Are we in a GPT-4-style leap that evals can't see?

Read more here: External Link