Optimizing LLMs to be good at specific tests backfires on Meta, Stability.

Read more here: External Link