Bounty: Diverse hard tasks for LLM agents

METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents.

Read more here: External Link