Bounty: Diverse hard tasks for LLM agents

Jan 20, 2024 ·

METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents.