A curated map of agentic evaluation benchmarks — coding, math, research, science, and more.
Updated April 2026
We review submissions for quality and relevance. Open a GitHub issue with a link to your benchmark and we'll take a look.