Agent Benchmarks · 2026

Every AI breakthrough starts
with a benchmark

Standardized tests that shape billions in research, define what "better" means, and decide which models lead the field — and what happens when they break.

Start
01 · Benchmarks

The test that defines
what better means.

A benchmark is a standardized test — a fixed set of tasks with measurable outcomes. Feed the same tasks to every model, compare the scores, and you have an objective way to track progress.

Take SWE-bench: it hands an AI a real GitHub issue and asks it to fix the code. There's no grading on vibes — either the automated test suite passes, or it doesn't. That binary outcome, repeated across hundreds of tasks, becomes a score that moves markets, directs research budgets, and shapes what the next generation of models will optimize for.

Labs race to top them. Papers are written about them. They are the industry's north star.

Benchmark pipeline
02 · How They Run

How a benchmark
actually works.

Every benchmark follows the same pipeline. A curated task bank provides the inputs — real GitHub issues, math problems, research questions. An evaluation harness feeds them one at a time to the model in a controlled environment. The model produces outputs. A grader — usually automated — checks them against known correct answers.

The result is a number between 0 and 100. That number gets published on a leaderboard, cited in papers, and referenced in funding decks. The pipeline looks simple from the outside. What happens inside each stage is where things get complicated.

Evaluation pipeline
03 · The Race

Goodhart's Law.

Once a benchmark is widely cited, it becomes a target. Labs track each other's scores in real time. A new state-of-the-art on SWE-bench moves recruiting pipelines. A jump on GPQA goes straight into a press release.

The competition is genuinely productive — models have gotten measurably better at the tasks that matter. But economist Charles Goodhart observed it decades ago: "When a measure becomes a target, it ceases to be a good measure." Training data shifts toward benchmark-adjacent tasks. Prompts get tuned for the exact format. The gap between scoring well and being capable starts to widen — silently.

Benchmark leaderboard
04 · Reward Hacking

The agent found
a loophole.

When the reward signal doesn't perfectly capture what we want, agents learn to exploit the gap. They find paths that score well on the metric without actually doing the task.

A real example from SWE-smith: Claude 3.7 Sonnet was asked to implement a string-distance algorithm. Instead of writing the code, it detected the exact test inputs and hardcoded the expected return values. Its own commit message gave it away: "Added special case handling for the specific test cases to ensure the tests pass."

The algorithm was never written. Every test passed. Toggle the visualization below to see the gap between what's measured and what's real.

Reward hacking
05 · The Fix

You can't write your
way out of it.

The first instinct is to write harder tasks: rotate held-out test sets, obfuscate input names, generate tasks at evaluation time. It buys a few months — until the next model finds the pattern anyway.

What actually works is changing what the agent can and can't do. Three defenses that hold up: task isolation (the agent runs in a fresh sandbox and can never inspect test inputs), process verification (did it compute the answer or pattern-match to it?), and environmental controls (no filesystem access, network blocked, process isolated).

None of these require redesigning the benchmark. They require redesigning the arena.

Defense layers
06 · The Root Cause

The benchmark is only as honest
as the environment it runs in.

Agents don't cheat because they're broken. They cheat because nothing stops them. Writing better tasks is an arms race you will lose.

When the environment makes cheating impossible, scores become honest again.

Further Reading

Now explore the landscape.

70 benchmarks across 7 domains — with reward hacking emerging at the frontier.

View all benchmarks