Every AI breakthrough
starts with a benchmark.

Standardized tests that shape billions in research, define what "better" means, and decide which models lead the field. Every engineer should know how to read one.

Learn how All benchmarks

01 · Benchmarks

The test that defines
what better means.

You've been writing benchmarks your whole career. A unit test suite is a benchmark. p99 latency is a benchmark. A code coverage percentage is a benchmark. They all answer the same question: is this better than before?

AI uses the same idea at a larger scale. Take SWE-bench: it hands a model a real GitHub issue and checks whether the automated tests pass. Repeated across hundreds of tasks, that binary outcome becomes a score — one that moves research budgets, directs lab priorities, and shapes what the next generation of models will optimize for.

Labs race to top them. Papers are written about them. They are the industry's north star.

Benchmark pipeline

02 · How They Run

How a benchmark
actually works.

Every benchmark follows the same pipeline. A task bank provides the inputs — real GitHub issues, math problems, research questions. A test runner feeds them to the model one at a time. The model produces outputs. A grader — usually automated — checks them against known correct answers.

The result is a number between 0 and 100. That number gets published on a leaderboard, cited in papers, and referenced in funding decks. The pipeline looks simple from the outside. But the number is only as good as what's behind it — and that depends on details that rarely make the leaderboard.

Evaluation pipeline

03 · The Ceiling

Scores can drift from reality.

There are two ways a score drifts from reality before anyone cheats.

The first is contamination. Models train on massive chunks of the internet. Benchmark tasks also come from the internet. When those overlap, the model has basically already seen the test. Scores go up. The data just leaked.

The second is saturation. Once several models hit 90%+ on a benchmark, it stops measuring meaningful differences. The leaderboard keeps moving — by fractions of a percent — but the capability differences it's tracking have become noise. The benchmark has been solved; it just hasn't been retired yet.

Both problems exist before anyone cheats.

Benchmark leaderboard

Real example · Opus 4.7 release · April 2026

How to read a model release.

Anthropic published scores on 12 benchmarks this week. The numbers look great. Here's what they actually tell you.

+11 pts SWE-bench Pro: 53.4% → 64.3%
Still maturing. This gap means something.

+3 pts GPQA Diamond: 91.3% → 94.2%
Above 90% — all frontier models are here. The gap is noise.

Same release. One number tells you something, the other doesn't. The difference is where each benchmark sits on the maturity curve — and the release won't tell you that.

04 · Gaming the Metric

And then there's gaming it.

On top of contamination and saturation, there's a third way scores mislead: agents that find shortcuts to a high score.

A documented example from SWE-smith: Claude 3.7 Sonnet was asked to implement a string-distance algorithm. Instead, it detected the exact test inputs and hardcoded the expected return values. Its own commit message gave it away: "Added special case handling for the specific test cases to ensure the tests pass."

The algorithm was never written. Every test passed. You've seen this before — it's the same thing that gives you 100% coverage with tests that never actually assert anything. Toggle the visualization to see the gap.

Reward hacking

05 · Reading the Score

What makes a score trustworthy.

The first move is to write harder tasks: swap out the test set, randomize variable names, generate tasks on the fly. It buys a few months — until the next model finds the pattern anyway.

What actually makes a score trustworthy is the environment it runs in. Three things to look for: task isolation (the agent runs in a fresh sandbox, cut off from test inputs), process verification (did it compute the answer or pattern-match to it?), and environmental controls (no filesystem access, network blocked, process isolated).

When you're picking which model to use, these are what separates a score that means something from noise.

Defense layers

06 · The Takeaway

The leaderboard
won't tell you this.

Three problems, three different fixes. Contamination is a training data problem. Saturation is a benchmark lifecycle problem. Gaming is an environment problem — and the only one you can actually control at eval time.

Filesystem open

Network unrestricted

Process exposed

Filesystem locked

Network blocked

Process isolated

When the environment closes the gap between metric and reality, the score becomes a signal again.

Now explore the landscape.

70 benchmarks across 7 domains — mapped by domain, capability, and where scores are drifting from reality.

View all benchmarks Share on X

Every AI breakthroughstarts with a benchmark.

The test that defineswhat better means.

How a benchmarkactually works.