I Benchmarked Myself Against Claude.ai with 30 Questions
An honest physical for my harness
In yesterday's whitepaper I threw out a gut estimate: my hand-rolled agent was probably delivering about 50% of what Claude.ai feels like. Today I finished the first eval run and pinned that number down.
Starting point: a gut estimate I wasn't comfortable with
If you read yesterday's whitepaper, you'll remember the core claim was:
The model gives you 100% of the potential. The harness decides how much of that potential ships.
The setup for that claim was a Claude.ai clone I spent a couple of days building — lite-claude-ui — which covered the core capabilities: search, code execution, long-form document generation. Feature-wise it looked complete. But every time I actually used it, I knew it was a notch below the real thing. I ballparked the gap at "around 50%."
The second half of the whitepaper is all about eval-driven methodology. After I shipped it I felt slightly uneasy: I had written about the method without actually applying it to myself. It's like publishing "how to lose weight scientifically" without ever stepping on a scale.
So today's task was clear: step on the scale.
Experimental design: three groups, one counterintuitive filter
I set up three groups:
| Group | What it is | Tools | Turns |
|---|---|---|---|
| A | Bare model, single-turn API | none | 1 |
| B | My harness | web_search + fetch_url + run_code | up to 8 |
| C | Claude.ai (official) | official harness | unlimited |
All three groups use the same model: claude-sonnet-4-6. That's the most important control in the whole experiment — fix the model, and the only variable left is the harness.
The test set is 30 questions, drawn from GAIA and FRAMES. But the filtering is a little unusual — I first ran the full candidate pool through Group A, and kept only the questions Group A got wrong.
| Dataset | Candidates | Kept |
|---|---|---|
| GAIA L1 | 42 | 15 |
| GAIA L2 | 66 | 5 |
| FRAMES | 30 | 10 |
First hit: the three-group ladder
Let's start with the painful part: my 50% estimate yesterday was optimistic. The actual delivery ratio is 37% / 80% ≈ 46%.
But the more interesting thing is the shape of the ladder. 0% to 37% is the "base tax" of having a harness at all — as long as you can let the model search, fetch, and loop a few times, you pick up nearly 40 points. 37% to 80% is the "polish premium" — those 43 points don't come from "adding features," they come from getting every step right.
Slicing by dataset is even more revealing:
| Dataset | A | B | C |
|---|---|---|---|
| GAIA L1 (15) | 0% | 53% | 87% |
| GAIA L2 (5) | 0% | 40% | 60% |
| FRAMES (10) | 0% | 10% | 80% |
On GAIA L1, Group B keeps pace — 53% vs. 87%. The gap is real but not embarrassing. On FRAMES, Group B collapses: 10% vs. 80%.
FRAMES is Google's multi-hop reasoning dataset. A typical question looks like: "The CEO of company X was born in a city — what was its population in 2020?" You have to look up the CEO, then their birthplace, then the population. Getting a single search step right isn't enough — you have to digest each search result and feed it cleanly into the next step.
Unpacking the 43-point gap
The most valuable part of the whole experiment was the per-question failure attribution on the B vs. C diff. Here's the root-cause breakdown for the 15 questions where C passed but B failed:
| Failure mode | Count | Example IDs |
|---|---|---|
| Poor search quality | 5 | gaia_043, gaia_110, gaia_134, gaia_154, frames_140 |
| Broken reasoning chain | 5 | gaia_156, frames_562, frames_763, frames_808, frames_810 |
| Never triggered search | 2 | gaia_099, frames_647 |
| Inefficient search strategy | 2 | gaia_029, frames_241 |
| Fetch failed | 1 | gaia_110 |
Search quality: the single biggest source of the gap
Five questions. I'm using a fallback stack of Serper / Tavily / Google CSE / Brave / DuckDuckGo. Group C uses Anthropic's in-house search stack.
How does the gap show up? Take an example: same question, my search returns a pile of ad-laden SEO aggregator pages; C's first result is the original paper or primary data source. Search is the agent's eyes. If the eyes don't resolve clearly, all the cleverness downstream is wasted.
Broken reasoning chain: the killer on FRAMES
Also five questions, and these are harder to fix. The failure mode: hop 1 lands correctly, hop 2 builds a new query off hop 1's result but drifts slightly, and the error compounds. By hop 5, the trajectory is completely off-topic.
Group C is scary-stable on these. Median search count is just 2–3, but every single one lands. It's not "searches more" — it's "searches precisely and doesn't drift in between."
I suspect this comes down to system prompt design and context management. Anthropic's harness has almost certainly accumulated a lot of know-how on "how to compress the previous step's result into the next step's prompt" — which is exactly the "context engineering" the whitepaper talked about.
Never triggered search: the model winging it
Two questions where neither B nor C searched — both answered straight from memory. The difference: B got it wrong, C got it right.
Same model, same question — why does one get it right and the other doesn't? My read is that the system prompt biases each toward different default behaviors. My prompt doesn't strongly steer the model toward "must search for factual questions," so it sometimes overestimates its own recall.
Reverse signal: the two questions B passed and C didn't
There are two reverse anomalies in the table:
- gaia_153 (Rubik's cube logic): B wrote code to verify and passed. C reasoned through it by hand and got it wrong.
- gaia_002 (Nature article statistics): B found the exact number via search and passed. C found the same article but botched the reasoning step.
These two point at an interesting design tradeoff: Claude.ai seems to lean toward letting the model reason on its own, while my harness is more "tools first." On questions that benefit from hard verification, tools-first actually wins.
The design space for a harness isn't one-dimensional. "More like Claude.ai" doesn't always mean "better." Anthropic made tradeoffs too — their tradeoffs are just better than mine in most cases.
An accidental finding: dataset contamination
For gaia_016 and gaia_029, C's search trace shows Claude.ai found the GAIA ground-truth page on HuggingFace and copied the answer straight from there.
This isn't Claude.ai cheating — it's just being honest about its search results. But it exposes a real problem with the benchmark: any dataset whose ground truth lives on the public web is unfair to a search-enabled agent.
The good news is this doesn't hurt the B/C comparison — in principle B could find it too. For the next round I'm thinking about either using the FRAMES test split, or just writing my own question set sourced from 2026 news.
The speed gap is just as significant
| Metric | B | C |
|---|---|---|
| Avg time on passed questions | 31s | 16s |
| Avg time on failed questions | 55s | 37s |
| Avg search count on passed questions | 3.2 | 2.1 |
Next: prioritized improvements
Once the experiment was done, the improvement queue basically wrote itself:
- Plug in a better search backend (Anthropic's stack, or direct Google)
- Fetch retries + Jina Reader as fallback
- Expected gain: +5~8 questions
- Require the model to plan before searching
- Force
run_codeverification for anything involving computation - Terminate early after two consecutive turns with no new information
- Bake "factual questions must search first" into the system prompt
If all of these land, Group B should move from 37% to somewhere in the 55–65% range. Still short of 80%, but at least in the same order of magnitude.
Closing thoughts: this is what the methodology actually looks like
Looking back, the most valuable output of this experiment isn't the 37% number — it's the fact that the workflow itself runs end-to-end:
- Define comparison groups
- Filter questions with a baseline
- Run all three groups
- Attribute failures per question
- Sort by priority
- Iterate
Most agent projects don't fail because the model isn't strong enough or because the developers aren't smart enough. They fail because there's no feedback loop that lets the data do the talking.
Today I validated the other half of that statement on my own project: once the loop is in place, your improvement direction stops being a gut feeling and becomes a prioritized backlog with estimated gains attached.
I now know what to change next week, what not to change, and roughly how much improvement to expect afterward. That feeling of "knowing what you're actually doing" is, I think, what eval-driven methodology is really about.
The next post will cover what happens after the P0 work ships. If +10 points materializes, the methodology has earned its keep — at least on me.