← Home
Eval Report

I Benchmarked Myself Against Claude.ai with 30 Questions

An honest physical for my harness

2026-05-13 Eval Agent Benchmark Claude

In yesterday's whitepaper I threw out a gut estimate: my hand-rolled agent was probably delivering about 50% of what Claude.ai feels like. Today I finished the first eval run and pinned that number down.

My Harness
37%
11/30 passed
Claude.ai
80%
24/30 passed

Starting point: a gut estimate I wasn't comfortable with

If you read yesterday's whitepaper, you'll remember the core claim was:

The model gives you 100% of the potential. The harness decides how much of that potential ships.

The setup for that claim was a Claude.ai clone I spent a couple of days building — lite-claude-ui — which covered the core capabilities: search, code execution, long-form document generation. Feature-wise it looked complete. But every time I actually used it, I knew it was a notch below the real thing. I ballparked the gap at "around 50%."

The second half of the whitepaper is all about eval-driven methodology. After I shipped it I felt slightly uneasy: I had written about the method without actually applying it to myself. It's like publishing "how to lose weight scientifically" without ever stepping on a scale.

So today's task was clear: step on the scale.

Experimental design: three groups, one counterintuitive filter

I set up three groups:

GroupWhat it isToolsTurns
ABare model, single-turn APInone1
BMy harnessweb_search + fetch_url + run_codeup to 8
CClaude.ai (official)official harnessunlimited

All three groups use the same model: claude-sonnet-4-6. That's the most important control in the whole experiment — fix the model, and the only variable left is the harness.

The test set is 30 questions, drawn from GAIA and FRAMES. But the filtering is a little unusual — I first ran the full candidate pool through Group A, and kept only the questions Group A got wrong.

The point of this filter: Group A's baseline is pinned at 0%. Any PASS from B or C is pure harness gain — nobody gets to coast on the model's built-in knowledge. So B and C's scores are the "net output" of their respective harnesses.
DatasetCandidatesKept
GAIA L14215
GAIA L2665
FRAMES3010

First hit: the three-group ladder

A (bare model)
0%
B (my harness)
37%
C (Claude.ai)
80%

Let's start with the painful part: my 50% estimate yesterday was optimistic. The actual delivery ratio is 37% / 80% ≈ 46%.

But the more interesting thing is the shape of the ladder. 0% to 37% is the "base tax" of having a harness at all — as long as you can let the model search, fetch, and loop a few times, you pick up nearly 40 points. 37% to 80% is the "polish premium" — those 43 points don't come from "adding features," they come from getting every step right.

Slicing by dataset is even more revealing:

DatasetABC
GAIA L1 (15)0%53%87%
GAIA L2 (5)0%40%60%
FRAMES (10)0%10%80%

On GAIA L1, Group B keeps pace — 53% vs. 87%. The gap is real but not embarrassing. On FRAMES, Group B collapses: 10% vs. 80%.

FRAMES is Google's multi-hop reasoning dataset. A typical question looks like: "The CEO of company X was born in a city — what was its population in 2020?" You have to look up the CEO, then their birthplace, then the population. Getting a single search step right isn't enough — you have to digest each search result and feed it cleanly into the next step.

The split tells me something specific: between "can search" and "can reason," my harness's bottleneck is the reasoning side.

Unpacking the 43-point gap

The most valuable part of the whole experiment was the per-question failure attribution on the B vs. C diff. Here's the root-cause breakdown for the 15 questions where C passed but B failed:

Failure modeCountExample IDs
Poor search quality5gaia_043, gaia_110, gaia_134, gaia_154, frames_140
Broken reasoning chain5gaia_156, frames_562, frames_763, frames_808, frames_810
Never triggered search2gaia_099, frames_647
Inefficient search strategy2gaia_029, frames_241
Fetch failed1gaia_110

Search quality: the single biggest source of the gap

Five questions. I'm using a fallback stack of Serper / Tavily / Google CSE / Brave / DuckDuckGo. Group C uses Anthropic's in-house search stack.

How does the gap show up? Take an example: same question, my search returns a pile of ad-laden SEO aggregator pages; C's first result is the original paper or primary data source. Search is the agent's eyes. If the eyes don't resolve clearly, all the cleverness downstream is wasted.

Broken reasoning chain: the killer on FRAMES

Also five questions, and these are harder to fix. The failure mode: hop 1 lands correctly, hop 2 builds a new query off hop 1's result but drifts slightly, and the error compounds. By hop 5, the trajectory is completely off-topic.

Group C is scary-stable on these. Median search count is just 2–3, but every single one lands. It's not "searches more" — it's "searches precisely and doesn't drift in between."

I suspect this comes down to system prompt design and context management. Anthropic's harness has almost certainly accumulated a lot of know-how on "how to compress the previous step's result into the next step's prompt" — which is exactly the "context engineering" the whitepaper talked about.

Never triggered search: the model winging it

Two questions where neither B nor C searched — both answered straight from memory. The difference: B got it wrong, C got it right.

Same model, same question — why does one get it right and the other doesn't? My read is that the system prompt biases each toward different default behaviors. My prompt doesn't strongly steer the model toward "must search for factual questions," so it sometimes overestimates its own recall.

The fix is trivial: add one line to the system prompt. But that's exactly the point — prompt engineering has enormous leverage. A single sentence can be worth several percentage points.

Reverse signal: the two questions B passed and C didn't

There are two reverse anomalies in the table:

These two point at an interesting design tradeoff: Claude.ai seems to lean toward letting the model reason on its own, while my harness is more "tools first." On questions that benefit from hard verification, tools-first actually wins.

The design space for a harness isn't one-dimensional. "More like Claude.ai" doesn't always mean "better." Anthropic made tradeoffs too — their tradeoffs are just better than mine in most cases.

An accidental finding: dataset contamination

For gaia_016 and gaia_029, C's search trace shows Claude.ai found the GAIA ground-truth page on HuggingFace and copied the answer straight from there.

This isn't Claude.ai cheating — it's just being honest about its search results. But it exposes a real problem with the benchmark: any dataset whose ground truth lives on the public web is unfair to a search-enabled agent.

The good news is this doesn't hurt the B/C comparison — in principle B could find it too. For the next round I'm thinking about either using the FRAMES test split, or just writing my own question set sourced from 2026 news.

The speed gap is just as significant

MetricBC
Avg time on passed questions31s16s
Avg time on failed questions55s37s
Avg search count on passed questions3.22.1
Group C uses fewer searches and less time to land a higher pass rate. That's the real gap — not "close, but slightly better." It's "leading on every axis."

Next: prioritized improvements

Once the experiment was done, the improvement queue basically wrote itself:

P0 — Search quality +10~25pp
  • Plug in a better search backend (Anthropic's stack, or direct Google)
  • Fetch retries + Jina Reader as fallback
  • Expected gain: +5~8 questions
P1 — Prompt and reasoning strategy +3~5 questions
  • Require the model to plan before searching
  • Force run_code verification for anything involving computation
  • Terminate early after two consecutive turns with no new information
P2 — Search trigger policy +2 questions
  • Bake "factual questions must search first" into the system prompt

If all of these land, Group B should move from 37% to somewhere in the 55–65% range. Still short of 80%, but at least in the same order of magnitude.

Closing thoughts: this is what the methodology actually looks like

Looking back, the most valuable output of this experiment isn't the 37% number — it's the fact that the workflow itself runs end-to-end:

  1. Define comparison groups
  2. Filter questions with a baseline
  3. Run all three groups
  4. Attribute failures per question
  5. Sort by priority
  6. Iterate

Most agent projects don't fail because the model isn't strong enough or because the developers aren't smart enough. They fail because there's no feedback loop that lets the data do the talking.

Today I validated the other half of that statement on my own project: once the loop is in place, your improvement direction stops being a gut feeling and becomes a prioritized backlog with estimated gains attached.

I now know what to change next week, what not to change, and roughly how much improvement to expect afterward. That feeling of "knowing what you're actually doing" is, I think, what eval-driven methodology is really about.

The next post will cover what happens after the P0 work ships. If +10 points materializes, the methodology has earned its keep — at least on me.

Yao Yuheng / 姚钰珩

MSc in Data Science, NTU. Focus: AI agent systems engineering, eval-driven development, LLM applications.

Originally published at sg.yaoyuheng2001.me. Please credit the source when reposting.

Blog · GitHub · Juejin · Substack · RSS