Eval Report

I Benchmarked Myself Against Claude.ai with 30 Questions

An honest physical for my harness

2026-05-13 Eval Agent Benchmark Claude

In yesterday's whitepaper I threw out a gut estimate: my hand-rolled agent was probably delivering about 50% of what Claude.ai feels like. Today I finished the first eval run and pinned that number down.

My Harness

37%

11/30 passed

Claude.ai

80%

24/30 passed

Starting point: a gut estimate I wasn't comfortable with

If you read yesterday's whitepaper, you'll remember the core claim was:

The model gives you 100% of the potential. The harness decides how much of that potential ships.

The setup for that claim was a Claude.ai clone I spent a couple of days building — lite-claude-ui — which covered the core capabilities: search, code execution, long-form document generation. Feature-wise it looked complete. But every time I actually used it, I knew it was a notch below the real thing. I ballparked the gap at "around 50%."

The second half of the whitepaper is all about eval-driven methodology. After I shipped it I felt slightly uneasy: I had written about the method without actually applying it to myself. It's like publishing "how to lose weight scientifically" without ever stepping on a scale.

So today's task was clear: step on the scale.

Experimental design: three groups, one counterintuitive filter

I set up three groups:

Group	What it is	Tools	Turns
A	Bare model, single-turn API	none	1
B	My harness	web_search + fetch_url + run_code	up to 8
C	Claude.ai (official)	official harness	unlimited

All three groups use the same model: claude-sonnet-4-6. That's the most important control in the whole experiment — fix the model, and the only variable left is the harness.

The test set is 30 questions, drawn from GAIA and FRAMES. But the filtering is a little unusual — I first ran the full candidate pool through Group A, and kept only the questions Group A got wrong.

The point of this filter: Group A's baseline is pinned at 0%. Any PASS from B or C is pure harness gain — nobody gets to coast on the model's built-in knowledge. So B and C's scores are the "net output" of their respective harnesses.

Dataset	Candidates	Kept
GAIA L1	42	15
GAIA L2	66	5
FRAMES	30	10

First hit: the three-group ladder

A (bare model)

B (my harness)

37%

C (Claude.ai)

80%

Let's start with the painful part: my 50% estimate yesterday was optimistic. The actual delivery ratio is 37% / 80% ≈ 46%.

But the more interesting thing is the shape of the ladder. 0% to 37% is the "base tax" of having a harness at all — as long as you can let the model search, fetch, and loop a few times, you pick up nearly 40 points. 37% to 80% is the "polish premium" — those 43 points don't come from "adding features," they come from getting every step right.

Slicing by dataset is even more revealing:

Dataset	A	B	C
GAIA L1 (15)	0%	53%	87%
GAIA L2 (5)	0%	40%	60%
FRAMES (10)	0%	10%	80%

On GAIA L1, Group B keeps pace — 53% vs. 87%. The gap is real but not embarrassing. On FRAMES, Group B collapses: 10% vs. 80%.

FRAMES is Google's multi-hop reasoning dataset. A typical question looks like: "The CEO of company X was born in a city — what was its population in 2020?" You have to look up the CEO, then their birthplace, then the population. Getting a single search step right isn't enough — you have to digest each search result and feed it cleanly into the next step.

The split tells me something specific: between "can search" and "can reason," my harness's bottleneck is the reasoning side.

Unpacking the 43-point gap

The most valuable part of the whole experiment was the per-question failure attribution on the B vs. C diff. Here's the root-cause breakdown for the 15 questions where C passed but B failed:

Failure mode	Count	Example IDs
Poor search quality	5	gaia_043, gaia_110, gaia_134, gaia_154, frames_140
Broken reasoning chain	5	gaia_156, frames_562, frames_763, frames_808, frames_810
Never triggered search	2	gaia_099, frames_647
Inefficient search strategy	2	gaia_029, frames_241
Fetch failed	1	gaia_110

Search quality: the single biggest source of the gap

Five questions. I'm using a fallback stack of Serper / Tavily / Google CSE / Brave / DuckDuckGo. Group C uses Anthropic's in-house search stack.

How does the gap show up? Take an example: same question, my search returns a pile of ad-laden SEO aggregator pages; C's first result is the original paper or primary data source. Search is the agent's eyes. If the eyes don't resolve clearly, all the cleverness downstream is wasted.

Broken reasoning chain: the killer on FRAMES

Also five questions, and these are harder to fix. The failure mode: hop 1 lands correctly, hop 2 builds a new query off hop 1's result but drifts slightly, and the error compounds. By hop 5, the trajectory is completely off-topic.

Group C is scary-stable on these. Median search count is just 2–3, but every single one lands. It's not "searches more" — it's "searches precisely and doesn't drift in between."

I suspect this comes down to system prompt design and context management. Anthropic's harness has almost certainly accumulated a lot of know-how on "how to compress the previous step's result into the next step's prompt" — which is exactly the "context engineering" the whitepaper talked about.

Never triggered search: the model winging it

Two questions where neither B nor C searched — both answered straight from memory. The difference: B got it wrong, C got it right.

Same model, same question — why does one get it right and the other doesn't? My read is that the system prompt biases each toward different default behaviors. My prompt doesn't strongly steer the model toward "must search for factual questions," so it sometimes overestimates its own recall.

The fix is trivial: add one line to the system prompt. But that's exactly the point — prompt engineering has enormous leverage. A single sentence can be worth several percentage points.

Reverse signal: the two questions B passed and C didn't

There are two reverse anomalies in the table:

gaia_153 (Rubik's cube logic): B wrote code to verify and passed. C reasoned through it by hand and got it wrong.
gaia_002 (Nature article statistics): B found the exact number via search and passed. C found the same article but botched the reasoning step.

These two point at an interesting design tradeoff: Claude.ai seems to lean toward letting the model reason on its own, while my harness is more "tools first." On questions that benefit from hard verification, tools-first actually wins.

The design space for a harness isn't one-dimensional. "More like Claude.ai" doesn't always mean "better." Anthropic made tradeoffs too — their tradeoffs are just better than mine in most cases.

An accidental finding: dataset contamination

For gaia_016 and gaia_029, C's search trace shows Claude.ai found the GAIA ground-truth page on HuggingFace and copied the answer straight from there.

This isn't Claude.ai cheating — it's just being honest about its search results. But it exposes a real problem with the benchmark: any dataset whose ground truth lives on the public web is unfair to a search-enabled agent.

The good news is this doesn't hurt the B/C comparison — in principle B could find it too. For the next round I'm thinking about either using the FRAMES test split, or just writing my own question set sourced from 2026 news.

The speed gap is just as significant

Metric	B	C
Avg time on passed questions	31s	16s
Avg time on failed questions	55s	37s
Avg search count on passed questions	3.2	2.1

Group C uses fewer searches and less time to land a higher pass rate. That's the real gap — not "close, but slightly better." It's "leading on every axis."

Next: prioritized improvements

Once the experiment was done, the improvement queue basically wrote itself:

P0 — Search quality +10~25pp

Plug in a better search backend (Anthropic's stack, or direct Google)
Fetch retries + Jina Reader as fallback
Expected gain: +5~8 questions

P1 — Prompt and reasoning strategy +3~5 questions

Require the model to plan before searching
Force run_code verification for anything involving computation
Terminate early after two consecutive turns with no new information

P2 — Search trigger policy +2 questions

Bake "factual questions must search first" into the system prompt

If all of these land, Group B should move from 37% to somewhere in the 55–65% range. Still short of 80%, but at least in the same order of magnitude.

Closing thoughts: this is what the methodology actually looks like

Looking back, the most valuable output of this experiment isn't the 37% number — it's the fact that the workflow itself runs end-to-end:

Define comparison groups
Filter questions with a baseline
Run all three groups
Attribute failures per question
Sort by priority
Iterate

Most agent projects don't fail because the model isn't strong enough or because the developers aren't smart enough. They fail because there's no feedback loop that lets the data do the talking.

Today I validated the other half of that statement on my own project: once the loop is in place, your improvement direction stops being a gut feeling and becomes a prioritized backlog with estimated gains attached.

I now know what to change next week, what not to change, and roughly how much improvement to expect afterward. That feeling of "knowing what you're actually doing" is, I think, what eval-driven methodology is really about.

The next post will cover what happens after the P0 work ships. If +10 points materializes, the methodology has earned its keep — at least on me.

Yao Yuheng / 姚钰珩

MSc in Data Science, NTU. Focus: AI agent systems engineering, eval-driven development, LLM applications.

Originally published at sg.yaoyuheng2001.me. Please credit the source when reposting.

Blog · GitHub · Juejin · Substack · RSS