Tutorial + Experiment Report

Tuning Search to Claude.ai Level

Part 1 explains the mechanics (tutorial); Part 2 validates with data (experiment report)

2026-06-10 Search Agent Caching Claude 中文 →

The last eval pinned the gap: fixing the model and varying only the harness, my hand-rolled agent delivered 37% / 80% ≈ 46% of Claude.ai — and search precision alone accounted for nearly half the failing tasks. This post cashes in that "after the P0 optimization" flag.

Structure

Part 1 · Tutorial: the 5 engineering optimizations behind Claude.ai's fast+accurate search, each with an API example and official citation.
Part 2 · Experiment report: each treatment tested on a hand-rolled harness, with latency / token / per-task pass-fail data.

Up front: Claude.ai's search is not a magic engine (its backend is actually Brave Search — citation overlap with Brave's top results is ~86.7%). Its "fast and accurate" is 5 decomposable, replicable engineering optimizations. Here's each, with something you can copy.

1Content returns inline with results — no separate fetch

Mechanic: one web_search call returns, per result, ~500 words of query-relevant page content already attached, passage-selected index-side. One search yields the relevant text — no separate fetch, no extra model call to extract.

Example: a request with the search tool

POST /v1/messages
{
  "model": "claude-opus-4-...",
  "tools": [{ "type": "web_search_20250305",
              "name": "web_search", "max_uses": 5 }],
  "messages": [{ "role": "user", "content": "..." }]
}

Shape of each returned result (note encrypted_content)

{
  "type": "web_search_tool_result",
  "content": [{
    "type": "web_search_result",
    "url": "https://en.wikipedia.org/wiki/...",
    "title": "...",
    // ↓ ~500 words, query-selected, goes straight into context
    "encrypted_content": "EqgfCiB...(~4000-6300 bytes)",
    "page_age": "..."
  }]
}

Contrast: Claude Code (the CLI) discards those inline snippets, keeping only title/url, and re-fetches via WebFetch (Turndown + Haiku) when needed. Same engine, two content pipelines.

Sources: Web Search Tool (official) · The Claude Code WebSearch Black Box · Inside Claude Code's Web Tools

2Server-side execution + tool/generation overlap

Mechanic: search runs inside Anthropic's infra (no client round-trips), and the model can emit multiple tool calls in one message, executed in parallel.

Example: parallel tool calls in one assistant turn

"content": [
  { "type": "tool_use", "id": "a", "name": "web_search",
    "input": { "query": "..." } },
  { "type": "tool_use", "id": "b", "name": "web_search",
    "input": { "query": "..." } }   // two in parallel
]

Can you do it? Parallel tools — yes (run tool execution with Promise.all). Server-side co-location and generation-overlap — no, that's the moat. The good news: those aren't the bulk of the speed.

Source: Parallel Tool Use (official)

3"Code filtering," not "call another model"

Mechanic: when results are many, dynamic filtering has the model write code and run it in a sandbox to filter/rank/extract, sending only the code's output into context. Deterministic, token-cheap, zero extra model round-trips.

Example: with code execution on, the model writes the filter

# Claude generates and runs inside code_execution:
results = load_search_results()
hits = [r for r in results
        if "Highlands and Islands" in r["text"]
        and any(y in r["text"] for y in ("2021","2016"))]
print(hits[:3])   # only these enter context

Versus "ask another LLM to read and filter": that's a full inference round-trip (100s of ms to seconds); code filtering is milliseconds. This becomes the mirror for the mistake I make below.

Source: Advanced Tool Use (programmatic tool calling)

4Prompt Caching: cache the stable prefix, up to 85% latency cut

Mechanic: cache the big, stable tools + system prefix; rounds 2+ read it. Official numbers: up to 85% latency and 90% cost reduction; a 100K-token example dropped from 11.5s to 2.4s.

Example: cache_control breakpoint on system / last tool

{
  "system": [{ "type": "text", "text": "<long system prompt>",
               "cache_control": { "type": "ephemeral" } }],
  "tools":  [ ...,
    { "name": "web_search", ...,
      "cache_control": { "type": "ephemeral" } } ]  // caches the tools block
}

The discipline (where people trip): processing order is tools → system → messages, and the cached prefix must be stable. Cache only tools+system; never put per-turn-changing tool results inside the cached region, or it invalidates every turn and gets slower — there's an arXiv study on exactly this (boundary control improves TTFT 13–31%; naive full caching increases latency).

Sources: Prompt Caching (official) · Prompt Caching docs · "Don't Break the Cache" (arXiv)

5Selective retrieval after search, not bulk-fetch every time

Mechanic: the docs recommend search → selective retrieve. Simple factual queries use 1–3 searches, with max_uses bounding the count.

Example: bound the number of searches

"tools": [{ "type": "web_search_20250305",
            "name": "web_search",
            "max_uses": 3 }]   // latency-sensitive: cap at 3

Source: Web Search Tool · max_uses (official)

Tutorial takeaway: it's fast because it doesn't walk "search → fetch → call-a-model-to-extract" — content returns inline (#1), filtering is code not a model (#3), the stable prefix is cached (#4). It's accurate because the inline snippet is index-side query-selected + code-filtered + selectively deep-fetched (#5). Below I port all five to my own harness and measure what each is worth.

Setup

Subject: a hand-rolled harness (the agent-research runner, same family as the production claude-ai-harness).
Controlled variable: model fixed (same upstream); only the retrieval/fetch/context pipeline changes.
Dataset: 30 GAIA+FRAMES tasks the naked model fails, run in full.
Judge: contains — whether the model's FINAL ANSWER contains the gold answer (case-insensitive).
Baselines: naked model 0 / 30, original harness (B) 11 / 30 (37%), Claude.ai (C) 24 / 30 (80%).
Grade caveat: the upstream (luckyapi) is not vanilla Anthropic and temperature 0 doesn't fully reproduce, so a single run is a point estimate, not a verdict (this is also why frames_241 below is flaky).

The old architecture

The original harness was the most naive "search → fetch → char-truncate" three-stage pipeline, losing information at every layer:

Retrieval: Serper-primary, snippets only — no body text, so a separate page fetch was forced.
Fetch: one web_search auto-fetched the top 3 pages, and bare fetch doesn't render JS, so anti-bot often blocked it.
Ingestion: fetched text entered context by positional truncation slice(0, 6000) — query-blind, so any answer past char 6000 was silently dropped (a hidden bug).
Context: Haiku compressed history only above 30k tokens (also query-blind), rewriting history every round and shattering the cacheable prefix.
Loop: capped at 8 rounds.

Three fatal points: can't get body text, fragile fetch, answers positionally truncated; multi-hop tasks rarely survived to convergence.

The current architecture

Following the 5 optimizations dissected in Part 1, every layer was reshaped into "retrieval-style, zero extra generation, cacheable":

layer	before	after
retrieval	Serper, snippets only	Brave-primary + `extra_snippets` (~500 words inline); 9-level fallback chain
fetch	top-3 every search	inline-first: don't fetch if snippet is rich; else selectively fetch top-1
extraction	`slice(6000)` positional cut	Exa `/contents` highlights (embedding selection, verbatim, no LLM round-trip)
context	rewrite history each round	append-only + `cache_control` on `system/tools`
loop	8 rounds	16 rounds + multi-hop planning prompt + network retry

Architecture comparison: left BEFORE (red dot) is a four-step vertical pipeline Search snippet only → Fetch top-3 pages → Truncate 6000 chars → Rewrite history, an 8-rounds loop around it and a red clock timeout at the bottom; right AFTER (green dot) is Search inline 500 words → Inline-first skip fetch → Exa highlights → Append-only + cache, a 16-rounds loop and a green lightning fast at the bottom — The same retrieval pipeline, rewritten layer by layer: the left loses information at every stage and times out before it finishes; the right is inline-first + cached and survives 16 rounds

What changed

Condensed to the four behavior-changing moves (knob-tuning omitted):

Inline-first (biggest speedup): let relevant body text come back with the search results, skip the fetch whenever possible — cutting the default "3 fetches + 3 extraction calls" per search to 0 in the common case.
Extract by retrieval, not generation: hand "read the right passage" to Exa highlights (semantic selection) instead of calling Haiku per fetched page.
Cache the stable prefix + append-only history: cache system+tools, append messages only, drop the rewrite-history-each-round hack — many-round iteration no longer recomputes the full context every turn.
Fix the truncation bug: query-aware extraction replaces slice(6000), killing the silent "answer past char 6000 dropped" failure.

The controlled measurement behind move #2 (same URL, same query, only the Exa contents param differs): summary:{query} runs a generative LLM on Exa's side = 3284ms; highlights:{query} is embedding selection returning verbatim text = 613ms (5.4× faster). Lesson: extraction wants "retrieval-style verbatim excerpts," not yet another model generating a summary — which is exactly what Claude.ai's inline snippet is. Exa Search/Contents API

Results

With every change stacked (temperature 0, 16 rounds), the full 30 tasks run once against the baselines:

harness (model held fixed)	30-task pass	position
naked model (no tools)	0 / 30	floor
original hand-rolled harness (B)	11 / 30 · 37%	start
after this rewrite (v3)	20 / 30 · 67%	+30 pts
Claude.ai (C)	24 / 30 · 80%	ceiling

37% → 67% closes the gap to Claude.ai from 43 points to 13 — about two-thirds of it.

More important is latency — the old architecture's fatal flaw was "accurate but timed out." Here's each group's time / rounds / retrieval behavior:

group	tasks	avg time	median time	avg rounds	avg searches	avg fetches	inline-only
PASS	20	54s	35s	4.4	2.2	2.4	89%
FAIL	10	93s	40s	8.0	3.4	3.9	65%

PASS tasks hit the answer in a couple of searches (median 35s, 2.2 searches) — accurate and cheap; the full 30 average 67s/task. FAIL tasks instead searched more and ran longer and still failed (avg 3.4 searches / 8 rounds) — the blocker isn't "couldn't find it."

Narrowing to the 8 tasks the first experiment (original B harness, 8-round cap) actually solved (v3 PASSes all 8 too), a pure timing comparison:

task	first run (B)	now (v3)	time change
gaia_016	32s	7s	4.5× faster
gaia_150	37s	22s	faster
gaia_096	37s	35s	~same
gaia_052	56s	57s	~same
gaia_136	16s	25s	a bit slower
gaia_158	16s	33s	a bit slower
gaia_059	12s	31s	a bit slower
gaia_153	41s	79s	slower
8-task total (all PASS)	31s avg · 34.5s median	36s avg · 32s median	same ballpark

3 faster, 5 a bit slower; mean 31s → 36s, median 34.5s → 32s — the same ballpark (the mean is pulled up by gaia_153 alone). Takeaway: the correctness the rewrite gained on hard tasks didn't slow down the tasks that were already solvable.

Where the improvement comes from

An aggregate number can lie — 67% might be "got lucky." Break down the round-by-round logs and each change leaves a quantifiable fingerprint. First, how a single search actually flows now:

Retrieval decision flow: Query → Brave search + inline snippets (with a Brave/SerpAPI/SearchApi/Exa fallback chain underneath that fires when empty) → decision Snippet rich enough? → YES 78% to a green Use inline text no fetch; NO 22% to Fetch top-1 Exa highlights → both merge into Answer or next round — 78% of searches answer straight from inline snippets with zero fetch; only when it isn't rich enough does it fetch top-1 + Exa highlights. The fallback chain fires only when Brave returns empty

Each optimization leaves a fingerprint in the logs:

Inline-first → speed: 61 of 78 searches (78%) fetched nothing (89% on PASS tasks) — the main reason latency dropped from "timeout" back to tens of seconds.
Fallback chain → coverage: 67 Brave-only, 11 fell to a fallback, every one recovered to ≥3 results (zero thin sets); one went 4 levels deep (Brave→SerpAPI→SearchApi→Exa), Exa rescued 9 results and bought gaia_008 a PASS.
highlights → fast and accurate: all 87 fetches resolved at Exa /contents with zero LLM round-trip, verbatim. (The robust fetch chain Jina/Firecrawl never fired: dormant insurance, no robustness evidence this run.)
Caching → sustaining many rounds: late-round uncached input drops to single-digit tokens (gaia_001 ran 11 rounds; the back half ~7–44 tokens each) — many rounds are no longer compute-expensive, the foundation for finishing instead of timing out.

Attribution in one line: the speed comes from not walking the "search → fetch → generative extraction" slow path (inline + highlights + caching); the coverage comes from Brave inline snippets + the fallback chain backfilling scarce results.

The remaining 10 failures are mostly no longer in the search/fetch pipeline; they split into three classes, only one a retrieval-precision problem: ① judgment wall (gaia_110 / 014 / 128 ran the full 15 rounds, read the right text but picked the wrong entity); ② retrieval-precision wall (only gaia_029: a niche fact not in the index — 13 reformulations, 10 forced fetches, still unlocked); ③ trigger discipline (frames_543 / 647 answered after one search, frames_562 did zero — speaking before searching enough).

The honest boundary: for multi-hop tasks like frames_241, the PASS is an unstable ~⅓ (4 dedicated reruns = 1/4, and it has FAILed head-to-head), rooted in a non-vanilla upstream where even temperature 0 doesn't reproduce. So the rewrite moved the bottleneck from "infrastructure / rounds" to "multi-hop entity reasoning + upstream determinism" — it didn't "crack" it.

Conclusion and transferable principles

This round moved the bottleneck from "infrastructure" to "judgment": search accurately, fetch reliably, cheap rounds — all validated with data; what's left on the two unsolved tasks is picking the right entity in multi-hop reasoning — the next mountain, not the search pipeline. Five transferable principles:

Inline content first: if a retrieval API returns query-relevant content directly (Exa highlights / Tavily), don't "search → fetch → call a model to extract."
Extract via retrieval, not generation: semantic highlights (embedding selection) is an order of magnitude faster than "ask a model to summarize" (measured 613ms vs 3284ms) and verbatim.
Cache the stable prefix, append-only history: cache system+tools, don't rewrite history each round — the precondition for affording many rounds (measured: late-round input tokens drop to single digits).
Positional truncation is an invisible bug: slicing body text by character silently drops answers; keep by query relevance.
Respect the moat: server-side co-location and tool/generation overlap can't be replicated — but they aren't the bulk of the speed. Inline content, caching, and code filtering are, and all three you can do.

Yao Yuheng

MSc Data Science, NTU. Focus: AI agent systems engineering, eval-driven development, LLM applications.

First published on sg.yaoyuheng2001.me.

Blog · GitHub · Substack · RSS