← Home
Tutorial + Experiment Report

Tuning Search to Claude.ai Level

Part 1 explains the mechanics (tutorial); Part 2 validates with data (experiment report)

2026-06-10 Search Agent Caching Claude 中文 →

The last eval pinned the gap: fixing the model and varying only the harness, my hand-rolled agent delivered 37% / 80% ≈ 46% of Claude.ai — and search precision alone accounted for nearly half the failing tasks. This post cashes in that "after the P0 optimization" flag.

Structure
  1. Part 1 · Tutorial: the 5 engineering optimizations behind Claude.ai's fast+accurate search, each with an API example and official citation.
  2. Part 2 · Experiment report: each treatment tested on a hand-rolled harness, with latency / token / per-task pass-fail data.
Part 1 · Tutorial: Claude.ai search's 5 optimizations

Up front: Claude.ai's search is not a magic engine (its backend is actually Brave Search — citation overlap with Brave's top results is ~86.7%). Its "fast and accurate" is 5 decomposable, replicable engineering optimizations. Here's each, with something you can copy.

1Content returns inline with results — no separate fetch

Mechanic: one web_search call returns, per result, ~500 words of query-relevant page content already attached, passage-selected index-side. One search yields the relevant text — no separate fetch, no extra model call to extract.

Example: a request with the search tool

POST /v1/messages
{
  "model": "claude-opus-4-...",
  "tools": [{ "type": "web_search_20250305",
              "name": "web_search", "max_uses": 5 }],
  "messages": [{ "role": "user", "content": "..." }]
}

Shape of each returned result (note encrypted_content)

{
  "type": "web_search_tool_result",
  "content": [{
    "type": "web_search_result",
    "url": "https://en.wikipedia.org/wiki/...",
    "title": "...",
    // ↓ ~500 words, query-selected, goes straight into context
    "encrypted_content": "EqgfCiB...(~4000-6300 bytes)",
    "page_age": "..."
  }]
}

Contrast: Claude Code (the CLI) discards those inline snippets, keeping only title/url, and re-fetches via WebFetch (Turndown + Haiku) when needed. Same engine, two content pipelines.

Sources: Web Search Tool (official) · The Claude Code WebSearch Black Box · Inside Claude Code's Web Tools

2Server-side execution + tool/generation overlap

Mechanic: search runs inside Anthropic's infra (no client round-trips), and the model can emit multiple tool calls in one message, executed in parallel.

Example: parallel tool calls in one assistant turn

"content": [
  { "type": "tool_use", "id": "a", "name": "web_search",
    "input": { "query": "..." } },
  { "type": "tool_use", "id": "b", "name": "web_search",
    "input": { "query": "..." } }   // two in parallel
]

Can you do it? Parallel tools — yes (run tool execution with Promise.all). Server-side co-location and generation-overlap — no, that's the moat. The good news: those aren't the bulk of the speed.

Source: Parallel Tool Use (official)

3"Code filtering," not "call another model"

Mechanic: when results are many, dynamic filtering has the model write code and run it in a sandbox to filter/rank/extract, sending only the code's output into context. Deterministic, token-cheap, zero extra model round-trips.

Example: with code execution on, the model writes the filter

# Claude generates and runs inside code_execution:
results = load_search_results()
hits = [r for r in results
        if "Highlands and Islands" in r["text"]
        and any(y in r["text"] for y in ("2021","2016"))]
print(hits[:3])   # only these enter context

Versus "ask another LLM to read and filter": that's a full inference round-trip (100s of ms to seconds); code filtering is milliseconds. This becomes the mirror for the mistake I make below.

Source: Advanced Tool Use (programmatic tool calling)

4Prompt Caching: cache the stable prefix, up to 85% latency cut

Mechanic: cache the big, stable tools + system prefix; rounds 2+ read it. Official numbers: up to 85% latency and 90% cost reduction; a 100K-token example dropped from 11.5s to 2.4s.

Example: cache_control breakpoint on system / last tool

{
  "system": [{ "type": "text", "text": "<long system prompt>",
               "cache_control": { "type": "ephemeral" } }],
  "tools":  [ ...,
    { "name": "web_search", ...,
      "cache_control": { "type": "ephemeral" } } ]  // caches the tools block
}
The discipline (where people trip): processing order is tools → system → messages, and the cached prefix must be stable. Cache only tools+system; never put per-turn-changing tool results inside the cached region, or it invalidates every turn and gets slower — there's an arXiv study on exactly this (boundary control improves TTFT 13–31%; naive full caching increases latency).

Sources: Prompt Caching (official) · Prompt Caching docs · "Don't Break the Cache" (arXiv)

5Selective retrieval after search, not bulk-fetch every time

Mechanic: the docs recommend search → selective retrieve. Simple factual queries use 1–3 searches, with max_uses bounding the count.

Example: bound the number of searches

"tools": [{ "type": "web_search_20250305",
            "name": "web_search",
            "max_uses": 3 }]   // latency-sensitive: cap at 3

Source: Web Search Tool · max_uses (official)

Tutorial takeaway: it's fast because it doesn't walk "search → fetch → call-a-model-to-extract" — content returns inline (#1), filtering is code not a model (#3), the stable prefix is cached (#4). It's accurate because the inline snippet is index-side query-selected + code-filtered + selectively deep-fetched (#5). Below I port all five to my own harness and measure what each is worth.
Part 2 · Experiment report: rewriting the retrieval pipeline into Claude.ai's shape
Setup
Subject: a hand-rolled harness (the agent-research runner, same family as the production claude-ai-harness).
Controlled variable: model fixed (same upstream); only the retrieval/fetch/context pipeline changes.
Dataset: 30 GAIA+FRAMES tasks the naked model fails, run in full.
Judge: contains — whether the model's FINAL ANSWER contains the gold answer (case-insensitive).
Baselines: naked model 0 / 30, original harness (B) 11 / 30 (37%), Claude.ai (C) 24 / 30 (80%).
Grade caveat: the upstream (luckyapi) is not vanilla Anthropic and temperature 0 doesn't fully reproduce, so a single run is a point estimate, not a verdict (this is also why frames_241 below is flaky).

The old architecture

The original harness was the most naive "search → fetch → char-truncate" three-stage pipeline, losing information at every layer:

Three fatal points: can't get body text, fragile fetch, answers positionally truncated; multi-hop tasks rarely survived to convergence.

The current architecture

Following the 5 optimizations dissected in Part 1, every layer was reshaped into "retrieval-style, zero extra generation, cacheable":

layerbeforeafter
retrievalSerper, snippets onlyBrave-primary + extra_snippets (~500 words inline); 9-level fallback chain
fetchtop-3 every searchinline-first: don't fetch if snippet is rich; else selectively fetch top-1
extractionslice(6000) positional cutExa /contents highlights (embedding selection, verbatim, no LLM round-trip)
contextrewrite history each roundappend-only + cache_control on system/tools
loop8 rounds16 rounds + multi-hop planning prompt + network retry
Architecture comparison: left BEFORE (red dot) is a four-step vertical pipeline Search snippet only → Fetch top-3 pages → Truncate 6000 chars → Rewrite history, an 8-rounds loop around it and a red clock timeout at the bottom; right AFTER (green dot) is Search inline 500 words → Inline-first skip fetch → Exa highlights → Append-only + cache, a 16-rounds loop and a green lightning fast at the bottom
The same retrieval pipeline, rewritten layer by layer: the left loses information at every stage and times out before it finishes; the right is inline-first + cached and survives 16 rounds

What changed

Condensed to the four behavior-changing moves (knob-tuning omitted):

  1. Inline-first (biggest speedup): let relevant body text come back with the search results, skip the fetch whenever possible — cutting the default "3 fetches + 3 extraction calls" per search to 0 in the common case.
  2. Extract by retrieval, not generation: hand "read the right passage" to Exa highlights (semantic selection) instead of calling Haiku per fetched page.
  3. Cache the stable prefix + append-only history: cache system+tools, append messages only, drop the rewrite-history-each-round hack — many-round iteration no longer recomputes the full context every turn.
  4. Fix the truncation bug: query-aware extraction replaces slice(6000), killing the silent "answer past char 6000 dropped" failure.
The controlled measurement behind move #2 (same URL, same query, only the Exa contents param differs): summary:{query} runs a generative LLM on Exa's side = 3284ms; highlights:{query} is embedding selection returning verbatim text = 613ms (5.4× faster). Lesson: extraction wants "retrieval-style verbatim excerpts," not yet another model generating a summary — which is exactly what Claude.ai's inline snippet is. Exa Search/Contents API

Results

With every change stacked (temperature 0, 16 rounds), the full 30 tasks run once against the baselines:

harness (model held fixed)30-task passposition
naked model (no tools)0 / 30floor
original hand-rolled harness (B)11 / 30 · 37%start
after this rewrite (v3)20 / 30 · 67%+30 pts
Claude.ai (C)24 / 30 · 80%ceiling

37% → 67% closes the gap to Claude.ai from 43 points to 13 — about two-thirds of it.

More important is latency — the old architecture's fatal flaw was "accurate but timed out." Here's each group's time / rounds / retrieval behavior:

grouptasksavg timemedian timeavg roundsavg searchesavg fetchesinline-only
PASS2054s35s4.42.22.489%
FAIL1093s40s8.03.43.965%

PASS tasks hit the answer in a couple of searches (median 35s, 2.2 searches) — accurate and cheap; the full 30 average 67s/task. FAIL tasks instead searched more and ran longer and still failed (avg 3.4 searches / 8 rounds) — the blocker isn't "couldn't find it."

Narrowing to the 8 tasks the first experiment (original B harness, 8-round cap) actually solved (v3 PASSes all 8 too), a pure timing comparison:

taskfirst run (B)now (v3)time change
gaia_01632s7s4.5× faster
gaia_15037s22sfaster
gaia_09637s35s~same
gaia_05256s57s~same
gaia_13616s25sa bit slower
gaia_15816s33sa bit slower
gaia_05912s31sa bit slower
gaia_15341s79sslower
8-task total (all PASS)31s avg · 34.5s median36s avg · 32s mediansame ballpark

3 faster, 5 a bit slower; mean 31s → 36s, median 34.5s → 32s — the same ballpark (the mean is pulled up by gaia_153 alone). Takeaway: the correctness the rewrite gained on hard tasks didn't slow down the tasks that were already solvable.

Where the improvement comes from

An aggregate number can lie — 67% might be "got lucky." Break down the round-by-round logs and each change leaves a quantifiable fingerprint. First, how a single search actually flows now:

Retrieval decision flow: Query → Brave search + inline snippets (with a Brave/SerpAPI/SearchApi/Exa fallback chain underneath that fires when empty) → decision Snippet rich enough? → YES 78% to a green Use inline text no fetch; NO 22% to Fetch top-1 Exa highlights → both merge into Answer or next round
78% of searches answer straight from inline snippets with zero fetch; only when it isn't rich enough does it fetch top-1 + Exa highlights. The fallback chain fires only when Brave returns empty

Each optimization leaves a fingerprint in the logs:

Attribution in one line: the speed comes from not walking the "search → fetch → generative extraction" slow path (inline + highlights + caching); the coverage comes from Brave inline snippets + the fallback chain backfilling scarce results.

The remaining 10 failures are mostly no longer in the search/fetch pipeline; they split into three classes, only one a retrieval-precision problem: ① judgment wall (gaia_110 / 014 / 128 ran the full 15 rounds, read the right text but picked the wrong entity); ② retrieval-precision wall (only gaia_029: a niche fact not in the index — 13 reformulations, 10 forced fetches, still unlocked); ③ trigger discipline (frames_543 / 647 answered after one search, frames_562 did zero — speaking before searching enough).

The honest boundary: for multi-hop tasks like frames_241, the PASS is an unstable ~⅓ (4 dedicated reruns = 1/4, and it has FAILed head-to-head), rooted in a non-vanilla upstream where even temperature 0 doesn't reproduce. So the rewrite moved the bottleneck from "infrastructure / rounds" to "multi-hop entity reasoning + upstream determinism" — it didn't "crack" it.

Conclusion and transferable principles

This round moved the bottleneck from "infrastructure" to "judgment": search accurately, fetch reliably, cheap rounds — all validated with data; what's left on the two unsolved tasks is picking the right entity in multi-hop reasoning — the next mountain, not the search pipeline. Five transferable principles:

  1. Inline content first: if a retrieval API returns query-relevant content directly (Exa highlights / Tavily), don't "search → fetch → call a model to extract."
  2. Extract via retrieval, not generation: semantic highlights (embedding selection) is an order of magnitude faster than "ask a model to summarize" (measured 613ms vs 3284ms) and verbatim.
  3. Cache the stable prefix, append-only history: cache system+tools, don't rewrite history each round — the precondition for affording many rounds (measured: late-round input tokens drop to single digits).
  4. Positional truncation is an invisible bug: slicing body text by character silently drops answers; keep by query relevance.
  5. Respect the moat: server-side co-location and tool/generation overlap can't be replicated — but they aren't the bulk of the speed. Inline content, caching, and code filtering are, and all three you can do.

Yao Yuheng

MSc Data Science, NTU. Focus: AI agent systems engineering, eval-driven development, LLM applications.

First published on sg.yaoyuheng2001.me.

Blog · GitHub · Substack · RSS