# Autoresearch Runs 13-15: Web Access Experiments

Three runs tested web access (WebSearch, WebFetch) as a harness capability, placed at different points in the pipeline. The results reveal that where and how web access is provided matters more than whether it's available.

## Run 13: Web Access in Main Agent (91.54%, $2.29)

Gave the main coding agent WebSearch and WebFetch alongside its editing tools. The result was paradoxical: the highest keep rate ever (12/20) but the lowest accuracy since run 4 (91.54%).

The agent never escalated past haiku. It stayed in MLP territory for all 20 iterations, making tiny incremental hyperparameter tweaks --- learning rate adjustments, batch normalization, scheduler tuning --- each one barely beating the previous best by 0.1-0.5pp. It never made the bold architectural leap to a CNN that every other successful run took by iteration 4-6.

Web access gave haiku a larger menu of safe, small options. With more incremental ideas available, haiku could keep finding marginal improvements indefinitely, defeating the escalation mechanism. The ladder requires 3 consecutive discards to escalate, and haiku never hit that threshold because it always found *something* that worked, even if the gain was tiny.

**Finding:** Web access in the main agent enables conservative play that defeats the ladder. Keep rate is not a proxy for quality.

## Run 14: Web Access in Postmortem, No Nudge (92.86%, $4.62)

Moved web access from the main agent to the postmortem (which fires after 3 consecutive discards). The postmortem runs with the escalation target model --- sonnet or opus --- so stronger models would do the web research.

Result: the postmortem **never used the web tools**. It produced the same kind of pure-reasoning analysis as run 11's postmortems. Having the tools available wasn't enough --- the model defaulted to its internal knowledge.

**Finding:** Making tools available doesn't mean the agent will use them. The prompt must explicitly instruct tool usage.

## Run 15: Web Access in Postmortem, With Nudge (93.17%, $4.69)

Same as run 14, but the postmortem system prompt now explicitly says: "Use WebSearch and WebFetch to research relevant techniques, best practices, or alternative approaches before making recommendations."

This time the postmortems did real research:

**Postmortem #1 (sonnet, 11 turns, $0.51):** Searched for Fashion-MNIST benchmarks, the 1cycle policy paper, PyTorch OneCycleLR docs, and Kaggle notebooks. Cited 6 external sources. Identified epoch starvation as the bottleneck and recommended BS=128 to reduce per-epoch time, OneCycleLR for short training runs, and 3x3 kernels to cut compute.

**Postmortem #2 (opus, 8 turns, $0.36):** Corrected postmortem #1. The BS=128 recommendation had been tested in iteration 14 and failed. Opus diagnosed why --- CPU doesn't parallelize conv ops like GPU does, so larger batches don't reduce epoch time on this hardware. It pivoted to the one untried recommendation: swap 5x5 kernels for 3x3 to cut per-epoch compute 20-35%. This self-correction within the postmortem chain is something we'd never seen before.

After the second postmortem, opus got 3 iterations and produced 2 keeps (93.00% and 93.17%), still improving at exit.

### The Missing Context Problem

The web-researched postmortem's first recommendation (BS=128) was wrong because the postmortem prompt doesn't include experimental constraints. The main agent gets program.md which specifies CPU-only training, 300s budget, and Fashion-MNIST details. The postmortem gets only: the last 6 iterations' data, the current train.py code, and the previous postmortem. It has to infer constraints from the code.

When the postmortem searched the web for Fashion-MNIST best practices, it found GPU-oriented advice (BS=128 for faster epochs) that doesn't apply to CPU training. The code does contain `device = "cpu"` and `BUDGET_SECS`, but these are easy to miss when synthesizing web results about general best practices.

**Finding:** Web-researched postmortems need the experimental constraints (hardware, budget, task) included in their prompt, not just the iteration data and code.

## Comparison

| | Run 12 (no web) | Run 13 (web in agent) | Run 14 (web in PM) | Run 15 (web in PM, nudged) |
|---|---|---|---|---|
| Best val_acc | 93.41% | 91.54% | 92.86% | 93.17% |
| Cost | $6.41 | $2.29 | $4.62 | $4.69 |
| Keeps | 10/20 | 12/20 | 10/20 | 9/20 |
| Reached opus | never | never | never | iter 17 |
| PM cost | $0.10 | $0.00 | $0.12 | $0.87 |
| PM used web | n/a | n/a | no | yes |
| PM self-corrected | n/a | n/a | n/a | yes |

## Key Findings Across the Web Access Experiments

**1. Web access in the coding agent is harmful.** It enables incrementalism that defeats the escalation mechanism. The agent finds a steady stream of tiny improvements instead of making bold architectural leaps. (Run 13)

**2. Tool availability requires explicit prompting.** Making WebSearch/WebFetch available to the postmortem agent isn't enough --- the model defaults to internal reasoning. The system prompt must tell it to use the tools. (Run 14 vs 15)

**3. Web-researched postmortems produce qualitatively better analysis.** The nudged postmortem cited external sources, referenced specific papers and benchmarks, and gave recommendations grounded in literature. The second postmortem self-corrected the first based on experimental evidence. (Run 15)

**4. Postmortem prompts need experimental context.** The postmortem receives iteration data and code but not the constraints from program.md (CPU-only, 300s budget, task details). Web research without these constraints produces GPU-oriented recommendations that fail on CPU. This is a fixable gap in the harness. (Run 15)

**5. Haiku's performance is stochastic and can compress the improvement space.** Run 15's haiku got 7 keeps and reached 92.28% before escalating --- far above its usual 89-90% ceiling. This left less room for sonnet and opus. A longer run (30+ iterations) would better show whether the web-corrected postmortem chain delivers value at the opus tier.