# Autoresearch Run 13: Web Access

Run 13 gave the main agent access to WebSearch and WebFetch, testing whether the ability to look up techniques and best practices improves performance. Everything else was identical to run 12 (structured CoT, backtracking on discards, postmortems on streak trigger, 3-step one-way ratchet ladder).

The result was surprising: the highest keep rate ever recorded (12/20) and the lowest accuracy since run 4 (91.54%). The agent never escalated past haiku.

## Setup

Same as run 12, with one change:
- Main agent allowed tools: `Read, Write, Edit, Glob, Grep, WebSearch, WebFetch` (added web access)
- Backtrack and postmortem agents: unchanged (`Read, Glob, Grep`)

## Results

| Iter | Step | Model | Action | val_accuracy | Streak | BT |
|------|------|-------|--------|-------------|--------|----|
| 0 | 0 | haiku/high | keep | 0.8828 | 0 | |
| 1 | 0 | haiku/high | keep | 0.8871 | 0 | |
| 2 | 0 | haiku/high | discard | 0.8752 | 1 | yes |
| 3 | 0 | haiku/high | keep | 0.8904 | 0 | |
| 4 | 0 | haiku/high | discard | 0.8897 | 1 | yes |
| 5 | 0 | haiku/high | keep | 0.8941 | 0 | |
| 6 | 0 | haiku/high | keep | 0.8942 | 0 | |
| 7 | 0 | haiku/high | discard | 0.8843 | 1 | yes |
| 8 | 0 | haiku/high | keep | 0.8952 | 0 | |
| 9 | 0 | haiku/high | keep | 0.8962 | 0 | |
| 10 | 0 | haiku/high | discard | 0.8943 | 1 | yes |
| 11 | 0 | haiku/high | keep | 0.8979 | 0 | |
| 12 | 0 | haiku/high | keep | 0.9038 | 0 | |
| 13 | 0 | haiku/high | discard | 0.8988 | 1 | yes |
| 14 | 0 | haiku/high | keep | 0.9064 | 0 | |
| 15 | 0 | haiku/high | discard | 0.9053 | 1 | yes |
| 16 | 0 | haiku/high | keep | 0.9065 | 0 | |
| 17 | 0 | haiku/high | discard | 0.9003 | 1 | yes |
| 18 | 0 | haiku/high | discard | 0.9044 | 2 | yes |
| 19 | 0 | haiku/high | keep | 0.9154 | 0 | |

**Best: 91.54% at iteration 19. Total cost: $2.29. Never escalated past haiku.**

## Cost Breakdown

| Component | Calls | Total cost |
|-----------|-------|-----------|
| Agent (haiku/high) | 20 | $1.96 |
| Backtrack calls | 8 | $0.33 |
| Postmortem calls | 0 | $0.00 |
| **Total** | | **$2.29** |

No postmortems fired because the discard streak never reached 3. The agent alternated between keeps and discards with remarkable regularity --- discard, keep, keep, discard, keep, keep --- never allowing consecutive failures to accumulate.

## What Happened

The agent with web access made tiny, conservative, incremental improvements throughout the entire run. The accuracy trajectory tells the story:

0.883 -> 0.887 -> 0.890 -> 0.894 -> 0.894 -> 0.895 -> 0.896 -> 0.898 -> 0.904 -> 0.906 -> 0.907 -> 0.915

Almost every keep is a gain of 0.1-0.5 percentage points. The agent stayed in MLP territory for the entire run, making hyperparameter tweaks --- learning rate adjustments, adding batch normalization layers, scheduler tuning. It never made the bold architectural leap to a CNN that every other successful run took by iteration 4-6.

The web access appears to have given haiku a larger menu of small, safe options to try. With more incremental ideas available, haiku could keep finding marginal improvements indefinitely --- each one just barely beating the previous best. This defeated the escalation mechanism: the ladder requires 3 consecutive discards to escalate, and haiku never hit that threshold because it always found *something* that worked, even if the gain was tiny.

## The Paradox

This run optimized for the wrong thing. The escalation ladder's trigger (3 consecutive discards) implicitly defines "exhaustion" as repeated failure. But haiku with web access was never exhausted --- it just produced diminishing returns. The ladder conflates "can't improve" with "can only improve a little." A model that makes 0.1pp gains forever will never escalate, but it will also never reach the accuracy ceiling that a bolder model could hit.

In every previous run, haiku's limited reasoning capacity forced it into bold moves (like trying a CNN) or outright failures. Both outcomes eventually triggered escalation to sonnet, which had the reasoning capacity for the CNN breakthrough that typically jumps accuracy from ~89% to ~92%. Web access gave haiku enough incremental options to avoid both bold moves and outright failures --- the worst of both worlds for the ladder mechanism.

## Comparison

| | Run 12 (no web) | Run 13 (web access) |
|---|---|---|
| Best val_acc | 93.41% | 91.54% |
| Cost | $6.41 | $2.29 |
| Keeps | 10/20 | 12/20 |
| Reached sonnet | iter 6 | never |
| Reached opus | never | never |
| Architecture | CNN by iter 6 | MLP throughout |
| Keep rate | 50% | 60% |
| Cost per keep | $0.64 | $0.19 |

Run 13 has the best keep rate and lowest cost per keep across all 13 runs. It also has the worst accuracy of any run with the current context enhancements. The metrics that measure efficiency (keep rate, cost) point in the opposite direction from the metric that measures quality (accuracy).

## Findings

**Web access hurt performance by enabling conservative play.** The agent used web access to find a steady stream of incremental MLP improvements rather than making the architectural leap to CNNs. This is the first run since the ladder was introduced where the agent never built a convolutional model.

**The ladder's escalation trigger has a blind spot.** The 3-consecutive-discards threshold detects outright failure but not diminishing returns. A model that makes 0.1pp gains will never escalate, even when a stronger model could make 2pp gains. This suggests the escalation trigger may need to consider the *magnitude* of improvement, not just whether improvement occurred.

**Keep rate is not a proxy for quality.** Run 13 has the highest keep rate (60%) and lowest accuracy (91.54%). Run 9 has 45% keep rate and 93.79% accuracy. The best runs feature a mix of bold moves (some of which fail) and incremental refinement. A run that only makes safe bets never reaches the ceiling.

**Web access is a double-edged sword for weaker models.** For haiku, web access expanded the space of "safe" changes without expanding the space of "breakthrough" changes. The model could look up batch normalization best practices but couldn't synthesize the higher-level reasoning needed to propose a CNN architecture from scratch. The result: more competent incrementalism, less exploration.