# Autoresearch Run 9: One-Way Ratchet Ladder

Run 9 tested a one-way escalation ladder where the model tier only goes up, never down. Each tier gets 3 consecutive discards before the ladder escalates. A keep resets the discard counter but does not change the ladder position. The ladder has 3 steps: haiku/high, sonnet/high, opus/high. No opus/max --- run 8 showed it adds cost and latency without improving results.

This is the fourth ladder variant tested. Runs 6-8 used ladders that reset on success, causing oscillation between cheap and expensive models. The one-way ratchet was designed to avoid that: each tier is exhausted before moving on, and once the agent has graduated to a stronger model, it stays there.

## Setup

- Task: Fashion-MNIST classification, CPU only
- Baseline: single linear layer (~82%)
- Training budget: 300s per iteration, no agent timeout
- 20 iterations

## Results

| Iter | Step | Model | Action | val_accuracy | Streak |
|------|------|-------|--------|-------------|--------|
| 0 | 0 | haiku/high | keep | 0.8929 | 0 |
| 1 | 0 | haiku/high | discard | 0.8852 | 1 |
| 2 | 0 | haiku/high | discard | 0.8875 | 2 |
| 3 | 0 | haiku/high | discard | 0.8534 | 3 -> escalate |
| 4 | 1 | sonnet/high | keep | 0.9145 | 0 |
| 5 | 1 | sonnet/high | keep | 0.9191 | 0 |
| 6 | 1 | sonnet/high | keep | 0.9253 | 0 |
| 7 | 1 | sonnet/high | keep | 0.9256 | 0 |
| 8 | 1 | sonnet/high | discard | 0.9240 | 1 |
| 9 | 1 | sonnet/high | keep | 0.9323 | 0 (reset) |
| 10 | 1 | sonnet/high | discard | 0.9244 | 1 |
| 11 | 1 | sonnet/high | discard | 0.9286 | 2 |
| 12 | 1 | sonnet/high | discard | 0.9267 | 3 -> escalate |
| 13 | 2 | opus/high | discard | 0.9312 | 1 |
| 14 | 2 | opus/high | keep | 0.9331 | 0 (reset) |
| 15 | 2 | opus/high | discard | 0.9296 | 1 |
| 16 | 2 | opus/high | keep | 0.9357 | 0 (reset) |
| 17 | 2 | opus/high | keep | 0.9368 | 0 |
| 18 | 2 | opus/high | discard | 0.9350 | 1 |
| 19 | 2 | opus/high | keep | 0.9379 | 0 |

**Best: 93.79% at iteration 19. Total cost: $3.93. Still improving at exit.**

## Cost Breakdown

| Tier | Iterations | Total cost | Avg/iter |
|------|-----------|-----------|---------|
| haiku/high | 4 | $0.22 | $0.05 |
| sonnet/high | 9 | $1.31 | $0.15 |
| opus/high | 7 | $2.40 | $0.34 |
| **Total** | **20** | **$3.93** | **$0.20** |

## How the Ratchet Worked

The mechanism performed as intended. Each tier was genuinely exhausted before the ladder moved on.

**Haiku (iters 0-3):** One keep at iteration 0 (baseline to 89.29%), then three straight discards. Haiku grabbed the obvious first move --- replacing the linear layer with a CNN --- but couldn't improve beyond that. Escalated after 3 failures. We expected haiku to deliver more than one keep here, but the stochastic nature of these runs means a single data point isn't conclusive.

**Sonnet (iters 4-12):** The productive phase. Four consecutive keeps (iters 4-7) pushed accuracy from 91.45% to 92.56%. Then a discard at iter 8, but iter 9 produced a keep (93.23%), resetting the streak and buying sonnet three more chances. Those three chances all failed, so the ladder escalated. The streak reset at iter 9 is exactly the mechanism doing its job --- sonnet proved it still had value, got more time, and only graduated when truly exhausted.

**Opus (iters 13-19):** Continued finding improvements where sonnet couldn't. Four keeps in seven iterations, pushing from 93.31% to 93.79%. The run ended with the model still improving --- streak was 0 at exit.

## Comparison With Previous Ladder Runs

| Run | Ladder | Behavior | Steps | Best val_acc | Cost |
|-----|--------|----------|-------|-------------|------|
| 6 | 10-step | reset to 0 on keep | haiku l/m/h, sonnet l/m/h, opus l/m/h/max | 92.75% | $1.77 |
| 7 | 4-step | down 1 on keep | haiku/h, sonnet/h, opus/h, opus/max | 92.82% | $4.29 |
| 8 | 4-step | reset to 0 on keep | haiku/h, sonnet/h, opus/h, opus/max | 93.26% | $8.30 |
| **9** | **3-step** | **one-way ratchet** | **haiku/h, sonnet/h, opus/h** | **93.79%** | **$3.93** |
| 5 | fixed | n/a | opus/high only | 93.95% | $2.10 |

Run 9 is the best ladder result, approaching run 5's fixed opus/high (93.95%) while starting from haiku. The one-way ratchet avoids the cost traps seen in earlier ladder runs: no resetting to haiku mid-plateau (runs 6, 8), no getting stuck at expensive opus/max (runs 7, 8), no wasted iterations re-discovering that a cheap model is exhausted.

Run 5 remains cheaper ($2.10 vs $3.93) because it never spent iterations on haiku or sonnet at all. The ladder's value proposition is clearer in longer runs or when the cheap tiers are more productive --- here haiku only contributed one keep, so its 4 iterations were mostly overhead.

## Observations

**The run was still improving at exit.** Opus/high had a streak of 0 with 4 keeps in its last 7 iterations. More iterations could push past run 5's 93.95%. This contrasts with every previous run, which plateaued well before iteration 20.

**Sonnet was the workhorse.** Five keeps in nine iterations, covering the 91-93% range. This is the strongest showing for sonnet across all ladder runs and supports the idea that sonnet is the cost-performance sweet spot for this task.

**The streak reset mechanism matters.** Without the reset at iteration 9, sonnet would have escalated at iteration 11 instead of 12. That one extra sonnet iteration is cheap, but more importantly, the reset validated that sonnet still had productive ideas at the 93% level. The conservative escalation policy --- require 3 *consecutive* failures, reset on any success --- gave each tier its full chance.

**Dropping opus/max was the right call.** Run 8 spent $6.40 on 9 opus/max iterations (avg 5m07s, max 11m59s) with zero improvements. Run 9 spent $2.40 on 7 opus/high iterations (avg 2m10s) with four improvements. Opus/high is faster, cheaper, and at least as effective on this task.