# Autoresearch Run 7: Bidirectional Ladder and Agent Timeouts Run 7 tested a shorter, bidirectional escalation ladder: 4 steps instead of 10, with the ladder moving down one step on success instead of resetting to zero. The run also exposed a timeout problem with opus/max effort that was invisible in run 6. ## The Ladder 4 steps. Discard moves up one step (escalate). Keep moves down one step (de-escalate). Clamped at both ends. | Step | Model | Effort | |------|-------|--------| | 0 | haiku | high | | 1 | sonnet | high | | 2 | opus | high | | 3 | opus | max | Compared to run 6's 10-step ladder (haiku low/med/high, sonnet low/med/high, opus low/med/high/max) with full reset to step 0 on success. ## Results 20 iterations, 5-minute training budget, Fashion-MNIST from weak baseline. Total cost: $4.29. | Iter | Step | Model/Effort | Action | val_accuracy | |------|------|-------------|--------|-------------| | 0 | 0 | haiku/high | keep | 0.8419 | | 1 | 0 | haiku/high | keep | 0.8877 | | 2 | 0 | haiku/high | discard | 0.8860 | | 3 | 1 | sonnet/high | keep | 0.8919 | | 4 | 0 | haiku/high | discard | 0.8918 | | 5 | 1 | sonnet/high | keep | 0.9188 | | 6 | 0 | haiku/high | discard | 0.9031 | | 7 | 1 | sonnet/high | discard | 0.9158 | | 8 | 2 | opus/high | keep | 0.9282 | | 9 | 1 | sonnet/high | discard | 0.9067 | | 10 | 2 | opus/high | discard | 0.9281 | | 11 | 3 | opus/max | discard | 0.9256 | | 12 | 3 | opus/max | discard | 0.9219 | | 13 | 3 | opus/max | discard | 0.9265 | | 14 | 3 | opus/max | TIMEOUT | --- | | 15 | 3 | opus/max | discard | 0.9280 | | 16 | 3 | opus/max | TIMEOUT | --- | | 17 | 3 | opus/max | discard | 0.9220 | | 18 | 3 | opus/max | discard | 0.9201 | | 19 | 3 | opus/max | TIMEOUT | --- | **Best: 92.82% at iteration 8 (opus/high). Total cost: $4.29.** ## The Timeout Problem The orchestrator uses a 300-second agent timeout (same as the training budget). Agent wall times by tier: | Tier | Calls | Avg | Min | Max | Timeouts | Avg cost | |------|-------|-----|-----|-----|----------|----------| | haiku/high | 5 | 28s | 20s | 32s | 0 | $0.06 | | sonnet/high | 4 | 78s | 38s | 120s | 0 | $0.18 | | opus/high | 2 | 70s | 58s | 83s | 0 | $0.26 | | opus/max | 9 | 218s | 119s | 301s | 3 | $0.46 | Haiku, sonnet, and opus/high all finish well within 300 seconds. Opus/max averages 3.6 minutes with a ceiling at 5 minutes --- one in three calls times out. The turn counts are similar across tiers (~5 turns), so opus/max isn't doing more work, it's thinking longer per turn. The three timed-out iterations produced no output and wasted ~15 minutes of wall time. ## Bidirectional Ladder: Stuck at the Ceiling The bidirectional de-escalation mechanism never had a chance to prove itself. After iteration 8 (the last keep), the ladder escalated from step 1 to step 3 over three discards and stayed pinned at opus/max for the remaining 9 iterations. Since de-escalation requires a keep, and no iteration after 8 improved, the ladder could not move down. This is the fundamental problem: the bidirectional strategy only differs from the reset strategy when the agent is succeeding. When it's stuck in a plateau --- which is when you most need cost control --- both strategies behave identically (pinned at ceiling). Run 6's reset ladder had the same plateau problem (pinned at step 3+ from iteration 11 onward), but its 10-step ladder meant it reached opus territory later and spent less there overall. ## Cross-Run Comparison All runs: Fashion-MNIST, baseline ~82%, 5-minute training budget (except where noted). | Run | Model | Iters | Best val_acc | Cost | $/iter | Key variable | |-----|-------|-------|-------------|------|--------|-------------| | 1 | opus/high | 10 | 90.68% | $3.54 | $0.35 | 2-min budget, basic harness | | 2 | opus/high | 10 | 91.22% | $4.13 | $0.41 | 5-min budget | | 3 | opus/high | 20 | 92.02% | $3.38 | $0.17 | Training curves in context | | 4 | opus/high | 20 | 89.85% | $4.91 | $0.25 | Context improvements (infra bug) | | 5 | opus/high | 20 | **93.95%** | **$2.10** | **$0.11** | Full context + infra fix | | 6 | ladder (reset) | 20 | 92.75% | $1.77 | $0.09 | 10-step, reset on keep | | 7 | ladder (bidir) | 20 | 92.82% | $4.29 | $0.21 | 4-step, de-escalate on keep | Run 5 remains the best result: highest accuracy at the second-lowest cost. Its advantage is not context quality (runs 6 and 7 use the same context enhancements) but consistent use of opus/high for every iteration. The ladder runs save money in early iterations by using haiku, but haiku produces weaker architectural ideas during the critical phase where the biggest gains happen. Run 7's bidirectional ladder is strictly worse than run 6's reset ladder for this task: similar accuracy (92.82% vs 92.75%) at 2.4x the cost ($4.29 vs $1.77), plus 3 wasted iterations from agent timeouts. The shorter ladder reaches opus/max faster, the bidirectional mechanism can't de-escalate during a plateau, and opus/max is both expensive and timeout-prone. ## Findings **Shorter ladders escalate too fast.** Run 6's 10-step ladder kept costs low by exhausting haiku and sonnet tiers before reaching opus. Run 7's 4-step ladder hit opus/max by iteration 11 and stayed there, burning $2.37 (55% of total cost) on 9 opus/max iterations with zero improvement. **Bidirectional de-escalation is inert during plateaus.** The mechanism only activates on success, which is exactly when cost control matters least. During the plateau phase (where all the expensive iterations happen), it behaves identically to the reset strategy. **Opus/max needs a longer agent timeout.** The 300-second timeout is adequate for all other tiers but clips opus/max calls that would otherwise succeed. The successful opus/max call at iteration 15 took 4m06s (8 turns, $0.67) --- close to the limit. A 600-second timeout for opus tiers would eliminate this failure mode. **The best cost-performance strategy remains fixed opus/high.** Run 5 achieved $0.11/iter with opus/high throughout, beating both ladder strategies on accuracy while matching their per-iteration cost. The ladder's theoretical advantage --- cheap iterations when easy progress is available --- is real but modest (~$0.02/iter savings), and it comes at the cost of weaker early-phase reasoning.