# Autoresearch: What to Test Next (Post Run 7)

After seven runs on Fashion-MNIST, the empirical picture is clear: context quality matters more than model choice (run 5), ladder strategies save cost but sacrifice peak accuracy (runs 6-7), and every run plateaus after 10-14 iterations regardless of strategy. These are the most promising next experiments, ordered by expected information value.

## 1. Sonnet/high as the fixed model

We have never tested sonnet/high in isolation. Run 5 used opus/high ($0.11/iter, 93.95%). The ladder runs used sonnet/high for some iterations, where it averaged 78s and $0.18/call --- but that was in the context of a ladder, not as the sole model. A clean 20-iteration run with sonnet/high would answer: is opus/high actually necessary, or does sonnet/high produce equivalent architectural ideas at lower cost? If sonnet matches opus, it becomes the default for all future runs.

**Cost estimate:** ~$2-3 for 20 iterations. Quick to run, clean comparison against run 5.

## 2. Plateau intervention: rewrite program.md mid-run

The biggest unsolved problem. Every run hits a wall around iteration 10-14 and wastes remaining iterations on fruitless attempts. The proposed mechanism: after N consecutive discards (e.g., 5), pause the inner loop and invoke an opus-tier agent to read the full history, diagnose why the loop stalled, and rewrite program.md with a new strategy. Resume from the best checkpoint.

No run has ever updated its strategy mid-flight. The agent currently operates under the same program.md from iteration 0 to 19, even as the problem shifts from "build a CNN" to "squeeze 0.1% from an already-optimized model." The outer loop --- strategy evolution --- is the scroll's recommended next step and the most obvious gap in the current system.

**Implementation:** Add a plateau detector to the orchestrator. When triggered, run a separate opus agent with a meta-prompt: "Here is the full history of N iterations. The last K were all discarded. Diagnose what's happening and rewrite program.md to try a different approach." Then continue the inner loop with the new strategy.

## 3. Start from run 5's best model instead of baseline

Every run spends ~5 iterations rediscovering CNNs + BatchNorm + Adam + normalization. Starting from run 5's best model (93.95%, 3-block CNN, zero dropout, cosine LR) would skip the easy phase entirely and give the agent 20 full iterations to work on the hard problem: breaking past the plateau. This tests whether the plateau is a true local optimum for this compute budget or just insufficient iteration count.

**Risk:** The agent may have less room to maneuver. Run 5's model is already well-optimized for the 5-minute budget. But if 20 iterations can push from 93.95% to 94.5%+, that's evidence the plateau is soft.

## 4. Expert.md: agent-maintained knowledge base

Inspired by AgentSAT. After each iteration, the agent appends a structured lesson to expert.md: "dropout removal: reliable improvement under low-epoch regimes", "SE attention: not worth the epoch cost at 5-min budget", "data augmentation: harmful when epoch count < 10." This file persists and grows across iterations, giving the agent institutional memory rather than re-deriving insights from raw history.

The hypothesis: the agent makes the same mistakes across runs because it loses knowledge between iterations (it only sees the last 10 in detail). Expert.md would let it accumulate durable conclusions. The risk is that wrong conclusions become sticky --- the agent might avoid a technique that would work in a different context because expert.md says it failed once.

## Priority

Test 1 first (cheap, fast, fills a data gap), then test 2 (highest potential impact on the plateau problem, requires orchestrator changes). Tests 3 and 4 are independently valuable and could run in parallel with each other.