# Autoresearch: Context Quality as the Key Lever Findings from five autoresearch runs on Fashion-MNIST, focusing on how the quality of context provided to the inner-loop agent determines overall performance. ## Setup The autoresearch pattern (Karpathy, March 2026): an LLM agent modifies `train.py` → trains within a fixed time budget → evaluates → keeps if improved, discards otherwise. The orchestrator manages the loop; the agent only sees a prompt and edits one file. - **Task:** Fashion-MNIST 10-class classification (28x28 grayscale, 60k train / 10k val) - **Metric:** val_accuracy within a fixed training time budget on CPU - **Agent:** Claude Opus, high effort, via `claude --print` subprocess - **Infrastructure:** Sprite VM, PyTorch CPU-only, self-terminating training loop ## Results Across Five Runs | Run | Budget | Iters | Best val_acc | Agent Cost | Key Change | |-----|--------|-------|-------------|-----------|------------| | 1 | 2 min | 10 | 0.9068 | — | Baseline harness | | 2 | 5 min | 10 | 0.9122 | — | Longer budget | | 3 | 5 min | 20 | 0.9202 | — | More iterations | | 4 | 2 min | 20 | 0.8985 | $4.91 | Context improvements (but venv bug wasted 6 iters) | | 5 | 5 min | 20 | **0.9395** | **$2.10** | Context improvements + infrastructure fix | Run 5 beat all previous runs by a large margin (+1.93% over run 3) while costing less than half of run 4. ## What Changed in the Harness Two changes to the prompt rendered for the inner-loop agent: ### 1. Full Agent Explanations Previously, the experiment history shown to the agent truncated each entry to 60 characters of the first line — typically just a header like `**What I changed and why:**`. The agent could see *that* an experiment was discarded but had almost no information about *what* was tried or *why* it failed. After the change, the last 10 iterations include the agent's complete 2-5 sentence explanation. This allowed the agent to learn from its own reasoning. Example: when data augmentation was discarded in iter 2 (0.8753 vs 0.9112 baseline), the full explanation noted that augmentation slowed convergence within the limited epoch budget. The agent never tried augmentation again — it understood the *mechanism* of failure, not just the outcome. ### 2. Earlier Experiments Summary Previously, iterations beyond the most recent 10 disappeared entirely from the agent's view. After the change, older iterations get a compact one-line summary (iteration number, action, metric, first-line description) with aggregate statistics. This prevents the agent from re-trying ideas that failed in older iterations it can no longer see. ### 3. Infrastructure Fix (VENV_PYTHON) The orchestrator used `sys.executable` to run `train.py`, which resolved to system Python 3.13 — but PyTorch was installed in a `.venv` with Python 3.12. Run 4 lost 6 of 20 iterations to the agent debugging this. Fixed by pointing directly to `.venv/bin/python`. ## Run 5 Trajectory All 20 iterations productive (zero infrastructure failures): ``` Iter Action val_acc What changed 0 KEEP 0.9054 Baseline → CNN + Adam + BatchNorm + normalization 1 KEEP 0.9112 Moved normalization inside forward() (train/val consistency fix) 2 DISCARD 0.8753 Data augmentation (slowed convergence in limited epochs) 3 KEEP 0.9148 LR 1e-3→2e-3 + CosineAnnealing + label smoothing 4 KEEP 0.9243 Graduated dropout reduction (0.25→0.10/0.15/0.20) 5 DISCARD 0.9236 Residual connections (extra compute reduced epoch count) 6 DISCARD 0.9237 EMA weights (not enough epochs for EMA to help) 7 KEEP 0.9260 Batch size 128→256, LR 2e-3→3e-3 (faster epochs) 8 KEEP 0.9279 CosineAnnealing T_max 10→5 (match actual epoch count) 9 DISCARD 0.9242 AdamW with weight decay 10 DISCARD 0.9270 OneCycleLR scheduler 11 DISCARD 0.9236 Squeeze-and-Excitation attention 12 DISCARD 0.9235 Mixup training 13 KEEP 0.9351 Further dropout reduction (near-zero everywhere) 14 KEEP 0.9395 Complete dropout removal ← NEW BEST 15-19 DISCARD 0.929-0.938 Various attempts, none beat 0.9395 ``` The agent discovered a coherent strategy: with only ~5 epochs fitting in the budget, convergence speed matters more than regularization. It progressively removed all dropout (iters 4, 13, 14), recognizing that BatchNorm + label smoothing provided sufficient regularization. Each dropout reduction was the single largest accuracy gain in its neighborhood. ## The Plateau Pattern Every run hit a plateau where successive attempts failed to improve: - Run 3: 10 consecutive failures after iter 10 - Run 5: 5 consecutive failures after iter 14 (iters 15-19 all scored 0.929-0.938) The improvements showed diminishing returns. The agent tried AdamW, OneCycleLR, SE attention, mixup, residual connections — all reasonable ideas, all slightly worse. The current architecture (3-block CNN, no dropout, Adam + cosine annealing, label smoothing) may be near-optimal for this budget. ## Key Observations **Context quality > iteration count.** Run 3 had 20 iterations with poor context and reached 0.9202. Run 5 had 20 iterations with rich context and reached 0.9395. The agent made better decisions when it could understand its own history. **The agent reasons about mechanisms, not just outcomes.** With full explanations, the agent recognized *why* dropout hurts in low-epoch regimes and pursued that insight across three iterations (4, 13, 14). With truncated context, this kind of multi-step reasoning was impossible. **Cost efficiency improved.** Run 5 cost $2.10 for 20 iterations ($0.105/iter average). Run 4 cost $4.91 for 20 iterations ($0.246/iter) — largely because failed iterations triggered longer agent sessions as the agent tried to debug infrastructure problems. **The outer loop remains unsolved.** The agent's strategy is still static — `program.md` is never updated based on accumulated results. The agent has no way to say "dropout removal is the winning strategy, focus there" or "architectural changes consistently fail in this regime, stop trying them." This is the next lever to pull. ## Best Model (iter 14, val_acc 0.9395) 3-block CNN (32→64→128 channels), BatchNorm after every conv, zero dropout, 256-unit classifier head. Adam (lr=3e-3) with CosineAnnealingLR (T_max=5), CrossEntropyLoss with label smoothing 0.1, batch size 256. Input normalized inside forward() with Fashion-MNIST statistics (mean=0.286, std=0.353). Fits ~5-6 epochs in 300s CPU budget. --- *Run 5 completed 2026-03-28. Orchestrator: autoresearch.py on sprite-pt5. Agent: Claude Opus, high effort.*