# Autoresearch: Context Quality as the Key Lever

Findings from five autoresearch runs on Fashion-MNIST, focusing on how the quality of context provided to the inner-loop agent determines overall performance.

## Setup

The autoresearch pattern (Karpathy, March 2026): an LLM agent modifies `train.py` → trains within a fixed time budget → evaluates → keeps if improved, discards otherwise. The orchestrator manages the loop; the agent only sees a prompt and edits one file.

- **Task:** Fashion-MNIST 10-class classification (28x28 grayscale, 60k train / 10k val)
- **Metric:** val_accuracy within a fixed training time budget on CPU
- **Agent:** Claude Opus, high effort, via `claude --print` subprocess
- **Infrastructure:** Sprite VM, PyTorch CPU-only, self-terminating training loop

## Results Across Five Runs

| Run | Budget | Iters | Best val_acc | Agent Cost | Key Change |
|-----|--------|-------|-------------|-----------|------------|
| 1 | 2 min | 10 | 0.9068 | — | Baseline harness |
| 2 | 5 min | 10 | 0.9122 | — | Longer budget |
| 3 | 5 min | 20 | 0.9202 | — | More iterations |
| 4 | 2 min | 20 | 0.8985 | $4.91 | Context improvements (but venv bug wasted 6 iters) |
| 5 | 5 min | 20 | **0.9395** | **$2.10** | Context improvements + infrastructure fix |

Run 5 beat all previous runs by a large margin (+1.93% over run 3) while costing less than half of run 4.

## What Changed in the Harness

Two changes to the prompt rendered for the inner-loop agent:

### 1. Full Agent Explanations

Previously, the experiment history shown to the agent truncated each entry to 60 characters of the first line — typically just a header like `**What I changed and why:**`. The agent could see *that* an experiment was discarded but had almost no information about *what* was tried or *why* it failed.

After the change, the last 10 iterations include the agent's complete 2-5 sentence explanation. This allowed the agent to learn from its own reasoning. Example: when data augmentation was discarded in iter 2 (0.8753 vs 0.9112 baseline), the full explanation noted that augmentation slowed convergence within the limited epoch budget. The agent never tried augmentation again — it understood the *mechanism* of failure, not just the outcome.

### 2. Earlier Experiments Summary

Previously, iterations beyond the most recent 10 disappeared entirely from the agent's view. After the change, older iterations get a compact one-line summary (iteration number, action, metric, first-line description) with aggregate statistics. This prevents the agent from re-trying ideas that failed in older iterations it can no longer see.

### 3. Infrastructure Fix (VENV_PYTHON)

The orchestrator used `sys.executable` to run `train.py`, which resolved to system Python 3.13 — but PyTorch was installed in a `.venv` with Python 3.12. Run 4 lost 6 of 20 iterations to the agent debugging this. Fixed by pointing directly to `.venv/bin/python`.

## Run 5 Trajectory

All 20 iterations productive (zero infrastructure failures):

```
Iter  Action    val_acc   What changed
 0    KEEP      0.9054    Baseline → CNN + Adam + BatchNorm + normalization
 1    KEEP      0.9112    Moved normalization inside forward() (train/val consistency fix)
 2    DISCARD   0.8753    Data augmentation (slowed convergence in limited epochs)
 3    KEEP      0.9148    LR 1e-3→2e-3 + CosineAnnealing + label smoothing
 4    KEEP      0.9243    Graduated dropout reduction (0.25→0.10/0.15/0.20)
 5    DISCARD   0.9236    Residual connections (extra compute reduced epoch count)
 6    DISCARD   0.9237    EMA weights (not enough epochs for EMA to help)
 7    KEEP      0.9260    Batch size 128→256, LR 2e-3→3e-3 (faster epochs)
 8    KEEP      0.9279    CosineAnnealing T_max 10→5 (match actual epoch count)
 9    DISCARD   0.9242    AdamW with weight decay
10    DISCARD   0.9270    OneCycleLR scheduler
11    DISCARD   0.9236    Squeeze-and-Excitation attention
12    DISCARD   0.9235    Mixup training
13    KEEP      0.9351    Further dropout reduction (near-zero everywhere)
14    KEEP      0.9395    Complete dropout removal ← NEW BEST
15-19 DISCARD   0.929-0.938  Various attempts, none beat 0.9395
```

The agent discovered a coherent strategy: with only ~5 epochs fitting in the budget, convergence speed matters more than regularization. It progressively removed all dropout (iters 4, 13, 14), recognizing that BatchNorm + label smoothing provided sufficient regularization. Each dropout reduction was the single largest accuracy gain in its neighborhood.

## The Plateau Pattern

Every run hit a plateau where successive attempts failed to improve:
- Run 3: 10 consecutive failures after iter 10
- Run 5: 5 consecutive failures after iter 14 (iters 15-19 all scored 0.929-0.938)

The improvements showed diminishing returns. The agent tried AdamW, OneCycleLR, SE attention, mixup, residual connections — all reasonable ideas, all slightly worse. The current architecture (3-block CNN, no dropout, Adam + cosine annealing, label smoothing) may be near-optimal for this budget.

## Key Observations

**Context quality > iteration count.** Run 3 had 20 iterations with poor context and reached 0.9202. Run 5 had 20 iterations with rich context and reached 0.9395. The agent made better decisions when it could understand its own history.

**The agent reasons about mechanisms, not just outcomes.** With full explanations, the agent recognized *why* dropout hurts in low-epoch regimes and pursued that insight across three iterations (4, 13, 14). With truncated context, this kind of multi-step reasoning was impossible.

**Cost efficiency improved.** Run 5 cost $2.10 for 20 iterations ($0.105/iter average). Run 4 cost $4.91 for 20 iterations ($0.246/iter) — largely because failed iterations triggered longer agent sessions as the agent tried to debug infrastructure problems.

**The outer loop remains unsolved.** The agent's strategy is still static — `program.md` is never updated based on accumulated results. The agent has no way to say "dropout removal is the winning strategy, focus there" or "architectural changes consistently fail in this regime, stop trying them." This is the next lever to pull.

## Best Model (iter 14, val_acc 0.9395)

3-block CNN (32→64→128 channels), BatchNorm after every conv, zero dropout, 256-unit classifier head. Adam (lr=3e-3) with CosineAnnealingLR (T_max=5), CrossEntropyLoss with label smoothing 0.1, batch size 256. Input normalized inside forward() with Fashion-MNIST statistics (mean=0.286, std=0.353). Fits ~5-6 epochs in 300s CPU budget.

---

*Run 5 completed 2026-03-28. Orchestrator: autoresearch.py on sprite-pt5. Agent: Claude Opus, high effort.*