Autoresearch on Fashion-MNIST: Implementation Notes

We implemented Karpathy's autoresearch concept (March 2026) using Claude Opus as the
research agent and Fashion-MNIST as a validation domain. This is a report on the
implementation, results, and honest assessment of what worked and what didn't.

---

THE CONCEPT

Autoresearch is a simple loop: modify train.py, train for a fixed budget, evaluate on a
validation metric, keep improvements and discard regressions. The key insight is that an
AI agent can make both parametric changes (tuning numbers) and structural changes
(transforming program logic) -- the latter being where language model understanding is
genuinely useful.

The three-file structure: prepare.py (fixed infrastructure), train.py (the search space
the agent modifies), and program.md (natural language research strategy). An external
Python orchestrator managed the loop while the agent focused solely on code modification.

---

THE RUN

20 iterations, 2-minute training budget each, Fashion-MNIST classification (10-class
clothing images, 28x28 grayscale). Starting from a deliberately weak baseline: single
linear layer, SGD lr=0.1, batch size 32 (~82% accuracy).

Keep history:
  Iter  0: Linear baseline -> 2-block CNN + BatchNorm + Adam + batch 128    0.822 -> 0.874
  Iter  3: Added input normalization (mean=0.2860, std=0.3530)              0.874 -> 0.879
  Iter  4: Cosine annealing LR (1e-3 -> 1e-5) + label smoothing 0.1        0.879 -> 0.884
  Iter  8: Fixed T_max (100->25) to match actual epoch count; added         0.884 -> 0.888
           self-terminating budget loop from environment variable
  Iter 12: Reduced dropout (0.25->0.10 conv, 0.5->0.25 classifier);        0.888 -> 0.893
           existing BatchNorm + label smoothing already regularizing well
  Iter 14: Added third conv block (64->128 channels, 7x7->3x3)             0.893 -> 0.907

Final: 90.7% validation accuracy. 6 improvements out of 20 iterations. $3.54 total cost.

Best architecture: 3-block CNN (32->64->128 channels), two Conv2d per block, BatchNorm +
ReLU + MaxPool + Dropout2d throughout, input normalized to dataset statistics, Adam +
cosine annealing, label smoothing 0.1.

---

WHAT WORKED

The agent did genuine structural search -- not just tuning learning rates, but redesigning
the architecture, changing the training regime, and reasoning about what would actually
fit within the time budget. The iter 12 reasoning was notably good: rather than adding
capacity (which previous iterations showed increased epoch time and hurt results), it
correctly identified that existing BatchNorm and label smoothing meant high dropout was
causing underfitting, not overfitting. The iter 8 reasoning caught a subtle bug -- T_max
of 100 on a cosine scheduler that only sees ~20 epochs means the LR barely moves.

The keep/discard mechanism worked as intended. The agent could not game it by fabricating
metrics; it could only predict what would help, then learn from whether the metric improved.
Structural changes dominated -- of the 6 improvements, 5 involved architectural or training
regime changes, not just hyperparameter values.

---

WHAT WE BOTCHED

Budget enforcement. Every run was killed mid-epoch by SIGTERM rather than exiting cleanly
at an epoch boundary. SIGTERM is deferrable by PyTorch's C++ kernel -- a backward pass in
progress holds the signal until the kernel returns. This made comparisons noisy: the same
change (input normalization) was discarded in iter 1 and accepted in iter 3, partly
because of where the kill happened.

The fix: delegate to the OS timeout command (SIGKILL cannot be deferred) with a 30-second
grace window beyond the budget, and inject TRAINING_BUDGET_SECS as an environment
variable so train.py can self-terminate at epoch boundaries before the hard kill fires.
The agent adopted the self-terminating pattern in iter 8's change. Future runs get clean
epoch-boundary exits; the hard kill becomes a safety net only.

---

ASSESSMENT

The budget flaw was real but narrow. The three-file structure, keep/discard logic, and
structural search capability all functioned correctly. The external orchestrator approach
is more controlled than having the agent run training directly -- it guarantees equal
compute budgets, clean evaluation, and reliable history.

The 2-minute budget is tight enough that the agent was partly racing the clock rather than
purely comparing architectural quality. Longer budgets (5+ minutes) would give cleaner
signals, especially for deeper architectures that need more epochs to converge.

The missing piece is the outer loop: program.md's strategy section was static. A proper
implementation would revise the research strategy itself based on accumulated results --
for instance, recognising that normalization changes need more epochs to stabilise, and
adjusting future strategy accordingly. That meta-learning loop is where the real research
leverage lives.