Autoresearch on Fashion-MNIST: Implementation Notes We implemented Karpathy's autoresearch concept (March 2026) using Claude Opus as the research agent and Fashion-MNIST as a validation domain. This is a report on the implementation, results, and honest assessment of what worked and what didn't. --- THE CONCEPT Autoresearch is a simple loop: modify train.py, train for a fixed budget, evaluate on a validation metric, keep improvements and discard regressions. The key insight is that an AI agent can make both parametric changes (tuning numbers) and structural changes (transforming program logic) -- the latter being where language model understanding is genuinely useful. The three-file structure: prepare.py (fixed infrastructure), train.py (the search space the agent modifies), and program.md (natural language research strategy). An external Python orchestrator managed the loop while the agent focused solely on code modification. --- THE RUN 20 iterations, 2-minute training budget each, Fashion-MNIST classification (10-class clothing images, 28x28 grayscale). Starting from a deliberately weak baseline: single linear layer, SGD lr=0.1, batch size 32 (~82% accuracy). Keep history: Iter 0: Linear baseline -> 2-block CNN + BatchNorm + Adam + batch 128 0.822 -> 0.874 Iter 3: Added input normalization (mean=0.2860, std=0.3530) 0.874 -> 0.879 Iter 4: Cosine annealing LR (1e-3 -> 1e-5) + label smoothing 0.1 0.879 -> 0.884 Iter 8: Fixed T_max (100->25) to match actual epoch count; added 0.884 -> 0.888 self-terminating budget loop from environment variable Iter 12: Reduced dropout (0.25->0.10 conv, 0.5->0.25 classifier); 0.888 -> 0.893 existing BatchNorm + label smoothing already regularizing well Iter 14: Added third conv block (64->128 channels, 7x7->3x3) 0.893 -> 0.907 Final: 90.7% validation accuracy. 6 improvements out of 20 iterations. $3.54 total cost. Best architecture: 3-block CNN (32->64->128 channels), two Conv2d per block, BatchNorm + ReLU + MaxPool + Dropout2d throughout, input normalized to dataset statistics, Adam + cosine annealing, label smoothing 0.1. --- WHAT WORKED The agent did genuine structural search -- not just tuning learning rates, but redesigning the architecture, changing the training regime, and reasoning about what would actually fit within the time budget. The iter 12 reasoning was notably good: rather than adding capacity (which previous iterations showed increased epoch time and hurt results), it correctly identified that existing BatchNorm and label smoothing meant high dropout was causing underfitting, not overfitting. The iter 8 reasoning caught a subtle bug -- T_max of 100 on a cosine scheduler that only sees ~20 epochs means the LR barely moves. The keep/discard mechanism worked as intended. The agent could not game it by fabricating metrics; it could only predict what would help, then learn from whether the metric improved. Structural changes dominated -- of the 6 improvements, 5 involved architectural or training regime changes, not just hyperparameter values. --- WHAT WE BOTCHED Budget enforcement. Every run was killed mid-epoch by SIGTERM rather than exiting cleanly at an epoch boundary. SIGTERM is deferrable by PyTorch's C++ kernel -- a backward pass in progress holds the signal until the kernel returns. This made comparisons noisy: the same change (input normalization) was discarded in iter 1 and accepted in iter 3, partly because of where the kill happened. The fix: delegate to the OS timeout command (SIGKILL cannot be deferred) with a 30-second grace window beyond the budget, and inject TRAINING_BUDGET_SECS as an environment variable so train.py can self-terminate at epoch boundaries before the hard kill fires. The agent adopted the self-terminating pattern in iter 8's change. Future runs get clean epoch-boundary exits; the hard kill becomes a safety net only. --- ASSESSMENT The budget flaw was real but narrow. The three-file structure, keep/discard logic, and structural search capability all functioned correctly. The external orchestrator approach is more controlled than having the agent run training directly -- it guarantees equal compute budgets, clean evaluation, and reliable history. The 2-minute budget is tight enough that the agent was partly racing the clock rather than purely comparing architectural quality. Longer budgets (5+ minutes) would give cleaner signals, especially for deeper architectures that need more epochs to converge. The missing piece is the outer loop: program.md's strategy section was static. A proper implementation would revise the research strategy itself based on accumulated results -- for instance, recognising that normalization changes need more epochs to stabilise, and adjusting future strategy accordingly. That meta-learning loop is where the real research leverage lives.