Autoresearch Run 2: 5-Minute Budget, Clean Epoch Exits A follow-up to the first run, with two changes: the time budget was extended from 2 to 5 minutes, and the budget enforcement was fixed. Results were meaningfully better on both fronts. --- THE FIX The first run had a fundamental measurement problem: every experiment was killed mid-epoch by SIGTERM, which PyTorch's C++ kernel defers until the current operation completes. This made comparisons noisy -- the same change could be accepted or rejected depending on where the kill landed. The fix was two-part. First, delegate to the OS timeout command (SIGKILL cannot be deferred). Second, inject TRAINING_BUDGET_SECS as an environment variable so train.py can self-terminate at epoch boundaries before the hard kill fires. Result: every one of the 20 iterations exited cleanly (timed_out=False, exit=0). --- THE RUN 20 iterations, 5-minute budget, Fashion-MNIST, Claude Opus with high effort. Baseline: single linear layer, SGD lr=0.1, batch=32 (~82% accuracy). Keep history: Iter 0: Data augmentation (horiz flip + random crop) -> 0.9033 Iter 2: Random erasing / cutout (50% prob, 4-12px patches) -> 0.9056 Iter 3: Residual blocks (ResNet-style skip connections) -> 0.9069 Iter 5: AdamW (decoupled weight decay 2e-4) + grad clipping -> 0.9080 Iter 10: Squeeze-and-Excitation channel attention per ResBlock -> 0.9085 Iter 12: Test-time augmentation (avg logits: orig + h-flip) -> 0.9107 Iter 16: 4th ResBlock + Global Average Pooling -> 0.9122 Final: 91.2% val accuracy. 7 improvements in 20 iterations. $4.13 total. One iteration (iter 14) crashed with a type error in an EMA implementation; the orchestrator caught it, discarded, and continued cleanly. --- COMPARISON WITH RUN 1 (2-minute budget) Accuracy: 0.9068 -> 0.9122 (+0.54pp) Termination: all hard-killed -> all clean epoch exits Architecture: 3 conv blocks -> 4 ResBlocks with SE attention + GAP The accuracy gap is modest but the architecture gap is large. With 2 minutes, the agent stuck to changes that show benefit quickly (optimizer, normalization, dropout). With 5 minutes, it explored techniques that need longer to converge: residual connections, attention mechanisms, test-time augmentation. The agent's reasoning explicitly accounted for whether each change would fit the budget. --- WHAT IS STILL MISSING Every technique applied is published and well-understood -- SE-Net, ResNet, TTA, AdamW. At 91.2%, the remaining ~3pp gap requires either longer training or novel combinations. The outer loop (updating program.md strategy from accumulated results) was not implemented; the agent learned from the history table but the strategy was static. That meta-learning loop is where genuinely novel search would happen.