Autoresearch Run 2: 5-Minute Budget, Clean Termination A follow-up to the first autoresearch run (rANr3nebles). Same setup -- Claude Opus, Fashion-MNIST, 20 iterations -- with two fixes applied: the self-terminating training loop and the corrected timeout mechanism. This run validates that the fixes worked and shows what the agent does with more time per experiment. --- WHAT CHANGED Run 1 had a budget enforcement failure: every experiment was killed mid-epoch by SIGTERM, which PyTorch's C++ kernel can defer until the current operation finishes. Comparisons were noisy as a result. Two fixes: 1. Train.py now reads TRAINING_BUDGET_SECS from the environment and exits cleanly at epoch boundaries when time runs short. The hard kill (via the OS timeout command, which uses SIGKILL) fires 30 seconds after the budget as a safety net only. 2. Budget raised from 2 minutes to 5 minutes, giving architectures room to converge. --- RESULTS 20 iterations, 5-minute budget, opus/high-effort. Best: 0.9122 at iteration 16. Every iteration: timed_out=False, exit=0. The self-terminating loop worked throughout. Training times clustered 250-315s -- the last epoch finishing cleanly, not being killed. Keep history: Iter 0: Data augmentation (horiz flip + random crop) 0.822 -> 0.9033 Iter 2: Random erasing / cutout augmentation 0.9033 -> 0.9056 Iter 3: Residual blocks (ResNet-style skip connections) 0.9056 -> 0.9069 Iter 5: AdamW optimizer + gradient clipping 0.9069 -> 0.9080 Iter 10: Squeeze-and-Excitation channel attention 0.9080 -> 0.9085 Iter 12: Test-time augmentation (TTA, horiz flip average) 0.9085 -> 0.9107 Iter 16: 4th ResBlock + Global Average Pooling 0.9107 -> 0.9122 7 improvements in 20 iterations. $4.13 total cost. One crash (iter 14): EMA implementation applied float operations to an integer buffer in BatchNorm. The orchestrator caught exit=1, parsed no metric, discarded cleanly, and continued. The resilience mechanism worked as designed. --- COMPARISON TO RUN 1 (2-minute budget) Best accuracy: 0.9068 -> 0.9122 (+0.54pp) Improvements: 6/20 -> 7/20 Budget flaw: hard kills every iter -> clean exits every iter Architecture: 3-block CNN, cosine LR, label smoothing -> ResBlocks, SE attention, TTA, Global Average Pooling The accuracy gap understates the difference. Run 1's agent was constrained to techniques that show results in 2 minutes -- normalization, dropout tuning, adding a third conv block. Run 2's agent explored genuinely more sophisticated territory: attention mechanisms, residual learning, test-time augmentation, advanced regularization. The 5-minute budget isn't just more of the same; it changes what the agent considers worth trying. The agent is still within "known techniques" territory -- SE-Net, ResNet, TTA are all well-established. But with clean 5-minute budgets they are at least being properly evaluated rather than cut off mid-convergence. A harder problem or overnight scale (100+ iterations) would push into genuinely novel space. --- OUTSTANDING ISSUE The outer loop is still missing. Program.md's strategy section is static. A proper implementation would revise the research strategy itself based on accumulated results: noticing that augmentation techniques dominate the early improvements, or that architectural complexity gains require longer budgets to show benefit. That meta-learning loop -- where the agent updates its own research strategy -- is the remaining gap between this implementation and the original concept.