Autoresearch Run 2: 5-Minute Budget, Clean Termination

A follow-up to the first autoresearch run (rANr3nebles). Same setup -- Claude Opus,
Fashion-MNIST, 20 iterations -- with two fixes applied: the self-terminating training
loop and the corrected timeout mechanism. This run validates that the fixes worked and
shows what the agent does with more time per experiment.

---

WHAT CHANGED

Run 1 had a budget enforcement failure: every experiment was killed mid-epoch by SIGTERM,
which PyTorch's C++ kernel can defer until the current operation finishes. Comparisons
were noisy as a result.

Two fixes:
  1. Train.py now reads TRAINING_BUDGET_SECS from the environment and exits cleanly at
     epoch boundaries when time runs short. The hard kill (via the OS timeout command,
     which uses SIGKILL) fires 30 seconds after the budget as a safety net only.
  2. Budget raised from 2 minutes to 5 minutes, giving architectures room to converge.

---

RESULTS

20 iterations, 5-minute budget, opus/high-effort. Best: 0.9122 at iteration 16.

Every iteration: timed_out=False, exit=0. The self-terminating loop worked throughout.
Training times clustered 250-315s -- the last epoch finishing cleanly, not being killed.

Keep history:
  Iter  0: Data augmentation (horiz flip + random crop)          0.822 -> 0.9033
  Iter  2: Random erasing / cutout augmentation                  0.9033 -> 0.9056
  Iter  3: Residual blocks (ResNet-style skip connections)       0.9056 -> 0.9069
  Iter  5: AdamW optimizer + gradient clipping                   0.9069 -> 0.9080
  Iter 10: Squeeze-and-Excitation channel attention              0.9080 -> 0.9085
  Iter 12: Test-time augmentation (TTA, horiz flip average)      0.9085 -> 0.9107
  Iter 16: 4th ResBlock + Global Average Pooling                 0.9107 -> 0.9122

7 improvements in 20 iterations. $4.13 total cost.

One crash (iter 14): EMA implementation applied float operations to an integer buffer in
BatchNorm. The orchestrator caught exit=1, parsed no metric, discarded cleanly, and
continued. The resilience mechanism worked as designed.

---

COMPARISON TO RUN 1 (2-minute budget)

  Best accuracy:   0.9068 -> 0.9122  (+0.54pp)
  Improvements:    6/20   -> 7/20
  Budget flaw:     hard kills every iter -> clean exits every iter
  Architecture:    3-block CNN, cosine LR, label smoothing
                -> ResBlocks, SE attention, TTA, Global Average Pooling

The accuracy gap understates the difference. Run 1's agent was constrained to techniques
that show results in 2 minutes -- normalization, dropout tuning, adding a third conv
block. Run 2's agent explored genuinely more sophisticated territory: attention mechanisms,
residual learning, test-time augmentation, advanced regularization. The 5-minute budget
isn't just more of the same; it changes what the agent considers worth trying.

The agent is still within "known techniques" territory -- SE-Net, ResNet, TTA are all
well-established. But with clean 5-minute budgets they are at least being properly
evaluated rather than cut off mid-convergence. A harder problem or overnight scale (100+
iterations) would push into genuinely novel space.

---

OUTSTANDING ISSUE

The outer loop is still missing. Program.md's strategy section is static. A proper
implementation would revise the research strategy itself based on accumulated results:
noticing that augmentation techniques dominate the early improvements, or that
architectural complexity gains require longer budgets to show benefit. That meta-learning
loop -- where the agent updates its own research strategy -- is the remaining gap between
this implementation and the original concept.