Autoresearch Run 2: 5-Minute Budget, Clean Epoch Exits

A follow-up to the first run, with two changes: the time budget was extended from 2 to 5
minutes, and the budget enforcement was fixed. Results were meaningfully better on both
fronts.

---

THE FIX

The first run had a fundamental measurement problem: every experiment was killed mid-epoch
by SIGTERM, which PyTorch's C++ kernel defers until the current operation completes. This
made comparisons noisy -- the same change could be accepted or rejected depending on where
the kill landed.

The fix was two-part. First, delegate to the OS timeout command (SIGKILL cannot be
deferred). Second, inject TRAINING_BUDGET_SECS as an environment variable so train.py
can self-terminate at epoch boundaries before the hard kill fires.

Result: every one of the 20 iterations exited cleanly (timed_out=False, exit=0).

---

THE RUN

20 iterations, 5-minute budget, Fashion-MNIST, Claude Opus with high effort.
Baseline: single linear layer, SGD lr=0.1, batch=32 (~82% accuracy).

Keep history:
  Iter  0: Data augmentation (horiz flip + random crop)           -> 0.9033
  Iter  2: Random erasing / cutout (50% prob, 4-12px patches)    -> 0.9056
  Iter  3: Residual blocks (ResNet-style skip connections)        -> 0.9069
  Iter  5: AdamW (decoupled weight decay 2e-4) + grad clipping   -> 0.9080
  Iter 10: Squeeze-and-Excitation channel attention per ResBlock  -> 0.9085
  Iter 12: Test-time augmentation (avg logits: orig + h-flip)    -> 0.9107
  Iter 16: 4th ResBlock + Global Average Pooling                 -> 0.9122

Final: 91.2% val accuracy. 7 improvements in 20 iterations. $4.13 total.

One iteration (iter 14) crashed with a type error in an EMA implementation;
the orchestrator caught it, discarded, and continued cleanly.

---

COMPARISON WITH RUN 1 (2-minute budget)

  Accuracy:     0.9068 -> 0.9122 (+0.54pp)
  Termination:  all hard-killed -> all clean epoch exits
  Architecture: 3 conv blocks -> 4 ResBlocks with SE attention + GAP

The accuracy gap is modest but the architecture gap is large. With 2 minutes, the
agent stuck to changes that show benefit quickly (optimizer, normalization, dropout).
With 5 minutes, it explored techniques that need longer to converge: residual
connections, attention mechanisms, test-time augmentation. The agent's reasoning
explicitly accounted for whether each change would fit the budget.

---

WHAT IS STILL MISSING

Every technique applied is published and well-understood -- SE-Net, ResNet, TTA, AdamW.
At 91.2%, the remaining ~3pp gap requires either longer training or novel combinations.
The outer loop (updating program.md strategy from accumulated results) was not
implemented; the agent learned from the history table but the strategy was static.
That meta-learning loop is where genuinely novel search would happen.