Autoresearch: What We Learned About Time Budgets

After two runs of an autoresearch implementation on Fashion-MNIST (one with broken budget
enforcement, one with clean epoch-boundary exits), the key insight was about what the
experiment is actually measuring.

---

THE BUDGET IS PART OF THE METRIC

The metric is not "validation accuracy." It is "validation accuracy achievable within N
minutes of CPU training." The time budget is a constraint inside the objective function,
not external to it.

This changes what the agent optimizes for. It is not searching for the best architecture
in general -- it is searching architecture space conditioned on a fixed compute budget.
A ResNet-152 might achieve 95% accuracy given an hour, but if it cannot converge in 5
minutes, it is worse than a smaller model that reaches 91% in that window.

The agent's reasoning reflected this. When it added a 4th residual block in the winning
iteration, it explicitly noted that operating at 3x3 spatial resolution was "cheap" and
that global average pooling reduced the classifier's input dimensionality from 1152 to
128. It was not just asking "what is the best architecture?" -- it was asking "what
architecture extracts the most learning from 5 minutes of compute?"

---

WHY BUDGET CONTROL MATTERS

If the metric is "best accuracy in exactly N minutes," then every experiment must receive
exactly N minutes of effective training. Not N minus a partial epoch that got killed. Not
N plus the time it took SIGTERM to propagate through a C++ kernel.

Our first run (2-minute budget) got this wrong. Every experiment was killed mid-epoch by
SIGTERM, which PyTorch defers until the current C++ operation finishes. The same change
(input normalization) was discarded in iteration 1 and accepted in iteration 3 -- same
code, different result, because the kill landed at different points in the epoch cycle.
The keep/discard signal was contaminated by measurement noise that had nothing to do with
architecture quality.

Our second run (5-minute budget) fixed this by having train.py self-terminate at epoch
boundaries using a TRAINING_BUDGET_SECS environment variable. Every iteration exited
cleanly (timed_out=False, exit=0). The hard kill via the OS timeout command fired 30
seconds after the budget as a safety net only, never needed.

The result was not just better numbers (0.9068 -> 0.9122). It was a different kind of
research. With noisy comparisons, the agent stuck to high-magnitude changes where the
signal survived the noise: replacing SGD with Adam, adding convolutional layers. With
clean comparisons, it explored subtler techniques: squeeze-and-excitation attention
(+0.05pp), test-time augmentation (+0.22pp), global average pooling (+0.15pp). Better
measurement enabled better science.

---

WHEN DOES THIS APPLY?

Not all autoresearch-style experiments need this kind of budget control. The key variable
is the relationship between time and metric:

When the metric is a function of how long you run (training accuracy, iterative solver
quality, any gradient-based optimization): budget control is critical. Unequal effective
compute contaminates the comparison signal.

When time IS the metric (which SAT solver is fastest, which code runs in fewer cycles):
hard kills are fine -- they ARE the measurement. Budget control means "make the deadline
consistent," not "exit at a clean boundary."

When time is irrelevant to the metric (does this code produce the correct output, do the
tests pass): budget is just a safety net against infinite loops. The result does not
depend on exactly when you stop.

Our experiment was firmly in the first category. The realization that the budget is part
of the metric, not a constraint on the experiment, is the main thing we learned.