Autoresearch: What We Learned About Time Budgets After two runs of an autoresearch implementation on Fashion-MNIST (one with broken budget enforcement, one with clean epoch-boundary exits), the key insight was about what the experiment is actually measuring. --- THE BUDGET IS PART OF THE METRIC The metric is not "validation accuracy." It is "validation accuracy achievable within N minutes of CPU training." The time budget is a constraint inside the objective function, not external to it. This changes what the agent optimizes for. It is not searching for the best architecture in general -- it is searching architecture space conditioned on a fixed compute budget. A ResNet-152 might achieve 95% accuracy given an hour, but if it cannot converge in 5 minutes, it is worse than a smaller model that reaches 91% in that window. The agent's reasoning reflected this. When it added a 4th residual block in the winning iteration, it explicitly noted that operating at 3x3 spatial resolution was "cheap" and that global average pooling reduced the classifier's input dimensionality from 1152 to 128. It was not just asking "what is the best architecture?" -- it was asking "what architecture extracts the most learning from 5 minutes of compute?" --- WHY BUDGET CONTROL MATTERS If the metric is "best accuracy in exactly N minutes," then every experiment must receive exactly N minutes of effective training. Not N minus a partial epoch that got killed. Not N plus the time it took SIGTERM to propagate through a C++ kernel. Our first run (2-minute budget) got this wrong. Every experiment was killed mid-epoch by SIGTERM, which PyTorch defers until the current C++ operation finishes. The same change (input normalization) was discarded in iteration 1 and accepted in iteration 3 -- same code, different result, because the kill landed at different points in the epoch cycle. The keep/discard signal was contaminated by measurement noise that had nothing to do with architecture quality. Our second run (5-minute budget) fixed this by having train.py self-terminate at epoch boundaries using a TRAINING_BUDGET_SECS environment variable. Every iteration exited cleanly (timed_out=False, exit=0). The hard kill via the OS timeout command fired 30 seconds after the budget as a safety net only, never needed. The result was not just better numbers (0.9068 -> 0.9122). It was a different kind of research. With noisy comparisons, the agent stuck to high-magnitude changes where the signal survived the noise: replacing SGD with Adam, adding convolutional layers. With clean comparisons, it explored subtler techniques: squeeze-and-excitation attention (+0.05pp), test-time augmentation (+0.22pp), global average pooling (+0.15pp). Better measurement enabled better science. --- WHEN DOES THIS APPLY? Not all autoresearch-style experiments need this kind of budget control. The key variable is the relationship between time and metric: When the metric is a function of how long you run (training accuracy, iterative solver quality, any gradient-based optimization): budget control is critical. Unequal effective compute contaminates the comparison signal. When time IS the metric (which SAT solver is fastest, which code runs in fewer cycles): hard kills are fine -- they ARE the measurement. Budget control means "make the deadline consistent," not "exit at a clean boundary." When time is irrelevant to the metric (does this code produce the correct output, do the tests pass): budget is just a safety net against infinite loops. The result does not depend on exactly when you stop. Our experiment was firmly in the first category. The realization that the budget is part of the metric, not a constraint on the experiment, is the main thing we learned.