Autoresearch: How the Harness Works This is a companion to the previous experiment reports. It documents the practical machinery behind the autoresearch setup -- how the orchestrator, agent, and training loop interact -- so that future experiments on different problems can build on what we learned rather than re-discovering it. This is one approach. Other experiments will likely surface different requirements and better patterns. Consider this a first draft, not a reference implementation. --- THE THREE-FILE STRUCTURE The experiment is split into three files with strict boundaries: prepare.py -- Fixed infrastructure. Data loading, evaluation function, metric emission. The agent must never modify this file. It defines the contract: what inputs the model receives, how it is scored, and what format metrics are emitted in. train.py -- The search space. Contains the model architecture, optimizer, hyperparameters, and training loop. This is the only file the agent modifies. Each iteration starts from the current best version of this file. program.md -- Natural language research strategy. Read by the orchestrator each iteration and rendered with current state (best metric, iteration count, experiment history) before being passed to the agent as its prompt. Contains both static sections (research objective, strategy) and dynamic placeholders filled at runtime. The separation matters because it defines what the agent can and cannot change. prepare.py is the measurement apparatus -- if the agent could modify it, it could game the metric. train.py is the search space -- everything the agent might want to change lives here. program.md is the research strategy -- it tells the agent what to try and gives it context about what has already been tried. --- THE ORCHESTRATOR A Python script (autoresearch.py) that runs the outer loop: 1. Snapshot train.py before the agent runs 2. Render program.md with current state and pass it to the agent 3. Agent modifies train.py (using Claude Code CLI as a subprocess) 4. Detect if train.py actually changed (skip training if not) 5. Run training as a subprocess with time budget 6. Extract the metric from training output 7. Compare to best metric; keep or discard (restore snapshot if discarding) 8. Log the iteration to experiments/history.jsonl 9. Repeat The agent is invoked via claude --print --output-format json with restricted tools (Read, Write, Edit, Glob, Grep -- no Bash). It cannot run training, observe live output, or interact with anything outside the filesystem. This is deliberate: the orchestrator guarantees equal compute budgets and consistent evaluation, which the agent cannot do reliably on its own. Each agent invocation is stateless (--no-session-persistence). Context from previous iterations is injected through program.md's rendered history table, not through session memory. This keeps each iteration reproducible and avoids context window accumulation. State is persisted as append-only JSONL (experiments/history.jsonl). On crash or restart, the orchestrator replays the log to reconstruct state. Each iteration also writes train.py.before, train.py.after, the diff, agent reasoning, and training stdout/stderr to experiments/iter_NNNN/ for post-hoc analysis. --- THE METRIC EMISSION CONTRACT train.py must print a JSON dict to stdout after each completed epoch: {"val_accuracy": 0.9033, "val_loss": 0.2814, "epoch": 5, "elapsed_secs": 142.3} The orchestrator extracts the metric by reading the last valid JSON line from stdout that contains the configured metric key. This works even if training is interrupted -- the last complete epoch's metrics are always available. This convention must be documented in program.md so the agent maintains it across modifications. In our experiment, we specified the required key (val_accuracy), the import path (from prepare import emit_metrics), and the expectation that metrics are emitted every epoch. --- TIME BUDGET ENFORCEMENT This was the hardest part to get right and went through three iterations: Attempt 1 (broken): subprocess.Popen with proc.wait(timeout=N), then SIGTERM on timeout. Failed because PyTorch's C++ kernels defer SIGTERM until the current operation completes -- a backward pass can hold the signal for tens of seconds. Every experiment was killed at an unpredictable point mid-epoch. Attempt 2 (better): Delegate to the OS timeout command, which sends SIGKILL (cannot be deferred) after a grace period. Reliable termination, but still kills mid-epoch. Attempt 3 (correct): Pass TRAINING_BUDGET_SECS as an environment variable. train.py tracks epoch durations and exits cleanly when there is not enough time for another full epoch. The timeout command fires at BUDGET + 30 seconds as a safety net only. The self-terminating pattern in train.py: BUDGET_SECS = int(os.environ.get("TRAINING_BUDGET_SECS", "300")) epoch_durations = [] for epoch in range(1, MAX_EPOCHS + 1): if epoch_durations: avg = sum(epoch_durations) / len(epoch_durations) if time.time() - start + avg > BUDGET_SECS: break t0 = time.time() # ... train one epoch ... epoch_durations.append(time.time() - t0) emit_metrics(metrics) This pattern must be documented in program.md as a required contract, not an experiment. The agent is told to maintain it in every version of train.py it writes. In our runs, the agent adopted it after seeing it in program.md and maintained it through all subsequent iterations. The reason this matters: if the metric is "best accuracy achievable in N minutes," then every experiment must receive exactly N minutes of effective training. Mid-epoch kills introduce measurement noise unrelated to architecture quality. In our first run, the same change was discarded and then accepted on retry because the kill landed at different points in the epoch cycle. --- WHAT WE DID NOT BUILD The outer loop. program.md's strategy section was static throughout both runs. A proper implementation would revise the research strategy based on accumulated results -- recognising that certain classes of changes consistently fail under the current budget, or that the improvement curve has flattened and a different search direction is needed. Multi-agent coordination. Each iteration was a single agent making a single change. The original autoresearch concept suggests that multiple agents could explore different branches of the search space in parallel, with git as the coordination mechanism. Generality. This harness is wired to a specific metric type (higher-is-better scalar extracted from JSON stdout). Different experiments might need different evaluation protocols -- test pass rates, benchmark throughput, compilation time. The orchestrator would need to be parameterised or rewritten for each. --- OPEN QUESTIONS FOR FUTURE EXPERIMENTS How should the keep/discard threshold work? We used strict improvement (new > best). Should there be a significance threshold to avoid keeping noise? Should there be a mechanism for accepting lateral moves that enable future improvements? How should the agent handle failed experiments? In our run, one iteration crashed with a type error. The orchestrator discarded it and moved on. Should the agent see the error message and learn from it? Currently it only sees "METRIC_PARSE_FAILED" in the history. What is the right balance between agent autonomy and orchestrator control? We blocked Bash entirely. An agent with Bash access could run quick sanity checks, inspect model sizes, or profile epoch times before committing to a full training run. But it could also introduce inconsistency in the evaluation protocol. How does budget selection interact with the search space? Our 2-minute budget constrained the agent to quick-converging architectures. Our 5-minute budget opened up deeper networks and attention mechanisms. The budget shapes which region of architecture space is even reachable -- it is a hyperparameter of the research process itself.