Autoresearch: How the Harness Works

This is a companion to the previous experiment reports. It documents the practical
machinery behind the autoresearch setup -- how the orchestrator, agent, and training
loop interact -- so that future experiments on different problems can build on what we
learned rather than re-discovering it.

This is one approach. Other experiments will likely surface different requirements and
better patterns. Consider this a first draft, not a reference implementation.

---

THE THREE-FILE STRUCTURE

The experiment is split into three files with strict boundaries:

  prepare.py -- Fixed infrastructure. Data loading, evaluation function, metric emission.
  The agent must never modify this file. It defines the contract: what inputs the model
  receives, how it is scored, and what format metrics are emitted in.

  train.py -- The search space. Contains the model architecture, optimizer, hyperparameters,
  and training loop. This is the only file the agent modifies. Each iteration starts from
  the current best version of this file.

  program.md -- Natural language research strategy. Read by the orchestrator each iteration
  and rendered with current state (best metric, iteration count, experiment history) before
  being passed to the agent as its prompt. Contains both static sections (research objective,
  strategy) and dynamic placeholders filled at runtime.

The separation matters because it defines what the agent can and cannot change. prepare.py
is the measurement apparatus -- if the agent could modify it, it could game the metric.
train.py is the search space -- everything the agent might want to change lives here.
program.md is the research strategy -- it tells the agent what to try and gives it context
about what has already been tried.

---

THE ORCHESTRATOR

A Python script (autoresearch.py) that runs the outer loop:

  1. Snapshot train.py before the agent runs
  2. Render program.md with current state and pass it to the agent
  3. Agent modifies train.py (using Claude Code CLI as a subprocess)
  4. Detect if train.py actually changed (skip training if not)
  5. Run training as a subprocess with time budget
  6. Extract the metric from training output
  7. Compare to best metric; keep or discard (restore snapshot if discarding)
  8. Log the iteration to experiments/history.jsonl
  9. Repeat

The agent is invoked via claude --print --output-format json with restricted tools
(Read, Write, Edit, Glob, Grep -- no Bash). It cannot run training, observe live output,
or interact with anything outside the filesystem. This is deliberate: the orchestrator
guarantees equal compute budgets and consistent evaluation, which the agent cannot do
reliably on its own.

Each agent invocation is stateless (--no-session-persistence). Context from previous
iterations is injected through program.md's rendered history table, not through session
memory. This keeps each iteration reproducible and avoids context window accumulation.

State is persisted as append-only JSONL (experiments/history.jsonl). On crash or restart,
the orchestrator replays the log to reconstruct state. Each iteration also writes
train.py.before, train.py.after, the diff, agent reasoning, and training stdout/stderr
to experiments/iter_NNNN/ for post-hoc analysis.

---

THE METRIC EMISSION CONTRACT

train.py must print a JSON dict to stdout after each completed epoch:

  {"val_accuracy": 0.9033, "val_loss": 0.2814, "epoch": 5, "elapsed_secs": 142.3}

The orchestrator extracts the metric by reading the last valid JSON line from stdout that
contains the configured metric key. This works even if training is interrupted -- the last
complete epoch's metrics are always available.

This convention must be documented in program.md so the agent maintains it across
modifications. In our experiment, we specified the required key (val_accuracy), the
import path (from prepare import emit_metrics), and the expectation that metrics are
emitted every epoch.

---

TIME BUDGET ENFORCEMENT

This was the hardest part to get right and went through three iterations:

  Attempt 1 (broken): subprocess.Popen with proc.wait(timeout=N), then SIGTERM on
  timeout. Failed because PyTorch's C++ kernels defer SIGTERM until the current operation
  completes -- a backward pass can hold the signal for tens of seconds. Every experiment
  was killed at an unpredictable point mid-epoch.

  Attempt 2 (better): Delegate to the OS timeout command, which sends SIGKILL (cannot
  be deferred) after a grace period. Reliable termination, but still kills mid-epoch.

  Attempt 3 (correct): Pass TRAINING_BUDGET_SECS as an environment variable. train.py
  tracks epoch durations and exits cleanly when there is not enough time for another full
  epoch. The timeout command fires at BUDGET + 30 seconds as a safety net only.

  The self-terminating pattern in train.py:

    BUDGET_SECS = int(os.environ.get("TRAINING_BUDGET_SECS", "300"))
    epoch_durations = []
    for epoch in range(1, MAX_EPOCHS + 1):
        if epoch_durations:
            avg = sum(epoch_durations) / len(epoch_durations)
            if time.time() - start + avg > BUDGET_SECS:
                break
        t0 = time.time()
        # ... train one epoch ...
        epoch_durations.append(time.time() - t0)
        emit_metrics(metrics)

This pattern must be documented in program.md as a required contract, not an experiment.
The agent is told to maintain it in every version of train.py it writes. In our runs, the
agent adopted it after seeing it in program.md and maintained it through all subsequent
iterations.

The reason this matters: if the metric is "best accuracy achievable in N minutes," then
every experiment must receive exactly N minutes of effective training. Mid-epoch kills
introduce measurement noise unrelated to architecture quality. In our first run, the same
change was discarded and then accepted on retry because the kill landed at different
points in the epoch cycle.

---

WHAT WE DID NOT BUILD

The outer loop. program.md's strategy section was static throughout both runs. A proper
implementation would revise the research strategy based on accumulated results --
recognising that certain classes of changes consistently fail under the current budget,
or that the improvement curve has flattened and a different search direction is needed.

Multi-agent coordination. Each iteration was a single agent making a single change. The
original autoresearch concept suggests that multiple agents could explore different
branches of the search space in parallel, with git as the coordination mechanism.

Generality. This harness is wired to a specific metric type (higher-is-better scalar
extracted from JSON stdout). Different experiments might need different evaluation
protocols -- test pass rates, benchmark throughput, compilation time. The orchestrator
would need to be parameterised or rewritten for each.

---

OPEN QUESTIONS FOR FUTURE EXPERIMENTS

How should the keep/discard threshold work? We used strict improvement (new > best).
Should there be a significance threshold to avoid keeping noise? Should there be a
mechanism for accepting lateral moves that enable future improvements?

How should the agent handle failed experiments? In our run, one iteration crashed with
a type error. The orchestrator discarded it and moved on. Should the agent see the error
message and learn from it? Currently it only sees "METRIC_PARSE_FAILED" in the history.

What is the right balance between agent autonomy and orchestrator control? We blocked
Bash entirely. An agent with Bash access could run quick sanity checks, inspect model
sizes, or profile epoch times before committing to a full training run. But it could also
introduce inconsistency in the evaluation protocol.

How does budget selection interact with the search space? Our 2-minute budget constrained
the agent to quick-converging architectures. Our 5-minute budget opened up deeper
networks and attention mechanisms. The budget shapes which region of architecture space
is even reachable -- it is a hyperparameter of the research process itself.