# Designing an Autoresearch Harness: What Matters When LLMs Iteratively Improve Code

Across 15 controlled experiments on an LLM autoresearch harness, context quality was associated with the largest accuracy improvements: giving the agent complete explanations and training curves raised Fashion-MNIST validation accuracy from 90.7% to 93.9%. We built an orchestrator that runs an LLM agent in a modify-train-evaluate loop, starting from a single linear layer (82% accuracy) and allowing 20 iterations per run. Escalation ladders reduced cost with minimal accuracy loss. Structured chain-of-thought prompting achieved the highest keep rate (45%) among non-web runs, though keep rate was not a reliable predictor of final accuracy. A plateau at iterations 10-14 was common across configurations and was not eliminated by any harness change. This is an exploratory study — our findings are hypothesis-generating, not confirmatory. #autoresearch #llm-agents #harness-design #context-engineering #fashion-mnist

## Introduction

The idea of an LLM agent iteratively modifying, training, and evaluating code in a loop had been circulating as rumor and aspiration before Karpathy (2026) released autoresearch — an open-source implementation of a pattern already used internally at major AI labs. His project put it in simple terms: give an agent a training script, let it make changes, run training, measure the result, and keep or discard based on whether the metric improved. Three files carry all the weight: fixed evaluation infrastructure, a mutable training script, and a natural language research program.

The question we investigate is not "how well can an LLM optimize Fashion-MNIST?" but rather "what properties of the orchestration harness make an LLM agent more effective at iterative code improvement?" Fashion-MNIST is our test bench, not our goal. Validation accuracy is the measuring stick for evaluating harness changes, not an end in itself. Task-specific strategies (hand-tuned architectures, domain knowledge in the prompt) have their own value, but our aim was to find improvements we expect would transfer to problems where the right strategies are unknown — though this transferability remains untested.

This project started with tinkering: playing with one variable, observing what changed, then trying the next thing the results suggested. Over 15 runs, this accumulated into something structured enough to share as an exploratory empirical study generating hypotheses about which harness design variables matter most.

We tested 15 harness configurations, targeting one variable at a time: training budget enforcement, context richness, escalation ladder design, agent reasoning structure, and tool access. Each run started from the same weak baseline with no knowledge carried over from previous experiments. The result is a set of controlled comparisons that isolate what the harness contributes versus what the agent discovers on its own.

## Related Work

**A note on coverage:** We did not conduct a thorough literature review before beginning experiments. Concurrent work on harness optimization — including Meta-Harness (Lee et al., 2026), which automates harness search with an outer loop; Natural-Language Agent Harnesses (Pan et al., 2026), which formalizes harness design as a scientific object; and ARTEMIS (Brookes et al., 2025), which evolves agent configurations — was published during or after our experimental period. Related work in AutoML, neural architecture search, population-based program search (FunSearch), and end-to-end scientific discovery systems (The AI Scientist) is also relevant but not surveyed here. This gap is a limitation we accept rather than retroactively patch. What follows is the narrower set of references that actually informed our work.

Karpathy's autoresearch (2026) demonstrated the basic modify-train-evaluate loop and inspired a wave of implementations. The same pattern has been applied across domains: Shopify used it to achieve 53% faster parse and render on Liquid, a 20-year-old Ruby template engine, through structural code transformations no hyperparameter optimizer could propose. Anthropic applied it to build a Boltzmann solver for the cosmic microwave background, reaching sub-percent agreement with canonical physics codes in days. Carlini et al. (2026) used Claude as an autonomous agent for iterative source code analysis, discovering zero-days in the Linux kernel and FreeBSD — a related but distinct methodology from the modify-train-evaluate loop, sharing the theme of LLM-driven iterative work but not the specific harness structure. These cases differ in search space, fitness function, and time horizon — suggesting the harness design problem is general, not ML-specific.

The context engineering practices in this work — rendering structured prompts with historical data, controlling the agent's information diet — draw on emerging principles for building effective agent systems (Anthropic, 2025). The structured chain-of-thought prompting used in our later runs was informed by research on decomposing reasoning into modular cognitive operations for LLM agents (Ebouky, Bartezzaghi, Rigotti, 2025).

Cost-aware model selection through escalation ladders relates to cascading inference approaches where cheap models handle easy cases and expensive models handle hard ones. Our contribution is testing specific ladder designs (reset-on-keep, bidirectional, one-way ratchet) in the autoresearch context, where "easy" and "hard" correspond to early and plateau phases of optimization.

## Methods

This study is exploratory, not confirmatory. Each run's design was informed by the results of previous runs — we were mapping the harness design space, not testing pre-registered predictions. Readers should treat our causal interpretations as preliminary findings that warrant confirmatory testing with pre-registered hypotheses.

### Harness Architecture

The system separates three concerns across three files:

- **prepare.py** — fixed infrastructure: data loading, evaluation, metric emission. The agent cannot modify this file. It defines the measurement contract.
- **train.py** — the search space: model architecture, optimizer, hyperparameters, training loop. The agent modifies only this file each iteration.
- **program.md** — a natural language prompt template rendered with current experimental state and passed to the agent as its sole context.

An orchestrator (autoresearch.py) executes the outer loop:
1. Snapshot train.py before the agent runs
2. Render program.md with current metrics, experiment history, and training curves
3. Invoke Claude via `claude --print` CLI to modify train.py
4. Run training with a 300-second CPU budget
5. Extract the last valid JSON metric line from stdout
6. Compare against best: keep (update baseline) or discard (revert train.py)
7. Log iteration data to append-only JSONL

The agent is stateless — no session persistence between iterations. All context arrives through the rendered prompt. This makes the agent's information diet fully controllable and observable.

### Experimental Protocol

Each run targets one harness variable against a fixed baseline:

- **Baseline model:** Single linear layer (784→10), SGD lr=0.1, batch size 32 (~82% accuracy)
- **Task:** Fashion-MNIST, 10-class grayscale image classification (60k train, 10k test)
- **Compute:** CPU only, 300-second training budget per iteration. The budget is part of the metric, not external to it — the agent is not searching for the best architecture in general, but for the best architecture that can be trained within 300 seconds on CPU.
- **Iterations:** 20 per run
- **Metric:** val_accuracy (higher is better)
- **Cost tracking:** Each Claude CLI invocation returns total_cost_usd in its JSON output. Costs for main agent, postmortem, and backtracking calls are tracked separately.
- **Isolation:** Every run from run 3 onward starts from the identical weak baseline. (Runs 1-2, during infrastructure development, used evolving baselines — see Limitations.) No cross-run knowledge transfer — no expert files, no hints from prior experiments. The agent faces the problem fresh each time, as it would on a novel task.

This isolation is not a methodological preference — it is definitional. If we told the agent what architectures worked in previous runs, we would be studying the agent's ability to follow instructions, not the harness's ability to support discovery. The web tools tested in runs 13-15 are a deliberate exception: general knowledge from the web is not cross-run experience. The distinction is between what the agent can look up and what it learned from our previous experiments.

### Confounds

The controlled single-variable design constrains interpretation even without pre-registration: when only one thing changes between runs, observed differences can only be attributed to that variable (or to stochastic variation — see Limitations). However, not every run changed exactly one thing. Run 2 bundled a longer training budget with a new termination mechanism. Run 5 combined context improvements with an environment bug fix. Run 8 changed the ladder design while also removing the agent timeout. Run 12 introduced both structured CoT and per-discard backtracking. In these cases, the individual contribution of each change cannot be separated.

### Harness Variables Tested

Harness improvements were cumulative. Each group of runs inherited all changes from prior groups:

- **Runs 1-5** used opus/high for every iteration with no escalation ladder, while infrastructure and context evolved.
- **Runs 6-9** inherited run 5's full context and tested ladder designs.
- **Runs 10-12** inherited full context and the ratchet ladder, and tested reasoning mechanisms.
- **Runs 13-15** inherited full context, ratchet ladder, and structured CoT with backtracking, and tested web tool access.

"One variable at a time" means each run added one variable to the accumulated harness configuration, not to a bare baseline. Cross-group comparisons (e.g., run 5 vs. run 9) reflect the combined effect of the accumulated configuration plus the new variable.

The 15 runs tested variables across four categories:

1. **Infrastructure** (runs 1-5): training budget duration, budget enforcement mechanism, context richness (truncated vs. full explanations, training curves, historical summaries)
2. **Escalation ladder design** (runs 6-9): number of tiers, escalation trigger, de-escalation policy, model effort levels
3. **Agent reasoning** (runs 10-12): self-critique predictions, postmortem analysis on failure streaks, structured chain-of-thought with per-discard backtracking
4. **Tool access** (runs 13-15): web search in the coding agent, web search in the postmortem agent, explicit tool-use nudging in the prompt

## Results

### Infrastructure Experiments (Runs 1-5)

| Run | Variable | Best val_acc | Cost | Keeps |
|-----|----------|-------------|------|-------|
| 1* | Baseline (2-min budget) | 90.68% | $3.54 | 6/20 |
| 2* | 5-min budget + clean termination | 91.22% | $4.13 | 7/20 |
| 3 | Training curves in context | 92.02% | $3.38 | 4/20 |
| 4 | Context expansion (infra bug) | 89.85% | $4.91 | 8/20 |
| 5 | Full context + infra fix | 93.95% | $5.34 | 8/20 |

*Runs 1-2 used evolving baselines during infrastructure development (see Limitations).

Runs 1-5 all used Claude Opus with high effort for every iteration (no escalation ladder). Run 1 used SIGTERM-based termination, which PyTorch's C++ kernels can defer during backward passes. This produced mid-epoch kills and noisy metric comparisons — the same input normalization change was discarded in iteration 1 and accepted in iteration 3, because the kill landed at different points in the epoch cycle. Run 2 replaced this with graceful self-termination at epoch boundaries via a TRAINING_BUDGET_SECS environment variable. All 20 iterations exited cleanly.

Run 3 added epoch-by-epoch training curves to the agent's context. Run 5 added two further context changes: the agent's complete multi-sentence explanations in the history table (previously truncated to 60 characters) and compact one-line summaries for iterations older than the most recent 10.

Run 4 regressed because a Python environment path bug caused venv mismatches that wasted iterations. The same context changes, with the bug fixed (run 5), produced the best accuracy across all 15 runs: 93.95%. (Run 5's cost of $5.34 reflects a mid-run orchestrator restart at iteration 13 that reset the cost counter — the reported total includes all 20 iterations.)

### Escalation Ladder Experiments (Runs 6-9)

| Run | Variable | Best val_acc | Cost | Keeps |
|-----|----------|-------------|------|-------|
| 6 | 10-step, reset to step 0 on keep | 92.75% | $1.77 | 6/20 |
| 7 | 4-step, bidirectional | 92.82% | $4.29 | 5/20 |
| 8 | 4-step, reset, no agent timeout | 93.26% | $8.30 | 4/20 |
| 9 | 3-step, one-way ratchet | 93.79% | $3.93 | 10/20 |

All ladder runs used three model tiers: haiku/high, sonnet/high, opus/high. Runs 7-8 also included opus/max, which timed out 33% of the time at 300-second agent deadlines, cost 3-7x more than opus/high, and produced no accuracy improvement. It was dropped from the ladder for run 9.

Run 6 reset the ladder to haiku on every keep. This wasted iterations during plateaus: the agent would escalate to sonnet, make a keep, drop back to haiku, fail to improve with haiku, and re-escalate — rediscovering the same limitation.

Run 7's bidirectional de-escalation never triggered during plateaus because de-escalation required keeps, which are exactly what wasn't happening. The mechanism was useless when it was most needed.

Run 9's one-way ratchet required 3 consecutive discards to advance one tier. A keep reset the discard counter but did not move the ladder down. The ladder never retreated. This exhausted each tier's potential before spending more, and avoided oscillation. Unlike every previous run, run 9 was still improving at exit — opus/high had 4 keeps in its last 7 iterations with a discard streak of 0.

### Agent Reasoning Experiments (Runs 10-12)

| Run | Variable | Best val_acc | Cost | Keeps |
|-----|----------|-------------|------|-------|
| 10 | Self-critique predictions | 92.97% | $6.08 | 4/20 |
| 11 | Postmortem on streak trigger | 93.01% | $4.30 | 6/20 |
| 12 | Structured CoT + backtracking | 93.41% | $5.63 | 9/20 |

**Run 10** asked the agent to predict keep/discard and an expected accuracy range before each change. Of 20 predictions, 3 fell within the agent's predicted range. The agent was systematically overconfident and never adjusted its predictions despite repeated misses. The feedback loop did not close.

**Run 11** introduced a separate analyst agent call every 3 consecutive discards. The analyst reviewed recent failures, training curves, and code diffs, then recommended specific next steps. The coding agent followed all three postmortem recommendations (StepLR → CosineAnnealingLR → third conv layer). The mechanism costs $0.09-0.14 per call and fires only at escalation boundaries.

**Run 12** restructured the agent's system prompt into four explicit phases: UNDERSTAND the current state, RECALL experiment history, PROPOSE and IMPLEMENT one change, EXAMINE the change for bugs and flawed assumptions. After each discarded iteration, a separate read-only backtracking call diagnosed where the reasoning went wrong, and its output fed into the next iteration's context. This run produced 9 keeps out of 20 (45%) and kept sonnet productive for 14 consecutive iterations without escalating to opus. Like run 9, the run was still improving at exit — keeps at iterations 16 and 17 pushed accuracy from 92.80% to 93.41%.

### Web Access Experiments (Runs 13-15)

| Run | Variable | Best val_acc | Cost | Keeps |
|-----|----------|-------------|------|-------|
| 13 | Web tools in main agent | 91.54% | $2.29 | 12/20 |
| 14 | Web in postmortem (no nudge) | 92.86% | $4.62 | 10/20 |
| 15 | Web in postmortem (nudged) | 93.17% | $4.69 | 9/20 |

**Run 13** gave the main coding agent access to WebSearch and WebFetch. The haiku-tier agent found enough safe incremental optimizations via web search — learning rate adjustments, batch normalization, scheduler tuning — that it never accumulated 3 consecutive discards. It remained in haiku for all 20 iterations, never escalated, and stayed in MLP territory for 18 of 20 iterations while every other successful run made the CNN leap by iteration 4-6 — only building a CNN at the final iteration. The keep rate was the highest ever (60%), but accuracy was 1.87 percentage points below the control.

**Run 14** moved web tools to the postmortem agent only. The postmortem agent never used them despite having access.

**Run 15** added an explicit instruction: "Use WebSearch and WebFetch to research relevant techniques." The postmortem conducted genuine research, citing Fashion-MNIST benchmarks and optimization papers. However, the postmortem prompt lacked experimental constraints (CPU-only, 300-second budget), causing it to recommend GPU-oriented advice (batch size 128) that the CPU setup couldn't leverage.

## Discussion

### Context quality appears to be the strongest lever

The largest accuracy improvement across all 15 runs was associated with enriching the agent's context (runs 3-5), not with changing the model, the ladder, or the reasoning structure. However, this finding is confounded: run 5 bundled context improvements with an infrastructure bug fix (see Confounds), and the 4.1 percentage point gap between run 4 (buggy, 89.85%) and run 5 (fixed, 93.95%) may be partly attributable to the bug fix rather than context enrichment alone. With that caveat, the mechanism is plausible: complete explanations and training curves let the agent reason about why previous changes succeeded or failed — a form of mechanistic understanding rather than outcome-only observation. In run 5, the agent progressively removed dropout across iterations 4, 13, and 14, recognizing that with only ~5 epochs fitting in the budget, convergence speed matters more than regularization. This multi-step reasoning was not possible with truncated context.

Not all context enrichment helps. We can classify the types of context we tested:

| Category | Examples | Effect |
|----------|----------|--------|
| Objective data | Training curves, full explanations, iteration history | Consistently positive |
| Structured diagnosis | Postmortem analysis, backtracking analysis | Neutral to positive |
| Subjective introspection | Self-critique predictions | Negative |
| External knowledge (coding agent) | Web search in the coding loop | Negative |
| External knowledge (postmortem) | Web search with explicit nudging | Neutral to positive |

The pattern: context that helps the agent reason about what happened is valuable. Context that asks the agent to predict what will happen, or that provides escape routes from bold decisions, is harmful.

### Escalation ladders are a cost mechanism, not a quality mechanism

No ladder design exceeded the accuracy of running opus/high for all 20 iterations (run 5, 93.95%). The best ladder run (run 9, 93.79%) came within 0.16 percentage points at approximately 25% lower cost ($3.93 vs $5.34, though run 5's cost is inflated by a mid-run restart). The 0.16pp difference is within plausible stochastic variation for N=1 and should not be treated as a definitive ranking. The ladder's value is economic: it spends cheap tokens on problems the cheap model can solve, reserving expensive tokens for harder problems.

The one-way ratchet emerged as the best design through elimination. Reset-on-keep (run 6) caused oscillation. Bidirectional de-escalation (run 7) never activated during plateaus. The ratchet avoids both failure modes by exhausting each tier before advancing and never retreating.

### Structured reasoning improves agent behavior but not the ceiling

Run 12's structured CoT + backtracking produced the best keep rate among non-web runs (45%) and the most sustained productivity from a single model tier (sonnet for 14 iterations). Even haiku benefited — 2 keeps in run 12 versus 1 in runs 9 and 11, suggesting the explicit reasoning phases help weaker models make better use of their limited capacity. The agent made more changes that improved the metric, more consistently. But the accuracy ceiling (93.41%) did not exceed runs with less structured reasoning (93.95% in run 5, 93.79% in run 9). The agent reached the plateau with better efficiency but still hit it.

Similarly, postmortem analysis (run 11) was behaviorally effective — the coding agent followed 3/3 recommendations — but the recommendations did not push accuracy past the ceiling. The fact that both ladders and structured reasoning improve efficiency without raising the accuracy ceiling suggests the ceiling is not caused by reasoning failures or model selection — see the plateau discussion below.

### Web access creates a conservatism trap

Run 13's paradox — highest keep rate, lowest accuracy — suggests that web access in the coding agent enables a conservative strategy that defeats the escalation mechanism. The haiku-tier agent, armed with web search, found an endless supply of safe incremental tweaks. It never failed badly enough to escalate, trapping itself in a low-capability tier. An alternative explanation is that haiku's limited capability, not web access, is the primary constraint — we cannot fully disentangle the two without testing web access on a stronger model tier. The ladder's 3-consecutive-discards trigger conflates "can't improve" with "can only improve a little" — a model making 0.1 percentage point gains will never escalate, even when a stronger model could make 2 percentage point gains. Keep rate is not a proxy for quality.

Moving web access to the postmortem agent (runs 14-15) avoided this trap. But two implementation details mattered: the agent needed explicit prompting to use the tools (run 14 vs 15), and the prompt needed the same experimental constraints as the main agent to avoid recommending infeasible approaches.

### The plateau problem

Most runs hit a performance ceiling between iterations 10-14. The agent independently discovers CNN architectures with batch normalization, cosine annealing, and appropriate regularization within the first 5-8 iterations — a remarkably consistent convergence across runs that likely reflects the LLM's training data priors about Fashion-MNIST optimization. After that, improvements become marginal and intermittent.

The plateau is likely a combination of three factors. First, a **budget-constrained ceiling**: external Fashion-MNIST benchmarks show that simple CNNs without augmentation routinely achieve 93-94% accuracy, while reaching higher requires more training time and compute. Our 300-second CPU budget places the ceiling right at this boundary — state-of-the-art on Fashion-MNIST exceeds 96% with standard CNNs and 99%+ with advanced approaches, so the plateau is not an absolute limit but a property of the task-budget pair. Second, **greedy hill-climbing limitations**: the strict keep/discard mechanism is a well-known failure mode on rugged fitness landscapes, where escaping local optima requires accepting temporary regressions. Third, the **20-iteration limit**: runs 9 and 12 were still improving at exit, suggesting the "plateau" for those runs may be an artifact of stopping rather than a true ceiling.

We tested whether the plateau is caused by poor strategy selection (postmortem, run 11), insufficient reasoning (structured CoT, run 12), lack of external knowledge (web access, runs 13-15), or model capability (opus vs sonnet). None of these eliminated it, but none of them address the budget constraint or the greedy search mechanism.

The plateau is the most important unsolved problem we encountered. Untested approaches include longer runs (30+ iterations), branching search trees that explore multiple directions in parallel, simulated-annealing-style acceptance of temporary regressions to escape local optima, and larger training budgets to test whether the ceiling shifts.

## Limitations

**Single task.** All 15 runs used Fashion-MNIST. The harness changes are designed to be domain-agnostic, but this claim is untested. Cross-domain validation is the most important next step.

**Single agent family.** All runs used Claude (Haiku, Sonnet, Opus). Behavior may differ with other LLM families.

**No statistical replication.** Each configuration was run once. Variance across identical configurations is unknown. Any single run may be unrepresentative. The paper's own evidence suggests substantial variance: run 4 and run 5 differ by 4.1 percentage points due to an environment bug, indicating that accuracy differences below 1-2pp across runs should be treated as indistinguishable.

**20-iteration ceiling.** Enough to observe the plateau but not enough to test strategies designed to break through it.

**CPU-only constraint.** GPU access would shift both the baseline and ceiling, potentially changing which harness variables matter most.

**Baseline stability.** Verification against historical checkpoints confirmed that runs 3-15 all started from the identical baseline model (single linear layer 784→10). Run 1 used the same model architecture but a different training contract (SIGTERM-based termination vs self-termination). Run 2 started from run 1's trained CNN model rather than the weak baseline — an early infrastructure oversight corrected from run 3 onward. The harness infrastructure (orchestrator, environment, dependencies) evolved between runs; run 4's regression from an environment path bug demonstrates that the experimental template was not perfectly stable.

**Post-hoc analysis.** Hypotheses were not pre-registered. Each run's design was chosen after observing previous results, and interpretations were constructed after seeing outcomes. The conversation logs from each run provide a partial real-time record of decision-making, but do not constitute pre-registration. Our findings are hypothesis-generating — they identify which harness variables appear to matter — not hypothesis-confirming.

**Cumulative design.** Harness changes were cumulative across run groups: later runs inherit all prior improvements. This means individual variable attribution is impossible without factorial design — run 12's 93.41% reflects the combined effect of full context, ratchet ladder, and structured CoT, not any single variable in isolation.

**Operator learning.** As runs progressed, we became better at writing prompts and designing experiments. Improvements across runs may partly reflect operator learning, not just harness variable effects. The isolation protocol controls for cross-run agent knowledge but not for cross-run human knowledge.

**Self-referential.** The harness and experiments were developed by the same team. We chose which variables to test, in which order, and defined the baseline. Several experimental ideas — including run 10's self-critique predictions — were generated collaboratively with Claude Code, the tool under study. This creates potential selection bias toward configurations that Claude models would find effective, in addition to confirmation bias in interpretation.

**Model version drift.** The 15 runs were conducted over several days. Claude model weights may have been updated during this period without version bumps, introducing an uncontrollable confound.

**Incomplete literature review.** We did not survey related work before beginning experiments (see Related Work). Concurrent publications studying harness optimization (Meta-Harness, NLAH, ARTEMIS) and related areas (AutoML, NAS, FunSearch, The AI Scientist) are not addressed. Our findings may replicate, contradict, or be superseded by this work.

## Conclusion

Three findings emerged from 15 controlled experiments on an autoresearch harness. First, context quality — giving the agent complete explanations, training curves, and historical summaries — was associated with the largest accuracy improvements, though this finding is confounded by concurrent infrastructure changes. Second, escalation ladders reduced cost without meaningfully affecting the accuracy ceiling; the one-way ratchet design avoided the oscillation and dead-mechanism failures of alternatives. Third, web access in the coding agent created a conservatism trap, enabling a weak model to make endless safe micro-optimizations that prevented escalation to a stronger tier.

These are hypotheses, not confirmed findings. Each rests on a single unreplicated run, the experimental design was post-hoc, and our literature review was incomplete. The most important next step is cross-domain validation: testing whether these harness design principles hold on tasks where the right strategies are genuinely unknown.

## Verification

The harness consists of four files: autoresearch.py (orchestrator), program.md (agent prompt template), prepare.py (fixed evaluation infrastructure), and train.py (agent-modified search space). Per-iteration logs (JSONL), agent outputs, diffs, training curves, postmortem archives, and backtracking records are preserved for all 15 runs.

All quantitative claims (accuracy, cost, keep counts) were verified against raw JSONL logs and orchestrator logs recovered from historical checkpoints. This verification identified and corrected errors in earlier drafts:

- Run 5 cost: $2.10 corrected to $5.34 (orchestrator restart reset the cost counter; JSONL preserves all iteration costs)
- Run 11 cost: $4.61 corrected to $4.30 (postmortem costs double-counted)
- Run 11 keeps: 5/20 corrected to 6/20 (failed iteration miscounted)
- Run 12 cost: $6.41 corrected to $5.63 (postmortem and backtracking costs double-counted)
- Run 12 keeps: 10/20 corrected to 9/20
- Run 10 predictions: "1 of 18 parseable" corrected to "3 of 20"
- Run 13 CNN claim: "never built a CNN" corrected to "built CNN at final iteration"
- Run 2 baseline: documented that run 2 started from run 1's trained model, not the weak baseline

All 15 accuracy figures were confirmed to match.

To reproduce: reset train.py to the baseline linear model, run `python autoresearch.py --escalate --max-iterations 20 --training-budget 300`, and compare the resulting val_accuracy trajectory. Results will vary due to LLM stochasticity and hardware differences (the 300-second budget makes results CPU-dependent), but the qualitative patterns — rapid early improvement, plateau around iteration 10-14, ladder escalation behavior — should be consistent.

Experiment notes from each run: https://texts-pt5.sprites.app/agents/vtBKygQMO_ld258AByHOyUCA2Q7eATqKUDxLG_5BnZw/page

## References

- Karpathy, A. (2026). autoresearch. https://github.com/karpathy/autoresearch
- Xiao, H., Rasul, K., Vollgraf, R. (2017). Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. https://github.com/zalandoresearch/fashion-mnist
- Anthropic. (2025). Claude Code documentation. https://docs.anthropic.com/en/docs/claude-code
- Anthropic. (2025). Effective context engineering for AI agents. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- Ebouky, M., Bartezzaghi, A., Rigotti, M. (2025). Eliciting Reasoning in Language Models with Cognitive Tools. arXiv:2506.12115v2
- Anthropic. (2026). Long-running Claude and the Boltzmann solver. https://www.anthropic.com/research/long-running-Claude
- Carlini, N. et al. (2026). Evaluating and mitigating the growing risk of LLM-discovered 0-days. https://red.anthropic.com/2026/zero-days/
- Shopify. (2026). Liquid performance optimization via pi-autoresearch. https://github.com/Shopify/liquid/pull/2056