# Autoresearch Run 12: Cognitive Tools (Structured CoT + Backtracking)

Run 12 restructured the agent's reasoning process using ideas from the Cognitive Tools paper (arxiv 2506.12115v2). Instead of letting the agent reason freely, we prompt it through four explicit phases --- understand, recall, propose, examine --- in a single call. After each discarded iteration, a separate backtracking call analyzes where the reasoning went wrong. The postmortem mechanism from run 11 stays in place for macro-level diagnosis every 3 consecutive discards.

This produced the highest keep rate across all 12 runs (10/20) and kept sonnet productive for 14 straight iterations without ever needing to escalate to opus.

## Design

### Structured CoT (replaces the old "edit then explain" prompt)

The agent's system prompt and task instructions now guide it through four phases in a single response:

1. **UNDERSTAND**: Analyze the current state. What is the bottleneck? What does the training curve tell you?
2. **RECALL**: Review the experiment history. What similar changes were tried? What worked and what didn't?
3. **PROPOSE & IMPLEMENT**: Choose ONE focused change and apply it by editing train.py.
4. **EXAMINE**: Review the change for bugs, wrong assumptions, contract violations, or reasons it might not improve accuracy. Adjust if needed.

The "2-5 sentences" output constraint from previous runs was removed. The agent writes what it needs.

### Backtracking (new, fires on discards only)

After each discarded iteration, a separate read-only agent call receives the agent's original reasoning, the code diff, and the training results. It analyzes where the reasoning went wrong and suggests what to try differently. The output is injected into the next iteration's context.

Each backtrack sees the previous backtrack, creating a chain of post-hoc analysis that accumulates across iterations.

### Postmortems (unchanged from run 11)

Still fire every 3 consecutive discards, providing higher-level pattern diagnosis. Backtracking is per-iteration micro-analysis; postmortems are macro-analysis across multiple failures.

## Setup

- 3-step one-way ratchet ladder: haiku/high -> sonnet/high -> opus/high
- 3 consecutive discards to escalate, streak resets on keep
- 20 iterations, 300s training budget, no agent timeout
- Baseline: single linear layer (~82%)

## Results

| Iter | Step | Model | Action | val_accuracy | Streak | BT |
|------|------|-------|--------|-------------|--------|----|
| 0 | 0 | haiku/high | keep | 0.8863 | 0 | |
| 1 | 0 | haiku/high | discard | 0.1000 | 1 | yes |
| 2 | 0 | haiku/high | keep | 0.9003 | 0 | |
| 3 | 0 | haiku/high | discard | 0.8995 | 1 | yes |
| 4 | 0 | haiku/high | discard | 0.8911 | 2 | yes |
| 5 | 0 | haiku/high | discard | 0.8190 | 3 -> esc | yes |
| --- | | | postmortem | | | |
| 6 | 1 | sonnet/high | discard | 0.9001 | 1 | yes |
| 7 | 1 | sonnet/high | keep | 0.9009 | 0 | |
| 8 | 1 | sonnet/high | keep | 0.9201 | 0 | |
| 9 | 1 | sonnet/high | discard | 0.9195 | 1 | yes |
| 10 | 1 | sonnet/high | keep | 0.9235 | 0 | |
| 11 | 1 | sonnet/high | keep | 0.9240 | 0 | |
| 12 | 1 | sonnet/high | discard | 0.9195 | 1 | yes |
| 13 | 1 | sonnet/high | keep | 0.9280 | 0 | |
| 14 | 1 | sonnet/high | discard | 0.9249 | 1 | yes |
| 15 | 1 | sonnet/high | discard | 0.9219 | 2 | yes |
| 16 | 1 | sonnet/high | keep | 0.9298 | 0 | |
| 17 | 1 | sonnet/high | keep | 0.9341 | 0 | |
| 18 | 1 | sonnet/high | discard | 0.9298 | 1 | yes |
| 19 | 1 | sonnet/high | discard | 0.9270 | 2 | yes |

**Best: 93.41% at iteration 17. Total cost: $6.41. Still improving at exit. Never reached opus.**

## Cost Breakdown

| Component | Calls | Total cost |
|-----------|-------|-----------|
| Agent (haiku/high) | 6 | $0.39 |
| Agent (sonnet/high) | 14 | $4.46 |
| Backtrack calls | 11 | $0.68 |
| Postmortem calls | 1 | $0.10 |
| **Total** | | **$6.41** |

Agent cost per iteration: $0.24. Backtrack cost per discard: $0.06. Postmortem cost: $0.10.

## Key Behavioral Observations

### Haiku got two keeps

In runs 9 and 11, haiku produced exactly 1 keep before exhausting. Here it got 2, including one at 90.03% --- a stronger first-phase result than most runs. The structured CoT appears to help haiku make better use of its limited reasoning capacity by forcing it through an explicit analysis before acting.

### Sonnet never exhausted

This is the most striking result. In every previous run, sonnet exhausted within 5-9 iterations and the ladder escalated to opus. Here, sonnet produced 8 keeps in 14 iterations and never hit a streak of 3 consecutive discards. The keeps were distributed across the full sonnet phase --- not front-loaded --- with two late keeps at iterations 16-17 that pushed from 92.80% to 93.41%.

The backtracking mechanism appears to be the key factor. After each discard, the backtrack analysis identifies where the reasoning went wrong, and the next iteration's structured UNDERSTAND/RECALL phase picks up that feedback. This creates a tighter feedback loop than any previous mechanism: reason explicitly (CoT), fail, get diagnosis (backtrack), reason explicitly again incorporating the diagnosis (CoT).

### The feedback loop that previous runs lacked

Run 10 (predictions) asked the agent to introspect before acting --- it never adjusted its calibration. Run 11 (postmortems) gave the agent macro-level diagnosis every 3 failures --- the agent followed advice but still plateaued. Run 12 combines micro-level backtracking (every discard) with structured reasoning (every iteration), creating a loop where each failure immediately informs the next attempt.

## Comparison

| | Run 9 (control) | Run 10 (predictions) | Run 11 (postmortem) | Run 12 (CoT+BT) |
|---|---|---|---|---|
| Best val_acc | 93.79% | 92.97% | 93.01% | 93.41% |
| Cost | $3.93 | $6.08 | $4.61 | $6.41 |
| Keeps | 9/20 | 4/20 | 5/20 | **10/20** |
| Reached opus | iter 13 | iter 10 | iter 10 | **never** |
| Still improving | yes | no | no | **yes** |
| Cost per keep | $0.44 | $1.52 | $0.92 | $0.64 |

Run 9 still has the highest peak accuracy (93.79%), but with stochastic variation the 0.38pp difference is within noise. The behavioral metrics tell a clearer story: run 12 has the best keep rate (10/20), never needed opus, and was still improving at exit.

## Findings

**Structured CoT and backtracking create a working feedback loop.** Previous context enrichment attempts failed because they were either too passive (predictions the agent ignored) or too infrequent (postmortems every 3 discards). The combination of per-iteration structured reasoning and per-discard backtracking creates a tight cycle: reason, act, fail, diagnose, reason again with the diagnosis. This is the first mechanism that kept sonnet productive for 14 iterations.

**Explicit reasoning phases help weaker models.** Haiku's improvement from 1 keep (runs 9, 11) to 2 keeps suggests that structured CoT compensates for limited reasoning capacity. The UNDERSTAND and RECALL phases force the model to process the context systematically rather than jumping to a change.

**Backtracking is cheap and targeted.** At $0.06 per call, backtracks add $0.68 total (11% of agent cost). Each one is a single-turn read-only call that fires only on failures. This is far more cost-effective than per-iteration predictions (run 10: ~$2.15 extra) or the implicit cost of longer agent responses.

**The plateau resistance is the real result.** Whether 93.41% beats 93.79% is noise. What matters is that sonnet stayed productive for 14 iterations without exhausting, producing keeps at iterations 16 and 17 when every previous run had long since plateaued. The mechanism doesn't solve the plateau --- it delays it by helping the agent learn from each failure and try something genuinely different next time.