# Autoresearch Run 11: Postmortem Analysis on Streak Trigger

Run 11 tested a postmortem mechanism: every time the discard streak hits 3, a separate agent call analyzes the recent failures and produces a diagnosis that's injected into subsequent prompts. This fires both at escalation boundaries (ladder moves up) and at the ceiling (opus/high, every 3 consecutive discards). Each postmortem sees all prior postmortems, building cumulative analysis.

This was designed to address run 10's failure, where per-iteration self-critique predictions added cost without improving outcomes because the agent never closed the calibration feedback loop. The postmortem takes a different approach: instead of asking the agent to predict before each change, we ask a separate agent to diagnose *after* a series of failures, and inject that diagnosis into the coding agent's context.

## Setup

Same ladder as runs 9-10:
- 3-step one-way ratchet: haiku/high -> sonnet/high -> opus/high
- 3 consecutive discards to escalate, streak resets on keep
- 20 iterations, 300s training budget, no agent timeout
- Baseline: single linear layer (~82%)
- Self-critique predictions reverted (back to run 9's prompt)

New: postmortem fires every 3 consecutive discards, using the current tier's model. Output persists as a "Failure Analysis" section in subsequent agent prompts.

## Results

| Iter | Step | Model | Action | val_accuracy | Streak | Notes |
|------|------|-------|--------|-------------|--------|-------|
| 0 | 0 | haiku/high | keep | 0.8953 | 0 | |
| 1 | 0 | haiku/high | discard | 0.8903 | 1 | |
| 2 | 0 | haiku/high | failed | --- | 2 | metric crash |
| 3 | 0 | haiku/high | discard | 0.8866 | 3 -> escalate | |
| --- | --- | --- | **postmortem #1** | --- | --- | sonnet, $0.09 |
| 4 | 1 | sonnet/high | keep | 0.8968 | 0 | followed advice |
| 5 | 1 | sonnet/high | discard | 0.8956 | 1 | |
| 6 | 1 | sonnet/high | keep | 0.9239 | 0 | CNN breakthrough |
| 7-9 | 1 | sonnet/high | discard x3 | 0.875-0.922 | 3 -> escalate | |
| --- | --- | --- | **postmortem #2** | --- | --- | opus, $0.14 |
| 10 | 2 | opus/high | keep | 0.9255 | 0 | followed advice |
| 11 | 2 | opus/high | keep | 0.9271 | 0 | |
| 12 | 2 | opus/high | discard | 0.9177 | 1 | |
| 13 | 2 | opus/high | keep | 0.9301 | 0 | best |
| 14-16 | 2 | opus/high | discard x3 | 0.927-0.930 | 3 | |
| --- | --- | --- | **postmortem #3** | --- | --- | opus, $0.09 |
| 17-19 | 2 | opus/high | discard x3 | 0.918-0.927 | 6 | |

**Best: 93.01% at iteration 13. Total cost: $4.61 (agents $4.30 + postmortems $0.32).**

## The Postmortems

Three postmortems fired: after iter 3 (haiku->sonnet escalation), after iter 9 (sonnet->opus escalation), and after iter 16 (opus ceiling, streak hit 3).

**Postmortem #1 (after haiku failures):** Diagnosed that all three haiku failures modified optimizer dynamics without addressing the real bottleneck. Identified that weight decay with high LR causes late-epoch divergence. Recommended adding StepLR to existing SGD setup as the minimal, targeted fix.

**Postmortem #2 (after sonnet failures):** Diagnosed the epoch budget as the core bottleneck --- CNNs at batch_size=32 yield only 11 epochs in 300s. Identified that momentum=0.9 with lr=0.1 creates catastrophic effective step size. Recommended CosineAnnealingLR with lower initial LR, and explicitly warned against any change that increases per-epoch cost.

**Postmortem #3 (after opus plateau):** Diagnosed that all three failures added regularization (normalization, dropout, label smoothing) to a model that was *underfitting*, not overfitting. Identified that the 2-layer CNN hits a representational capacity ceiling around 93%. Recommended wider channels or a 3rd conv layer without GAP (correcting a prior mistake), and explicitly warned against any pure-regularization additions.

## The Agent Followed the Advice

This is the most notable finding. After each postmortem, the agent's next iteration directly cited and followed the recommendations:

- **Iter 4** (after postmortem #1): Added StepLR exactly as recommended --- "the single most targeted fix for the diagnosed problem"
- **Iter 10** (after postmortem #2): Switched to CosineAnnealingLR --- "the #1 recommended intervention from the automated postmortem"
- **Iter 17** (after postmortem #3): Added a 3rd conv layer without GAP --- explicitly avoiding the GAP mistake the postmortem flagged

The postmortem mechanism works as a communication channel: the analyst agent identifies failure patterns and prescribes concrete actions, and the coding agent follows them. This is qualitatively different from run 10's self-critique, where the agent wrote predictions but never adjusted based on their accuracy.

## Comparison

| | Run 9 (control) | Run 10 (predictions) | Run 11 (postmortem) |
|---|---|---|---|
| Best val_acc | 93.79% | 92.97% | 93.01% |
| Cost | $3.93 | $6.08 | $4.61 |
| Keeps | 9/20 | 4/20 | 5/20 |
| Plateau onset | iter 13 (still improving) | iter 10 | iter 13 |
| Context enrichment cost | $0 | ~$2.15 extra tokens | $0.32 (3 postmortems) |

Run 9 still has the best accuracy. The postmortem didn't hurt performance the way predictions did (run 10), and the postmortem cost was modest ($0.32 for 3 calls vs. the ~$2 extra spent on prediction tokens in run 10). But it didn't break through the plateau either.

## Findings

**Postmortem analysis produces high-quality diagnoses.** All three postmortems correctly identified failure patterns, cited specific numbers and curves, distinguished between wrong strategy vs. wrong execution, and gave actionable recommendations. The third postmortem's insight --- that regularization was pointless on an underfitting model --- was particularly sharp.

**The coding agent follows postmortem advice.** This is a meaningful result. The mechanism creates a working analyst-to-coder feedback loop within a single run, without cross-run knowledge contamination. The coding agent explicitly referenced the postmortem in its explanations and implemented the recommended changes.

**The plateau persists despite correct diagnosis.** Postmortem #3 correctly identified the bottleneck (representational capacity ceiling at ~93%) and recommended wider channels and a 3rd conv layer. The agent followed this advice in iter 17 --- and it still didn't improve. The diagnosis was right, but the solution space at 93%+ accuracy within a 9-epoch CPU budget is genuinely narrow. The postmortem can't conjure improvements that don't exist in the search space.

**Postmortems are cheap and non-harmful.** At $0.09-0.14 per postmortem (1 turn each), the mechanism adds ~7% to total cost. Unlike per-iteration predictions ($0.10/iter extra, 55% cost increase in run 10), postmortems fire only at escalation boundaries --- at most every 3 iterations. The mechanism is worth keeping even if it doesn't consistently improve accuracy, because the diagnostic output is valuable for understanding *why* the agent is stuck.

**Context enrichment is not all equal.** Run 3->5 showed that adding training curves and full explanations (objective data) improved results. Run 10 showed that adding predictions (subjective agent uncertainty) hurt. Run 11 shows that adding postmortems (structured diagnosis of failures) is neutral-to-positive. The pattern: context that helps the agent *see facts it couldn't see before* is valuable. Context that asks the agent to *introspect about its own reasoning* is not.