# Autoresearch Run 10: Agent Self-Critique (Pre-mortem Predictions) Run 10 tested whether asking the agent to predict what could go wrong with each change --- and what val_accuracy it expects --- would improve decision-making during the plateau phase. The hypothesis: seeing its own predictions miss alongside actual results would create a calibration feedback loop, helping the agent avoid repeating patterns of failed reasoning. The result was disappointing. The predictions did not improve outcomes. The agent wrote longer, more thoughtful responses, spent more per iteration, and performed worse than the control run. ## Setup Same as run 9 (the control): - 3-step one-way ratchet ladder: haiku/high -> sonnet/high -> opus/high - 3 consecutive discards to escalate, streak resets on keep - 20 iterations, 300s training budget, no agent timeout - Baseline: single linear layer (~82%) The only change: the system prompt and program.md now ask the agent to write a prediction starting with "Prediction:" after each code change --- what might go wrong and what val_accuracy it expects. ## Results | Iter | Step | Model | Action | val_accuracy | Streak | |------|------|-------|--------|-------------|--------| | 0 | 0 | haiku/high | keep | 0.8912 | 0 | | 1 | 0 | haiku/high | discard | 0.8744 | 1 | | 2 | 0 | haiku/high | discard | 0.8904 | 2 | | 3 | 0 | haiku/high | discard | 0.8854 | 3 -> escalate | | 4 | 1 | sonnet/high | keep | 0.9203 | 0 | | 5 | 1 | sonnet/high | keep | 0.9212 | 0 | | 6 | 1 | sonnet/high | discard | 0.9181 | 1 | | 7 | 1 | sonnet/high | discard | 0.9206 | 2 | | 8 | 1 | sonnet/high | discard | 0.9200 | 3 -> escalate | | 9 | 2 | opus/high | discard | 0.9210 | 1 | | 10 | 2 | opus/high | keep | 0.9297 | 0 | | 11-19 | 2 | opus/high | discard x9 | 0.920-0.929 | 1->9 | **Best: 92.97% at iteration 10. Total cost: $6.08.** ## Prediction Calibration The agent was asked to predict a val_accuracy range for each change. Out of 18 iterations where it made a parseable prediction, **only 1 was in range** (iter 4). The agent was systematically overconfident --- it consistently predicted higher accuracy than it achieved. | Iter | Action | Actual | Predicted range | In range? | |------|--------|--------|----------------|-----------| | 1 | discard | 0.8744 | 0.895-0.902 | no | | 2 | discard | 0.8904 | 0.900-0.915 | no | | 4 | keep | 0.9203 | 0.910-0.930 | yes | | 5 | keep | 0.9212 | 0.925-0.935 | no | | 6 | discard | 0.9181 | 0.924-0.932 | no | | 10 | keep | 0.9297 | 0.922-0.928 | no | | 13 | discard | 0.9195 | 0.928-0.936 | no | | 15 | discard | 0.9239 | 0.930-0.936 | no | | 19 | discard | 0.9225 | 0.928-0.934 | no | (Selected iterations shown. Full set: 1/18 in range.) The pattern was consistent from iter 11 onward: the agent predicted 0.928-0.935 while landing at 0.920-0.929. It never adjusted its calibration despite seeing its predictions miss repeatedly. ## Did the Agent Respond to Failed Predictions? The agent referenced failed *experiments* but not failed *predictions*. Some examples: **Iter 13:** "Iteration 6 tried horizontal flip alone (without random crop or OneCycleLR) and got only 0.918, which is concerning." --- Good, it learned from a prior failure. **Iter 18:** "Unlike the geometric augmentation in iter 13 (random crop + flip, which scored 0.919 due to slow convergence), mixup works at the pixel-interpolation level." --- It's adapting strategy based on outcomes. **Iter 19:** "Since all other changes (architecture, regularization, augmentation) consistently hurt performance, this pure schedule adjustment is the safest path." --- It recognized the pattern of failure. But at no point did the agent say: "My last 8 predictions were all too optimistic, so I should lower my expectations." It used the prediction space to reason about risks (thoughtfully), but treated each prediction as independent. The calibration feedback loop never closed. The agent did self-critique on its *code changes* but not on its *prediction accuracy*. ## Comparison with Run 9 (Control) | | Run 9 (no critique) | Run 10 (self-critique) | |---|---|---| | Best val_acc | 93.79% | 92.97% | | Cost | $3.93 | $6.08 | | Keeps | 9/20 | 4/20 | | Plateau onset | iter 13 (still improving at 19) | iter 10 | | Opus/high avg cost/iter | $0.34 | $0.44 | | Sonnet keeps | 5/9 | 2/5 | Run 10 is worse on every metric. The self-critique added cost (longer agent responses) without improving decision quality. The agent escalated to opus earlier (iter 8 vs iter 12 in run 9) because sonnet only produced 2 keeps instead of 5, and once at opus the plateau set in earlier and harder. ## Cost Breakdown | Tier | Iters | Total cost | Avg/iter | |------|-------|-----------|---------| | haiku/high | 4 | $0.31 | $0.08 | | sonnet/high | 5 | $0.87 | $0.17 | | opus/high | 11 | $4.89 | $0.44 | | **Total** | **20** | **$6.08** | **$0.30** | ## Findings **Self-critique predictions did not improve the agent's performance.** The agent wrote thoughtful risk analyses but was systematically overconfident, and it never learned from its own prediction errors. The calibration feedback loop --- the core hypothesis --- did not emerge. **The predictions added cost without value.** Opus/high iterations cost $0.44/iter with predictions vs $0.34/iter without (run 9). The agent spent ~30% more tokens per iteration on prediction text that didn't improve its code changes. **The agent critiques experiments, not predictions.** It successfully referenced prior experimental failures ("iter 6 tried augmentation and it hurt") but never referenced prior prediction failures ("I keep predicting 0.930+ and landing at 0.925"). This is a limitation of how the context is structured --- predictions and outcomes appear in the same block, but the agent isn't prompted to compare them explicitly. **Context enrichment is not universally beneficial.** Runs 3 and 5 showed that adding training curves and full explanations to context improved results significantly. But adding predictions to context made things worse. The difference: training curves and explanations add *objective data* that helps the agent reason. Predictions add *the agent's own uncertainty*, which turned out to be poorly calibrated and not self-correcting. More context is only better when it's informative context.