# Autoresearch: Ideas Triage (Post Run 12)

After 12 experimental runs on Fashion-MNIST, we revisit the brainstormed ideas list (originally 20 items, generated after run 5) and sort them into tested, ruled out, and remaining. The purpose of this triage is to clarify what we've learned, what we've chosen not to pursue and why, and what's left to explore.

## Experimental Constraints

These constraints govern all experiments and inform why certain ideas are ruled out:

- **Run isolation**: Every run starts from the same weak baseline. No knowledge carries between runs. The agent must face the problem fresh each time, as it would on a novel task in practice.
- **Comparability**: Changes between runs must be minimal and attributable. If we change two things at once, we can't tell which one mattered.
- **Harness quality is the goal**: We are not optimizing for Fashion-MNIST accuracy. Val_accuracy is a measuring stick for evaluating harness changes. Every improvement to the harness must be domain-agnostic.

## Tested

### Model/effort strategies (ideas #5, #6, #7)
Tested across runs 6-9. Haiku inner loop (#5), adaptive model escalation (#6), and effort as a knob (#7) were all explored through four ladder variants: 10-step with reset (run 6), 4-step bidirectional (run 7), 4-step with reset (run 8), and 3-step one-way ratchet (run 9).

**Findings**: The one-way ratchet with 3 consecutive discards before escalation is the best ladder design. Opus/max adds cost without value. Haiku grabs low-hanging fruit cheaply, sonnet does the heavy lifting, opus provides marginal gains. The ladder is a solved component of the harness.

### Agent self-critique (idea #18)
Tested in run 10 as per-iteration predictions ("what could go wrong, what val_accuracy do you expect?"). The agent was systematically overconfident and never calibrated from its own prediction errors.

**Finding**: Asking the agent to introspect about its own predictions hurts performance ($6.08 vs $3.93 control) and adds cost without value. However, the useful version of self-critique turned out to be the EXAMINE phase inside structured CoT (run 12) --- embedded in the reasoning flow rather than bolted on as standalone predictions.

### Postmortem analysis (related to idea #1)
Tested in run 11. Every 3 consecutive discards, a separate agent call diagnoses failure patterns and recommends next steps. The agent followed the advice. Related to idea #1 (plateau-triggered strategy rewrite) but does not actually rewrite program.md.

**Finding**: Postmortems produce high-quality diagnoses and the coding agent follows them. Cheap ($0.09-0.14 per call). The plateau persists despite correct diagnosis --- the bottleneck is the search space, not the agent's reasoning. Worth keeping as a diagnostic tool.

### Cognitive tools / structured reasoning (new, inspired by arxiv 2506.12115v2)
Tested in run 12. Restructured the agent's prompt into four explicit phases (understand, recall, propose, examine) and added per-discard backtracking analysis.

**Finding**: Best keep rate across all runs (10/20). Sonnet never exhausted in 14 iterations. The combination of structured CoT and per-discard backtracking creates a feedback loop that keeps the agent productive longer. The most promising harness mechanism tested so far.

## Ruled Out

### Expert.md / cross-run memory (ideas #3, #19)
These leak knowledge between runs, breaking isolation. In practice, autoresearch will face novel problems where prior run knowledge doesn't exist. Giving the agent hints from prior Fashion-MNIST runs would make the harness look better than it actually is on new tasks. This is basic scientific experiment methodology --- you can't contaminate the control.

### Multi-agent shared knowledge (idea #17)
Same isolation problem as expert.md. Multiple agents sharing a knowledge base within a run would be interesting, but the shared knowledge across branches creates cross-contamination that makes results hard to attribute.

### Constrained creativity rounds (idea #20)
Forcing novelty every Nth iteration was meant to prevent the agent from repeating similar ideas during the plateau. In practice, this is hard to enforce (the agent can claim novelty while making minor variations), and the structured CoT + backtracking from run 12 already addresses the same problem more naturally --- the agent adapts after each failure through the feedback loop rather than being forced to be different on a schedule.

## Not Yet Tested

### Simulated annealing (idea #10)
Allow the agent to accept regressions with decreasing probability. Directly targets the plateau by letting the agent take a step backwards to escape a local optimum. The concern: it requires longer runs (30-40+ iterations) to see the regression-then-recovery cycle play out, which is not practical at our current cost and time budget. Parked for now.

### Branching search / mini-MCTS (idea #8)
Maintain 2-3 parallel code variants instead of linear keep/discard. Tests whether the single-path topology is the bottleneck. The most structurally different idea on the list. Parked due to implementation complexity and the cost of running 2-3x training evaluations per iteration.

### Evolutionary population (idea #9)
Maintain a population of train.py variants with selection pressure. Similar to branching but with a different selection mechanism. Same practical constraints as #8.

### Cross-domain transfer (idea #15)
Run the identical harness on a different task entirely. This is the most important untested idea. Everything we've built is validated on Fashion-MNIST only. If the harness improvements are Fashion-MNIST-specific, we need to know. This directly serves the project's stated goal: building a general autoresearch harness, not a Fashion-MNIST solver.

### Plateau-triggered strategy rewrite (idea #1)
The postmortem mechanism (run 11) is a partial implementation of this --- it diagnoses failures but doesn't rewrite program.md. A full implementation would have the orchestrator rewrite the agent's strategic guidance based on accumulated evidence. With the cognitive tools (run 12) now providing per-iteration reasoning structure and per-discard backtracking, the value of rewriting program.md is less clear --- the agent is already adapting its strategy through the feedback loop.

### Program.md A/B testing (idea #2)
Run parallel experiments with different program.md strategies. Interesting for understanding how much the strategy document matters, but requires 2x the runs to get meaningful data. Lower priority than cross-domain transfer.

### Alternative metrics (ideas #11, #12, #13)
Composite multi-task scoring, Pareto metric (accuracy vs parameters), training efficiency metric. These change what the harness optimizes for, which could reveal whether metric design affects search quality. Interesting but secondary to the cross-domain transfer question.

### Budget ladder (idea #16)
Run at different time budgets (1 min, 5 min, 30 min) to study how optimal strategy changes with compute. Would teach us about budget sensitivity but doesn't improve the harness itself.

### Dissolving the harness (idea #4)
Remove the Python orchestrator and let Claude Code run the full loop. Tests whether the agent can be its own harness. Philosophically interesting but risky --- the AgentSAT "running out of will" problem suggests agents may drift without external structure.

## Priority

1. **Cross-domain transfer** (#15) --- validates whether everything we've built generalizes
2. **Simulated annealing** (#10) --- most direct attack on the plateau, needs longer runs
3. **Branching search** (#8) --- most structurally different approach, needs implementation work