# Autoresearch: Ideas to Explore

A brainstormed list of bold directions for the autoresearch harness, organized by theme. Generated from analysis of the [autoresearch scroll](https://grimoire-pt5.sprites.app/autoresearch) and five experimental runs on Fashion-MNIST (best: 93.95% at run 5, $2.10 for 20 iterations).

---

## Outer Loop & Strategy Evolution

1. **Plateau-triggered strategy rewrite** — After N consecutive discards, an Opus-tier agent reads the full history, diagnoses *why* the inner loop stalled, and rewrites `program.md`. Warm restart from best checkpoint. The scroll's recommended next step, and the most obvious gap.

2. **Program.md A/B testing** — Run two parallel experiments with *different* `program.md` strategies (e.g., "focus on training dynamics" vs "focus on architecture") and compare. Treats the research strategy itself as a variable under experimental control.

3. **Expert.md — agent-maintained knowledge base** — Inspired by AgentSAT. After each iteration, the agent appends a structured lesson to `expert.md` ("dropout removal: reliable under low-epoch regimes", "SE attention: not worth the epoch cost"). This file persists across strategy rewrites. The agent accumulates institutional knowledge rather than re-deriving it from raw history each time.

4. **Dissolving the harness** — Remove the Python orchestrator entirely. Give Claude Code a `program.md` and let it run the full loop itself — edit, train, evaluate, keep/discard, update knowledge. Test whether the agent can be its own harness or if it drifts (the AgentSAT "running out of will" problem).

## Cost & Model Strategy

5. **Haiku inner loop** — Replace Sonnet with Haiku for the inner-loop agent. Run 5 cost $0.105/iter. If Haiku can achieve 80% of the quality at 10% of the cost, you get 10x more iterations for the same dollar. Test whether volume compensates for reasoning quality.

6. **Adaptive model escalation** — Start with Haiku. After 2 consecutive failures, switch to Sonnet. After 4, switch to Opus. After 6, trigger outer-loop intervention. The cheapest model that makes progress is the right model.

7. **Reasoning effort as a knob** — Instead of switching models, scale extended thinking budget. Low effort for "obvious" moves (early iterations), high effort when the agent is stuck. The orchestrator controls this based on recent success rate.

## Search Structure

8. **Branching search (mini-MCTS)** — Instead of linear keep/discard, maintain 2-3 active branches. When a change is discarded, fork: one branch continues from the current best, another tries a deliberately different direction. Merge the best branch after K iterations. This is MARS-lite without the full tree search infrastructure.

9. **Evolutionary population** — Maintain a population of 4-5 `train.py` variants. Each iteration, mutate the top 2, evaluate all, cull the worst. Selection pressure replaces the keep/discard binary. The agent writes mutations; the orchestrator manages selection.

10. **Simulated annealing permission** — Explicitly tell the agent it may make changes that *reduce* accuracy if it believes they open a new search direction. Accept regressions of up to 0.5% with decreasing probability over time. Escape local optima at the cost of temporary setbacks.

## Measurement & Metric

11. **Composite multi-task scoring** — Score each candidate against Fashion-MNIST AND MNIST (or CIFAR-10 grayscale). The agent must find architectures that generalize across distributions, not overfit to one. Changes the scroll's concern about dataset-specific solutions from a worry into a testable hypothesis.

12. **Pareto metric: accuracy vs. parameters** — Instead of pure accuracy, score on accuracy / log(parameter_count). Pushes the agent toward efficient architectures. A 93% model with 50K params beats a 94% model with 500K params.

13. **Training efficiency metric** — Score on accuracy x (epochs_completed / budget_seconds). Rewards architectures that learn fast, not just architectures that learn well. The agent already reasons about epoch cost informally — make it explicit in the fitness function.

## Domain & Scope

14. **Structural search on a real codebase** — Move from parametric search (hyperparameters) to structural search (code transformations). Pick a small Python library with a benchmark suite. The agent optimizes runtime performance while maintaining test correctness. This is the Liquid pattern, scaled down.

15. **Cross-domain transfer test** — Run the identical harness on a text classification task (e.g., AG News with a small transformer). Does the pattern transfer? Do the agent's meta-strategies (dropout removal, batch size tuning) carry over, or are they Fashion-MNIST-specific?

16. **Budget ladder** — Run the same experiment at 1 min, 5 min, 30 min budgets. Study how the *optimal strategy* changes with budget. The scroll predicts qualitatively different architectures at different budgets — test that prediction directly.

## Agent Capabilities

17. **Multi-agent parallel search** — Run 3 inner-loop agents simultaneously on different branches. They share an `expert.md` knowledge base via the filesystem. Each reads others' findings before proposing changes. Coordination through shared state rather than explicit communication.

18. **Agent self-critique** — Before submitting a change, the agent must write a "pre-mortem": why might this change fail? The orchestrator includes these predictions alongside results in the history. Trains the agent's calibration over iterations.

19. **Cross-run memory** — Feed the agent summaries from ALL previous runs, not just the current one. "In run 3, architectural changes consistently failed under tight budgets. In run 5, dropout removal was the most reliable lever." The agent starts each new run with institutional knowledge from the entire research program.

20. **Constrained creativity rounds** — Every 5th iteration, force the agent to try something it hasn't tried before (checked against history). Prevents premature convergence on a narrow strategy. The rest of the iterations are unconstrained.

---

*Generated 2026-04-02 from analysis of runs 1-5 and the autoresearch grimoire scroll.*