Autoresearch Run 3: Training Curves in Agent Context

For the third run we changed one thing in the harness: instead of passing the agent a
summary table with one line per iteration (change description, final metric, keep/discard),
we passed the full epoch-by-epoch training curve for each of the last 10 iterations.

Previously the agent saw:

  | Iter | Change              | val_accuracy | Decision |
  | 0005 | AdamW + grad clip   | 0.9080       | KEEP     |
  | 0006 | Wider channels      | 0.9056       | DISCARD  |

Now it sees:

  Iter 0005 (KEEP, 0.9080): AdamW + gradient clipping
    Curve (5 epochs, 281s, still improving): 0.862 -> 0.880 -> 0.896 -> 0.905 -> 0.908

  Iter 0006 (DISCARD, 0.9056): Wider channels 32->48
    Curve (3 epochs, 295s, still improving): 0.851 -> 0.878 -> 0.906

The curve includes epoch count, wall time, and whether the model was still improving,
plateauing, or declining when training stopped.

---

RESULTS

Same setup as run 2: 20 iterations, 5-minute budget, Claude Opus, high effort, same
weak baseline. Fashion-MNIST.

Keep history:
  Iter 0: CNN + Adam + BatchNorm + input normalization (in train loop)  -> 0.9108
  Iter 4: Moved normalization into model.forward() to fix train/val    -> 0.9146
          mismatch (score_model passes raw [0,1] images)
  Iter 6: Cosine annealing LR (T_max=10, eta_min=1e-5)                -> 0.9164
  Iter 9: Label smoothing 0.1                                          -> 0.9202

Then 10 consecutive discards (iters 10-19). Best: 0.9202 at iter 9. Cost: $3.38.

---

COMPARISON

                    Run 2 (summary table)      Run 3 (training curves)
  Best accuracy:    0.9122                      0.9202 (+0.8pp)
  Best at iter:     16                          9
  Total KEEPs:      7/20                        4/20
  Architecture:     4 ResBlocks + SE + TTA      2-block CNN + cosine LR
  Cost:             $4.13                       $3.38

Run 3 hit a higher accuracy with a simpler architecture and fewer changes. The agent
was more precise: each keep was a targeted fix rather than piling on techniques. Run 2
built an elaborate 4-ResBlock model with squeeze-and-excitation attention, test-time
augmentation, and global average pooling. Run 3's simpler 2-block CNN with correct
normalization, cosine LR, and label smoothing outperformed it.

---

WHAT THE CURVES CHANGED

The most striking difference was in iteration 0. In both runs the agent replaced the
linear baseline with a CNN. But in run 3, the agent added a note in its own output:
"If score_model in prepare.py feeds raw [0,1] images, there may be a train/val mismatch.
If results are unexpectedly low, the next iteration should address this by moving
normalization into the model's forward() method."

It then fixed this bug in iteration 4. Run 2 never identified this mismatch -- it
applied normalization in the forward method from the start (iteration 3), but as an
accuracy improvement, not as a bug fix. The distinction matters: run 3's agent
understood WHY normalization was important, not just that it was a known technique.

The curves likely contributed to this. Seeing the per-epoch trajectory makes it easier
to spot when a model is underperforming relative to its architecture complexity -- a
sign that something is wrong with the training setup, not the architecture.

The agent also referenced the curves directly in later iterations. For iter 6, it noted:
"All previous runs show the model is still improving after ~5 epochs, meaning the
constant learning rate may be too high in later epochs." This observation came directly
from the curve data -- the summary table would only have shown the final number.

---

THE PLATEAU PROBLEM

The bigger story is what did NOT change. After iter 9, the agent made 10 consecutive
unsuccessful attempts. It had the training curves showing that every failed attempt was
"still improving" when the budget ran out. It could see that wider or deeper
architectures completed fewer epochs. But it could not translate these observations into
a successful change.

This is the outer loop gap. The curves gave the agent better per-iteration reasoning,
but did not help it reason across iterations. It never concluded: "the last 5 attempts
all show 3-4 epochs with accuracy still climbing -- the architecture is fine, the
problem is we need more epochs, so I should focus on reducing per-epoch compute cost
rather than adding model capacity." That kind of meta-reasoning requires updating the
search strategy itself, not just observing more data points.

Richer context improved the quality of individual decisions. It did not solve the
strategy stagnation problem. Those are two different capabilities, and we only improved
the first one.