Autoresearch Run 3: Training Curves in Agent Context For the third run we changed one thing in the harness: instead of passing the agent a summary table with one line per iteration (change description, final metric, keep/discard), we passed the full epoch-by-epoch training curve for each of the last 10 iterations. Previously the agent saw: | Iter | Change | val_accuracy | Decision | | 0005 | AdamW + grad clip | 0.9080 | KEEP | | 0006 | Wider channels | 0.9056 | DISCARD | Now it sees: Iter 0005 (KEEP, 0.9080): AdamW + gradient clipping Curve (5 epochs, 281s, still improving): 0.862 -> 0.880 -> 0.896 -> 0.905 -> 0.908 Iter 0006 (DISCARD, 0.9056): Wider channels 32->48 Curve (3 epochs, 295s, still improving): 0.851 -> 0.878 -> 0.906 The curve includes epoch count, wall time, and whether the model was still improving, plateauing, or declining when training stopped. --- RESULTS Same setup as run 2: 20 iterations, 5-minute budget, Claude Opus, high effort, same weak baseline. Fashion-MNIST. Keep history: Iter 0: CNN + Adam + BatchNorm + input normalization (in train loop) -> 0.9108 Iter 4: Moved normalization into model.forward() to fix train/val -> 0.9146 mismatch (score_model passes raw [0,1] images) Iter 6: Cosine annealing LR (T_max=10, eta_min=1e-5) -> 0.9164 Iter 9: Label smoothing 0.1 -> 0.9202 Then 10 consecutive discards (iters 10-19). Best: 0.9202 at iter 9. Cost: $3.38. --- COMPARISON Run 2 (summary table) Run 3 (training curves) Best accuracy: 0.9122 0.9202 (+0.8pp) Best at iter: 16 9 Total KEEPs: 7/20 4/20 Architecture: 4 ResBlocks + SE + TTA 2-block CNN + cosine LR Cost: $4.13 $3.38 Run 3 hit a higher accuracy with a simpler architecture and fewer changes. The agent was more precise: each keep was a targeted fix rather than piling on techniques. Run 2 built an elaborate 4-ResBlock model with squeeze-and-excitation attention, test-time augmentation, and global average pooling. Run 3's simpler 2-block CNN with correct normalization, cosine LR, and label smoothing outperformed it. --- WHAT THE CURVES CHANGED The most striking difference was in iteration 0. In both runs the agent replaced the linear baseline with a CNN. But in run 3, the agent added a note in its own output: "If score_model in prepare.py feeds raw [0,1] images, there may be a train/val mismatch. If results are unexpectedly low, the next iteration should address this by moving normalization into the model's forward() method." It then fixed this bug in iteration 4. Run 2 never identified this mismatch -- it applied normalization in the forward method from the start (iteration 3), but as an accuracy improvement, not as a bug fix. The distinction matters: run 3's agent understood WHY normalization was important, not just that it was a known technique. The curves likely contributed to this. Seeing the per-epoch trajectory makes it easier to spot when a model is underperforming relative to its architecture complexity -- a sign that something is wrong with the training setup, not the architecture. The agent also referenced the curves directly in later iterations. For iter 6, it noted: "All previous runs show the model is still improving after ~5 epochs, meaning the constant learning rate may be too high in later epochs." This observation came directly from the curve data -- the summary table would only have shown the final number. --- THE PLATEAU PROBLEM The bigger story is what did NOT change. After iter 9, the agent made 10 consecutive unsuccessful attempts. It had the training curves showing that every failed attempt was "still improving" when the budget ran out. It could see that wider or deeper architectures completed fewer epochs. But it could not translate these observations into a successful change. This is the outer loop gap. The curves gave the agent better per-iteration reasoning, but did not help it reason across iterations. It never concluded: "the last 5 attempts all show 3-4 epochs with accuracy still climbing -- the architecture is fine, the problem is we need more epochs, so I should focus on reducing per-epoch compute cost rather than adding model capacity." That kind of meta-reasoning requires updating the search strategy itself, not just observing more data points. Richer context improved the quality of individual decisions. It did not solve the strategy stagnation problem. Those are two different capabilities, and we only improved the first one.