# Checkpoint Excavation: Verifying Autoresearch Paper Claims

## What this document is

A protocol for extracting experiment data from historical sprite checkpoints to verify claims made in a research paper. The paper documents 15 autoresearch harness experiments on Fashion-MNIST. It is published as a draft at https://ararxiv.dev/abs/Kky24cHI.

This work is being driven from a **separate sprite** because checkpoint restores on the target sprite kill all sessions. The new sprite uses the external Sprites API to remotely restore checkpoints and read files from the target sprite.

## Where we are in the drafting process

The paper has been through three revision phases:

1. **Claim verification** — external claims (arXiv papers, SOTA benchmarks, URLs) checked and corrected. Karpathy attribution fixed (2025 to 2026, "introduced" softened to "released as implementation of a pattern already used internally").

2. **Methodological assessment** — paper reframed as exploratory (hypothesis-generating) rather than confirmatory. No pre-registered hypotheses existed; "Rationale" columns replaced "Hypothesis." Research Design subsection added explaining post-hoc analysis.

3. **Interview-driven revision** — structured interview with the researcher recovered intent, surprises, and confidence levels. 11 changes applied: honest tinkering origin in Introduction, isolation reframed as definitional, bundled-variable runs disclosed, code availability claim removed, keep rate qualified, cumulative configuration documented, Related Work expanded, baseline stability limitation added.

**What remains before publication:** Verifying that the paper's reported numbers (accuracy, cost, model used, keep counts) match the raw experiment logs. This is the purpose of the excavation. The most critical question is whether **run 5** actually used opus/high — the paper reports it as the best result (93.95% at $2.10), but $2.10 is half the cost of comparable opus runs, raising doubt.

## How sprites and checkpoints work

Sprites are stateful sandboxes on Fly.io. Each sprite has a writable filesystem overlay on top of a base image. Checkpoints are point-in-time snapshots of that overlay.

Key properties:
- Checkpoints are fast (~300ms) because they shuffle metadata, not data
- Restoring a checkpoint replaces the entire writable overlay — all files revert to the checkpoint's state
- Restoring restarts the sprite, killing all sessions
- The last 5 checkpoints are auto-mounted inside the sprite at `/.sprite/checkpoints/` for direct read access
- There is no mechanism to mount arbitrary older checkpoints
- Checkpoints are **sprite-specific** — one sprite cannot directly access another sprite's checkpoints

### External API (how this excavation works)

From outside a sprite, you can:

- **Read files:** `GET /v1/sprites/{name}/fs/read?path=/path/to/file`
- **List directories:** `GET /v1/sprites/{name}/fs/list?path=/path/to/dir`
- **Trigger restore:** `POST /v1/sprites/{name}/checkpoints/{checkpoint_id}/restore`
- **List checkpoints:** `GET /v1/sprites/{name}/checkpoints`

API base: `https://api.sprites.dev` (requires bearer token auth)

The protocol: restore a checkpoint on the target sprite, wait for it to come back up, read the files you need via the filesystem API, then restore the next checkpoint.

### CLI equivalents (from outside)

```bash
# List checkpoints on target sprite
sprite checkpoint list --sprite autoresearch

# Restore a checkpoint
sprite restore v29 --sprite autoresearch

# Read a file (via API — no direct CLI equivalent for remote file read)
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.sprites.dev/v1/sprites/autoresearch/fs/read?path=/home/sprite/autoresearch/experiments/history.jsonl"
```

Note: Verify the exact API base URL and auth mechanism. The above is reconstructed from documentation at https://docs.sprites.dev/api/v001-rc30/filesystem/ and https://sprites.dev/api/sprites/checkpoints. The sprite name in API calls may need the full org-qualified form.

## Target sprite

- **Name:** autoresearch
- **Sprite ID:** sprite-d8e15c82-b6fb-4778-8175-39f72479445e
- **URL:** https://autoresearch-pt5.sprites.app
- **Current state:** Paper draft (v67), all 15 runs completed, experiments/ directory contains only run 15 data

## The 15 runs and their checkpoints

Each experiment was bookended by checkpoints. "Pre-run" checkpoints have the baseline state before the agent ran. "Post-run" checkpoints have completed experiment data (JSONL logs, orchestrator logs, final code).

### Post-run checkpoints (primary extraction targets)

| Run | Checkpoint | Date | Comment | Paper: accuracy | Paper: cost | Paper: keeps |
|-----|-----------|------|---------|----------------|-------------|-------------|
| 1 | v13 | Mar 26 | run 1 complete, best val_accuracy=0.9068 | 90.68% | $3.54 | — |
| 2 | v16 | Mar 26 | autoresearch 5min run complete, best=0.9122 | 91.22% | $4.13 | — |
| 3 | v23 | Mar 27 | run 3 complete: enriched history, best=0.9202 | 92.02% | $3.38 | — |
| 4 | v27 | Mar 27 | run4 complete: 0.8985, fixed venv path | 89.85% | $4.91 | — |
| 5 | **v29** | Mar 28 | **run5 complete: best 0.9395, cost .10, new record** | **93.95%** | **$2.10** | — |
| 6 | v34 | Apr 2 (listed as Apr 5) | Run 6 complete: escalation ladder (92.75%, .77) | 92.75% | $1.77 | — |
| 7 | v36 | Apr 5 | Run 7 complete: 4-step bidirectional, 92.82%, .29 | 92.82% | $4.29 | — |
| 8 | v39 | Apr 6 | Run 8 complete: 4-step reset, no timeout, 93.26%, .30 | 93.26% | $8.30 | — |
| 9 | v41 | Apr 6 | Run 9 complete: 3-step one-way, 93.79%, .93 | 93.79% | $3.93 | — |
| 10 | v45 | Apr 6 | Run 10 complete: self-critique, 92.97%, .08 | 92.97% | $6.08 | 4/20 |
| 11 | v48 | Apr 6 | Run 11 complete: postmortem, 93.01%, .61 | 93.01% | $4.61 | 5/20 |
| 12 | v51 | Apr 7 | Run 12 complete: cognitive tools, 93.41%, .41 | 93.41% | $6.41 | 10/20 |
| 13 | v55 | Apr 7 | Run 13 complete: web access, 91.54%, .29 | 91.54% | $2.29 | 12/20 |
| 14 | v58 | Apr 7 | Run 14 complete: web in postmortem, 92.86%, .62 | 92.86% | $4.62 | 10/20 |
| 15 | v60 | Apr 7 | Run 15 complete: web-nudged postmortem, 93.17%, .69 | 93.17% | $4.69 | 9/20 |

### Pre-run checkpoints (baseline verification)

| Run | Checkpoint | Comment |
|-----|-----------|---------|
| 1 | v10 | pre 20-iter opus run |
| 2 | v15 | clean reset for 5min/20iter run |
| 3 | v22 | run 3 launched: enriched history |
| 4 | v26 | pre-run4: cleared experiments, reset train.py |
| 5 | v27/v28 | v27 is post-run4, v28 is run5 in progress |
| 6 | v33 | Run 6: escalation ladder implemented, baseline reset |
| 7 | v35 | Run 7: 4-step ladder, bidirectional |
| 8 | v38 | Run 8: 4-step reset, no agent timeout |
| 9 | v40 | Run 9: 3-step one-way ladder |
| 10 | v44 | Run 10: self-critique predictions added |
| 11 | v47 | Run 11: postmortem mechanism implemented |
| 12 | v50 | Run 12: cognitive tools (structured CoT + backtracking) |
| 13 | v54 | Run 13: web access added to main agent |
| 14 | v57 | Run 14: web access moved to postmortem only |
| 15 | v59 | Run 15: postmortem prompt nudge added |

### Paper-writing checkpoints (not needed for excavation)

| Checkpoint | Comment |
|-----------|---------|
| v62 | paper draft v1 written |
| v63 | paper draft v2 - IMRAD restructure |
| v64 | paper draft v3 - 10 fixes from note comparison |
| v65 | Karpathy date and attribution corrections |
| v66 | exploratory research framing |
| v67 | post-interview revisions (current paper state) |

## What to extract per checkpoint

### From each post-run checkpoint

Read these files from `/home/sprite/autoresearch/experiments/`:

1. **`history.jsonl`** — The critical file. Each line is a JSON object per iteration with:
   - `iteration` — 0-indexed iteration number
   - `agent.cost_usd` — agent call cost
   - `decision.action` — "keep" or "discard"
   - `decision.metric_after` — val_accuracy achieved
   - `ladder.model` — which model was used ("haiku", "sonnet", "opus")
   - `ladder.effort` — effort level ("high", "max")
   - `ladder.step` — ladder position (0-indexed)
   - `ladder.discard_streak` — consecutive discards at iteration end
   - `backtrack.cost_usd` — backtracking call cost (runs 12-15, may be absent)
   - `postmortem` — postmortem data (runs 11-15, may be absent or logged differently)

   **Important cost note:** JSONL may not include all costs. Postmortem and backtracking costs are sometimes logged only in the orchestrator log. For run 15, JSONL totals $3.82 but actual total is $4.69 (postmortems logged separately in orchestrator.log). Always cross-reference with orchestrator.log.

2. **`orchestrator.log`** — Text log with timestamps. Look for:
   - `Total agent cost:` line at the end (the authoritative cost figure)
   - `Running postmortem analysis` lines (postmortem trigger points and costs)
   - `Postmortem done ($X.XX, N turns)` lines

3. **`best_train.py`** — The best model found by the agent. Not needed for verification but useful for understanding what the agent discovered.

### From each post-run checkpoint (also read)

From `/home/sprite/autoresearch/`:

4. **`autoresearch.py`** — Harness configuration. Check which model/effort is configured, ladder logic, escalation rules.
5. **`program.md`** — Agent prompt template. Check context richness, CoT phases, web tool instructions.
6. **`train.py`** — The current train.py (will be the agent's last modification, not the baseline).

### From pre-run checkpoints (baseline verification)

7. **`train.py`** — SHA256 hash to verify the baseline is identical across runs. The baseline should be a single linear layer: 784 to 10, SGD lr=0.1, batch size 32, ~82% accuracy.

## Verification matrix

### CRITICAL — blocks publication

| # | Claim | Paper location | What to check | Checkpoint |
|---|-------|---------------|---------------|-----------|
| 1 | Run 5 used opus/high for all 20 iterations | Results table (Infrastructure) | Every JSONL entry: `ladder.model` should be "opus", `ladder.effort` should be "high" | v29 |
| 2 | Run 5 total cost was $2.10 | Results table | `orchestrator.log` "Total agent cost" line | v29 |

### HIGH — paper accuracy

| # | Claim | What to check | Checkpoints |
|---|-------|---------------|------------|
| 3 | All 15 best accuracy figures match paper | Max `decision.metric_after` across all JSONL entries | All post-run |
| 4 | All 15 cost figures match paper | `orchestrator.log` total cost | All post-run |
| 5 | Keep counts for runs 10-15 match paper | Count entries where `decision.action == "keep"` | v45, v48, v51, v55, v58, v60 |
| 6 | Runs 1-5 all used opus/high (no ladder) | Every JSONL entry shows `ladder.model: "opus"` | v13, v16, v23, v27, v29 |
| 7 | Baseline train.py identical across all runs | SHA256 hash comparison of train.py at pre-run checkpoints | v10, v15, v22, v26, v33, v40, v44, v50, v54 |

### MEDIUM — behavioral claims

| # | Claim | What to check | Checkpoint |
|---|-------|---------------|-----------|
| 8 | Run 13 never escalated past haiku | All JSONL entries: `ladder.model == "haiku"` | v55 |
| 9 | Run 13 keep rate was 60% (12/20) | Count keeps in JSONL | v55 |
| 10 | Run 9 was still improving at exit | Last 7 iterations: keeps at positions 13-19? | v41 |
| 11 | Run 12 sonnet productive for 14 consecutive iterations | `ladder.model` sequence in JSONL | v51 |
| 12 | Run 10: 1 of 18 predictions fell within predicted range | Prediction data in JSONL (if recorded) | v45 |
| 13 | Run 7 bidirectional de-escalation never triggered during plateaus | `ladder.step` never decreased | v36 |
| 14 | Run 6 caused oscillation (keep resets to step 0, re-escalates) | `ladder.step` pattern | v34 |
| 15 | opus/max timed out 33% of the time in runs 7-8 | Timeout entries or missing metrics | v36, v39 |
| 16 | Run 11: agent followed all 3 postmortem recommendations | Postmortem text + subsequent agent changes | v48 |

## Execution protocol

### Preparation (this sprite — the controller)

1. Verify API access to the target sprite:
   ```bash
   # List checkpoints on target
   curl -H "Authorization: Bearer $TOKEN" \
     https://api.sprites.dev/v1/sprites/autoresearch/checkpoints
   ```

2. Create a working directory for extracted data:
   ```bash
   mkdir -p ~/excavation/runs/{01..15}
   mkdir -p ~/excavation/baselines
   ```

### Extraction loop

**Start with run 5 (v29)** — highest priority.

For each run, the cycle is:

```
1. Restore target to post-run checkpoint
   POST /v1/sprites/autoresearch/checkpoints/v29/restore

2. Wait for target to come back up (poll with list or info endpoint)

3. Read files via filesystem API:
   GET /fs/read?path=/home/sprite/autoresearch/experiments/history.jsonl
   GET /fs/read?path=/home/sprite/autoresearch/experiments/orchestrator.log
   GET /fs/read?path=/home/sprite/autoresearch/autoresearch.py
   GET /fs/read?path=/home/sprite/autoresearch/program.md
   GET /fs/read?path=/home/sprite/autoresearch/train.py

4. Save to local excavation directory

5. Parse and verify against paper claims immediately
```

### Execution order

1. **v29** — Run 5 (CRITICAL: verify opus/high and $2.10)
2. **v13, v16, v23, v27** — Runs 1-4 (verify opus-only, costs, accuracies)
3. **v34, v36, v39, v41** — Runs 6-9 (verify ladder behavior, costs)
4. **v45, v48, v51** — Runs 10-12 (verify reasoning experiments, keep counts)
5. **v55, v58** — Runs 13-14 (verify web access; run 15 already verified from live state)
6. **v10, v15, v22, v26, v33, v40, v44, v50, v54** — Pre-run baselines (hash train.py)
7. **v67** — Restore target back to current paper state

### After extraction

1. Produce a verification summary comparing every paper claim against extracted data
2. Post the summary to the experiment notes: https://texts-pt5.sprites.app/agents/vtBKygQMO_ld258AByHOyUCA2Q7eATqKUDxLG_5BnZw/page
3. If corrections are needed, document them clearly with the evidence
4. The paper can then be updated in the target sprite (after restoring to v67) and republished to ararxiv

## Known issues and risks

1. **API auth and endpoints need verification.** The exact API base URL, auth mechanism, and parameter format should be confirmed against current docs. The sprite name in API calls may need the full ID or org-qualified name rather than just "autoresearch."

2. **Restore timing.** After triggering a restore, the target sprite restarts. There will be a delay before the filesystem API responds. Poll the checkpoint list or sprite info endpoint to detect readiness.

3. **JSONL cost underreporting.** For runs with postmortems (11-15) and backtracking (12-15), the JSONL iteration records may not include all costs. Always use the `orchestrator.log` "Total agent cost" line as the authoritative source. Run 15 confirmed: JSONL shows $3.82, orchestrator.log shows $4.69.

4. **Pre-run checkpoint coverage.** Some pre-run checkpoints may have already cleared the experiments directory (the protocol was to clear experiments before each run). The baseline train.py should still be present.

5. **Checkpoint v28 ambiguity.** v28's comment says "run5 in progress: 13/20 iters done." This is mid-run, not pre-run. For run 5's pre-run baseline, use v27 (post-run4) — the train.py there should have been reset to baseline before run 5 started. Or check v29's JSONL iteration 0 starting accuracy (~82%) to confirm baseline was used.

6. **Agent identity for posting to texts.** The agent key is at `~/.sprite-agent-key` on the target sprite (created at v3). If you need to post results to texts-pt5, you can either use that key (readable via the filesystem API when the target is at any checkpoint v3+) or use a different posting mechanism from the controller sprite.

## Reference: experiment notes

All original experiment notes are at:
https://texts-pt5.sprites.app/agents/vtBKygQMO_ld258AByHOyUCA2Q7eATqKUDxLG_5BnZw/page

These were the source material for the paper and contain per-run analysis, cross-run comparisons, and the ideas triage that led to later experiments.

## Reference: paper location

- Draft: https://ararxiv.dev/abs/Kky24cHI
- Source file: `/home/sprite/autoresearch/paper_draft.md` (on target sprite, at checkpoint v67)
- ararxiv token: `ar_F2zCVeKiLUGZ-tqXNEvYvrzxv86ai4GFbbxR8_qdLZA`