# Moving an always-on service off the sprite platform ## The problem precis.fyi is an email service — you send it an email, it thinks about it, and replies. The architecture is straightforward: Postmark fires a webhook, FastAPI accepts it and returns 200 immediately, then a DBOS async workflow runs in the background to generate the reply via Claude and send it back through Postmark. This worked fine in development. In production on the sprite platform, it broke in a subtle way. The sprite platform runs containers on Fly.io machines with auto-stop enabled. When there's no active HTTP connection, the machine freezes. The webhook handler returns 200 instantly (good practice — Postmark has a timeout), but that closes the last HTTP connection. The machine freezes. The async workflow, mid-execution, stops. We caught this when an email sent at 15:11 UTC didn't get a reply until 17:01 — almost two hours later. The DBOS operation log told the full story: ``` 1. parse_email 15:11:15 → 15:11:15 (0.0s) 2. DBOS.recv 16:59:28 → 17:00:28 (60.0s debounce) 3. generate_reply 17:00:28 → 17:01:18 (49.2s) 4. send_reply 17:01:18 → 17:01:18 (0.4s) ``` 108 minutes of nothing between steps 1 and 2. The process wasn't killed (same PID), the code between those steps is a trivial SQLite upsert. The VM was frozen. ## The fix Move to a standalone Fly.io machine with `auto_stop_machines = "off"`. The sprite platform is designed for lightweight, request-response workloads. Background processing after the HTTP response is a mismatch. Rather than fight the platform's design, we moved the service to its own Fly.io machine where we control the lifecycle. ## What the migration involved **Code changes (minimal):** 1. Made database paths configurable via a `DATA_DIR` environment variable. On the sprite, databases lived next to the script. On Fly.io, they live on a mounted persistent volume at `/data`. 2. Pointed the DBOS system database to the same volume via `system_database_url` in the config. 3. Enabled WAL mode on SQLite — the webhook handler and async workflow both access the same database, and WAL allows concurrent reads during writes. **New files:** - `Dockerfile` — single-stage build using `ghcr.io/astral-sh/uv:python3.13-trixie-slim` as the base image. Dependencies installed via `uv sync --frozen`, app files copied in, `/data` directory created. - `fly.toml` — the critical setting is `auto_stop_machines = "off"` with `min_machines_running = 1`. Persistent volume mounted at `/data`. Shared CPU, 512MB RAM (the app is I/O-bound, waiting on API calls). - `.dockerignore` — keeps `.venv`, `.env`, databases, and dev artifacts out of the build context. **Deployment:** ``` fly apps create precis-fyi fly volumes create precis_data --region iad --size 1 fly secrets set ANTHROPIC_API_KEY=... POSTMARK_SERVER_TOKEN=... ... fly deploy ``` Then migrated the existing `app.db` (user data — 45KB) via `fly ssh sftp`, added the custom domain with `fly certs add precis.fyi`, and updated the Postmark webhook URL via their API. ## A gotcha: SQLite version compatibility The first deploy crashed: ``` sqlalchemy.exc.IntegrityError: NOT NULL constraint failed: application_versions.version_timestamp ``` DBOS's migration 12 creates a table with `DEFAULT (unixepoch('subsec') * 1000)`. The `subsec` modifier was added in SQLite 3.42.0. Debian Bookworm ships 3.40.1. The DEFAULT silently evaluates to NULL, which violates the NOT NULL constraint. The sprite had SQLite 3.46.1 (works fine). The fix: switch from `bookworm-slim` to `trixie-slim` (Debian 13), which ships SQLite 3.46.1. A one-line Dockerfile change. This is the kind of thing that's invisible until you change environments. The error message points at a NOT NULL constraint, not at an unsupported SQL function — you have to know that `unixepoch('subsec')` is version-gated. ## After migration Test email sent at 18:48:12, reply delivered at 18:49:20. 68 seconds total, of which 60 is the intentional debounce window. No more multi-hour freezes. The sprite service was deleted. The sprite still holds the code as a backup but serves nothing. ## References - [Fly.io machine auto-stop docs](https://fly.io/docs/machines/machine-stop/) - [SQLite `unixepoch()` function](https://www.sqlite.org/lang_datefunc.html) — `subsec` modifier added in 3.42.0 - [DBOS Python SDK](https://docs.dbos.dev/python/programming-guide) - [Postmark inbound webhook](https://postmarkapp.com/developer/webhooks/inbound-webhook)