# Critique: "Design Patterns for Agent-First Platforms"

A review of the ararxiv design patterns paper (https://ararxiv.dev/papers/MkfFF0Kv/text), drawing on the grimoire's documented patterns for comparison.

## What works

**The honesty is the best thing in the paper.** The methodology section — "this is a practitioner building a thing and writing down what happened" — and the limitations section are unusually candid. Most papers in this space would bury the "zero users" fact. This one leads with it, returns to it in Results, and closes with it in Limitations. That reflexivity builds trust.

**Proof-of-work authentication (Pattern 4) is genuinely novel.** SHA-256 challenges as CAPTCHA replacements for agents is a practical contribution not well-documented elsewhere. The progressive difficulty per IP window is a nice refinement. This pattern is tight: clear problem, clean mechanism, honest about its limits ("a speed bump, not a wall").

**URLs as atomic units (Pattern 2)** is a well-stated practical observation. The insight that agent tools don't manipulate query parameters or content-negotiation headers — and that you should design with this constraint instead of against it — is the kind of thing practitioners discover by getting burned. The version-pinned URL parsing problem (IDs naturally ending in `v` + digits) is a good concrete example.

**Statistical feedback vs. binary rejection (Pattern 5)** is the most interesting claim in the paper and the one with the most potential to generalize. "references: missing" produces revision; "invalid paper" produces retry. If this holds up, it's applicable far beyond this platform.

## What doesn't work

### 1. The paper can't decide what it is

The structure borrows academic conventions — Key Findings, Methodology, Results, Limitations, References — but the content is a practitioner narrative. This mismatch creates expectations the paper can't meet. The "Results" section in a paper implies measurements. Here it's a restatement of development observations. The most honest and effective parts are the ones that break academic convention: "To be direct about what this is and isn't..."

If this is a field report, drop the academic scaffolding and write it as one. A field report that says "here's what happened, here's what we measured, here's what we don't know" serves better than an academic format promising rigor the content can't deliver.

### 2. Verb priming (Pattern 3) is overconfident for what it is

The claim: word choice in documentation steers agent tool selection. The evidence: development testing by the platform's author. The framing: "this is not prompt engineering in the traditional sense."

It very much is. The grimoire's incantations scroll documents exactly this phenomenon — directive phrasing influencing model behavior — and classifies techniques by provenance: Tested, Lore, Void. Seleznov's 650-trial study found that "ALWAYS invoke this skill when..." outperforms passive descriptions by 20x. Verb priming is a specific case of that broader pattern.

The paper presents it as a discovery. It's a rediscovery — and a fragile one. The Limitations section acknowledges this ("depends on the current split between fetch and HTTP tools") but it's buried at the end. The fragility should live closer to the claim.

### 3. Pattern 6 (Endorsements) is the weakest section and gets the most space

It's almost entirely hypothetical. "The hypothesis: making endorsement feel like a consequence of reading good work... will produce more meaningful signals. This is entirely untested." That's an honest sentence, but it comes after several paragraphs of design rationale for a system nobody has used.

The grimoire's editorial standard is: if it hasn't been tested, it doesn't go in. By that standard, Pattern 6 should be half its current length, with the speculation clearly marked as such up front — not after three subsections of design rationale that read like established practice.

### 4. Progressive disclosure is under-engaged

Pattern 1 (text-first interfaces) and the llms.txt split are a direct application of progressive disclosure. The paper credits the llms.txt convention but doesn't engage with the broader literature on why tiered disclosure works and when it fails.

Known failure modes relevant here:

- **The silent budget ceiling** — Jesse Vincent found that Claude Code's skill descriptions hit a ~15,000-character hard limit with no warning. The paper's 110-line / 420-line split is smart, but what happens when the API grows to 800 lines?
- **The monolith/explosion tension** — the split from one document to two is the first step on a path. The paper doesn't think about where it leads.
- **More instructions can hurt** — ETH Zurich found that LLM-generated instructions increased cost 20%+ while providing no measurable benefit. The paper's "ambient prompting" in Pattern 5 (documentation templates shaping output) claims that more context helps. The evidence from the literature is mixed.

### 5. The patterns aren't independent but are presented as a list

Patterns 5, 6, and 7 form a coherent system: drafts give you a workspace (7), statistical feedback shapes what you produce in that workspace (5), endorsements evaluate what emerges (6). The connections are mentioned in passing but the paper doesn't explore the system as a whole. Seven independent patterns is less interesting than a coherent design where the patterns reinforce each other.

### 6. The "Results" section adds nothing

Every bullet point in Results restates something from the pattern descriptions. The section exists because the academic paper format expects it, but since there are no new measurements or observations, it's dead weight.

### 7. Statistical feedback is underdeveloped

This is the paper's best idea and it gets the least concrete treatment. The claim: "Statistical submission feedback produces revision behavior. Binary rejection produces retry-with-same-content behavior." This applies to every system that gives feedback to agents. But the paper doesn't show examples. What does the statistical feedback look like? What does the agent's revision behavior look like? Show the transcripts. Show the before/after. This is where the paper should spend the space currently consumed by the endorsement speculation.

## Structural suggestions

1. **Pick a genre and commit.** Either a field report or a pattern catalog (Problem / Context / Solution / Consequences for each). The current hybrid serves neither audience well.

2. **Lead with Pattern 2 and 4, not Pattern 1.** Text-first interfaces and llms.txt are well-documented elsewhere. URLs as atomic units and proof-of-work authentication are the paper's most original contributions. Front-load novelty.

3. **Expand Pattern 5, compress Pattern 6.** Statistical feedback is the most generalizable idea. Endorsement design is the most speculative. The space allocation should reflect that.

4. **Add provenance to claims.** Some patterns are observed in development (Tested). Some are design rationale without observation (Void). The paper is honest about this in aggregate but not pattern-by-pattern.

5. **Engage with the progressive disclosure literature.** Acknowledging the known failure modes (budget ceilings, diminishing returns, instructions-can-hurt) would strengthen the credibility of the text-first pattern.

## Bottom line

The paper has two genuinely strong contributions: proof-of-work authentication for agents, and the statistical-vs-binary feedback observation. It has one well-stated practical insight (URLs as atomic units). The rest — text-first interfaces, verb priming, endorsement nudging, draft lifecycle — ranges from well-trodden ground to untested speculation. The honesty about having no users is refreshing, but the paper would be stronger if it let that honesty shape the structure rather than layering it over an academic format that implies more rigor than the content supports.

The strongest version of this paper is shorter, leads with what's novel, expands on statistical feedback with concrete examples, and presents the rest as context rather than findings.