# DAMM Field Manual

For the operator on the ground, the friend two countries over, the future maintainer reading this cold. This is what's deployed, what was tried and didn't work, what we're demanding of each piece, the storyboards we're solving for, and the wider register of ideas — flagged honestly between "shipped", "partial", "speculative", and "abandoned."

## 0. The bedrock fact (don't break this)

**As of 2026-04-28**, one real device has been connected through DAMM for over 5 hours and moved **332 MB rx / 193 MB tx** of real traffic through the gateway. Every future change must demonstrably not break this. When in doubt, run `sudo wg show wg0 latest-handshakes` on hub2 — if the handshake age is climbing past `2 × PersistentKeepalive` (50s) without a fresh handshake, *something just broke* and recovery comes before the change.

This is the celebrated end-to-end success. Everything in the rest of this manual is scaffolding around that one fact, plus an honest map of what we don't yet have.

## How to use this manual

- **Onboarding a new operator**: read §1 (manifest), §3 (demands per concern), §4 (storyboards), then `bring-up-notebook.md` for hands-on.
- **Triaging a live incident**: read §0 (don't break this), `operational-runbook.md`, then this manual's §3 to see if the failing concern has a "shipped" or "open" answer.
- **Designing the next ship**: read §3 demands marked OPEN, §5 brainstorm filtered by `[partial]`, §8 next ship.
- **Steering the doc set itself**: `docs/INDEX.md` is the canonical map of which doc owns which concern; updating a doc without updating the index is how the doc set goes to seed.

## 1. The flow as it is manifest

Measured against the live deployment on `hub2` (Contabo VPS, public IP `149.102.137.139`), 2026-04-28:

| Surface | What it is | Measured |
|---|---|---|
| `https://damm.raindesk.dev/` | Static landing | 100ms p50 from a clean WAN |
| `https://damm.raindesk.dev/get/` | One-click wizard | 100–135ms p50 |
| `https://cp.damm.raindesk.dev/healthz` | CP liveness | p50=173ms / p99=303ms / max=386ms |
| `https://cp.damm.raindesk.dev/v1/catalog?region=eu-central` | Signed envelope | p50=123ms / p99=165ms |
| `https://cp.damm.raindesk.dev/v1/health/fleet` | Public summary | 110ms |
| `vpn.damm.raindesk.dev:51820` (UDP) | Gateway | One real peer, 5.4MB rx / 35.7MB tx since cold start |
| Enrollment throughput, in-process | `benchmark-enroll.js` | **195 enrollments/sec** at concurrency 6, 5.14ms avg |
| Control plane RSS | Steady-state | 60MB, ~0% CPU |
| state.json size | Post-trim | 48KB (was 78KB before trim) |

Seven systemd units: `damm-control-plane`, `wg-quick@wg0`, `damm-heartbeat.timer`, `damm-sync-peers.path`, `damm-zombie-sweeper.timer`, `damm-orchestrator.timer`, `caddy`. All `active`.

Tags shipped in the session that landed this manual: `v0.2.0` → `v0.2.4` → `v0.3.0` (orchestrator alive). Branch `postgres-backend-hardening`, source-of-truth on `hyle`.

## 2. Postmortem — what was tried and didn't pan out cleanly

Honest list. Each entry: what was attempted, why it failed, what the working solution turned out to be.

- **First wizard had inline `onclick="...\'cta-test\'..."`** — invalid JS once the HTML attribute parser got hold of it. *Fix:* dropped all inline handlers, attached listeners via `addEventListener` from the page's main script.
- **First sync-peers used `wg set ... preshared-key <(printf ...)`** — process substitution doesn't survive the sudo boundary, the `/dev/fd/63` symlink was meaningless to the elevated process. *Fix:* `mktemp` a tempfile, write PSK there, pass the path.
- **First env file had `VPN_CLIENT_DNS=1.1.1.1, 1.0.0.1`** — unquoted, broke shell `source` in `heartbeat.sh`. *Fix:* double-quote every value containing `,`, `:`, `(`, `/`. Both shell and systemd `EnvironmentFile=` honor double quotes.
- **First Caddy config used `try_files {path} /index.html`** — directory requests fell through to the root index instead of finding `/get/index.html`. *Fix:* `try_files {path} {path}/index.html /index.html`.
- **First seminar-commentary agent dispatched as `Explore` type** — failed because it could not write to a remote file; `Bash` not actually granted in that flow. *Fix:* re-dispatched as `general-purpose`. Lesson: read the agent's actual tool list, not the description.
- **Wizard's first hello card showed "Why is this here? · I've been here before · I want to read first"** as a `<summary>` — three concepts pretending to be three links, opening one paragraph stack. Honest cousin: "Tell me more before I click."
- **First Edit pass on sync-peers.sh** to fix the FD-leak bug only matched one of two indented blocks (different indentation levels). *Lesson:* when both branches of an if/else have the same shape, extract a helper, don't `replace_all`.
- **First reconciler timer noise:** ran every 30s, did 2× sudo `wg show` per tick → 480 sudo events/h just to confirm "no change." *Fix:* hash the peer-relevant slice of state.json, short-circuit when unchanged. Steady-state cost dropped to near zero.
- **First control-plane log was silent** — no per-request logging at all, only a "listening on..." line per restart. Reading the room from journalctl alone was guesswork. *Fix:* JSON-Lines per request emitted on response finish.
- **Background prober and inside-observer agents both stream-timed-out** mid-flight. Wrote partial logs that were still useful, but the seminar-commentary agent (longest one) wrote nothing on first dispatch. *Lesson:* parallel observer agents are a brittle channel for long-form output; a single stretched-out agent risks idle-timeout.

These are the ones worth remembering. Several follow the same pattern: the obvious-shaped solution had a hidden seam (sudo dropping FDs, three-shape disclosure summary, two-branch Edit), which only showed at runtime.

## 3. Compacted demands per design concern

For each concern, what we *demand* of a working solution, then either a *crisp solution* shipped or `OPEN — no working answer yet`.

### 3.1 First-visitor onboarding

**Demand:** A friend on any modern platform, with no DAMM context, gets connected in one click after landing, in under 30 seconds total.

**Solution shipped:** `https://damm.raindesk.dev/get/`. Auto-detects platform; one button (`Get me connected`); browser does X25519 keygen via WebCrypto; auto-issues install pass via `POST /v1/public/install-pass` (rate-limited 3/IP/h); enrolls; renders `.conf` with platform-specific import banner + download + copy. Visible debug log if anything fails. Verified working: a real device handshaked, 5.4MB rx / 35.7MB tx flowing.

### 3.2 Resilient identity / state

**Demand:** No state corruption under normal operation. If `state.json` is wedged or lost, recovery is procedural and recorded.

**Solution shipped:** Atomic write-and-rename in the JSON store. `.bak` rotation on every commit. The runbook (§5–§6) documents recovery from corruption, snapshot, and total state loss. **Open:** logical state separation (today's `state.json` mixes durable and hot data). Premortem §5.1 calls this; v0.4 candidate.

### 3.3 Per-request observability

**Demand:** Every request leaves a structured line in the log. The log can be tailed and grep'd to read the room without journalctl spelunking.

**Solution shipped:** JSON-Lines per request on `response.on("finish")`. Example: `{"t":"...","m":"GET","p":"/v1/catalog?region=eu-central","s":200,"ms":8,"ip":"..."}`. Logrotate ships the file daily, keeps 7 rotations.

### 3.4 Not letting state grow without bound

**Demand:** Audit log + enrollment passes do not accumulate forever. The system reaps them.

**Solution shipped:** `damm-zombie-sweeper.timer` (every 5 min). Revokes devices that never handshaked after 15 min. Trims `enrollmentPasses` to active+unexpired only. Trims `adminAuditLog` to last 500 entries. First production run: 13 zombies revoked, passes 74 → 10, state.json 78KB → 48KB.

### 3.5 Gateway → control-plane peer reconciliation

**Demand:** When a device is enrolled or revoked, wg0 reflects it within seconds, idempotently, with no spurious sudo cost on heartbeat-only writes.

**Solution shipped:** systemd path-watcher on `state.json`, fires `damm-sync-peers.service`. Hash short-circuit on the peer-relevant slice (publicKey, assignedIp, presharedKey, status). Heartbeat writes skip the wg-show cost. PSK applied via mktemp file (no process-sub across sudo).

### 3.6 Public surface paranoia

**Demand:** ISPs scanning our IP range should not see WireGuard's UDP port answer.

**Solution OPEN.** Phase 0 leaves UDP/51820 listening unconditionally; the dark-by-default architecture (knock-to-open, codebook-rotated HMAC) is specced in the conversation but not deployed. Two-phase rollout proposed: knock endpoint live + firewall permissive for 24h observation, then default-drop. Code not yet written.

### 3.7 Polymorphic transport

**Demand:** When UDP/51820 is blocked or DPI'd, a client has at least one alternate transport that survives.

**Solution OPEN.** T0 (bare WG) is what's shipped; T1 (AmneziaWG) onward exists as a tradeoff matrix in `architecture-map.md`, not as code. The catalog format already supports per-frontdoor `transport` labels, so adding T1 is a deploy-and-publish step once a node runs AmneziaWG-compatible config.

### 3.8 Egress sequestering

**Demand:** Some flows route through an egress pool with isolated reputation; the rest go through the default. The user can declare which.

**Solution OPEN.** Today: one egress pool. Architecture provisions `quarantined-exodus` semantics; no second pool exists, no per-flow routing implemented. Phase 3 task.

### 3.9 Provider polyculture

**Demand:** Adding a new provider — Tailscale, Cloudflare, AWS, on-prem — is small, doesn't touch the rest of the system.

**Solution shipped (interface), open (live integrations).** v0.2.4 widened the Provider interface with `compute / ingress / egress / overlay / cdn_front / tunnel_term` capabilities. Adapters: Hetzner (live API), DigitalOcean (live API stub, no token), Cloudflare (cdn_front+tunnel_term, unconfigured), Tailscale (overlay, unconfigured). Contract tests assert every adapter survives `probeHealth()`. Live API integration beyond Hetzner: open.

### 3.10 Resilience-by-statistics

**Demand:** Score every path continuously. Demote degraded paths from the catalog. Promote recovered ones. Don't ask why a path is bad.

**Solution shipped (observation), open (mutation).** v0.3.0 ships `damm-orchestrator.timer` (60s tick): probes CP, scores each gateway from heartbeat-freshness × registration × active-status, writes `scores.json`, exposes `/v1/health/fleet` (public summary) and `/v1/admin/scores` (full). The catalog doesn't yet *read* scores.json to filter — that's the v0.3.1 deliverable.

### 3.11 Profile under load and tune

**Demand:** Know what the system can take. Tune at least one thing per profile run.

**Solution shipped:** Profile run captured in §1. Tuned: audit-log + enrollment-pass trim in the sweeper. State.json shrunk from 78KB → 48KB on first sweep. Per-enrollment cost (in-process) is 5.14ms; the public surface's 173ms p50 is dominated by TLS+WAN, not server work.

### 3.12 Captive portal navigator

**Demand:** Coffee-shop wifi makes you click "I agree" before TCP/UDP works. Today, the wizard tries to fetch and fails opaquely.

**Solution OPEN.** Detect: catalog fetch returns HTML where JSON expected → surface "this network has a portal — open `http://example.com` first, click through, then come back." Two hours of work; not started.

### 3.14 Admission posture — generous by default, watching for spikes

**Demand:** Mostly take on new clients/users/guests, provision keys for them, accommodate them. Trip only on signals that look like abuse from passers-by or hostile interference. Operator must be able to see the rolling state of admission at any time.

**Solution shipped (v0.3.1):** Three layered guards on `POST /v1/public/install-pass`:

| Guard | Default | What it sees | Tunable env |
|---|---|---|---|
| per-IP | 60 / hour (1/min sustained) | a single client looping the wizard | `VPN_PUBLIC_PASS_MAX` |
| per-network | 240 / hour (4/min sustained) | one /24 (IPv4) or /64 (IPv6) — handles a NAT'd household / school | `VPN_PUBLIC_PASS_CIDR_MAX` |
| global | 2400 / hour (40/min sustained) | fleet-wide ceiling — trips on real abuse, never on organic traffic | `VPN_PUBLIC_PASS_GLOBAL_MAX` |

A friend retrying a few times sails through. A small group of friends behind shared NAT all get keys. A single IP scripting 100/h trips the per-IP guard cleanly. A /24 scanning at scale trips the per-network guard before reaching the global. The 429 response names which scope tripped (`{ "scope": "ip|network|global", "resetAt": "..." }`) so the wizard can surface a useful message.

Operator visibility: `GET /v1/admin/admission` returns the rolling counters for all three guards plus the top buckets by count. Pair it with the structured request log to triangulate.

**Open:** Adaptive thresholds (auto-bump caps when the rolling baseline shifts). Anomaly detection for "all enrollments from this cluster never handshake" — currently the zombie sweeper detects this at 15-min granularity but doesn't feed back into admission. Honeypot pass (mint a pass not advertised on the public surface; if it gets used, definitively abuse — pause public install-pass for the source network).

### 3.13 Phone-to-phone mesh fallback

**Solution OPEN. No design has been seen to work end-to-end** at consumer-phone scale on stock OSes. Briar uses Tor; Bridgefy uses BLE; both have specific compromises. We don't claim one. See §5 brainstorm.

## 4. Storyboards — the field, six panels each

Concrete scenarios for training. Each: the user, the situation, what they see, what the system does, where it fails, what we'd add to close the gap.

### Story A — Friend on a hostile network

**Panel 1.** Anya, in a country where the standard ISPs block "VPN-shaped" UDP. She's on her iPhone, on hotel wifi.
**Panel 2.** I send her `https://damm.raindesk.dev/get/`. Page loads (HTTPS works).
**Panel 3.** She taps "Get me connected." Wizard does its three steps, hands her a `.conf` and a banner: open WireGuard, +, Create from file.
**Panel 4.** She does it. Imports. Toggles on.
**Panel 5.** WireGuard handshake fails (UDP/51820 silently dropped at the ISP). The wizard's "Test my connection" times out cleanly, showing "We don't see you yet — common causes are X, Y."
**Panel 6.** **What we'd ship:** a T1 (AmneziaWG) frontdoor on the same gateway, advertised in the catalog. The wizard would offer "try a heavier-disguise version" without re-enrollment.
**Lesson:** T0 alone is not enough for half the field cases. T1 is the single highest-ROI ship.

### Story B — Two friends, shared local Wi-Fi, upstream lost

**Panel 1.** Both Ben and Cris are DAMM-enrolled. They're in the same room, same Wi-Fi, but the upstream uplink just died.
**Panel 2.** Their tunnels go cold. WireGuard `latest-handshakes` ages out.
**Panel 3.** Both clients still have valid configs, valid keys, valid peer info — none of which is reachable, because the gateway is on the other side of the dead uplink.
**Panel 4.** They could, in principle, talk *directly* over the local Wi-Fi (mDNS-discoverable peers, same /24 subnet). WireGuard doesn't auto-discover this.
**Panel 5.** Without DAMM extending here, they can't.
**Panel 6.** **What would close it:** a DAMM client-side mesh module that, on detecting upstream loss, broadcasts a small mDNS service `_damm._tcp.local`, peers with other DAMM clients on the same LAN, and routes traffic between them via a side WireGuard interface keyed from a local handshake. Adjacent: Briar's BLE flooding model. **No design we've validated end-to-end.** Open question with serious primitive work behind it. Marked speculative.

### Story C — Contested bridge, sub-second decisions

**Panel 1.** Dani is on a network with active DPI that fingerprints WireGuard's handshake within 200ms and RST-injects the connection.
**Panel 2.** They open `damm.raindesk.dev/get/`, complete the wizard, get a `.conf`, import.
**Panel 3.** Toggle on. Handshake initiation goes out. DPI pattern-matches. RST. Tunnel never establishes.
**Panel 4.** Wizard's connection test reports failure. Dani has 30 seconds before they're in a different room.
**Panel 5.** They tap "try a heavier transport" (DOESN'T EXIST YET). System would re-issue a T3 (WSS-fronted) configuration, valid for 24 hours.
**Panel 6.** Latency cost of T3: +60–120ms per RTT. Throughput cost: ~25%. Acceptable for messaging-scale traffic, painful for video. **Lesson:** the tier ladder is the user-facing handle; we need it both as catalog-side advertising AND as a button in the wizard.

### Story D — LLM compute sharing across the trust mesh

**Panel 1.** Eli has a slow laptop. Friend Fran has a 4090 running Ollama with `qwen2:32b`. Both DAMM-enrolled, same tunnel-network.
**Panel 2.** Eli wants to summarize a long doc. Local model can't.
**Panel 3.** Eli's app POSTs to `https://infer.damm.local/v1/messages` (resolves only inside the tunnel). Catalog tells the client this hostname maps to a fabric endpoint Fran exposed.
**Panel 4.** CP routes the request into the supervisor lattice. Picks Fran's endpoint based on the score loop (low latency, available capacity).
**Panel 5.** Stream comes back through the tunnel. Eli's app sees a normal API response.
**Panel 6.** **What we'd ship:** the entire "ululation" stack from the prior conversation. Not built. Spec exists in conversation; protocol document open. Marked speculative — two-month build.

### Story E — Topological churn under active censorship

**Panel 1.** A region's reachability landscape shifts every 6 hours: the censor experiments with new TLS fingerprint policies. T0 success rate falls 99% → 40% in an hour. T1 stays at 95%.
**Panel 2.** The orchestrator's score loop sees the drop. Composite score for the affected gateway falls below 0.5.
**Panel 3.** Catalog rebuild *demotes* the affected frontdoor. Newly enrolling clients get T1-first.
**Panel 4.** Existing clients on T0 keep their config, get a `freshness` warning on next catalog refresh, prompt to "switch to a heavier transport."
**Panel 5.** **What's shipped:** scoring loop produces the data (`scores.json`). **What's open:** catalog read of `scores.json` (v0.3.1), client-side prompt to switch.
**Panel 6.** **Lesson:** the resilience claim hinges on the scoring → catalog → client loop closing. Today, only the first arrow is drawn.

### Story G — Burst from a single /24, suspiciously

**Panel 1.** Over 10 minutes, 80 install-pass requests arrive from `203.0.113.0/24`. UAs vary but they share a timing pattern: every 7±0.5 seconds.
**Panel 2.** First 6 from `203.0.113.42` succeed. The 7th hits the per-IP guard: 429 `{"scope":"ip", ...}`.
**Panel 3.** Next 30 across other IPs in the /24 succeed. The 31st hits the per-network guard: 429 `{"scope":"network", ...}`.
**Panel 4.** Operator notices, hits `GET /v1/admin/admission`. Sees `perNetwork.top: [{key: "203.0.113.0/24", count: 30, ...}]` — exactly one bucket dominating. Cross-checks the structured request log: 30 enrollments, zero handshakes.
**Panel 5.** Operator either: lifts the cap (legitimate but unusual usage), narrows the cap (`VPN_PUBLIC_PASS_CIDR_MAX=10`), or fires a manual block via `iptables` against the /24. The system has given them the data to choose.
**Panel 6.** **Lesson:** the guards aren't there to be aggressive. They're there to *create the moment* where an operator can see what's happening clearly and decide.

### Story F — Border-traversal under time pressure

**Panel 1.** Geoff has 30 seconds at a wifi hotspot before being moved on.
**Panel 2.** Opens `damm.raindesk.dev/get/`. Taps. Gets a config.
**Panel 3.** Hits "Show QR" (NOT YET SHIPPED — Geoff currently must download the file).
**Panel 4.** Screenshots the QR / saves the file. Walks away.
**Panel 5.** Later, on a different network, he imports the saved config into WireGuard. Tunnel comes up.
**Panel 6.** **What we'd ship:** QR rendering. ~10KB JS lib (`qrcode-svg` or Project Nayuki's pure-JS encoder), inlined into the wizard. Two hours of work.

## 5. Brainstorm register

Wide net. Each entry tagged for honesty.

- **Port-knocking + per-device codebook** — `[partial]`. Spec written in conversation; HMAC-signed knock packets validated against per-device codebook entries; iptables-default-drop with knock-opens-port-for-source-IP-for-5-min. Implementation 1–2 days.
- **Catalog-filter-by-score** — `[shipped data, partial integration]`. `scores.json` written every 60s; CP doesn't yet read it in `lib/catalog.js`. ~30 lines.
- **DigitalOcean live integration** — `[stub-shipped, real-open]`. Adapter class exists, declares correct capabilities, no API token configured. One env var + a smoke run = real.
- **Cloudflare front-doors (T3)** — `[adapter stub, deployment open]`. CF tunnel + DNS-only proxy = HTTPS-shaped front for our gateway. Real work: a Worker or `cloudflared` config + a catalog entry referencing the CF hostname.
- **Tailscale overlay membership** — `[adapter stub, deployment open]`. `tailscale up --authkey=...` on each gateway; gateway gains 100.x.y.z address; clients with Tailscale see internal route.
- **AmneziaWG (T1)** — `[real, deferred]`. AmneziaWG is a fork of WG with junk-packet + magic-header rewrite. Defeats off-the-shelf DPI. Same `.conf` format with extra params. Single highest-ROI ship for hostile-network friends.
- **In-browser QR encoder** — `[easy, deferred]`. Inline 10KB lib. Wizard renders QR alongside download/copy.
- **iOS WireGuard deep-link** — `[partial, format-known]`. iOS WG accepts `wireguard://import-tunnel?url=https://...` for HOSTED configs. Hosting `.conf` files defeats the privacy story (private key transits). Workable: short-lived pre-shared-symmetric encryption of the config, link contains the URL + decryption key in fragment, server never sees the key.
- **OSI stack collapsing** — `[speculative]`. Brainstorm reading: WireGuard-in-DNS or WireGuard-in-ICMP for bridges that pass only those. Real cost: throughput collapse to 5–20kbps. Niche tool, not a default.
- **Protocol profile inversion** — `[speculative]`. Make the WG handshake's first packet look like SSH-2 banner, or QUIC initial, or DTLS ClientHello. Implementations exist (e.g., `wstunnel`, `udp2raw`). Effective for a season; eventually fingerprinted. Wins as a tier above T1, below T3.
- **Super-light crypto for evasion** — `[abandoned]`. WG's ChaCha20-Poly1305 is already cheap (single-digit microseconds per packet). Shaving cycles via SipHash or ChaCha8 buys nothing real and weakens the cryptographic floor. Don't.
- **TSP-approximation pathfinder over shifting topology** — `[speculative, paper-only]`. Frame the route-selection problem as a generalized TSP where edge weights are score-loop outputs, edges drop in/out hourly. Approximation algorithms exist (Christofides, LKH). The hard part isn't the solver; it's that each client needs to reason locally with the same algorithm to fail over without round-tripping the CP. Open problem.
- **Captive-portal navigator** — `[easy, unbuilt]`. Detect: a fetch returns HTML where JSON expected → wizard surfaces "open `http://example.com` first." Two hours.
- **State split (inventory / devices / events / signing)** — `[premortem demand, unbuilt]`. Sanity-check called this. v0.4 candidate.
- **Multi-control-plane** — `[speculative]`. Two CPs sharing a Postgres, both serving the same catalogs, distributed ticket signing if we go to admission-tickets. Real but multi-week.
- **Fabric-as-supervisor-lattice (LLM ululation)** — `[speculative]`. Two-month build. Real architectural shape; not a side-quest. Documented in conversation; protocol spec open.
- **mDNS phone-to-phone mesh** — `[speculative, OS-hostile]`. iOS sandboxes mDNS, Android more permissive. End-to-end via stock apps: not validated. Briar's model (Tor + BLE) is the credible reference.
- **Per-flow sequester routing via fwmark** — `[design-only]`. `iptables` + `ip rule fwmark` based egress selection. Code shape: clear. Not written.
- **Trustless catalogs via signed-blob-on-public-storage** — `[abandoned]`. Catalogs need freshness + signing key continuity. Public-storage approaches add latency and don't solve revocation. Stick with HTTPS + signed envelope.
- **Heartbeat-free liveness via WG transfer counters** — `[real, partial]`. The gateway already knows whether each peer's transfer counter is incrementing. Use that as the liveness signal instead of, or alongside, the gateway-pushes-heartbeat-to-CP loop. Simplification opportunity.
- **Bring-your-own-host adapter** — `[capability slot exists]`. A friend's home box. Static-host adapter declares `["ingress", "egress"]`, no compute API, hand-provisioned. Heartbeat from the box; orchestrator includes/excludes via score.

The terse-pragmatic ones, separated out:

- AmneziaWG as T1 — *single biggest ROI*
- Catalog-filter-by-score — *closes the resilience loop*
- QR encoder in wizard — *closes the iOS field-traversal story*
- Captive-portal navigator — *fixes coffee-shop death*

The speculative-but-interesting ones:

- Phone-to-phone mesh (open architectural problem)
- Topology TSP (open math problem)
- Fabric-as-lattice (multi-month engineering problem)

The abandon list:

- Super-light crypto (false economy)
- Trustless catalogs (revocation latency too high)

## 6. Profile data, raw

Run on `hub2`, 2026-04-28, against the live deployment. Captured into the runbook for repeatable comparison.

```
Enrollment throughput (in-process, scripts/benchmark-enroll.js)
  count           60
  concurrency     6
  total           308.18ms
  per-second      194.69
  avg-per-enroll  5.14ms
  cpu user        420.91ms
  cpu system      204.40ms
  peak rss        69.09 MiB
  heap before     7.18 MiB
  heap after      9.57 MiB

GET /healthz via WAN+Caddy+TLS
  count   60
  conc    6
  min     121ms     (just TLS resumption)
  p50     173ms
  p90     227ms
  p99     303ms
  max     386ms

GET /v1/catalog?region=eu-central via WAN+Caddy+TLS
  count   40
  conc    4
  min     101ms
  p50     123ms     (faster than healthz — keepalive amortized signing cost across calls)
  p90     144ms
  p99     165ms
  max     175ms

CP process steady-state
  RSS     60 MB
  CPU     0.0% idle
  Uptime  ~8h before this profile

state.json
  before trim    78 KB     (181 audit, 74 passes, 18 devices)
  after  trim    48 KB     (182 audit, 10 passes, 18 devices)
  delta          -39%
```

The number that constrains us today is not throughput but **TLS+WAN handshake latency**: 100ms+ per request from a clean cold connection. Server-side work is lost in noise. Tuning candidate: HTTP/2 keepalive sustaining (Caddy already does this); reusing HTTP connections across a wizard's three calls (browser does this automatically).

## 7. Tuning applied

Single tune from this profile: **audit-log and enrollment-pass trim**, added to the existing `damm-zombie-sweeper` (5-min timer). Keeps the last 500 audit entries and only valid+unexpired passes. State.json went from 78KB → 48KB on first sweep. Long-term, this prevents the worst Phase-0 forecast (state.json reaches multi-MB size and every heartbeat rewrites the whole file).

Not yet tuned but profile-data informs:

- Caddy `health_uri /healthz` → fail-fast on hung CP (sanity-check §C.4 demand).
- HTTP/2 server-push from `/get/` to pre-fetch `/v1/health/fleet` and the next-step API endpoints (cuts the multi-call cold latency).
- Pre-sign the catalog and serve from cache when state hasn't changed (we sign per request today; state changes every ~30s due to heartbeats — but we could sign once between heartbeats and serve N times).

## 7.5 What just shipped (v0.3.1)

- **Three-layered admission guards** (per-IP, per-network /24-or-/64, global) on `/v1/public/install-pass`. Generous defaults: 6/30/600 per hour. Ship the same posture today: take on new connections every few minutes without breaking sweat; trip cleanly on patterns that look wrong.
- **`GET /v1/admin/admission`** — operator visibility into the rolling admission state. The 429 response names which scope tripped, so the wizard can surface a useful message.
- **`rate-limit.snapshot()`** — the limiter exposes its top buckets so the admin endpoint can render them.
- **Audit-log + enrollment-pass trim** rolled into the existing zombie sweeper. State.json went 78KB → 48KB on first sweep.
- **Story G** added to the storyboards: a /24 burst suspiciously, walks through what the system shows the operator.
- **`docs/INDEX.md`** added so the doc set actually steers development.

The celebrated path was verified intact after the deploy: handshake age 19s post-restart, 332MB / 193MB still moving.

## 8. The next ship

In priority order, weighted by the storyboards above:

1. **AmneziaWG (T1) frontdoor** alongside T0. Closes Story A. 1 day work + 1 day field testing.
2. **QR encoder in the wizard.** Closes Story F. 2 hours.
3. **Catalog-filter-by-score.** Closes Story E. ~30 lines in `lib/catalog.js`.
4. **Captive-portal navigator.** Closes the coffee-shop death case. 2 hours.
5. **DigitalOcean live integration.** Validates the coordination-layer abstraction across two providers. 1 day with a token.

Everything else in the brainstorm register stays in the brainstorm register until we ship a clean v0.4.

---

Read this manual cold and you should be able to:

- run the live deployment and recover it from any of the §3 demands' shipped solutions
- diagnose a field failure against the storyboards in §4 and know whether the gap is shipped, partial, or speculative
- look at §5 and pick a plausibly-shippable thing without confusing it with a research project
- profile the system tomorrow with the same commands and compare numbers honestly

The honest closing line: this is a small VPN that works for one user across 35MB of real traffic. Everything else is shipped scaffolding around that one fact, plus an honest map of what we don't yet have.
