# Sanity Check — 2026-04-28

A direct read of the system as it stands at v0.2.0, marking what is load-bearing-but-shouldn't-be, what needs sharp API boundaries drawn, and where our several pictures of the architecture are talking past each other. No hedging. The point is to flush ambiguity before it ossifies.

## Where we are

A working VPN with a one-click onboarding wizard. Users can land on `damm.raindesk.dev/get/`, tap one button, get a `.conf`, import it into WireGuard, and have traffic exit through Germany. We have one gateway, one egress, one transport tier, one control plane, one machine. The architecture documents describe a system many times more general than what's deployed. That gap is fine for v0.2.0; it is a problem if it persists into v0.5.

The shape of the deployed system has been quietly grown out of bash scripts, systemd path watchers, and atomic JSON rewrites. None of it is wrong. Most of it is half a step from being wrong. This document marks exactly which steps.

---

## Components, contracts, and the work each one needs

### 1. Wizard (`scripts/site/render-get-flow.js`)

**What it is now**: a single-file 600-line HTML template returning a string, including all CSS and JS inline, rendered at build time by `build-site.js`.

**What's right**: zero runtime dependencies; no framework; all behavior visible in one place; visible debug log; no inline onclick handlers.

**What's wrong**:
- The renderer is a JavaScript function returning a JavaScript string containing a JavaScript-in-HTML payload. Three escaping levels. The escape bugs we already hit (the `\\'cta-test\\'` invalid-JS-in-HTML attribute) will happen again. *Refactor*: split the payload into a static `wizard.js` file the build copies as-is, no template-literal interpolation. The HTML can still be templated but the JS is its own file with one line at the top: `const CONFIG = {...};` written by the build.
- WebCrypto X25519 capability is checked via "the algorithm name was rejected" string matching. *Refactor*: feature-detect once at page load, gate the entire flow behind it; never present the "Get me connected" button on browsers without support — show the supported-browsers list directly.
- The "test my connection" poll only checks `/v1/whoami`'s `gatewayMatch`. It can't distinguish "tunnel up but client isn't routing through it" from "tunnel down". *Refactor*: also display the public IP we observe, alongside the user's pre-tunnel IP captured at page-load. They can compare with their eyes.

**Sharp boundary**: the wizard talks to the CP only via JSON over HTTPS, three calls in order: `POST /v1/public/install-pass`, `POST /v1/devices/enroll`, `GET /v1/whoami`. No other CP route is visible to the wizard. This boundary is clean today; pin it by writing a tiny TypeScript-style type comment at the top of the wizard JS describing the request and response shapes.

### 2. Public install-pass surface

**What it is now**: an in-memory leaky bucket keyed by client IP, 3 passes per IP per hour, single-use, 1-day expiry. State is process-local.

**What's wrong**:
- Process restart resets all counters. A crash-loop becomes an abuse loop.
- Behind a CDN we'd see one IP. Behind a NAT'd corporate network we'd punish many users for one. The IP is the wrong key.
- The "early-adopter" tier is hardcoded into the public path. Tier choice should be policy, not magic constants.

**Refactor**: move the counter into Redis or a sqlite-backed sliding window. Key on a tuple of (CIDR/24, browser-fingerprint-hash). Make the tier selectable by an env var, default still `early-adopter`.

**Sharp boundary**: anyone on the open internet can call `POST /v1/public/install-pass` with `{ region, deviceName }`. We promise: a single-use enrollment token good for 24 hours, byte-quota-bounded, with a clean 429 if rate-limited. We demand: nothing about the caller's identity — by design.

### 3. Control plane core (`control-plane/server.js`)

**What it is now**: a single 850-line file, raw `node:http`, JSON in/JSON out, atomic state writes, ed25519 catalog signing, in-memory rate limiter, no per-route middleware.

**What's right**: small, debuggable, no framework choice to argue about. Test surface (43/43) actually exercises real endpoints.

**What's wrong**:
- One file is the right size for v0.2.0 and the wrong size at v0.5. Routes will accrete; the if-cascade in `buildServer()` will get embarrassing.
- No per-route middleware means CORS, rate limiting, and auth checks are scattered. The `requireAdmin` call lives inside each handler.
- No structured logging. The log shows only "control-plane listening on...". When a slow path hits us, we have no histogram.
- The store interface's `updateState(mutator)` reads/mutates/writes the whole JSON file. This is the chokepoint the premortem named.

**Refactor**: extract a small request-pipeline (route table → middleware list → handler). Add request logging emitting JSON lines. Switch the validated runtime path to Postgres (the code is there; the operator notebook is the only thing keeping us on JSON).

**Sharp boundary**: between `server.js` and `lib/store.js`, the contract is "give me a function `updateState(mutator)` that runs my mutator atomically against the latest committed state and persists the result". Today this is honored by `store-json.js` and stubbed by `store-postgres.js`. Pin this contract by asserting it in a small contract test that both stores must pass.

### 4. State store (`store-json.js`, `store-postgres.js`, `state.json`)

**What it is now**: a single growing JSON file holding gateways, devices, egress pools, access tiers, audit log, enrollment passes, catalog signing keys.

**What's wrong**:
- State that is durable (gateways, devices, signing keys) sits in the same file as state that is hot and frequent (gateway heartbeats, audit events, metering). Every heartbeat rewrites the audit log.
- Audit log grows unboundedly. Enrollment-pass records grow until manually compacted.
- The signing private key sits in plaintext in `state.json`. A backup of state.json IS a copy of the signing key. We have no compartmentalization.
- The `state.json.bak` rotation is brittle: it's a single prior-version, not a journal.

**Refactor**: split into three layers. (a) `inventory.json` — gateways, egress pools, access tiers, slow-changing canon. (b) `devices/` — a small file per device. (c) `events.log` — append-only newline-delimited audit log, rotated by logrotate. Catalog signing key lives in `signing/` with mode 600 and is **never** in any json that gets backed up alongside state. The Postgres backend gets these layers as tables.

**Sharp boundary**: the heartbeat path *should not write the device file*. Heartbeat-derived liveness lives in process memory or a tiny `liveness.json` that the path-watcher does not watch. The path-watcher exists to catch *device-set changes*, not heartbeats. Pin this by making the reconciler's hash check upstream of the path watcher (which we did at v0.2.0); finish the job by giving heartbeat a separate sink.

### 5. Reconciler (`sync-peers.sh` + `damm-sync-peers.path/.service`)

**What it is now**: a bash script triggered by systemd path-watcher on `state.json` changes, computing a hash of (id, publicKey, assignedIp, presharedKey, status) per device, short-circuiting when unchanged, otherwise diffing wg0's peer list against state and applying via `sudo wg set wg0 peer ...`.

**What's right**: idempotent, observable, short-circuits the heartbeat noise (~480 sudo events/h → ~0).

**What's wrong**:
- It's a bash script with sudo. The sudoers grant is a sharp object on the floor.
- The path-watcher fires twice per atomic rename (a known systemd quirk). Idempotence makes this harmless; the noise is real.
- Bash is the wrong language for this. We have `wg setconf` and `wg syncconf` taking a config file; we should generate that file, not piecewise-shell-out.

**Refactor (cheap)**: switch from per-peer `wg set` to a single `wg syncconf wg0 <generated-file>`. The sudoers grant collapses to one verb on one file path. The script becomes ten lines.

**Refactor (right)**: dissolve the script entirely. The control plane writes `wg-peers.conf` directly when it commits a state change that touches the peer set. A small daemon (or even systemd's `Path=` watching that file) runs `wg syncconf` whenever the file changes. No bash, no path-on-state.json, no double-fires.

**Sharp boundary**: the reconciler's contract is "make wg0's peer table match the device set in state". Today this is enforced by ad-hoc shell. Pin it by codifying the desired wg0 peer list as an output of `lib/state.js` (a function `derivePeerConfig(state)`), so the reconciler is a thin shell over a tested pure function.

### 6. Heartbeat (`heartbeat.sh` + `damm-heartbeat.timer`)

**What it is now**: a bash script run every 30s by a systemd timer, doing a single HTTP POST to `/v1/gateways/<id>/heartbeat`.

**What's wrong**: the heartbeat *belongs in the gateway*, not in a sibling script. We don't have a gateway daemon; we have wg0 plus this script. The asymmetry — "the gateway is just `wg-quick`, but for liveness we paste a bash script next to it" — is the kind of seam that bites later.

**Refactor**: write a tiny gateway agent (`gateway/agent.js` — the directory exists with one file already) that owns heartbeat, owns the wg0 peer-config rendering, and exposes a small unix socket for the control plane to talk to it. Drop the bash script. Drop the timer. The agent runs as one systemd unit `damm-gateway.service` and IS the gateway.

**Sharp boundary**: gateway↔CP is one socket pair. Today it's two surfaces (gateway-token HTTP for heartbeat, sudoers-grant shell for peer changes). Collapse to: a single bidirectional channel where the CP pushes desired peers and the gateway pushes liveness + metrics. Could be HTTP long-poll, websocket, or a small protobuf-over-unix-socket. Pin by writing the protocol spec before writing the agent.

### 7. Catalog (signing, publication, freshness)

**What it is now**: ed25519-signed JSON envelope with the gateway list and egress pools, region-filtered, signed at publish time, public-key embedded in every response so clients can verify.

**What's right**: signed, versioned, freshness-bounded, public-key included.

**What's wrong**:
- The `region` query parameter is echoed verbatim into the signed envelope. The probe agent flagged this: clients displaying it raw must treat it as untrusted text. We sign attacker input.
- No catalog cache headers. Every fetch re-signs and re-serializes.
- Catalog versioning is a `version` field; nothing in the protocol enforces that clients refuse downgrades.

**Refactor**: validate `region` against a closed set of known regions before placing it in the signed payload (or hash-only if we want to keep the echo). Add `cache-control: max-age=60` to catalog responses. Bake catalog version into the signature input so a downgrade attack changes the verification result.

**Sharp boundary**: catalog↔client is a signed envelope with a minimum freshness contract. The CP promises: "every catalog I sign is correct as of `issuedAt`". The client demands: "I will reject anything older than my last-seen `issuedAt` minus tolerance". Pin by writing the catalog format spec — a single page — and have both CP and client validate against it.

### 8. Caddy reverse proxy

**What it is now**: a single Caddyfile vhost block reverse-proxying `cp.damm.raindesk.dev` to `127.0.0.1:8080`, plus the `damm.raindesk.dev` static site.

**What's right**: zero config beyond hostname. Caddy handles ACME, HTTP/2, h3.

**What's wrong**:
- It's co-tenant with thirty-something other domains in the same Caddyfile. The internal observer flagged 327 ACME-failure log lines in 2 minutes from neighbor domains — pure noise we filter out, but also: a neighbor's misconfiguration could nudge our cert behavior. We have one Caddy process for all of these.
- No HSTS visible at the application layer (the prober flagged this; Caddy's defaults may set it but we haven't asserted).
- No upstream health check; Caddy will happily proxy to a hung CP.

**Refactor**: add `header { Strict-Transport-Security "max-age=31536000; includeSubDomains" }` to the cp.damm block. Add `reverse_proxy 127.0.0.1:8080 { health_uri /healthz health_interval 5s health_timeout 2s }` so a 502 happens fast on a hung CP rather than a slow client timeout.

**Sharp boundary**: Caddy↔CP is HTTP/1.1 over loopback. Pin by writing the four headers we expect Caddy to set on requests entering the CP (`x-forwarded-for`, `x-forwarded-proto`, `x-real-ip`, host) and asserting them in the CP's request handler.

### 9. Orchestrator (`orchestrator/`)

**What it is now**: a directory of planning code (`plan.js`, `reconcile.js`, `policy.js`, `providers.js`, `adapters.js`, `hetzner-smoke.js`, `hetzner-cleanup.js`) that knows how to talk to Hetzner Cloud's API but is not running anywhere as a service.

**What's right**: it exists, it has tests, and it embodies the right model (plan → reconcile → apply with policy gates).

**What's wrong**: it's dead code in production. The architecture map promises an orchestrator that adds gateways under pressure. We have orchestrator code that runs only when an operator manually invokes a CLI.

**Refactor**: pick one of two paths. (a) Cut it for now — move to `archive/orchestrator/` and stop pretending we have one until we do. (b) Wire it up as a scheduled service that responds to score-loop signals from the CP. Path (a) is the honest move for v0.2; path (b) is the path forward for v0.4+.

**Sharp boundary**: orchestrator↔CP is "the orchestrator publishes new gateways into state via a privileged write; the CP refuses orchestrator writes that violate provider policy". Today this contract is implied; nothing enforces it. Pin by giving the orchestrator a separate signing key the CP verifies, or by giving it write access only to a `desired-fleet.json` that the CP reconciles (instead of writing state directly).

### 10. The browser companion (`/docs/damm-client.html`)

**What it is now**: a 900-line single-file PWA, predating the wizard, doing platform-detection, profile loading, install-tab guidance, and (now) auto-fetching install passes when the form is submitted blank.

**What's wrong**: it duplicates 80% of what the wizard does. Two surfaces with the same job, drifting independently.

**Refactor**: pick one. The wizard is the front door; the legacy companion is the power-user inspector. Either:
- Cut the companion entirely (redirect `/docs/damm-client.html` → `/get/`), or
- Re-scope the companion to "post-enrollment power-user view" — diagnostics, manual config import, profile inspection — and stop overlapping enrollment.

For v0.2.0 we did the latter implicitly (prefilled the field, made it usable). For v0.3 pick consciously.

**Sharp boundary**: each public HTML surface owns exactly one user job. `/get/` is enrollment. `/docs/damm-client.html` is post-enrollment inspection. Anything that doesn't fit either gets its own page or doesn't ship.

---

## Cross-cutting issues to call out

### A. We have three architectural pictures and they don't quite add up

- **Phase 0 deployed reality**: one gateway, one egress, one transport, one CP, one host. WireGuard wrapper + onboarding wizard.
- **Resilience-by-statistics map** (`architecture-map.md`): many gateways, many front doors per gateway, polymorphic transport, score-driven catalog churn, threat-blind degradation.
- **Stateless-tickets alternative** (`architecture-premortem.md`): admission tickets verified at handshake, gateways carry no peer state, control plane is a ticket mint.

These are not in conflict, but they are **not yet wired into a single trajectory**. The premortem suggests a hybrid (admission-record-style peer-add) for Phase 1.5; the map's score loop is a Phase 2/3 concern. We need a single document — call it `roadmap.md`, refresh the existing one — that says "from where we are today, here are the next 3 architectural moves and which document each is grounded in." Without that, the maps drift apart.

### B. The "many conceptualizations of the architecture" question

Across the docs we have:
- `architecture.md` — original system thesis
- `architecture-map.md` — resilience model
- `architecture-premortem.md` — peer critique
- `system-spec.md` — implementation contract
- `client-boundaries.md` — what the client is and isn't
- `engineering-decisions.md` — why we chose what we chose
- `onboarding-flow.md` — the user flow

Six documents that overlap by ~40% and disagree by ~10%. **Refactor**: declare one canonical document per concern. `architecture.md` owns the system thesis. `system-spec.md` owns the wire-format contracts. `architecture-map.md` owns the resilience posture. The others link out. Drift gets caught when the canonical doc moves.

### C. The promise/demand matrix between components

For each component pair, a single line of "X promises Y, X demands Z":

| Pair                      | Promises                                          | Demands                                              |
|---------------------------|---------------------------------------------------|------------------------------------------------------|
| Wizard ↔ CP               | calls only the three public endpoints             | a CORS-OK 200/201 within 8s                          |
| Wizard ↔ User-WG-app      | the .conf is valid and self-contained             | the user installs WG and toggles it                  |
| CP ↔ Store                | every state read/write is atomic                  | the store survives crash mid-write without corruption |
| CP ↔ Catalog              | every gateway in the catalog is heartbeat-fresh   | clients reject stale and unsigned                    |
| CP ↔ Reconciler           | state.json reflects current desired peer set      | the reconciler matches wg0 to it within ~2s          |
| Reconciler ↔ Gateway (wg0) | will only call `wg set` for valid peers          | wg0 is up and accepts the call                       |
| Heartbeat ↔ CP            | every 30s, the gateway is alive                   | CP marks the gateway healthy on receipt              |
| Caddy ↔ CP                | will retry idempotent GETs once, fail-fast on 5xx | CP responds within 8s                                |
| Orchestrator ↔ CP         | new gateways enter via vetted writes              | CP enforces provider-policy gate                     |

Most of these are implicit today. **Pin them by writing each as a paragraph in `system-spec.md` and asserting them in tests where the seam is testable.**

---

## The cut list

Three categories: cut now, refactor next sprint, leave alone.

### Cut now (week of v0.3)
- Inline `onclick=` handlers in any HTML (already done in /get/, audit the legacy client).
- The `data/drill-bundle/`, `data/gauntlet/`, `data/first-visitor-*` artifacts are checked into the working tree as untracked. Either commit them with a `data/.gitignore` carve-out, or `.gitignore` them outright. Untracked is the worst state.
- `orccu/` directory — looks like an experimental neighbor, doesn't ship as part of DAMM, doesn't belong in the repo root. Move or remove.

### Refactor next sprint
- `sync-peers.sh` → `wg syncconf` based, sudoers grant collapses to one path.
- Heartbeat → moved into a gateway agent, bash script and timer retired.
- Browser companion `/docs/damm-client.html` — re-scope or redirect to `/get/`.
- Audit log → rotate, with `logrotate` config or a periodic compaction job.

### Reconceptualize
- Single state.json → split into inventory / devices / events / signing.
- Orchestrator → either wire it as a service or archive it.
- Catalog signing key → keystore separate from state.

### Leave alone for now
- The wizard's overall shape. v0.2.0 ships, it works, the JS is clean, the contract is sharp.
- The Caddy fronting model — co-tenancy noise is real but not a DAMM problem.
- WireGuard as the data plane primitive — this is the right call, the premortem confirmed it, do not touch.

---

## What "done" looks like for v0.3

Concretely:
1. `sync-peers.sh` rewritten to use `wg syncconf`. Sudoers grant collapses. Three lines of bash gone.
2. Heartbeat moved into a `gateway/agent.js` running as `damm-gateway.service`. The bash script is deleted.
3. `state.json` split into `inventory.json`, `devices/`, `events.log`, with the path-watcher only watching `devices/`. Heartbeat writes go nowhere the watcher sees.
4. The signed catalog refuses unknown regions instead of echoing them.
5. Caddy block adds HSTS and an upstream health check.
6. `/docs/damm-client.html` either redirects to `/get/` or has a clearly different scoped purpose.
7. The orchestrator either becomes a service or gets moved to `archive/`.
8. One `roadmap.md` reconciling Phase 0 / map / premortem / spec into a single trajectory.

When those eight are done, v0.3 ships. None of them require throwing away v0.2.0; all of them sharpen what v0.2.0 already does.
