# Memo — Premortem and an Alternative Architecture

**To:** A peer architect picking up DAMM cold
**From:** The architect of record for Phase 0
**Re:** What's going to bite us, and an honest competing design

You're inheriting a small, working system. The walkthrough notebook will get you to a green deployment in ninety minutes. That's not what I want to talk about. I want to lay out where the design choices are going to hurt — not as a list of regrets, but as a forecast you should test against the system as it grows. And then I want to sketch one radically different architecture that I think competes with what we built. Same threat-blind resilience goal; different shape; different tradeoffs.

## Premortem — where we'll regret choices

I'll be unsentimental. Phase 0 is deployed and useful. Some of these regrets are immediate; some are six months out.

### 1. JSON-on-disk state with full-file rewrites

`control-plane/lib/store-json.js` reads the whole state file, mutates it in memory, writes it back atomically (tmp + rename). This is fine at our current size — kilobytes. It will be painful at low-thousands of devices: every enrollment, every meter update, every heartbeat-driven gateway registration touches the disk through a single-writer chokepoint. The `.bak` rotation gives us crash safety; it doesn't give us throughput.

The Postgres backend exists in code (`store-postgres.js`) and in the unit tests, but it's been inert in production. We will discover its sharp edges the day we need it most — under enrollment surge — and that is the worst time to discover them. **Forecast:** the JSON store reaches its limits months before we expected, because the audit log grows monotonically and every state write also writes the audit log inline. We do not currently rotate the audit.

### 2. Control plane and gateway colocated on the same host

Phase 0 puts both on `hub2`. This is operationally simple: one box, no inter-host plumbing, one TLS termination. It is also a single blast radius. A bug in the control plane that consumes file descriptors will starve `wg-quick`. A storm of HTTPS requests that gets through Caddy can make the kernel scheduler unkind to the WireGuard data path. The architecture document promises "control plane should not be a data-plane hot path" — colocation puts a thumb on that promise.

**Forecast:** the first time we have to scale the control plane horizontally, we will spend a weekend untangling assumptions baked in by colocation: the env file's path, the state.json path, the sudoers grant, the path-watcher. None of these are insurmountable; all of them are friction we paid for early simplicity.

### 3. The reconciler as a path-watcher + sudo-elevated shell script

`damm-sync-peers.path` watches `state.json`, fires `damm-sync-peers.service`, which runs `sync-peers.sh`, which calls `sudo wg show` and `sudo wg set`. This works. It also:

- fires twice on every atomic rename (a known systemd quirk)
- requires a sudoers grant that has to be exactly right
- writes preshared keys through tempfiles because process substitution doesn't survive sudo
- has steady-state cost ~480 sudo events/hour at 30s timer (per our internal observer's measurement)

A more correct shape: the control plane writes the canonical `wg syncconf` file directly into `/etc/wireguard/wg0-peers.conf`, and a single `wg syncconf wg0 /etc/wireguard/wg0-peers.conf` runs in response. Or — better still — bind a netlink socket from the control plane and apply peer changes directly without shelling out at all.

**Forecast:** the reconciler will at some point miss an event during a rapid sequence of enrollments (path watchers debounce poorly under bursts). We will discover this only when a user reports their peer didn't take, and the forensics will be slow.

### 4. The catalog signing key as a true single point of failure

This is by design (see the architecture map — refusing silent rerouting), but it deserves naming. The catalog signing key lives in `state.json`. Lose it (state corruption, catastrophic disk loss) and every cached client rejects every new catalog you publish. A backup discipline is the answer; we don't have one wired in.

**Forecast:** within a year, we have an incident where state.json is corrupted, we restore from a sloppy backup, and either (a) we have to force-re-enroll everyone because the new signing key doesn't match what clients cached, or (b) we keep the old key but also keep the corruption. There's no clean third option without a key-rotation protocol that supports overlap.

### 5. Ad-hoc rate limiting in process memory

`control-plane/lib/rate-limit.js` is a leaky bucket in a `Map`. A control plane restart resets all counters. This is fine for our scale and accidentally useful (an abusive client gets a clean slate when we restart, but so does the user being abusive). It is **not** fine if we ever have multiple control-plane processes — they won't share state, and the per-IP cap becomes per-IP-per-process.

**Forecast:** the day we add a second control plane is the day we discover we're issuing 6 install passes per IP per hour instead of 3. Nobody will notice for weeks. By then a lot of passes have been issued.

### 6. The wizard auto-issues install passes to anyone

Public install-pass is trust-minimized: rate-limited per IP, single-use, time-bound. It is also wide open. Nothing prevents a script from rotating IPs and harvesting tokens. The early-adopter tier has byte quotas, so the abuse blast radius is bounded — but we will see attempts. Captcha, attestation, or invite-only would harden this; each costs something we don't want to pay (UX, complexity, the privacy promise).

**Forecast:** within a month of going semi-public, we will see a burst that consumes our daily token mint. The right response is not captcha — it is to ratchet the per-IP limit down dynamically when global mint rate exceeds a threshold, and let legitimate visitors retry. We don't have that loop yet.

### 7. WireGuard as the only transport, with no T1+ ready

The architecture document promises a tier ladder from T0 to T4. T0 is what we have. Every user on a network that blocks UDP/51820 sees a non-functional service today. The day we send a friend a link and they open it from a corporate WiFi, they see the wizard succeed (HTTPS works) and the tunnel never come up (UDP blocked). The wizard's "test my connection" will dutifully report this — but it doesn't fix it.

**Forecast:** AmneziaWG (T1) will be the most-asked-for feature within two weeks of the wizard going wider, because UDP-blocking networks are everywhere.

### 8. iOS deep-linking is unspecified

The wizard's import path on iOS is "tap Download .conf, switch to WireGuard, tap +, find the file in Files." That is six taps, three apps, one mental model switch. iOS WireGuard accepts `wireguard://import?url=https://...` for hosted configs. We do not host configs (deliberately — the .conf has the private key in it, hosting it leaks).

**Forecast:** until we ship QR rendering inline (mobile camera scan from the WireGuard app's `+ → Create from QR code`), iOS users will struggle. Roughly, our funnel will lose half its iPhone visitors at the import step. QR is a Phase-1 followup; the wizard advertises it on the roadmap but doesn't ship it.

### 9. Catalog, region, and exit-country promises that we don't yet route

The control plane accepts `region` and `exitCountry` parameters on enroll and stamps them on the device record. Today, we have one region and one egress, so the promise is hollow but harmless. When we have two egress pools (Phase 3), we will discover whether our policy code actually matches user intent or just hands users to whichever pool the placement code thinks is least loaded. The bug class here is "policy drift": the user said "exit from the Netherlands" and got Germany because the NL pool was momentarily down; nobody told them.

**Forecast:** the first multi-region deployment uncovers a routing decision the user wouldn't have made for themselves. The fix is to surface "you got this exit because that one was down" honestly in the wizard, not to silently accept the substitute.

### 10. Observability is essentially `tail -f`

We have control-plane stdout, Caddy journal, wg show, the heartbeat timer's logs, and the reconciler's logs. We do not have structured request logs, latency histograms, or per-route counters. The internal observer agent's report flagged this: "no request-level logging visible." Today this is fine. The day we have to debug "why is enrollment slow this hour" with only those tools is going to be a long day.

**Forecast:** we will write JSON access logs to disk under duress, then realize we need rotation, then discover that disk pressure broke the control plane. The right move is to bake structured logging from the start, but Phase 0 deliberately didn't.

---

## An alternative architecture

You asked whether there is a competing design — same goals, no worse on resilience — that we should consider before we get further entrenched. Here is one.

### Stateless gateways with signed admission tickets

The premise: **gateways hold no peer state.** They forward only WireGuard packets that carry a valid signed admission ticket. The control plane mints tickets; gateways verify them statelessly; everything else is the same.

#### How it works

A device, at enrollment time, receives:

1. Its WireGuard private key (browser-generated, as today).
2. A short-lived signed admission ticket: `{ deviceId, allowedTunnelIp, validUntil, signature }` signed by the control plane's admission key.
3. The catalog of front doors, signed by the catalog key.

A gateway, on receiving a WireGuard handshake initiation, has a small extension wrapper: before WireGuard's `Noise IK` runs, the client sends the admission ticket as the first packet. The gateway verifies the signature against the admission public key (which it learned at registration time, never the private key). If valid, the handshake proceeds; the ephemeral state is held only for the lifetime of the session.

When the ticket expires (say, 24 hours), the device fetches a fresh one from the control plane. Failures here are the same shape as a catalog-stale failure: the client knows the path is the issue, retries with a fresh ticket.

#### Why this is interesting

**Gateways become truly disposable.** They have no peer table to keep in sync with the control plane. No reconciler. No path-watcher. No `wg set peer` cascade. A gateway that comes online with the right keys can serve the entire population that holds valid tickets. A gateway that goes away takes nothing with it; the next gateway up serves the same tickets.

**The reconciler problem disappears.** Phase 0's biggest moving part — the systemd path-watcher reconciling state.json into wg0 — does not exist. No sudoers grant for `wg set`. No double-fire on rename. No quiet asymmetry between state and wire.

**Multi-control-plane works without coordination.** Two control planes can run independently, both signing tickets with the same admission key (held in a small HSM or shared KMS). Gateways verify either signature; clients fetch from either CP. The only thing requiring coordination is admission key rotation, and that's a known cadence event.

**Compromise blast radius is shorter.** A leaked admission key forces ticket reissuance, but the existing tickets carry an expiry — old keys lose effect on a clock. Today's leaked gateway-API-token is worse: it requires a rotation drill, restarts, and a window in which we have to remove peers manually.

#### What it costs

**One round trip per session re-establishment.** Today, an enrolled device handshakes whenever it likes, indefinitely. With admission tickets, every 24 hours (or whatever validity window) the device has to re-fetch a ticket. This is a network call to the control plane — meaning if the CP is down, sessions still work until tickets expire, but new sessions don't start. Catalogs are still cached, ticket validity is still hours-long, so this is degradation, not failure.

**One additional signature verification per handshake.** Cheap (Ed25519 is fast), but it's not free. At gateway scale, this is microseconds per session.

**Slightly more complicated client.** The wizard now has to handle ticket expiry: poll the CP periodically (or at handshake-failure), get a fresh ticket, swap it into the WireGuard config. WireGuard itself doesn't know about the ticket — it goes into a pre-shared field or a small custom header in front of the handshake. This is the part that requires wrapping or extending WireGuard, and that is the real engineering cost.

**The customizing-WireGuard problem.** This is the deal-breaker for an off-the-shelf-WireGuard-app deployment. iOS WireGuard, macOS WireGuard, Linux wg-quick — none of them know how to send an admission ticket. We'd ship a thin native client (a wrapper) on iOS/Android, or an out-of-band side-channel where the ticket is presented to the gateway over HTTPS in a separate flow that grants the WireGuard handshake (a "key-knock" that opens the door for the next handshake from a known IP). The latter is cleaner UX-wise but couples ticket and IP, undoing some of the resilience gains.

#### Side-by-side

|                                | Phase 0 (today)                        | Stateless-tickets alternative        |
|--------------------------------|----------------------------------------|--------------------------------------|
| Gateway state                  | Peer table mutated on enroll/revoke    | None                                 |
| Reconciler                     | systemd path-watcher + shell script    | Does not exist                       |
| Gateway disposability          | Cheap to launch, peer table to rehydrate | Truly disposable                     |
| Multi-CP coordination          | Needed (shared state.json or DB)       | Only key cadence                     |
| Admission revocation latency   | Reconciler interval (~30s)             | Ticket TTL (hours, with deny-list)   |
| Compromised gateway-private-key | Force re-enroll all devices            | Same                                 |
| Compromised admission-private-key | n/a                                  | Force re-issue all tickets           |
| Off-the-shelf WireGuard client | ✓                                       | ✗ — needs custom client or side-channel |
| Operational simplicity, day 1  | High                                   | Lower (more moving parts)            |
| Operational simplicity, day 365 | Lower (state grows, reconciler scales poorly) | Higher (gateway is dumb)         |
| Resilience under provider churn | Good (rotate gateway, sync state)      | Excellent (rotate gateway, no state to sync) |
| Resilience under CP outage     | Cached catalogs work; new sessions blocked | Cached tickets work; new sessions blocked once tickets expire |

#### My honest read

The stateless-ticket design competes on resilience. It is genuinely better at "lose a gateway, replace it instantly" because there is no rehydration step. It is genuinely better at multi-CP because state is in tickets, not in a shared DB. It pays for these gains with the off-the-shelf-client problem.

If we were not committed to standard WireGuard apps, I would seriously argue for the alternative. We are committed — that was an early architectural decision in `architecture.md` ("DAMM should not pretend the browser owns the tunnel"; the corollary is also true: DAMM should ride on the WireGuard apps users already trust). That commitment makes Phase 0's design correct. It makes the alternative an architecture for a different product.

There is a hybrid worth considering: keep Phase 0's standard-WireGuard data plane, but borrow the admission-ticket idea for a control-plane-to-gateway interaction. Gateways verify a signed admission record at peer-add time instead of receiving a peer write from a privileged reconciler. That removes the sudoers grant and the path-watcher without disturbing the client. It is a small enough delta that I think we should do it in Phase 1.5.

### Other alternatives considered briefly

- **Multi-hop overlay (Tor-style at the leaves).** Stronger anonymity, latency cost we cannot pay for "in seconds, online" UX. Wrong tool for our stated goal.
- **Pure peer-to-peer with volunteer relays.** Beautiful resilience, no answer for "who pays for the bandwidth and who's on the hook for what gets done." Wrong economic model.
- **No control plane at all — rendezvous on a public read-only data store (IPFS, blockchain, etc.).** Aesthetic appeal, real cost: latency to fetch the catalog on every reconnect, and no answer for revocation that doesn't require someone with edit rights. Wrong primitive.
- **Eat your own dogfood — control channel rides the data channel.** Tempting in that an adversary blocking the CP also blocks the data plane (one thing to defend, not two). The bootstrap problem is brutal: a client without a working tunnel cannot fetch their first config. Solvable with a sidecar, ugly to ship.

### What I'd actually do

If I started over today, I would still build Phase 0 the way Phase 0 is built, with three deltas:

1. **Postgres from day one.** The JSON store is a teaching toy; the ops cost of switching mid-flight is real. Skip it.
2. **Control plane on a different host from the gateway.** Even if it's the same provider in the same DC. The colocation simplification is not worth the blast-radius cost.
3. **Admission-record-style peer-add** (the hybrid described above) instead of the sudo-elevated reconciler. Same data plane, simpler control plane.

The fourth thing I would not do differently: I would ship what we shipped. The minimal Phase 0 — `/get/` wizard, public install-pass, whoami, posture panel, threat-blind catalog — is the right shape. The mistakes are the implementation choices around it, not the system contour.

— A. of R.