# DAMM Operational Runbook

When the system halts, gunks up, freezes, gets stuck in an invalid state, loses track of itself, or becomes unresponsive — start at the top of this document, work down. Each step is a runnable command. Don't skip the diagnostics.

The deployment shape this runbook assumes (Phase 0):

- **hub2** (149.102.137.139, public): runs `wg-quick@wg0`, `damm-control-plane`, `damm-heartbeat.timer`, `damm-sync-peers.path/.service`, Caddy reverse-proxying `cp.damm.raindesk.dev` → `127.0.0.1:8080`
- **hyle**: holds the source-of-truth git repo at `/home/uprootiny/damm`
- State directory on hub2: `/home/uprootiny/damm-state/` (`env`, `gateway-keypair.json`, `state.json`, `*.log`, `sync-peers.sh`, `heartbeat.sh`)
- DNS: `vpn.damm.raindesk.dev`, `cp.damm.raindesk.dev`, `damm.raindesk.dev` all → 149.102.137.139

When the runbook says "ssh hub2" it assumes `ssh -o RemoteCommand=none hub2`.

---

## 0. First sixty seconds — discovery

Before touching anything, learn what's broken. These are read-only.

1. **Are services even up?** `ssh hub2 'systemctl is-active damm-control-plane wg-quick@wg0 damm-heartbeat.timer damm-sync-peers.path caddy'` — five `active` lines is healthy.
2. **Does the public surface answer?** From your laptop: `curl -fsS --max-time 8 https://cp.damm.raindesk.dev/healthz` — expect `{"ok":true,"stateBackend":"json"}`.
3. **Is UDP reachable?** `nc -uvz -w 3 vpn.damm.raindesk.dev 51820` — UDP can lie, but a hard refusal means the listen socket is gone.
4. **Is the gateway healthy in its own state?** `ssh hub2 'jq ".gateways[0].registration" /home/uprootiny/damm-state/state.json'` — `lastHeartbeatAt` should be within the last 60 seconds.
5. **Wireguard interface alive?** `ssh hub2 'sudo -n wg show wg0'` — should list interface, port 51820, plus zero or more peers.
6. **Disk free?** `ssh hub2 'df -h / /home /var | head -10'` — under 5% free is the silent killer.
7. **Recent control-plane log?** `ssh hub2 'tail -n 30 /home/uprootiny/damm-state/control-plane.log'` — anything other than `control-plane listening on http://127.0.0.1:8080` means look harder.
8. **Time correct?** `ssh hub2 'date -u; chronyc tracking 2>/dev/null | head -3'` — drift over a few seconds breaks heartbeat freshness.

Got the picture? Jump to the relevant section. If everything looks fine but a user says it's broken, see §17 (the system reports healthy but isn't).

---

## 1. Control plane process down or crashed

**Detection**: §0.1 shows `inactive` or `failed` for `damm-control-plane`; or §0.2 fails with a connection error; or 502 from Caddy.

1.1. `ssh hub2 'systemctl status damm-control-plane --no-pager | tail -20'` — read the last `Active:` line and the recent journal.
1.2. `ssh hub2 'sudo -n journalctl -u damm-control-plane -n 80 --no-pager'` — find the panic, stack trace, or the first error before the exit.
1.3. If it died from a syntax error in a recent code change: `ssh hub2 'cd ~/damm && git status && git log -3 --oneline'` — note any uncommitted edits.
1.4. If env file is the suspect: `ssh hub2 'sudo -n systemctl cat damm-control-plane | grep EnvironmentFile && bash -nc "set -a; source /home/uprootiny/damm-state/env; set +a; echo ok"'` — non-zero exit means a malformed value broke the parse.
1.5. Try a clean restart: `ssh hub2 'sudo -n systemctl restart damm-control-plane && sleep 2 && systemctl is-active damm-control-plane'`.
1.6. If it crash-loops, freeze it for inspection: `ssh hub2 'sudo -n systemctl stop damm-control-plane'` then run the binary by hand to see the live output: `ssh hub2 'cd ~/damm && set -a; source /home/uprootiny/damm-state/env; set +a; node control-plane/server.js 2>&1 | head -30'` — the error is now in your terminal.
1.7. If it's an `EADDRINUSE`: `ssh hub2 'ss -lntp | grep :8080'` — another process holds the port. Kill it (`kill <pid>`) and retry. Often a previous node process didn't exit cleanly.
1.8. If it's a JSON parse error on `state.json`: see §6.
1.9. Once it starts, re-run §0.4 to confirm it's serving and the gateway is fresh.
1.10. Tag the incident in the audit log so future-you can find it: `ssh hub2 'echo "$(date -Is) restart $REASON" >> /home/uprootiny/damm-state/operator-notes.log'`.

---

## 2. WireGuard interface (wg0) down

**Detection**: §0.5 fails or shows "Cannot find device"; clients can't handshake; UDP listen socket gone.

2.1. `ssh hub2 'systemctl status wg-quick@wg0 --no-pager | tail -30'` — note the exit code and the last unit message.
2.2. `ssh hub2 'sudo -n journalctl -u wg-quick@wg0 -n 60 --no-pager'` — usually the failing line is one of the `PostUp` `iptables` invocations.
2.3. Inspect the config: `ssh hub2 'sudo -n cat /etc/wireguard/wg0.conf'` — verify `PrivateKey`, `ListenPort = 51820`, `Address = 10.44.0.1/24`, and the `PostUp`/`PostDown` lines.
2.4. Try `sudo wg-quick down wg0 || true; sudo wg-quick up wg0` and read every line of output.
2.5. If `iptables: Bad rule (does a matching rule exist in that chain?)` — a previous `PostDown` partially removed rules. Manually reconcile: `sudo iptables -t nat -S POSTROUTING | grep 10.44.0.0/24`. Add the missing MASQUERADE: `sudo iptables -t nat -A POSTROUTING -s 10.44.0.0/24 -o eth0 -j MASQUERADE`.
2.6. If the listen port is taken by a stranger: `ss -lunp | grep 51820` and kill the offender if it's ours, or move the port (and update `VPN_GATEWAY_ENDPOINT` in `~/damm-state/env`, then re-register).
2.7. After `wg0` comes back up, re-run the reconciler so peers are restored: `ssh hub2 '/home/uprootiny/damm-state/sync-peers.sh'`.
2.8. Confirm with `sudo -n wg show wg0` that peer count matches `jq '.devices|length' ~/damm-state/state.json`.
2.9. If the kernel module is missing (`FATAL: Module wireguard not found`): `sudo apt-get install -y wireguard-tools wireguard-dkms` then `sudo modprobe wireguard`.

---

## 3. Gateway stale → `no_active_gateways` at enrollment

**Detection**: §0.4 shows `lastHeartbeatAt` older than ~120 seconds; clients get HTTP 503 `no_active_gateways` from `/v1/devices/enroll`.

3.1. `ssh hub2 'systemctl is-active damm-heartbeat.timer && systemctl list-timers damm-heartbeat.timer --no-pager'` — confirm the timer is enabled and queued.
3.2. `ssh hub2 'sudo -n journalctl -u damm-heartbeat.service -n 30 --no-pager'` — look for non-zero exits.
3.3. Run heartbeat manually and read every line: `ssh hub2 '/home/uprootiny/damm-state/heartbeat.sh; echo "exit=$?"'`.
3.4. If heartbeat.sh fails on `/home/uprootiny/damm-state/env: line N: ...: command not found` — an env value lost its double quotes. See §11.
3.5. If heartbeat.sh fails with `401 invalid_gateway_token` — the env's `VPN_GATEWAY_API_TOKEN` does not match `state.json`'s gateway record. See §11.4.
3.6. If heartbeat returns an error from the control plane (5xx), §1 first.
3.7. After a successful manual heartbeat, re-check freshness: `ssh hub2 'jq ".gateways[0].registration.lastHeartbeatAt" ~/damm-state/state.json'`.
3.8. If freshness is good but enrollment still 503s: `ssh hub2 'jq ".gateways[0]" ~/damm-state/state.json'` — confirm `status: "active"`, `frontdoors[0].active: true`, and no `example.com` placeholder endpoints (those are skipped in failover but counted as inactive).

---

## 4. Caddy / public TLS / HTTPS reachability

**Detection**: §0.2 returns SSL alert or DNS error; browser shows cert warning on `cp.damm.raindesk.dev`; ACME log noise.

4.1. `ssh hub2 'systemctl is-active caddy && caddy version'`.
4.2. `ssh hub2 'sudo -n caddy validate --config /etc/caddy/Caddyfile'`.
4.3. Verify our block exists and is correctly shaped: `ssh hub2 'sudo -n grep -A6 "^cp.damm.raindesk.dev" /etc/caddy/Caddyfile'`.
4.4. Verify Caddy is reaching upstream: `ssh hub2 'curl -fsS http://127.0.0.1:8080/healthz'`. If this fails, the problem is the control plane (§1), not Caddy.
4.5. ACME / cert health: `ssh hub2 'sudo -n journalctl -u caddy --since "5 minutes ago" --no-pager | grep -iE "cp.damm|certificate|acme" | tail -10'`. Persistent challenge failures here mean DNS is wrong (rare — wildcard handles it) or another vhost is failing (co-tenant noise — see §4.10).
4.6. If your block was just added and the cert hasn't issued, give it 60 seconds and retry. Caddy backs off ~60s on failure.
4.7. Force re-issue: `ssh hub2 'sudo -n caddy reload --config /etc/caddy/Caddyfile'`. Avoid `caddy restart` unless reload fails.
4.8. If cert refuses to issue at all, fall back to staging to learn why without burning rate-limit: temporarily add `acme_ca https://acme-staging-v02.api.letsencrypt.org/directory` inside the site block, reload, observe the issuance flow, then remove and reload again.
4.9. CORS: this surface is `access-control-allow-origin: *` by design (the control plane intentionally allows browser companions on any origin). Don't tighten this without checking `damm.raindesk.dev` and any companion pages that call `/v1/devices/enroll`.
4.10. **Co-tenant noise** is real: this Caddy serves many other domains. Errors from `corpora.raindesk.dev`, `innie.dissemblage.art`, `globe.hyperstitious.art`, etc. in the journal are not DAMM problems. Filter your queries with `grep -i "cp.damm\\|damm.raindesk\\|vpn.damm"` to keep signal high.
4.11. If `https://cp.damm.raindesk.dev/` returns 502/504 under load, see §15 (control plane is fsync-blocked).

---

## 5. State.json — corruption, loss, drift

**Detection**: control plane crash-loops on JSON parse; `jq` errors on the file; reconciler sees garbage.

5.1. **First, do not lose what you have**: `ssh hub2 'cp /home/uprootiny/damm-state/state.json /home/uprootiny/damm-state/state.json.snapshot.$(date +%s)'`.
5.2. Verify the JSON parses: `ssh hub2 'jq . /home/uprootiny/damm-state/state.json | head -3'`. If it parses, the file isn't corrupt — diagnose elsewhere.
5.3. If JSON is truncated (often last bytes lost on crash mid-write): `ssh hub2 'tail -c 200 /home/uprootiny/damm-state/state.json'` — last char should be `}`. If it's something else, the writer was interrupted.
5.4. Recover from snapshot: `ssh hub2 'ls -lt /home/uprootiny/damm-state/state.json.* 2>/dev/null | head -5'`. The control plane writes atomically (write to `state.json.tmp`, then rename), so a clean prior version is usually intact.
5.5. The control plane keeps a `.bak` after each write: `ssh hub2 'ls -la /home/uprootiny/damm-state/state.json.bak 2>/dev/null && jq . /home/uprootiny/damm-state/state.json.bak | head -3'`. If valid, restore: `cp state.json.bak state.json` and restart the CP.
5.6. **Total state loss** (e.g., disk wiped, rm -rf): the control plane will rebootstrap from `~/damm-state/env` on next start. You lose: enrolled devices (need to re-enroll), enrollment passes, signing-key continuity (clients with cached catalogs will reject the new signing key — they re-enroll). You keep: the gateway identity, the egress pool, the access tiers (rebuilt from env).
5.7. After a state loss, force every wg0 peer out: `ssh hub2 'sudo -n wg show wg0 peers | xargs -rn1 sudo -n wg set wg0 peer remove 2>/dev/null; sudo -n wg show wg0'`.
5.8. **Audit-log growth**: `jq '.adminAuditLog|length' state.json`. If over ~10000, compact: see §13.
5.9. **Enrollment-pass growth**: same — `.enrollmentPasses|length`. If over ~1000 unused/expired, prune.

---

## 6. JSON state file is locked / writer wedged

**Detection**: control plane logs go quiet; HTTP requests time out; iostat shows no progress on the device.

6.1. `ssh hub2 'lsof /home/uprootiny/damm-state/state.json 2>/dev/null'` — multiple writers should not exist.
6.2. `ssh hub2 'fuser -v /home/uprootiny/damm-state/state.json 2>&1'` — same idea, different tool.
6.3. If the control plane process is hung in `D` state (uninterruptible sleep on disk): `ssh hub2 'ps -o stat,pid,pcpu,etime,cmd -p $(pgrep -f node.*control-plane | head -1)'`. A `D` state is disk pressure — see §13.
6.4. Last-resort: `sudo -n systemctl kill -s KILL damm-control-plane; sleep 1; sudo -n systemctl start damm-control-plane`. Lose any in-flight write — the snapshot from §5.1 is your safety net.

---

## 7. Sync-peers reconciler is failing or drifting

**Detection**: `wg show wg0 peers | wc -l` ≠ `jq '.devices|length' state.json`; or `journalctl -u damm-sync-peers.service` shows non-zero exits.

7.1. `ssh hub2 'sudo -n journalctl -u damm-sync-peers.service --since "10 minutes ago" --no-pager | tail -40'`.
7.2. Run it by hand and read every line: `ssh hub2 '/home/uprootiny/damm-state/sync-peers.sh'`. Healthy run prints either `sync complete: added=0 updated=0 removed=0` or numbered changes.
7.3. If `fopen: No such file or directory` from `wg set ... preshared-key`: a tempfile path is wrong. The current implementation writes the PSK to a `mktemp` file; check `/tmp` is writable.
7.4. If `sudo: a password is required`: the sudoers grant in `/etc/sudoers.d/damm-wg-sync` has been clobbered. Recreate:
    ```
    sudo tee /etc/sudoers.d/damm-wg-sync > /dev/null <<'EOF'
    uprootiny ALL=(root) NOPASSWD: /usr/bin/wg show wg0 peers
    uprootiny ALL=(root) NOPASSWD: /usr/bin/wg show wg0 allowed-ips
    uprootiny ALL=(root) NOPASSWD: /usr/bin/wg set wg0 peer *
    EOF
    sudo chmod 440 /etc/sudoers.d/damm-wg-sync
    sudo visudo -cf /etc/sudoers.d/damm-wg-sync
    ```
7.5. Force a full reconcile: just run §7.2.
7.6. If a peer in `wg0` is *not* in state, the reconciler removes it. If it's a peer you wanted to keep manually (debugging?) — add it to `state.json` as a device record, or stop the path-watcher first.
7.7. **Path-watcher double-fires** are real (the JSON write is atomic via rename, which fires `Modified` then `MovedTo`). The reconciler is idempotent so this is harmless cost. If you want to dampen, add a debounce by wrapping `sync-peers.sh` in `flock -w 5 /run/damm-sync.lock` and a 1-second sleep before applying.
7.8. Reconciler's sudo cost: each run is 2 `sudo wg show` calls + 1 `wg set` per peer. Steady-state with no churn is ~480 sudo events/hour at 30s timer. If audit log volume becomes an issue, raise the timer interval to 60s.

---

## 8. systemd path-watcher not firing on state changes

**Detection**: device is enrolled, `state.json` has the device, `wg0` does not have the peer, and `journalctl -u damm-sync-peers.service` shows no recent activity.

8.1. `ssh hub2 'systemctl is-active damm-sync-peers.path && systemctl status damm-sync-peers.path --no-pager | tail -10'`.
8.2. `ssh hub2 'sudo -n journalctl -u damm-sync-peers.path --since "5 minutes ago" --no-pager'`.
8.3. Verify the watched path matches reality: `systemctl cat damm-sync-peers.path | grep PathChanged`. Must be `/home/uprootiny/damm-state/state.json`.
8.4. Touch the file to trigger: `ssh hub2 'touch /home/uprootiny/damm-state/state.json && sleep 2 && journalctl -u damm-sync-peers.service -n 5 --no-pager'`.
8.5. If still not firing, restart both the path and service: `sudo systemctl restart damm-sync-peers.path damm-sync-peers.service`.
8.6. If the service runs but does no work, see §7.

---

## 9. Routing / NAT / forwarding lost (often after reboot or kernel upgrade)

**Detection**: tunnel handshake completes (`wg show wg0 latest-handshakes` shows recent activity for the peer), bytes flow on `wg show wg0 transfer`, but the user reports "no internet".

9.1. **IP forwarding**: `ssh hub2 'sysctl net.ipv4.ip_forward'` — must be `1`. If `0`: `sudo sysctl -w net.ipv4.ip_forward=1; echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.d/99-damm.conf`.
9.2. **MASQUERADE rule present**: `ssh hub2 'sudo -n iptables -t nat -S POSTROUTING | grep 10.44.0.0/24'` — should show `-A POSTROUTING -s 10.44.0.0/24 -o eth0 -j MASQUERADE`. If missing, `sudo iptables -t nat -A POSTROUTING -s 10.44.0.0/24 -o eth0 -j MASQUERADE`.
9.3. **FORWARD chain accept**: `ssh hub2 'sudo -n iptables -S FORWARD | grep -E "wg0|10.44.0"'` — should show ACCEPTs. If missing, `sudo iptables -A FORWARD -i wg0 -j ACCEPT; sudo iptables -A FORWARD -o wg0 -j ACCEPT`.
9.4. The wg-quick `PostUp`/`PostDown` adds these. If they're gone after reboot, the unit didn't fire — check §2.1.
9.5. Persist iptables across reboots if not already (Ubuntu): `sudo apt-get install -y iptables-persistent` then `sudo netfilter-persistent save`.
9.6. **MTU / fragmentation**: if pings work but TCP doesn't (browser hangs on TLS handshake), set `MTU = 1380` in `wg0.conf`'s `[Interface]` section and `wg-quick down wg0 && wg-quick up wg0`.
9.7. **Egress IP lookup mismatch**: `ssh hub2 'ip route get 1.1.1.1'` — should show `dev eth0 src 149.102.137.139`. If it picks docker0 or a VPN-internal IP, you've got a routing-table conflict.

---

## 10. DNS resolution problems

**Detection**: clients connect, traffic flows, but `https://example.com` fails with name resolution errors; or hub2 itself can't resolve hosts.

10.1. `ssh hub2 'systemctl status systemd-resolved --no-pager | tail -5'`.
10.2. `ssh hub2 'resolvectl status | head -20'` — confirm a resolver is set.
10.3. `ssh hub2 'dig +short cp.damm.raindesk.dev'` — must be `149.102.137.139`.
10.4. If clients fail DNS but the host succeeds, the client's `.conf` `DNS = 1.1.1.1, 1.0.0.1` line is fine; the problem is upstream of us.
10.5. If they're using "pi-hole at home" via the tunnel, that's a routing-of-DNS-traffic issue — out of scope here, but the client should set `DNS = <home-resolver-IP>` themselves.

---

## 11. Token / secret drift between env and state

**Detection**: heartbeat returns `401 invalid_gateway_token`; admin login fails; enrollment fails with auth errors.

11.1. `ssh hub2 'jq ".gateways[0].apiToken" /home/uprootiny/damm-state/state.json'` — token in state.
11.2. `ssh hub2 'grep ^VPN_GATEWAY_API_TOKEN= /home/uprootiny/damm-state/env'` — token in env.
11.3. They must match. If they don't, update env to match state (preferred — state is durable) or update state to match env.
11.4. To rotate cleanly:
    - Generate fresh token: `T=$(node -e 'console.log(require("crypto").randomBytes(24).toString("hex"))')`
    - Patch state: `jq ".gateways[0].apiToken = \"$T\"" state.json > state.json.tmp && mv state.json.tmp state.json`
    - Patch env: `sed -i "s|^VPN_GATEWAY_API_TOKEN=.*|VPN_GATEWAY_API_TOKEN=\"$T\"|" env`
    - Restart CP: `sudo -n systemctl restart damm-control-plane`
    - Run heartbeat once to confirm: `/home/uprootiny/damm-state/heartbeat.sh`
11.5. **VPN_ENROLLMENT_TOKEN** can be rotated similarly. Existing enrollment passes (different objects) keep working.
11.6. **VPN_ADMIN_BOOTSTRAP_TOKEN** — used to mint short-lived admin bearers. Rotate and reissue any operator scripts that depend on it.
11.7. **VPN_ADMIN_SIGNING_SECRET** — used for HMAC of admin tokens. Rotating invalidates all live admin sessions; that's fine.
11.8. **Catalog signing key** — lives in `state.catalog.privateKey`. **Do NOT regenerate casually**: every enrolled device cached the public half and will reject catalogs signed by a fresh key. Force re-enrollment first if you have to.
11.9. Quoting trap: env values containing `,`, `:`, `(`, `/`, `0.0.0.0` need double quotes (e.g. `VPN_ALLOWED_IPS="0.0.0.0/0, ::/0"`) for shell `source`. systemd's `EnvironmentFile=` honors quoted values too. Always quote.

---

## 12. Disk full / disk pressure

**Detection**: control plane stalls; sync-peers fails to `mktemp`; heartbeat times out; `df` shows ≥95% full.

12.1. `ssh hub2 'df -h / /home /var | head -10'`.
12.2. `ssh hub2 'sudo -n du -sh /var/log/* 2>/dev/null | sort -h | tail -10'` — usual culprits: journald, nginx access logs, docker.
12.3. If journald is the offender: `sudo journalctl --vacuum-size=200M`.
12.4. If `/home/uprootiny/damm-state/control-plane.log` is gigantic: `> /home/uprootiny/damm-state/control-plane.log` (truncate; the unit appends so it'll just keep going). Add logrotate config: see §16.
12.5. `/tmp` filling up from leaked PSK temp files: `ls /tmp/tmp.* 2>/dev/null | wc -l` — if hundreds, the reconciler isn't cleaning up. Inspect `~/damm-state/sync-peers.sh` for the `rm -f "$psk_file"` line.
12.6. After clearing, restart the CP if it was wedged: §1.5.

---

## 13. State growth — audit log, enrollment passes accumulating

**Detection**: `wc -c state.json` over ~1MB; `jq '.adminAuditLog|length, .enrollmentPasses|length'` reports many thousands.

13.1. Compact audit log to last 1000 entries:
    ```
    jq '.adminAuditLog |= (.[-1000:])' state.json > state.json.tmp && mv state.json.tmp state.json
    sudo systemctl restart damm-control-plane
    ```
13.2. Prune expired/used enrollment passes:
    ```
    NOW=$(date -Is)
    jq --arg now "$NOW" '.enrollmentPasses |= map(select((.expiresAt // "9999") > $now and (.usedCount // 0) < (.maxUses // 1)))' state.json > state.json.tmp && mv state.json.tmp state.json
    sudo systemctl restart damm-control-plane
    ```
13.3. Snapshot before doing either (§5.1).

---

## 14. ACME / TLS cert won't renew

**Detection**: cert expiry approaching; Caddy log shows persistent failures for `cp.damm.raindesk.dev`.

14.1. Confirm the failure is ours, not co-tenant: §4.5.
14.2. Verify port 80 is reachable from the public internet (Let's Encrypt's HTTP-01 needs it): `curl -fsS http://cp.damm.raindesk.dev/.well-known/acme-challenge/ping || true` — should be 404 from Caddy (means it's reaching us), not connection-refused.
14.3. If it's connection-refused, check that Caddy is binding `:80`: `ssh hub2 'ss -lntp | grep :80 '`.
14.4. If Caddy is hitting Let's Encrypt rate limits, switch the failing site to ZeroSSL or Buypass temporarily by adding `tls { issuer acme { ca https://acme.zerossl.com/v2/DV90 } }` to the site block.
14.5. As a manual fallback only: `sudo certbot certonly --webroot -w /home/uprootiny/damm-site -d cp.damm.raindesk.dev`, then point Caddy at the cert with `tls /etc/letsencrypt/live/cp.damm.raindesk.dev/fullchain.pem /etc/letsencrypt/live/cp.damm.raindesk.dev/privkey.pem`.

---

## 15. Caddy returns 502/504 to clients (control plane slow, not down)

**Detection**: §0.1 says CP is `active`; §0.2 (direct `curl`) succeeds; but `https://cp.damm.raindesk.dev/healthz` returns 502.

15.1. Hit upstream directly: `ssh hub2 'curl -fsS http://127.0.0.1:8080/healthz; time curl -fsS http://127.0.0.1:8080/v1/catalog'`. Look at the latency.
15.2. If catalog fetch is slow (>500ms): the CP is doing a synchronous JSON write on every state change. Under load this is the bottleneck. The Postgres backend exists for this reason — see `docs/runbook.md` "Postgres smoke procedure".
15.3. CPU or RSS pegged: `ssh hub2 'ps -o pid,rss,pcpu,etime,cmd -p $(pgrep -f node.*control-plane | head -1)'`. If RSS is climbing fast, restart (§1.5) and instrument.
15.4. Caddy timeout: `caddy.transport.http.dial_timeout` defaults are usually fine. If you're sitting at exactly 30s for every 504, that's a Caddy default; raise inside the site block: `reverse_proxy 127.0.0.1:8080 { transport http { dial_timeout 5s read_timeout 10s } }`.

---

## 16. Logs not rotating

**Detection**: `~/damm-state/control-plane.log` grows without bound; eventually triggers §12.

16.1. Add a logrotate stanza:
    ```
    sudo tee /etc/logrotate.d/damm > /dev/null <<'EOF'
    /home/uprootiny/damm-state/*.log {
      daily
      rotate 7
      missingok
      notifempty
      compress
      delaycompress
      copytruncate
      su uprootiny uprootiny
    }
    EOF
    ```
16.2. Test: `sudo logrotate -d /etc/logrotate.d/damm` (dry-run), then `sudo logrotate -f /etc/logrotate.d/damm`.
16.3. The `copytruncate` mode means we don't have to signal the CP — it keeps writing to the same file; we just lop off the head daily.

---

## 17. Everything reports healthy but a user says it doesn't work

**Detection**: §0 all green; user is unhappy.

17.1. Confirm what *they* see. Their wg client UI tells you a lot: handshake age, RX/TX bytes.
17.2. `ssh hub2 'sudo -n wg show wg0 latest-handshakes'` — find their pubkey, see when their last handshake was.
17.3. If handshake is recent but bytes don't flow: it's almost always client-side routing (split-tunnel rule on their machine pushing traffic outside the tunnel). Have them try `curl ifconfig.me` — if the IP isn't 149.102.137.139, traffic isn't going through us.
17.4. If handshake never happened: their UDP is blocked. They need T1 (AmneziaWG) or T2 (Shadowsocks-wrapped) — neither is shipped yet. Tell them, and queue the work.
17.5. If MTU issues (small pings work, big TLS fails): see §9.6.
17.6. Run the diagnose CLI for them: `ssh hub2 'cd ~/damm && node client/client.js diagnose --device <their-device.json> --config <their-wg.conf> --out /tmp/dx.json && cat /tmp/dx.json'` — produces a sanitized support packet.

---

## 18. Public IP rotated by provider

**Detection**: `dig +short vpn.damm.raindesk.dev` no longer returns 149.102.137.139; or DNS hasn't updated but the host's `ip addr show eth0` differs.

18.1. If DNS is stale: update the record at the registrar. Wildcard `*.raindesk.dev` lives at Porkbun (see auto-memory `reference_porkbun.md`); update there.
18.2. If host IP changed: `ip addr show eth0`, note the new IP.
18.3. Update env: `VPN_EGRESS_IPS="<new-ip>"`.
18.4. **Caveat**: existing client `.conf` files still have `Endpoint = vpn.damm.raindesk.dev:51820` which will resolve to the new IP — *if* the client's resolver isn't caching aggressively. Mobile WireGuard apps sometimes cache. Worst case: re-issue configs.
18.5. The `whoami` endpoint compares against `VPN_EGRESS_IPS`; without an update, it'll report `gatewayMatch: false` even when the user is connected.

---

## 19. SSH lockout / lost access to hub2

**Detection**: SSH hangs or refuses; you're staring at a service from the outside only.

19.1. Try the provider's web console. Contabo (and most VPS providers) offer a serial console.
19.2. If you can get a console, repair `/etc/ssh/sshd_config` or `~/.ssh/authorized_keys`.
19.3. The fleet has `tailscale0` interfaces (visible in §0.5's `ip -br link`). If Tailscale is configured, `tailscale ssh hub2` is a fallback.
19.4. As a last resort, the provider can rebuild the box. `~/damm-state/` is the only thing not in the git repo — back it up off-host.

---

## 20. Compromise / leaked credential

**Detection**: unexpected gateway registrations; admin audit log shows actions you didn't take; surge of enrollments from a single IP.

20.1. Snapshot state: §5.1.
20.2. Rotate every secret in `~/damm-state/env` (§11.4 for the gateway token; same recipe for the others).
20.3. Restart CP: `sudo systemctl restart damm-control-plane`.
20.4. Force re-enrollment of every device by clearing `.devices` and resetting `wg0` peers: `jq '.devices = []' state.json > state.json.tmp && mv state.json.tmp state.json`. Then §7.5.
20.5. **Do NOT** rotate the catalog signing key unless absolutely necessary — that breaks every cached client. Prefer a forced re-enroll cycle.
20.6. Inspect the audit log for the attacker's actions and write up a postmortem in `~/damm-state/operator-notes.log`.
20.7. If the gateway *private* key (in `gateway-keypair.json`) leaked, regenerate it (`node -e ...` snippet from §1 of CLAUDE memory), update env, restart CP, restart wg0, force re-enroll.

---

## 21. Tailscale or other VPN co-tenant interfering

**Detection**: routing oddities; specific destinations work and others don't; `ip route` shows surprising entries.

21.1. `ssh hub2 'ip -br link' | grep -E "tailscale|wg|tun"` — list all VPN interfaces.
21.2. `ssh hub2 'ip route show table all | head -40'` — read the full routing table.
21.3. Tailscale's `100.x.y.z` exit-node logic can fight WireGuard's NAT if you're using Tailscale exit on the same host. Don't use both as exit nodes.
21.4. If Tailscale is just providing management access (not exit), it shouldn't conflict — but confirm `ip rule` for surprising priorities: `sudo -n ip rule show`.

---

## 22. Process zombies, port stuck, can't restart

**Detection**: `EADDRINUSE` on CP restart even though `systemctl is-active` says inactive.

22.1. `ssh hub2 'sudo -n ss -lntp | grep :8080'` — find the PID holding it.
22.2. If it's a node process from a previous run: `sudo kill -9 <pid>; sleep 2; sudo systemctl start damm-control-plane`.
22.3. If it's not ours: investigate the rogue process before killing.
22.4. For the WireGuard side, `wg-quick down wg0` may fail if `PostDown` rules are partially gone. `sudo ip link delete wg0; sudo wg-quick up wg0` is the brute-force version.

---

## 23. Misordered systemd dependencies

**Detection**: at boot, services come up in the wrong order; CP starts before wg0 is up; heartbeat fires before CP is ready.

23.1. The unit graph should be:
    - `network-online.target` → `wg-quick@wg0.service` → `damm-control-plane.service`
    - `damm-control-plane.service` → `damm-heartbeat.timer`
    - `wg-quick@wg0.service` + `damm-control-plane.service` → `damm-sync-peers.path` → `damm-sync-peers.service`
23.2. Verify: `ssh hub2 'systemctl list-dependencies damm-control-plane.service damm-heartbeat.timer damm-sync-peers.service'`.
23.3. The `[Unit] After=` directives in `/etc/systemd/system/damm-*.service` enforce ordering. Inspect: `systemctl cat damm-control-plane damm-heartbeat damm-sync-peers`.
23.4. If a unit insists on starting first, add `Wants=` and `After=` more strictly. `Requires=` is too strong (cascades failures).

---

## 24. NTP / time skew breaks heartbeat freshness

**Detection**: gateway shows `lastHeartbeatAt` in the future, or in the distant past, even though the timer fires.

24.1. `ssh hub2 'date -u; chronyc tracking 2>/dev/null'` — check offset.
24.2. If chrony is missing: `sudo apt-get install -y chrony && sudo systemctl enable --now chrony`.
24.3. WireGuard handshake uses absolute time; large skews break handshakes silently. Force a sync: `sudo chronyc -a makestep`.
24.4. The CP uses `Date.now()` for heartbeat freshness. Skew on hub2 → false-positive staleness. Skew on the gateway-machine vs control-plane-machine in a future split deploy is a known footgun.

---

## 25. Public install-pass abuse / rate-limit too tight

**Detection**: legitimate visitors get `429 rate_limited` from `/v1/public/install-pass`.

25.1. Default is 3 passes per IP per hour. Adjust via env:
    - `VPN_PUBLIC_PASS_MAX=5` (per window)
    - `VPN_PUBLIC_PASS_WINDOW_MS=3600000` (1h)
25.2. Restart CP after env edit.
25.3. **Note**: rate limiter is in-memory only. Restarting CP resets all counters — so a CP restart loop also resets the abuse window. Acceptable trade for now.
25.4. If a single IP is hammering you: `sudo iptables -I INPUT -s <ip> -p tcp --dport 443 -j DROP` for the immediate fix; remove later.

---

## 26. Whoami reports `gatewayMatch: false` for connected clients

**Detection**: a known-connected user calls `/v1/whoami` and we return `gatewayMatch: false`.

26.1. Check `VPN_EGRESS_IPS` matches the actual egress: `ssh hub2 'curl -s ifconfig.me'` should equal an entry in `jq '.egressPools[0].egressIps' state.json`.
26.2. If the user's request is coming through a CDN or proxy, `X-Forwarded-For` may have a different IP than the gateway. The CP trusts XFF by default (`VPN_TRUST_FORWARDED=true`). For a cleaner check, also expose `request.socket.remoteAddress` in the response so debug clients can see both.
26.3. IPv6: if a user's client has IPv6 and the gateway is IPv4-only, they won't go through us at all. Add the gateway's IPv6 to `egressIps` if hub2 has one (`ip -6 addr show eth0`).

---

## When to escalate / give up cleanly

If after working through the relevant section the system is still wedged for ≥ 30 minutes, and the failure mode looks novel:

- Stop poking. Capture state for a postmortem.
- `ssh hub2 'tar czf /tmp/damm-incident-$(date +%s).tgz /home/uprootiny/damm-state /etc/wireguard/wg0.conf /etc/systemd/system/damm-* /etc/caddy/Caddyfile 2>/dev/null'` and pull it down.
- Bring up the most degraded acceptable mode: just the static `damm.raindesk.dev` site (which doesn't depend on the CP) so visitors see something rather than a void.
- Write the incident note in `~/damm-state/operator-notes.log` while it's fresh.

---

## Sanity check after any intervention

A complete green pass:

1. `systemctl is-active damm-control-plane wg-quick@wg0 damm-heartbeat.timer damm-sync-peers.path caddy` — all `active`.
2. `curl -fsS https://cp.damm.raindesk.dev/healthz` — `{"ok":true,...}`.
3. `jq ".gateways[0].registration.lastHeartbeatAt" ~/damm-state/state.json` — within last 60s.
4. `sudo -n wg show wg0` — interface up, port 51820, peer count matches `jq '.devices|length' state.json`.
5. `cd ~/damm && npm test 2>&1 | tail -5` — 43/43 passing (44 with postgres if `DATABASE_URL` is set).

If all five pass, the system is in the same shape it was in green. Close the incident.
