# DAMM Coordination Layer

The architectural posture for absorbing third-party providers as first-class building blocks. This is what lets us be competitive with — and built on top of — Tailscale, Cloudflare, GCE, AWS, Hetzner, and whatever else shows up next, without re-cutting the system each time.

## Why this matters now

The deployed system is a single gateway on a single VPS at one provider. The architecture documents promise a fleet across providers and ASNs, treating each provider's outage as statistical noise. That promise costs nothing to make and a lot to keep. The cheap way to keep it is: a clean coordination layer where adding a provider means writing one adapter file, registering it, and policy gates pick it up automatically. The expensive way is to keep growing if-statements until "Hetzner" is hardcoded in three different decisions.

We have the shape (`orchestrator/adapters.js`, `orchestrator/providers.js`, `orchestrator/policy.js`) — what we don't have is load-bearing use. The orchestrator is CLI-only. The Provider interface is compute-centric; tunnel-providers and overlay-providers don't fit. Let's fix both.

## The thesis

A **provider** is anything that gives DAMM a piece of the path between client and destination. A provider can give us:

- **Compute** — VMs we run software on.
- **Ingress capacity** — the public surface a client connects to.
- **Egress capacity** — the outbound NAT face we leave through.
- **Overlay** — a private network we can ride on instead of the public internet.
- **CDN-fronting** — an HTTPS-shaped front that hides our real ingress IP.
- **Tunnel-termination** — their tunnel, not ours, that we redirect through.

A single provider may give us several of these. Hetzner gives compute + ingress + egress (their VMs do all three). Cloudflare gives CDN-fronting + tunnel-termination + a global edge but no traditional VMs. Tailscale gives overlay + DNS + identity. AWS gives all six in different services. The coordination layer's job is to expose these as **composable capabilities**, not vendor-shaped bundles.

The Provider interface says: *here are the things you can do, here is how to do them.* The orchestrator says: *for this region, this load, this transport, this user policy, who can do what?* The system says yes to whoever's healthy, cheap enough, and politically allowed.

## The interface

```
interface Provider {
  // Identity
  id: string                       // "hetzner", "cloudflare", "aws-eu-west", ...
  capabilities: Capability[]       // see below
  region(damm_region): string|null // map our regions to the provider's regions

  // Compute (if capabilities include 'compute')
  createIngressNode(spec): Result<NodeRef>
  createEgressNode(spec): Result<NodeRef>
  drainNode(nodeRef): Result<>
  retireNode(nodeRef): Result<>
  listNodes(filter): Result<Node[]>

  // Ingress capacity (if 'ingress')
  allocateIngressFrontdoor(spec): Result<FrontdoorRef>
  releaseIngressFrontdoor(ref): Result<>

  // Egress capacity (if 'egress')
  allocateEgressIp(spec): Result<EgressIpRef>
  releaseEgressIp(ref): Result<>

  // Overlay (if 'overlay')
  joinOverlay(spec): Result<OverlayMembership>
  leaveOverlay(membership): Result<>

  // CDN-fronting (if 'cdn_front')
  publishFront(spec): Result<FrontRef>      // returns the public hostname/edge endpoint
  retireFront(ref): Result<>

  // Tunnel-termination (if 'tunnel_term')
  registerTunnel(spec): Result<TunnelRef>   // their managed tunnel, we redirect through
  retireTunnel(ref): Result<>

  // Health (always)
  probeHealth(target): Result<HealthSignal>
  getMetrics(target): Result<Metrics>
}
```

Each method is **optional**. An adapter implements only the methods its capabilities advertise. Capabilities are a static set, declared at adapter registration time:

```
type Capability =
  | "compute"        // VMs
  | "ingress"        // public WireGuard surface
  | "egress"         // outbound NAT
  | "overlay"        // their network, we ride
  | "cdn_front"      // HTTPS-fronted ingress
  | "tunnel_term"    // their tunnel, we redirect through
```

This widening is the smallest delta that turns the existing compute-centric `BaseAdapter` into a substrate that fits Cloudflare and Tailscale.

## Adapter pattern

Each provider gets one file, `orchestrator/adapters/<provider>.js`, exporting one class extending `BaseAdapter`. The class declares its capabilities as a static field and implements only those methods. `BaseAdapter`'s default impls return `{ mode: "plan", capability: "<unimplemented>" }` so a missing method is a clean no-op rather than a crash.

The registry (`orchestrator/providers.js`) is the only place that imports adapter classes. It exposes a single `getAdapter(name): Provider` function. The rest of the orchestrator works against the interface, not the implementations.

A new provider is **one new file plus one line in the registry**. That's the property the user named when they said "easily digest us adding another."

## Capability composition — the interesting bit

Most users won't talk to one provider. They'll talk to a path that's been stitched from several:

- Client → **Cloudflare Workers (cdn_front)** → **Hetzner-hosted ingress (ingress + compute)** → **AWS NAT Gateway (egress)** → destination
- Client → **Tailscale overlay (overlay)** → DAMM ingress on a Tailscale node → destination
- Client → **bare WG to Hetzner (compute + ingress + egress)** → destination [the Phase 0 path]

The orchestrator picks among compositions. Each composition has a score (latency, cost, blockability). The signed catalog publishes the top-K compositions to clients. Failover is composition-level: if `Cloudflare-front + Hetzner-ingress + AWS-egress` degrades, the next candidate might be `Cloudflare-front + Hetzner-ingress + Hetzner-egress` (egress fallback). Or `bare-WG-to-Hetzner` (front fallback). The client never knows the composition; it just picks the catalog entry with the best score for its measured conditions.

This is what "competitive in principle" looks like in code: we don't pick one of these stacks and call it the product. The product is the chooser.

## Orchestrator's role

The orchestrator does four things, on a tick:

1. **Probe** every active node, frontdoor, egress IP, overlay membership.
2. **Score** each path-component by recency-weighted signals (handshake success, RTT distribution, throughput, client-reported quality).
3. **Plan** what to add, what to drain, what to retire — gated by `policy.js` (provider allowlist, region allowlist, per-run caps).
4. **Apply** the plan via the relevant adapters' `create` / `drain` / `retire` methods.

Today the orchestrator runs on demand from a CLI. To make the coordination layer load-bearing, it needs to run as a service: `damm-orchestrator.service`, ticking every 60s, writing its decisions into the same state.json the control plane reads from. The control plane stays out of orchestration; it just publishes whatever fleet the orchestrator has assembled.

This is the Phase 3 ask: orchestrator-as-service.

## Provider-by-provider competitive analysis

For each of the providers the user named, here's what they offer, what we'd integrate, and where the architecture lets us compete head-on rather than be a strict subset.

### Tailscale (overlay + identity + magic discovery)

**What they give**: a global mesh overlay, NAT-traversal magic, identity (Google/Microsoft/GitHub SSO), DNS for peer names, exit-node capability.

**What we'd integrate**: a Tailscale adapter with `capabilities: ["overlay", "compute"]`. `joinOverlay()` enrolls a DAMM gateway as a Tailscale node; the gateway's WG-side then sits on a Tailscale IP. Clients with Tailscale already running see the gateway on their tailnet; they don't need our public ingress. For users without Tailscale, the gateway still has its public WG endpoint — so it's both.

**Where we compete**: Tailscale is best-in-class for "private mesh between machines you own." We're better at "anonymous public VPN your friend can use" (Tailscale requires sign-up). And we're better at *transport polymorphism* — Tailscale uses one protocol family; we ride many. The architecture absorbs Tailscale as a tool we use without committing to their identity model.

**The win**: a DAMM gateway running in someone's Tailscale-private cluster gives them a direct internal route, with no public exposure for friend-traffic. They can also expose it publicly for outsiders. Both, simultaneously, from one box.

### Cloudflare (CDN-fronting + tunnels + Workers)

**What they give**: a global edge network (300+ POPs), Cloudflare Tunnel (their tunneling protocol), Workers (edge compute), DNS, R2 storage. Their abuse-detection is good, their reputation is enormous.

**What we'd integrate**: a Cloudflare adapter with `capabilities: ["cdn_front", "tunnel_term"]`. `publishFront()` creates a Cloudflare-fronted hostname pointing at one of our gateways; clients reach `https://wg-front-7.damm.example` (a CF-hosted hostname), which is unblockable without blocking Cloudflare wholesale. The HTTPS payload carries WireGuard packets via WSS or via Cloudflare's Spectrum/UDP service.

**Where we compete**: Cloudflare wants you to have *their* identity, *their* tunnel, *their* control plane. We use them as a substrate. Their tunnel is one of our T-tier transports, not the protocol of record. We can drop them and run bare WireGuard the same hour their pricing changes.

**The win**: against ISP-driven blacklisting, putting an entire AS (Cloudflare's) between us and a censor is a real defense — they can't block CF without breaking half the internet. Cost: real per-byte fees on the egress side; we'd cap CF-fronted paths to censored-network users only.

### AWS / GCE / Azure (compute + everything)

**What they give**: VMs anywhere in the world, NAT gateways, load balancers, identity. Expensive but vast.

**What we'd integrate**: adapters for each, with `capabilities: ["compute", "ingress", "egress"]`. The Phase 0 model (bare-WG on a VM) directly translates. The orchestrator's policy file decides per-region per-provider how many we'll run — typically 0 for AWS unless someone needs that specific region.

**Where we compete**: head-on with their offerings (AWS Client VPN, GCP VPN, Azure VPN Gateway) — those are enterprise products with five-figure annual costs. We're a fortieth of the price for a friend-network-scale deployment, and we're transport-polymorphic where they are not.

**The win**: regional coverage we cannot achieve cheaply elsewhere. AWS in São Paulo, GCE in Tokyo, Azure in Cape Town. Pay only for what's used; the orchestrator drops them when scores drop.

### Hetzner / OVH / DigitalOcean / Vultr / Contabo (independent compute)

**What they give**: cheap, simple compute. No magic. Predictable bills.

**What we'd integrate**: already done for Hetzner; trivial extension for the rest. `capabilities: ["compute", "ingress", "egress"]` each.

**Where we compete**: nowhere — these are our backbone. The architecture's resilience claim depends on having compute across this set, not just Hetzner. The win is provider diversity for ASN/jurisdiction reasons.

### Self-hosted / on-prem (special)

**What they give**: a friend says "I have a Raspberry Pi at my house, can I help?"

**What we'd integrate**: a `static-host` adapter with `capabilities: ["ingress", "egress"]` (no compute API; the host is provisioned by hand). Heartbeat works the same; orchestrator can't create or retire, just include/exclude.

**Where we compete**: this is the long-tail of trust-network capacity. A friend's home box is a DAMM gateway with a different threat profile (dynamic IP, possibly behind CGNAT) and a different cost profile (free). The architecture absorbs it the same way it absorbs a Hetzner VM.

## Concrete shipping order

This is what to do, in order, after the v0.2.x stream we just shipped.

### Phase 3.0 — widen the interface (1 day, no new providers)

- Add `Capability[]` to `BaseAdapter` (default `["compute", "ingress", "egress"]`).
- Add stub methods for `joinOverlay`, `publishFront`, `registerTunnel`, `probeHealth`, `getMetrics` to `BaseAdapter` (default `{ mode: "unimplemented" }`).
- `HetznerAdapter` declares `capabilities = ["compute", "ingress", "egress"]` explicitly.
- Add a `getAdapter(name)` registry function in `providers.js`.
- One contract test: every registered adapter implements at least one capability and survives `probeHealth("dummy-target")` without crashing.

### Phase 3.1 — orchestrator as service (2 days)

- `damm-orchestrator.service` running on hub2 (later: separate host).
- 60s tick: probe → score → plan → apply.
- Probes are read-only against existing nodes; planning produces a JSON that goes into `state.orchestrator.desiredFleet`; apply calls adapter methods.
- For now: only Hetzner adapter active, no live provisioning (mode=plan), just fleet-scoring.
- The CP reads `state.orchestrator.desiredFleet` and includes only healthy entries in catalogs.

### Phase 3.2 — second compute adapter (1 day)

- DigitalOcean adapter (the simplest non-Hetzner). Verifies the abstraction actually composes.
- Policy file gets a DO block. Smoke mode runs against DO-staging-equivalent.

### Phase 3.3 — first non-compute adapter (3–4 days)

- Cloudflare adapter, capabilities `["cdn_front"]` only (no compute).
- `publishFront(spec)` creates a CF-hosted hostname pointing at one of our gateways via DNS-only or worker-route.
- The published catalog gets a new "front" component per gateway: a CF-fronted hostname alongside the bare IP.
- Clients see one gateway with two front-doors (T0 bare, T3 CF-fronted) and pick by score.

### Phase 3.4 — overlay adapter (3–4 days)

- Tailscale adapter, capabilities `["overlay"]`.
- `joinOverlay(spec)` enrolls a gateway as a tailnet member; gateways gain a `100.x.y.z` Tailscale IP.
- Clients with Tailscale running see the gateway on their tailnet; we publish those `100.x.y.z` addresses in catalogs filtered by user-provided tailnet identity.
- Users without Tailscale see the public addresses as before.

### Phase 3.5 — score-driven provisioning (2 weeks)

- The orchestrator's score loop now actually creates/retires nodes when scores cross thresholds.
- Policy gates remain authoritative — no spending caps exceeded.
- Warm-spare pool: M pre-provisioned nodes per region, not in catalog, ready to be promoted in seconds.

### Phase 3.6 — competitive surfaces (ongoing)

- Phase 3.6a: a "private mesh" mode where DAMM acts like Tailscale for a small group (devices see each other, exit-through-DAMM optional).
- Phase 3.6b: a "expose-this-thing" mode like Cloudflare Tunnel (a DAMM device runs as the server, others reach it through the gateway).
- Phase 3.6c: multi-hop routing through the overlay before egress (Tor-shaped, optional, latency-aware).

## What this changes for the catalog

A coordination layer with capabilities means catalog entries get richer. Today's catalog entry is `{ gatewayId, frontdoor: { endpoint, transport, priority } }`. The Phase 3+ catalog entry is `{ pathId, composition: [...components], score, transports: [T0, T3, ...] }` — a *composition* of provider components plus the transports it supports. Clients pick a path; the components are how we got there.

This is a forward-compatible change: today's `frontdoor` is a degenerate one-component composition. We can ship widening incrementally.

## What this changes for the wizard

Nothing, today. The wizard talks to the CP, gets a tunnel, hands the user a `.conf`. The `.conf` references *one* endpoint. As compositions multiply, the `.conf` can include multiple `[Peer]` blocks with `AllowedIPs` slicing — but most users won't see the difference. WireGuard's failover-via-AllowedIPs handles this transparently.

## The integrity claim

The architecture's resilience claim is real iff: given a single provider going dark, the system continues at ≥ 70% capacity within 5 minutes. The coordination layer is the mechanism by which that's true. Today, with one provider, the claim is hollow. Phase 3.0–3.2 makes it provable for compute. 3.3–3.4 makes it provable for transport. 3.5 closes the loop with auto-provisioning.

We can keep promising the claim before all of that ships, or we can stop. The honest move is to publish the integrity claim with a concrete "validated for: <list>" tag and only check off providers as they're real.

## What competitive *doesn't* mean

It doesn't mean we replicate every feature each provider has. It means: for the things DAMM does — anonymous client VPN, friend-network mesh, censorship-resistant exit — we are **at least as good** as the comparable offering at each provider, and **strictly better** at the things the provider can't or won't do (multi-provider resilience, transport polymorphism, no-signup-required onboarding). We do not try to be Tailscale, Cloudflare, and AWS Client VPN at once. We are a chooser among them.

## A working ground-truth check

When this layer is real, an operator can answer the question:

> "If Hetzner went down right now, what would happen to a connected user?"

with three concrete numbers:
- How many alternate paths are in the catalog right now (count).
- What fraction of users can fail over without re-enrollment (percent).
- How long until orchestrator-driven re-provisioning fills the capacity gap (minutes).

Today those numbers are 0, 0%, and "manual operator intervention." After Phase 3, they should be ≥3, ≥95%, and ≤5 minutes. That's the integrity claim, written as numbers we can measure.
