# DAMM Architecture Map — Resilience by Statistics

This is the architectural posture DAMM is converging on. It is not a list of features. It is a description of how the system stays alive and useful for an end-user when the network around them is hostile, degraded, or simply mediocre — without ever having to know *why* a path stopped working.

## The principle

A connection's quality is a scalar. It varies over time. It varies across paths. The system's job is to keep the scalar high for the user.

The system does not need to know whether a low scalar means deep packet inspection, blacklisting, BGP poisoning, DNS poisoning, sudden congestion, a tired peer, or a mistuned MTU. All of those collapse to the same observation: this path is delivering less signal per second than it could. The architecture treats the cause as opaque and the consequence as primary.

Said differently: the adversary has many faces and many tricks. The architecture has one response: continuously observe many paths, route weight toward the better ones, drop weight from the worse ones, replenish the pool when paths exhaust, and never freeze on any single assumption about which path is best.

This makes the architecture **threat-blind**: the runtime behaves the same whether the path is bad because of a censor, a flaky cable, or a noisy neighbor. We extract signal; we do not chase ghosts.

## Layers, top to bottom

```
┌──────────────────────────────────────────────────────────────┐
│ client                                                       │
│   - holds private key                                        │
│   - holds signed catalog of N front doors                    │
│   - measures handshake success, throughput, RTT per try      │
│   - rotates front doors based on its own measurements        │
│   - never trusts unsigned route changes                      │
└─────┬────────────────────────────────────────────────────────┘
      │
      │  N parallel transport flavors (T0/T1/T2/T3/T4)
      │  - bare WireGuard / junked / Shadowsocks / WSS / fronted
      │  - same packet payload, different shells
      │  - each shell selectable per session, swappable mid-flight
      │
┌─────▼────────────────────────────────────────────────────────┐
│ ingress fleet                                                │
│   - many gateway processes across providers, ASNs, regions   │
│   - each gateway exposes M front doors (different IPs/ports) │
│   - heartbeats with composite quality score                  │
│   - dies easily, replaced quickly by orchestrator            │
└─────┬────────────────────────────────────────────────────────┘
      │
      │  internal mesh (private network or wireguard-overlay)
      │  - decoupled from public reachability
      │
┌─────▼────────────────────────────────────────────────────────┐
│ egress fleet                                                 │
│   - separate pool, optimized for outbound reputation         │
│   - K outbound IPs across providers, ASNs                    │
│   - sequester pool: parallel egress with isolated reputation │
│   - rotated independently of ingress                         │
└─────┬────────────────────────────────────────────────────────┘
      │
      ▼
   destination internet

       control loop runs across all of this:
       observation → score → ranking → catalog → client → measurement
```

The key shape: **ingress and egress are separate inventories with separate lifecycles.** A blocked ingress doesn't take an egress with it. A burned egress doesn't ruin an ingress' usefulness. They're connected by an internal mesh whose addressing is invisible to the public.

## The signal-extraction loop

This is the part that makes the system threat-blind.

Every path in the system is continuously scored. The score is a single scalar, computed from many observables:

- **Reachability** — does a probe at this front door connect?
- **Handshake success rate** — over the last N attempts, what fraction completed?
- **Throughput slope** — does the path sustain its advertised bandwidth?
- **Latency distribution** — median, p90, p99; rising tail = degrading signal
- **Heartbeat freshness** — does the gateway behind it respond inside its TTL?
- **Client-reported quality** — clients themselves report measurements upstream

These observables enter a weighted decay function. Recent observations dominate; old observations fade. The decay constant is short (minutes, not hours) so the system reacts quickly to a path going bad and quickly to a path coming back.

The output: a score per (gateway × front door × egress pool). Gateways below a threshold are dropped from the next published catalog; egress pools below threshold are skipped in placement; front doors with degraded priority are demoted, not removed (a demoted front door is the failover target if the primary takes a hit later).

The score does not name the cause. A front door that's been actively DPI'd looks identical, in score-space, to one whose upstream provider had a bad ten minutes. The system treats both with the same response: route around it for now, watch it, restore weight when its score climbs.

## The catalog as the contract

The client never holds a hardcoded view of "the gateway". It holds a **signed catalog** — a list of front doors, signed by the control plane's catalog key, with a freshness timestamp.

The catalog is small (kilobytes), publishable through any channel, and re-fetchable on demand. When a client suspects its current path has gone bad, it pulls a fresh catalog, picks the next-best entry, and dials it.

This makes the routing topology a **first-class published artifact**, not an implicit consequence of which IP got cached in DNS. The control plane decides which paths to advertise based on the score loop above; the client picks among advertised paths based on its own local measurements.

The signature matters: clients reject unsigned or stale catalogs. A network adversary cannot inject "use this front door instead" — a signed catalog is the only way new routes enter the client's mind.

## Polymorphic transport

The same WireGuard payload can ride inside many shells. T0 is bare UDP. T1 is AmneziaWG (junk packets + magic-header rewrite — defeats off-the-shelf DPI signatures). T2 is the same payload tunneled through a TCP cover (Shadowsocks or similar — defeats UDP blocking). T3 wraps it in TLS over 443 (looks like HTTPS — defeats most SNI-respecting DPI). T4 fronts T3 through a CDN (defeats almost everything except SNI introspection).

Each tier has a **stated tradeoff**: throughput cost, latency cost, blocks-defeated, client compatibility. The system measures the actual cost in production and surfaces it in the operator UI. Choices are made with measurements, not myths.

The tiers are not a ladder we climb under pressure. They are a parallel set of front doors, advertised together, that the score loop can up-weight and down-weight independently. A client whose T0 stops handshaking can dial T1 from the same catalog without re-enrolling. A new T3 shell coming online doesn't disturb the T0 users who don't need it.

## Gateway disposability

Each gateway is built to be cheap to launch and cheap to retire. The orchestrator can:

- create new ingress nodes in any of N providers
- attach front doors (fresh IPs / ports / SNIs) to existing gateways
- drain a gateway (stop accepting new sessions, let live ones expire)
- retire a gateway (remove from catalog, kill the process)

A gateway has no precious state. Its gateway-private-key lives only on its host. Its peer table comes from the control plane's published state. If a gateway is lost, a new one stands up with a new keypair, registers, and starts being advertised inside one heartbeat cycle.

This means: when paths get burned faster than they can be rotated, the answer is not to harden a single path; it is to spin up more paths, faster.

## Egress separation

Inbound reachability and outbound reputation are different problems with different blame radii. Conflating them means a single bad event ruins both.

Two egress pools per region:

- **Main exodus** — shared NAT, fast, cleaner reputation. Default for everyone.
- **Quarantined exodus** — separate provider, separate ASN, separate IP space, separate reputation. Smaller, slower, isolated.

Routing to one or the other is a per-flow decision based on either pre-declared user rules (CIDR / domain / SNI tags) or a session-scoped flag. The gateway implements this with `fwmark` per flow + ip rule lookups against a routing table specific to the chosen pool.

The point is *not* to give one user better treatment than another. The point is that what one flow does does not pollute the IP that another flow sits behind. Quarantine is for traffic where reputation contagion is the risk; exodus is the default for everything else.

## Independence properties

Some properties the architecture aims to preserve:

- **Loss of one front door** does not lose its gateway.
- **Loss of one gateway** does not lose its region.
- **Loss of one region** does not lose any client (if catalogs include cross-region recovery).
- **Loss of the control plane** does not break already-connected clients (their catalog is signed and cached).
- **Loss of one egress pool** does not block traffic (the other pool absorbs).
- **Loss of one provider** does not lose the fleet (the orchestrator can route fresh ingress through other providers).
- **Loss of the public DNS path** does not lose connectivity (clients have N front doors with N IPs cached from the catalog).
- **Loss of the catalog signing key** is the only true service-stop: clients reject any new catalog. This is a deliberate single point of failure to prevent silent rerouting; rotation is a planned event with re-enrollment.

The architecture trades against absolutism here: we accept some single-points-of-failure (the catalog key, the device's private key) in exchange for refusing to be quietly redirected. This is a feature, not an oversight.

## Client-side autonomy

A client with a fresh signed catalog can pick its own front door. It does not need a round-trip to the control plane to fail over. This matters when control-plane reachability is itself impaired — clients keep working on the catalog they last fetched, reordering its entries based on their own observed quality.

Client measurements eventually flow back upstream as anonymous quality signals when reachability returns. Clients that experience widespread blocking on a particular front door inform the score loop that demotes it for everyone.

## What is intentionally NOT in the architecture

- **A single canonical "best" route.** The system never decides one path is best forever. It always carries N options.
- **Threat classification.** The system does not try to identify why a path is bad. It just notes the path is bad.
- **Per-user routing personalization.** Two clients with the same access tier and region get the same catalog. Tags, not identities, drive routing.
- **Steady-state assumptions.** Every component's lifetime is short, scored, replaced when degraded. No part is sacred.
- **A bigger app.** The user installs WireGuard from their app store. We do not write our own client. Our wrapper is profile issuance, catalog distribution, posture, and operator UI — never the tunnel itself.

## What this architecture is good at

- **Smooth degradation.** As paths drop, the user's connection moves to the next-best with measurements as evidence, not guesses.
- **Fast recovery.** A path coming back online climbs the score and re-enters the catalog within minutes.
- **Cheap experimentation.** New transports, new providers, new egress pools enter as additional inventory. Their score either earns them weight or doesn't.
- **Plain operations.** The operator surface is a list of paths and their scores. Decisions are explainable from the data shown.

## What this architecture is bad at

- **Latency-critical realtime.** Polymorphic transport adds hops. T3/T4 in particular pay 40–100ms; this is a real cost. If your use case is FPS gaming over a hostile network, we are not the answer; nothing reasonable is.
- **Commitment to a specific path.** A user who emotionally wants "always the German exit" is fighting the architecture, which prefers what scores best for their session right now. We can pin paths but the architecture's bias is against it.
- **Provider monoculture.** If we wake up one morning with all our gateways at one provider, the architecture's resilience claim is hollow. Provider diversity is a discipline, not a property the architecture enforces by itself.

## How to validate the architecture is doing its job

Three observables at any time:

1. **Catalog churn rate.** How often is the published catalog changing? Higher under pressure, near-zero under calm. If it's *not* changing under pressure, the score loop isn't doing its job.
2. **Client-reported success.** A small fraction of clients running periodic synthetic transactions through the tunnel and reporting success. If client-reported success is high while raw probe success is low, the architecture is genuinely routing clients to the live paths and not just preserving its own probes.
3. **Mean time from path-bad to path-removed.** From the moment a front door's score crosses the demote threshold to the moment it disappears from the catalog. Should be tight (minutes). Drift up = score loop is sluggish or operator-touch-required.

These three numbers are the architecture's vital signs. If they are healthy, the architecture is alive.

## Where Phase 0 sits in this map

Phase 0, as deployed, is the floor: a single gateway, a single front door, a single egress, a single control plane, a single transport (T0 bare WireGuard). The architecture above is what we build *up* from this floor. Every property described above has a starting point in the deployed Phase 0:

- The signed catalog is real (one entry today; will hold many).
- The score model is degenerate today (only "is gateway heartbeating?" matters); the inputs are wired in but the weights are flat.
- Egress is one pool today; the architecture supports two trivially.
- The transport stack has T0 today; T1+ slot in as new front doors per gateway with no client-side change required.
- The orchestrator can drain/retire gateways today; what it lacks is automated provider creation under pressure.

The roadmap is to bring the score model to life, plant a second egress pool, and ship T1. Each of those moves the system closer to the resilience-by-statistics shape this document describes. None of them require throwing away what's deployed today.
