Technical writing

Building a Distributed VPN with Intelligent Routing

October 15, 2024· 18 min read· AI Analytics

CensorshipVPNML routingDPI evasionWireGuard

One component of the Voidly platform is a VPN service operated for probe operators and journalists working in high-censorship environments. Building it required solving three distinct problems: making traffic undetectable to deep packet inspection, making entry nodes hard to enumerate and block, and routing around disruptions faster than human operators can respond. This article documents the architecture and the tradeoffs at each layer.

The threat model

State-level censorship systems operate at national ISP chokepoints and share intelligence across providers. The three environments we tested against — China's GFW, Iran's IRGC filtering infrastructure, and Russia's TSPU (Технические средства противодействия угрозам) — each have distinct but overlapping capabilities:

Capability                    GFW     IRGC    TSPU
──────────────────────────────────────────────────
TLS fingerprint matching       ✓       ✓       ✓
WireGuard protocol detection   ✓       ✓       partial
OpenVPN pattern matching       ✓       ✓       ✓
Active probing (replay)        ✓       –       –
IP reputation lists            ✓       ✓       ✓
SNI-based blocking             ✓       ✓       ✓
BGP null-route enforcement     ✓       partial ✓
Deep flow analysis             ✓       –       partial

The GFW is the most capable: it combines passive DPI with active probing, replaying captured connection initiations at suspicious servers to test for VPN endpoints. WireGuard's handshake is identifiable — the initiator sends a 148-byte message (4 bytes type, 4 padding, 32 bytes ephemeral pubkey, 48 bytes encrypted static key, 28 bytes encrypted timestamp, 16 bytes MAC1, 16 bytes MAC2) to UDP/51820. This is a fixed-size distinctive pattern. The GFW added WireGuard detection in early 2022.

Iran and Russia focus more heavily on IP reputation. Both maintain synchronized blocklists distributed to ISPs that identify known VPN server ranges, Tor exit nodes, and cloud provider subnets associated with circumvention tools. Russia's TSPU can also throttle unrecognized UDP traffic to impractical speeds without blocking it outright, which is harder to circumvent than a clean block.

Why standard VPNs fail

OpenVPN over UDP has a distinctive TLS record structure. Even when the application data is encrypted, the handshake contains an OpenVPN-specific 2-byte packet length prefix that appears before the TLS ClientHello. GFW classifiers catch it in milliseconds. IPsec IKEv2 uses UDP/500 with protocol-specific ISAKMP headers — no obfuscation attempt is made at all.

WireGuard is harder to fingerprint than OpenVPN but still fails for static-IP deployments. Once a server IP is discovered — via enumeration, user reports, or scanning — it goes onto blocklists and stays there. The static nature of standard WireGuard configurations means a single block kills all clients.

Architecture overview

The system has three tiers. Entry nodes receive client traffic and are the visible face of the VPN. A routing intelligence layer — running on-device on the client — selects paths in real time. Exit nodes forward decrypted traffic to the open internet and are not publicly exposed.

Client (Rust + TFLite)
       │
       │  HTTPS/443 to CDN edge
       ▼
CDN Edge (Cloudflare / Google / AWS CloudFront)
       │
       │  Host header routes to entry node backend
       ▼
Entry Node (WireGuard endpoint, 48hr IP rotation)
       │
       │  Inner WireGuard tunnel
       ▼
Exit Node (stable, non-publicly-routed)
       │
       ▼
 Open internet

The outer layer is standard HTTPS to a CDN edge server that the censor cannot block without collateral damage. The inner layer is WireGuard. The entry node IP is ephemeral and unknown to the client at configuration time — it is resolved dynamically from a distributed rendezvous system at connection time.

Domain fronting: the outer layer

Domain fronting exploits the gap between TLS SNI (visible to the censor) and the HTTP Host header (visible only to the CDN after TLS termination). The client opens a TLS connection advertising SNI www.google.com. The censor sees a connection to Google — which it cannot block. Once TLS is established with Google's edge, the HTTP/2 request carries Host: entry.internal.example.com which Google routes to our origin.

// TLS handshake (visible to censor)
ClientHello:
  server_name: "www.google.com"   ← censor sees Google

// HTTP/2 request (inside TLS — invisible to censor)
:method: CONNECT
:authority: entry.internal.example.com
Host: entry.internal.example.com   ← CDN routes to our origin

We tunnel WireGuard inside an HTTP/2 CONNECT tunnel established over this fronted connection. This means the observable traffic is: TLS to a Google/Cloudflare IP, containing an HTTP/2 CONNECT to our backend, containing WireGuard datagrams. Each layer strips the identifiers the censor relies on.

WireGuard over TCP has a performance penalty (TCP-over-TCP with its own congestion control), so we also support QUIC-based tunneling for clients where UDP is not throttled — QUIC to port 443 is increasingly common and hard to distinguish from regular QUIC/HTTP3 traffic.

Google shut down domain fronting support in 2018; Cloudflare followed. We use a mix of remaining CDN providers that permit fronting under specific conditions, and we run our own CDN-proxy layer on cloud providers with large IP ranges (Azure, Oracle Cloud, and Alibaba Cloud — the latter being important for Iranian users since CN-origin traffic gets special treatment in the IRGC filtering model).

Entry node IP rotation

Entry nodes rotate IPs every 48 hours. The mechanics:

A Cloudflare Worker runs on a cron trigger every 24 hours
It allocates a new IP from our provider pool (Vultr, Hetzner, DigitalOcean, OVH — we never use AWS/GCP/Azure for entry nodes since those ranges are instantly flagged)
The new IP is added to the entry node's WireGuard config as an additional endpoint
A TTL-12hr DNS record in Cloudflare KV points clients to the active IP
After 48 hours the old IP is released and its allocation slot returned to the pool

Clients never hardcode IPs. At connection time they query a rendezvous endpoint (itself domain-fronted) to get the current active entry node list. The response is signed with an Ed25519 key burned into the client binary; a tampered response is rejected and an alert is logged.

// Entry node resolution — Rust client
async fn resolve_entry_nodes(
    fronting_client: &FrontingClient,
    ed25519_verify_key: &VerifyingKey,
) -> Result<Vec<EntryNode>> {
    let resp = fronting_client
        .get("https://rendezvous.internal.example.com/v2/nodes")
        .await?;

    // Verify Ed25519 signature before trusting
    let sig = resp.headers()
        .get("X-Node-Sig")
        .and_then(|v| v.to_str().ok())
        .ok_or(Error::MissingSignature)?;

    let sig_bytes = BASE64.decode(sig)?;
    let signature = Signature::from_slice(&sig_bytes)?;
    ed25519_verify_key.verify(resp.body(), &signature)?;

    Ok(serde_json::from_slice::<Vec<EntryNode>>(resp.body())?)
}

During the 8 months of production operation, 142 entry node IPs were blocked by the GFW. In every case, the block was discovered by a probe operator reporting a connection failure, confirmed by Voidly's own measurement probes hitting the IP from inside China, and cleared within the same 48-hour rotation window without any service interruption. The redundancy comes from maintaining 3–5 active entry nodes per geographic region; clients try them in ranked order and fall back on first failure.

Traffic morphing

Even inside the fronted TLS tunnel, traffic analysis can reveal VPN usage. HTTPS to Google typically shows a characteristic pattern of packet sizes (TCP segments dominated by 1460-byte payloads after the initial handshake, with small acknowledgment frames). VPN traffic through an HTTP tunnel tends to produce longer steady streams. We implement three morphing techniques:

1. Packet size distribution matching

We collected 14 days of real HTTPS traffic to Google domains from multiple vantage points and built a packet-size CDF (cumulative distribution function). During VPN operation, our traffic morphing layer samples from this CDF to determine padding targets.

// Packet size CDF (condensed — actual has 512 bins)
const GOOGLE_HTTPS_CDF: [(usize, f32); 8] = [
    (64,   0.18),   // 18% of frames ≤ 64 bytes (ACKs, small responses)
    (256,  0.31),
    (512,  0.42),
    (1024, 0.54),
    (1280, 0.61),
    (1380, 0.68),
    (1460, 0.91),   // MTU-sized bulk data dominates
    (2920, 1.00),   // GSO aggregated segments
];

fn target_size_from_cdf(rng: &mut SmallRng) -> usize {
    let p: f32 = rng.gen();
    for (size, cdf) in GOOGLE_HTTPS_CDF {
        if p <= cdf { return size; }
    }
    1460
}

fn morph_packet(payload: &[u8], rng: &mut SmallRng) -> Vec<u8> {
    let target = target_size_from_cdf(rng);
    if payload.len() >= target {
        return payload.to_vec(); // never truncate
    }
    // Pad to target with random bytes encrypted in the outer TLS layer
    let mut out = payload.to_vec();
    out.resize(target, rng.gen());
    out
}

Padding is added to the inner WireGuard frame before it enters the outer TLS tunnel, so the padding is encrypted. The recipient strips it based on a length prefix in the WireGuard data frame. Overhead is 8–22% depending on traffic composition.

2. Timing perturbation

State-level DPI systems can apply traffic analysis to inter-packet arrival times even when content is encrypted. VPN protocols tend to produce very regular inter-packet gaps because they forward traffic as soon as a kernel buffer is ready. Real HTTPS traffic has irregular timing driven by application behavior.

We inject Laplace noise into inter-packet gaps:

// Laplace timing perturbation
// scale=12ms chosen to match empirical HTTPS timing variance
// while staying below perceptible latency threshold (~50ms)

fn laplace_delay_micros(rng: &mut SmallRng, scale_ms: f64) -> u64 {
    let u: f64 = rng.gen_range(-0.5_f64..0.5_f64);
    let noise_ms = -scale_ms * u.signum() * (1.0 - 2.0 * u.abs()).ln();
    (noise_ms.max(0.0) * 1000.0) as u64
}

async fn send_with_jitter(
    tx: &mut TcpStream,
    frame: &[u8],
    rng: &mut SmallRng,
) -> io::Result<()> {
    let delay_us = laplace_delay_micros(rng, 12.0);
    if delay_us > 0 {
        tokio::time::sleep(Duration::from_micros(delay_us)).await;
    }
    tx.write_all(frame).await
}

At scale=12ms the added latency is imperceptible to users (median 8ms additional per-packet overhead) and breaks the traffic classifiers we tested against, which rely on consistent timing variance thresholds.

3. Cover traffic

During idle periods — when the VPN is connected but the user is not generating traffic — we generate synthetic requests to real Google and Cloudflare properties. This prevents the censor from detecting VPN sessions by the absence of traffic (a real HTTPS session to Google has background keep-alive traffic; a VPN tunnel sitting idle has none).

Cover traffic is pre-generated: a 4MB buffer of realistic HTTP/2 frame sequences (with correctly sized HEADERS, DATA, and WINDOW_UPDATE frames) is compiled into the client binary and played back at intervals sampled from the real-traffic timing CDF. This avoids the need for actual DNS resolution or network activity that could be separately fingerprinted.

ML routing: path selection

With multiple entry nodes per region and multiple CDN fronting providers, a client has 5–12 candidate paths at any given time. Static ranking (e.g., lowest-latency) is insufficient because paths degrade in response to censorship events that happen faster than any static rule can track. We train a routing model that predicts path quality in real time.

Feature set

// Per-path feature vector (22 dimensions)
struct PathFeatures {
    // Network quality (recent measurements)
    latency_p50_ms: f32,
    latency_p99_ms: f32,
    packet_loss_rate: f32,
    jitter_ms: f32,

    // Censorship signals
    rst_rate_30min: f32,         // TCP resets in last 30 min
    tls_alert_rate_30min: f32,   // TLS handshake failures
    timeout_rate_30min: f32,
    bgp_withdraw_age_hours: f32, // Time since last BGP withdrawal seen
                                 // affecting this path (from IODA feed)

    // Entry node health
    ip_age_hours: f32,           // How long this IP has been active
    ip_was_blocked_before: f32,  // 1.0 if this IP prefix was blocked prev.
    provider_block_rate_7d: f32, // Fraction of this provider's IPs blocked

    // Fronting provider signals
    cdn_latency_p50_ms: f32,
    cdn_error_rate_1hr: f32,

    // Temporal features
    hour_of_day_sin: f32,        // Encoded as sin/cos for circularity
    hour_of_day_cos: f32,
    day_of_week_sin: f32,
    day_of_week_cos: f32,

    // Country-specific
    country_censorship_score: f32,  // Voidly country score (0–1)
    country_gfw: f32,               // 1.0 if GFW environment

    // Path-specific history
    success_rate_24hr: f32,
    consecutive_successes: f32,
    consecutive_failures: f32,
}

Model architecture

We use a gradient-boosted tree (XGBoost) model rather than a neural network. The training dataset is 90 days of path telemetry from all clients (anonymized, opt-in) with connection outcomes as labels: success, degraded, or blocked. The model predicts the probability of successfor each candidate path.

XGBoost was chosen over a small neural network for three reasons: it is far smaller (the exported model is 340KB versus 4MB+ for a comparable MLP), inference on a 22-dimension feature vector takes under 1ms on any modern CPU without GPU acceleration, and it handles missing features gracefully when telemetry is incomplete for a new path that has no history.

The model is exported to ONNX Runtime format and bundled in the client binary, updated weekly via a silent background download. No network request is needed for inference — the entire routing decision happens on-device.

// Routing decision loop — Rust client
struct Router {
    model: OrtSession,              // ONNX Runtime session
    paths: Vec<PathState>,          // All known candidate paths
    active: Option<usize>,          // Currently connected path index
    telemetry: TelemetryBuffer,     // Rolling 30-min window
}

impl Router {
    fn rank_paths(&mut self) -> Vec<(usize, f32)> {
        let mut scores: Vec<(usize, f32)> = self.paths
            .iter()
            .enumerate()
            .map(|(i, path)| {
                let features = self.telemetry.features_for(path);
                let score = self.model
                    .run(features.as_slice())
                    .expect("ONNX inference")[0];
                (i, score)
            })
            .collect();

        scores.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
        scores
    }

    async fn maintain(&mut self) {
        loop {
            tokio::time::sleep(Duration::from_secs(30)).await;
            let ranked = self.rank_paths();
            let (best_idx, best_score) = ranked[0];

            // Switch if current path score has dropped >0.15 below best
            if let Some(active) = self.active {
                let current_score = ranked.iter()
                    .find(|(i, _)| *i == active)
                    .map(|(_, s)| *s)
                    .unwrap_or(0.0);

                if best_score - current_score > 0.15 {
                    self.switch_to(best_idx).await;
                }
            } else {
                self.switch_to(best_idx).await;
            }
        }
    }
}

The 0.15 hysteresis threshold prevents thrashing — without it, near-equal paths cause constant reconnects that are themselves detectable. The 30-second evaluation interval is a deliberate tradeoff: censorship events typically persist for minutes to hours, so faster polling wastes cycles without improving coverage.

BGP feed integration

The bgp_withdraw_age_hours feature is sourced from Voidly's IODA feed integration. When a BGP route serving an entry node's prefix is withdrawn, this signal appears in the feature vector immediately. In practice this means the model pre-ranks affected paths lower within 2–3 minutes of a shutdown event, before any client has experienced a connection failure on those paths.

WireGuard configuration

WireGuard runs on the entry node listening on a non-standard port (not 51820, to avoid trivial port-based blocks). The inner WireGuard tunnel between entry and exit nodes is separate from the client-facing tunnel. MTU sizing is critical for the domain-fronting stack:

// MTU budget for client-facing WireGuard interface

Physical MTU           1500   (standard Ethernet)
─ IPv4 header           -20
─ TCP header            -20
─ TLS record overhead   -29   (5 hdr + 16 AEAD tag + 8 explicit nonce)
─ HTTP/2 DATA frame hdr  -9
─ WireGuard transport   -32   (4 type + 4 recv index + 8 nonce + 16 AEAD)
─ Inner IPv4 header     -20
─ Safety margin         -10
                       ────
WireGuard interface MTU 1360

[Interface]
MTU = 1360

[Peer]
# Fragmentation is worse than a slightly smaller MTU;
# the safety margin absorbs HTTP/2 CONTINUATION frames

The client-facing WireGuard peer uses a pre-shared key (PSK) in addition to the DH-based key exchange. The PSK is rotated monthly and delivered via the same signed rendezvous endpoint that delivers entry node IPs. This provides a second authentication factor and makes active probing harder: a censor replaying a captured initiator handshake to a new server will fail PSK verification.

Blockage detection

The client maintains a BlockageDetector that watches for signals suggesting active interference rather than ordinary network degradation:

struct BlockageDetector {
    // Rolling 5-min windows
    tcp_rst_count: RollingCounter,
    tls_alert_count: RollingCounter,
    timeout_count: RollingCounter,
    success_count: RollingCounter,
}

enum BlockageSignal {
    RstInjection,          // RST < 15ms after SYN — likely injected
    TlsFingerprint,        // TLS alert 40 (handshake_failure) on WG port
    ActiveProbe,           // Unexpected initiator from different IP
    Throttling,            // Bandwidth below 10kbps sustained > 2min
    None,
}

impl BlockageDetector {
    fn classify(&self) -> BlockageSignal {
        // Fast RST: ISN-aware RST arriving < 15ms after SYN
        // indicates injection rather than server reject
        if self.rst_latency_p50_ms < 15.0 && self.tcp_rst_count.rate() > 0.3 {
            return BlockageSignal::RstInjection;
        }
        if self.tls_alert_count.rate() > 0.5 {
            return BlockageSignal::TlsFingerprint;
        }
        // etc.
        BlockageSignal::None
    }
}

On detecting RstInjection or TlsFingerprint, the client immediately invokes the router's switch_to without waiting for the 30-second evaluation interval. Path switches in this case complete in under 4 seconds (new WireGuard handshake, inner tunnel re-establishment).

Client implementation

The client is written in Rust and compiles to a single statically linked binary. Key crates: boringtun for userspace WireGuard (avoids kernel module requirements on constrained platforms), rustls for the outer TLS layer,ort (ONNX Runtime bindings) for ML inference, tokio for the async runtime.

// Binary sizes (stripped, aarch64-linux-musl)
vpn-client-minimal    6.2 MB   # boringtun + rustls only
vpn-client-full       8.1 MB   # + ONNX Runtime + morphing layer
vpn-client-full.gz    3.9 MB   # compressed for download

// Runtime memory (steady state, 1 active path)
RSS:       48 MB
ONNX heap: 12 MB   # model weights + inference scratch
WG state:  ~200KB  # per-peer session state

The client compiles and runs on Linux (x86_64, aarch64), macOS (arm64, x86_64), Windows (x86_64 via MSVC), and Android (aarch64 via JNI). The iOS port uses NetworkExtension and has a separate (larger) binary due to Apple's requirement for framework-based WireGuard.

Performance results

Metric                              Value
──────────────────────────────────────────────
Throughput (sustained)              820 Mbps
Latency overhead vs direct          +22ms p50, +41ms p99
Route evaluation (on-device)        < 1ms
Path switch time (failure detected) 3.8s median
DPI evasion rate (CN/IR/RU test)    99.3%
Entry node block rate (8mo)         142 IPs blocked, 0 outages
Padding overhead                    11% average, 22% worst case
Timing jitter overhead              +8ms p50 additional latency
ONNX inference (22-dim, XGBoost)    0.4ms on Cortex-A55
Cover traffic bandwidth             6–18 Kbps during idle

The 99.3% evasion rate was measured by running probe operators inside CN, IR, and RU networks and counting successful connections to the production VPN over 30 days. The 0.7% failure rate breaks down as: 0.4% entry node transitions (the 4-second switch window during active blocking), 0.2% CDN fronting provider outages, and 0.1% unexplained (likely client-side network issues unrelated to censorship).

What does not work

Several approaches were tried and abandoned:

Tor bridges — obfs4 bridges are increasingly blocked in China (the GFW runs active probers that identify obfs4 handshakes). We integrated a meek-lite transport (which does domain fronting through Azure) but found it adds 60–120ms latency and the Azure CDN range is increasingly suspicious in IR.

Shadowsocks — works well in China but depends on stable server IPs. Same IP-rotation problem as static WireGuard. And Shadowsocks was added to GFW blocklists in 2022 following widespread detection of its statistical fingerprint in traffic analysis research that was publicly published.

IPv6 tunneling — TSPU in Russia does not consistently apply DPI to IPv6 traffic, making IPv6 a useful bypass in specific Russian ISPs. But GFW DPI applies to both IPv4 and IPv6, and most restrictive environments have limited IPv6 penetration, so this doesn't help in the highest-value cases.

QUIC domain fronting — promising. QUIC/HTTP3 to CDN edges is an increasingly large fraction of legitimate internet traffic, making it hard to block selectively. We run QUIC fronting for a subset of clients as a secondary path; it has lower overhead than the TCP-based tunnel but QUIC is sometimes throttled to unusable speeds by the TSPU rather than cleanly blocked.

Operational security

The obfuscation layer parameters — the specific CDN fronting provider list, packet size CDF bins, and timing perturbation parameters — are not published in detail. Publishing them reduces effectiveness immediately: censors read research papers and update classifiers accordingly. The architecture above is sufficient to replicate the approach; the specific parameterization requires empirical calibration per target environment.

Exit node IPs are not disclosed. Exit nodes are operated from residential ISP connections and datacenter providers outside the five-eyes, specifically to prevent correlation attacks where a censor observes both the entry and the exit.

For the boringtun userspace WireGuard integration in Voidly's measurement probes, which shares the same on-device key management approach: The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →

For how BGP prefix withdrawal signals feed the ML routing layer's blockage features: BGP routing signals and internet shutdown detection: how Voidly uses IODA data →

For the full HTTP/HTTPS measurement lifecycle including RST injection timing analysis, which informed the BlockageDetector design: How Voidly measures HTTP and HTTPS censorship: the full protocol lifecycle →

For the seven-day shutdown forecasting model whose predictions feed route pre-positioning: Seven-day internet shutdown forecasting: how Voidly predicts connectivity outages →

For the social media ingestion pipeline whose OSINT signals inform censorship detection and route prioritization: Social media ingestion at scale: collecting 58M posts per day from 47 platform schemas →

For how the blocking infrastructure this VPN routes around is fingerprinted — DPI vendor signatures, TTL analysis, and OSINT procurement cross-referencing: Censorship infrastructure mapping: DPI vendor signatures, block page fingerprints, and OSINT procurement cross-referencing →