Technical writing

Voidly Probe Networking: Staying Connected Through NAT, Firewalls, and Censored Infrastructure

· 9 min read· AI Analytics
VoidlyNetworkingQUICInfrastructure

The Voidly probe is a Tauri desktop application that runs measurement tests — TCP connects, TLS handshakes, DNS queries, HTTP requests — and uploads the results to the Voidly ingest endpoint. The interesting problem is not the measurements themselves but the upload: probes run in the same censored networks they are measuring, behind consumer-grade NAT, sometimes behind CG-NAT, and on ISPs that actively block VPN protocols and outbound connections to unknown endpoints. If the probe can not reach the ingest endpoint, the measurement is lost.

This post describes the transport layer design that keeps probes connected and their data flowing from hostile environments.

Transport choice: QUIC on port 443

The original probe prototype used gRPC over HTTP/2 on a dedicated port (8443). That worked fine in permissive networks but was immediately blocked in Iran, where the Great Firewall equivalent — the National Internet Network (NIN) — blocks all non-whitelisted TCP destinations on non-standard ports. Switching to QUIC on port 443 resolved this: QUIC is indistinguishable from HTTPS traffic at the protocol level, and port 443 is almost never blocked because it would break ordinary web browsing.

The ingest endpoint runs as a Cloudflare Worker, which natively supports HTTP/3 (QUIC). Probes connect to ingest.voidly.ai, which resolves to Cloudflare's anycast network. Three persistent QUIC streams are multiplexed over the single connection:

  • Stream 0 — heartbeat: 120-second keepalive ping; response carries config deltas (updated test list, probe parameters, revocation status).
  • Stream 1 — measurement upload: probe sends Protobuf batches; server responds with an ack containing the accepted batch_seq values.
  • Stream 2 — telemetry: probe sends health metrics (CPU, memory, connection quality) used by the probe health monitoring system.
// probe/src/transport/quic_client.rs

use quinn::{Connection, Endpoint};
use std::sync::Arc;

pub struct ProbeTransport {
    connection:  Arc<Connection>,
    heartbeat_tx: quinn::SendStream,
    upload_tx:    quinn::SendStream,
    telemetry_tx: quinn::SendStream,
}

impl ProbeTransport {
    pub async fn connect(config: &ProbeConfig) -> Result<Self, TransportError> {
        let endpoint = Endpoint::client("0.0.0.0:0".parse().unwrap())?;

        let connection = endpoint
            .connect(INGEST_ADDR, &config.ingest_hostname)?
            .await?;

        // Verify SPKI hash — certificate pinning
        let peer_cert = connection.peer_identity()
            .ok_or(TransportError::NoPeerCert)?
            .downcast::<Vec<rustls::Certificate>>()
            .map_err(|_| TransportError::CertTypeMismatch)?;
        let spki_hash = spki_sha256(&peer_cert[0]);
        if !config.pinned_spki_hashes.contains(&spki_hash) {
            return Err(TransportError::CertPinningFailure { got: spki_hash });
        }

        // Open three unidirectional streams
        let heartbeat_tx = connection.open_uni().await?;
        let upload_tx    = connection.open_uni().await?;
        let telemetry_tx = connection.open_uni().await?;

        Ok(ProbeTransport { connection: Arc::new(connection),
                            heartbeat_tx, upload_tx, telemetry_tx })
    }
}

Certificate pinning against MITM

Many censoring ISPs perform TLS interception: the probe receives a certificate issued by the ISP's internal CA rather than Let's Encrypt, allowing the ISP to read (and block or modify) the measurement data. Certificate pinning stops this: the probe ships a hardcoded set of acceptable SPKI SHA-256 hashes for the ingest endpoint certificate. Any certificate that does not match — including one from an ISP-controlled CA — causes the connection to be refused.

// Compute SPKI SHA-256 from a DER-encoded certificate
fn spki_sha256(cert: &rustls::Certificate) -> [u8; 32] {
    use sha2::{Sha256, Digest};
    // Extract SubjectPublicKeyInfo from the DER structure
    // (offset 24 is approximate; real impl uses x509-parser)
    let parsed = x509_parser::parse_x509_certificate(&cert.0)
        .expect("valid DER cert");
    let spki_der = parsed.1.tbs_certificate.subject_pki
        .raw;
    Sha256::digest(spki_der).into()
}

// Pinned SPKI hashes are embedded at build time from the Voidly PKI
const PINNED_SPKI_HASHES: &[[u8; 32]] = &[
    // Primary ingest certificate (Cloudflare-managed, rotated every 90 days)
    hex_literal::hex!("a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2"),
    // Backup certificate (pre-positioned for rotation)
    hex_literal::hex!("b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3"),
];

Probe binaries are updated before certificate rotations (see the OTA update section below). We maintain two pinned hashes at all times — the active certificate and the next one — to allow rotation without a forced update.

Domain fronting for blocked environments

In some countries, the ingest hostname (ingest.voidly.ai) is on a DNS blocklist, or all traffic to Cloudflare's ASN is filtered if it does not appear to be bound for a whitelisted CDN customer. The probe falls back to domain fronting in these cases: the TLS SNI is set to a large CDN hostname that is on every country's whitelist (storage.googleapis.com or cloudflare.com depending on the network), while the HTTP Host header routes to the actual ingest endpoint.

// Domain fronting configuration per country
// Detected automatically based on DNS resolution failure for ingest hostname
pub struct FrontingConfig {
    pub sni_host:  &'static str,   // sent in TLS ClientHello
    pub dns_addr:  IpAddr,         // resolved IP of the front domain
    pub host_hdr:  &'static str,   // HTTP Host header → actual ingest route
}

const FRONTING_CONFIGS: &[FrontingConfig] = &[
    FrontingConfig {
        sni_host: "storage.googleapis.com",
        dns_addr: IpAddr::V4(Ipv4Addr::new(142, 250, 80, 128)),
        host_hdr: "ingest.voidly.ai",
    },
    FrontingConfig {
        sni_host: "www.cloudflare.com",
        dns_addr: IpAddr::V4(Ipv4Addr::new(104, 16, 0, 1)),
        host_hdr: "ingest.voidly.ai",
    },
];

async fn connect_with_fronting(config: &FrontingConfig) -> Result<Connection, TransportError> {
    // Connect to the CDN IP using the CDN's SNI (avoids SNI blocklisting)
    // The QUIC/HTTP3 Host header then routes to the actual backend
    let endpoint = Endpoint::client("0.0.0.0:0".parse().unwrap())?;
    let connection = endpoint
        .connect(SocketAddr::new(config.dns_addr, 443), config.sni_host)?
        .await?;
    // Certificate pinning still applies — the CDN's certificate won't match,
    // so we use a fronting-specific pinned hash for the CDN edge cert
    Ok(connection)
}

NAT traversal and connection maintenance

QUIC handles NAT traversal better than TCP-based protocols because it uses connection IDs rather than 4-tuples to identify connections. When a probe's external IP or port changes (common with consumer CG-NAT after ~5-minute idle timeouts), QUIC can migrate the connection without a new handshake via connection migration. The probe uses the heartbeat stream to keep the connection alive through NAT state expiry.

// Connection health monitor — runs in background task
async fn maintain_connection(transport: Arc<Mutex<ProbeTransport>>) {
    let mut consecutive_missed = 0u8;

    loop {
        tokio::time::sleep(Duration::from_secs(120)).await;

        let result = {
            let mut t = transport.lock().await;
            t.send_heartbeat().await
        };

        match result {
            Ok(latency_ms) => {
                consecutive_missed = 0;
                metrics::gauge("heartbeat.latency_ms", latency_ms as f64);
            }
            Err(e) => {
                consecutive_missed += 1;
                tracing::warn!("heartbeat missed ({consecutive_missed}/3): {e}");

                if consecutive_missed >= 3 {
                    // Exponential backoff reconnect: 30s, 60s, 120s, 240s, 480s (max)
                    let backoff = Duration::from_secs(30 * (1u64 << consecutive_missed.min(4)));
                    tracing::info!("reconnecting after {backoff:?}");
                    tokio::time::sleep(backoff).await;

                    match ProbeTransport::connect(&PROBE_CONFIG).await {
                        Ok(new_transport) => {
                            *transport.lock().await = new_transport;
                            consecutive_missed = 0;
                        }
                        Err(e) => tracing::error!("reconnect failed: {e}"),
                    }
                }
            }
        }
    }
}

Local SQLite buffer: measurements survive disconnections

Network disruptions in censored countries can last minutes to hours. The probe never drops a measurement — it writes to a local SQLite database first, then uploads. The upload worker drains the SQLite queue in batch order, marking rows as uploaded only after receiving a server ack.

-- SQLite schema: probe-local measurement buffer
CREATE TABLE pending_measurements (
    rowid         INTEGER PRIMARY KEY,
    batch_seq     INTEGER NOT NULL,           -- monotonic, used for server dedup
    measurement   BLOB    NOT NULL,           -- Protobuf-encoded Measurement
    created_at    INTEGER NOT NULL,           -- Unix timestamp
    uploaded_at   INTEGER,                   -- NULL until server ack received
    upload_attempts INTEGER NOT NULL DEFAULT 0
);

CREATE INDEX idx_pending ON pending_measurements (uploaded_at, created_at);

-- Probe also enforces a 500MB cap: oldest uploaded rows are deleted first
-- to prevent unbounded growth on long-disconnected probes
// Upload worker: drains SQLite buffer in batch order
async fn upload_worker(db: SqlitePool, transport: Arc<Mutex<ProbeTransport>>) {
    loop {
        // Fetch up to 100 unuploaded measurements
        let rows = sqlx::query!(
            "SELECT rowid, batch_seq, measurement FROM pending_measurements
             WHERE uploaded_at IS NULL
             ORDER BY batch_seq ASC
             LIMIT 100"
        )
        .fetch_all(&db).await?;

        if rows.is_empty() {
            tokio::time::sleep(Duration::from_secs(5)).await;
            continue;
        }

        // Build a MeasurementBatch protobuf
        let batch = MeasurementBatch {
            probe_id:   PROBE_ID.to_string(),
            batch_seq:  rows.first().unwrap().batch_seq as u64,
            batch_hash: sha256_of_measurements(&rows),
            device_sig: sign_batch(&rows, &DEVICE_KEY),
            measurements: rows.iter()
                .map(|r| Measurement::decode(&*r.measurement).unwrap())
                .collect(),
        };

        // Compress with zstd (typically 3.2× ratio on measurement data)
        let compressed = zstd::encode_all(batch.encode_to_vec().as_slice(), 3)?;

        match transport.lock().await.upload_batch(&compressed).await {
            Ok(acked_seqs) => {
                // Mark acked rows as uploaded
                for seq in acked_seqs {
                    sqlx::query!(
                        "UPDATE pending_measurements SET uploaded_at = ?1 WHERE batch_seq = ?2",
                        unix_now(), seq
                    ).execute(&db).await?;
                }
            }
            Err(e) => {
                tracing::warn!("upload failed, will retry: {e}");
                tokio::time::sleep(Duration::from_secs(10)).await;
            }
        }
    }
}

The SQLite buffer holds up to 500 MB of pending measurements (about 48 hours of measurements at typical probe cadence). Measurements older than 48 hours are dropped at upload with a late_arrival_gt48h quality flag — the ingest pipeline discards them rather than adding stale data to the live dataset.

Metered connection awareness

Some probe operators run on mobile data plans with monthly caps. The probe tracks cumulative upload bytes over a 24-hour rolling window and backs off non-critical uploads if the rate exceeds a configurable threshold (default 50 MB/day). Measurements are still written to SQLite; only the upload cadence slows.

pub struct DataBudgetGuard {
    budget_bytes_per_day: u64,   // configurable per operator, default 50MB
    uploaded_today: AtomicU64,
    window_start: Instant,
}

impl DataBudgetGuard {
    pub fn check_and_record(&self, bytes: u64) -> UploadDecision {
        // Reset window every 24h
        if self.window_start.elapsed() > Duration::from_secs(86400) {
            self.uploaded_today.store(0, Ordering::Relaxed);
        }

        let current = self.uploaded_today.fetch_add(bytes, Ordering::Relaxed);
        if current + bytes > self.budget_bytes_per_day {
            // Over budget: delay non-urgent upload by 1–4 hours
            UploadDecision::Defer(Duration::from_secs(
                rand::thread_rng().gen_range(3600..14400)
            ))
        } else {
            UploadDecision::Proceed
        }
    }
}

Upload compression

Protobuf-encoded MeasurementBatch messages are zstd-compressed before upload. The typical compression ratio is 3.2× for measurement data (dominated by repeated field names, country codes, and IP addresses). A typical 100-measurement batch is ~48 KB uncompressed, ~15 KB compressed. At the 50 MB/day budget, that is approximately 3,300 batches per day — far above the typical probe's cadence of ~200 batches/day.

Latency to database

End-to-end latency from a measurement completing on the probe to the row appearing in TimescaleDB:

Stagep50p95
Probe → Cloudflare Worker1.8s8.2s
CF Worker → Kafka0.3s1.1s
Kafka → Rust consumer0.4s2.3s
Consumer → TimescaleDB0.8s3.7s
Total (probe → DB)~3.3s~15.3s

The p95 tail is dominated by probes in high-latency environments (satellite, censored mobile data, metered connections that have backed off). The ingest pipeline's full architecture — normalization, quality checks, Kafka consumer — is described in the probe ingest pipeline article.


For the Tauri + boringtun probe architecture that runs the measurements uploaded through this transport layer: The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →

For how the probe buffers measurements locally during upload failures — SQLite ring buffer, LZ4 compression, priority queue, and chunked delivery: Voidly probe local measurement buffer: SQLite ring buffer, batch compression, and resilient upload →

For the server-side pipeline that receives, normalizes, and quality-filters the uploaded batches: Voidly's probe-to-dataset ingest pipeline: normalization, quality filtering, and TimescaleDB indexing →

For what happens inside each probe run before the batch is assembled — DNS, TCP, TLS, HTTP measurement phases, ProbeResult assembly, and Ed25519 signing: Voidly probe run lifecycle: from scheduled task to classifier input →

For how the real-time event pipeline processes measurements into alerts within 8 minutes of upload: Voidly's real-time event pipeline: from measurement anomaly to journalist alert in under 8 minutes →