Technical writing

The Voidly probe test runner: concurrency, timeout handling, and the measurement state machine

November 8, 2025· AI Analytics

CensorshipVoidlyInfrastructureRust

The test runner's place in the pipeline

The measurement lifecycle in a Voidly probe starts with the scheduler assigning tasks and ends with signed results queued for upload. Between those two points sits the test runner: a Rust async task that pulls MeasurementTask structs from a channel, manages concurrent execution with a semaphore, enforces per-layer timeouts with tokio::time::timeout, and classifies results before pushing them to the upload queue.

The HTTP measurement article covers what happens inside a single probe test — the DNS, TCP, TLS, and HTTP layers and how each maps to an interference type. The probe architecture article covers the Tauri application shell that houses all of this. This article covers the engine that connects them: the component responsible for deciding how many tests run at once, what happens when a layer stalls, and which results are worth uploading.

The MeasurementTask struct

The scheduler dispatches work as MeasurementTask structs. Each task identifies a domain, the probe's location context, the protocol layers to exercise, and a priority score used to order execution within the queue:

use std::time::SystemTime;

#[derive(Debug, Clone)]
pub struct MeasurementTask {
    pub task_id:       uuid::Uuid,
    pub domain:        String,
    pub probe_cc:      String,
    pub probe_asn:     u32,
    pub protocols:     Vec<Protocol>,   // [DNS, TCP, TLS, HTTP]
    pub priority:      u8,              // 1-10; higher = sooner
    pub scheduled_at:  SystemTime,
    pub is_warmup:     bool,
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Protocol { Dns, Tcp, Tls, Http }

The is_warmup flag marks the first measurement of a probe session. Warmup measurements are treated differently in the anomaly classifier — the probe's routing table and DNS caches are cold, and timing baselines haven't stabilized. The classifier discounts z-score features for warmup measurements and requires higher anomaly probability before raising a flag.

The priority field is set by the scheduler based on country risk tier, domain category (news and circumvention tools score higher), and whether the domain recently produced an anomaly in another vantage. A domain that generated an anomaly in a neighbouring AS within the last 6 hours is bumped to priority 9 regardless of its base tier — this is the “anomaly boost” that lets the network respond faster to active censorship events.

The MeasurementState machine

Every task passes through a state machine as it executes. The states represent the complete lifecycle of a single measurement attempt:

#[derive(Debug, Clone, PartialEq)]
pub enum MeasurementState {
    Pending,
    Running { started_at: SystemTime },
    Success(MeasurementResult),
    Error  { reason: ErrorClass, detail: String },
    Timeout { layer: &'static str },
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum ErrorClass {
    NetworkError,   // TCP RST, ICMP unreachable
    DnsError,       // NXDOMAIN, SERVFAIL (not interference — genuine error)
    TlsError,       // handshake error not caused by interference
    ParseError,     // malformed response
}

The distinction between Error and Timeout is deliberate and consequential. An Error means the connection attempt returned a definite failure — a TCP RST, a DNS SERVFAIL, a TLS alert. A Timeout means the operation simply stopped responding. These outcomes have different implications for censorship classification and different retry policies. They are kept as separate variants rather than collapsed into a single failure type precisely because the layer at which a timeout occurs is itself evidence.

ErrorClass distinguishes errors that indicate genuine network problems from errors that could indicate interference. The classifier receives both the raw error class and the full per-layer measurement data, so these enum values are diagnostic aids rather than final verdicts.

Concurrency model: three permits

The test runner uses a tokio::sync::Semaphore with 3 permits, allowing exactly 3 concurrent measurement tasks. A 4th task blocks on acquire() until one of the 3 running tasks completes and drops its permit:

use tokio::sync::Semaphore;
use std::sync::Arc;

pub struct TestRunner {
    concurrency: Arc<Semaphore>,        // permits = 3
    upload_tx:   tokio::sync::mpsc::Sender<MeasurementResult>,
}

impl TestRunner {
    pub async fn run_task(&self, task: MeasurementTask) {
        let _permit = self.concurrency.acquire().await.unwrap();
        let result  = self.execute_with_timeout(&task).await;
        if let MeasurementState::Success(r) = result {
            let _ = self.upload_tx.send(r).await;
        }
        // Errors and timeouts are logged locally; not uploaded
    }
}

The limit of 3 concurrent tasks is intentional. Running too many simultaneous network tests from one IP increases the chance of triggering rate-limiting or connection-tracking-based CDN blocks that would produce false positives. A CDN that sees 10 simultaneous connections to multiple blocked domains from the same residential IP may apply a temporary block that is indistinguishable from censorship. Three concurrent tasks allows meaningful throughput against the 80-domain test list while staying within the connection behaviour profile of a single active browser tab.

The semaphore is Arc-wrapped so that the permit guard can be held across await points. The permit is dropped when the task completes (or panics) because _permit is bound with a leading underscore — it's not used directly, but its Drop implementation returns the permit to the semaphore. This pattern avoids explicit permit release calls and ensures correct cleanup even in the error path.

Per-layer timeouts

Each protocol layer has an independent timeout budget. The budgets are stacked inside a total 30-second hard limit:

use tokio::time::{timeout, Duration};

// Timeout budgets per protocol layer
const TIMEOUT_DNS:   Duration = Duration::from_secs(3);
const TIMEOUT_TCP:   Duration = Duration::from_secs(5);
const TIMEOUT_TLS:   Duration = Duration::from_secs(8);
const TIMEOUT_HTTP:  Duration = Duration::from_secs(15);
const TIMEOUT_TOTAL: Duration = Duration::from_secs(30);

async fn execute_with_timeout(
    task: &MeasurementTask,
) -> MeasurementState {
    let total_guard = timeout(TIMEOUT_TOTAL, async {
        // DNS
        let dns_result = match timeout(TIMEOUT_DNS, run_dns(task)).await {
            Ok(r)  => r,
            Err(_) => return MeasurementState::Timeout { layer: "dns" },
        };
        // TCP
        let tcp_result = match timeout(TIMEOUT_TCP, run_tcp(task, &dns_result)).await {
            Ok(r)  => r,
            Err(_) => return MeasurementState::Timeout { layer: "tcp" },
        };
        // TLS (if HTTPS)
        let tls_result = if task.protocols.contains(&Protocol::Tls) {
            match timeout(TIMEOUT_TLS, run_tls(task, &tcp_result)).await {
                Ok(r)  => Some(r),
                Err(_) => return MeasurementState::Timeout { layer: "tls" },
            }
        } else { None };
        // HTTP
        let http_result = match timeout(TIMEOUT_HTTP, run_http(task, &tcp_result)).await {
            Ok(r)  => r,
            Err(_) => return MeasurementState::Timeout { layer: "http" },
        };

        MeasurementState::Success(assemble_result(task, dns_result, tcp_result, tls_result, http_result))
    }).await;

    total_guard.unwrap_or(MeasurementState::Timeout { layer: "total" })
}

The outer timeout(TIMEOUT_TOTAL, ...) wraps the entire measurement. If any layer's individual timeout fires, the inner async block returns early with the layer-specific variant. If the inner block somehow exceeds 30 seconds despite all the per-layer guards — possible if the overhead between layers accumulates, or if a future implementation adds intermediate steps — total_guard.unwrap_or(...) catches it.

Why per-layer timeouts matter for censorship detection

A 30-second total timeout with no per-layer budget would allow a slow DNS response — stuck for 28 seconds — to consume the entire budget before TCP even starts. With per-layer timeouts, Voidly can classify exactly which layer timed out, and that classification carries evidentiary weight.

Timeout { layer: "dns" } is itself evidence of DNS interference in high-risk countries where authoritative resolvers are known to delay NXDOMAIN responses for blocked domains. The resolver has received the query but is waiting on a policy decision system before responding. This pattern is documented in Iran and China, where DNS resolvers for certain blocked domains return delayed responses rather than immediate errors.

Timeout { layer: "tcp" } indicates that DNS succeeded but no TCP connection was established within 5 seconds. This is the fingerprint of a packet-drop firewall rule — not a RST injection (which would produce a fast error) but a black hole route where packets are silently discarded. The distinction between Timeout { layer: "tcp" } and ErrorClass::NetworkError (which covers RST) is directly used by the bgp_withdrawal interference classifier.

Timeout { layer: "tls" } — TCP connected but TLS handshake stalled — is rare but meaningful. It occurs when a middlebox intercepts the TCP connection and holds it open without completing or terminating the TLS handshake. This has been observed in Kazakhstan where DPI infrastructure sometimes stalls rather than resets connections.

Measurement signing before upload

Before a successful MeasurementResult is pushed to the upload channel, it is signed with the probe's ed25519 keypair. The signing ensures that results received by the collector can be verified as originating from a registered probe and haven't been tampered with in transit:

use ed25519_dalek::{Keypair, Signer};
use sha2::{Sha256, Digest};

fn sign_result(result: &MeasurementResult, keypair: &Keypair) -> Vec<u8> {
    let mut hasher = Sha256::new();
    hasher.update(serde_json::to_vec(result).unwrap());
    let payload_hash = hasher.finalize();
    keypair.sign(&payload_hash).to_bytes().to_vec()
}

The signing approach: SHA-256 the JSON-serialized result, then sign the hash with ed25519. The collector verifies the signature against the probe's registered public key before ingesting the measurement. A measurement that fails signature verification is rejected — it either originated from an unregistered probe or was modified after signing.

The probe's keypair is the same X25519-Dalek key used for the WireGuard tunnel, stored in the OS keychain and never written to disk directly. The signing step adds a small but measurable overhead (SHA-256 of a ~3KB JSON payload, then an ed25519 sign operation) that is negligible relative to the network time of the measurement itself.

The upload queue

Signed results are pushed to a tokio::sync::mpsc::channel with capacity 200. If the channel is full — because the upload task is backed up from a network outage or collector unavailability — the test runner drops the result and increments a dropped_results counter:

// In run_task: try_send instead of send to avoid blocking the test runner
// on a full upload queue.
match self.upload_tx.try_send(signed_result) {
    Ok(()) => {}
    Err(TrySendError::Full(_)) => {
        metrics::counter!("upload_dropped_total").increment(1);
        // result is dropped here — recovery path is SQLite
    }
    Err(TrySendError::Closed(_)) => {
        // upload task has exited; probe is shutting down
    }
}

Dropping from the channel does not mean the measurement is permanently lost. A parallel task writes every MeasurementResult to a local SQLite buffer capped at 500MB. The SQLite writer receives results via a second channel that is unbounded — it never drops. When the upload task reconnects to the collector after an outage, it drains the SQLite buffer before processing the in-memory channel, so results are uploaded in chronological order.

The 500MB SQLite cap corresponds to approximately 150,000 full measurement results. At the probe's nominal throughput of one measurement every 3–5 seconds, that covers roughly six days of offline operation before the oldest results are evicted. The eviction policy is oldest-first: when the database size exceeds the cap, the 1,000 oldest rows are deleted to make room.

Retry policy

The test runner applies different retry behaviour depending on how a task ended:

Zero retries for ErrorClass::NetworkError or ErrorClass::DnsError. Retrying immediately from the same IP would generate duplicate evidence of the same error without adding information. If a domain's TCP connection is RST by a middlebox, a second attempt 100ms later will produce the same RST. The error is logged and the task is not re-queued.
One retry after a 30-second backoff for MeasurementState::Timeout. Timeouts are sometimes transient: server-side congestion, a momentary routing hiccup, or a flapping DPI appliance. A single retry after a brief wait can confirm or refute the interference signal. If the retry also times out, the result is recorded as a confirmed timeout at that layer. If the retry succeeds, the original timeout is recorded as a transient failure and the successful result is uploaded.
Zero retries for ErrorClass::TlsError or ErrorClass::ParseError. These errors indicate a deterministic protocol-level failure. A TLS handshake that fails with a specific alert code will fail the same way on retry. Parse errors in an HTTP response body indicate either server-side corruption or a block page injected into the stream — both are worth recording as-is.

The 30-second backoff for timeout retries is coordinated with the semaphore: the retrying task releases its permit during the backoff period, freeing the slot for other tasks. The retry is spawned as a new tokio::taskthat acquires a fresh permit when it runs. This avoids a scenario where a timed-out task holds a concurrency permit for 30 seconds of dead waiting.

Observability counters

The test runner exposes Prometheus-style counters that are readable by the Tauri frontend and by the probe health monitor. These are incremented atomically using the metrics crate:

measurements_attempted_total (labelled by protocol set) — incremented when a task is dequeued and a semaphore permit is acquired.
measurements_success_total — incremented when MeasurementState::Success is reached and the result passes the pre-upload signature check.
measurements_timeout_total (labelled by layer) — incremented on any Timeout variant. The layer label distinguishes "dns", "tcp", "tls","http", and "total".
measurements_error_total (labelled by error_class) — incremented on any Error variant.
upload_queue_depth — sampled as a gauge when each task completes, reporting the current count of results waiting in the upload channel.
upload_dropped_total — incremented each time a result is dropped due to a full upload channel.

The Tauri frontend polls these counters every 10 seconds and displays them in the probe status panel. The probe health monitor — described in the probe health article — reads the same counters remotely via the WireGuard tunnel to detect probes that have stalled (measurements_attempted_total not advancing) or are experiencing elevated timeout rates (measurements_timeout_totalexceeding a 20% threshold over a 30-minute window).

The upload_dropped_total counter is the primary indicator of upload backlog. A probe that drops more than 50 results in an hour is considered to have a degraded upload path. The health monitor raises a ProbeStatus::DegradedUpload flag and the collector pauses anomaly-boost promotions for that probe until the backlog clears.