Technical writing

How Voidly protects probe operator identity while publishing full measurement data

April 2, 2025· 10 min read· AI Analytics

CensorshipVoidlyMethodologyInfrastructure

Censorship measurement data is useful precisely because it is specific: which AS, which country, which hour. A vague claim that “some ISPs in Iran block Twitter” is not actionable for researchers or journalists. A dataset entry saying that AS12880 (Information Technology Company, Iran) returned a block page for twitter.com at 2025-03-14T11:00Z is. The specificity is the product.

The tension is that the same specificity that makes the dataset valuable creates a re-identification risk for the humans who operate the probes. An operator in Tehran running a Voidly probe is generating measurements from a known AS in a known country. If the published dataset also included the probe's IP address, or any identifier that could be linked back to the operator's IP address, a government entity with access to the dataset could use it to identify and locate the probe operator. In some of the countries Voidly measures most carefully, that consequence is physical.

The publication model must therefore satisfy two requirements simultaneously: full AS-level and country-level measurement data must be published, and operator re-identification must be infeasible even for adversaries with complete access to the published dataset. This post describes how the architecture achieves both.

Probe identity architecture

At first launch, the Voidly desktop application generates an X25519 keypair using x25519-dalek with OsRng. The private key is stored in the OS keychain and never leaves the device. The public key is registered with the control server and serves as the sole cryptographic identity of the probe.

The probe_id that appears in the published dataset is derived from the public key:

use sha2::{Sha256, Digest};
use x25519_dalek::PublicKey;

fn derive_probe_id(public_key: &PublicKey) -> String {
    let mut hasher = Sha256::new();
    hasher.update(public_key.as_bytes());
    let hash = hasher.finalize();
    // Encode as lowercase hex — 64 characters
    hex::encode(hash)
}

The probe_id is stable across the probe's lifetime (it changes only when the keypair is rotated, which happens on a schedule for high-risk country probes). It is not an IP address, not a subnet, and not any routable network identifier. Knowing a probe_id tells you nothing about where on the internet the probe is running beyond what the accompanyingprobe_asn and probe_cc fields already make explicit.

The control server enforces a strict no-log policy on incoming probe connection IPs. The log retention policy for connection IPs is zero seconds: the source IP is consumed for rate limiting in a hot path that runs synchronously before any async write, and is then discarded. No IP-to-probe_id mapping is ever persisted to disk or transmitted to any downstream service.

The codename system

A probe's probe_id (the SHA-256 hash of the public key) is its cryptographic identity used in the measurement pipeline. It is not the identifier used in operator-facing communications. Operators interact with Voidly through a randomly generated human-readable codename assigned at registration — for example, tangerine-falcon-41.

Codenames are generated from a fixed wordlist: an adjective, a noun, and a two-digit number (00–99). With 75 adjectives, 60 nouns, and 100 number suffixes, the space is 450,000 combinations — large enough that enumeration is impractical for any adversary who doesn't already know the codename. The generation is straightforward:

use rand::seq::SliceRandom;
use rand::Rng;
use rand::rngs::OsRng;

const ADJECTIVES: &[&str] = &[
    "amber", "azure", "brass", "coral", "crimson",
    "dusk", "ember", "fern", "garnet", "jade",
    // ... 65 more
    "tangerine", "umber", "viridian", "wren", "zinc",
];

const NOUNS: &[&str] = &[
    "albatross", "badger", "crane", "dingo", "egret",
    "falcon", "goshawk", "heron", "ibis", "jackal",
    // ... 50 more
    "swift", "thrush", "urchin", "vole", "weasel",
];

fn generate_codename() -> String {
    let mut rng = OsRng;
    let adj = ADJECTIVES.choose(&mut rng).unwrap();
    let noun = NOUNS.choose(&mut rng).unwrap();
    let num: u8 = rng.gen_range(0..100);
    format!("{}-{}-{:02}", adj, noun, num)
}

Three properties make the codename system safe. First, codenames are not recorded alongside probe_id in any joint database table. The control server stores the codename in an operator communications table and the probe_id in a separate measurement identity table, with no foreign key linking the two. A database breach of either table in isolation reveals nothing about the other. Second, operators can rotate their codename at any time by requesting a new one through the probe application; rotation generates a new codename while leaving the probe_id and keypair unchanged. Third, Voidly staff who handle operator support tickets see only the codename — they have no query path to the corresponding probe_id without deliberate cross-table joins that are disabled for support roles.

Measurement data anonymization

The published dataset includes the following per-measurement fields related to probe identity and location:

probe_id — the SHA-256 hash of the probe's public key. Stable across measurements from the same probe. Not an IP address.
probe_cc — two-letter country code derived at the control server side using MaxMind GeoLite2-Country applied to the probe's connection IP. The country code is assigned once per session and stored; the IP that produced it is discarded.
probe_asn — Autonomous System Number in AS12345 format, derived using MaxMind GeoLite2-ASN at the same moment as the country lookup.
network_type — one of residential, mobile, or datacenter, assigned from the CAIDA AS-Rank classification for the detected ASN. This field is static per ASN, not derived from the probe's connection.

The following fields are explicitly never published:

IP address or any subnet prefix
Precise geolocation (city, region, or coordinates)
Measurement timestamp at finer than one-hour granularity — timestamps are rounded down to the hour boundary before publication

The IP stripping happens in the normalization step that transforms an incoming protobuf measurement into a Kafka record. The function signature and the stripping logic:

use crate::proto::RawMeasurement;
use crate::kafka::NormalizedRecord;
use crate::geoip::{lookup_country, lookup_asn};
use chrono::{Utc, Timelike};

pub fn normalize_measurement(
    raw: RawMeasurement,
    source_ip: std::net::IpAddr,
) -> NormalizedRecord {
    // Derive geo fields from IP before discarding it
    let probe_cc = lookup_country(source_ip)
        .unwrap_or_else(|_| "ZZ".to_string());
    let probe_asn = lookup_asn(source_ip)
        .map(|asn| format!("AS{}", asn))
        .unwrap_or_else(|_| "AS0".to_string());

    // Round timestamp down to hour — sub-hour granularity is dropped
    let ts = Utc::now();
    let ts_hour = ts.with_minute(0).unwrap()
                    .with_second(0).unwrap()
                    .with_nanosecond(0).unwrap();

    // source_ip is NOT included in the returned record
    NormalizedRecord {
        probe_id: raw.probe_id,   // SHA-256 of public key, not the IP
        probe_cc,
        probe_asn,
        network_type: lookup_network_type(&probe_asn),
        measurement_hour: ts_hour,
        domain: raw.domain,
        dns_result: raw.dns_result,
        tls_result: raw.tls_result,
        http_result: raw.http_result,
        interference_class: raw.interference_class,
    }
    // source_ip goes out of scope here and is not written anywhere
}

Per-probe signing keys and JWT architecture

Each measurement upload is cryptographically signed by the probe. Although the key exchange at the WireGuard layer uses X25519, upload authentication uses Ed25519 — a signing key derived from the probe's X25519 private key via a deterministic KDF. This gives each probe a stable signing identity without requiring a separate key to be stored or registered.

The control server verifies upload signatures using the stored Ed25519 public key, which is itself derived from the registered X25519 public key at registration time. The verification logic:

use ed25519_dalek::{Verifier, VerifyingKey, Signature};
use crate::db::ProbeKeyStore;

pub async fn verify_upload_signature(
    probe_id: &str,
    message: &[u8],
    signature_bytes: &[u8; 64],
    key_store: &ProbeKeyStore,
) -> Result<(), VerificationError> {
    // Fetch only the public key — no user/operator data in this table
    let verifying_key_bytes = key_store
        .get_ed25519_public_key(probe_id)
        .await?;

    let verifying_key = VerifyingKey::from_bytes(&verifying_key_bytes)
        .map_err(|_| VerificationError::InvalidKey)?;

    let signature = Signature::from_bytes(signature_bytes);

    verifying_key
        .verify(message, &signature)
        .map_err(|_| VerificationError::InvalidSignature)
}

The ProbeKeyStore is a separate database with a schema containing exactly two columns: probe_id (the SHA-256 hash) and ed25519_public_key (32 bytes). There are no foreign keys to any operator, user, or communications table. A full dump of this database reveals only the mapping from probe_id to public key — no operator identity, no IP history, no codename.

High-risk country extra protections

Twelve countries receive additional protections for their probe operators beyond the baseline anonymization described above. The classification is based on Freedom House Freedom on the Net scores and the presence of laws that could make censorship measurement activity a criminal offense. The current list:

Country	Code	Publication delay	probe_id rotation
China	`CN`	4–48 hours	90 days
Russia	`RU`	4–48 hours	90 days
Iran	`IR`	4–48 hours	90 days
Belarus	`BY`	4–48 hours	90 days
Vietnam	`VN`	4–48 hours	90 days
North Korea	`KP`	4–48 hours	90 days
Cuba	`CU`	4–48 hours	90 days
Syria	`SY`	4–48 hours	90 days
Sudan	`SD`	4–48 hours	90 days
Eritrea	`ER`	4–48 hours	90 days
Turkmenistan	`TM`	4–48 hours	90 days
Azerbaijan	`AZ`	4–48 hours	90 days

Four protections apply specifically to probes in these countries. First, measurement uploads are held in an internal queue before publication for a random duration between 4 and 48 hours. The delay is sampled uniformly at publication time, not at upload time, so there is no predictable pattern a network observer could use to correlate upload timing with publication timing. Second, the probe_id is rotated every 90 days: the probe generates a new X25519 keypair, re-registers the new public key with the control server, and begins uploading under a new probe_id. Measurements before and after the rotation boundary are cryptographically unlinkable without the operator's voluntary disclosure. Third, measurements are tagged high_risk_country=true in the internal Kafka topic so the pipeline can apply additional scrubbing, but this field is stripped before any record reaches the public dataset. Fourth, operators in these countries are advised (but not required) to route probe traffic through Tor or a VPN service so that the network observer between them and the control server sees only Tor or VPN traffic, not a WireGuard connection to the Voidly collector endpoint.

What an adversary can and cannot infer

An adversary with complete, continuous access to the published dataset — including a government intelligence service with full BGP visibility and ISP cooperation — is limited to the following inferences:

Can determine	Cannot determine
Which AS the probe is in (by design — this is the unit of analysis)	The probe's IP address or subnet
The country (derived from the same IP lookup, same disclosure)	The operator's identity, name, or location within the country
That measurements happened within a given hour window	The exact minute or second a measurement was taken
That a specific `probe_id` was active during an interval	Whether two `probe_id`s belong to the same operator (no linking identifier in the published dataset)
Network type (residential / mobile / datacenter)	Whether a `probe_id` before and after a 90-day rotation is the same physical device

The AS disclosure is intentional and unavoidable: the dataset's analytical value depends entirely on knowing which ISP is applying a block. The country disclosure follows from the AS. These two fields are the unit of analysis, not incidental leakage. Every other identity-adjacent field is either suppressed or transformed to remove identifying content before the record leaves the internal pipeline.

Takedown response policy

If a government authority issues a legal demand for probe operator identification associated with a specific probe_id or AS, the response is constrained by what Voidly actually holds:

Voidly has no IP-to-probe_id mapping stored anywhere. The mapping is never persisted: the source IP is consumed synchronously for GeoIP lookup and then discarded before any async write executes. A court order for “the IP address associated with probe_id X” cannot be fulfilled because that information does not exist.

Voidly has no operator identity associated with any probe_id. The key store that maps probe_id to public key has no foreign keys to any operator or user table. Even a full database disclosure would reveal only public keys — the same information the probe already broadcasts during WireGuard handshake.

The codename system means that even Voidly cannot voluntarily re-identify an operator from a probe_id without the operator's cooperation. The codename table and the probe_id table are separate schemas with no join path exposed to any Voidly role short of a multi-step administrative procedure that requires explicit logging and audit. This is a deliberate design choice that functions as legal engineering as much as technical engineering: the strongest protection against a coerced disclosure is ensuring there is nothing to disclose.

This policy is consistent with the approach taken by other internet freedom infrastructure operators — Tor, Signal, and several OONI contributors — where the goal is not to resist legal demands through litigation but to make the demanded data structurally unavailable. Litigation can be lost. Data that was never collected cannot be compelled.

For the vantage selection strategy that determines which ASNs are prioritized and how operator safety shapes coverage decisions: Voidly probe vantage selection: ASN diversity, operator safety, and reaching hard-to-measure countries →

For the probe architecture that generates and stores keys on-device and routes all traffic through WireGuard: The Voidly probe: Tauri + boringtun network measurement at the operator's edge →

For how the measurement dataset is structured after anonymization passes through the pipeline: The Voidly measurement dataset: field-by-field schema reference →

For the probe commissioning process where operators register and their public keys are enrolled: Voidly probe commissioning: how a new operator joins the censorship measurement network →