Technical writing

Voidly's real-time corroboration engine: fetching, aligning, and merging OONI, CensoredPlanet, and IODA data

February 22, 2025· 8 min read· AI Analytics

CensorshipVoidlyInfrastructureData engineering

The confidence tier system requires external corroboration at two transition points. Moving from Anomaly to Corroborated requires agreement from at least one external source. Moving from Corroborated to Verified requires a sustained detection pattern plus cross-source confirmation from two or more independent sources. The conceptual description of what those tiers mean — and why independence weighting matters for precision — is in the confidence tiers article. This article covers the engineering reality of how corroboration actually happens.

The problem is that OONI, CensoredPlanet, and IODA don't publish data on the same schedule. OONI processes in batches: a probe measurement typically appears in the API 10–60 minutes after it was taken, sometimes longer. CensoredPlanet publishes once per day as a bulk dump on Google Cloud Storage. IODA provides hourly BGP and active probing signals with a 15-minute processing delay. When a Voidly probe flags an anomaly, the corroboration engine can't simply query all three sources and join the results — only one of them will have data immediately.

Three data sources, three access patterns

OONI

The OONI public REST API lives at https://api.ooni.io/api/v1/measurements. Queries accept country_code, domain, and a test_start_time range, and return JSON with per-measurement records including anomaly classification, blocking type, and probe ASN. Latency from probe measurement to API availability is typically 10–60 minutes; OONI processes measurements in batches rather than streaming them individually. The unauthenticated rate limit is 100 requests per minute.

CensoredPlanet

CensoredPlanet publishes daily bulk dumps on GCS at gs://censoredplanet-data/results/, updated once per day at approximately 08:00 UTC. There is no queryable API — to use CP data, you must download the day's dump and index it locally. The format is gzipped CSV, one file per test type (Quack, Satellite, Hyperquack). A full day's dump is approximately 2–4 GB compressed. This access pattern means CP data is never available in real-time: at any given moment, the freshest available CP data is from the previous day.

IODA

The IODA REST API at https://api.ioda.caida.org/v2/ exposes the /signals endpoint, which returns country-level BGP withdrawal and active probing signals for a specified time range. Updates are hourly with a 15-minute processing delay — meaning IODA is the fastest of the three external sources to reflect a censorship or outage event, but it operates at country granularity rather than per-domain. IODA is most useful for detecting full outages and BGP-level routing changes, not for confirming domain-specific blocking.

The alignment challenge

The timing mismatch between sources is the central engineering problem. For a Voidly probe anomaly flagged at T+0, the data availability looks like this:

T+0:    Voidly probe sees anomaly
T+5m:   Voidly classifier flags it (Anomaly tier)
T+15m:  IODA BGP signal available (hourly cadence)
T+60m:  OONI measurement typically available via API
T+24h:  CensoredPlanet daily dump includes today's data

The corroboration engine has to work around this structure rather than against it. The approach is a four-part strategy: (a) immediately query IODA for the BGP signal because it's the fastest to reflect network-level events; (b) poll OONI with retries over a 4-hour window, since OONI will eventually have per-domain data even if it takes an hour; (c) use yesterday's CensoredPlanet dump for prior-day context during the initial check; and (d) run a retroactive nightly reprocessing pass once today's CP dump is available.

The CorroborationEngine struct and lifecycle

The engine is implemented in Rust and runs as a long-lived async task alongside the main Voidly probe ingest pipeline. Its state is structured around three persistent clients — one per external source — plus an in-memory index for the current CensoredPlanet dump and a DashMap cache for in-progress corroboration results:

pub struct CorroborationEngine {
    ooni_client: Arc<OoniApiClient>,
    ioda_client: Arc<IodaApiClient>,
    cp_index: Arc<RwLock<CensoredPlanetDailyIndex>>,   // in-memory index of today's CP dump
    result_store: Arc<TimescaleDb>,
    corroboration_cache: Arc<DashMap<IncidentKey, CorroborationResult>>,
}

pub struct CorroborationResult {
    pub incident_key: IncidentKey,        // (country_code, domain, interference_type)
    pub ooni_corroborated: bool,
    pub ooni_first_seen: Option<DateTime<Utc>>,
    pub cp_corroborated: bool,
    pub cp_first_seen: Option<DateTime<Utc>>,
    pub ioda_corroborated: bool,
    pub ioda_first_seen: Option<DateTime<Utc>>,
    pub corroboration_score: f32,
    pub last_checked: DateTime<Utc>,
    pub next_check: DateTime<Utc>,        // adaptive: sooner if not yet corroborated
}

The IncidentKey is a tuple of (country_code, domain, interference_type) — it identifies the specific blocking event rather than an individual probe measurement. Multiple probe measurements from the same country and domain within a rolling window collapse into a single IncidentKey for corroboration purposes.

The corroboration_cache is a DashMap (a concurrent HashMap) that holds in-progress results between polling cycles. When corroboration is complete — or when the incident ages past the 7-day retroactive window — the result is persisted to TimescaleDB and evicted from the cache.

Parallel async fetch: the core real-time logic

When a new Anomaly-tier measurement arrives, the engine calls check_corroboration, which fires all three source queries concurrently using tokio::join! with individual timeouts per source. The CP lookup is in-memory and needs no timeout; OONI and IODA each have their own deadline:

async fn check_corroboration(
    &self,
    key: &IncidentKey,
    measurement_time: DateTime<Utc>,
) -> CorroborationResult {
    let window_start = measurement_time - Duration::hours(4);
    let window_end   = measurement_time + Duration::hours(4);

    let (ooni_result, cp_result, ioda_result) = tokio::join!(
        self.ooni_client.query_measurements(
            &key.country_code, &key.domain, window_start, window_end,
        ).timeout(Duration::seconds(10)),
        self.cp_index.read().await.query(
            &key.country_code, &key.domain, window_start, window_end,
        ),   // in-memory, no timeout needed
        self.ioda_client.query_signals(
            &key.country_code, window_start, window_end,
        ).timeout(Duration::seconds(8)),
    );

    let ooni_corroborated = ooni_result
        .ok().flatten()
        .map(|r| r.anomaly_count > 0)
        .unwrap_or(false);

    let ioda_corroborated = ioda_result
        .ok().flatten()
        .map(|r| r.bgp_outage_score > 0.3 || r.active_probe_reachability < 0.7)
        .unwrap_or(false);

    let cp_corroborated = cp_result
        .map(|r| r.blocked_count > 0)
        .unwrap_or(false);

    // ...compute corroboration_score using independence weights...
}

A timeout on OONI or IODA produces false via .ok().flatten().map(...).unwrap_or(false) — a failed external query is treated as a non-corroboration rather than an error that aborts the check. This is intentional: if OONI's API is slow or rate-limited, the engine still records the IODA and CP results and schedules a retry. A partial corroboration result is more useful than no result.

The IODA thresholds — bgp_outage_score > 0.3 or active_probe_reachability < 0.7 — reflect a deliberate asymmetry: IODA operates at country granularity, so a modest BGP signal is meaningful even when Voidly's probe is seeing domain-specific interference. A Voidly probe flagging DNS tampering for news.example.com in the same country where IODA sees a 0.4 BGP outage score is a significantly stronger signal than the probe measurement alone.

The CensoredPlanet daily index

Because CensoredPlanet only publishes bulk dumps, the engine downloads the previous day's dump at 09:00 UTC each morning — one hour after CP publishes — parses it, and loads it into an in-memory index. The index is a HashMap<(String, String), Vec<CpMeasurement>> keyed on (country_code, domain). Lookups during check_corroboration are sub-millisecond.

async fn load_cp_daily_dump(date: NaiveDate) -> Result<CensoredPlanetDailyIndex> {
    let dump_url = format!(
        "https://storage.googleapis.com/censoredplanet-data/results/{}/cp-results-{}.tar.gz",
        date.format("%Y/%m/%d"),
        date,
    );

    let response = reqwest::get(&dump_url).await?;
    let mut decoder = GzDecoder::new(response.bytes_stream());
    let mut index = HashMap::new();

    while let Some(record) = parse_next_record(&mut decoder).await? {
        index
            .entry((record.country_code.clone(), record.domain.clone()))
            .or_insert_with(Vec::new)
            .push(record);
    }

    Ok(CensoredPlanetDailyIndex { index, date })
}

The download is 2–4 GB compressed; parsing takes 8–12 minutes on the dedicated thread pool that handles the streaming decompression and CSV parsing. During the load window, the previous index remains live — the RwLock<CensoredPlanetDailyIndex> is only write-locked for the swap at the end, which takes microseconds once the new index object is ready in memory. Corroboration checks that run during the 8–12 minute parse window use yesterday's index, which is the same data they would have used anyway.

One practical consequence: for an incident that occurred between yesterday's CP measurement and today's, the CP index will not reflect it until tomorrow morning. This is why the CP corroboration result is treated as a latency-tolerant signal rather than a blocking requirement for tier promotion.

Adaptive retry schedule for OONI

OONI data becomes available with variable latency depending on probe load and batch processing cadence. Rather than polling at a fixed interval, the engine schedules retries at increasing delays, balancing freshness against API rate limits:

First check: T+15 minutes (OONI's fastest possible availability)
Second check: T+60 minutes
Third check: T+3 hours
Fourth check: T+6 hours (final for real-time corroboration)
If not yet corroborated after the fourth check: mark as pending_retroactive and defer to the nightly reprocessing job

The next_check field in CorroborationResult encodes this schedule. When the engine's background poller wakes up each minute, it scans the DashMap for entries where next_check <= Utc::now() and fires corroboration checks for those keys. Once a key is fully corroborated — all three sources have been checked and at least one has confirmed — the poller stops scheduling retries for it and persists the final result.

The adaptive schedule means that an incident corroborated by IODA alone within the first 15 minutes (which is common for BGP-level outages) stops polling OONI aggressively once IODA confirmation is recorded and the incident reaches Corroborated tier. The engine still performs the T+60 and T+3h OONI checks to collect per-domain evidence, but a failed OONI check no longer changes the tier — it only enriches the corroboration record.

Retroactive nightly reprocessing

Every night at 01:00 UTC, a separate job re-runs corroboration for all incidents from the past 7 days that meet any of the following conditions:

Were not corroborated at real-time check time (marked pending_retroactive)
Used yesterday's CP dump — today's dump is now available
Had IODA data that may have been updated with retroactive corrections

The retroactive pass frequently promotes Anomaly-tier incidents to Corroborated 24–48 hours after initial detection, especially for incidents that occurred during CensoredPlanet's measurement cycle. An incident flagged at 20:00 UTC on a given day will have no CP data available during real-time corroboration, but by the following morning's retroactive pass, the previous day's CP dump is loaded and the check can run.

Retroactive promotion affects the timestamps in the dataset: an incident may show a corroborated_at time that is 12–36 hours after first_seen_at. Consumers who use the dataset for time-sensitive monitoring should filter on first_seen_at rather than corroborated_at to avoid the appearance of a lag in detection.

The independence weight table

When multiple sources agree on an incident, the corroboration score is not a simple count of agreeing sources. Sources with correlated measurement methodologies provide less independent evidence than sources using fundamentally different techniques. OONI and CensoredPlanet both run active probes against target URLs, so their measurements are more correlated than either is with IODA's BGP and passive probing signals.

The weight table is stored in Redis for fast lookup during corroboration scoring. The scoring function computes the probability that at least one of the agreeing sources would detect the event if it were real — using the independence weights as the probability of independent detection per pair:

# Stored in Redis KV for fast lookup during corroboration scoring
INDEPENDENCE_WEIGHTS = {
    ('voidly', 'ooni'):          0.80,
    ('voidly', 'censoredplanet'): 0.75,
    ('voidly', 'ioda'):          0.95,
    ('ooni', 'censoredplanet'):  0.70,
    ('ooni', 'ioda'):            0.90,
    ('censoredplanet', 'ioda'):  0.90,
}

def compute_corroboration_score(agreeing_sources: list[str]) -> float:
    if len(agreeing_sources) <= 1:
        return 0.60
    pairs = list(combinations(agreeing_sources, 2))
    weights = [INDEPENDENCE_WEIGHTS.get(tuple(sorted(a, b)), 0.80) for a, b in pairs]
    return round(1.0 - reduce(lambda acc, w: acc * (1 - w), weights, 1.0), 3)

The ooni/censoredplanet pair has the lowest weight (0.70) because both use HTTP/HTTPS probing against the same target URLs — if one sees a blockpage response, the other is likely to as well regardless of whether censorship is real or a measurement artifact. The voidly/ioda and ooni/ioda pairs have the highest weights (0.95 and 0.90) because IODA's BGP signal is structurally independent from active URL probing — a BGP withdrawal detected by IODA is evidence of network-layer interference that could not be caused by the same artifacts that affect HTTP probes.

For the theoretical background on why independence weighting outperforms source counting at high precision targets, and for the full derivation of how these weights translate to the Verified tier's 99% precision claim, see the cross-source verification methodology article.

Performance characteristics

A real-time corroboration check typically completes in 2–8 seconds. OONI API latency dominates: the OONI query takes 1–6 seconds depending on query result size and server load. IODA typically responds in under 800 milliseconds. The CP index lookup is in-memory and completes in under 1 millisecond.

During normal operation, the engine processes approximately 800 corroboration checks per hour. During major geopolitical events — elections, protests, or large-scale government-ordered shutdowns — the rate of new Anomaly-tier detections spikes sharply: the engine has handled 4,000+ checks per hour during peak events. At those rates, OONI's 100 req/min unauthenticated limit would be a bottleneck if all checks attempted OONI first. The adaptive scheduling helps here: the engine queues OONI checks at T+15 minutes rather than immediately, spreading the OONI request load across time and relying on IODA's faster response for the initial tier assessment.

The DashMap cache keeps corroboration state in memory across polling cycles without locking pressure. Benchmarks on the production instance show the poller's per-cycle scan (scanning all entries to find those with next_check <= now()) taking under 5 milliseconds even with 50,000 active entries during high-activity periods.

For the conceptual explanation of the confidence tier system this engine supports — what Anomaly, Corroborated, and Verified mean and why external confirmation is required: From anomaly to verified incident: the Voidly confidence tier system →

For how corroboration scores feed the TimescaleDB continuous aggregates and the country-level censorship index: Voidly's measurement database: 2.2B probe results in TimescaleDB →

For the MCP server tools that expose corroboration_score and per-source corroboration fields to agent queries: The Voidly MCP server: 83 censorship query tools for Claude and GPT →

For the cross-source verification methodology — data format normalization across OONI, CensoredPlanet, and IODA and the independence weighting theory: Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →