Technical writing

Building Voidly's classifier training dataset from OONI: ingestion, alignment, and label generation

December 15, 2024· 8 min read· AI Analytics

CensorshipVoidlyMLMethodology

The previous post in this series described the Voidly anomaly classifier's architecture: five per-class XGBoost binary models, recall-optimized thresholds, and a confidence scoring layer that feeds the cross-source reconciler. That post treated the training data as a given. This post covers where it actually comes from.

Supervised censorship detection has a labeling problem. You cannot label training data from Voidly's own probe outputs without introducing circularity — the model would learn to reproduce whatever heuristics generated the labels rather than learning the underlying feature signatures of interference. The labels must come from somewhere independent. OONI has 12+ years of historical web connectivity measurements across 200+ countries, a corpus of manually reviewed confirmed-block flags, and a publicly accessible S3 archive. It is the primary external ground truth source for the Voidly training pipeline.

Why OONI is the primary label source

OONI Explorer's is_confirmed flags represent manually reviewed events: a curator has inspected the measurement, verified it corresponds to a genuine block, and marked it accordingly. These are high-confidence ground truth. The false positive rate on confirmed measurements is approximately 0.3% — almost all due to CDN geoblocking that OONI's curators occasionally mistake for censorship when the geoblock targets the same IP ranges a national censor would.

OONI also exposes an anomaly flag on every measurement, set by its own statistical heuristics. This flag is noisier — it fires on infrastructure failures, transient routing issues, and CDN geographic splits, not only on genuine interference. But it fires at much higher volume than confirmed flags and covers countries and targets where manual curation has not yet caught up. Both signals are useful; neither is sufficient alone.

The coverage gap is the structural limitation. OONI probe deployments concentrate where volunteer operators exist: Western Europe, urban South-East Asia, parts of Latin America. Many authoritarian states have sparse OONI coverage. Ethiopia generates roughly 12 OONI web_connectivity measurements per month. Turkmenistan generates fewer than 5. North Korea has none. For countries like these, OONI cannot be the primary label source regardless of its quality.

OONI ingestion pipeline

OONI publishes daily JSONL dumps to a public S3 bucket (ooni-data-eu-fra), partitioned by date and test name. Each file contains one JSON object per measurement. The compressed dumps total approximately 40 GB per day across all test types; web_connectivity measurements alone account for roughly 22 GB.

The ingestion job runs nightly at 02:00 UTC and processes the previous day's dump. Each measurement line is parsed into an OoniMeasurement dataclass:

from dataclasses import dataclass
from datetime import datetime
from typing import Optional

@dataclass
class OoniMeasurement:
    measurement_uid: str
    test_name: str                    # 'web_connectivity', 'http_requests', etc.
    country_code: str
    domain: str
    measurement_start_time: datetime
    is_confirmed: bool
    anomaly: bool
    failure: Optional[str]            # e.g. 'dns_nxdomain_error', 'connection_reset'
    resolver_ip: Optional[str]
    resolver_asn: Optional[int]

Two filters apply before the measurement enters the alignment stage. First, onlytest_name = 'web_connectivity' measurements are retained — this test type accounts for 95.3% of censorship-relevant measurements in the corpus. HTTP-request-only tests and legacy test types lack the control comparison data needed for the label functions. Second, records are deduplicated onmeasurement_uid; duplicate submissions from probe retries appear in the raw dumps at a rate of approximately 1.8%.

Alignment with Voidly probe data

OONI measurements and Voidly probe measurements are independent observations from different vantage points. They are not guaranteed to be simultaneous. The alignment strategy coarsens the time dimension to the daily level: the alignment key is(country_code, domain, date). Any OONI measurement for the same country and domain within ±12 hours of a Voidly probe measurement's timestamp is considered aligned.

The ±12-hour window is wider than a strict date boundary to accommodate timezone differences — a Voidly probe at 23:30 local time and an OONI measurement at 01:00 UTC the next day may describe the same blocking event. The window is narrow enough that alignment does not cross between meaningfully different network conditions (blocks are typically sustained for hours to days, not minutes).

Alignment rate on the current corpus: 67% of Voidly probe records have at least one OONI measurement within the alignment window. The remaining 33% have no OONI coverage for that country-domain-date combination. These unaligned records require a different labeling strategy.

Five label functions

For aligned records, labeling follows the Snorkel weak supervision framework. Five label functions inspect different aspects of the aligned OONI measurements and vote independently. The label model learns each function's accuracy and pairwise correlation structure from agreement patterns, then combines the votes into a probabilistic output p_censored in [0, 1].

LF1: ooni_confirmed. Fires when is_confirmed = True. Returns a positive vote. Weight: 0.95. False positive rate: ~0.3%, almost entirely CDN geoblocking misclassified as censorship during manual review.
LF2: ooni_anomaly_no_failure. Fires when anomaly = True AND failure = null. The null-failure condition filters out measurements where OONI flagged an anomaly because the connection failed entirely — those are more likely infrastructure noise than censorship. Returns a weak positive vote. Weight: 0.60. Many anomalies at this weight are still infrastructure noise; the label model learns to discount this function when it conflicts with the others.
LF3: blockpage_hash_match. Fires when the HTTP response body SHA-256 or SimHash matches a known block-page signature in the corpus of 800+ fingerprints. Returns a positive vote. Weight: 0.97. Block-page bodies are highly distinctive; this is the highest-precision label function.
LF4: dns_injection_ip. Fires when the DNS response IP matches a known injection address: 18 IPs attributed to the Great Firewall, 3 IPs used by Iran's national filtering system, 2 IPs used by Turkey's BTK-mandated DNS blocking. Returns a positive vote. Weight: 0.92. The injection IP list is maintained in version control and updated when new IPs are confirmed through independent research.
LF5: rst_injection_timing. Fires when a TCP RST arrives in less than 15 ms from the initial SYN — faster than the round-trip to any legitimate server, consistent with in-path RST injection. Returns a positive vote. Weight: 0.88. Timing measurements are subject to probe-side clock resolution artifacts; the 15 ms threshold was chosen to stay well above clock uncertainty.

The label matrix has shape (n_measurements, 5). Each cell is 1 (positive vote), 0 (negative vote), or −1 (abstain — the function did not fire). A measurement where no label function fires at all is excluded from the training corpus; it carries no supervision signal. LF coverage — the fraction of aligned records where at least one label function fires — is 71.8%.

from snorkel.labeling.model import LabelModel
import numpy as np

# L: label matrix, shape (n_measurements, 5)
# Rows: measurements; Columns: LF1..LF5
# Values: 1 = CENSORED, 0 = NOT_CENSORED, -1 = ABSTAIN

label_model = LabelModel(cardinality=2, verbose=False)
label_model.fit(
    L_train=L,
    n_epochs=500,
    lr=0.01,
    seed=42,
)

# Probabilistic labels: p_censored in [0, 1] for each measurement
p_censored: np.ndarray = label_model.predict_proba(L)[:, 1]

Label conflicts and quality metrics

Label conflicts — records where two or more label functions fire with opposing votes — occur in 2.1% of aligned records. The generative model handles these by weighting functions according to their learned accuracy and correlation; no manual resolution is required. Thep_censored output for conflicted records tends to cluster in the 0.35–0.65 range; these measurements receive lower sample weight in XGBoost training to reduce the influence of genuinely ambiguous cases.

Label quality is estimated against a manually reviewed holdout of 5,000 records:

OONI confirmed coverage: 34.2% of training records carry at least one is_confirmed = True OONI measurement
LF coverage: 71.8% of aligned records have at least one firing label function
Label conflicts: 2.1% of records, resolved by the generative model
Estimated label noise: ~4.8% on the manually reviewed holdout

Coverage gap handling

For the 33% of Voidly probe records with no OONI alignment, two strategies apply depending on the country's OONI measurement volume.

For countries with at least 100 OONI measurements per month, the unaligned records receive pseudo-labels derived from a previous version of the Voidly anomaly classifier itself — label distillation. The previous model's output probability becomes the training label for the unaligned record. Pseudo-labeled records are flagged with label_source = 'distilled' and receive half the sample weight of OONI-derived labels in XGBoost training, reflecting their lower epistemic authority.

For countries with fewer than 100 OONI measurements per month, pseudo-labels are supplemented with CensoredPlanet data via the same alignment procedure: CP measurements for the same country-domain-date key are ingested, run through the same five label functions where applicable (LF1, LF2, LF3, LF4 fire on CP data; LF5 requires raw timing fields that CP does not expose), and the resultingp_censored estimate is used as the label.

Ethiopia illustrates the coverage gap at its most severe. Voidly generates approximately 3,200 probe measurements for Ethiopian targets per month. OONI generates 12. The result: 99.6% of Ethiopian training data uses pseudo-labels or CensoredPlanet-derived labels, with only 0.4% coming from direct OONI alignment. The Ethiopian classifier submodel consequently has wider calibration error bands than the Iran or China submodels and is audited more frequently for drift.

Per-interference-type label mapping

The five label functions produce a single composite p_censored score. The five-class classifier requires per-class labels — one binary label per interference type per measurement. The mapping from label function outputs to per-class labels is deterministic:

DNS tampering: dns_tamper_label = 1 when LF4 (dns_injection_ip) fires, or when LF1 fires AND the OONIfailure field contains 'dns'.
TLS interference: tls_interference_label = 1 when LF5 (rst_injection_timing) fires AND the aligned OONI measurement records atls_handshake_failure.
HTTP blocking: http_blocking_label = 1 when LF3 (blockpage_hash_match) fires. Block-page fingerprinting is specific enough that no secondary condition is required.
BGP withdrawal: bgp_withdrawal_label = 1 when an IODA BGP event overlaps with the measurement window. No OONI label function covers BGP withdrawal directly; this class is labeled entirely from IODA join, independent of the Snorkel pipeline.
Throttling: throttling_label = 0.5(weak positive) when no label function fires but p_censored > 0.4 AND the measurement's timing features fall in the throttling-indicative range. Hard positive labels for throttling are rare in the OONI corpus — OONI does not have a confirmed throttling flag — so the class relies on soft labels and accepts higher noise during training.

Dataset versioning and reproducibility

Training datasets are versioned by a tuple of three parameters:(cutoff_date, lf_versions, alignment_window_hours). Every dataset version is stored as Parquet with schema metadata embedded in adataset_catalog table. Each model version in the model registry references a specific dataset version by content hash, making it possible to reproduce any historical model training run by checking out the referenced dataset and the corresponding training code version.

Label function versions are tracked separately because the injection IP list (LF4) and block-page corpus (LF3) are updated independently of the ingestion code. A change to the LF4 IP list increments the lf_versions component of the dataset version tuple, which invalidates any cached label matrix derived from the previous LF4 version and triggers a rerun of the label generation step before the next monthly full retrain.

For the five-class XGBoost models trained on this dataset: The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall →

For the broader weak supervision and training methodology: Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →

For the weekly retrain pipeline that consumes this dataset: Voidly classifier retraining: the weekly pipeline that keeps the anomaly models current →

For how OONI, CensoredPlanet, and IODA are reconciled in real-time corroboration: Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →