Technical writing

Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements

· 8 min read· AI Analytics
CensorshipVoidlyMLData engineering

Supervised machine learning for censorship detection has a foundational problem: you need ground truth. “This measurement represents a genuine block” is a statement that requires evidence from somewhere outside the measurement itself. If you label your training data using the same signals your model will learn to recognize, you have a circular definition of correctness, not a classifier.

This post describes how we build the labeled training corpus that feeds the Voidly anomaly classifier — starting from the 200M+ measurement OONI archive, working through the weak supervision framework that generates probabilistic labels, the feature engineering pipeline that converts raw measurements into the 47-column input matrix, and the train/val/test split strategy that prevents leakage between geographically and temporally correlated observations.

The labeling problem

The OONI archive provides two useful annotation fields on every measurement: test_keys.is_confirmed (a curator has manually confirmed this is a block) and anomaly (OONI's statistical heuristic flagged this as anomalous). Neither is adequate as a training label on its own.

is_confirmed is high-precision but severely under-representative. Only approximately 2.3% of measurements that eventually prove to be genuine blocks carry a confirmed flag at the time of measurement — curation is slow, manual, and biased toward high-profile targets. Building a training set from confirmed records alone gives you a corpus that represents the tip of the iceberg and systematically misses the bulk of actual censorship.

anomaly has the opposite problem. The flag captures any statistical deviation from control measurements — including CDN geographic splits (where a CDN returns different IP addresses per region by design), transient probe connectivity issues, and misconfigured origin servers. In practice, the anomaly flag fires on genuine censorship, on network noise, and on legitimate infrastructure behaviour at roughly similar rates. Training on anomaly labels alone produces a classifier that has learned to replicate OONI's noise floor, not to identify blocks.

Voidly's approach is multi-signal weak supervision: we define a set of label functions — programmatic heuristics that vote on the likely label of each measurement — and train a label model (following the Snorkel framework) that learns the accuracy and correlation structure of those functions and produces a denoised probabilistic label. The final output for each measurement is one of three values: CENSORED (1), NOT_CENSORED (0), or ABSTAIN (−1) for measurements where the label functions disagree enough that we decline to assign a label at all.

Label functions

We use five label functions. Each inspects a different aspect of the raw OONI measurement and returns a vote. The label model learns how reliable each function is and how they correlate before combining their votes into a final label.

from enum import IntEnum
from dataclasses import dataclass

class Label(IntEnum):
    CENSORED = 1
    NOT_CENSORED = 0
    ABSTAIN = -1

# Known block-page body hashes — approximately 2,300 entries
# sourced from OONI confirmed measurements and CitizenLab research
KNOWN_BLOCKPAGE_HASHES: set[str] = load_blockpage_hashes()

@dataclass
class OoniMeasurement:
    is_confirmed: bool
    anomaly: bool
    dns_failure: str | None          # e.g. 'dns_nxdomain_error'
    dns_ip_blockpage_asn: str | None # ASN of returned IP if in block-page set
    tls_handshake_failure: str | None
    control_tls_failure: str | None  # control measurement TLS failure
    http_body_sha256: str | None
    bgp_outage_score: float          # 0.0–1.0 from IODA BGP signal

# --- Label function 1: OONI curator-confirmed block ---
def lf_ooni_confirmed(m: OoniMeasurement) -> Label:
    if m.is_confirmed:
        return Label.CENSORED
    return Label.ABSTAIN

# --- Label function 2: DNS NXDOMAIN to known block-page ASN ---
def lf_dns_nxdomain_blockpage_asn(m: OoniMeasurement) -> Label:
    if (m.dns_failure == 'dns_nxdomain_error'
            and m.dns_ip_blockpage_asn is not None):
        return Label.CENSORED
    return Label.ABSTAIN

# --- Label function 3: TLS reset with no control failure ---
def lf_tls_reset_no_control_failure(m: OoniMeasurement) -> Label:
    if (m.tls_handshake_failure == 'connection_reset'
            and m.control_tls_failure is None):
        return Label.CENSORED
    return Label.ABSTAIN

# --- Label function 4: HTTP response body matches known block page ---
def lf_http_blockpage_hash(m: OoniMeasurement) -> Label:
    if m.http_body_sha256 in KNOWN_BLOCKPAGE_HASHES:
        return Label.CENSORED
    return Label.ABSTAIN

# --- Label function 5: BGP outage corroboration ---
# High BGP outage score means the network is generally unreachable —
# likely infrastructure failure, not selective censorship
def lf_bgp_outage_corroborated(m: OoniMeasurement) -> Label:
    if m.bgp_outage_score > 0.8:
        return Label.NOT_CENSORED
    return Label.ABSTAIN


# Assemble the label matrix: shape (n_measurements, n_label_functions)
import numpy as np

LABEL_FUNCTIONS = [
    lf_ooni_confirmed,
    lf_dns_nxdomain_blockpage_asn,
    lf_tls_reset_no_control_failure,
    lf_http_blockpage_hash,
    lf_bgp_outage_corroborated,
]

def build_label_matrix(measurements: list[OoniMeasurement]) -> np.ndarray:
    return np.array(
        [[lf(m) for lf in LABEL_FUNCTIONS] for m in measurements],
        dtype=int,
    )

The label matrix is passed to the Snorkel LabelModel, which learns coverage, accuracy, and pairwise correlations for each label function, then outputs a probabilistic label for every row. Measurements where no label function fires (all ABSTAIN) are dropped from the training corpus entirely — they carry no supervision signal regardless of what the true label might be.

Feature extraction

The raw OONI measurement is a nested JSON document. The classifier expects a flat, fixed-width feature vector. We extract 47 features spanning the DNS, TCP, TLS, HTTP, and metadata layers of each measurement:

FEATURE_SCHEMA: dict[str, str] = {
    # DNS layer (9 features)
    "dns_query_ms":            "float  — resolver round-trip time in milliseconds",
    "dns_fail_nxdomain":       "bool   — NXDOMAIN response",
    "dns_fail_timeout":        "bool   — resolver timed out",
    "dns_fail_refused":        "bool   — REFUSED response code",
    "dns_fail_servfail":       "bool   — SERVFAIL response code",
    "dns_fail_other":          "bool   — any other DNS failure",
    "dns_fail_none":           "bool   — DNS succeeded",
    "dns_consistency":         "float  — 0=inconsistent, 0.5=partial, 1=consistent",
    "dns_ip_blockpage_asn":    "bool   — returned IP is in known block-page ASN set",
    "dns_resolved_ip_count":   "int    — number of IP addresses in DNS response",

    # TCP layer (3 features)
    "tcp_status_ok":           "bool   — TCP connection succeeded",
    "tcp_status_timeout":      "bool   — TCP connection timed out",
    "tcp_status_reset":        "bool   — TCP connection was reset",
    "tcp_connect_ms":          "float  — TCP handshake time in milliseconds",

    # TLS layer (4 features)
    "tls_fail_reset":          "bool   — TLS reset after ClientHello",
    "tls_fail_timeout":        "bool   — TLS handshake timed out",
    "tls_fail_other":          "bool   — other TLS failure",
    "tls_fail_none":           "bool   — TLS succeeded",
    "tls_cert_matches_control":"bool   — certificate matches control probe cert",
    "tls_interception_detected":"bool  — SNI mismatch or unexpected issuer chain",

    # HTTP layer (5 features)
    "http_fail_none":          "bool   — HTTP fetch succeeded",
    "http_fail_connection":    "bool   — connection refused or reset during HTTP",
    "http_fail_other":         "bool   — other HTTP failure",
    "http_response_status":    "int    — HTTP status code (0 if no response)",
    "http_body_length_ratio":  "float  — body length vs. control body length",
    "http_blockpage_match":    "bool   — body fingerprint matches block-page corpus",

    # Control comparison (3 features)
    "any_control_failure":     "bool   — control measurement had any failure",
    "control_dns_failure":     "bool   — control DNS failed",
    "anomaly_score":           "float  — composite OONI anomaly score 0–1",

    # BGP layer (3 features)
    "bgp_prefix_reachable":    "bool   — target IP prefix reachable from vantage",
    "bgp_outage_score":        "float  — IODA BGP disruption score 0–1",
    "bgp_path_length_delta":   "int    — path length vs. 7-day baseline",

    # Probe metadata (8 features)
    "probe_asn_type_residential": "bool — residential ASN classification",
    "probe_asn_type_mobile":      "bool — mobile carrier ASN classification",
    "probe_asn_type_datacenter":  "bool — datacenter/hosting ASN classification",
    "probe_cc_embed_0":           "float — country code embedding dim 0",
    "probe_cc_embed_1":           "float — country code embedding dim 1",
    "probe_cc_embed_2":           "float — country code embedding dim 2",
    "probe_cc_embed_3":           "float — country code embedding dim 3",
    "hour_of_day":                "int   — 0–23, UTC",
    "day_of_week":                "int   — 0=Monday, 6=Sunday",

    # Cross-source corroboration (3 features)
    "censored_planet_flag":    "bool   — CensoredPlanet flagged same domain/country",
    "ioda_bgp_event":          "bool   — IODA recorded BGP event in same window",
    "cp_correlation_score":    "float  — CensoredPlanet similarity score 0–1",
}

Country code is embedded rather than one-hot encoded because one-hot over 200+ countries would produce a high-dimensional sparse matrix that generalizes poorly to held-out countries. We train a 4-dimensional country embedding on co-occurrence of measurement outcomes within country — countries with similar censorship patterns end up near each other in embedding space, which gives the model a useful inductive bias when it encounters a country with few training examples.

Class imbalance

In the labeled training corpus after weak supervision, CENSORED events account for approximately 3.2% of records. This level of imbalance is enough to cause a naive classifier to ignore the minority class almost entirely — it can achieve 96.8% accuracy by predicting NOT_CENSORED for everything, which is exactly the wrong behaviour for a censorship detector.

We use two complementary strategies to address this:

SMOTE oversampling

SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic minority-class examples by interpolating between existing CENSORED records in feature space. We apply SMOTE to bring the minority class from 3.2% up to 10% representation in the training set before fitting the model. This gives the model enough positive examples to learn the feature patterns associated with blocking without the synthetic examples dominating the distribution.

from imblearn.over_sampling import SMOTE

smote = SMOTE(
    sampling_strategy=0.10,  # minority class target: 10% of training set
    k_neighbors=5,
    random_state=42,
)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

Class log-weighting in XGBoost

After SMOTE, the model still sees a 9:1 class ratio. We apply XGBoost's scale_pos_weight parameter to further correct the loss function. The global model uses a weight of approximately 30 (the ratio of negative to positive examples in the post-SMOTE set). For high-censorship countries where the base rate is meaningfully higher, we override this per-country:

# Global model (all countries combined)
neg_count = (y_train_resampled == 0).sum()
pos_count = (y_train_resampled == 1).sum()
global_scale_pos_weight = neg_count / pos_count  # ≈ 30

# Per-country overrides for countries where CENSORED base rate is higher
# Lower weight = less aggressive minority-class boosting
COUNTRY_SCALE_POS_WEIGHT = {
    'CN': 8,   # ~11% of CN measurements are confirmed censored
    'IR': 12,  # ~7% of IR measurements
    'RU': 18,  # ~5% of RU measurements
    # remaining 188 countries fall back to global_scale_pos_weight
}

def get_scale_pos_weight(country: str) -> float:
    return COUNTRY_SCALE_POS_WEIGHT.get(country, global_scale_pos_weight)

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=get_scale_pos_weight(target_country),
    eval_metric='aucpr',
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
)

We deliberately do not rely solely on threshold tuning (adjusting the decision boundary post-training) to handle imbalance. Threshold adjustment trades precision against recall on a fixed model — it can only move along the existing precision-recall curve. SMOTE and class weighting change the curve itself by training the model on a distribution where positive examples are more representatively weighted. The downstream corroboration layer handles the residual false positive rate; the classifier's job is to maximize recall.

Train/val/test split strategy

Random splitting is incorrect for this dataset. Three forms of leakage make a naive shuffle-and-split approach produce inflated validation metrics that collapse when the model encounters genuinely unseen data:

  • Same-country, same-ASN leakage. A probe in AS17816 in Iran that observed a block at 14:00 is highly correlated with a probe in the same AS at 15:00. If one is in training and the other in validation, the model has effectively seen the validation example. Country-level blocking patterns are persistent — a block that exists at one timestamp tends to exist at nearby timestamps and from nearby vantage points.
  • Temporal leakage. Probe IDs reuse over time. The same probe that measured a domain on Monday often measures the same domain on Tuesday. A random split puts these correlated measurements on either side of the train/val boundary, inflating apparent generalization.
  • Target leakage from future labels. Weak supervision labels are computed over the entire corpus. A measurement's label can be influenced by CensoredPlanet data that arrives weeks after the original measurement. If validation data is drawn from before the training cutoff, it may carry labels that implicitly use information from the training period's future.

The correct split is time-based with probe-level constraints:

import pandas as pd

def build_splits(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Build train/val/test splits with temporal isolation and probe-level leakage prevention.

    Split boundaries:
      train: measurement_start_time < 2023-06-01
      val:   2023-06-01 <= measurement_start_time < 2024-01-01
      test:  measurement_start_time >= 2024-01-01
    """
    TRAIN_END  = pd.Timestamp('2023-06-01', tz='UTC')
    VAL_END    = pd.Timestamp('2024-01-01', tz='UTC')

    train = df[df['measurement_start_time'] < TRAIN_END].copy()
    val   = df[(df['measurement_start_time'] >= TRAIN_END) &
               (df['measurement_start_time'] <  VAL_END)].copy()
    test  = df[df['measurement_start_time'] >= VAL_END].copy()

    # Probe-level isolation: any probe_id seen in train is excluded from val/test
    # This prevents the model from memorizing probe-specific noise patterns
    train_probe_ids: set[str] = set(train['probe_id'].unique())
    val  = val[~val['probe_id'].isin(train_probe_ids)]
    test = test[~test['probe_id'].isin(train_probe_ids)]

    return train, val, test


def build_country_holdout_split(df: pd.DataFrame, holdout_cc: str):
    """
    Country-holdout validation: train on all countries EXCEPT holdout_cc,
    then evaluate on holdout_cc only. Tests whether the model generalizes
    to censorship patterns it has never directly observed.
    """
    train = df[df['probe_cc'] != holdout_cc].copy()
    holdout = df[df['probe_cc'] == holdout_cc].copy()
    return train, holdout

# Example: train without China, evaluate on China
train_no_cn, test_cn = build_country_holdout_split(df, 'CN')

The country-holdout validation is particularly important for deployment. The global model must generalize to countries with few or no labeled examples — autocratic states that actively suppress documentation of their own censorship. If the model's performance on CN drops substantially when CN is excluded from training, that is a signal that the model is memorizing country-specific patterns rather than learning the underlying feature signatures of censorship.

Per-country calibration

A well-calibrated model outputs probabilities that match empirical frequencies — a measurement scored at 0.7 should be blocked in about 70% of similar cases. The global model is reasonably calibrated in aggregate, but its calibration degrades substantially for individual countries because the base rate of censorship varies dramatically across them. A model calibrated on a 3.2% global positive rate will systematically over-score measurements from countries where blocking is rare and under-score measurements from countries where blocking is common.

We correct for this using Platt scaling: for each country with sufficient labeled data, we fit a logistic regression on the logit of the raw model score using country-specific holdout labels. This remaps the model's raw probability output to one that is calibrated to the actual positive rate in that country:

from sklearn.linear_model import LogisticRegression
import numpy as np

def fit_platt_calibrator(
    raw_scores: np.ndarray,
    labels: np.ndarray,
) -> LogisticRegression:
    """Fit a Platt scaling calibrator on the logit of the raw score."""
    logit_scores = np.log(raw_scores / (1 - raw_scores + 1e-8)).reshape(-1, 1)
    cal = LogisticRegression()
    cal.fit(logit_scores, labels)
    return cal

# Fit per-country calibrators where sufficient data exists
# Minimum requirement: 500 confirmed CENSORED examples in the calibration holdout
CALIBRATION_THRESHOLD = 500

country_calibrators: dict[str, LogisticRegression] = {}
for cc, group in calibration_holdout.groupby('probe_cc'):
    confirmed_count = (group['label'] == 1).sum()
    if confirmed_count >= CALIBRATION_THRESHOLD:
        raw = group['raw_score'].values
        labels = group['label'].values
        country_calibrators[cc] = fit_platt_calibrator(raw, labels)

# 12 countries meet the threshold: CN, IR, RU, TR, EG, PK, BY, KZ, AZ, TH, MM, VN
# The remaining 188 countries use one of 6 regional calibrators (by UN geoscheme)
print(f"{len(country_calibrators)} country-specific calibrators fitted")

def calibrated_score(raw_score: float, probe_cc: str) -> float:
    """Apply per-country calibration, falling back to regional calibrator."""
    logit = np.log(raw_score / (1 - raw_score + 1e-8))
    cal = country_calibrators.get(probe_cc) or regional_calibrators[get_region(probe_cc)]
    return float(cal.predict_proba([[logit]])[0, 1])

Iran is the most clearly well-calibrated country — the probability output matches empirical blocking frequency within 2–3 percentage points across the full score range, a consequence of the large volume of confirmed Iranian measurements in the OONI archive. Eritrea, with fewer than 80 confirmed blocks in the full corpus, falls back to the Sub-Saharan Africa regional calibrator and shows substantially wider calibration error bands, particularly in the 0.4–0.7 score range where the regional prior diverges from the country-specific base rate.

The continuous training pipeline

The OONI archive grows at approximately 2 million new measurements per day. Without continuous retraining, the model's training distribution drifts progressively further from the live measurement distribution as new censorship techniques emerge, new ISPs begin blocking, and the composition of OONI's probe network evolves.

We run two retraining loops on different schedules:

Weekly incremental fine-tuning

Every Monday, we pull the past 7 days of OONI measurements that have received new confirmed labels (either is_confirmed flags from OONI curators or new CensoredPlanet corroboration). These new confirmed examples are used to incrementally fine-tune the existing model — not a full retrain, but a small number of additional boosting rounds weighted heavily toward the new confirmed examples. The goal is to incorporate newly documented blocking patterns (new block page hashes, new DNS sinkhole IPs, new SNI reset signatures) without destabilizing the model's existing decision boundaries.

Monthly full retrain

On the first of each month, we run a full retrain from the complete labeled corpus with the updated training/val/test boundaries shifted forward by one month. The resulting model artifact is named with its training snapshot date (e.g. voidly-classifier-20241101.xgb) and stored alongside the probe measurement snapshots it was trained on, ensuring reproducibility.

Champion/challenger deployment

A newly trained model is not deployed immediately. Instead, it runs in shadow mode for 7 days alongside the current production model — both models score every incoming measurement, but only the champion's scores are used for downstream corroboration. After 7 days, we compare the models on the measurements received during the shadow period that have since received confirmed labels:

def should_promote_challenger(
    champion_scores: np.ndarray,
    challenger_scores: np.ndarray,
    confirmed_labels: np.ndarray,
    threshold: float = 0.5,
) -> bool:
    """
    Promote challenger to champion if:
    - F1 improves by at least 0.5 percentage points
    - Recall does not decrease (we never accept a recall regression)
    """
    from sklearn.metrics import f1_score, recall_score

    champion_preds   = (champion_scores >= threshold).astype(int)
    challenger_preds = (challenger_scores >= threshold).astype(int)

    champion_f1   = f1_score(confirmed_labels, champion_preds)
    challenger_f1 = f1_score(confirmed_labels, challenger_preds)

    champion_recall   = recall_score(confirmed_labels, champion_preds)
    challenger_recall = recall_score(confirmed_labels, challenger_preds)

    f1_improvement = challenger_f1 - champion_f1
    recall_regression = challenger_recall < champion_recall

    return f1_improvement >= 0.005 and not recall_regression

The recall constraint is non-negotiable. A challenger that achieves higher F1 by trading recall for precision is not an improvement in our pipeline — the downstream corroboration layer is designed to handle false positives, not to recover missed events. We will accept a precision decrease to avoid a recall regression; we will not accept the inverse.

Validation results

The following results are from evaluation on the held-out test set (measurements with measurement_start_time ≥ 2024-01-01, with probe IDs not seen in training), using per-class binary classifiers at each interference type's operating threshold:

  • DNS tampering: precision 0.91, recall 0.95, F1 0.93
  • TLS interference: precision 0.88, recall 0.93, F1 0.90
  • HTTP blocking: precision 0.94, recall 0.89, F1 0.91 — HTTP is the most precise class because block-page fingerprints are distinctive; recall is slightly lower because novel block pages not yet in the hash corpus produce false negatives until the weekly fine-tune incorporates them
  • BGP withdrawal: precision 0.96, recall 0.97, F1 0.96 — BGP events are coarse-grained but highly distinctive; the feature signal is strong and the class is well-represented in training data from IODA corroboration
  • Throttling: precision 0.79, recall 0.87, F1 0.83 — by far the hardest class

Throttling's lower performance is structural rather than a modelling failure. The other four interference classes produce binary failures — a TLS handshake either completes or it doesn't; DNS either resolves or returns NXDOMAIN. Throttling is a continuous signal: bandwidth degradation along a spectrum from “slightly slower than usual” to “effectively blocked.” The decision boundary between legitimate congestion and targeted throttling sits in a feature region where the two phenomena genuinely overlap — no labeling approach or model architecture resolves that ambiguity completely. The 0.87 recall means we surface the large majority of real throttling events, and the 0.79 precision means a substantial fraction of what we surface is noise that the corroboration layer must filter.


For the OONI historical corpus that provides the raw measurements this pipeline labels: Building the OONI historical corpus: 1.66M downloads, schema normalization, and the decisions behind the dataset →

For the anomaly classifier that this training corpus feeds: The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall →

For how the cross-source verification layer reconciles OONI, CensoredPlanet, and IODA — the three label sources used in weak supervision above: Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →

For the block page fingerprint library behind the lf_http_blockpage_hash label function: Voidly's block page fingerprint library: detecting censorship signatures across 2,300+ known pages →

For how raw probe measurements are gated before reaching this pipeline — the quality_filter() function, five rejection criteria, 3.2% drop rate, and the to_feature_input() schema transformation: Voidly measurement quality filtering: gating probe data before ML feature extraction →