Technical writing

OONI measurement normalization: reconciling test_keys schema drift across 20 measurement types and 8 years of format changes

November 9, 2024· 13 min read· AI Analytics

CensorshipData engineeringOONIInfrastructure

The OONI (Open Observatory of Network Interference) S3 data archive contains over 200 million web_connectivity measurements collected since 2016. Building anything useful on top of it — training data for an anomaly classifier, a historical baseline for censorship scoring, a cross-source corroboration layer — requires first solving a data engineering problem: the test_keysfield schema has changed significantly across 8 years and at least 6 probe versions, with breaking changes that silently produce wrong results if not handled.

This article covers how we built a normalization pipeline that ingests all OONI web_connectivity measurements regardless of version and produces a single canonical OoniMeasurementNormalized output, achieving a 95.3% pass-through rate (4.7% dropped for unfixable quality issues).

The web_connectivity schema evolution

OONI measurements are NDJSON files (one JSON object per line) with a top-level structure that has been relatively stable, but a test_keys sub-object that changes with each probe version. The web_connectivity test is by far the most common test type (95.3% of all measurements) and has the most complex schema.

Version	Probe version range	Key breaking changes	Data volume
v0.2	Legacy probe < 0.3	No tcp_connect; no blocking field; bool anomaly only	~3M measurements (pre-2017)
v0.3	Probe 0.3–0.9	tcp_connect added; blocking is null \| string	~28M measurements (2017–2019)
v0.4	Probe 1.x (MK)	dns_experiment_failure enum added; blocking_country field	~62M measurements (2019–2021)
v0.5	Probe 3.x (ooniprobe)	x_blocking_flags bitmask added; is_confirmed bool added	~81M measurements (2021–2023)
v0.6	Probe 3.22+	anomaly renamed to x_blocking_flags; tcp_connect_success split	~30M measurements (2023+)

Version detection logic

There is no reliable schema_version field in OONI measurements. Version detection must infer the schema from the presence and absence of specific fields in test_keys.

from enum import Enum
from typing import Any

class WebConnectivityVersion(Enum):
    V02 = '0.2'   # legacy; minimal fields
    V03 = '0.3'   # tcp_connect added
    V04 = '0.4'   # dns_experiment_failure; blocking_country
    V05 = '0.5'   # x_blocking_flags bitmask; is_confirmed
    V06 = '0.6'   # anomaly renamed; tcp_connect_success split
    UNKNOWN = 'unknown'

def detect_web_connectivity_version(
    test_keys: dict[str, Any],
) -> WebConnectivityVersion:
    """
    Detect the web_connectivity schema version from test_keys field presence.
    Rules applied in priority order (most specific first).
    """
    # v0.6 marker: tcp_connect_success as a dict (split into per-IP results)
    if isinstance(test_keys.get('tcp_connect'), list):
        if test_keys['tcp_connect'] and isinstance(
            test_keys['tcp_connect'][0].get('status'), dict
        ):
            # v0.6 split status: {'blocked': bool, 'failure': str | null, 'success': bool}
            return WebConnectivityVersion.V06

    # v0.5 marker: x_blocking_flags field present (integer bitmask)
    if 'x_blocking_flags' in test_keys and isinstance(
        test_keys['x_blocking_flags'], int
    ):
        return WebConnectivityVersion.V05

    # v0.4 marker: blocking_country field present
    if 'blocking_country' in test_keys or 'dns_experiment_failure' in test_keys:
        return WebConnectivityVersion.V04

    # v0.3 marker: tcp_connect present and blocking is null or string
    if 'tcp_connect' in test_keys and 'blocking' in test_keys:
        return WebConnectivityVersion.V03

    # v0.2: only bool anomaly, no tcp_connect
    if 'accessible' in test_keys and 'tcp_connect' not in test_keys:
        return WebConnectivityVersion.V02

    return WebConnectivityVersion.UNKNOWN

Canonical output dataclass

from dataclasses import dataclass
from typing import Optional
from enum import Enum

class AnomalyType(Enum):
    """Normalized anomaly classification, version-agnostic."""
    CLEAN = 'clean'               # accessible, no anomaly
    DNS_ANOMALY = 'dns_anomaly'   # DNS failure or spoofed response
    TCP_ANOMALY = 'tcp_anomaly'   # TCP connection failure
    HTTP_ANOMALY = 'http_anomaly' # HTTP response anomaly (blocking page or failure)
    TLS_ANOMALY = 'tls_anomaly'   # TLS handshake failure
    CONFIRMED = 'confirmed'       # OONI confirmed blocking
    GENERIC_FAILURE = 'failure'   # probe-level failure; exclude from training
    UNKNOWN = 'unknown'

class ConfidenceTier(Enum):
    HIGH = 'high'     # confirmed OR multiple anomaly signals
    MEDIUM = 'medium' # single anomaly signal
    LOW = 'low'       # accessible with minor anomaly

@dataclass
class OoniMeasurementNormalized:
    # Top-level identifiers
    measurement_uid: str
    probe_cc: str               # ISO 3166-1 alpha-2 country code
    probe_asn: int              # AS number (numeric)
    test_start_time: str        # ISO 8601 UTC
    input_url: str              # tested URL

    # Normalized anomaly classification
    anomaly_type: AnomalyType
    confidence_tier: ConfidenceTier
    is_confirmed: bool          # OONI-confirmed blocking (highest confidence)
    blocking_category: Optional[str]  # OONI blocking category if confirmed

    # DNS layer (normalized)
    dns_success: bool
    dns_failure: Optional[str]  # failure string or None
    dns_ips: list[str]          # resolved IPs
    dns_matches_control: Optional[bool]

    # TCP layer (normalized)
    tcp_connected: Optional[bool]
    tcp_failure: Optional[str]

    # HTTP layer (normalized)
    http_success: Optional[bool]
    http_status_code: Optional[int]
    http_body_length: Optional[int]
    http_body_sha256: Optional[str]  # for block page matching
    http_failure: Optional[str]

    # Schema version (for debugging)
    schema_version: str

v0.5 and v0.6 normalization side-by-side

The most common migration challenge is the v0.5 → v0.6 transition, where the anomaly boolean was replaced by the x_blocking_flagsbitmask. The bitmask encodes which specific measurement components triggered the anomaly flag. In v0.6, the bitmask was preserved but the top-level anomalybool was dropped, requiring bitmask parsing for equivalent information.

# x_blocking_flags bitmask (v0.5+)
# Bit 0: DNS anomaly
# Bit 1: TCP anomaly
# Bit 2: HTTP anomaly (response differs from control)
# Bit 3: HTTP failure (connection failed at HTTP layer)
# Bit 4: TLS anomaly
# Bit 5: confirmed blocking (block page matched)
FLAG_DNS     = 1 << 0   # 0x01
FLAG_TCP     = 1 << 1   # 0x02
FLAG_HTTP    = 1 << 2   # 0x04
FLAG_FAILURE = 1 << 3   # 0x08
FLAG_TLS     = 1 << 4   # 0x10
FLAG_CONFIRM = 1 << 5   # 0x20

def normalize_v05(test_keys: dict) -> dict:
    flags = test_keys.get('x_blocking_flags', 0)
    is_confirmed = bool(flags & FLAG_CONFIRM) or test_keys.get('is_confirmed', False)

    if is_confirmed:
        anomaly_type = AnomalyType.CONFIRMED
    elif flags & FLAG_DNS:
        anomaly_type = AnomalyType.DNS_ANOMALY
    elif flags & FLAG_TCP:
        anomaly_type = AnomalyType.TCP_ANOMALY
    elif flags & (FLAG_HTTP | FLAG_FAILURE):
        anomaly_type = AnomalyType.HTTP_ANOMALY
    elif flags & FLAG_TLS:
        anomaly_type = AnomalyType.TLS_ANOMALY
    elif flags == 0:
        anomaly_type = AnomalyType.CLEAN
    else:
        anomaly_type = AnomalyType.UNKNOWN

    return {
        'anomaly_type': anomaly_type,
        'is_confirmed': is_confirmed,
        'dns_matches_control': test_keys.get('dns_consistency') == 'consistent',
        'http_body_sha256': None,  # not present in v0.5; added v0.6
    }

def normalize_v06(test_keys: dict) -> dict:
    # v0.6 dropped top-level 'anomaly'; flags still present
    flags = test_keys.get('x_blocking_flags', 0)
    is_confirmed = bool(flags & FLAG_CONFIRM)

    # v0.6 adds http_response body hash in http_experiment_failure or separately
    body_sha256 = test_keys.get('http_response_body_sha256')

    if is_confirmed:
        anomaly_type = AnomalyType.CONFIRMED
    elif flags & FLAG_DNS:
        anomaly_type = AnomalyType.DNS_ANOMALY
    elif flags & FLAG_TCP:
        anomaly_type = AnomalyType.TCP_ANOMALY
    elif flags & (FLAG_HTTP | FLAG_FAILURE):
        anomaly_type = AnomalyType.HTTP_ANOMALY
    elif flags & FLAG_TLS:
        anomaly_type = AnomalyType.TLS_ANOMALY
    elif flags == 0:
        anomaly_type = AnomalyType.CLEAN
    else:
        anomaly_type = AnomalyType.UNKNOWN

    return {
        'anomaly_type': anomaly_type,
        'is_confirmed': is_confirmed,
        'dns_matches_control': test_keys.get('x_dns_consistency') == 'consistent',
        'http_body_sha256': body_sha256,
    }

Pass-through rate breakdown

After normalization, measurements are filtered for quality before entering the classifier training pipeline. The 4.7% drop rate breaks down as follows:

Drop reason	% of total	Notes
Unknown schema version	0.8%	Fields missing from both old and new schemas; likely corrupt
Probe version < 0.3.0	0.9%	Too few fields for reliable normalization
control_failure field set	1.9%	Probe failed to reach control; comparison invalid
Missing input URL	0.4%	Early-format measurements without input field
Duplicate measurement_uid	0.2%	Probe resubmission; keep first by timestamp
Non-web_connectivity test type	4.7% of total OONI, but excluded before normalization pipeline	Separate pipelines for tor, signal, whatsapp tests

The resulting 95.3% normalized corpus (190M+ measurements as of November 2024) forms the primary training and evaluation data source for the Voidly anomaly classifier, with OONI-confirmed measurements serving as high-confidence positive labels (coverage: 34.2% of censored URL-country pairs have at least one confirmed measurement).

For how the normalized OONI corpus is hosted and accessed on HuggingFace: Building the OONI historical corpus: 200M+ measurements on HuggingFace →

For how Voidly's quality filter applies additional gates on top of this normalized corpus: Voidly's measurement quality filter: the 3.2% drop rate and what causes it →