Technical writing
OONI measurement normalization: reconciling test_keys schema drift across 20 measurement types and 8 years of format changes
The OONI (Open Observatory of Network Interference) S3 data archive contains over 200 million web_connectivity measurements collected since 2016. Building anything useful on top of it — training data for an anomaly classifier, a historical baseline for censorship scoring, a cross-source corroboration layer — requires first solving a data engineering problem: the test_keysfield schema has changed significantly across 8 years and at least 6 probe versions, with breaking changes that silently produce wrong results if not handled.
This article covers how we built a normalization pipeline that ingests all OONI web_connectivity measurements regardless of version and produces a single canonical OoniMeasurementNormalized output, achieving a 95.3% pass-through rate (4.7% dropped for unfixable quality issues).
The web_connectivity schema evolution
OONI measurements are NDJSON files (one JSON object per line) with a top-level structure that has been relatively stable, but a test_keys sub-object that changes with each probe version. The web_connectivity test is by far the most common test type (95.3% of all measurements) and has the most complex schema.
| Version | Probe version range | Key breaking changes | Data volume |
|---|---|---|---|
| v0.2 | Legacy probe < 0.3 | No tcp_connect; no blocking field; bool anomaly only | ~3M measurements (pre-2017) |
| v0.3 | Probe 0.3–0.9 | tcp_connect added; blocking is null | string | ~28M measurements (2017–2019) |
| v0.4 | Probe 1.x (MK) | dns_experiment_failure enum added; blocking_country field | ~62M measurements (2019–2021) |
| v0.5 | Probe 3.x (ooniprobe) | x_blocking_flags bitmask added; is_confirmed bool added | ~81M measurements (2021–2023) |
| v0.6 | Probe 3.22+ | anomaly renamed to x_blocking_flags; tcp_connect_success split | ~30M measurements (2023+) |
Version detection logic
There is no reliable schema_version field in OONI measurements. Version detection must infer the schema from the presence and absence of specific fields in test_keys.
from enum import Enum
from typing import Any
class WebConnectivityVersion(Enum):
V02 = '0.2' # legacy; minimal fields
V03 = '0.3' # tcp_connect added
V04 = '0.4' # dns_experiment_failure; blocking_country
V05 = '0.5' # x_blocking_flags bitmask; is_confirmed
V06 = '0.6' # anomaly renamed; tcp_connect_success split
UNKNOWN = 'unknown'
def detect_web_connectivity_version(
test_keys: dict[str, Any],
) -> WebConnectivityVersion:
"""
Detect the web_connectivity schema version from test_keys field presence.
Rules applied in priority order (most specific first).
"""
# v0.6 marker: tcp_connect_success as a dict (split into per-IP results)
if isinstance(test_keys.get('tcp_connect'), list):
if test_keys['tcp_connect'] and isinstance(
test_keys['tcp_connect'][0].get('status'), dict
):
# v0.6 split status: {'blocked': bool, 'failure': str | null, 'success': bool}
return WebConnectivityVersion.V06
# v0.5 marker: x_blocking_flags field present (integer bitmask)
if 'x_blocking_flags' in test_keys and isinstance(
test_keys['x_blocking_flags'], int
):
return WebConnectivityVersion.V05
# v0.4 marker: blocking_country field present
if 'blocking_country' in test_keys or 'dns_experiment_failure' in test_keys:
return WebConnectivityVersion.V04
# v0.3 marker: tcp_connect present and blocking is null or string
if 'tcp_connect' in test_keys and 'blocking' in test_keys:
return WebConnectivityVersion.V03
# v0.2: only bool anomaly, no tcp_connect
if 'accessible' in test_keys and 'tcp_connect' not in test_keys:
return WebConnectivityVersion.V02
return WebConnectivityVersion.UNKNOWNCanonical output dataclass
from dataclasses import dataclass
from typing import Optional
from enum import Enum
class AnomalyType(Enum):
"""Normalized anomaly classification, version-agnostic."""
CLEAN = 'clean' # accessible, no anomaly
DNS_ANOMALY = 'dns_anomaly' # DNS failure or spoofed response
TCP_ANOMALY = 'tcp_anomaly' # TCP connection failure
HTTP_ANOMALY = 'http_anomaly' # HTTP response anomaly (blocking page or failure)
TLS_ANOMALY = 'tls_anomaly' # TLS handshake failure
CONFIRMED = 'confirmed' # OONI confirmed blocking
GENERIC_FAILURE = 'failure' # probe-level failure; exclude from training
UNKNOWN = 'unknown'
class ConfidenceTier(Enum):
HIGH = 'high' # confirmed OR multiple anomaly signals
MEDIUM = 'medium' # single anomaly signal
LOW = 'low' # accessible with minor anomaly
@dataclass
class OoniMeasurementNormalized:
# Top-level identifiers
measurement_uid: str
probe_cc: str # ISO 3166-1 alpha-2 country code
probe_asn: int # AS number (numeric)
test_start_time: str # ISO 8601 UTC
input_url: str # tested URL
# Normalized anomaly classification
anomaly_type: AnomalyType
confidence_tier: ConfidenceTier
is_confirmed: bool # OONI-confirmed blocking (highest confidence)
blocking_category: Optional[str] # OONI blocking category if confirmed
# DNS layer (normalized)
dns_success: bool
dns_failure: Optional[str] # failure string or None
dns_ips: list[str] # resolved IPs
dns_matches_control: Optional[bool]
# TCP layer (normalized)
tcp_connected: Optional[bool]
tcp_failure: Optional[str]
# HTTP layer (normalized)
http_success: Optional[bool]
http_status_code: Optional[int]
http_body_length: Optional[int]
http_body_sha256: Optional[str] # for block page matching
http_failure: Optional[str]
# Schema version (for debugging)
schema_version: strv0.5 and v0.6 normalization side-by-side
The most common migration challenge is the v0.5 → v0.6 transition, where the anomaly boolean was replaced by the x_blocking_flagsbitmask. The bitmask encodes which specific measurement components triggered the anomaly flag. In v0.6, the bitmask was preserved but the top-level anomalybool was dropped, requiring bitmask parsing for equivalent information.
# x_blocking_flags bitmask (v0.5+)
# Bit 0: DNS anomaly
# Bit 1: TCP anomaly
# Bit 2: HTTP anomaly (response differs from control)
# Bit 3: HTTP failure (connection failed at HTTP layer)
# Bit 4: TLS anomaly
# Bit 5: confirmed blocking (block page matched)
FLAG_DNS = 1 << 0 # 0x01
FLAG_TCP = 1 << 1 # 0x02
FLAG_HTTP = 1 << 2 # 0x04
FLAG_FAILURE = 1 << 3 # 0x08
FLAG_TLS = 1 << 4 # 0x10
FLAG_CONFIRM = 1 << 5 # 0x20
def normalize_v05(test_keys: dict) -> dict:
flags = test_keys.get('x_blocking_flags', 0)
is_confirmed = bool(flags & FLAG_CONFIRM) or test_keys.get('is_confirmed', False)
if is_confirmed:
anomaly_type = AnomalyType.CONFIRMED
elif flags & FLAG_DNS:
anomaly_type = AnomalyType.DNS_ANOMALY
elif flags & FLAG_TCP:
anomaly_type = AnomalyType.TCP_ANOMALY
elif flags & (FLAG_HTTP | FLAG_FAILURE):
anomaly_type = AnomalyType.HTTP_ANOMALY
elif flags & FLAG_TLS:
anomaly_type = AnomalyType.TLS_ANOMALY
elif flags == 0:
anomaly_type = AnomalyType.CLEAN
else:
anomaly_type = AnomalyType.UNKNOWN
return {
'anomaly_type': anomaly_type,
'is_confirmed': is_confirmed,
'dns_matches_control': test_keys.get('dns_consistency') == 'consistent',
'http_body_sha256': None, # not present in v0.5; added v0.6
}
def normalize_v06(test_keys: dict) -> dict:
# v0.6 dropped top-level 'anomaly'; flags still present
flags = test_keys.get('x_blocking_flags', 0)
is_confirmed = bool(flags & FLAG_CONFIRM)
# v0.6 adds http_response body hash in http_experiment_failure or separately
body_sha256 = test_keys.get('http_response_body_sha256')
if is_confirmed:
anomaly_type = AnomalyType.CONFIRMED
elif flags & FLAG_DNS:
anomaly_type = AnomalyType.DNS_ANOMALY
elif flags & FLAG_TCP:
anomaly_type = AnomalyType.TCP_ANOMALY
elif flags & (FLAG_HTTP | FLAG_FAILURE):
anomaly_type = AnomalyType.HTTP_ANOMALY
elif flags & FLAG_TLS:
anomaly_type = AnomalyType.TLS_ANOMALY
elif flags == 0:
anomaly_type = AnomalyType.CLEAN
else:
anomaly_type = AnomalyType.UNKNOWN
return {
'anomaly_type': anomaly_type,
'is_confirmed': is_confirmed,
'dns_matches_control': test_keys.get('x_dns_consistency') == 'consistent',
'http_body_sha256': body_sha256,
}Pass-through rate breakdown
After normalization, measurements are filtered for quality before entering the classifier training pipeline. The 4.7% drop rate breaks down as follows:
| Drop reason | % of total | Notes |
|---|---|---|
| Unknown schema version | 0.8% | Fields missing from both old and new schemas; likely corrupt |
| Probe version < 0.3.0 | 0.9% | Too few fields for reliable normalization |
| control_failure field set | 1.9% | Probe failed to reach control; comparison invalid |
| Missing input URL | 0.4% | Early-format measurements without input field |
| Duplicate measurement_uid | 0.2% | Probe resubmission; keep first by timestamp |
| Non-web_connectivity test type | 4.7% of total OONI, but excluded before normalization pipeline | Separate pipelines for tor, signal, whatsapp tests |
The resulting 95.3% normalized corpus (190M+ measurements as of November 2024) forms the primary training and evaluation data source for the Voidly anomaly classifier, with OONI-confirmed measurements serving as high-confidence positive labels (coverage: 34.2% of censored URL-country pairs have at least one confirmed measurement).
For how the normalized OONI corpus is hosted and accessed on HuggingFace: Building the OONI historical corpus: 200M+ measurements on HuggingFace →
For how Voidly's quality filter applies additional gates on top of this normalized corpus: Voidly's measurement quality filter: the 3.2% drop rate and what causes it →