Technical writing
The Voidly Measurement Dataset: Field-by-Field Schema Reference
The Voidly measurement dataset is published on HuggingFace under CC BY 4.0. It contains one row per measurement: each time a probe tests a URL, the result — along with the control comparison, classifier output, and corroboration status — is a row in the dataset. At 2.2B+ measurements and growing at roughly 500k rows per day, the full dataset is not small. Understanding what each field means is essential before filtering or modeling.
This is the field-by-field reference. We group fields into six categories: probe identity, measurement target, the raw measurement layers, the control comparison, the ML classification output, and the derived aggregates used for incident reporting and forecasting.
Probe identity
measurement_id # UUID, globally unique per measurement run probe_id # Opaque ID for the probe; stable per installation, not linked to identity probe_cc # ISO 3166-1 alpha-2 country code of the probe's physical location probe_asn # Autonomous System Number (numeric) of the probe's ISP probe_asn_name # ISP name string (from RIPE NCC Routing Registry) probe_type # "residential" | "mobile" | "university" | "data_center" measurement_start # UTC ISO 8601 timestamp, probe-reported measurement_end # UTC ISO 8601 timestamp, probe-reported collector_received # UTC ISO 8601 timestamp when the collector received the measurement
probe_id is stable per device installation but is not linked to any operator identity. Operators can reset it at any time from the probe application's settings. We use it for anomaly attribution (detecting if a single probe is producing systematically biased results) but never publish probe IDs — the published dataset uses probe_cc + probe_asnas the localization fields.
probe_type affects how the classifier weights the measurement. A residential broadband probe is the gold standard. A data center probe is less representative and carries a lower quality weight in the confidence score aggregation.
Measurement target
url # Full URL tested, including scheme and path
domain # Extracted domain (used for grouping and blocking pattern analysis)
url_category # OONI category code: NEWS | COMM | HUMR | POLR | LGBT | SRCH |
# PORN | ALDR | GAME | MMED | FILE | GRP (see test list methodology)
url_source # "citizen_lab_global" | "country_supplemental" | "emergency_addition"
url_test_list_added # ISO 8601 date when the URL was added to the test list
The 12 OONI category codes reflect the kinds of content that governments most frequently block. HUMR (human rights) and POLR(political opposition) are the highest-risk categories in the dataset — blocking events in these categories are more likely to reflect deliberate government censorship than commercial filtering or age-verification enforcement.
url_source identifies how the URL entered the test list.emergency_addition URLs (like Telegram and VPN provider sites added during the Iran 2022 protests) are valuable for tracking reactive censorship — blocks that appear within hours of a URL being added to monitor.
Raw measurement layers
DNS
dns_query_time_ms # Time for the DNS resolver to respond (milliseconds)
dns_failure # Failure string if DNS failed: "NXDOMAIN" | "SERVFAIL" |
# "timeout" | "connection_refused" | null
dns_resolved_ips # JSON array of IPs returned by the probe's local resolver
dns_control_ips # JSON array of IPs returned by the control server's resolver
dns_ip_match # bool: probe IPs ∩ control IPs ≠ ∅
dns_ip_blockpage_asn # bool: any resolved IP belongs to a known block-page ASN
dns_consistency # "consistent" | "inconsistent" | "control_failure" | "unknown"
TCP
tcp_connect_time_ms # Time to establish TCP connection to port 443 (milliseconds) tcp_failure # "connection_refused" | "connection_reset" | "timeout" | null tcp_reachable_probe # bool: probe reached port 443 tcp_reachable_control # bool: control server reached port 443
TLS
tls_handshake_time_ms # Time to complete TLS handshake (milliseconds) tls_failure # "connection_reset" | "eof" | "certificate_error" | null tls_cert_valid # bool: probe's TLS certificate is valid for the target domain tls_cert_matches_control # bool: probe and control received the same leaf certificate tls_interception_detected # bool: probe cert chain anchors to a non-standard root CA tls_version # TLS version negotiated: "TLSv1.3" | "TLSv1.2" | null
HTTP
http_status_code # HTTP response status code (int), or null on failure http_failure # OONI failure string, e.g. "eof_error", or null http_body_length # Bytes in HTTP response body http_body_hash # SHA-256 of first 8KB of body (for block-page matching) http_title # <title> tag content, or null http_status_match # bool: probe status == control status http_body_length_ratio # probe_body / control_body (1.0 = identical length) http_blockpage_match # bool: body hash matches known block-page library http_blockpage_country # Country whose block page was matched, or null
Control comparison
control_failure # bool: control server itself failed to reach the target control_measurement_id # UUID of the paired control measurement anomaly_score # float [0.0, 1.0]: composite divergence score from control comparison anomaly_type_dns # bool: DNS comparison flagged this measurement anomaly_type_tcp # bool: TCP comparison flagged anomaly_type_tls # bool: TLS comparison flagged anomaly_type_http # bool: HTTP body/status comparison flagged
When control_failure is true, none of the comparison fields are meaningful — the control couldn't reach the target either, so no comparison can be made. These rows are retained in the dataset but should be excluded from blocking analysis. Common causes: the target site is globally down, the control server's egress IP is blocked by the CDN, or the site is geo-restricted to the probe's country only.
ML classification output
classifier_version # String: model version used for classification (e.g. "v3.1.2")
interference_type # Primary class: "dns_tampering" | "tls_interference" |
# "http_blocking" | "bgp_withdrawal" | "throttling" | null
prob_dns_tampering # float [0.0, 1.0]: per-class probability
prob_tls_interference # float [0.0, 1.0]
prob_http_blocking # float [0.0, 1.0]
prob_bgp_withdrawal # float [0.0, 1.0]
prob_throttling # float [0.0, 1.0]
classifier_confidence # float [0.0, 1.0]: calibrated composite confidence score
tier # "anomaly" | "corroborated" | "verified_incident"
tier_updated_at # UTC ISO 8601 timestamp when this measurement last changed tier
interference_type is the argmax class whenclassifier_confidence exceeds 0.40. Below that threshold it's null — the measurement is still in the dataset as an anomaly tier row, but the classifier isn't confident enough to assign a primary class.
For ML training: use classifier_confidence ≥ 0.75 andtier = "verified_incident" as your high-quality label set. Use tier = "corroborated" for training data andtier = "verified_incident" for evaluation. Rawtier = "anomaly" rows are noisy and should be avoided as positive examples.
Cross-source corroboration
ooni_corroborated # bool: OONI has a matching anomaly for same domain/country/window
ooni_match_type # OONI blocking type if corroborated: "dns" | "tcp_ip" |
# "http-failure" | "http-diff" | null
cp_corroborated # bool: CensoredPlanet corroborated
cp_vp_count # int: number of CensoredPlanet vantage points that also flagged
ioda_corroborated # bool: IODA BGP/DNS signal present for same country/window
ioda_signal_type # "bgp_withdrawal" | "dns_anomaly" | "telescope" | null
corroboration_score # float [0.0, 1.0]: independence-weighted combined score
# (same as classifier_confidence for verified_incident rows)
BGP and outage signals
bgp_withdrawal # bool: BGP prefix withdrawal detected for target AS
bgp_outage_score # float [0.0, 1.0]: normalized deviation from 90-day baseline
bgp_prefixes_withdrawn # int: count of /24 prefixes withdrawn in country ASes
ioda_telescope_drop # bool: IODA darknet telescope shows traffic drop for country
country_outage_prob # float [0.0, 1.0]: probability the country is experiencing
# a full internet outage (used in shutdown forecasting)
bgp_outage_score is computed daily from the per-country BGP baseline (90-day trailing mean of prefix-weighted uptime). A score of 1.0 means every prefix in the country is withdrawn — a complete routing blackout. A score of 0.1 means a small fraction of prefixes are missing, which may indicate partial outage or route flapping rather than deliberate shutdown.
Derived aggregates
incident_id # UUID of the Verified Incident this measurement belongs to (if any)
domain_block_rate_7d # float: fraction of measurements for this domain in this country
# that show blocking over the past 7 days (useful for trend analysis)
shutdown_forecast_7d # float [0.0, 1.0]: 7-day probability of country-level shutdown
# from the forecasting model, as of measurement date
political_calendar_flag # bool: measurement date is within 14 days of an election,
# protest, or political event in the probe's country
Filtering recipes
For a journalist covering a specific country:
# All verified incidents in Iran in 2025
df[
(df['probe_cc'] == 'IR') &
(df['tier'] == 'verified_incident') &
(df['measurement_start'] >= '2025-01-01')
]For an ML researcher building a DNS tampering classifier:
# High-confidence DNS tampering rows for training
train = df[
(df['interference_type'] == 'dns_tampering') &
(df['tier'].isin(['corroborated', 'verified_incident'])) &
(df['control_failure'] == False) &
(df['classifier_confidence'] >= 0.70)
]For an infrastructure team tracking BGP-level shutdowns:
# Full routing outages (BGP withdrawal + IODA telescope)
outages = df[
(df['bgp_withdrawal'] == True) &
(df['ioda_telescope_drop'] == True) &
(df['bgp_outage_score'] >= 0.5)
].groupby(['probe_cc', df['measurement_start'].dt.date])['bgp_outage_score'].max()For how the control comparison fields in this schema are generated: The Voidly control server: how we tell censorship from a bad network →
For how the ML classifier produces the interference_type and confidence fields: The Voidly anomaly classifier: five interference classes and why we optimize for recall →
For how the tier field moves from anomaly to verified incident: From anomaly to verified incident: the Voidly confidence tier system →
For how the bgp_outage_score and BGP fields are computed: BGP routing signals and internet shutdown detection: how Voidly uses IODA data →
For how the incident_id field is assigned — four-tuple clustering key, 6-hour gap rule, and lifecycle states: Incident clustering and deduplication: how Voidly avoids counting the same censorship event twice →
For querying these fields via REST API — endpoints, filtering, pagination, and code samples: The Voidly REST API: querying the global censorship index in real time →
For the block page fingerprint library behind the blockpage_match and blockpage_fp_id fields: Voidly's block page fingerprint library: detecting censorship signatures across 2,300+ known pages →
For how raw probe bytes become the records this schema describes — the full ingest path from QUIC upload to TimescaleDB row: Voidly's probe-to-dataset ingest pipeline: normalization, quality filtering, and TimescaleDB indexing →
For the continuous aggregate layer that pre-aggregates these records into per-country and per-ASN statistics for sub-10ms queries: Voidly's TimescaleDB continuous aggregates: pre-aggregating 2.2B probe measurements for fast queries →