Technical writing

The Voidly Measurement Dataset: Field-by-Field Schema Reference

August 20, 2025· 9 min read· AI Analytics

CensorshipVoidlyData engineeringOpen data

The Voidly measurement dataset is published on HuggingFace under CC BY 4.0. It contains one row per measurement: each time a probe tests a URL, the result — along with the control comparison, classifier output, and corroboration status — is a row in the dataset. At 2.2B+ measurements and growing at roughly 500k rows per day, the full dataset is not small. Understanding what each field means is essential before filtering or modeling.

This is the field-by-field reference. We group fields into six categories: probe identity, measurement target, the raw measurement layers, the control comparison, the ML classification output, and the derived aggregates used for incident reporting and forecasting.

Probe identity

measurement_id      # UUID, globally unique per measurement run
probe_id            # Opaque ID for the probe; stable per installation, not linked to identity
probe_cc            # ISO 3166-1 alpha-2 country code of the probe's physical location
probe_asn           # Autonomous System Number (numeric) of the probe's ISP
probe_asn_name      # ISP name string (from RIPE NCC Routing Registry)
probe_type          # "residential" | "mobile" | "university" | "data_center"
measurement_start   # UTC ISO 8601 timestamp, probe-reported
measurement_end     # UTC ISO 8601 timestamp, probe-reported
collector_received  # UTC ISO 8601 timestamp when the collector received the measurement

probe_id is stable per device installation but is not linked to any operator identity. Operators can reset it at any time from the probe application's settings. We use it for anomaly attribution (detecting if a single probe is producing systematically biased results) but never publish probe IDs — the published dataset uses probe_cc + probe_asnas the localization fields.

probe_type affects how the classifier weights the measurement. A residential broadband probe is the gold standard. A data center probe is less representative and carries a lower quality weight in the confidence score aggregation.

Measurement target

url                 # Full URL tested, including scheme and path
domain              # Extracted domain (used for grouping and blocking pattern analysis)
url_category        # OONI category code: NEWS | COMM | HUMR | POLR | LGBT | SRCH |
                    #   PORN | ALDR | GAME | MMED | FILE | GRP (see test list methodology)
url_source          # "citizen_lab_global" | "country_supplemental" | "emergency_addition"
url_test_list_added # ISO 8601 date when the URL was added to the test list

The 12 OONI category codes reflect the kinds of content that governments most frequently block. HUMR (human rights) and POLR(political opposition) are the highest-risk categories in the dataset — blocking events in these categories are more likely to reflect deliberate government censorship than commercial filtering or age-verification enforcement.

url_source identifies how the URL entered the test list.emergency_addition URLs (like Telegram and VPN provider sites added during the Iran 2022 protests) are valuable for tracking reactive censorship — blocks that appear within hours of a URL being added to monitor.

Raw measurement layers

DNS

dns_query_time_ms   # Time for the DNS resolver to respond (milliseconds)
dns_failure         # Failure string if DNS failed: "NXDOMAIN" | "SERVFAIL" |
                    #   "timeout" | "connection_refused" | null
dns_resolved_ips    # JSON array of IPs returned by the probe's local resolver
dns_control_ips     # JSON array of IPs returned by the control server's resolver
dns_ip_match        # bool: probe IPs ∩ control IPs ≠ ∅
dns_ip_blockpage_asn # bool: any resolved IP belongs to a known block-page ASN
dns_consistency     # "consistent" | "inconsistent" | "control_failure" | "unknown"

TCP

tcp_connect_time_ms # Time to establish TCP connection to port 443 (milliseconds)
tcp_failure         # "connection_refused" | "connection_reset" | "timeout" | null
tcp_reachable_probe # bool: probe reached port 443
tcp_reachable_control # bool: control server reached port 443

TLS

tls_handshake_time_ms # Time to complete TLS handshake (milliseconds)
tls_failure         # "connection_reset" | "eof" | "certificate_error" | null
tls_cert_valid      # bool: probe's TLS certificate is valid for the target domain
tls_cert_matches_control # bool: probe and control received the same leaf certificate
tls_interception_detected # bool: probe cert chain anchors to a non-standard root CA
tls_version         # TLS version negotiated: "TLSv1.3" | "TLSv1.2" | null

HTTP

http_status_code    # HTTP response status code (int), or null on failure
http_failure        # OONI failure string, e.g. "eof_error", or null
http_body_length    # Bytes in HTTP response body
http_body_hash      # SHA-256 of first 8KB of body (for block-page matching)
http_title          # <title> tag content, or null
http_status_match   # bool: probe status == control status
http_body_length_ratio # probe_body / control_body (1.0 = identical length)
http_blockpage_match # bool: body hash matches known block-page library
http_blockpage_country # Country whose block page was matched, or null

Control comparison

control_failure     # bool: control server itself failed to reach the target
control_measurement_id # UUID of the paired control measurement
anomaly_score       # float [0.0, 1.0]: composite divergence score from control comparison
anomaly_type_dns    # bool: DNS comparison flagged this measurement
anomaly_type_tcp    # bool: TCP comparison flagged
anomaly_type_tls    # bool: TLS comparison flagged
anomaly_type_http   # bool: HTTP body/status comparison flagged

When control_failure is true, none of the comparison fields are meaningful — the control couldn't reach the target either, so no comparison can be made. These rows are retained in the dataset but should be excluded from blocking analysis. Common causes: the target site is globally down, the control server's egress IP is blocked by the CDN, or the site is geo-restricted to the probe's country only.

ML classification output

classifier_version  # String: model version used for classification (e.g. "v3.1.2")
interference_type   # Primary class: "dns_tampering" | "tls_interference" |
                    #   "http_blocking" | "bgp_withdrawal" | "throttling" | null
prob_dns_tampering  # float [0.0, 1.0]: per-class probability
prob_tls_interference # float [0.0, 1.0]
prob_http_blocking  # float [0.0, 1.0]
prob_bgp_withdrawal # float [0.0, 1.0]
prob_throttling     # float [0.0, 1.0]
classifier_confidence # float [0.0, 1.0]: calibrated composite confidence score
tier                # "anomaly" | "corroborated" | "verified_incident"
tier_updated_at     # UTC ISO 8601 timestamp when this measurement last changed tier

interference_type is the argmax class whenclassifier_confidence exceeds 0.40. Below that threshold it's null — the measurement is still in the dataset as an anomaly tier row, but the classifier isn't confident enough to assign a primary class.

For ML training: use classifier_confidence ≥ 0.75 andtier = "verified_incident" as your high-quality label set. Use tier = "corroborated" for training data andtier = "verified_incident" for evaluation. Rawtier = "anomaly" rows are noisy and should be avoided as positive examples.

Cross-source corroboration

ooni_corroborated   # bool: OONI has a matching anomaly for same domain/country/window
ooni_match_type     # OONI blocking type if corroborated: "dns" | "tcp_ip" |
                    #   "http-failure" | "http-diff" | null
cp_corroborated     # bool: CensoredPlanet corroborated
cp_vp_count         # int: number of CensoredPlanet vantage points that also flagged
ioda_corroborated   # bool: IODA BGP/DNS signal present for same country/window
ioda_signal_type    # "bgp_withdrawal" | "dns_anomaly" | "telescope" | null
corroboration_score # float [0.0, 1.0]: independence-weighted combined score
                    #   (same as classifier_confidence for verified_incident rows)

BGP and outage signals

bgp_withdrawal      # bool: BGP prefix withdrawal detected for target AS
bgp_outage_score    # float [0.0, 1.0]: normalized deviation from 90-day baseline
bgp_prefixes_withdrawn # int: count of /24 prefixes withdrawn in country ASes
ioda_telescope_drop # bool: IODA darknet telescope shows traffic drop for country
country_outage_prob # float [0.0, 1.0]: probability the country is experiencing
                    #   a full internet outage (used in shutdown forecasting)

bgp_outage_score is computed daily from the per-country BGP baseline (90-day trailing mean of prefix-weighted uptime). A score of 1.0 means every prefix in the country is withdrawn — a complete routing blackout. A score of 0.1 means a small fraction of prefixes are missing, which may indicate partial outage or route flapping rather than deliberate shutdown.

Derived aggregates

incident_id         # UUID of the Verified Incident this measurement belongs to (if any)
domain_block_rate_7d # float: fraction of measurements for this domain in this country
                    #   that show blocking over the past 7 days (useful for trend analysis)
shutdown_forecast_7d # float [0.0, 1.0]: 7-day probability of country-level shutdown
                    #   from the forecasting model, as of measurement date
political_calendar_flag # bool: measurement date is within 14 days of an election,
                    #   protest, or political event in the probe's country

Filtering recipes

For a journalist covering a specific country:

# All verified incidents in Iran in 2025
df[
    (df['probe_cc'] == 'IR') &
    (df['tier'] == 'verified_incident') &
    (df['measurement_start'] >= '2025-01-01')
]

For an ML researcher building a DNS tampering classifier:

# High-confidence DNS tampering rows for training
train = df[
    (df['interference_type'] == 'dns_tampering') &
    (df['tier'].isin(['corroborated', 'verified_incident'])) &
    (df['control_failure'] == False) &
    (df['classifier_confidence'] >= 0.70)
]

For an infrastructure team tracking BGP-level shutdowns:

# Full routing outages (BGP withdrawal + IODA telescope)
outages = df[
    (df['bgp_withdrawal'] == True) &
    (df['ioda_telescope_drop'] == True) &
    (df['bgp_outage_score'] >= 0.5)
].groupby(['probe_cc', df['measurement_start'].dt.date])['bgp_outage_score'].max()

For how the control comparison fields in this schema are generated: The Voidly control server: how we tell censorship from a bad network →

For how the ML classifier produces the interference_type and confidence fields: The Voidly anomaly classifier: five interference classes and why we optimize for recall →

For how the tier field moves from anomaly to verified incident: From anomaly to verified incident: the Voidly confidence tier system →

For how the bgp_outage_score and BGP fields are computed: BGP routing signals and internet shutdown detection: how Voidly uses IODA data →

For how the incident_id field is assigned — four-tuple clustering key, 6-hour gap rule, and lifecycle states: Incident clustering and deduplication: how Voidly avoids counting the same censorship event twice →

For querying these fields via REST API — endpoints, filtering, pagination, and code samples: The Voidly REST API: querying the global censorship index in real time →

For the block page fingerprint library behind the blockpage_match and blockpage_fp_id fields: Voidly's block page fingerprint library: detecting censorship signatures across 2,300+ known pages →

For how raw probe bytes become the records this schema describes — the full ingest path from QUIC upload to TimescaleDB row: Voidly's probe-to-dataset ingest pipeline: normalization, quality filtering, and TimescaleDB indexing →

For the continuous aggregate layer that pre-aggregates these records into per-country and per-ASN statistics for sub-10ms queries: Voidly's TimescaleDB continuous aggregates: pre-aggregating 2.2B probe measurements for fast queries →