Developer landing · CC BY 4.0

Using the Voidly censorship dataset

Six years of measurement, 200 countries, 2.2B+ data points, cross-verified against OONI, CensoredPlanet, and IODA. Pick the access surface that fits your workflow.

Access surfaces

REST API

Documented JSON endpoints. Auth optional for read; rate-limited.

voidly.ai/api-docs

HuggingFace snapshots

CSV bulk downloads. global-censorship-index + ooni-censorship-historical (1.66M+ downloads).

huggingface.co/emperor-mew

MCP server (AI agents)

83 tools for Claude, GPT, and agent frameworks to query the dataset in natural language.

voidly-ai/mcp-server

Live dashboard

Map view, active blocking events, country drilldown, ML-powered alerts, 7-day forecast.

voidly.ai

Companion dataset: Verboten (banned books)

Voidly's banned-books index ships as plain static JSON — no key, no rate limit, no server. Point an agent at the manifest and resolve any country or title: 19,283 banned or restricted books across 119 countries, every ban dated and source-cited.

/verboten/api/index.json
Manifest — dataset stats, country index, endpoint map.
/verboten/api/country/{ISO}.json
Per-country summary for all 119 countries (ISO 3166-1 alpha-2).
/verboten/api/book/{slug}.json
Full source-cited record for the 200 most-banned titles.
/verboten/search-index.json
Every title mapped to the countries that ban it — the lookup index.

Browse Verboten · CC BY 4.0, built on the banned-books.org Open Censorship Core.

Companion accountability datasets

The Voidly accountability records — SpyLedger, DarkRegister, Sanctions Programs, Right to Information, Data Protection, OrganWatch, and Foreign-Held U.S. Farmland — ship as plain static JSON: no key, no rate limit. All are sourced from primary public records and carry no personal data.

/spyledger/index.json
SpyLedger — 26 surveillance vendors, their products, and source-linked government-designation status (BIS/OFAC/FCC/NS-CMIC).
/darkregister/index.json
DarkRegister — public-access status of 46 beneficial-ownership registers, with legal basis and source per jurisdiction.
/darkregister/gleif-coverage.json
GLEIF open-ownership coverage — CC0 legal-entity counts per jurisdiction (entities, not people).
/sanctions-programs/index.json
Sanctions Programs — 41 US (OFAC) sanctions programs with the Executive Order/statute, target, scope, and status behind each program code (RUSSIA-EO14024, SDGT, CMIC-EO13959…). Programs, not designated persons.
/rti-laws/index.json
Right to Information — national access-to-information / FOI laws across 61 countries (name, year adopted, oversight body, scope), from official legal sources. The law, not the people — no personal data; not a ranking.
/data-protection/index.json
Data Protection — national personal-data-protection / privacy laws across 61 countries (GDPR, LGPD, PIPL, POPIA…; name, year, supervisory authority, scope). The companion to Right to Information. The law, not data subjects — no personal data; not a ranking.
/genetic-privacy/index.json
The Genetic Privacy Ledger — the 23andMe custody chain (42 documented events: breach, regulators, bankruptcy sale) with three-register discipline, the 12-state DTC genetic-privacy statute annex, and the federal gap map. Statutes, courts, and regulators only — no personal data.
/organwatch/index.json
OrganWatch — US organ-procurement accountability: the federally designated OPOs with CMS performance tiers, oversight timeline, and documented findings. Institution-level facts, source-linked — no personal data.
/foreign-farmland/index.json
Foreign-Held U.S. Farmland — the USDA AFIDA register aggregated: 46M acres by investor country, state, land use, and interest type across 15 years, plus documented register defects, state restriction laws, and the Shell Map's 38 verified ownership chains (structured under shellMap). Aggregates only — no personal data.
/section117/index.json
Section 117 Ledger — foreign gifts and contracts disclosed by US universities: $62B across 117k transactions at 528 institutions (1981–2025), by institution and source country, with named foreign-government sources and the year trend. Aggregate-only — no personal data.
/grid-owners/index.json
GridOwners — ownership of US generating capacity: 27,768 operable generators / 1.38 TW resolved to entity-level owners from EIA-860, with capacity by owner type, the top-100 owners, and state and technology tables. Entity-level — no personal data.
/detention/index.json
The Detention Ledger — who runs ICE's 203 detention facilities: 66,161 people held on an average day, with capacity, inspection status, guaranteed-minimum bed arithmetic, the FY19–FY26 panel, and an evidence-tiered operator spine (federal award / SEC filing / no operator in federal records). Entity-level — no personal data.
/287g/index.json
The 287(g) Wave — every signed agreement deputizing local police into immigration enforcement: 2,123 agreements across 1,804 agencies, with model split (task force / warrant service / jail enforcement), the month-by-month signing wave, county FIPS joins, and the source file's own defect ledger. Agency-level — no personal data.
/bop/index.json
The BOP Ledger — every Federal Bureau of Prisons institution from BOP's own weekly feeds: 133 facilities holding 138,553 people with security level, type, region, and population, the 155-contract halfway-house layer with named operators, the FY1980-present system series, and the private-prison reappearance watch (currently zero). Facility/system level — no personal data.

Browse SpyLedger · Browse DarkRegister · Browse Sanctions Programs · CC BY 4.0.

Discover every accountability dataset in one call: /voidly/datasets.json — a machine-readable manifest of all eight (plus the censorship index) with each endpoint, its latest counts, and license.

Quick start

Same query — recent verified incidents in Iran since 2026-01-01 — in four ecosystems.

curl

# Recent verified incidents in Iran
curl -s 'https://voidly.ai/api/v1/incidents?country=IR&since=2026-01-01' \
  -H 'Accept: application/json' | jq '.[] | {date,domain,classification}'

Python

import httpx

r = httpx.get(
    "https://voidly.ai/api/v1/incidents",
    params={"country": "IR", "since": "2026-01-01"},
    timeout=30,
)
for incident in r.json():
    print(incident["date"], incident["domain"], incident["classification"])

JavaScript

const url = new URL('https://voidly.ai/api/v1/incidents');
url.searchParams.set('country', 'IR');
url.searchParams.set('since', '2026-01-01');

const res = await fetch(url, { headers: { Accept: 'application/json' } });
const incidents = await res.json();
for (const i of incidents) console.log(i.date, i.domain, i.classification);

library(httr2)
library(jsonlite)

resp <- request("https://voidly.ai/api/v1/incidents") |>
  req_url_query(country = "IR", since = "2026-01-01") |>
  req_perform()

incidents <- fromJSON(resp_body_string(resp))
print(head(incidents))

Response shape

[
  {
    "date":           "2026-04-12T03:21:14Z",
    "country":        "IR",
    "domain":         "signal.org",
    "classification": "TLS_INTERFERENCE",
    "confidence":     "verified",
    "sources":        ["voidly", "ooni"],
    "asn":            58224,
    "block_type":     "sni_inspection",
    "evidence_url":   "https://voidly.ai/incidents/2026-04-12-IR-signal"
  }
  // ...
]

Variables measured

DNS_TAMPERING: Resolver returns the wrong IP, or refuses recursion.
TLS_INTERFERENCE: Handshake interrupted, certificate altered, SNI inspected.
HTTP_BLOCK: Block page, content rewrite, response throttled to zero.
BGP_WITHDRAWAL: Network disappears from the global routing table.
THROTTLING: Bandwidth deliberately collapsed for a specific service.
SHUTDOWN: National or regional connectivity dropped entirely.

Technical documentation

Building the OONI historical corpus: 1.66M downloads, schema normalization, and the decisions behind the dataset →
How we processed 200M+ OONI measurements into the flat ML-ready CSV now downloaded 1.66M+ times on HuggingFace — probe version schema drift, test_keys normalization, and what we left out. 2024.
The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →
How the desktop probe that generates this dataset works: Tauri 2, Cloudflare boringtun WireGuard, tun-rs TUN device, X25519-Dalek on-device key generation, and the operator-safety constraints behind the design. 2025.
The Voidly control server: how we tell censorship from a bad network →
How the control_failure and anomaly_score fields are generated — three-node distributed control network (US-East, EU-West, AP-East), CDN split-horizon handling, block-page hash library of ~2,300 fingerprints, and the mapping from ControlComparison fields to classifier inputs. 2025.
Voidly probe health monitoring: how we detect and replace failing probe nodes →
How Voidly keeps 37+ probe nodes healthy: heartbeat state machine (DEGRADED → OFFLINE thresholds), measurement quality scoring, ASN coverage SLOs per country, flapping detection that caps the confidence_tier at CORROBORATED, and the classify_offline_cause() algorithm for distinguishing probe failure from ISP-level censorship events. 2025.
Voidly's block page fingerprint library: detecting censorship signatures across 2,300+ known pages →
How the blockpage_match and blockpage_fp_id fields in the dataset are produced — four matching strategies (exact hash, structural normalization, SimHash, TLS certificate fingerprinting), the match pipeline, per-country library counts, and false positive mitigation for CDN error pages and captive portals. 2025.
The Voidly open datasets on HuggingFace: structure, daily snapshots, and filter recipes →
The practical guide to bulk data access — Parquet partitioning by country and month, daily incremental append cadence, git-lfs versioning for reproducible research, filter recipes in Python/pandas/DuckDB/R, confidence tier guidance for different consumer types, and the citation format for CC BY 4.0 attribution. 2025.
The Voidly anomaly classifier: five interference classes and why we optimize for recall →
How the ML classifier assigns DNS tampering, TLS interference, HTTP blocking, BGP withdrawal, and throttling labels to raw probe measurements — per-class binary models, country-specific calibration, and the confidence scores in the dataset. 2024.
Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →
How the classifier's training corpus is built from 200M+ OONI measurements: weak supervision label functions, 47-feature schema, time-based splits to prevent leakage, SMOTE for class imbalance, and per-country Platt calibration. 2024.
Voidly's real-time inference API: classifying censorship measurements at 50ms →
How the classifier runs as a live API in the real-time measurement pipeline — ONNX Runtime serving, per-country calibration at inference time, champion/challenger deployment, and the p50/p99 latency breakdown across all pipeline stages. 2025.
From anomaly to verified incident: the Voidly confidence tier system →
What the confidence_tier field in the dataset means — how a measurement moves from Anomaly to Corroborated to Verified Incident, and which tier is appropriate for your use case. 2025.
The Voidly measurement dataset: field-by-field schema reference →
Complete field-by-field guide to the CC BY 4.0 dataset — probe identity, DNS/TCP/TLS/HTTP layers, control comparison deltas, ML classifier output (interference_type, prob_*, tier), cross-source corroboration scores, BGP signals, and derived aggregates. Includes filtering recipes for journalists, ML researchers, and infra teams. 2025.
Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →
How Voidly correlates three independent measurement projects — data format normalization, 4-hour sliding window alignment, and independence-weighted confidence scoring. 2025.
Seven-day internet shutdown forecasting: how Voidly predicts connectivity outages →
Architecture of the 7-day forecast: political calendar features, BGP telemetry, ARIMA + XGBoost ensemble, per-country calibration, and reliability scoring. 2025.
BGP routing signals and internet shutdown detection: how Voidly uses IODA data →
How the bgp_withdrawal and bgp_outage_score fields are generated — IODA prefix withdrawal detection, 90-day per-country baselines, and why BGP silence differs from BGP withdrawal. 2025.
The Voidly measurement scheduler: how we decide which domains to probe and when →
What drives measurement cadence in the dataset — category-code priority tiers (NEWS/SMG probed every 5 minutes), anomaly-driven urgent injection (priority=10 for 30 minutes after detection), ±15% jitter for anti-detection, per-domain ASN distribution, and per-country task budgets that explain why CN and IR have higher measurement density. 2024.
Voidly's real-time event pipeline: from measurement anomaly to journalist alert in under 8 minutes →
Why events appear in the API within minutes of probe detection: inline scoring, parallel OONI/IODA corroboration, and the two-window alert-fatigue guard. Understanding this helps set the right polling cadence for real-time consumers. 2025.
The Voidly MCP server: 83 censorship query tools for Claude and GPT →
How to query the full dataset from Claude Code, Claude Desktop, or any OpenAI function-calling agent — 83 tools across 6 categories (incident lookup, measurement queries, country coverage, domain test-list, BGP/network, dataset metadata), JSON-RPC over Streamable HTTP, and rate-limit reference. 2025.
The Voidly REST API: querying the global censorship index in real time →
Core endpoints, cursor-based pagination, filtering by country / confidence tier / interference type / date range, streaming NDJSON export, RFC 7807 error format, rate limits, and code samples in curl, Python, and TypeScript. Pair with the MCP server for agent workflows. 2025.
Incident clustering and deduplication: how Voidly avoids counting the same event twice →
How raw probe measurements deduplicate into the incident_id field you see in the dataset — the four-tuple clustering key, 6-hour gap rule, incident lifecycle (ANOMALY → RESOLVED), the 12-hour re-open window, and retroactive CensoredPlanet alignment. Explains why incident counts differ from measurement counts by 200–800×. 2025.
Voidly's country-level censorship score: aggregating 2.2B probe measurements into the global index →
How the censorship_score field in the country summary API is computed: exponential recency decay (30-day half-life), ASN diversity weighting, domain category weights, cross-source corroboration multipliers, Gaussian temporal smoothing, and bootstrap confidence bands. Covers per-country calibration for coverage disparity. 2025.
Building a distributed VPN with intelligent routing →
How censorship-aware routing works in practice: ML-driven path selection across 142 entry-node IPs, traffic morphing, DPI evasion. 2024.
Measurement methodology: from probe to verified incident →
The full verification flow: vantage selection, scan cadence, anomaly classification, and cross-source corroboration against OONI, CensoredPlanet, and IODA.

License & attribution

Voidly measurement data is published under CC BY 4.0. Use, redistribute, and remix with attribution. Suggested format:

Source: Voidly / AI Analytics — voidly.ai (CC BY 4.0)

Citing in a paper? Use the formatted citations on /voidly (APA + BibTeX, click to copy).

Also available: US Federal Regulatory Data

The Federal Regulatory Data Hub indexes 208 federal datasets across 89 agencies — SEC, FDA, OFAC, DOJ, EPA, CFPB, IRS, FEMA, CDC, NHTSA, FAA, CMS, MSHA, OSHA and more. 50M+ canonical records, daily refresh, licensed CC0 1.0 (public domain — no attribution required). Accessible via REST, MCP, and JSON-LD.

→ Federal Data Hub landing page · Live dataset counts · MCP server (38+ tools)