Methodology
How we measure internet censorship.
Verifiable measurement is the entire point. This page documents what we actually do — vantage selection, scan cadence, classification, cross-source verification, and what it takes for an anomaly to become a citable incident.
Vantages
37+ probe nodes are deployed across 200 countries on six continents. Vantage selection balances three constraints: presence inside affected jurisdictions, ASN diversity (so a single ISP outage doesn't fake a censorship event), and operator safety. We do not run probes from infrastructure that exposes operator identity.
For the technical architecture of the probe application: The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →
For how vantages are selected for ASN diversity, operator safety, and hard-to-reach countries: Voidly probe vantage selection: ASN diversity, operator safety, and reaching hard-to-measure countries →
Test list
80 domains spanning news, civil-society organizations, encrypted messaging, circumvention tools, social platforms, and reference content. The list is reviewed quarterly and additions are governed by the same anti-targeting rules used by OONI's test list.
For the full curation methodology — Citizen Lab's global list, 12 OONI category codes, per-country additions, and emergency updates: Voidly's URL test list: how we curate the domains that reveal internet censorship →
Cadence
Every probe runs the full domain list every 5 minutes, 24×7. That gives a worst-case detection latency around 5 minutes for a sudden block, and a 1-hour rolling baseline sufficient to flag throttling that ramps up gradually.
Each probe measurement is paired with a simultaneous control server measurement from a neutral vantage outside the country being tested. The comparison — DNS, TCP reachability, TLS certificate chain, and HTTP body — is what separates genuine censorship from network errors and CDN split-horizon DNS. For the full control server methodology: The Voidly control server: how we tell censorship from a bad network →
Anomaly classifier
- DNS tampering. Resolver returns an IP that doesn't match the known ASN, returns NXDOMAIN, or refuses recursion.
- TLS interference. Handshake interrupted at SNI, certificate altered, ClientHello rewritten.
- HTTP blocking. Block page returned, content rewritten, response throttled to zero.
- BGP withdrawal. Origin AS disappears from the global routing table for the target prefix.
- Throttling. Bandwidth deliberately collapsed for specific services while neighboring services remain unaffected.
For the full technical write-up on feature engineering, the five per-class binary models, and the recall-vs-precision tradeoff: The Voidly anomaly classifier: five interference classes and why we optimize for recall →
Cross-source verification
An anomaly is only promoted to a verified incident after it correlates against at least one independent measurement project for the same target, location, and time window:
- · OONI — Open Observatory of Network Interference (community probe network)
- · CensoredPlanet — University of Michigan
- · IODA — Internet Outage Detection and Analysis (Georgia Tech)
For the full technical write-up on data format normalization, time-window alignment, and confidence scoring: Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →
For how IODA's BGP routing data is used to detect country-level shutdowns: BGP routing signals and internet shutdown detection: how Voidly uses IODA data →
Confidence levels
- Anomaly
- Single probe, single source. Surface as "observed" only.
- Corroborated
- Multiple Voidly probes, OR one external project agrees. Public dataset, with caveat.
- Verified incident
- Voidly + ≥1 external source agree. Citable, included in counters and forecasts.
For the full technical write-up on tier thresholds, independence weighting, and what each tier means for data consumers: From anomaly to verified incident: the Voidly confidence tier system →
Publication and alerting
The probe-to-published-incident pipeline targets under 8 minutes for events with real-time OONI and IODA corroboration. Inline anomaly scoring takes under 50ms; parallel async queries to the OONI Explorer and IODA real-time APIs run concurrently. A two-window alert-fatigue guard prevents false-alarm flooding; BGP full-shutdown signals bypass it. CensoredPlanet corroboration arrives retroactively in a nightly batch pass. Full pipeline write-up →
Forecasting
7-day shutdown forecasts are produced from a model trained on historical measurements, conditioned on country-level political and security signals. The model uses an ensemble of ARIMA (per-country telemetry trends) and gradient-boosted classifiers (full feature vector including political calendars and shutdown fingerprints). Forecast quality varies by region; per-country reliability scores are published alongside every forecast. For the technical write-up of the model, see Seven-day internet shutdown forecasting →
Reproducibility
The full measurement dataset (CC BY 4.0) is downloadable from the HuggingFace datasets listing or queryable via the Voidly REST API. The classifier code is published in the voidly-ai GitHub org; anyone with a probe can reproduce a measurement and submit independent corroboration.
Federal Regulatory Data Hub — ingest methodology
Each of the 256 datasets in the Federal Regulatory Data Hub follows a common ingest pattern. The source is always the primary government URL — no secondary aggregators, no third-party resellers. Every record carries a _source envelope with the original URL, retrieval timestamp, and license.
- Primary source fetch. Data is pulled directly from government endpoints: EDGAR FTP, openFDA API, OFAC XML, EPA ECHO, SAM.gov, USAspending, FinCEN, FDIC BankFind, CMS provider files, NIST NVD JSON feeds, CISA KEV, and so on.
- Daily refresh via Cloudflare cron. Each dataset refreshes on a per-source cadence — enforcement datasets typically daily; registries (FDIC institutions, FAA aircraft) weekly; historical datasets on-change detection.
- Normalization. Records are normalized into per-vertical SQLite tables on Cloudflare D1. Field names are standardized; dates are parsed to ISO 8601; dollar amounts are stored as integers (cents); entity identifiers (CIK, UEI, NPI, DUNS, LEI, ticker) are extracted into indexed columns.
- Entity bridge. An
entity_mastertable joins records across datasets by CIK, ticker, UEI, LEI, DUNS, and NPI. This is the cross-agency join layer — one query returns every regulatory event for a company across all 256 datasets. - Content negotiation. Every canonical record URL serves HTML (with Schema.org JSON-LD), Markdown (LLM-friendly excerpts), or JSON based on the
Acceptheader or file extension.
Fallback: if a primary source is temporarily unreachable, the prior day's snapshot is served with a stale: true flag and a stale_since timestamp. The build pipeline never fails on a missing dataset — a static fallback is always available.
All underlying federal works are US public domain under 17 U.S.C. §105 and 5 U.S.C. §105. The derived aggregate dataset is licensed CC0 1.0 Universal — attribution-free reuse, including for AI training. See api.ai-analytics.org/coverage for live per-dataset record counts. For a full technical write-up of the D1 schema, ingest architecture, and vertical sharding strategy, see Building the Federal Regulatory Data Hub on Cloudflare D1 →