Methodology

How we measure internet censorship.

Verifiable measurement is the entire point. This page documents what we actually do — vantage selection, scan cadence, classification, cross-source verification, and what it takes for an anomaly to become a citable incident.

Vantages

37+ probe nodes are deployed across 200 countries on six continents. Vantage selection balances three constraints: presence inside affected jurisdictions, ASN diversity (so a single ISP outage doesn't fake a censorship event), and operator safety. We do not run probes from infrastructure that exposes operator identity.

For the technical architecture of the probe application: The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →

For how vantages are selected for ASN diversity, operator safety, and hard-to-reach countries: Voidly probe vantage selection: ASN diversity, operator safety, and reaching hard-to-measure countries →

Test list

80 domains spanning news, civil-society organizations, encrypted messaging, circumvention tools, social platforms, and reference content. The list is reviewed quarterly and additions are governed by the same anti-targeting rules used by OONI's test list.

For the full curation methodology — Citizen Lab's global list, 12 OONI category codes, per-country additions, and emergency updates: Voidly's URL test list: how we curate the domains that reveal internet censorship →

Cadence

Every probe runs the full domain list every 5 minutes, 24×7. That gives a worst-case detection latency around 5 minutes for a sudden block, and a 1-hour rolling baseline sufficient to flag throttling that ramps up gradually.

Each probe measurement is paired with a simultaneous control server measurement from a neutral vantage outside the country being tested. The comparison — DNS, TCP reachability, TLS certificate chain, and HTTP body — is what separates genuine censorship from network errors and CDN split-horizon DNS. For the full control server methodology: The Voidly control server: how we tell censorship from a bad network →

Anomaly classifier

  • DNS tampering. Resolver returns an IP that doesn't match the known ASN, returns NXDOMAIN, or refuses recursion.
  • TLS interference. Handshake interrupted at SNI, certificate altered, ClientHello rewritten.
  • HTTP blocking. Block page returned, content rewritten, response throttled to zero.
  • BGP withdrawal. Origin AS disappears from the global routing table for the target prefix.
  • Throttling. Bandwidth deliberately collapsed for specific services while neighboring services remain unaffected.

For the full technical write-up on feature engineering, the five per-class binary models, and the recall-vs-precision tradeoff: The Voidly anomaly classifier: five interference classes and why we optimize for recall →

Cross-source verification

An anomaly is only promoted to a verified incident after it correlates against at least one independent measurement project for the same target, location, and time window:

  • · OONI — Open Observatory of Network Interference (community probe network)
  • · CensoredPlanet — University of Michigan
  • · IODA — Internet Outage Detection and Analysis (Georgia Tech)

For the full technical write-up on data format normalization, time-window alignment, and confidence scoring: Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →

For how IODA's BGP routing data is used to detect country-level shutdowns: BGP routing signals and internet shutdown detection: how Voidly uses IODA data →

Confidence levels

Anomaly
Single probe, single source. Surface as "observed" only.
Corroborated
Multiple Voidly probes, OR one external project agrees. Public dataset, with caveat.
Verified incident
Voidly + ≥1 external source agree. Citable, included in counters and forecasts.

For the full technical write-up on tier thresholds, independence weighting, and what each tier means for data consumers: From anomaly to verified incident: the Voidly confidence tier system →

Publication and alerting

The probe-to-published-incident pipeline targets under 8 minutes for events with real-time OONI and IODA corroboration. Inline anomaly scoring takes under 50ms; parallel async queries to the OONI Explorer and IODA real-time APIs run concurrently. A two-window alert-fatigue guard prevents false-alarm flooding; BGP full-shutdown signals bypass it. CensoredPlanet corroboration arrives retroactively in a nightly batch pass. Full pipeline write-up →

Forecasting

7-day shutdown forecasts are produced from a model trained on historical measurements, conditioned on country-level political and security signals. The model uses an ensemble of ARIMA (per-country telemetry trends) and gradient-boosted classifiers (full feature vector including political calendars and shutdown fingerprints). Forecast quality varies by region; per-country reliability scores are published alongside every forecast. For the technical write-up of the model, see Seven-day internet shutdown forecasting →

Reproducibility

The full measurement dataset (CC BY 4.0) is downloadable from the HuggingFace datasets listing or queryable via the Voidly REST API. The classifier code is published in the voidly-ai GitHub org; anyone with a probe can reproduce a measurement and submit independent corroboration.


Federal Regulatory Data Hub — ingest methodology

Each of the 256 datasets in the Federal Regulatory Data Hub follows a common ingest pattern. The source is always the primary government URL — no secondary aggregators, no third-party resellers. Every record carries a _source envelope with the original URL, retrieval timestamp, and license.

  1. Primary source fetch. Data is pulled directly from government endpoints: EDGAR FTP, openFDA API, OFAC XML, EPA ECHO, SAM.gov, USAspending, FinCEN, FDIC BankFind, CMS provider files, NIST NVD JSON feeds, CISA KEV, and so on.
  2. Daily refresh via Cloudflare cron. Each dataset refreshes on a per-source cadence — enforcement datasets typically daily; registries (FDIC institutions, FAA aircraft) weekly; historical datasets on-change detection.
  3. Normalization. Records are normalized into per-vertical SQLite tables on Cloudflare D1. Field names are standardized; dates are parsed to ISO 8601; dollar amounts are stored as integers (cents); entity identifiers (CIK, UEI, NPI, DUNS, LEI, ticker) are extracted into indexed columns.
  4. Entity bridge. An entity_master table joins records across datasets by CIK, ticker, UEI, LEI, DUNS, and NPI. This is the cross-agency join layer — one query returns every regulatory event for a company across all 256 datasets.
  5. Content negotiation. Every canonical record URL serves HTML (with Schema.org JSON-LD), Markdown (LLM-friendly excerpts), or JSON based on the Accept header or file extension.

Fallback: if a primary source is temporarily unreachable, the prior day's snapshot is served with a stale: true flag and a stale_since timestamp. The build pipeline never fails on a missing dataset — a static fallback is always available.

All underlying federal works are US public domain under 17 U.S.C. §105 and 5 U.S.C. §105. The derived aggregate dataset is licensed CC0 1.0 Universal — attribution-free reuse, including for AI training. See api.ai-analytics.org/coverage for live per-dataset record counts. For a full technical write-up of the D1 schema, ingest architecture, and vertical sharding strategy, see Building the Federal Regulatory Data Hub on Cloudflare D1 →