Methodology

How the data is built.

Verifiable method is the entire point. Three method families cover everything we publish: direct measurement for the censorship index, a common ingest pattern for the Federal Data Hub, and a common documentary method for the accountability datasets. This page documents all three — and what it takes for a measurement or a record to become citable.

Vantages

37+ probe nodes are deployed across 200 countries on six continents. Vantage selection balances three constraints: presence inside affected jurisdictions, ASN diversity (so a single ISP outage doesn't fake a censorship event), and operator safety. We do not run probes from infrastructure that exposes operator identity.

For the technical architecture of the probe application: The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →

For how vantages are selected for ASN diversity, operator safety, and hard-to-reach countries: Voidly probe vantage selection: ASN diversity, operator safety, and reaching hard-to-measure countries →

Test list

80 domains spanning news, civil-society organizations, encrypted messaging, circumvention tools, social platforms, and reference content. The list is reviewed quarterly and additions are governed by the same anti-targeting rules used by OONI's test list.

For the full curation methodology — Citizen Lab's global list, 12 OONI category codes, per-country additions, and emergency updates: Voidly's URL test list: how we curate the domains that reveal internet censorship →

Cadence

Every probe runs the full domain list every 5 minutes, 24×7. That gives a worst-case detection latency around 5 minutes for a sudden block, and a 1-hour rolling baseline sufficient to flag throttling that ramps up gradually.

Each probe measurement is paired with a simultaneous control server measurement from a neutral vantage outside the country being tested. The comparison — DNS, TCP reachability, TLS certificate chain, and HTTP body — is what separates genuine censorship from network errors and CDN split-horizon DNS. For the full control server methodology: The Voidly control server: how we tell censorship from a bad network →

Anomaly classifier

DNS tampering. Resolver returns an IP that doesn't match the known ASN, returns NXDOMAIN, or refuses recursion.
TLS interference. Handshake interrupted at SNI, certificate altered, ClientHello rewritten.
HTTP blocking. Block page returned, content rewritten, response throttled to zero.
BGP withdrawal. Origin AS disappears from the global routing table for the target prefix.
Throttling. Bandwidth deliberately collapsed for specific services while neighboring services remain unaffected.

For the full technical write-up on feature engineering, the five per-class binary models, and the recall-vs-precision tradeoff: The Voidly anomaly classifier: five interference classes and why we optimize for recall →

Cross-source verification

An anomaly is only promoted to a verified incident after it correlates against at least one independent measurement project for the same target, location, and time window:

· OONI — Open Observatory of Network Interference (community probe network)
· CensoredPlanet — University of Michigan
· IODA — Internet Outage Detection and Analysis (Georgia Tech)

For the full technical write-up on data format normalization, time-window alignment, and confidence scoring: Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →

For how IODA's BGP routing data is used to detect country-level shutdowns: BGP routing signals and internet shutdown detection: how Voidly uses IODA data →

Confidence levels

Anomaly: Single probe, single source. Surface as "observed" only.
Corroborated: Multiple Voidly probes, OR one external project agrees. Public dataset, with caveat.
Verified incident: Voidly + ≥1 external source agree. Citable, included in counters and forecasts.

For the full technical write-up on tier thresholds, independence weighting, and what each tier means for data consumers: From anomaly to verified incident: the Voidly confidence tier system →

Publication and alerting

The probe-to-published-incident pipeline targets under 8 minutes for events with real-time OONI and IODA corroboration. Inline anomaly scoring takes under 50ms; parallel async queries to the OONI Explorer and IODA real-time APIs run concurrently. A two-window alert-fatigue guard prevents false-alarm flooding; BGP full-shutdown signals bypass it. CensoredPlanet corroboration arrives retroactively in a nightly batch pass. Full pipeline write-up →

Forecasting

7-day shutdown forecasts are produced from a model trained on historical measurements, conditioned on country-level political and security signals. The model uses an ensemble of ARIMA (per-country telemetry trends) and gradient-boosted classifiers (full feature vector including political calendars and shutdown fingerprints). Forecast quality varies by region; per-country reliability scores are published alongside every forecast. For the technical write-up of the model, see Seven-day internet shutdown forecasting →

Reproducibility

The full measurement dataset (CC BY 4.0) is downloadable from the HuggingFace datasets listing or queryable via the Voidly REST API. The classifier code is published in the voidly-ai GitHub org; anyone with a probe can reproduce a measurement and submit independent corroboration.

Federal Regulatory Data Hub — ingest methodology

Each of the 208 datasets in the Federal Regulatory Data Hub follows a common ingest pattern. The source is always the primary government URL — no secondary aggregators, no third-party resellers. Every record carries a _source envelope with the original URL, retrieval timestamp, and license.

Primary source fetch. Data is pulled directly from government endpoints: EDGAR FTP, openFDA API, OFAC XML, EPA ECHO, SAM.gov, USAspending, FinCEN, FDIC BankFind, CMS provider files, NIST NVD JSON feeds, CISA KEV, and so on.
Daily refresh via Cloudflare cron. Each dataset refreshes on a per-source cadence — enforcement datasets typically daily; registries (FDIC institutions, FAA aircraft) weekly; historical datasets on-change detection.
Normalization. Records are normalized into per-vertical SQLite tables on Cloudflare D1. Field names are standardized; dates are parsed to ISO 8601; dollar amounts are stored as integers (cents); entity identifiers (CIK, UEI, NPI, DUNS, LEI, ticker) are extracted into indexed columns.
Entity bridge. An entity_master table joins records across datasets by CIK, ticker, UEI, LEI, DUNS, and NPI. This is the cross-agency join layer — one query returns every regulatory event for a company across all 208 datasets.
Content negotiation. Every canonical record URL serves HTML (with Schema.org JSON-LD), Markdown (LLM-friendly excerpts), or JSON based on the Accept header or file extension.

Fallback: if a primary source is temporarily unreachable, the prior day's snapshot is served with a stale: true flag and a stale_since timestamp. The build pipeline never fails on a missing dataset — a static fallback is always available.

All underlying federal works are US public domain under 17 U.S.C. §105 and 5 U.S.C. §105. The derived aggregate dataset is licensed CC0 1.0 Universal — attribution-free reuse, including for AI training. See api.ai-analytics.org/coverage for live per-dataset record counts. For a full technical write-up of the D1 schema, ingest architecture, and vertical sharding strategy, see Building the Federal Regulatory Data Hub on Cloudflare D1 →

Accountability datasets — common method

The fourteen Voidly accountability datasets share one documentary method, enforced in the build pipelines rather than promised in prose:

Primary public sources only. Government files (ICE detention statistics, USDA AFIDA, Department of Education Section 117, EIA-860, OFAC program pages, official legal texts), regulators' designation feeds, or the subject entity's own filings. The censorship index is the one exception by design: it is direct network measurement, documented above.
Regeneration by script. Each dataset is rebuilt from its sources by a checked-in script — the published JSON, the landing page, and the machine manifest regenerate together, so a correction propagates everywhere in one build.
Schema gates. Source columns are read against per-file whitelists that fail the build if the source's schema drifts; free-text and address fields are never ingested; person-name detectors abort regeneration on a match. Where a register names individuals, we publish aggregates only.
The evidence ladder. An entity is named only where a government record or the entity's own filing names it, and every attribution carries its evidence tier and source link. Where the record stops, the dataset says so instead of guessing.
Machine parity. Every dataset ships as keyless static JSON, enumerated in one manifest at /voidly/datasets.json; counts on the pages and in the JSON come from the same build.

The rules themselves — zero personal data, records-not-allegations, corrections and right of reply — are stated in full at /standards/. Each dataset's landing page carries its own Method & caveats section with the source list, build date, and the dataset-specific gates.