Technical writing

The Voidly open datasets on HuggingFace: structure, daily snapshots, and filter recipes

February 1, 2025· AI Analytics

CensorshipVoidlyOpen dataHuggingFaceInfrastructure

Two datasets, one pipeline

Voidly publishes two separate HuggingFace datasets under CC BY 4.0:

emperor-mew/global-censorship-index — the primary measurement dataset. One row per probe measurement, with all classifier outputs, confidence tier, cross-source corroboration fields, and BGP signals. Updated daily with the previous day's measurements. Covers 200 countries and 37+ probe nodes.
emperor-mew/ooni-censorship-historical — the normalized OONI corpus. 200M+ raw OONI measurements schema-normalized across 20 measurement types into a flat ML-ready format. 1.66M+ cumulative downloads. Static except for occasional backfill corrections.

Both are hosted as Parquet files via HuggingFace's git-lfs backend. You can access them with the datasets library, pandas, polars, DuckDB, or the raw huggingface_hub file download API.

global-censorship-index: layout and partitioning

The global-censorship-index dataset is partitioned by country_code and year_monthto keep individual Parquet files at a manageable size. The HuggingFace repo structure:

data/
  country=AF/year_month=2024-08/measurements.parquet
  country=AF/year_month=2024-09/measurements.parquet
  ...
  country=CN/year_month=2024-08/measurements.parquet
  country=CN/year_month=2024-09/measurements.parquet
  ...
  country=IR/year_month=2024-08/measurements.parquet
  ...

High-density countries (CN, IR, RU) produce the largest per-month files — typically 40–120 MB compressed per country-month for countries with dense coverage. Low-coverage countries are often under 1 MB per month. The total dataset across all countries is approximately 180 GB compressed.

The daily update cadence works by appending a new year_month=YYYY-MM subfolder for the current month. At month end, that folder is closed and a new one begins. HuggingFace's git-lfs tracks the history of each Parquet file — so you can git checkout to any snapshot date to reproduce a point-in-time view of the dataset.

Schema: the fields you'll actually use

Each row in global-censorship-index maps to one probe measurement. The full field reference is in the dataset schema article. For most use cases, these are the fields to start with:

# Identity and timing
measurement_id        # UUID, primary key
probe_id              # anonymized probe identifier
country_code          # ISO 3166-1 alpha-2
asn                   # probe ASN (e.g. "AS4134" for China Telecom)
measurement_start     # UTC timestamp (microsecond resolution)

# What was tested
domain                # e.g. "twitter.com", "bbc.co.uk"
url                   # full URL including path if non-root
category_code         # OONI category: NEWS, SMG, HUMR, POLR, COMM, ...
test_protocol         # "dns", "tcp", "tls", "http", "https"

# Classifier output
interference_type     # "dns_tampering" | "tls_interference" | "http_blocking"
                      # | "bgp_withdrawal" | "throttling" | null
interference_prob     # float [0, 1] — classifier confidence for winning class
confidence_tier       # "ANOMALY" | "CORROBORATED" | "VERIFIED"

# Cross-source corroboration
ooni_corroborated     # bool — OONI measurement agrees within 4h window
cp_corroborated       # bool — CensoredPlanet measurement agrees
ioda_corroborated     # bool — IODA BGP or ping signal agrees
corroboration_score   # float [0, 1] — composite independence-weighted score

# Control comparison
control_failure       # bool — control server also failed (not censorship)
blockpage_match       # bool — response body matches known block page
blockpage_fp_id       # fingerprint ID from the library (nullable)

# BGP signals
bgp_withdrawal        # bool — prefix withdrawn near measurement time
bgp_outage_score      # float [0, 1] — BGP-derived internet availability

Accessing the dataset: Python recipes

Load with the HuggingFace datasets library

from datasets import load_dataset

# All measurements for Iran, full history
ds = load_dataset(
    "emperor-mew/global-censorship-index",
    data_files="data/country=IR/**/*.parquet",
    split="train",
)

# Convert to pandas
df = ds.to_pandas()
print(df.shape)  # (~2.1M rows for IR as of 2025-01)

# Filter to verified incidents only
verified = df[df["confidence_tier"] == "VERIFIED"]

# DNS tampering in verified incidents
dns_blocks = verified[verified["interference_type"] == "dns_tampering"]
dns_blocks[["measurement_start", "domain", "asn", "blockpage_fp_id"]].head(10)

Load a single country-month with pandas

import pandas as pd

# Direct URL — no token required (CC BY 4.0)
url = (
    "https://huggingface.co/datasets/emperor-mew/global-censorship-index"
    "/resolve/main/data/country=RU/year_month=2024-11/measurements.parquet"
)

df = pd.read_parquet(url, engine="pyarrow")
print(df.dtypes)
print(df["interference_type"].value_counts())

Multi-country query with DuckDB (no download required)

import duckdb

con = duckdb.connect()

# Install the HuggingFace extension once
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("SET s3_endpoint='huggingface.co';")

base = (
    "https://huggingface.co/datasets/emperor-mew/"
    "global-censorship-index/resolve/main/data"
)

# HTTP blocking verified in Turkey and Azerbaijan, last 90 days
result = con.execute(f"""
    SELECT
        country_code,
        domain,
        COUNT(*) AS incident_count,
        AVG(corroboration_score) AS avg_corroboration
    FROM read_parquet([
        '{base}/country=TR/**/*.parquet',
        '{base}/country=AZ/**/*.parquet'
    ])
    WHERE confidence_tier = 'VERIFIED'
      AND interference_type = 'http_blocking'
      AND measurement_start >= '2024-10-01'
    GROUP BY 1, 2
    ORDER BY incident_count DESC
    LIMIT 20
""").fetchdf()

print(result)

Filter recipe: all verified incidents affecting news media

# Useful for journalists: all verified news-media censorship, any country
import pandas as pd
from huggingface_hub import list_repo_tree, hf_hub_download
import pyarrow.parquet as pq

# Stream Parquet files without loading all into memory
def iter_measurements(country_codes, category="NEWS", tier="VERIFIED"):
    base_path = "data"
    repo = "emperor-mew/global-censorship-index"
    for country in country_codes:
        files = list_repo_tree(
            repo,
            path_in_repo=f"data/country={country}",
            repo_type="dataset",
        )
        for f in files:
            local = hf_hub_download(repo, f.path, repo_type="dataset")
            table = pq.read_table(
                local,
                filters=[
                    ("confidence_tier", "=", tier),
                    ("category_code", "=", category),
                ],
            )
            yield table.to_pandas()

# Verified news-media censorship in 5 countries
frames = list(iter_measurements(["CN", "IR", "RU", "TR", "BY"]))
news_blocks = pd.concat(frames, ignore_index=True)
print(news_blocks.groupby("country_code")["domain"].nunique())

R access via arrow

library(arrow)
library(dplyr)

# Load a single country-month (no auth required)
url <- paste0(
  "https://huggingface.co/datasets/emperor-mew/",
  "global-censorship-index/resolve/main/",
  "data/country=IR/year_month=2024-12/measurements.parquet"
)

ds <- read_parquet(url)

ds |>
  filter(confidence_tier == "VERIFIED") |>
  count(interference_type, sort = TRUE)

ooni-censorship-historical: the normalized OONI corpus

The second dataset is a static normalized copy of the OONI raw measurement archive through mid-2024. The full build is described in the OONI historical corpus article — 200M+ measurements from 2012 to mid-2024, normalized across the 20 most common OONI test types into a flat CSV/Parquet schema.

This dataset is larger (approximately 650 GB uncompressed) and hosted as a collection of compressed Parquet parts. It is the ML training source for the Voidly anomaly classifier — the labeled subset (using OONI's confirmed flag plus theblockpage_hash field as ground truth) is what the weak supervision label functions in the ML training pipeline operate over.

from datasets import load_dataset

# Load the OONI historical corpus — large, stream it
ds = load_dataset(
    "emperor-mew/ooni-censorship-historical",
    streaming=True,
    split="train",
)

# Iterate without loading into RAM
for batch in ds.iter(batch_size=10_000):
    df = pd.DataFrame(batch)
    confirmed = df[df["ooni_confirmed"] == True]
    # Process confirmed measurements...

Versioning and point-in-time access

Both datasets use HuggingFace's git-lfs backend, so every Parquet file has a full git history. To reproduce a specific point-in-time snapshot — for paper reproducibility or audit purposes — use the revision parameter with a commit hash:

from datasets import load_dataset

# Reproduce the dataset state from a specific commit
ds = load_dataset(
    "emperor-mew/global-censorship-index",
    data_files="data/country=CN/**/*.parquet",
    revision="a3f91d7b",  # git commit SHA
    split="train",
)

# Or with the datasets library, use the snapshot_download API:
from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="emperor-mew/global-censorship-index",
    repo_type="dataset",
    revision="a3f91d7b",
    local_dir="./snapshots/voidly-2024-11-15",
)

The HuggingFace dataset page lists the 10 most recent commits, each tagged with the data cutoff date. For reproducibility, record both the revision SHA and the measurement date range you filtered to — this combination fully specifies a reproducible subset.

Update cadence and staleness

The global-censorship-index receives a daily append: measurements fromT-1 (yesterday UTC) are exported, converted to Parquet, and committed to the HuggingFace repo by approximately 06:00 UTC. The usual latency from probe measurement to dataset commit is 20–26 hours — one calendar day plus the export window.

For real-time access (under 8 minutes from measurement to published incident), use the REST API instead of the HuggingFace snapshots. For bulk analysis, retrospective research, and ML training, HuggingFace is more practical because you can filter and iterate locally without rate limits.

The ooni-censorship-historical dataset does not receive daily updates. Occasional backfill commits are made when we reprocess historical records — for example, when a new block-page fingerprint is retrospectively applied or when a measurement reclassification changes OONI confirmed status for a batch of records.

Confidence tier filtering for different consumers

The three confidence tiers have different appropriate uses. The confidence tier article covers the definitions in detail; here are the practical filtering rules:

Use case	Recommended filter	Reason
Journalism / public reporting	`confidence_tier == "VERIFIED"`	Cross-source confirmation; lowest false-positive rate
ML training (positive class)	`confidence_tier in ["CORROBORATED", "VERIFIED"]`	Sufficient signal; ANOMALY tier has higher noise
ML training (full labeled set)	All tiers with tier as label	Treat tier as ordinal target; model uncertainty directly
Infrastructure monitoring	`confidence_tier == "ANOMALY"` + high `interference_prob`	Catch emerging blocks before cross-source confirmation arrives
Country-level statistics	`confidence_tier == "VERIFIED"`	Avoids inflating counts with noisy ANOMALY measurements

Probe quality filtering

Not all probe nodes have equal quality. Low-quality probes (high control_failurerate, low measurement frequency, flapping state) are not removed from the dataset — they are labeled, so consumers can filter them out:

# Remove measurements from degraded probes
# (control_failure == True means the control server also failed,
# suggesting a network issue rather than censorship)
clean = df[~df["control_failure"]]

# For ML training: remove measurements during probe ISOLATED states
# These are flagged by the absence of cross-source corroboration
# and non-null inference_dropped field
clean = df[df["inference_dropped"].isna()]

# Per-ASN quality filter: exclude ASNs with < 100 measurements/month
# to avoid outlier probes with tiny sample sizes
asn_counts = df.groupby("asn").size()
valid_asns = asn_counts[asn_counts >= 100].index
clean = df[df["asn"].isin(valid_asns)]

Time-based train/val/test splits

The ML training pipeline uses time-based splits to prevent temporal leakage. If you're using the dataset for ML, mirror this approach:

import pandas as pd

df["measurement_start"] = pd.to_datetime(df["measurement_start"])

# Chronological split — no random shuffling
train = df[df["measurement_start"] < "2024-07-01"]
val   = df[(df["measurement_start"] >= "2024-07-01") &
           (df["measurement_start"] <  "2024-10-01")]
test  = df[df["measurement_start"] >= "2024-10-01"]

# Key: never shuffle before splitting
# Shuffling would cause future measurements to appear in training data,
# and the classifier to learn from its own future outputs via
# cross-source corroboration fields (which reference OONI measurements
# that may postdate the measurement being classified)

print(f"Train: {len(train):,}  Val: {len(val):,}  Test: {len(test):,}")

Citation format

The dataset is CC BY 4.0. The preferred citation for academic work:

@dataset{voidly_global_censorship_index_2025,
  title   = {Voidly Global Censorship Index},
  author  = {{AI Analytics}},
  year    = {2025},
  url     = {https://huggingface.co/datasets/emperor-mew/global-censorship-index},
  license = {CC BY 4.0},
  note    = {200 countries, daily updates. Cite the dataset revision SHA
             and your measurement date range for reproducibility.}
}

For journalism, a short attribution is sufficient: “Source: Voidly / AI Analytics, CC BY 4.0, ai-analytics.org” with a link to the dataset URL.

For the full field-by-field schema of every column in the dataset: The Voidly measurement dataset: field-by-field schema reference →

For real-time access to incidents (under 8 minutes from probe to published event): The Voidly REST API: querying the global censorship index in real time →

For querying this dataset from Claude, GPT, or any MCP-capable agent: The Voidly MCP server: 83 censorship query tools for Claude and GPT →

For the confidence tier definitions and what each tier means for different consumers: From anomaly to verified incident: the Voidly confidence tier system →

For the OONI historical corpus build — schema normalization across 20 test types, handling probe version drift, and what was left out: Building the OONI historical corpus: 1.66M downloads, schema normalization, and the decisions behind the dataset →

For how the Voidly ML classifier generates the interference_type and interference_prob fields that appear in each dataset row: The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall →

For the nightly pipeline that converts TimescaleDB rows into the Parquet files in this dataset — PyArrow schema, Zstandard level 3 compression, named cursor streaming, and SHA-256 post-push verification: Voidly's nightly Parquet export: from TimescaleDB to HuggingFace →

For how the lifecycle state fields in this dataset (confidence_tier, is_active, resolved_at) are set — incident state machine, transition thresholds, and publication timing: Censorship incident lifecycle in Voidly: from anomaly detection to verified incident to resolution →