Technical writing

Incident clustering and deduplication: how Voidly avoids counting the same censorship event twice

December 22, 2025· 7 min read· AI Analytics

CensorshipVoidlyMethodologyInfrastructure

The problem with raw measurement counts

When Iran blocks Instagram, hundreds of probes across dozens of ASNs begin reporting the same anomaly within minutes. Each probe runs its measurement cycle every five minutes, and most probes in the country run continuously for the duration of the block — which might last hours, days, or weeks. By the time the block lifts, the raw measurement table might contain 40,000 individual rows flagging Instagram as blocked in Iran, each with a distinct timestamp and probe ID.

Without deduplication, these 40,000 rows produce 40,000 “detected incidents” in any naive counting query. A journalist asking “how many censorship incidents did Iran have this month?” would get a number inflated by two to three orders of magnitude. A political science researcher modeling censorship frequency would build models on fundamentally wrong event counts. A dashboard claiming “1,574 verified incidents” is meaningless if each incident is actually a measurement.

The right unit of analysis for journalists and researchers is the incident: a contiguous period of interference affecting a specific target in a specific country, identified by a specific mechanism. Voidly's clustering algorithm collapses the measurement stream into this unit, and every measurement record carries an incident_id linking it to the parent incident. This post explains exactly how that clustering works.

The clustering key

Every measurement is assigned to a potential incident window based on a four-part clustering key:

(country_code, domain, interference_type, probe_type_group)

The first two components are straightforward. country_code is the ISO 3166-1 alpha-2 code for the country where the probe is located, not the country of the target. A probe in Iran reporting a block on bbc.co.uk contributes to the Iran key, not a UK key. domain is the registered domain being tested — not the full URL, since the same block typically affects all paths under a domain.

interference_type is one of five values from the anomaly classifier: dns_tampering, http_blocking, tls_interference, tcp_reset, or bgp_withdrawal. Including the interference type in the clustering key is a deliberate design choice: the same domain can be blocked by DNS tampering and HTTP blocking simultaneously, and these are tracked as separate incidents because they may start and end at different times, affect different probe configurations, and indicate different implementation choices by the blocking party. A DNS block is typically deployed at the ISP resolver level; an HTTP block may require a deep-packet inspection box. Merging them into a single incident would obscure which mechanism was deployed when and whether they were removed together or separately.

probe_type_group collapses the three physical probe categories — desktop, mobile, and datacenter — into a single group label. Probes of all three types on the same ASN observe the same network path; a mobile device and a datacenter VM in the same ISP see the same DNS resolver, the same BGP routes, and the same DPI box. Keeping them as separate keys would multiply the incident count without adding information. A country with 10 probes across 4 ASNs — some desktop, some mobile — produces a single incident per (country, domain, interference_type) tuple rather than three.

The 6-hour gap rule

Clustering by key alone is not enough. A domain blocked for three separate weeks would collapse into a single incident if we only grouped by key. We need a rule for when a gap in measurements closes one incident and potentially opens another.

The rule is: if no measurement for a given clustering key crosses the anomaly threshold for six consecutive hours, the current incident window closes and is marked RESOLVED. The next anomalous measurement for that key starts a new incident.

Calibrating this threshold involved real tradeoffs. Too short — say, 30 minutes — and a block with a measurement gap caused by probe downtime gets split into two incidents. Probe operators sometimes restart their infrastructure, creating gaps of 15–45 minutes in the measurement stream for a given ASN. At 30 minutes, these gaps routinely over-split genuine long-running blocks. Too long — say, 48 hours — and distinct blocking campaigns get merged. Iran blocked Instagram for 3 days in October 2022, lifted the block for 8 hours, then reimposed it. A 48-hour gap rule would register this as one incident; a 6-hour gap rule correctly identifies two.

We settled on 6 hours after empirical analysis of 400 known incidents from 2023 and 2024 with independently verified start and end times sourced from NetBlocks reports, IODA alerts, and ISP-level confirmations. Against this ground truth, the 6-hour rule produced an over-split rate of 3.2% and a merge rate of 1.8% — better than any threshold between 1 hour and 24 hours. Below 4 hours the over-split rate jumped above 8%; above 12 hours the merge rate climbed above 5%.

Incident ID assignment

When the first measurement crosses the anomaly threshold for a new key (or after a gap that closed the previous window), a new incident is created and assigned a stable identifier in this format:

inc_{cc}_{YYYYMMDD}_{8-char-hash}

# Examples:
inc_IR_20241015_a3f8c91b   # Iran, 15 October 2024
inc_MM_20240201_7d2e44fa   # Myanmar, 1 February 2024
inc_RU_20250318_c9b03e12   # Russia, 18 March 2025

The date component uses the UTC date of the first anomalous measurement in the window. The 8-character hash is the first 8 hex characters of the SHA-256 of the string "{country_code}:{domain}:{interference_type}:{window_start_unix_ts}". Because the hash inputs include the exact Unix timestamp of the window start, the ID is stable across retroactive reprocessing: re-ingesting the same measurement stream produces identical IDs for the same incident windows. This matters when we re-cluster after correcting classifier errors or after adding new probe data for a historical period.

Every row in the measurement table carries an incident_id field populated at ingest time, within approximately 30 seconds of measurement arrival. Measurements that do not cross the anomaly threshold receive a null incident_id — they are stored for baseline calibration purposes but do not contribute to any incident.

Incident lifecycle

An incident moves through four states:

ANOMALY: The first measurement crosses the classifier threshold. The incident is created, assigned its ID, and enters the queue for corroboration evaluation. At this stage, only a single measurement (or a small cluster of same-ASN measurements within the same 5-minute cycle) has fired.
CORROBORATED: Two or more independent probe ASNs report the same interference_type within the incident window. “Independent” means distinct ASNs — same-ASN probes share a network path and are explicitly not counted as independent corroboration. This state is reachable from ANOMALY in a single corroboration pass, which typically completes within 2–4 minutes of the first measurement.
VERIFIED: Cross-source corroboration from at least one external platform — OONI, CensoredPlanet, or IODA — confirms the event within the incident window. The cp_corroborated, ooni_corroborated, and ioda_corroborated boolean flags on the incident record indicate which external sources agreed.
RESOLVED: The 6-hour gap rule closes the window. No new measurements for the clustering key have crossed the threshold for 6 consecutive hours. The incident record is finalized with a window_end timestamp and a duration_hours field.

A RESOLVED incident can be re-opened if new anomalous measurements for the same clustering key arrive within a 12-hour re-open window after resolution. When this happens, the existing incident ID is preserved, the window_end is cleared, and the incident returns to its previous state. Beyond 12 hours, the original incident stays RESOLVED and a new incident is created. The 12-hour re-open window is distinct from the 6-hour gap rule: the gap rule determines when a window closes; the re-open window determines how long after closure a resumption is treated as continuation versus recurrence.

The distinction between CORROBORATED and VERIFIED matters for data consumers. CORROBORATED events represent robust multi-ASN agreement within Voidly's own probe network — real signal, but from a single measurement infrastructure. VERIFIED events require external confirmation, which protects against systematic errors in Voidly's classifier that could affect all probes simultaneously (e.g., a CDN that consistently returns a block-page-like response from a specific country). Journalists citing specific events should use VERIFIED; researchers training models on the full distribution of censorship patterns can use CORROBORATED as well.

Real-time vs. retroactive clustering

Clustering happens in two passes with different latency characteristics:

Measurement arrives at collector
        │
        ▼
  [Cloudflare Worker — real-time pass, ~30s]
        │  ├─ score_anomaly(measurement)
        │  ├─ look up open incident for clustering key
        │  ├─ if within gap window: attach to existing incident_id
        │  └─ if new / gap expired: create new incident, assign incident_id
        │
        ▼
  Measurement stored with incident_id in D1

        │  (next day, 02:00 UTC)
        ▼
  [Nightly CensoredPlanet batch — retroactive pass]
        │  ├─ download CP daily export
        │  ├─ for each open/recent incident: query CP measurements in
        │  │    same 4-hour sliding windows
        │  ├─ if ≥3 Voidly measurements for the incident are CP-corroborated:
        │  │    set cp_corroborated = true on the incident record
        │  └─ if cp_corroborated flips to true and composite confidence
        │       crosses 0.75: upgrade confidence_tier to VERIFIED

The real-time pass runs in a Cloudflare Worker that executes on every measurement ingest event. The gap check is a single key-value lookup against a Cloudflare KV store keyed by the clustering tuple — the last-seen timestamp for each (country_code, domain, interference_type, probe_type_group) combination. If the stored timestamp is within 6 hours, the measurement is attached to the open incident. If not, or if no entry exists, a new incident is created in D1 and the KV entry is written. Total Worker execution time is consistently under 30 milliseconds.

The retroactive CensoredPlanet pass runs nightly because CP publishes daily batch exports rather than a real-time API. The batch compares CP's corpus against Voidly incidents using 4-hour sliding windows: for a Voidly incident with window_start = T, we search CP's export for measurements with test_start_time between T - 2h and T + 6h. The asymmetric window accounts for the fact that CP scans run on a schedule and may have observed the block slightly before or after Voidly's probes fired. If CP has three or more matching measurements flagging the same domain in the same country as anomalous, the cp_corroborated flag is set on the incident.

Edge cases

Four edge cases required special handling in the clustering logic:

(a) Nationwide BGP outages. Incidents with interference_type = bgp_withdrawal use a 24-hour gap rule instead of 6 hours. BGP events resolve at the routing layer — when an AS re-announces withdrawn prefixes, routing convergence takes 30–90 minutes, and probe measurements may not recover immediately even after the routing change. The shorter gap rule routinely over-split BGP events into many short incidents during the recovery phase. Empirically, BGP outages that self-resolve in under 24 hours are almost always single events; BGP outages that return after 24 hours are almost always policy decisions that warrant separate tracking.

(b) VPN evasion. Some Voidly probes run behind circumvention tools — specifically probes operated by volunteers in high-censorship environments who need VPN access to operate safely. Measurements from probes with probe_flags containing circumvention_active are excluded from incident clustering entirely. A probe that tunnels through a VPN will not see the block that a direct-connection probe sees, and including its “no anomaly” measurement in the clustering key would suppress legitimate incidents.

(c) Flapping. A domain that alternates between blocked and unblocked on a roughly 1-hour cycle — most commonly observed with rate-limiting implementations that reset periodically — produces a stream of short incidents under the standard 6-hour gap rule. Voidly detects this pattern when five or more RESOLVED incidents for the same (country, domain, interference_type) tuple occur within a 24-hour window, each with duration under 2 hours. When detected, all constituent incidents are annotated with flapping: true and a flapping_group_id linking them. The dashboard displays flapping groups as a single annotated entry rather than many separate incidents, and the incident count in the API reflects the group count for flapping events.

(d) False resolution from measurement gaps. If a block stays in place but all probes for a given clustering key go offline simultaneously — due to an operator network issue, a power outage, or a probe update cycle — the 6-hour gap rule will incorrectly close the incident window. The real-time pass has no way to distinguish “no measurements because nothing is wrong” from “no measurements because our probes are down.” The nightly retroactive pass catches most of these cases: if the incident is marked RESOLVED but CP or OONI measurements continue to corroborate the block in the same time window, the incident is re-opened and the resolution is reversed. Probe health data (per-probe uptime and last-seen timestamps) is also factored in: if a probe cluster covering a country shows a gap that aligns with a known maintenance window in the probe operator logs, the gap is annotated as a measurement gap rather than treated as evidence of resolution.

Clustering algorithm: Python pseudocode

The core clustering logic, simplified for clarity:

from dataclasses import dataclass
from datetime import datetime, timedelta
from hashlib import sha256

GAP_HOURS         = 6   # standard gap rule
BGP_GAP_HOURS     = 24  # BGP-specific gap rule
REOPEN_HOURS      = 12  # re-open window after resolution

@dataclass
class Measurement:
    country_code:      str
    domain:            str
    interference_type: str
    probe_type_group:  str
    test_start_time:   datetime
    anomaly_score:     float
    probe_asn:         int
    probe_flags:       list[str]

@dataclass
class Incident:
    incident_id:       str
    country_code:      str
    domain:            str
    interference_type: str
    window_start:      datetime
    window_end:        datetime | None
    status:            str   # ANOMALY | CORROBORATED | VERIFIED | RESOLVED
    probe_asns_seen:   set[int]

def make_incident_id(cc: str, domain: str, itype: str, ts: datetime) -> str:
    date_str  = ts.strftime('%Y%m%d')
    hash_src  = f"{cc}:{domain}:{itype}:{int(ts.timestamp())}"
    hash_hex  = sha256(hash_src.encode()).hexdigest()[:8]
    return f"inc_{cc}_{date_str}_{hash_hex}"

def gap_hours(interference_type: str) -> int:
    return BGP_GAP_HOURS if interference_type == 'bgp_withdrawal' else GAP_HOURS

def cluster_measurements(measurements: list[Measurement]) -> dict[str, Incident]:
    # Caller sorts by (country_code, domain, interference_type, test_start_time)
    open_incidents: dict[tuple, Incident] = {}
    all_incidents:  dict[str, Incident]   = {}

    for m in measurements:
        if 'circumvention_active' in m.probe_flags:
            continue  # excluded from clustering

        if m.anomaly_score < ANOMALY_THRESHOLD:
            continue  # does not contribute to any incident

        key = (m.country_code, m.domain, m.interference_type, m.probe_type_group)
        gap = timedelta(hours=gap_hours(m.interference_type))

        if key in open_incidents:
            inc = open_incidents[key]
            time_since_last = m.test_start_time - (inc.window_end or inc.window_start)

            if time_since_last <= gap:
                # Extend existing incident
                inc.window_end = m.test_start_time
                inc.probe_asns_seen.add(m.probe_asn)
            elif time_since_last <= timedelta(hours=REOPEN_HOURS):
                # Re-open: gap closed it but re-open window still active
                inc.window_end = m.test_start_time
                inc.status     = 'ANOMALY'  # resets; corroboration re-evaluates
            else:
                # Gap too large: resolve old, start new
                inc.status = 'RESOLVED'
                new_inc = _new_incident(m)
                open_incidents[key] = new_inc
                all_incidents[new_inc.incident_id] = new_inc
                continue
        else:
            inc = _new_incident(m)
            open_incidents[key] = inc

        all_incidents[inc.incident_id] = inc

        # Update status based on ASN independence
        if len(inc.probe_asns_seen) >= 2 and inc.status == 'ANOMALY':
            inc.status = 'CORROBORATED'

    # Resolve any still-open incidents past their gap
    now = datetime.utcnow()
    for key, inc in open_incidents.items():
        last_seen = inc.window_end or inc.window_start
        if now - last_seen > timedelta(hours=gap_hours(inc.interference_type)):
            inc.status = 'RESOLVED'

    return all_incidents

def _new_incident(m: Measurement) -> Incident:
    iid = make_incident_id(m.country_code, m.domain,
                           m.interference_type, m.test_start_time)
    return Incident(
        incident_id       = iid,
        country_code      = m.country_code,
        domain            = m.domain,
        interference_type = m.interference_type,
        window_start      = m.test_start_time,
        window_end        = None,
        status            = 'ANOMALY',
        probe_asns_seen   = {m.probe_asn},
    )

The production implementation differs in several ways: the KV-backed gap check is O(1) per measurement rather than requiring the full sorted stream; the VERIFIED state transition is handled by the corroboration worker rather than inline; and the flapping detection runs as a post-clustering pass rather than during stream processing. But the core logic — clustering key, gap rule, re-open window, ASN independence check — is exactly as shown above.

Impact on the dataset

The incident_id field in every measurement row is the primary interface between the raw measurement stream and the incident-level analysis that most researchers actually want. Users do not need to re-implement the clustering logic — they join on incident_id and get pre-clustered events.

The /v1/incidents API endpoint returns pre-clustered incident records directly, with all fields from the incident table including window_start, window_end, duration_hours, status, interference_type, probe_asn_count, measurement_count, ooni_corroborated, cp_corroborated, ioda_corroborated, confidence_tier, and flapping. For most research use cases, the incidents endpoint is the right starting point — download incidents for a country and date range, then join to measurements only if you need probe-level detail.

For local analysis with DuckDB, the join between the measurements Parquet files and the incidents Parquet files is straightforward:

-- Load both tables from the HuggingFace dataset Parquet exports
-- Download: https://huggingface.co/datasets/emperor-mew/global-censorship-index

CREATE TABLE measurements AS
    SELECT * FROM read_parquet('measurements/*.parquet');

CREATE TABLE incidents AS
    SELECT * FROM read_parquet('incidents/*.parquet');

-- Count distinct verified incidents per country in Q4 2024
SELECT
    i.country_code,
    COUNT(DISTINCT i.incident_id)          AS verified_incidents,
    AVG(i.duration_hours)                  AS avg_duration_hours,
    COUNT(DISTINCT m.measurement_id)       AS total_measurements,
    COUNT(DISTINCT m.measurement_id) /
        COUNT(DISTINCT i.incident_id)      AS measurements_per_incident
FROM incidents i
JOIN measurements m USING (incident_id)
WHERE i.status        = 'VERIFIED'
  AND i.window_start >= '2024-10-01'
  AND i.window_start <  '2025-01-01'
GROUP BY i.country_code
ORDER BY verified_incidents DESC
LIMIT 20;

-- Inspect all measurements that make up a single incident
SELECT
    m.probe_id,
    m.probe_asn,
    m.test_start_time,
    m.anomaly_score,
    m.interference_type,
    m.dns_consistent,
    m.tls_handshake_success,
    m.http_status_code
FROM measurements m
WHERE m.incident_id = 'inc_IR_20241015_a3f8c91b'
ORDER BY m.test_start_time;

-- Find flapping incidents (many short blocks of the same domain)
SELECT
    country_code,
    domain,
    COUNT(*) AS flap_count,
    MIN(window_start) AS first_seen,
    MAX(window_start) AS last_seen
FROM incidents
WHERE flapping = true
GROUP BY flapping_group_id, country_code, domain
ORDER BY flap_count DESC;

The measurements_per_incident ratio in the first query illustrates the deduplication effect concretely. For well-covered countries with many probes, this ratio is typically 200–800 — meaning a naive count of anomalous measurements would overestimate the incident count by 200 to 800 times. Even for countries with sparse probe coverage, the ratio is rarely below 30. A raw measurement count and an incident count for the same date range differ by roughly 50–100 times in expectation across the dataset.

Why this matters for researchers

The practical consequence of using raw measurements as the unit of analysis rather than incidents is severe enough to invalidate most quantitative findings. A study comparing censorship frequency across countries that counts anomalous measurements will systematically overcount countries with dense probe coverage (many probes means many measurements per incident) and undercount countries with sparse coverage. The ranking of countries by “censorship events” will reflect probe density as much as actual censorship intensity — which are only weakly correlated.

The incident timeline is the correct unit for political science research on censorship patterns, for journalism tracking specific events, and for ML models predicting future shutdowns. The incident_id field and the /v1/incidents endpoint exist specifically to make it easy to work at this level without re-engineering the clustering logic. For the complete list of fields available on the incident record — including the flapping flag, confidence_tier breakdowns, and per-source corroboration booleans — see the schema reference linked below.