Technical writing

Bridging classifier outputs to shutdown forecasting: from per-measurement censorship probability to country-level shutdown risk scores

June 11, 2025· AI Analytics

VoidlyMachine learningForecastingCensorship detection

The XGBoost classifier outputs a calibrated probability of censorship for each individual probe measurement: a single web-connectivity observation from one probe, one domain, one ASN, at one moment in time. The shutdown forecasting model operates at a completely different scale: it predicts the probability that an entire country will experience a full internet shutdown within the next 72 hours, using country-level signals aggregated across thousands of probes and hundreds of domains.

Bridging these two scales requires an aggregation layer that converts tens of thousands of per-measurement probabilities per day into a small set of time-series features that the forecasting model can ingest. This article covers the domain-ASN aggregation pass, exponential decay weighting for the 14-day observation window, the risk score normalization formula, and the feature engineering pipeline that produces the forecasting model's input vector.

Aggregation hierarchy

Per-measurement probabilities are aggregated in three stages before reaching the forecasting model:

Stage	Granularity	Time bucket	Output
1 — ASN-domain	(country, ASN, domain)	1 hour	Mean P(censored), measurement count
2 — Domain	(country, domain)	1 hour	ASN-weighted mean, ASN count, domain category
3 — Country	(country)	1 hour	Risk score, feature vector for forecasting

Stage 1 aggregation happens in the ingestion pipeline immediately after the ONNX inference step. Stages 2 and 3 run as TimescaleDB continuous aggregates over the ingested stage-1 records, refreshed every 15 minutes.

ASN-domain hourly aggregation (Stage 1)

-- TimescaleDB schema: stage-1 aggregation table
-- This is the hypertable into which the ingestion pipeline inserts per-measurement rows.
CREATE TABLE measurement_scores (
  ts              TIMESTAMPTZ NOT NULL,
  country_code    TEXT NOT NULL,
  asn             INTEGER NOT NULL,
  domain          TEXT NOT NULL,
  censor_prob     FLOAT4 NOT NULL,   -- calibrated P(censored) from ONNX
  probe_id        TEXT NOT NULL,     -- anonymized probe identifier
  test_type       TEXT NOT NULL      -- 'web_connectivity' | 'dns_consistency' | 'tcp_connect'
);

SELECT create_hypertable('measurement_scores', 'ts', chunk_time_interval => INTERVAL '6 hours');

CREATE INDEX ON measurement_scores (country_code, asn, domain, ts DESC);

-- Stage-1 continuous aggregate: hourly mean per (country, ASN, domain)
CREATE MATERIALIZED VIEW asn_domain_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', ts) AS bucket,
  country_code,
  asn,
  domain,
  AVG(censor_prob)         AS mean_prob,
  COUNT(*)                 AS n_measurements,
  COUNT(DISTINCT probe_id) AS n_probes
FROM measurement_scores
GROUP BY 1, 2, 3, 4
WITH NO DATA;

SELECT add_continuous_aggregate_policy('asn_domain_hourly',
  start_offset => INTERVAL '2 hours',
  end_offset   => INTERVAL '15 minutes',
  schedule_interval => INTERVAL '15 minutes'
);

Exponential decay weighting

Older observations are less informative than recent ones for predicting near-term shutdowns. The stage-3 aggregation applies exponential decay over a 14-day trailing window when computing the country-level risk score:

# aggregation/risk_score.py

import numpy as np
from dataclasses import dataclass

HALF_LIFE_HOURS  = 48.0   # observations older than 48h have half the weight
WINDOW_HOURS     = 336.0  # 14-day lookback
DECAY_LAMBDA     = np.log(2) / HALF_LIFE_HOURS   # lambda for e^(-lambda * t)


@dataclass
class HourlyObservation:
    bucket_age_hours: float   # how many hours ago this bucket ended
    mean_prob:        float   # mean P(censored) for this (country, domain) in the hour
    n_measurements:   int
    n_asns:           int
    domain_category:  str     # 'news' | 'human_rights' | 'social_media' | 'general' | 'circumvention'


def compute_risk_score(
    observations: list[HourlyObservation],
    category_weights: dict[str, float] | None = None,
) -> float:
    """
    Compute a single country-level risk score in [0, 1] from a list of
    hourly domain observations over the trailing 14-day window.

    Category weights allow news and human-rights domains to contribute more
    to the risk score than general domains (they are more likely to be
    selectively censored before a broader shutdown).
    """
    if category_weights is None:
        category_weights = {
            'news':          2.5,
            'human_rights':  2.5,
            'circumvention': 2.0,
            'social_media':  1.5,
            'general':       1.0,
        }

    if not observations:
        return 0.0

    weighted_sum   = 0.0
    total_weight   = 0.0

    for obs in observations:
        if obs.bucket_age_hours > WINDOW_HOURS:
            continue

        # Time decay weight
        time_weight = np.exp(-DECAY_LAMBDA * obs.bucket_age_hours)

        # Measurement count weight: log scale to avoid over-weighting high-probe-count hours
        count_weight = np.log1p(obs.n_measurements)

        # ASN diversity weight: observations from multiple ASNs are more credible
        asn_weight = np.log1p(obs.n_asns)

        # Category weight
        cat_weight = category_weights.get(obs.domain_category, 1.0)

        w = time_weight * count_weight * asn_weight * cat_weight
        weighted_sum += obs.mean_prob * w
        total_weight  += w

    if total_weight == 0.0:
        return 0.0

    raw_score = weighted_sum / total_weight

    # Apply sigmoid-like normalization to compress the [0, 1] range
    # into a more uniform distribution (avoids clustering at 0 and 1)
    normalized = 1.0 / (1.0 + np.exp(-6.0 * (raw_score - 0.5)))

    return float(normalized)

The 48-hour half-life was chosen empirically from the historical shutdown dataset: observations more than 72 hours before a shutdown onset retain predictive value (shutdowns are typically preceded by 2–4 days of increasing censorship), while observations from 10+ days prior are largely noise relative to the near-term signal. The 48-hour half-life gives a weight ratio of approximately 4:1 (today vs. four days ago), which matches the relative predictive importance found by permutation importance analysis on the forecasting model's features.

Feature engineering for the forecasting model

The forecasting model receives a fixed-length feature vector computed from the hourly risk score time series and the stage-2 domain-level aggregates. The feature vector has 28 dimensions:

# aggregation/forecast_features.py

from dataclasses import dataclass

@dataclass
class ForecastFeatureVector:
    country_code: str
    computed_at:  str    # ISO datetime

    # Risk score time series (12 features: current + 11 lagged hourly values)
    risk_score_h0:   float   # current hour
    risk_score_h1:   float
    risk_score_h3:   float
    risk_score_h6:   float
    risk_score_h12:  float
    risk_score_h24:  float
    risk_score_h48:  float
    risk_score_h72:  float
    risk_score_h96:  float
    risk_score_h120: float
    risk_score_h144: float
    risk_score_h168: float   # 7 days ago

    # Trend features (4)
    risk_slope_6h:    float   # linear slope over last 6 hours
    risk_slope_24h:   float
    risk_slope_72h:   float
    risk_acceleration_6h: float   # second derivative (slope of slope)

    # Domain coverage features (6)
    frac_domains_blocked_news:         float  # fraction of news domains with mean_prob > 0.7
    frac_domains_blocked_social:       float
    frac_domains_blocked_circumvention: float
    n_asns_with_any_blocking:          int
    n_asns_total_active:               int
    asn_block_concentration:           float  # Herfindahl-Hirschman index of blocking across ASNs

    # Historical context features (6)
    days_since_last_verified_shutdown:   float   # -1 if never
    n_shutdowns_last_90d:                int
    max_shutdown_duration_days_last_90d: float
    election_proximity_days:             float   # days to nearest election, -1 if none scheduled
    political_event_score:               float   # 0-1 human-annotated political tension score
    prior_shutdown_same_month_prev_year: float   # 0 or 1

The risk score time series features capture both the current level and the trajectory of censorship activity. The slope and acceleration features are particularly important: a rapidly increasing risk score over 6 hours is a stronger predictor of imminent shutdown than a high but stable score that has been elevated for days. The HHI concentration feature captures whether blocking is concentrated on one ISP (potentially an error or localized event) or distributed across many (a coordinated, country-wide action).

Pipeline handoff protocol

The feature vector is published to the forecasting service via a Kafka topic (voidly.forecast.features) every 15 minutes per country. The message payload is a Protocol Buffer encoding of ForecastFeatureVector. The forecasting service maintains a sliding window of the last four feature vectors per country (one hour of history) and runs the Bayesian forecasting model whenever a new vector arrives:

# Kafka topic configuration
# voidly.forecast.features
#   Key: country_code (string)
#   Value: ForecastFeatureVector protobuf
#   Partitions: 64 (one per active country, rounded to next power of 2)
#   Retention: 7 days
#   Compression: lz4
#   Max message size: 64 KB (feature vectors are ~2 KB each)

# Topic consumption (forecasting service)
consumer_config = {
    'group.id':            'voidly-shutdown-forecaster',
    'auto.offset.reset':   'latest',       # skip backlog on restart; historical features
                                           # are re-computed from TimescaleDB if needed
    'enable.auto.commit':  False,          # manual commit after forecast is written
    'max.poll.interval.ms': 120_000,       # 2 minutes: forecasting model run time p99
    'session.timeout.ms':   30_000,
}

The auto.offset.reset = latest policy means the forecasting service skips feature vectors that accumulated during downtime rather than processing a backlog that would produce stale forecasts. Country forecasts are written to a TimescaleDB table with a 24-hour TTL on the alert delivery fanout, so a 15-minute gap in forecasts during a service restart does not trigger spurious alerts.

Related writing

Voidly classifier calibration covers the Platt scaling and isotonic regression calibration passes that produce the calibrated probabilities consumed by the aggregation pipeline described here.

Shutdown forecasting describes the Bayesian model that ingests the feature vectors produced by this pipeline and outputs 72-hour shutdown probability estimates.