Technical writing
Bridging classifier outputs to shutdown forecasting: from per-measurement censorship probability to country-level shutdown risk scores
The XGBoost classifier outputs a calibrated probability of censorship for each individual probe measurement: a single web-connectivity observation from one probe, one domain, one ASN, at one moment in time. The shutdown forecasting model operates at a completely different scale: it predicts the probability that an entire country will experience a full internet shutdown within the next 72 hours, using country-level signals aggregated across thousands of probes and hundreds of domains.
Bridging these two scales requires an aggregation layer that converts tens of thousands of per-measurement probabilities per day into a small set of time-series features that the forecasting model can ingest. This article covers the domain-ASN aggregation pass, exponential decay weighting for the 14-day observation window, the risk score normalization formula, and the feature engineering pipeline that produces the forecasting model's input vector.
Aggregation hierarchy
Per-measurement probabilities are aggregated in three stages before reaching the forecasting model:
| Stage | Granularity | Time bucket | Output |
|---|---|---|---|
| 1 — ASN-domain | (country, ASN, domain) | 1 hour | Mean P(censored), measurement count |
| 2 — Domain | (country, domain) | 1 hour | ASN-weighted mean, ASN count, domain category |
| 3 — Country | (country) | 1 hour | Risk score, feature vector for forecasting |
Stage 1 aggregation happens in the ingestion pipeline immediately after the ONNX inference step. Stages 2 and 3 run as TimescaleDB continuous aggregates over the ingested stage-1 records, refreshed every 15 minutes.
ASN-domain hourly aggregation (Stage 1)
-- TimescaleDB schema: stage-1 aggregation table
-- This is the hypertable into which the ingestion pipeline inserts per-measurement rows.
CREATE TABLE measurement_scores (
ts TIMESTAMPTZ NOT NULL,
country_code TEXT NOT NULL,
asn INTEGER NOT NULL,
domain TEXT NOT NULL,
censor_prob FLOAT4 NOT NULL, -- calibrated P(censored) from ONNX
probe_id TEXT NOT NULL, -- anonymized probe identifier
test_type TEXT NOT NULL -- 'web_connectivity' | 'dns_consistency' | 'tcp_connect'
);
SELECT create_hypertable('measurement_scores', 'ts', chunk_time_interval => INTERVAL '6 hours');
CREATE INDEX ON measurement_scores (country_code, asn, domain, ts DESC);
-- Stage-1 continuous aggregate: hourly mean per (country, ASN, domain)
CREATE MATERIALIZED VIEW asn_domain_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', ts) AS bucket,
country_code,
asn,
domain,
AVG(censor_prob) AS mean_prob,
COUNT(*) AS n_measurements,
COUNT(DISTINCT probe_id) AS n_probes
FROM measurement_scores
GROUP BY 1, 2, 3, 4
WITH NO DATA;
SELECT add_continuous_aggregate_policy('asn_domain_hourly',
start_offset => INTERVAL '2 hours',
end_offset => INTERVAL '15 minutes',
schedule_interval => INTERVAL '15 minutes'
);Exponential decay weighting
Older observations are less informative than recent ones for predicting near-term shutdowns. The stage-3 aggregation applies exponential decay over a 14-day trailing window when computing the country-level risk score:
# aggregation/risk_score.py
import numpy as np
from dataclasses import dataclass
HALF_LIFE_HOURS = 48.0 # observations older than 48h have half the weight
WINDOW_HOURS = 336.0 # 14-day lookback
DECAY_LAMBDA = np.log(2) / HALF_LIFE_HOURS # lambda for e^(-lambda * t)
@dataclass
class HourlyObservation:
bucket_age_hours: float # how many hours ago this bucket ended
mean_prob: float # mean P(censored) for this (country, domain) in the hour
n_measurements: int
n_asns: int
domain_category: str # 'news' | 'human_rights' | 'social_media' | 'general' | 'circumvention'
def compute_risk_score(
observations: list[HourlyObservation],
category_weights: dict[str, float] | None = None,
) -> float:
"""
Compute a single country-level risk score in [0, 1] from a list of
hourly domain observations over the trailing 14-day window.
Category weights allow news and human-rights domains to contribute more
to the risk score than general domains (they are more likely to be
selectively censored before a broader shutdown).
"""
if category_weights is None:
category_weights = {
'news': 2.5,
'human_rights': 2.5,
'circumvention': 2.0,
'social_media': 1.5,
'general': 1.0,
}
if not observations:
return 0.0
weighted_sum = 0.0
total_weight = 0.0
for obs in observations:
if obs.bucket_age_hours > WINDOW_HOURS:
continue
# Time decay weight
time_weight = np.exp(-DECAY_LAMBDA * obs.bucket_age_hours)
# Measurement count weight: log scale to avoid over-weighting high-probe-count hours
count_weight = np.log1p(obs.n_measurements)
# ASN diversity weight: observations from multiple ASNs are more credible
asn_weight = np.log1p(obs.n_asns)
# Category weight
cat_weight = category_weights.get(obs.domain_category, 1.0)
w = time_weight * count_weight * asn_weight * cat_weight
weighted_sum += obs.mean_prob * w
total_weight += w
if total_weight == 0.0:
return 0.0
raw_score = weighted_sum / total_weight
# Apply sigmoid-like normalization to compress the [0, 1] range
# into a more uniform distribution (avoids clustering at 0 and 1)
normalized = 1.0 / (1.0 + np.exp(-6.0 * (raw_score - 0.5)))
return float(normalized)The 48-hour half-life was chosen empirically from the historical shutdown dataset: observations more than 72 hours before a shutdown onset retain predictive value (shutdowns are typically preceded by 2–4 days of increasing censorship), while observations from 10+ days prior are largely noise relative to the near-term signal. The 48-hour half-life gives a weight ratio of approximately 4:1 (today vs. four days ago), which matches the relative predictive importance found by permutation importance analysis on the forecasting model's features.
Feature engineering for the forecasting model
The forecasting model receives a fixed-length feature vector computed from the hourly risk score time series and the stage-2 domain-level aggregates. The feature vector has 28 dimensions:
# aggregation/forecast_features.py
from dataclasses import dataclass
@dataclass
class ForecastFeatureVector:
country_code: str
computed_at: str # ISO datetime
# Risk score time series (12 features: current + 11 lagged hourly values)
risk_score_h0: float # current hour
risk_score_h1: float
risk_score_h3: float
risk_score_h6: float
risk_score_h12: float
risk_score_h24: float
risk_score_h48: float
risk_score_h72: float
risk_score_h96: float
risk_score_h120: float
risk_score_h144: float
risk_score_h168: float # 7 days ago
# Trend features (4)
risk_slope_6h: float # linear slope over last 6 hours
risk_slope_24h: float
risk_slope_72h: float
risk_acceleration_6h: float # second derivative (slope of slope)
# Domain coverage features (6)
frac_domains_blocked_news: float # fraction of news domains with mean_prob > 0.7
frac_domains_blocked_social: float
frac_domains_blocked_circumvention: float
n_asns_with_any_blocking: int
n_asns_total_active: int
asn_block_concentration: float # Herfindahl-Hirschman index of blocking across ASNs
# Historical context features (6)
days_since_last_verified_shutdown: float # -1 if never
n_shutdowns_last_90d: int
max_shutdown_duration_days_last_90d: float
election_proximity_days: float # days to nearest election, -1 if none scheduled
political_event_score: float # 0-1 human-annotated political tension score
prior_shutdown_same_month_prev_year: float # 0 or 1The risk score time series features capture both the current level and the trajectory of censorship activity. The slope and acceleration features are particularly important: a rapidly increasing risk score over 6 hours is a stronger predictor of imminent shutdown than a high but stable score that has been elevated for days. The HHI concentration feature captures whether blocking is concentrated on one ISP (potentially an error or localized event) or distributed across many (a coordinated, country-wide action).
Pipeline handoff protocol
The feature vector is published to the forecasting service via a Kafka topic (voidly.forecast.features) every 15 minutes per country. The message payload is a Protocol Buffer encoding of ForecastFeatureVector. The forecasting service maintains a sliding window of the last four feature vectors per country (one hour of history) and runs the Bayesian forecasting model whenever a new vector arrives:
# Kafka topic configuration
# voidly.forecast.features
# Key: country_code (string)
# Value: ForecastFeatureVector protobuf
# Partitions: 64 (one per active country, rounded to next power of 2)
# Retention: 7 days
# Compression: lz4
# Max message size: 64 KB (feature vectors are ~2 KB each)
# Topic consumption (forecasting service)
consumer_config = {
'group.id': 'voidly-shutdown-forecaster',
'auto.offset.reset': 'latest', # skip backlog on restart; historical features
# are re-computed from TimescaleDB if needed
'enable.auto.commit': False, # manual commit after forecast is written
'max.poll.interval.ms': 120_000, # 2 minutes: forecasting model run time p99
'session.timeout.ms': 30_000,
}The auto.offset.reset = latest policy means the forecasting service skips feature vectors that accumulated during downtime rather than processing a backlog that would produce stale forecasts. Country forecasts are written to a TimescaleDB table with a 24-hour TTL on the alert delivery fanout, so a 15-minute gap in forecasts during a service restart does not trigger spurious alerts.
Related writing
Voidly classifier calibration covers the Platt scaling and isotonic regression calibration passes that produce the calibrated probabilities consumed by the aggregation pipeline described here.
Shutdown forecasting describes the Bayesian model that ingests the feature vectors produced by this pipeline and outputs 72-hour shutdown probability estimates.