Technical writing

Voidly's anomaly classifier retraining pipeline: temporal splits, champion/challenger promotion, and drift detection

· 8 min read· AI Analytics
CensorshipVoidlyMLMethodology

The Voidly anomaly classifier does not stay static. Censorship infrastructure evolves continuously: ISPs rotate the IP blocks they use for DNS injection, blocking operators swap out block-page HTML and update TLS certificates on interception proxies, and political events — elections, protests, sudden enforcement campaigns — create rapid distribution shifts that push new interference patterns into the measurement stream weeks before any manual labeling effort can catch up. A model trained six months ago and left frozen will drift quietly toward uselessness as the data it was trained on becomes unrepresentative of current traffic.

Voidly retrains on a weekly cadence. Each Monday at 02:00 UTC the retraining pipeline pulls a fresh rolling window of labeled measurements, checks for feature distribution drift, trains five new per-class XGBoost estimators, and begins a 48-hour shadow deployment alongside the current champion model. The pipeline is automated end-to-end; human review is only triggered when drift alerts fire or a challenger model fails promotion criteria.

Why weekly, not daily or monthly

A daily cadence would be too sensitive to short-term noise. A censorship campaign that lasted 18 hours — a temporary election-night social media block, for example — would dominate a 24-hour training window and produce a model overfit to patterns that have already resolved. A monthly cadence is too slow: a month's worth of CDN IP rotation or new block-page fingerprints can degrade classifier precision measurably before the next retrain cycle.

Weekly strikes the practical balance. In a typical week, Voidly accumulates 200K–400K new probe measurements plus a nightly OONI batch import of roughly 80K–120K measurements with confirmed labels. That gives the training pipeline enough fresh positive examples across all five interference classes to update decision boundaries without overfitting to a single week's events.

Training data composition and time-based splits

Each retrain uses a rolling 26-week (six-month) window of labeled measurements. The window is not a random sample — it is every labeled measurement whose probe timestamp falls within the window, ordered by time. Labels come from four sources, weighted by reliability:

  • OONI confirmed events (highest weight: 1.0)
  • CensoredPlanet confirmed blocks (high weight: 0.85)
  • Voidly Tier 3 verified incidents (medium weight: 0.60)
  • Snorkel labeling-function weak labels (low weight: 0.25)

The critical constraint is how this window is split. The pipeline uses time-based splits, not random splits. Weeks 1–20 form the training set, weeks 21–23 form the validation set (used for early stopping), and weeks 24–26 form the held-out test set for post-training evaluation and champion/challenger comparison.

Random splits are the wrong choice here. If a new block-page HTML template appears in week 25 and a measurement from week 10 happens to have a similar body hash, a random split leaks information from the future into training: the model learns the week-25 fingerprint from the week-10 measurement that was randomly assigned to the training fold, and then “correctly” classifies week-25 measurements at evaluation — but only because it has seen the future fingerprint during training. In production, the model would encounter week-25 measurements without having seen anything similar, and would perform substantially worse than the evaluation suggested. Time-based splits prevent this: the model is always evaluated on data that is strictly later than anything it trained on.

Class imbalance: SMOTE and inverse-frequency weights

The six-month training window is heavily skewed. Normal traffic — measurements showing no interference — accounts for approximately 89% of all labeled examples. The five interference classes occupy the remaining 11%:

Class                  Share of training set
─────────────────────────────────────────────
Normal (no interference)       88.9%
DNS tampering                   4.2%
TLS interference                2.8%
HTTP blocking                   2.3%
Throttling                      1.1%
BGP withdrawal                  0.6%

Left unaddressed, this imbalance drives gradient-boosted trees toward predicting “normal” for everything — a model that achieves 89% accuracy by ignoring all interference entirely. Voidly applies two complementary corrections.

First, SMOTE (Synthetic Minority Oversampling Technique) generates synthetic positive examples for the four minority classes by interpolating between real examples in feature space. SMOTE uses k=5 nearest neighbors in the 47-dimensional feature space and generates synthetic samples along the line segments between each minority example and its neighbors. This is applied only to the training set; the validation and test sets use raw label frequencies so that evaluation metrics reflect the real production distribution.

Second, XGBoost's scale_pos_weight parameter applies inverse-frequency class weights to the optimization objective itself. For the BGP withdrawal estimator, where positives are 0.6% of the data, the effective weight on positive examples is approximately 166:1. SMOTE addresses boundary underrepresentation — without synthetic examples, the minority class boundary is defined by too few real points and generalizes poorly. Class weights address objective bias — even with SMOTE, the loss function sees many more negative gradient contributions than positive ones. Both are needed.

PSI drift detection before training begins

Before training starts, the pipeline checks whether the newest week of incoming data has drifted significantly from the prior training window. A large distribution shift can mean the new data is corrupted (a probe configuration bug that changed how measurements are encoded), or it can mean a genuine infrastructure event (a CDN pushed a major update that changed TLS certificate hashes across thousands of domains). In either case, training on drifted data without review risks producing a model that reflects an artifact rather than real censorship signal.

The drift check uses Population Stability Index (PSI) computed per feature. PSI measures how much the distribution of a feature in the new week's data has shifted relative to the prior six-month window, using 10 equal-width bins:

from dataclasses import dataclass, field
from typing import Optional
import numpy as np

@dataclass
class FeatureDriftReport:
    feature_name: str
    psi_score: float
    n_new_samples: int
    n_reference_samples: int
    status: str          # 'ok', 'warning', or 'alert'
    triggered_at: str    # ISO-8601 timestamp


def compute_psi(
    reference: np.ndarray,
    current: np.ndarray,
    n_bins: int = 10,
    epsilon: float = 1e-8,
) -> float:
    """
    Compute Population Stability Index between reference and current distributions.
    PSI < 0.10 -> no action
    PSI 0.10-0.25 -> WARNING
    PSI > 0.25 -> ALERT (retrain skipped, ops paged)
    """
    min_val = min(reference.min(), current.min())
    max_val = max(reference.max(), current.max())
    bins = np.linspace(min_val, max_val, n_bins + 1)

    ref_counts, _ = np.histogram(reference, bins=bins)
    cur_counts, _ = np.histogram(current, bins=bins)

    ref_frac = ref_counts / (ref_counts.sum() + epsilon)
    cur_frac = cur_counts / (cur_counts.sum() + epsilon)

    # Clip to epsilon to prevent log(0)
    ref_frac = np.clip(ref_frac, epsilon, None)
    cur_frac = np.clip(cur_frac, epsilon, None)

    psi = np.sum((cur_frac - ref_frac) * np.log(cur_frac / ref_frac))
    return float(psi)


def classify_psi(psi: float) -> str:
    if psi < 0.10:
        return 'ok'
    elif psi < 0.25:
        return 'warning'
    return 'alert'

The three features most prone to drift in practice are dns_response_time (CDN TTL changes shift the population of cached vs. uncached responses), tls_cert_hash (certificate rotation events push a week's worth of “unexpected cert” signals that are entirely legitimate), and http_body_simhash (major site redesigns change body fingerprints for clean measurements and can temporarily suppress the block-page fingerprint match rate). When any of these three features fires an alert-level PSI, the retrain is halted and the on-call engineer receives a page with the full FeatureDriftReport. The existing champion model continues serving inference unchanged until the alert is resolved.

The training run itself

When drift checks pass, the pipeline trains five independent XGBoost 2.0 estimators, one per interference class, using the one-vs-rest approach. The estimators share the same hyperparameters, which are held fixed between weekly runs and re-tuned only once per quarter via a 30-trial Optuna 3.0 Bayesian search with 5-fold cross-validation on the most recent full quarter's data:

XGB_PARAMS = {
    'n_estimators': 800,
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.7,
    'tree_method': 'hist',   # CPU histogram method; no GPU in cluster
    'eval_metric': 'logloss',
    'early_stopping_rounds': 30,
    'random_state': 42,
}

Re-tuning hyperparameters on every weekly run would be expensive and counterproductive — the weekly cadence is for updating decision boundaries as new censorship patterns emerge, not for reshaping the tree structure from scratch. The 800-estimator, depth-6 configuration is stable enough to absorb the weekly data refresh without overfitting.

Training runs on a single 32-vCPU c5n.4xlarge instance using XGBoost's native OpenMP parallelism. The full 26-week window (typically 6M–9M training examples after SMOTE expansion) trains all five estimators in approximately 22 minutes. The histogram method is the key to keeping this tractable on CPU: it bins continuous features into 256 buckets before tree construction, which reduces the cost of the split-finding step from O(n) per feature to O(256) — a roughly 40x reduction on the largest continuous features.

Champion/challenger shadow deployment

A newly trained model does not immediately replace the serving champion. It enters a 48-hour shadow period in which both models score every incoming live measurement simultaneously. The challenger's scores are logged to R2 but never used for publication decisions — the champion's output is the only output that flows downstream to the reconciler and alert service.

After 48 hours, the shadow corpus is evaluated offline against three promotion criteria. All three must pass; failure on any one criterion retires the challenger and triggers a notification to the ML team:

  • Weighted F2-score on the held-out test set must be at least equal to the champion's F2-score (no tolerance — the challenger must not regress).
  • Precision on the “Verified Incident” promotion decision must be ≥ 0.94. This is the threshold above which Voidly's corroboration engine considers a measurement ready to surface as a potential incident to analysts.
  • A two-sample Kolmogorov-Smirnov test on the challenger's live shadow scores vs. the champion's live scores must return p > 0.05. This catches score distribution drift that aggregate F2 scores can miss: a challenger that scores 0.91 where the champion scores 0.43 on the same live measurements has changed behavior in a way that warrants inspection, even if the F2-score on the labeled test set is unchanged.
from dataclasses import dataclass
from typing import Optional
import numpy as np
from scipy.stats import ks_2samp
from sklearn.metrics import fbeta_score, precision_score

@dataclass
class ChallengerEvaluation:
    challenger_version: str
    champion_version: str
    challenger_f2: float
    champion_f2: float
    challenger_verified_precision: float
    ks_pvalue: float
    promoted: bool
    failure_reason: Optional[str]


def evaluate_challenger(
    champion_scores: np.ndarray,
    challenger_scores: np.ndarray,
    test_labels: np.ndarray,
    test_probs_champion: np.ndarray,
    test_probs_challenger: np.ndarray,
    threshold: float = 0.65,
) -> ChallengerEvaluation:
    champion_preds = (test_probs_champion >= threshold).astype(int)
    challenger_preds = (test_probs_challenger >= threshold).astype(int)

    champion_f2 = fbeta_score(test_labels, champion_preds, beta=2, average='weighted', zero_division=0)
    challenger_f2 = fbeta_score(test_labels, challenger_preds, beta=2, average='weighted', zero_division=0)

    # Precision on measurements above the verified-incident threshold (0.85)
    verified_mask = test_probs_challenger >= 0.85
    verified_precision = (
        precision_score(test_labels[verified_mask], challenger_preds[verified_mask], zero_division=0)
        if verified_mask.sum() > 0 else 0.0
    )

    ks_stat, ks_pvalue = ks_2samp(champion_scores, challenger_scores)

    promoted = True
    failure_reason = None
    if challenger_f2 < champion_f2:
        promoted = False
        failure_reason = f'F2 regression: {challenger_f2:.4f} < {champion_f2:.4f}'
    elif verified_precision < 0.94:
        promoted = False
        failure_reason = f'Verified precision {verified_precision:.4f} < 0.94'
    elif ks_pvalue <= 0.05:
        promoted = False
        failure_reason = f'KS test p={ks_pvalue:.4f} (score distribution drift)'

    return ChallengerEvaluation(
        challenger_f2=challenger_f2,
        champion_f2=champion_f2,
        challenger_verified_precision=verified_precision,
        ks_pvalue=ks_pvalue,
        promoted=promoted,
        failure_reason=failure_reason,
    )

Canary rollout to live traffic

When a challenger passes all three promotion criteria, it enters a phased canary rollout: 5% of live traffic for two hours, then 25% for two hours, then 100%. The canary controller monitors alert fatigue throughout each phase — specifically, the rate at which new incident drafts are created by the corroboration engine, compared to the champion's rolling baseline.

If the incident creation rate during any canary phase exceeds twice the champion's 7-day baseline, the rollout is paused and the controller reverts traffic to the champion. A doubling of incident creation rate most commonly means the challenger is producing false positives at scale — flagging CDN traffic or ISP maintenance as censorship. Reverting takes under 30 seconds because the Cloudflare KV entry that controls the active model version is updated atomically, and inference nodes poll KV every 60 seconds with a forced refresh on version change.

Once fully promoted, the champion model is exported to ONNX via onnxmltools.convert_xgboost() and the resulting graph is pushed to the inference node model registry. The ONNX export is what the inference API serves — the Python XGBoost objects are kept for retraining lineage but never loaded in the hot path.

Model registry and version tracking

Every trained model — promoted or not — is registered in the model registry before shadow deployment begins. The registry record captures the full training provenance:

from dataclasses import dataclass
from datetime import datetime
from typing import Optional

@dataclass
class ModelVersion:
    version_id: str              # semver: '2.14.1'
    trained_at: datetime
    champion_since: Optional[datetime]
    training_data_window: str    # '2024-12-02/2025-06-02'
    test_f2_score: float
    ooni_alignment_score: float  # fraction of OONI confirmed events the model catches
    onnx_model_hash: str         # SHA-256 of the exported .onnx file
    training_rows: int
    status: str                  # 'shadow', 'canary', 'champion', 'retired'

The ooni_alignment_score field deserves special attention. OONI's confirmed event list is the closest thing to a ground truth that exists for internet censorship measurement. A model that does not catch OONI-confirmed events at a high rate is not fit for promotion regardless of its F2-score on the internal test set. In practice, champion models maintain an ooni_alignment_score above 0.91.

Versions are retained in the registry for 90 days. After that, the full model artifacts are archived to S3 as Parquet, and the registry entry is kept with a pointer to the archive location. The inference API supports version pinning via /v1/classify?model_version=2.14.1, which is used by the nightly reprocessing batch job to reproduce historical classifications against the model that was champion at the time the measurement was originally scored.


For how the trained ONNX model is served — Cloudflare routing, regional nodes, and <50ms latency: Voidly's real-time inference API: classifying censorship measurements at 50ms →

For per-country Platt scaling that adjusts output probabilities after training: Voidly's per-country classifier calibration: Platt scaling and threshold tuning →

For the five interference classes and why the models optimize for recall: The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall →

For how the raw OONI and Voidly measurements become the labeled training dataset: Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →