Technical writing
Evaluating the Voidly anomaly classifier: per-country confusion matrices, precision-recall curves, and the offline test harness
Between training the classifier and running the active learning loop, there's a step that determines whether a new model version ships to production: offline evaluation. The Voidly anomaly classifier — a five-class XGBoost ensemble that labels DNS tampering, HTTP blocking, TLS interference, BGP withdrawal, and throttling — is retrained weekly. Not every retrain improves matters. Some weeks the additional labeled data hurts performance on countries that were already well-served. Others introduce calibration drift that the Platt scaling layer can't fully correct.
The offline evaluation harness answers one question before any model touches live traffic: is this version better than the one currently in production, and by how much, for which countries, and for which interference types? This article covers how that evaluation is structured, why the metrics look different for Iran vs. China vs. Russia, and what the promotion criteria require before a model graduates to the champion/challenger shadow phase.
Test set design
The most important structural decision in any classifier evaluation is how to split the data. For censorship data, the answer is always time-based — never random.
Time-based split to prevent temporal leakage
Censorship patterns are temporally correlated. A DNS injection campaign that runs for three weeks will produce measurements in both train and test sets if you split randomly, giving the model credit for “predicting” a pattern it effectively memorized from near-contemporaneous training examples. The test set is always the most recent 90 days of measurements, held out entirely from training. The training set ends at the split date; the test set begins the next day. There is no overlap.
This means evaluation scores move over time. A model trained in September is evaluated on October through December measurements — measurements that include blocking patterns the model has never seen. That is the intended behavior. We want to measure generalization to future censorship, not interpolation within a known period.
Per-country stratification and macro-averaging
Each country's test slice is evaluated independently. Aggregate metrics are country-macro-averaged — not pooled. Pooling would allow China's 12 million measurements per month to dominate countries with 200 measurements per month. A model that performs perfectly on China and fails completely on Turkmenistan would still report excellent aggregate metrics under pooling. Macro-averaging weights each country equally regardless of measurement volume: if the model averages 0.85 AUC-PR across 60 evaluated countries, that reflects median per-country performance, not total-measurement performance.
Minimum test-set size threshold
Countries with fewer than 500 test examples are excluded from per-country metrics and flagged as coverage_insufficient. They fall back to regional group evaluation — their measurements are pooled with neighboring countries that share similar network infrastructure and censorship regime (e.g., Central Asia, Sub-Saharan Africa). This avoids reporting spurious per-country metrics derived from 40 or 80 measurements where a single mislabeled example could swing F2 by 0.15.
from dataclasses import dataclass, field
from datetime import date
from typing import Optional
@dataclass
class EvaluationSplit:
split_date: date # training ends here; test begins day after
test_window_days: int = 90 # how many days of measurements in the test set
min_country_test_size: int = 500 # countries below this threshold are excluded
coverage_insufficient_countries: list[str] = field(default_factory=list)
regional_fallback_groups: dict[str, list[str]] = field(default_factory=dict)
def is_country_evaluable(self, country_code: str, n_test: int) -> bool:
if n_test < self.min_country_test_size:
self.coverage_insufficient_countries.append(country_code)
return False
return True
def get_regional_group(self, country_code: str) -> Optional[str]:
"""Return the regional group key for a coverage_insufficient country."""
for group_name, members in self.regional_fallback_groups.items():
if country_code in members:
return group_name
return NoneThe offline test harness
The harness is a Python class that wraps the evaluation pipeline: loading the held-out test set, running the classifier, applying per-country Platt scaling calibration, computing metrics, and assembling the evaluation report that feeds the promotion criteria check.
from dataclasses import dataclass
import numpy as np
from sklearn.metrics import (
confusion_matrix, classification_report,
average_precision_score, fbeta_score
)
from typing import Any
@dataclass
class CountryEvalResult:
country_code: str
n_test: int
per_class_metrics: dict[str, dict[str, float]] # class -> {precision, recall, f1, f2}
confusion_matrix: np.ndarray
auc_pr: float # macro-averaged over positive classes
f2_score: float # macro-averaged, beta=2
calibration_ece: float # Expected Calibration Error post-Platt scaling
class ClassifierEvaluator:
def __init__(self, model, calibrators, split: EvaluationSplit, test_data):
self.model = model
self.calibrators = calibrators # per-(country, class) Platt scaling objects
self.split = split
self.test_data = test_data
def evaluate_country(self, country_code: str) -> CountryEvalResult | None:
"""Evaluate the classifier on a single country's held-out test slice."""
...
def evaluate_all_countries(self) -> dict[str, CountryEvalResult]:
"""Run evaluate_country for every country in the test set."""
...
def aggregate_metrics(self, results: dict[str, CountryEvalResult]) -> dict[str, float]:
"""Compute country-macro-averaged AUC-PR, F2, and ECE."""
...
def generate_report(self, results, agg_metrics, previous_eval=None) -> dict[str, Any]:
"""Assemble the full evaluation report including promotion criteria results."""
...The evaluate_country method does the heavy lifting for each country. It extracts features from the country's test slice, runs predict_probathrough the classifier, applies per-country Platt scaling, applies the operating-point threshold for each class, and computes every metric the CountryEvalResultdataclass needs:
def evaluate_country(self, country_code: str) -> CountryEvalResult | None:
country_df = self.test_data[self.test_data['country_code'] == country_code]
n_test = len(country_df)
if not self.split.is_country_evaluable(country_code, n_test):
return None # flagged as coverage_insufficient
# 1. Feature extraction
X = extract_features(country_df) # returns a (n_test, 47) array
y_true = country_df[CLASS_LABEL_COLS].values # (n_test, 5) binary matrix
# 2. Raw model probabilities
raw_probs = self.model.predict_proba_multilabel(X) # (n_test, 5)
# 3. Per-country Platt scaling calibration
calibrated_probs = np.zeros_like(raw_probs)
for cls_idx, cls_name in enumerate(CLASS_NAMES):
key = (country_code, cls_name)
if key in self.calibrators:
calibrated_probs[:, cls_idx] = self.calibrators[key].predict(
raw_probs[:, cls_idx].reshape(-1, 1)
).ravel()
else:
# Regional fallback calibration for thin countries
regional_key = self.split.get_regional_group(country_code)
if regional_key and (regional_key, cls_name) in self.calibrators:
calibrated_probs[:, cls_idx] = self.calibrators[
(regional_key, cls_name)
].predict(raw_probs[:, cls_idx].reshape(-1, 1)).ravel()
else:
calibrated_probs[:, cls_idx] = raw_probs[:, cls_idx]
# 4. Apply per-class, per-country threshold to get binary predictions
thresholds = get_country_thresholds(country_code) # dict[cls_name, float]
y_pred = np.zeros_like(calibrated_probs, dtype=int)
for cls_idx, cls_name in enumerate(CLASS_NAMES):
y_pred[:, cls_idx] = (calibrated_probs[:, cls_idx] >= thresholds[cls_name]).astype(int)
# 5. Compute per-class and aggregate metrics
per_class_metrics = {}
auc_pr_scores = []
f2_scores = []
for cls_idx, cls_name in enumerate(CLASS_NAMES):
report = classification_report(
y_true[:, cls_idx], y_pred[:, cls_idx],
output_dict=True, zero_division=0
)
precision = report['1']['precision']
recall = report['1']['recall']
f2 = fbeta_score(y_true[:, cls_idx], y_pred[:, cls_idx], beta=2, zero_division=0)
auc_pr = average_precision_score(y_true[:, cls_idx], calibrated_probs[:, cls_idx])
per_class_metrics[cls_name] = {
'precision': precision, 'recall': recall,
'f1': report['1']['f1-score'], 'f2': f2, 'auc_pr': auc_pr,
}
auc_pr_scores.append(auc_pr)
f2_scores.append(f2)
cm = confusion_matrix(
y_true.argmax(axis=1),
y_pred.argmax(axis=1),
labels=list(range(len(CLASS_NAMES)))
)
ece = compute_ece(y_true.ravel(), calibrated_probs.ravel())
return CountryEvalResult(
country_code=country_code,
n_test=n_test,
per_class_metrics=per_class_metrics,
confusion_matrix=cm,
auc_pr=float(np.mean(auc_pr_scores)),
f2_score=float(np.mean(f2_scores)),
calibration_ece=ece,
)Why AUC-PR over AUC-ROC
In most countries, true censorship anomalies represent 1–5% of all measurements. A country blocking 8 domains out of a 200-domain test list, probed 10 times per day per vantage point, produces roughly 2–4% positive labels — and most of those are concentrated in a handful of domains probed from specific ASNs.
AUC-ROC is optimistic under this level of class imbalance. The receiver operating characteristic curve plots true positive rate (recall) against false positive rate (FPR = FP / (FP + TN)). When negatives vastly outnumber positives, the denominator of FPR is enormous. A model can produce a large absolute number of false positives while still achieving a small FPR — the ROC curve looks good even though the model is flagging many clean measurements as interference.
AUC-PR focuses on the tradeoff between precision and recall for the positive (anomaly) class. It doesn't use the true negative count at all. At the operating point threshold, a model that finds 90% of real anomalies with 70% precision is more valuable than a model with a 0.97 AUC-ROC but only 30% recall — the high-AUC-ROC model misses most of the events that matter.
The AUC-PR baseline for a random classifier equals class prevalence. For a country with a 3% anomaly rate, a random classifier achieves AUC-PR = 0.03. This makes it trivial to read real model lift off the metric: an AUC-PR of 0.82 on a 3% prevalence dataset represents roughly 27× improvement over random, not the misleading 96% accuracy that a majority-class predictor would achieve.
| Country | Anomaly rate | AUC-ROC | AUC-PR |
|---|---|---|---|
| Iran (IR) | 18% | 0.97 | 0.94 |
| China (CN) | 11% | 0.96 | 0.89 |
| Russia (RU) | 6% | 0.95 | 0.83 |
| Turkey (TR) | 4% | 0.94 | 0.79 |
| Germany (DE) | 1% | 0.93 | 0.61 |
Note that Germany's AUC-ROC (0.93) is nearly as high as Iran's (0.97), but the AUC-PR gap (0.61 vs. 0.94) correctly captures that the model performs much less reliably on Germany's rare, heterogeneous anomalies than on Iran's frequent, pattern-consistent blocking.
F2 scoring rationale
The Voidly pipeline optimizes for recall over precision. The F2 score formalizes this by weighting recall twice as heavily as precision in the harmonic mean:
F2 = (1 + 4) × (precision × recall) / (4 × precision + recall)
The rationale is asymmetric cost. A false negative — a missed censorship event — stays missed. The cross-source corroboration layer that filters out false positives downstream has no mechanism to recover events the classifier never surfaced. A false positive, by contrast, enters the corroboration queue as an “Observed” annotation and remains there until OONI, CensoredPlanet, or IODA independently confirm the same target in the same time window. If no external source corroborates it, it simply ages out. The asymmetry — recall errors are permanent, precision errors are recoverable — justifies the F2 weighting.
Per-class F2 scores for the current production model, macro-averaged across evaluated countries:
| Interference class | Precision | Recall | F2 score |
|---|---|---|---|
| DNS tampering | 0.81 | 0.95 | 0.91 |
| HTTP blocking | 0.88 | 0.91 | 0.88 |
| TLS interference | 0.79 | 0.92 | 0.84 |
| BGP withdrawal | 0.89 | 0.97 | 0.96 |
| Throttling | 0.55 | 0.87 | 0.79 |
Throttling has the lowest F2 score (0.79) because it is structurally the hardest class to distinguish from legitimate network congestion. The precision of 0.55 means nearly half of throttling detections are false positives — acceptable under the F2 weighting because the 0.87 recall ensures most real throttling events are captured, and the corroboration layer filters the noise. BGP withdrawal has the highest F2 (0.96) because routing events produce unambiguous hard signals — a prefix is either reachable from the probe's vantage point or it isn't.
Per-country confusion matrix analysis
Aggregate metrics flatten the interesting variation. The per-country confusion matrices reveal the specific failure modes that each national blocking regime produces.
Iran (IR)
Iran's test set covers 312 measurements, with DNS tampering the dominant interference class. The classifier achieves near-perfect recall on DNS tampering (0.97) and high precision (0.94). The reasons are regime-specific: Iranian DNS injection uses a small known set of injection IPs — the resolver returns one of approximately 40 addresses that belong to the Islamic Republic of Iran Broadcasting organization or the Telecommunications Infrastructure Company. These IPs are completely outside the expected ASN ranges for any target domain. The classifier's ip_in_expected_asn andredirect_to_block_page features fire cleanly with almost no ambiguity.
Iran also has almost no split-horizon CDN noise — legitimate geo-differentiated DNS responses that return different IPs to different regions. In markets with heavy CDN deployment, a resolver returning a regional edge IP can look like injection to a naive feature. Iran's internet infrastructure routes through a small number of state-controlled gateways, which limits CDN differentiation and reduces false positives accordingly.
China (CN)
China's test set covers 45,000 measurements — the largest per-country slice by volume. DNS tampering precision is lower (0.78) despite high recall (0.94). The primary driver is CDN split-horizon noise: major CDN operators return geo-differentiated DNS responses from Chinese edge nodes that are topologically close to Chinese resolvers but belong to different ASNs than the same CDN's global anycast fleet. The classifier sees a returned IP that doesn't match the domain's known ASN and flags it — correctly in censorship cases, but incorrectly for legitimate CDN differentiation.
This is why the Platt scaling threshold for CN DNS tampering is 0.74, compared to 0.62 for IR. After calibration, the Chinese threshold is set high enough to require multiple concurrent features to fire (ASN mismatch plus either NXDOMAIN or response time anomaly) before a positive label is assigned. The higher threshold trades some recall for precision — a deliberate per-country policy encoded into the calibration layer.
Russia (RU)
Russia shows the highest throttling recall (0.83) of any evaluated country. This is traceable to the TSPU (Sredstva Tekhnicheskogo i Kriptograficheskogo Zashchity) deep-packet inspection hardware mandated by the Sovereign Internet Law. When TSPU applies rate limiting, the body transfer rate drops to exactly the configured rate-limit value — not just “low,” but precisely at the ISP's configured ceiling. The timing signature produces a distinctive bandwidth_z feature value where the measured throughput clusters tightly around a specific threshold rather than varying stochastically as congestion does.
However, Russia also has the highest throttling false positive rate (21%) among major evaluated countries. Mobile network congestion in Russian urban areas produces bandwidth degradation patterns that partially resemble TSPU throttling — particularly during evening peak hours when legitimate congestion reduces throughput to values close to common TSPU rate-limit configurations. The other_domains_ok feature (whether neighboring domains on the same probe show similar degradation) partially mitigates this, but Russia's high density of probe deployments on residential mobile connections means the false positive rate remains elevated.
from sklearn.metrics import confusion_matrix, classification_report
def build_per_class_confusion(y_true: np.ndarray, y_pred: np.ndarray, class_names: list[str]):
"""
Build and extract per-class metrics from a multi-label binary matrix.
y_true, y_pred: shape (n_samples, n_classes)
Returns dict[class_name -> {precision, recall, f1, tp, fp, fn, tn}]
"""
results = {}
for cls_idx, cls_name in enumerate(class_names):
report = classification_report(
y_true[:, cls_idx],
y_pred[:, cls_idx],
output_dict=True,
zero_division=0,
)
cm = confusion_matrix(y_true[:, cls_idx], y_pred[:, cls_idx], labels=[0, 1])
tn, fp, fn, tp = cm.ravel()
results[cls_name] = {
'precision': report['1']['precision'],
'recall': report['1']['recall'],
'f1': report['1']['f1-score'],
'tp': int(tp), 'fp': int(fp),
'fn': int(fn), 'tn': int(tn),
'false_positive_rate': fp / (fp + tn) if (fp + tn) > 0 else 0.0,
}
return resultsCalibration evaluation
A model can have excellent AUC-PR and F2 scores while still producing poorly calibrated probabilities. If the model outputs 0.80 for an event that is actually positive 50% of the time, downstream consumers of the probability (the confidence tier system, the corroboration engine, the active learning uncertainty sampler) all receive misleading signal. Expected Calibration Error (ECE) measures this gap.
ECE bins predictions into n equal-width buckets and computes the weighted mean absolute difference between mean predicted probability and observed positive frequency within each bin:
def compute_ece(y_true: np.ndarray, y_prob: np.ndarray, n_bins: int = 10) -> float:
"""
Compute Expected Calibration Error.
y_true: binary labels (0 or 1)
y_prob: predicted probabilities in [0, 1]
Returns ECE in [0, 1]; lower is better.
"""
bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
n = len(y_true)
ece = 0.0
for i in range(n_bins):
lo, hi = bin_edges[i], bin_edges[i + 1]
mask = (y_prob >= lo) & (y_prob < hi)
if not mask.any():
continue
bin_count = mask.sum()
bin_confidence = y_prob[mask].mean()
bin_accuracy = y_true[mask].mean()
ece += (bin_count / n) * abs(bin_confidence - bin_accuracy)
return float(ece)Before Platt scaling, raw XGBoost probabilities are poorly calibrated — ECE ranges from 0.14 to 0.22 across countries. The model is overconfident: predictions in the 0.7–0.9 range correspond to actual positive rates of only 0.45–0.65. After per-country Platt scaling, ECE drops to 0.03–0.06. The calibration improvement is most pronounced for countries with large labeled datasets (China, Iran, Russia) where the Platt regression has enough examples to fit a tight logistic transformation.
Countries with insufficient data for per-country Platt scaling fall back to regional calibration (grouped by geographic region and censorship regime similarity). Regional calibration achieves ECE of 0.08–0.11 — worse than per-country calibration but substantially better than the uncalibrated model.
A reliability diagram — plotting mean predicted probability against observed positive rate for each bin — shows the before/after calibration difference clearly. Uncalibrated predictions form a curve that bows below the diagonal (the model is overconfident throughout). Post-Platt predictions track the diagonal closely, with residual deviation only in the extreme bins (0–0.05 and 0.95–1.0) where sample counts are lowest.
Promotion criteria
A new model version promotes to production only if every criterion in the following checklist passes. A single failure blocks promotion regardless of how well the model performs on other metrics.
- Country-macro-averaged AUC-PR ≥ 0.82 (previous production model: 0.81)
- Country-macro-averaged F2 ≥ 0.85 (previous production model: 0.84)
- No individual country F2 regression greater than 0.05 below the previous model's score for that country
- Calibration ECE ≤ 0.07 for at least 90% of evaluated countries
- Champion/challenger shadow mode for 48 hours on live measurements, with no anomalous divergence in flagging rate
The per-country regression check is the strictest gate in practice. A model that improves China by 0.04 AUC-PR while regressing Turkmenistan by 0.06 fails. This protects countries with sparse training data from being sacrificed for aggregate metric gains. The 0.05 threshold was chosen based on the empirical variability of per-country metrics across evaluation runs with identical models — noise in the held-out test set typically produces ±0.02 variation, so 0.05 represents a statistically meaningful regression.
def check_promotion_criteria(
current_eval: dict[str, CountryEvalResult],
previous_eval: dict[str, CountryEvalResult],
current_agg: dict[str, float],
previous_agg: dict[str, float],
) -> tuple[bool, str]:
"""
Returns (should_promote: bool, reason: str).
reason describes the first failing criterion, or 'All criteria passed' on success.
"""
# 1. Macro-averaged AUC-PR threshold
if current_agg['auc_pr'] < 0.82:
return False, f"AUC-PR {current_agg['auc_pr']:.3f} < 0.82 threshold"
# 2. Macro-averaged F2 threshold
if current_agg['f2'] < 0.85:
return False, f"F2 {current_agg['f2']:.3f} < 0.85 threshold"
# 3. Per-country F2 regression check
for country_code, result in current_eval.items():
if country_code not in previous_eval:
continue # new country; no regression baseline
prev_f2 = previous_eval[country_code].f2_score
curr_f2 = result.f2_score
if prev_f2 - curr_f2 > 0.05:
return False, (
f"Country {country_code} F2 regression: "
f"{prev_f2:.3f} -> {curr_f2:.3f} (delta {prev_f2 - curr_f2:.3f} > 0.05)"
)
# 4. Calibration ECE threshold (90% of countries must pass)
ece_values = [r.calibration_ece for r in current_eval.values()]
ece_pass_rate = sum(e <= 0.07 for e in ece_values) / len(ece_values)
if ece_pass_rate < 0.90:
return False, f"ECE ≤ 0.07 for only {ece_pass_rate:.1%} of countries (need 90%)"
# 5. Shadow mode check is performed externally; if we reach here, offline criteria pass
return True, 'All offline criteria passed; proceed to 48h shadow mode'What evaluation doesn't catch
The offline evaluation harness is thorough but bounded. Three failure modes fall outside its scope by design.
Concept drift. The test set covers the 90 days immediately before the split date. A new blocking pattern introduced after the split date — a DPI vendor's new fingerprint, a novel BGP manipulation technique, a previously unseen block page template — won't appear in the test set and won't lower any metric. The active learning loop is designed to catch concept drift: annotation uncertainty on live measurements surfaces new patterns to human reviewers, whose labels feed the next retrain. But offline evaluation can't self-report on patterns it has never seen.
Coverage gaps. Countries with fewer than 500 test examples are excluded from per-country evaluation. The evaluation harness doesn't tell you whether the classifier performs well in those countries — it reports them ascoverage_insufficient and defers to regional group metrics. Separate ASN coverage monitoring tracks which countries and ASNs have enough probe deployment for reliable measurement; that monitoring is orthogonal to classifier evaluation.
Corroboration feedback. Whether OONI, CensoredPlanet, or IODA subsequently confirmed Voidly's anomaly detections is a signal the offline harness never sees. Corroboration feedback — retroactive validation of classifier outputs by independent sources — is the highest-quality label signal available, but it arrives weeks or months after the measurement. This retroactive signal feeds the active learning annotation queue as a retrospective quality signal; it doesn't flow back into the offline evaluation metrics at promotion time.
For the classifier this harness evaluates — five-class XGBoost, gradient boosted trees, per-class binary models, and why 95% recall beats 95% precision: The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall →
For the active learning loop that uses evaluation failures as annotation signal — uncertainty sampling, Cohen's kappa, and weekly retrains: Voidly's active learning loop: growing the anomaly training set with human-in-the-loop annotation →
For the per-country Platt scaling calibration that's applied before evaluation metrics are computed: Voidly's per-country classifier calibration: Platt scaling, threshold tuning, and why the same probability means different things in Iran vs. China →
For the 47-feature vector that feeds into the classifier being evaluated: The 47 features that classify internet censorship: how Voidly extracts signal from raw network measurements →