Technical writing

Voidly's per-country classifier calibration: Platt scaling, threshold tuning, and why the same probability means different things in Iran vs. China

June 7, 2025· 9 min read· AI Analytics

CensorshipVoidlyMLMethodology

The Voidly anomaly classifier outputs a probability for each interference class —p_dns_tamper, p_tls_interference,p_http_blocking, p_bgp_withdrawal,p_throttling — for every probe measurement. But a raw classifier probability of 0.65 for DNS tampering does not carry the same meaning in Iran as it does in China. Iran has a highly centralized censorship infrastructure where DNS blocking events are tightly consistent across ISPs; once the classifier has seen a few examples from a given campaign, its probability estimates are well-calibrated. China has a complex CDN landscape where the same domain legitimately resolves to different IPs from different vantage points, and the classifier must set a higher threshold to avoid flagging CDN behavior as censorship. This post covers how per-country calibration works: Platt scaling, threshold tuning, and the calibration parameters for the four most-measured countries.

Why calibration is necessary

The base classifier is a set of five gradient-boosted tree models (one per interference class), each trained on a globally pooled dataset of 275K labeled measurements. A model trained on globally pooled data learns the global base rate of each class — roughly 3.1% DNS tampering, 1.8% TLS interference, 4.2% HTTP blocking across all measurements. But the per-country class rates vary enormously: Iran DNS tampering rate is approximately 22%; Germany DNS tampering rate is approximately 0.4%. A model trained on global data will underestimate the Iran rate and overestimate the Germany rate.

Platt scaling corrects for this by fitting a logistic regression on the model's raw log-odds output using per-country holdout data:

def fit_platt_scaling(
    model_logits: np.ndarray,  # raw XGBoost log-odds output (n_samples,)
    labels: np.ndarray,        # binary labels (n_samples,)
) -> tuple[float, float]:
    """
    Fit Platt scaling parameters (A, B) such that
    P(y=1 | f) = sigmoid(A * f + B)
    where f is the model's log-odds score.
    """
    from scipy.optimize import minimize
    from scipy.special import expit  # sigmoid

    def neg_log_likelihood(params):
        A, B = params
        p = expit(A * model_logits + B)
        # Add small epsilon to prevent log(0)
        return -np.sum(labels * np.log(p + 1e-12) + (1 - labels) * np.log(1 - p + 1e-12))

    result = minimize(neg_log_likelihood, x0=[1.0, 0.0], method='L-BFGS-B')
    return tuple(result.x)  # (A, B)

The calibration is fitted separately for each country × interference class combination on a rolling 30-day holdout window. The 30-day window is long enough to capture stable calibration patterns but short enough to adapt when a country's censorship methodology changes — which happens when governments switch from DNS-based to DPI-based blocking, or when new ISPs are brought under government-mandated blocking regimes.

Per-country holdout split

For the 50 countries with the most measurements (≥5,000 labeled examples in the training set), we fit separate Platt parameters. For the remaining 150 countries, we use a regional grouping: MENA countries share calibration parameters, sub-Saharan Africa countries share parameters, and so on. The global catch-all is used only for countries with fewer than 500 labeled examples in the holdout window.

class CalibrationRegistry:
    def __init__(self):
        # country_code -> interference_class -> (A, B)
        self._params: dict[str, dict[str, tuple[float, float]]] = {}
        self._regional_params: dict[str, dict[str, tuple[float, float]]] = {}
        self._global_params: dict[str, tuple[float, float]] = {}

    def get_params(self, country: str, iclass: str) -> tuple[float, float]:
        if country in self._params and iclass in self._params[country]:
            return self._params[country][iclass]
        region = COUNTRY_TO_REGION.get(country)
        if region and region in self._regional_params:
            return self._regional_params[region].get(iclass, (1.0, 0.0))
        return self._global_params.get(iclass, (1.0, 0.0))

    def calibrate(self, logit: float, country: str, iclass: str) -> float:
        A, B = self.get_params(country, iclass)
        return float(1.0 / (1.0 + np.exp(-(A * logit + B))))

The time-based split prevents leakage: the calibration window uses only data from the 30-day period immediately preceding the most recent model retrain, so the calibration sees the same label distribution the model will encounter in production.

Threshold tuning by class and country

After calibration, a decision threshold converts the probability into a binary classification. The threshold is tuned per class per country on the holdout using the F-beta score, where beta is set per class based on the cost asymmetry between false positives and false negatives:

BETA_BY_CLASS = {
    'dns_tamper': 2.0,       # recall-weighted: missing censorship worse than false alarm
    'tls_interference': 2.0, # same
    'http_blocking': 2.0,    # same
    'bgp_withdrawal': 1.5,   # slightly more precision-weighted: BGP signal is high-stakes
    'throttling': 1.0,       # balanced: throttling is lower-stakes to miss
}

def tune_threshold(probs: np.ndarray, labels: np.ndarray, beta: float) -> float:
    """Find the probability threshold that maximizes F_beta on holdout."""
    from sklearn.metrics import fbeta_score
    best_threshold, best_score = 0.5, 0.0
    for t in np.arange(0.05, 0.95, 0.01):
        preds = (probs >= t).astype(int)
        score = fbeta_score(labels, preds, beta=beta, zero_division=0)
        if score > best_score:
            best_score = score
            best_threshold = t
    return best_threshold

The recall-weighted F2 score for most classes reflects Voidly's optimization target: it is worse to miss a censorship event than to over-report one, because cross-source corroboration (OONI, CensoredPlanet, IODA) filters false positives before publication, but no corroboration source can recover a measurement that was dropped at the classification threshold.

Country-specific calibration parameters

The following table shows the current Platt scaling intercepts (B) and tuned thresholds for the four most-measured countries. A negative B pushes the calibrated probability down (conservative — the country's raw measurements have more noise), while a positive B pushes it up (aggressive — the country has a very consistent blocking signature).

Country / class	Platt A	Platt B	Threshold	Rationale
Iran / dns_tamper	0.94	+0.18	0.62	Single DNS authority; blocking very consistent
Iran / http_blocking	1.02	−0.09	0.71	HTTPS errors common from TLS inspection infrastructure
China / dns_tamper	1.11	−0.22	0.74	CDN split-horizon causes legitimate multi-IP responses
China / http_blocking	0.98	−0.14	0.68	GFW packet injection well-distinguished by response timing
Russia / dns_tamper	0.97	+0.04	0.67	Heterogeneous ISPs; some are stricter than others
Russia / tls_interference	1.06	+0.11	0.64	TSPU DPI gives distinctive TLS fingerprint
Turkey / dns_tamper	0.99	+0.08	0.65	BTK block orders produce consistent block-page responses
Turkey / http_blocking	1.03	+0.15	0.61	2,300+ known Turkish block pages; high fingerprint coverage

Iran's dns_tamper threshold of 0.62 is the lowest in the system — a raw probability of just above 0.6 is enough for the measurement to be classified as DNS tampering. This reflects the extremely consistent signal from NIMA (National Internet Exchange), Iran's single international transit chokepoint: when a domain is blocked, every Iranian probe gets the same poisoned DNS response to the same IP (usually 10.10.34.35 or a similar RFC 1918 address). The false positive rate at threshold 0.62 for Iran is 1.3% on the held-out evaluation set, acceptable given the high recall.

China's dns_tamper threshold of 0.74 is the highest. The GFW's DNS injection inserts responses from multiple fake IP addresses (there are over 150 known GFW injection IPs), but CDN-delivered domains like Akamai, CloudFront, and Fastly also legitimately return different IPs from different vantage points. At threshold 0.74, the false positive rate for Chinese DNS tampering is 2.1%, primarily from legitimate CDN split-horizon resolutions that the model classifies as suspicious. At threshold 0.68 (the global default), the FP rate for China rises to 6.8%, which would generate hundreds of spurious daily alerts.

Reliability score and calibration confidence

Each calibration fit is accompanied by a calibration_reliabilityscore (0.0–1.0) derived from the Brier score on the holdout window. A reliability score below 0.7 indicates that the calibrated probabilities should not be trusted as absolute values — they are still useful for ranking measurements by severity, but the threshold may be unreliable.

def calibration_reliability(
    probs_calibrated: np.ndarray,
    labels: np.ndarray,
) -> float:
    """
    Returns 1 - (Brier score / baseline Brier score).
    1.0 = perfect calibration; 0.0 = no better than predicting the mean.
    """
    brier = np.mean((probs_calibrated - labels) ** 2)
    baseline = np.mean((labels.mean() - labels) ** 2)  # always predict the mean
    if baseline == 0:
        return 1.0
    return float(max(0.0, 1.0 - brier / baseline))

Countries with fewer than 200 labeled examples in the holdout window get a reliability score of 0.0 regardless of the computed Brier score — the sample is too small to trust. The calibration_reliabilityfield is included in every inference API response, so downstream consumers can decide how much weight to give the classifier output for a given country.

Monthly calibration pipeline

Calibration parameters are updated monthly as part of the model release cycle. The pipeline runs after each monthly retrain:

# calibration_pipeline.py (simplified)
def run_calibration_pipeline(model, holdout_df, registry):
    countries = holdout_df['vantage_country'].value_counts()
    for country, count in countries.items():
        subset = holdout_df[holdout_df['vantage_country'] == country]
        if count < 200:
            continue  # insufficient data; fall through to regional/global
        for iclass in INTERFERENCE_CLASSES:
            labels = subset[f'label_{iclass}'].values
            if labels.sum() < 20:
                continue  # too few positive examples to calibrate
            logits = model.predict(subset[FEATURES].values, output_margin=True)[:, ICLASS_IDX[iclass]]
            A, B = fit_platt_scaling(logits, labels)
            threshold = tune_threshold(
                probs=1.0 / (1.0 + np.exp(-(A * logits + B))),
                labels=labels,
                beta=BETA_BY_CLASS[iclass],
            )
            registry.update(country, iclass, A, B, threshold)
            log_calibration_metrics(country, iclass, A, B, threshold, labels, logits)

When a country's calibration parameters shift by more than 0.15 in B or more than 0.08 in the threshold between consecutive monthly runs, an alert is raised for manual review. Such shifts typically indicate a change in the country's censorship methodology — a new blocking campaign, a change in ISP behavior, or an influx of new probe measurements from a different ASN that has a different measurement profile.

For the anomaly classifier architecture — five binary models, feature extraction, and why Voidly optimizes for recall: The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall →

For how the labeled training set that feeds calibration is built and maintained: Voidly's active learning loop: growing the anomaly training set with human-in-the-loop annotation →

For how the calibrated classifier output is served as a live inference API at 50ms: Voidly's real-time inference API: classifying censorship measurements at 50ms →

For how per-country calibration parameters contribute to the country-level censorship index: Voidly's country-level censorship score: aggregating 2.2B probe measurements into the global index →

For the weekly retraining pipeline that produces the model versions this calibration step is applied to — rolling 6-month splits, PSI drift detection, and champion/challenger promotion: Voidly's anomaly classifier retraining pipeline: temporal splits, champion/challenger promotion, and drift detection →