Technical writing

Voidly's Active Learning Loop: Growing the Anomaly Training Set with Human-in-the-Loop Annotation

November 27, 2024· 8 min read· AI Analytics

MLVoidlyActive LearningAnnotation

The Voidly ML training data pipeline described how we bootstrapped a labeled dataset from OONI's measurement archive. That gave us 127K labeled examples — enough to train the initial anomaly classifier but not enough to handle the long tail of interference patterns: ISP-specific certificate substitution signatures, country-specific DNS injection fingerprints, and interference methods that postdate the training window. This post describes the active learning loop we built to grow the training set efficiently without burning annotation budget on examples the model already handles well.

Bootstrap labels: OONI agreement flags

OONI's measurement records include a measurement_start_time, a test_name, and a confirmed boolean set by OONI's own pipeline when a block page is matched against their known signature library. We treat confirmed=true as a high-confidence blocked label and measurements where all control comparisons pass as normal. The middle tier — anomaly=true, confirmed=false — is unlabeled at bootstrap.

From 1.66M OONI measurements (the historical corpus), bootstrap labels break down as: 82K confirmed-blocked, 45K confirmed-normal, and 1.49M unlabeled anomalies. The 82K + 45K = 127K labeled set has a 65/35 blocked-to-normal split, which is reasonably balanced for gradient boosted tree training.

One problem with OONI-derived labels: confirmed=true means the block page matched OONI's library, but OONI's library covers only a subset of known block page signatures. Novel or rare block pages produceconfirmed=false even when genuine censorship occurred. The active learning loop is partly designed to surface these edge cases for human review.

Uncertainty sampling: which examples to annotate

The core idea of uncertainty sampling: take the current model, score every unlabeled example, and annotate the examples the model is most uncertain about. A model that is already confident about an example will not learn much from annotating it — that annotation budget is wasted.

We score uncertainty with the least-confidence heuristic: for an XGBoost binary classifier predicting P(blocked), uncertainty is 1 − |2P − 1|. An example where P = 0.5 has uncertainty 1.0 (maximum); one whereP = 0.05 or P = 0.95 has uncertainty 0.1 (minimum).

# Weekly uncertainty scoring over the unlabeled pool
import pandas as pd
import xgboost as xgb

def score_uncertainty(model: xgb.Booster,
                      unlabeled_df: pd.DataFrame,
                      feature_cols: list[str]) -> pd.DataFrame:
    """
    Returns the unlabeled pool sorted by uncertainty (descending).
    Uncertainty = 1 - |2 * P(blocked) - 1|
    """
    dmatrix = xgb.DMatrix(unlabeled_df[feature_cols])
    p_blocked = model.predict(dmatrix)  # shape (n,), probability of blocked class

    unlabeled_df = unlabeled_df.copy()
    unlabeled_df['p_blocked']   = p_blocked
    unlabeled_df['uncertainty'] = 1.0 - abs(2.0 * p_blocked - 1.0)

    return unlabeled_df.sort_values('uncertainty', ascending=False)


def select_annotation_batch(scored_df: pd.DataFrame,
                             budget: int = 500) -> pd.DataFrame:
    """
    Select top-budget examples by uncertainty. Apply diversity filter:
    no more than 20 examples per (country, test_protocol) cell to avoid
    over-annotating one region's dominant interference type.
    """
    selected = []
    cell_counts: dict[tuple, int] = {}

    for _, row in scored_df.iterrows():
        cell = (row['vantage_country'], row['test_protocol'])
        if cell_counts.get(cell, 0) < 20:
            selected.append(row)
            cell_counts[cell] = cell_counts.get(cell, 0) + 1
        if len(selected) >= budget:
            break

    return pd.DataFrame(selected)

Each week, 500 examples are selected from the ~1.49M unlabeled pool. The diversity filter prevents the batch from being dominated by a single country or test protocol — without it, Russia + HTTPS would crowd out smaller countries with unusual blocking methods.

Annotation interface

Annotators use a web interface that presents each measurement as a three-panel view:

Left panel — vantage measurement: DNS query and response, TCP connection result, TLS handshake outcome (certificate chain, alert code if any), HTTP response (status, headers, body hash, block page match score).
Middle panel — control comparison: the same test run from a Voidly control server outside the suspected censorship jurisdiction, shown alongside diff-highlighted deltas from the vantage measurement.
Right panel — context: OONI's original verdict for this measurement (if available), the model's current prediction and confidence, the top-3 SHAP features driving the prediction, and any other known incidents for this domain × country in the past 14 days.

Annotators choose one of four labels: Blocked,Likely blocked, Ambiguous, orNot blocked. A mandatory 1–3 sentence free-text rationale is required for “Ambiguous” labels. Rationales feed a separate error analysis review weekly.

Inter-annotator agreement: Cohen's kappa

Every example in the batch is independently labeled by three annotators. We compute pairwise Cohen's kappa across the three annotator pairs. A batch's example is accepted into the training set only if all three pairwise kappas exceed 0.82.

from sklearn.metrics import cohen_kappa_score
from itertools import combinations
import numpy as np

KAPPA_THRESHOLD = 0.82

def compute_batch_agreement(
    labels: dict[str, list[int]]   # annotator_id → list of label ints (0=normal, 1=blocked, 2=ambiguous)
) -> tuple[float, bool]:
    """
    Returns (mean_kappa, all_pairs_above_threshold).
    Ambiguous (label=2) examples are excluded from kappa calculation
    but tracked separately as 'ambiguous_rate'.
    """
    annotator_ids = list(labels.keys())
    kappas = []

    for a1, a2 in combinations(annotator_ids, 2):
        # Filter out pairs where either annotator said Ambiguous
        pairs = [(l1, l2) for l1, l2 in zip(labels[a1], labels[a2])
                 if l1 != 2 and l2 != 2]
        if not pairs:
            continue
        y1, y2 = zip(*pairs)
        kappas.append(cohen_kappa_score(y1, y2))

    mean_kappa = float(np.mean(kappas)) if kappas else 0.0
    return mean_kappa, all(k >= KAPPA_THRESHOLD for k in kappas)


def merge_labels(labels: dict[str, list[int]]) -> list[int | None]:
    """
    Majority vote across three annotators.
    Returns None for examples with no clear majority (e.g. 1-1-1 split).
    """
    merged = []
    for votes in zip(*labels.values()):
        from collections import Counter
        cnt = Counter(votes)
        top_label, top_count = cnt.most_common(1)[0]
        merged.append(top_label if top_count >= 2 else None)
    return merged

In practice, kappa across the non-ambiguous subset averages 0.87 — annotators agree on clear-cut blocked (high DNS-over-TLS anomaly, known block page hash) and clear-cut normal. The 8–12% of examples that fall below threshold tend to cluster around three patterns: partial HTTP blocking (TCP connects, TLS completes, but body is truncated), CDN-level certificate substitution that looks like censorship but may be a CDN misconfiguration, and measurements where the control server itself had a transient failure.

Disagreement examples with kappa below threshold are routed to a senior reviewer who makes the final call. These reviewer-resolved labels carry aresolution_source: "senior_review" tag in the dataset so downstream users can stratify by label provenance.

Label budget and weekly cadence

We annotate 500 examples per week using three annotators each, totaling 1,500 annotation events/week. At an average annotation time of 3.5 minutes per example (including reading the rationale field), that is ~87 person-hours per week. Annotation is handled by a rotating pool of 12 trained researchers who work 7–8 hours of annotation per week each. Annotators are trained on a gold standard set of 200 examples before their first live batch.

Of the 500 weekly examples, on average:

412 accepted into the training set (kappa ≥ 0.82, clear majority label)
63 sent to senior review (kappa below threshold or three-way split)
25 discarded (all three annotators said Ambiguous)

After 12 months of weekly batches (~52 × 412 = ~21,400 accepted labels), the training set grew from 127K bootstrap examples to approximately 148K accepted labels, plus the bootstrap set, for a total of ~275K. The added labels improve classifier recall on rare interference types by 8.3 percentage points (from 0.71 to 0.77 recall on the held-out rare-interference test set).

Weekly model retrain and data versioning

Every Monday, a retrain pipeline runs on the current full labeled set. The pipeline is gated on the accumulated label count: retrain only fires if at least 200 new accepted labels have been added since the last retrain (to avoid weekly retrains when annotation pace slows).

# DVC pipeline: dvc.yaml (simplified)
stages:
  prepare_features:
    cmd: python scripts/prepare_features.py
    deps:
      - data/labels/accepted_labels.parquet   # appended weekly
      - data/measurements/raw/
    outs:
      - data/features/train.parquet
      - data/features/test.parquet

  train_classifier:
    cmd: python scripts/train_xgboost.py
    deps:
      - data/features/train.parquet
    params:
      - params.yaml:
          - xgboost.n_estimators
          - xgboost.max_depth
          - xgboost.learning_rate
          - xgboost.scale_pos_weight    # handles class imbalance
    outs:
      - models/anomaly_classifier.ubj   # XGBoost binary format
    metrics:
      - metrics/eval.json:
          cache: false

  evaluate:
    cmd: python scripts/evaluate.py
    deps:
      - models/anomaly_classifier.ubj
      - data/features/test.parquet
    metrics:
      - metrics/eval.json:
          cache: false

DVC tracks each version of the training data, the model binary, and the evaluation metrics together. Every production model is tagged with the DVC commit hash of the labeled dataset it was trained on — so any inference result can be traced back to the exact annotation batch that added the relevant labels. This is important for corrections: if a labeling error is discovered, we can identify which model versions were trained on the bad label and retract their outputs from the verified_incident tier.

Drift detection and labeling criteria review

The feature distribution of unlabeled anomalies shifts over time as new interference methods emerge. We track this with Population Stability Index (PSI) computed monthly against the baseline distribution from the bootstrap training set.

def population_stability_index(expected: np.ndarray,
                               actual: np.ndarray,
                               n_bins: int = 10) -> float:
    """
    PSI < 0.1:  stable — no action needed
    PSI 0.1–0.2: some shift — monitor
    PSI > 0.2:  significant shift — review labeling criteria
    """
    eps = 1e-8
    bins = np.percentile(expected, np.linspace(0, 100, n_bins + 1))
    bins[0] -= eps; bins[-1] += eps

    e_pct = np.histogram(expected, bins=bins)[0] / len(expected) + eps
    a_pct = np.histogram(actual,   bins=bins)[0] / len(actual)   + eps

    return float(np.sum((a_pct - e_pct) * np.log(a_pct / e_pct)))

When PSI exceeds 0.2 on any of the top-5 features (tcp_connect_ms, dns_response_ip_count, tls_alert_code, http_body_length_ratio, bgp_withdrawal_count_24h), we convene a labeling criteria review. The review examines whether the annotation interface needs to be updated to reflect the new interference pattern — for example, when Iranian ISPs switched from DNS injection to HTTPS certificate substitution in early 2024, the criteria for labeling a TLS-only anomaly as “blocked” had to be tightened to avoid false positives from CDN certificate rotations.

For the initial training data pipeline that produced the 127K bootstrap labels: Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →

For the XGBoost classifier that these labels train — feature engineering, recall optimization, and inference serving: The Voidly Anomaly Classifier: five interference classes, gradient boosted trees, and why we optimize for recall →

For the confidence tier system that determines when a classifier prediction becomes a verified incident: From anomaly to verified incident: the Voidly confidence tier system →

For how the classifier trained on these annotations is evaluated offline — AUC-PR, F2 scoring, ECE calibration, and country case studies: Offline evaluation for the Voidly anomaly classifier: AUC-PR, F2, ECE calibration, and country case studies →