Technical writing

Voidly's real-time inference API: classifying censorship measurements at 50ms

May 28, 2025· 7 min read· AI Analytics

CensorshipVoidlyMLInfrastructure

The Voidly anomaly classifier runs in two modes. Offline, it reprocesses the full measurement archive nightly — a several-hour batch job that re-scores historical measurements against updated model versions and refreshed calibration coefficients. Online, it runs in the hot path of the real-time event pipeline, where a probe measurement arrives from the collector and needs a five-class interference verdict before it can be handed to the cross-source reconciler. The two modes share a model, but almost nothing else.

This post covers the online path. The 8-minute probe-to-journalist-alert SLA the real-time pipeline targets leaves the inference service a strict budget: under 50ms from the moment the raw probe measurement JSON is dequeued to the moment the scored record is written back to Cloudflare D1. That budget has to absorb feature extraction, model inference, Platt-scaling calibration, and the D1 write — while leaving headroom for the network and queue latency that fall outside the service's control.

Batch vs. online: the design constraint that matters

In the batch path, latency is irrelevant. The nightly reprocessing job can spend milliseconds per measurement — it processes roughly 3.4M measurements per night and still finishes inside a 6-hour window. The batch path uses Python, pandas, and XGBoost's native serialization format. It can afford imports, deserialization overhead, and garbage collection pauses because none of those costs affect any SLA.

The online path is the opposite. The metric that matters is not mean latency — it's p99. A mean inference time of 1.2ms is irrelevant if the p99 is 40ms, because tail latency is what determines whether a burst of probe arrivals blows the 50ms budget for the inference service or stays inside it. Every design decision in the online path — model format, transport encoding, routing topology, cache strategy — was made with the p99 distribution in mind, not the mean.

ONNX export: one portable graph for preprocessing, model, and calibration

The XGBoost model is trained in Python and evaluated offline against labeled measurements in a standard Python environment. But running a Python interpreter in the inference service adds 30–80ms of import overhead per cold start, and CPython's GIL means multi-threaded inference on a 32-vCPU node doesn't scale linearly. The solution is to export the model to ONNX and serve it with ONNX Runtime.

ONNX Runtime eliminates the Python runtime requirement from the inference process entirely. It uses SIMD-vectorized tree traversal (AVX2 on the bare-metal nodes), which pushes through XGBoost's 100-estimator ensemble substantially faster than the Python XGBoost predict path. Crucially, ONNX inference is consistent across CPU architectures — the same graph runs on the US-East, EU-West, and AP-East nodes without platform-specific tuning.

The exported graph is a single ONNX file that includes three stages: the feature preprocessing transforms (one-hot encoding for categorical inputs, z-score normalization for continuous inputs), the XGBoost core model, and the Platt scaling calibration layer. Combining all three into one graph eliminates inter-component serialization and keeps the hot path as a single session.run() call. The global model file is under 12MB.

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from onnxmltools import convert_xgboost
import onnx
from onnx import helper

# Export the XGBoost core model
xgb_onnx = convert_xgboost(
    xgb_model,
    name='VoidlyAnomalyClassifier',
    initial_types=[('probe_features', FloatTensorType([None, 47]))],
)

# Export the Platt scaling layer (sklearn LogisticRegression wrapper)
platt_onnx = convert_sklearn(
    platt_scaler,
    name='PlattCalibration',
    initial_types=[('raw_probs', FloatTensorType([None, 5]))],
)

# Combine into one graph: preprocessing -> xgb -> calibration
# (graph stitching done via onnx.compose.merge_models)
combined = stitch_pipeline(
    preprocessor_onnx,
    xgb_onnx,
    platt_onnx,
)

onnx.save(combined, f'models/xgb-global-{train_date}.onnx')
print(f'Exported: {os.path.getsize(combined_path) / 1e6:.1f}MB')

Feature extraction: 47 fields in under 5ms

Before inference can run, the raw probe measurement JSON — delivered from the probe collector via the Cloudflare Queue — must be transformed into the 47-field feature vector the model expects. This extraction step is on the critical path and must complete in under 5ms at p99.

The most latency-sensitive part of feature extraction is the control result lookup. Several features — DNS response IP vs. control IP, body length ratio, HTTP status vs. control status — require knowing what the measurement should look like on an unobstructed connection. That “ground truth” comes from the control server, which runs vantage nodes in uncensored jurisdictions and records the expected response for each domain.

Hitting the control server API on every inference request would add 20–100ms of round-trip latency, violating the budget immediately. Instead, each inference node maintains an in-memory LRU cache keyed on (domain, probe_cc), populated by a background thread that polls the control server API every 60 seconds. The cache holds the most recent control result for each domain-country pair. On a cache miss — which happens for newly added domains or after a cache eviction — the extractor falls back to the most recently known control result from the previous polling cycle. If no prior result exists at all, the control-dependent features are set to None and the measurement may trigger the abstain path described later.

from dataclasses import dataclass
from typing import Optional

@dataclass
class ProbeFeatures:
    # DNS features (10 fields)
    dns_failure_type: int          # one-hot over 5 classes: 0=none,1=nxdomain,2=refused,3=timeout,4=other
    dns_failure_nxdomain: int
    dns_failure_refused: int
    dns_failure_timeout: int
    dns_failure_other: int
    dns_ip_in_expected_asn: Optional[float]
    dns_returned_ip_is_sinkhole: int
    dns_redirect_to_block_page: int
    dns_response_time_z: Optional[float]
    dns_ttl_anomaly: Optional[int]

    # TLS features (8 fields)
    tls_handshake_completed: int
    tls_cert_hash_expected: Optional[int]
    tls_cert_issuer_trusted: int
    tls_sni_alert_type: int        # 0=none, 112=unrecognized_name, 42=bad_cert, etc.
    tls_reset_after_client_hello: int
    tls_handshake_time_z: Optional[float]
    tls_cert_cn_mismatch: int
    tls_unexpected_issuer: int

    # HTTP features (9 fields)
    http_status_category: int      # 0=2xx,1=3xx,2=4xx,3=5xx,4=no_response
    http_is_451: int
    http_body_fingerprint_score: Optional[float]
    http_redirect_count: int
    http_final_domain_changed: int
    http_body_length_ratio: Optional[float]
    http_content_type_mismatch: int
    http_transparent_proxy_detected: int
    http_response_time_z: Optional[float]

    # BGP features (5 fields)
    bgp_origin_as_reachable: int
    bgp_path_length_delta: Optional[float]
    bgp_unique_collectors_visible: int
    bgp_prefix_withdrawn: int
    bgp_as_path_prepending: int

    # Throttling features (7 fields)
    throttle_bandwidth_z: Optional[float]
    throttle_latency_z: Optional[float]
    throttle_neighboring_domains_ok: int
    throttle_time_of_day_bucket: int   # 0-3 (6h buckets)
    throttle_baseline_drift_detected: int
    throttle_udp_vs_tcp_ratio: Optional[float]
    throttle_consecutive_slow_measurements: int

    # Probe context features (8 fields)
    probe_cc_risk_tier: int            # 1-4, derived from country-level censorship index
    probe_asn_historical_block_rate: Optional[float]
    probe_vantage_diversity_score: float
    probe_measurement_hour: int
    probe_measurement_dow: int
    domain_category: int               # 0=news,1=social,2=vpn,3=human_rights,4=other
    domain_prior_block_rate: Optional[float]
    domain_control_result_age_s: Optional[float]


def extract_features(raw_measurement: dict, control_cache: LRUCache) -> ProbeFeatures:
    """
    Transform a raw probe measurement JSON into a ProbeFeatures dataclass.
    Must complete in <5ms at p99. No blocking I/O — all external data
    comes from in-memory cache.
    """
    domain = raw_measurement['domain']
    probe_cc = raw_measurement['probe_cc']

    control = control_cache.get((domain, probe_cc))  # None on miss

    dns = raw_measurement.get('dns', {})
    tls = raw_measurement.get('tls', {})
    http = raw_measurement.get('http', {})
    bgp = raw_measurement.get('bgp', {})

    return ProbeFeatures(
        # DNS
        dns_failure_type=DNS_FAILURE_MAP.get(dns.get('failure'), 0),
        dns_failure_nxdomain=int(dns.get('failure') == 'dns_nxdomain_error'),
        dns_failure_refused=int(dns.get('failure') == 'dns_refused_error'),
        dns_failure_timeout=int(dns.get('failure') == 'generic_timeout_error'),
        dns_failure_other=int(dns.get('failure') not in (None, *KNOWN_DNS_FAILURES)),
        dns_ip_in_expected_asn=check_ip_asn(dns.get('ip'), domain) if control else None,
        dns_returned_ip_is_sinkhole=int(dns.get('ip') in KNOWN_SINKHOLES),
        dns_redirect_to_block_page=int(is_known_block_ip(dns.get('ip'))),
        dns_response_time_z=z_score(dns.get('response_ms'), control, 'dns_ms') if control else None,
        dns_ttl_anomaly=int(abs((dns.get('ttl') or 0) - (control.get('dns_ttl') or 0)) > 3600) if control else None,
        # ... (remaining fields follow same pattern)
        probe_cc_risk_tier=COUNTRY_RISK_TIERS.get(probe_cc, 2),
        probe_vantage_diversity_score=raw_measurement['vantage_diversity_score'],
        domain_control_result_age_s=(time.time() - control['fetched_at']) if control else None,
    )

Inference service architecture

The inference service runs on three regional nodes: US-East (primary), EU-West, and AP-East. AP-East is co-located with the control server cluster serving Asia-Pacific probes, which keeps the control cache warm for high-volume measurement regions. Each node is bare-metal: 32 vCPUs, 128GB RAM, no GPU. XGBoost tree ensembles with ONNX Runtime are CPU-efficient — there is no matrix multiplication workload that would benefit from a GPU, and the elimination of GPU scheduling overhead helps with tail latency.

Probe measurement events arrive at a Cloudflare Worker that reads the probe_cc field from the msgpack-encoded payload and routes the request to the geographically nearest inference node. msgpack is used instead of JSON for the hot-path request encoding because it serializes the 47-field ProbeFeatures struct roughly 40% smaller and 3x faster than equivalent JSON, which matters when the inference service is processing thousands of measurements per second during peak probe windows.

All inference requests are stateless. No session state, no sticky routing, no affinity to a specific node. If the nearest node is unavailable, the Cloudflare Worker retries against US-East within 2 seconds — fast enough to stay inside the pipeline SLA while the unhealthy node is taken out of rotation.

The inference response is JSON, regardless of the msgpack request encoding, because the downstream consumers — the D1 write worker, the reconciler, the alert service — are simpler to operate when they can read plain JSON. The response schema:

{
  "measurement_uid": "20250528T142301Z_webconnectivity_IR_44244_n1_abc123",
  "interference_type": "dns_tampering",
  "prob_dns_tampering": 0.912,
  "prob_tls_interference": 0.041,
  "prob_http_blocking": 0.178,
  "prob_bgp_withdrawal": 0.003,
  "prob_throttling": 0.017,
  "model_version": "xgb-global-20250501",
  "confidence_tier": "corroborated",
  "abstain": false,
  "inference_ms": 1.4
}

Latency breakdown

The following table shows the p50 and p99 latency for each stage of the probe-to-verdict path, measured over 30 days of production traffic ending May 2025.

Stage                                        p50      p99
─────────────────────────────────────────────────────────
Probe-to-collector ingestion                120ms    800ms
Message queue dequeue                         2ms     15ms
Feature extraction (incl. control cache)      3ms     11ms
ONNX Runtime inference (47 → 5 probs)        1.2ms    4ms
Platt scaling calibration                    0.3ms    1ms
Result write to Cloudflare D1                 8ms     28ms
─────────────────────────────────────────────────────────
Total probe-to-verdict                      134ms    859ms

The 50ms budget is the target for the inference service itself — the three middle stages (feature extraction, ONNX inference, calibration). End-to-end probe-to-verdict includes network ingestion and queue latency that fall outside the service's control. At p99, those upstream stages account for 815ms of the 859ms total. The inference service's p99 contribution is 44ms — inside budget, with a 6ms margin.

The dominant source of inference-service tail latency is the feature extraction step, specifically the control cache lookup path when a cache entry has just been evicted and the background refresh thread hasn't yet repopulated it. This accounts for roughly 60% of p99 feature extraction time. The ONNX Runtime inference itself is remarkably stable — p50 and p99 differ by less than 3ms, which is the behavior you want from a vectorized tree traversal that has no dynamic memory allocation on the hot path.

Model versioning and zero-downtime deployment

Model versions follow a {model_type}-{country_scope}-{train_date} naming convention — for example, xgb-global-20250501 for the global model trained on May 1 2025 data. The active version name is stored in Cloudflare KV, readable by all inference nodes. Each node polls KV every 60 seconds; when it detects a version change, it preloads the new ONNX graph into memory alongside the current model before switching traffic.

The deployment process itself is a graduated rollout defined in a YAML spec consumed by the deployment controller:

# deploy/model-rollout.yaml
rollout:
  candidate_version: xgb-global-20250501
  champion_version: xgb-global-20250401
  schedule:
    - step: 1
      traffic_pct: 10
      hold_minutes: 10
      pass_conditions:
        p99_inference_ms_max: 30
        f1_delta_min: -0.003    # candidate F1 must not drop >0.3% vs champion
    - step: 2
      traffic_pct: 50
      hold_minutes: 10
      pass_conditions:
        p99_inference_ms_max: 30
        f1_delta_min: -0.003
    - step: 3
      traffic_pct: 100
      hold_minutes: 10
      pass_conditions:
        p99_inference_ms_max: 30
        f1_delta_min: -0.003
  rollback:
    trigger: any_step_fails
    target_version: xgb-global-20250401
    rollback_seconds: 30

The F1 delta is computed against the labeled probe stream — a small set of measurements with known ground truth labels (confirmed censorship events and confirmed-clean measurements) that flows continuously through the inference service during rollout. If the candidate's F1 falls more than 0.3 percentage points below the champion's, or if p99 inference latency exceeds 30ms, the rollout halts and the champion version is restored. Every inference request logs its model_version field to R2, enabling retroactive comparison between champion and any prior challenger.

Per-country calibration at inference time

The global ONNX graph produces raw probabilities that are well-calibrated on average but systematically miscalibrated for specific countries. A DNS timeout rate in Pakistan that would signal interference in a model calibrated on German probes is just normal baseline behavior. The calibration correction happens as the final stage of the combined ONNX graph, using Platt scaling coefficients stored in a JSON lookup table with one row per country (200 rows total).

Countries without enough labeled data for direct per-country calibration — fewer than 500 confirmed block measurements — use the coefficient for their UN geoscheme region instead. The lookup adds under 0.3ms because it is a simple array index into a 200-row in-memory table, not a database call. The calibrated probabilities are what appear as prob_* in both the measurement record written to D1 and the API response. Calibration coefficients are updated monthly as part of the full model retrain cycle.

Shadow mode and the champion/challenger protocol

Every new candidate model runs in shadow mode for seven days before any traffic is shifted to it in the graduated rollout. In shadow mode, the inference service receives every real incoming measurement, runs both the champion and the challenger against it, but only writes the champion result to D1. The challenger result is logged to R2 alongside the champion result for offline comparison.

Shadow requests are identified by a header the Cloudflare Worker injects:

# Cloudflare Worker — shadow routing
const isShadow = candidateVersion !== null && Math.random() < SHADOW_SAMPLE_RATE;

const response = await fetch(inferenceNode, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/msgpack',
    'X-Voidly-Model-Mode': isShadow ? 'shadow' : 'champion',
    'X-Voidly-Champion-Version': championVersion,
    'X-Voidly-Candidate-Version': candidateVersion ?? '',
  },
  body: msgpackEncode(probeFeatures),
});

The inference node, on receiving X-Voidly-Model-Mode: shadow, runs both models and returns only the champion result to the caller, but asynchronously writes a comparison record to R2:

{
  "measurement_uid": "...",
  "champion_version": "xgb-global-20250401",
  "champion_interference_type": "dns_tampering",
  "champion_prob_dns_tampering": 0.891,
  "candidate_version": "xgb-global-20250501",
  "candidate_interference_type": "dns_tampering",
  "candidate_prob_dns_tampering": 0.923,
  "shadow_ts": "2025-05-21T08:14:33Z"
}

After seven days, the shadow comparison corpus is evaluated offline. The challenger wins promotion to the graduated rollout only if its F1 against the labeled probe stream is at least 0.005 (half a percentage point) above the champion's, and its p99 inference latency is no higher than the champion's. Both conditions must hold. Either condition failing means the challenger is retired and the next retrain cycle begins.

Edge case: the abstain path

Some measurements arrive with too many missing features to classify reliably. The most common causes are a control cache miss on a newly added domain, a DNS timeout so early in the probe execution that no response fields were recorded, and BGP measurement failures at the vantage node level. For each of these, some subset of the 47 feature fields will be None.

The classifier handles None values via XGBoost's native missing-value path — trees learned during training to route NaN inputs to a default child node, so inference still completes rather than erroring. But when the number of missing features is high enough, the resulting probability estimates are unreliable regardless of what the tree routing produces.

The threshold is 8 out of 47 features. If 8 or more features are None, the inference service returns:

{
  "interference_type": null,
  "prob_dns_tampering": null,
  "prob_tls_interference": null,
  "prob_http_blocking": null,
  "prob_bgp_withdrawal": null,
  "prob_throttling": null,
  "confidence_tier": "anomaly",
  "abstain": true,
  "abstain_reason": "missing_features",
  "missing_feature_count": 12
}

Abstained measurements are still published to the public dataset with their raw probe data intact — researchers can reprocess them against updated models or examine the raw measurements directly. But they are never promoted to “Corroborated” or “Verified incident” status without corroboration from an external source that has its own independent measurement. Globally, the abstain rate sits at approximately 3.1% of all measurements. The rate is substantially higher in low-coverage ASNs — small regional ISPs where the control server has fewer prior measurements to cache — and in countries where DNS timeout failures are endemic enough to produce missing DNS features on a significant fraction of probes.

For how the classifier is calibrated per-country with Platt scaling and threshold tuning — why Iran DNS fires at 0.62 and China requires 0.74: Voidly's per-country classifier calibration: Platt scaling and threshold tuning →

For the anomaly classifier this inference API serves: The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall →

For the labeled training data the model was built on: Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →

For how inference output flows into the 8-minute journalist alert pipeline: Voidly's real-time event pipeline: from measurement anomaly to journalist alert in under 8 minutes →

For the control server that populates the cache the feature extractor depends on: The Voidly control server: how we tell censorship from a bad network →

For the full 47-feature engineering pipeline that transforms raw measurements into the vector this API receives: The 47 features that classify internet censorship: how Voidly extracts signal from raw network measurements →

For how the model served here is retrained weekly — rolling 6-month splits, PSI drift detection, champion/challenger promotion, and canary rollout: Voidly's anomaly classifier retraining pipeline: temporal splits, champion/challenger promotion, and drift detection →