Technical writing

Voidly ONNX inference: exporting XGBoost to ONNX and serving censorship predictions at 50ms p99

May 24, 2025· AI Analytics

VoidlyMachine learningONNXInferenceXGBoost

Voidly's censorship classifier is trained in Python with XGBoost and scikit-learn. The inference path, however, runs inside a Rust-based ingestion service that processes probe results at 1,200 batch requests per second. Embedding a CPython interpreter in that service would inflate memory usage by ~180 MB per worker process and introduce GIL contention. The solution is ONNX: export the trained model to an ONNX graph, then run ONNX Runtime from Rust via theort crate.

This article covers the sklearn-to-ONNX export pipeline with feature type registration, the ONNX Runtime session configuration tuned for single-threaded batch inference, opset version pinning to ensure forward compatibility across model updates, and the gRPC inference service that wraps ONNX Runtime for the ingestion pipeline.

Export pipeline

The XGBoost model is trained inside a scikit-learn Pipeline that applies feature preprocessing before the classifier. The entire pipeline — preprocessor and model — is exported as a single ONNX graph so that the inference service does not need to replicate the preprocessing logic:

# training/export_onnx.py

import numpy as np
import onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import (
    FloatTensorType,
    Int64TensorType,
    StringTensorType,
)
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

# Target opset: pin to 17 (supported by ONNX Runtime >= 1.15).
# Do NOT use the latest opset unless all deployment targets support it.
TARGET_OPSET = 17

# Feature schema must exactly match the order expected by the inference service.
# Changes here require a coordinated deploy of the inference service.
FEATURE_SCHEMA = [
    # (feature_name, ONNX type, shape)
    ("dns_nxdomain",          FloatTensorType([None, 1])),
    ("dns_resolver_mismatch", FloatTensorType([None, 1])),
    ("tcp_connect_failed",    FloatTensorType([None, 1])),
    ("tls_cert_invalid",      FloatTensorType([None, 1])),
    ("http_status_code",      Int64TensorType([None, 1])),
    ("body_similarity_score", FloatTensorType([None, 1])),
    ("rtt_percentile_50",     FloatTensorType([None, 1])),
    ("rtt_percentile_95",     FloatTensorType([None, 1])),
    ("control_delta_ms",      FloatTensorType([None, 1])),
    ("probe_asn_seen_before", FloatTensorType([None, 1])),
    ("country_block_rate_7d", FloatTensorType([None, 1])),
    ("domain_block_rate_30d", FloatTensorType([None, 1])),
]


def export_pipeline_to_onnx(
    pipeline: Pipeline,
    output_path: str,
    model_version: str,
) -> None:
    initial_types = FEATURE_SCHEMA

    onnx_model = convert_sklearn(
        pipeline,
        name=f"VoidlyCensorshipClassifier_v{model_version}",
        initial_types=initial_types,
        target_opset=TARGET_OPSET,
        options={XGBClassifier: {"zipmap": False}},
        # zipmap=False: return probability array rather than dict[label -> prob]
        # so the Rust side can index directly into the output tensor
    )

    # Embed model metadata for the inference service to validate at load time
    meta = onnx_model.metadata_props.add()
    meta.key, meta.value = "voidly_model_version", model_version
    meta = onnx_model.metadata_props.add()
    meta.key, meta.value = "voidly_feature_count", str(len(FEATURE_SCHEMA))
    meta = onnx_model.metadata_props.add()
    meta.key, meta.value = "voidly_target_opset", str(TARGET_OPSET)

    # Validate the graph before saving
    onnx.checker.check_model(onnx_model)

    with open(output_path, "wb") as f:
        f.write(onnx_model.SerializeToString())

    print(f"Exported ONNX model to {output_path} ({onnx_model.ByteSize() / 1024:.1f} KB)")

The zipmap=False option is critical for performance. With zipmap=True(the XGBoost ONNX converter default), the model output is a sequence of dictionaries mapping class labels to probabilities — a Python-native structure that ONNX Runtime wraps in an object that requires per-item Python access. With zipmap=False, the output is a plain float32 tensor of shape [batch_size, num_classes] that the Rust side can read as a contiguous array with zero copies.

Opset version pinning

ONNX opsets are forward-compatible but not backward-compatible: a model exported at opset 17 cannot be loaded by ONNX Runtime 1.14 (which supports opsets up to 17 only partially). The inference service validates the opset at session creation time and rejects models that specify an opset beyond its supported range:

// src/inference/model_loader.rs

const SUPPORTED_OPSET_MIN: i64 = 13;
const SUPPORTED_OPSET_MAX: i64 = 17;

pub fn validate_model_metadata(model_bytes: &[u8]) -> Result<ModelMeta, InferenceError> {
    use onnx_proto::ModelProto;
    use prost::Message;

    let proto = ModelProto::decode(model_bytes)
        .map_err(|e| InferenceError::ProtoDecodeError(e.to_string()))?;

    // Check opset version
    let opset_version = proto.opset_import
        .iter()
        .find(|op| op.domain.is_empty() || op.domain == "ai.onnx")
        .map(|op| op.version)
        .unwrap_or(0);

    if !(SUPPORTED_OPSET_MIN..=SUPPORTED_OPSET_MAX).contains(&opset_version) {
        return Err(InferenceError::UnsupportedOpset {
            got:  opset_version,
            min:  SUPPORTED_OPSET_MIN,
            max:  SUPPORTED_OPSET_MAX,
        });
    }

    // Extract Voidly metadata props
    let meta: HashMap<String, String> = proto.metadata_props
        .iter()
        .map(|p| (p.key.clone(), p.value.clone()))
        .collect();

    let version = meta.get("voidly_model_version")
        .cloned()
        .ok_or(InferenceError::MissingMetadata("voidly_model_version"))?;
    let feature_count: usize = meta.get("voidly_feature_count")
        .and_then(|s| s.parse().ok())
        .ok_or(InferenceError::MissingMetadata("voidly_feature_count"))?;

    Ok(ModelMeta { version, feature_count, opset_version })
}

ONNX Runtime session configuration

The ingestion service is CPU-only (no GPU) and processes batches of 50–200 probe results at a time. Each batch arrives as a gRPC request. The ONNX Runtime session is configured for single-threaded execution within the request handler, relying on Tokio's thread pool for concurrency at the gRPC level rather than ORT's internal inter-op parallelism:

// src/inference/session.rs

use ort::{Environment, ExecutionProvider, GraphOptimizationLevel, Session, SessionBuilder};

pub fn build_session(model_bytes: &[u8]) -> ort::Result<Session> {
    let environment = Environment::builder()
        .with_name("voidly_classifier")
        .with_log_level(ort::LoggingLevel::Warning)
        .build()?
        .into_arc();

    SessionBuilder::new(&environment)?
        // CPU execution only; reject GPU/TensorRT providers
        .with_execution_providers([ExecutionProvider::CPU(Default::default())])?

        // Graph optimization: apply all constant folding, shape inference,
        // and operator fusion passes offline (EXTENDED level = L2 in ORT docs)
        .with_optimization_level(GraphOptimizationLevel::Level3)?

        // Single intra-op thread per session: Tokio provides concurrency via
        // multiple sessions (one per tokio worker thread via thread_local!)
        .with_intra_threads(1)?
        .with_inter_threads(1)?

        // Disable memory arena to reduce peak RSS; batch sizes are small enough
        // that per-allocation overhead is negligible
        .with_disable_mem_pattern()?

        .commit_from_memory(model_bytes)
}

// One ONNX Runtime session per Tokio worker thread to avoid lock contention
thread_local! {
    static SESSION: std::cell::OnceCell<Session> = std::cell::OnceCell::new();
}

pub fn get_or_init_session(model_bytes: &'static [u8]) -> &'static Session {
    SESSION.with(|cell| {
        cell.get_or_init(|| build_session(model_bytes).expect("ORT session init"))
    })
}

The thread-local session pattern avoids a Mutex around a shared session. ONNX Runtime sessions are not Send in the ort crate's type system (the underlying C++ InferenceSession is thread-safe for concurrent Run() calls but the Rust wrapper does not expose this), so thread-local storage is the idiomatic approach. Each Tokio worker thread initializes its own session on first use; with 4 Tokio workers, this results in 4 sessions holding ~18 MB of model weights each (~72 MB total), within the 256 MB RSS target for the inference service.

Batch inference implementation

// src/inference/runner.rs

use ndarray::{Array2, ArrayView2};
use ort::inputs;

pub struct ClassifierInput {
    pub features: Array2<f32>,  // shape: [batch_size, NUM_FEATURES]
}

pub struct ClassifierOutput {
    pub labels:        Vec<i64>,   // predicted class index per item
    pub probabilities: Vec<f32>,   // P(censored) per item (class-1 probability)
}

const NUM_FEATURES: usize = 12;  // must match FEATURE_SCHEMA length

pub fn run_batch(
    session: &ort::Session,
    input: &ClassifierInput,
) -> ort::Result<ClassifierOutput> {
    assert_eq!(input.features.ncols(), NUM_FEATURES, "feature dimension mismatch");

    let batch_size = input.features.nrows();

    // ONNX Runtime expects inputs keyed by name; names match initial_types in export
    let outputs = session.run(inputs![
        "dns_nxdomain"          => input.features.column(0).view().into_dyn(),
        "dns_resolver_mismatch" => input.features.column(1).view().into_dyn(),
        "tcp_connect_failed"    => input.features.column(2).view().into_dyn(),
        "tls_cert_invalid"      => input.features.column(3).view().into_dyn(),
        "http_status_code"      => input.features.column(4).mapv(|v| v as i64).view().into_dyn(),
        "body_similarity_score" => input.features.column(5).view().into_dyn(),
        "rtt_percentile_50"     => input.features.column(6).view().into_dyn(),
        "rtt_percentile_95"     => input.features.column(7).view().into_dyn(),
        "control_delta_ms"      => input.features.column(8).view().into_dyn(),
        "probe_asn_seen_before" => input.features.column(9).view().into_dyn(),
        "country_block_rate_7d" => input.features.column(10).view().into_dyn(),
        "domain_block_rate_30d" => input.features.column(11).view().into_dyn(),
    ]?)?;

    // Output 0: labels tensor [batch_size] int64
    let label_tensor = outputs[0].try_extract_tensor::<i64>()?;
    let labels: Vec<i64> = label_tensor.view().iter().copied().collect();

    // Output 1: probabilities tensor [batch_size, 2] float32 (zipmap=False)
    let prob_tensor = outputs[1].try_extract_tensor::<f32>()?;
    let prob_view: ArrayView2<f32> = prob_tensor
        .view()
        .into_dimensionality::<ndarray::Ix2>()?;
    // Column 1 = P(class=1) = P(censored)
    let probabilities: Vec<f32> = (0..batch_size)
        .map(|i| prob_view[[i, 1]])
        .collect();

    Ok(ClassifierOutput { labels, probabilities })
}

Latency benchmarks

Batch size	p50 (ms)	p95 (ms)	p99 (ms)	Throughput (items/s, 4 vCPU)
1	0.4	0.8	1.2	~9,000
50	5.1	9.3	14.2	~56,000
100	9.6	18.4	28.1	~62,000
200	18.3	34.7	49.8	~68,000

Batch size 200 achieves p99 under 50ms, meeting the SLO, while maximizing CPU utilization through batching efficiency. The ingestion pipeline coalesces probe results into batches of up to 200 items using a 15ms wait window before dispatching to the gRPC inference service, ensuring the inference service consistently receives near-optimal batch sizes even at lower ingestion rates.

Related writing

Voidly measurement feature extraction describes the 12 features that are computed from raw probe observations before being passed to the ONNX inference pipeline documented here.

Voidly inference API covers the external-facing REST wrapper around the gRPC inference service, including authentication, request validation, and the response envelope format.