Technical writing

Voidly feature extraction: how 47 classifier inputs are derived from raw probe measurements

May 20, 2025· 14 min read· AI Analytics

CensorshipVoidlyMLMethodology

Each Voidly probe run produces a measurement record that can reach ~200 fields: DNS responses with every returned IP and TTL, TCP timing at millisecond resolution, a full TLS certificate chain, HTTP response headers and a body hash, RTT deltas, and probe-level metadata about the vantage node. This is the raw observational record — complete, verbose, and useless to the anomaly classifier without transformation.

The classifier expects a fixed-length float32 vector. It does not understand JSON, certificate chains, or hex-encoded body hashes. Feature extraction is the step that translates one representation into the other, and it is where almost all the domain knowledge about censorship detection lives. A feature vector that captures the right signal produces a well-separated decision boundary. A feature vector that misses the signal forces the model to compensate with more data and more complexity than the underlying problem warrants.

This article documents the 47 features that feed the Voidly anomaly classifier and the engineering decisions behind each one — why these quantities, how they are computed, how missing values are handled, and how the feature schema is versioned so models and extractors stay in sync.

Why feature extraction matters more than model architecture

The raw probe result contains approximately 200 fields. Some are direct signal (the DNS resolver returned a known injection IP), some are noise (the probe's local IPv6 address), and most are somewhere in between — informative in context but not independently. The classifier cannot sort this out automatically. Gradient boosted trees are good at finding nonlinear decision boundaries over numerical inputs, but they cannot learn to parse a certificate chain or to compute the distance between a returned IP and a control IP set. Those transformations have to be done ahead of time, by the feature extractor, informed by what we know about how censorship actually works.

The core insight driving the feature design is comparison against a control measurement. Almost nothing about a single probe measurement is interpretable in isolation. A TCP connect time of 180ms could be a content-aware RST injected by a DPI box after seeing the SNI, or it could be a slow server in a far region. A DNS TTL of 60 seconds could indicate injection, or it could be a legitimate low-TTL record on a heavily load-balanced CDN. The signal emerges when the probe measurement is compared against what the same domain looks like from an unobstructed vantage point in the same time window. That comparison is the job of the ControlDelta struct.

The ControlDelta struct

Before any per-layer features are extracted, the probe result and the most recent control measurement for the same domain are combined into a ControlDelta. This struct is the normalized representation of the difference between what the probe observed and what the unobstructed baseline expects. All delta and ratio features are derived from it. The struct is computed in the feature extractor (Rust, in the probe collector) before any Python ML code runs:

// src/features/control_delta.rs

use std::collections::HashSet;

#[derive(Debug, Clone)]
pub struct ControlDelta {
    // DNS comparison
    pub dns_match: bool,
    pub dns_ip_in_control_set: bool,
    pub dns_nxdomain_where_control_resolved: bool,
    pub dns_response_ms_delta: Option<f32>,
    pub dns_ttl_min_delta: Option<f32>,

    // TCP comparison
    pub tcp_connect_success: bool,
    pub tcp_rtt_delta_ms: Option<f32>,   // probe RTT minus control RTT
    pub tcp_rst_received: bool,
    pub tcp_rst_timing_ms: Option<f32>,  // ms from SYN to RST

    // TLS comparison
    pub tls_handshake_success: bool,
    pub tls_cert_match: bool,            // probe cert fingerprint == control cert fingerprint
    pub tls_cert_in_control_chain: bool, // probe cert appears anywhere in control's chain
    pub tls_cert_is_mitm: bool,          // probe cert fingerprint in known-MITM library
    pub tls_cert_is_self_signed: bool,
    pub tls_cert_issuer_known_govt: bool,
    pub tls_alert_code: u16,             // 0 = no alert
    pub tls_handshake_ms_delta: Option<f32>,

    // HTTP comparison
    pub http_status_match: bool,
    pub http_body_sha256_match: bool,
    pub http_body_sim_hash_distance: Option<f32>, // 0.0 = identical, 1.0 = completely different
    pub http_body_length_ratio: Option<f32>,      // probe body len / control body len
    pub http_blockpage_match_score: f32,           // 0.0–1.0 against 2300-entry corpus
    pub http_ttfb_ms_delta: Option<f32>,
    pub http_body_truncated: bool,
}

impl ControlDelta {
    pub fn compute(probe: &ProbeResult, control: &ControlResult) -> Self {
        let probe_ips: HashSet<&str> = probe.dns.answers.iter()
            .filter_map(|a| a.ip.as_deref())
            .collect();
        let control_ips: HashSet<&str> = control.dns.answers.iter()
            .filter_map(|a| a.ip.as_deref())
            .collect();

        let dns_ip_in_control_set = !probe_ips.is_disjoint(&control_ips);

        let tls_cert_match = probe.tls.as_ref()
            .and_then(|t| t.cert_fingerprint_sha256.as_deref())
            .zip(control.tls.as_ref().and_then(|t| t.cert_fingerprint_sha256.as_deref()))
            .map(|(p, c)| p == c)
            .unwrap_or(false);

        let blockpage_score = probe.http.as_ref()
            .and_then(|h| h.body_hash.as_deref())
            .map(|hash| BLOCKPAGE_INDEX.score(hash, probe.http.as_ref().unwrap()))
            .unwrap_or(0.0);

        ControlDelta {
            dns_match: probe_ips == control_ips,
            dns_ip_in_control_set,
            dns_nxdomain_where_control_resolved: probe.dns.failure.as_deref() == Some("dns_nxdomain_error")
                && control.dns.failure.is_none(),
            dns_response_ms_delta: probe.dns.response_ms.zip(control.dns.response_ms)
                .map(|(p, c)| p - c),
            dns_ttl_min_delta: probe.dns.ttl_min().zip(control.dns.ttl_min())
                .map(|(p, c)| p as f32 - c as f32),
            tcp_connect_success: probe.tcp.is_some_and(|t| t.connect_success),
            tcp_rtt_delta_ms: probe.tcp.and_then(|t| t.rtt_ms)
                .zip(control.tcp.and_then(|t| t.rtt_ms))
                .map(|(p, c)| p - c),
            tcp_rst_received: probe.tcp.is_some_and(|t| t.rst_received),
            tcp_rst_timing_ms: probe.tcp.and_then(|t| t.rst_timing_ms),
            tls_handshake_success: probe.tls.is_some_and(|t| t.handshake_success),
            tls_cert_match,
            tls_cert_in_control_chain: check_cert_in_chain(probe.tls.as_ref(), control.tls.as_ref()),
            tls_cert_is_mitm: probe.tls.as_ref()
                .and_then(|t| t.cert_fingerprint_sha256.as_deref())
                .map(|fp| KNOWN_MITM_CERTS.contains(fp))
                .unwrap_or(false),
            tls_cert_is_self_signed: probe.tls.as_ref()
                .map(|t| t.cert_is_self_signed)
                .unwrap_or(false),
            tls_cert_issuer_known_govt: probe.tls.as_ref()
                .and_then(|t| t.cert_issuer_cn.as_deref())
                .map(|cn| GOVT_CERT_ISSUERS.contains(cn))
                .unwrap_or(false),
            tls_alert_code: probe.tls.as_ref().map(|t| t.alert_code).unwrap_or(0),
            tls_handshake_ms_delta: probe.tls.and_then(|t| t.handshake_ms)
                .zip(control.tls.and_then(|t| t.handshake_ms))
                .map(|(p, c)| p - c),
            http_status_match: probe.http.as_ref().and_then(|h| h.status_code)
                == control.http.as_ref().and_then(|h| h.status_code),
            http_body_sha256_match: probe.http.as_ref().and_then(|h| h.body_hash.as_deref())
                == control.http.as_ref().and_then(|h| h.body_hash.as_deref()),
            http_body_sim_hash_distance: compute_sim_hash_distance(probe.http.as_ref(), control.http.as_ref()),
            http_body_length_ratio: probe.http.as_ref().and_then(|h| h.body_length)
                .zip(control.http.as_ref().and_then(|h| h.body_length))
                .map(|(p, c)| if c > 0 { p as f32 / c as f32 } else { 0.0 }),
            http_blockpage_match_score: blockpage_score,
            http_ttfb_ms_delta: probe.http.and_then(|h| h.ttfb_ms)
                .zip(control.http.and_then(|h| h.ttfb_ms))
                .map(|(p, c)| p - c),
            http_body_truncated: probe.http.as_ref()
                .map(|h| h.body_truncated)
                .unwrap_or(false),
        }
    }
}

The ControlDelta is not the feature vector — it is an intermediate representation. The per-layer feature extractors read from it, apply normalizations, and emit the final float32 values that go into the numpy array.

DNS layer features (12 features)

DNS is the first layer where censorship is typically applied, because it is the cheapest to intercept at scale. An ISP that operates the recursive resolver for its subscribers can redirect any domain to a block page IP with a single config change and no hardware investment. DNS features are correspondingly the most predictive class in the feature set — SHAP analysis of the classifier puts dns_known_injected_ip alone in the top three features by mean absolute contribution.

# features/dns.py
import math
import numpy as np
from ipaddress import ip_address, ip_network

BOGON_NETWORKS = [
    ip_network('0.0.0.0/8'), ip_network('10.0.0.0/8'),
    ip_network('100.64.0.0/10'), ip_network('127.0.0.0/8'),
    ip_network('169.254.0.0/16'), ip_network('172.16.0.0/12'),
    ip_network('192.0.2.0/24'), ip_network('192.168.0.0/16'),
    ip_network('198.18.0.0/15'), ip_network('198.51.100.0/24'),
    ip_network('203.0.113.0/24'), ip_network('224.0.0.0/4'),
    ip_network('240.0.0.0/4'),
]

def is_bogon(ip_str: str) -> bool:
    try:
        addr = ip_address(ip_str)
        return any(addr in net for net in BOGON_NETWORKS)
    except ValueError:
        return False

def extract_dns_features(
    delta: ControlDelta,
    probe: ProbeResult,
    asn_db: MaxMindDB,
    injection_ips: frozenset[str],
    isp_resolver_asns: frozenset[int],
    control_asn: int | None,
) -> list[float]:
    dns = probe.dns
    answers = dns.answers or []
    ips = [a.ip for a in answers if a.ip]

    # f01: dns_nxdomain — NXDOMAIN where control resolved successfully
    dns_nxdomain = float(delta.dns_nxdomain_where_control_resolved)

    # f02: dns_no_answer — empty answer section, no explicit NXDOMAIN
    dns_no_answer = float(
        len(answers) == 0 and dns.failure is None and not delta.dns_nxdomain_where_control_resolved
    )

    # f03: dns_ip_in_control_set — True means DNS matched control (NOT censored at DNS layer)
    # A True value here is exculpatory evidence for DNS; censorship is indicated by False.
    dns_ip_in_control_set = float(delta.dns_ip_in_control_set)

    # f04: dns_all_ips_bogon — all returned IPs are non-routable (classic injection pattern)
    dns_all_ips_bogon = float(bool(ips) and all(is_bogon(ip) for ip in ips))

    # f05: dns_known_injected_ip — at least one IP is in the curated injection list
    # The injection list is maintained from confirmed censorship cases and updated weekly.
    dns_known_injected_ip = float(any(ip in injection_ips for ip in ips))

    # f06: dns_resolver_is_isp — the probe used an ISP-operated resolver, which is more
    # susceptible to censorship orders than a third-party resolver like 8.8.8.8
    resolver_asn = asn_db.asn_for_ip(dns.resolver_ip) if dns.resolver_ip else None
    dns_resolver_is_isp = float(resolver_asn in isp_resolver_asns if resolver_asn else False)

    # f07: dns_answer_count — number of A/AAAA records returned, normalized by 5.0
    # Large answer counts (CDN round-robin) are legitimate; 0 with no failure is suspicious.
    dns_answer_count = min(len(ips) / 5.0, 1.0)

    # f08: dns_ttl_min_log — log2 of minimum TTL; very short TTLs (< 30s) are an injection signal
    # Injection infrastructure often injects with TTL=1 to prevent caching of the false response.
    ttls = [a.ttl for a in answers if a.ttl is not None]
    if ttls:
        min_ttl = max(min(ttls), 1)  # floor at 1 to avoid log(0)
        dns_ttl_min_log = math.log2(min_ttl) / 17.0  # log2(131072) ≈ 17; normalize to ~0–1
    else:
        dns_ttl_min_log = float('nan')

    # f09: dns_unique_asn_count — number of distinct ASNs in returned IPs
    # Legitimate CDNs return IPs in 1–3 ASNs. Injected responses often mix IPs from unrelated ASNs.
    ip_asns = [asn_db.asn_for_ip(ip) for ip in ips if ip]
    dns_unique_asn_count = float(len(set(a for a in ip_asns if a)))

    # f10: dns_cname_depth — depth of CNAME chain (0 if no CNAMEs)
    cnames = [a for a in answers if a.type == 'CNAME']
    dns_cname_depth = float(min(len(cnames), 5))

    # f11: dns_all_ips_in_same_asn_as_control — all probe IPs are in same ASN as control
    # resolution, a strong exculpatory signal (same ASN = same infrastructure)
    if control_asn and ip_asns:
        dns_all_ips_in_same_asn_as_control = float(all(a == control_asn for a in ip_asns if a))
    else:
        dns_all_ips_in_same_asn_as_control = float('nan')

    # f12: dns_response_ms — normalized DNS response time (100ms as reference point)
    # Unusually fast DNS response can indicate a spoofed reply arriving before the real one.
    if dns.response_ms is not None:
        dns_response_ms = math.log1p(dns.response_ms) / math.log1p(2000.0)
    else:
        dns_response_ms = float('nan')

    return [
        dns_nxdomain,           # f01
        dns_no_answer,          # f02
        dns_ip_in_control_set,  # f03
        dns_all_ips_bogon,      # f04
        dns_known_injected_ip,  # f05
        dns_resolver_is_isp,    # f06
        dns_answer_count,       # f07
        dns_ttl_min_log,        # f08
        dns_unique_asn_count,   # f09
        dns_cname_depth,        # f10
        dns_all_ips_in_same_asn_as_control,  # f11
        dns_response_ms,        # f12
    ]

The injection_ips set is loaded from a Parquet file updated weekly by a separate pipeline that scrapes confirmed injection IPs from OONI measurements, CensoredPlanet data, and community reports. At last update it contained 4,817 distinct IP addresses. The check is O(1) because it is a Python frozenset— lookup cost is negligible even at thousands of measurements per second.

TCP layer features (8 features)

TCP-layer interference typically takes the form of an injected RST segment, sent by a DPI device to terminate the connection after it has seen enough of the packet stream to identify the destination. The timing signature of an injected RST is distinctive: it arrives very quickly after the SYN-ACK because the DPI device can act as soon as it sees the TCP SYN, before the actual server even responds. A legitimate server RST (connection refused, application error) follows the natural RTT of the path.

# features/tcp.py

def extract_tcp_features(
    delta: ControlDelta,
    probe: ProbeResult,
    control_rtt_mean: float | None,
    control_rtt_std: float | None,
) -> list[float]:
    tcp = probe.tcp

    # f13: tcp_connect_success — binary; False is a hard interference signal
    tcp_connect_success = float(delta.tcp_connect_success)

    # f14: tcp_rst_received — RST was received at any point in the handshake
    tcp_rst_received = float(delta.tcp_rst_received)

    # f15: tcp_rtt_ms — log-normalized observed RTT
    # RTT is log-normalized because the distribution is right-skewed (long-tail outliers
    # from genuinely distant servers shouldn't dominate the feature scale).
    if tcp is not None and tcp.rtt_ms is not None:
        tcp_rtt_ms = math.log1p(tcp.rtt_ms) / math.log1p(5000.0)
    else:
        tcp_rtt_ms = float('nan')

    # f16: tcp_rtt_delta_from_control — deviation from expected RTT, standardized
    # A probe RTT 3+ standard deviations above control RTT suggests an interposing device.
    if delta.tcp_rtt_delta_ms is not None and control_rtt_std is not None and control_rtt_std > 0:
        tcp_rtt_delta_from_control = delta.tcp_rtt_delta_ms / control_rtt_std
    else:
        tcp_rtt_delta_from_control = float('nan')

    # f17: tcp_rst_timing_ms — how quickly after SYN did the RST arrive?
    # Injected RSTs from DPI boxes arrive in < 15ms (before the real server can respond).
    # Legitimate application RSTs arrive after a full RTT (typically 50–500ms).
    # We clip to 0–500ms and log-normalize.
    if delta.tcp_rst_timing_ms is not None:
        tcp_rst_timing_ms = math.log1p(min(delta.tcp_rst_timing_ms, 500.0)) / math.log1p(500.0)
    else:
        tcp_rst_timing_ms = float('nan')

    # f18: tcp_syn_ack_count — number of SYN-ACK segments received
    # Normally exactly 1. More than 1 indicates asymmetric routing anomaly or
    # a middlebox that clones the SYN-ACK to perform in-path analysis.
    if tcp is not None:
        tcp_syn_ack_count = float(min(tcp.syn_ack_count or 1, 5))
    else:
        tcp_syn_ack_count = float('nan')

    # f19: tcp_connect_delta_ms — difference between probe connect time and control connect time
    # Positive large values indicate an interposing device adding latency.
    if tcp is not None and tcp.connect_ms is not None and delta.tcp_rtt_delta_ms is not None:
        # We use the raw delta clipped to ±2000ms then normalized
        raw_delta = max(min(delta.tcp_rtt_delta_ms, 2000.0), -2000.0)
        tcp_connect_delta_ms = raw_delta / 2000.0
    else:
        tcp_connect_delta_ms = float('nan')

    # f20: tcp_timeout — connection attempt timed out entirely (no RST, no SYN-ACK)
    tcp_timeout = float(
        tcp is None or (not delta.tcp_connect_success and not delta.tcp_rst_received)
    )

    return [
        tcp_connect_success,        # f13
        tcp_rst_received,           # f14
        tcp_rtt_ms,                 # f15
        tcp_rtt_delta_from_control, # f16
        tcp_rst_timing_ms,          # f17
        tcp_syn_ack_count,          # f18
        tcp_connect_delta_ms,       # f19
        tcp_timeout,                # f20
    ]

The 15ms threshold for “fast RST” is empirically derived from a corpus of 140,000 confirmed-injection TCP measurements from the OONI archive. The distribution of RST timing for confirmed injections has a median of 6ms and a 95th percentile of 14ms. Legitimate connection refusals from reachable servers have a median RST timing of 82ms. The two distributions barely overlap.

TLS layer features (10 features)

TLS-layer interference is more expensive to deploy than DNS manipulation — it requires DPI hardware that can inspect the ClientHello, extract the SNI extension, and either inject a TCP RST or perform a certificate substitution. But it is the only option for operators trying to block TLS traffic where the DNS layer is not under their control (for example, when the user is using DoH or a third-party resolver). The feature set captures both the timing signature of SNI-triggered interference and the certificate substitution pattern used by MitM boxes.

# features/tls.py

TLS_ALERT_CODES = {0: 0, 40: 1, 42: 2, 70: 3, 80: 4, 112: 5, 113: 6}
# 0=no alert, 40=handshake_failure, 42=bad_cert, 70=protocol_version,
# 80=internal_error, 112=unrecognized_name (SNI not found), 113=cert_expired

def extract_tls_features(
    delta: ControlDelta,
    probe: ProbeResult,
    control_hs_mean: float | None,
    control_hs_std: float | None,
    known_mitm_fps: frozenset[str],
    govt_issuers: frozenset[str],
) -> list[float]:
    tls = probe.tls

    # f21: tls_handshake_success — binary; False is a hard interference signal if TCP succeeded
    tls_handshake_success = float(delta.tls_handshake_success)

    # f22: tls_cert_matches_sni — the returned certificate's CN or SAN matches the requested domain
    # A mismatched cert almost always means a substitution (MitM box presenting its own cert).
    if tls is not None and tls.cert_cn is not None and probe.domain is not None:
        tls_cert_matches_sni = float(
            tls.cert_cn == probe.domain or
            any(san == probe.domain for san in (tls.cert_sans or []))
        )
    else:
        tls_cert_matches_sni = float('nan')

    # f23: tls_cert_in_control_chain — the probe's leaf cert appears in control's cert chain
    # True = same certificate infrastructure; False when a MitM box substitutes its own cert.
    tls_cert_in_control_chain = float(delta.tls_cert_in_control_chain)

    # f24: tls_cert_is_self_signed — self-signed cert is a near-certain indicator of a MitM box
    tls_cert_is_self_signed = float(delta.tls_cert_is_self_signed)

    # f25: tls_cert_is_known_mitm — cert fingerprint is in the curated MITM cert library
    # The library is maintained from Censored Planet data and DPI hardware reverse-engineering.
    tls_cert_is_known_mitm = float(delta.tls_cert_is_mitm)

    # f26–f28: tls_alert_code — one-hot encoding of the TLS alert code
    # We one-hot encode instead of using the raw integer because alert codes are nominal
    # (40 is not "worse" than 42 in any ordinal sense).
    alert_bucket = TLS_ALERT_CODES.get(delta.tls_alert_code, len(TLS_ALERT_CODES))
    # Encode as a single ordinal for the tree model (trees handle this well; no need for full OHE)
    tls_alert_code_ordinal = float(alert_bucket)

    # f29: tls_handshake_ms — log-normalized handshake time
    if tls is not None and tls.handshake_ms is not None:
        tls_handshake_ms = math.log1p(tls.handshake_ms) / math.log1p(10_000.0)
    else:
        tls_handshake_ms = float('nan')

    # f30: tls_handshake_delta_from_control — standardized deviation from expected handshake time
    if delta.tls_handshake_ms_delta is not None and control_hs_std is not None and control_hs_std > 0:
        tls_handshake_delta_from_control = delta.tls_handshake_ms_delta / control_hs_std
    else:
        tls_handshake_delta_from_control = float('nan')

    # f31: tls_cert_valid_days_remaining — days until cert expiry, clipped to 0–365, normalized
    # A cert expiring immediately can indicate a hastily generated MitM cert.
    if tls is not None and tls.cert_not_after is not None:
        import datetime
        days_remaining = (tls.cert_not_after - datetime.date.today()).days
        tls_cert_valid_days_remaining = max(0.0, min(float(days_remaining), 365.0)) / 365.0
    else:
        tls_cert_valid_days_remaining = float('nan')

    # f32: tls_cert_issuer_known_govt — issuer CN or O matches known government CA patterns
    # Government-operated CAs have been used in Kazakhstan and Kyrgyzstan to sign MitM certs.
    tls_cert_issuer_known_govt = float(delta.tls_cert_issuer_known_govt)

    return [
        tls_handshake_success,          # f21
        tls_cert_matches_sni,           # f22
        tls_cert_in_control_chain,      # f23
        tls_cert_is_self_signed,        # f24
        tls_cert_is_known_mitm,         # f25
        tls_alert_code_ordinal,         # f26
        tls_handshake_ms,               # f27
        tls_handshake_delta_from_control,  # f28
        tls_cert_valid_days_remaining,  # f29
        tls_cert_issuer_known_govt,     # f30
    ]

The known-MITM certificate library is the most operationally intensive component of the TLS feature set. It currently contains 831 certificate fingerprints from confirmed or suspected MitM deployments. The library is maintained as a signed Parquet file in an S3 bucket and is hot-reloaded into the extraction service every 6 hours without a service restart — new additions propagate to all nodes within one polling cycle.

HTTP layer features (12 features)

HTTP-layer features are the most diverse class because HTTP blocking takes many forms: an explicit block page response (identifiable by body content), a redirect to a government notice, a silent connection reset after the response headers are sent, or a protocol-level response code like 451. The 2,300-entry block page corpus is the most expensive computation in the extraction pipeline because it requires a SimHash comparison of the probe body against thousands of reference signatures.

# features/http.py

def classify_status(code: int | None) -> float:
    """Bin HTTP status code into 5 ordinal categories."""
    if code is None:
        return 4.0  # timeout/no response
    if 200 <= code < 300:
        return 0.0  # 2xx success
    if 300 <= code < 400:
        return 1.0  # 3xx redirect
    if 400 <= code < 500:
        return 2.0  # 4xx client error (includes 451)
    if 500 <= code < 600:
        return 3.0  # 5xx server error
    return 4.0  # unexpected code

def extract_http_features(
    delta: ControlDelta,
    probe: ProbeResult,
    control_ttfb_mean: float | None,
    control_ttfb_std: float | None,
    control_body_len: int | None,
) -> list[float]:
    http = probe.http

    # f33: http_status_code — binned into 5 ordinal categories (2xx/3xx/4xx/5xx/no-response)
    status_code = http.status_code if http else None
    http_status_code = classify_status(status_code)

    # f34: http_status_matches_control — same status as the control measurement
    http_status_matches_control = float(delta.http_status_match)

    # f35: http_body_sha256_matches_control — exact body match with control
    # True is a strong exculpatory signal. False is necessary but not sufficient for blocking.
    http_body_sha256_matches_control = float(delta.http_body_sha256_match)

    # f36: http_blockpage_score — best SimHash match score against 2300-entry blockpage corpus
    # Score 0.0 = no similarity to any known block page; 1.0 = exact match.
    # Scores above 0.7 are treated as probable block pages; above 0.9 as confirmed.
    http_blockpage_score = float(delta.http_blockpage_match_score)

    # f37: http_body_length_ratio — probe body length / control body length, clipped to 0–2
    # A block page substitution dramatically changes body length. Ratio < 0.2 or > 1.5
    # warrants investigation even when body hash doesn't match a known block page.
    if delta.http_body_length_ratio is not None:
        http_body_length_ratio = min(max(delta.http_body_length_ratio, 0.0), 2.0)
    else:
        http_body_length_ratio = float('nan')

    # f38: http_redirect_count — number of redirects followed, clipped to 0–5
    # Unexpected redirects (probe follows 3 redirects where control followed 0) can indicate
    # a transparent proxy injecting a redirect to a block page.
    if http is not None:
        http_redirect_count = float(min(http.redirect_count or 0, 5))
    else:
        http_redirect_count = float('nan')

    # f39: http_ttfb_ms — time to first byte, log-normalized
    if http is not None and http.ttfb_ms is not None:
        http_ttfb_ms = math.log1p(http.ttfb_ms) / math.log1p(30_000.0)
    else:
        http_ttfb_ms = float('nan')

    # f40: http_ttfb_delta_from_control — standardized deviation from expected TTFB
    # Injected block pages are served from middlebox infrastructure with different RTTs
    # than the actual origin server, producing a distinctive TTFB delta.
    if delta.http_ttfb_ms_delta is not None and control_ttfb_std is not None and control_ttfb_std > 0:
        http_ttfb_delta_from_control = delta.http_ttfb_ms_delta / control_ttfb_std
    else:
        http_ttfb_delta_from_control = float('nan')

    # f41: http_response_ms — total HTTP response time, log-normalized
    if http is not None and http.response_ms is not None:
        http_response_ms = math.log1p(http.response_ms) / math.log1p(60_000.0)
    else:
        http_response_ms = float('nan')

    # f42: http_content_type_match — probe Content-Type header matches control
    # A block page substitution often returns text/html where the real content is JSON or binary.
    if http is not None:
        probe_ct = (http.headers or {}).get('content-type', '').split(';')[0].strip()
        control_ct = 'text/html'  # default from control cache; populated if available
        if delta.control_content_type is not None:
            control_ct = delta.control_content_type.split(';')[0].strip()
        http_content_type_match = float(probe_ct == control_ct)
    else:
        http_content_type_match = float('nan')

    # f43: http_server_header_match — Server header matches control measurement
    # A different server header (probe: nginx/1.18.0, control: AmazonS3) is a soft block signal.
    if http is not None and delta.control_server_header is not None:
        probe_server = (http.headers or {}).get('server', '')
        http_server_header_match = float(probe_server == delta.control_server_header)
    else:
        http_server_header_match = float('nan')

    # f44: http_body_truncated — body transfer terminated before Content-Length bytes were received
    # Some middleboxes inject a block page and then forcibly close the connection,
    # producing a truncated response at the TCP layer.
    http_body_truncated = float(delta.http_body_truncated)

    return [
        http_status_code,               # f33
        http_status_matches_control,    # f34
        http_body_sha256_matches_control,  # f35
        http_blockpage_score,           # f36
        http_body_length_ratio,         # f37
        http_redirect_count,            # f38
        http_ttfb_ms,                   # f39
        http_ttfb_delta_from_control,   # f40
        http_response_ms,               # f41
        http_content_type_match,        # f42
        http_server_header_match,       # f43
        http_body_truncated,            # f44
    ]

The SimHash blockpage comparison runs in the Rust extractor (not the Python feature extractor) because it is the most CPU-intensive step and must complete before the Python layer is even invoked. The Rust implementation uses a 256-bit SimHash over 3-grams of the body text, then computes Hamming distance against the corpus. The corpus is stored as a sorted Vec<u256>in a memory-mapped file; lookups use SIMD popcount for Hamming distance, which keeps the worst-case scan of all 2,300 corpus entries under 400 microseconds on a modern CPU.

Cross-layer and metadata features (5 features)

The final five features capture information that spans multiple protocol layers or is external to the measurement itself. These are the features that the model can use to adjust for contextual factors — probe reliability, timing, and network-level conditions that affect what individual measurements mean.

# features/meta.py
import datetime
import math

MOBILE_ASN_PREFIXES = frozenset([
    # CAIDA-derived set of ASNs operated by mobile carriers
    # Regenerated quarterly from CAIDA's AS-classification dataset
    # Kept in a frozenset for O(1) lookup
    9541, 17974, 24029,  # (abbreviated — actual set has ~2800 entries)
])

def extract_meta_features(
    probe: ProbeResult,
    control_available: bool,
    asn_db: MaxMindDB,
) -> list[float]:
    ts = probe.measurement_ts  # UTC datetime

    # f45: measurement_round_trip_total_ms — wall-clock time for the complete probe run
    # (DNS + TCP + TLS + HTTP, sequential). Very short total time (<50ms) can indicate
    # early termination by a middlebox or probe-level networking failure.
    if probe.total_measurement_ms is not None:
        measurement_rtt_total = math.log1p(probe.total_measurement_ms) / math.log1p(120_000.0)
    else:
        measurement_rtt_total = float('nan')

    # f46: control_unreachable — control server itself was unavailable during this window.
    # When True, all control-comparison features will be NaN, but the fact of control
    # unavailability is itself informative (can correlate with BGP-level events).
    control_unreachable = float(not control_available)

    # f47: probe_is_mobile_asn — probe is on a mobile carrier ASN per CAIDA classification.
    # Mobile ASNs have structurally different baseline RTTs, DNS behaviors, and censorship
    # patterns than residential fixed broadband, so this is a useful conditioning feature.
    probe_asn = asn_db.asn_for_ip(probe.probe_ip) if probe.probe_ip else None
    probe_is_mobile_asn = float(probe_asn in MOBILE_ASN_PREFIXES if probe_asn else False)

    # f48: measurement_attempt_number — retry number (1, 2, or 3).
    # First attempts are more likely to reflect transient conditions.
    # Third attempts are more likely to reflect sustained interference (transient issues
    # tend to resolve between retries; persistent blocks do not).
    measurement_attempt_number = float(min(probe.attempt_number or 1, 3))

    # f49: is_weekend — Saturday or Sunday in the probe's local timezone.
    # Some censorship enforcement systems have documented day-of-week variation;
    # several countries relax censorship during weekends or apply it more aggressively
    # during weekday business hours when political events are more likely to be monitored.
    is_weekend = float(ts.weekday() >= 5) if ts else float('nan')

    return [
        measurement_rtt_total,      # f45
        control_unreachable,        # f46
        probe_is_mobile_asn,        # f47
        measurement_attempt_number, # f48
        is_weekend,                 # f49
    ]

# Note: the five cross-layer features bring the total to 12+8+10+12+5 = 47.

The extraction pipeline

The five per-layer extractors are called sequentially and their outputs concatenated into a single numpy float32 array. The function also handles NaN imputation before returning — the classifier can tolerate a limited number of missing values (XGBoost routes NaN inputs to a default child node), but values above the 8-missing-feature abstain threshold require explicit handling at the inference service level.

# features/extract.py
import numpy as np
from typing import NamedTuple

FEATURE_NAMES = [
    # DNS (f01–f12)
    'dns_nxdomain',
    'dns_no_answer',
    'dns_ip_in_control_set',
    'dns_all_ips_bogon',
    'dns_known_injected_ip',
    'dns_resolver_is_isp',
    'dns_answer_count',
    'dns_ttl_min_log',
    'dns_unique_asn_count',
    'dns_cname_depth',
    'dns_all_ips_in_same_asn_as_control',
    'dns_response_ms',
    # TCP (f13–f20)
    'tcp_connect_success',
    'tcp_rst_received',
    'tcp_rtt_ms',
    'tcp_rtt_delta_from_control',
    'tcp_rst_timing_ms',
    'tcp_syn_ack_count',
    'tcp_connect_delta_ms',
    'tcp_timeout',
    # TLS (f21–f30)
    'tls_handshake_success',
    'tls_cert_matches_sni',
    'tls_cert_in_control_chain',
    'tls_cert_is_self_signed',
    'tls_cert_is_known_mitm',
    'tls_alert_code_ordinal',
    'tls_handshake_ms',
    'tls_handshake_delta_from_control',
    'tls_cert_valid_days_remaining',
    'tls_cert_issuer_known_govt',
    # HTTP (f31–f44)  [12 features]
    'http_status_code',
    'http_status_matches_control',
    'http_body_sha256_matches_control',
    'http_blockpage_score',
    'http_body_length_ratio',
    'http_redirect_count',
    'http_ttfb_ms',
    'http_ttfb_delta_from_control',
    'http_response_ms',
    'http_content_type_match',
    'http_server_header_match',
    'http_body_truncated',
    # Cross-layer/meta (f45–f49)  [5 features]
    'measurement_rtt_total',
    'control_unreachable',
    'probe_is_mobile_asn',
    'measurement_attempt_number',
    'is_weekend',
]

assert len(FEATURE_NAMES) == 47, f"Expected 47 features, got {len(FEATURE_NAMES)}"

# Per-feature imputation values for NaN replacement.
# Binary flags impute to 0 (unknown = treat as absent, i.e., no interference signal).
# Continuous features impute to the global training-set mean for that feature.
# These values are updated each time the model is retrained from the training corpus statistics.
IMPUTE_VALUES: dict[str, float] = {
    'dns_nxdomain': 0.0,
    'dns_no_answer': 0.0,
    'dns_ip_in_control_set': 0.5,     # truly unknown; 0.5 is neutral
    'dns_all_ips_bogon': 0.0,
    'dns_known_injected_ip': 0.0,
    'dns_resolver_is_isp': 0.5,
    'dns_answer_count': 0.4,           # training-set mean ≈ 2 IPs / 5.0
    'dns_ttl_min_log': 0.71,           # training-set mean ≈ log2(3600) / 17
    'dns_unique_asn_count': 1.2,
    'dns_cname_depth': 0.3,
    'dns_all_ips_in_same_asn_as_control': 0.5,
    'dns_response_ms': 0.18,
    'tcp_connect_success': 1.0,        # assume success when unknown
    'tcp_rst_received': 0.0,
    'tcp_rtt_ms': 0.22,
    'tcp_rtt_delta_from_control': 0.0,
    'tcp_rst_timing_ms': 0.5,
    'tcp_syn_ack_count': 1.0,
    'tcp_connect_delta_ms': 0.0,
    'tcp_timeout': 0.0,
    'tls_handshake_success': 1.0,
    'tls_cert_matches_sni': 1.0,
    'tls_cert_in_control_chain': 1.0,
    'tls_cert_is_self_signed': 0.0,
    'tls_cert_is_known_mitm': 0.0,
    'tls_alert_code_ordinal': 0.0,
    'tls_handshake_ms': 0.28,
    'tls_handshake_delta_from_control': 0.0,
    'tls_cert_valid_days_remaining': 0.7,
    'tls_cert_issuer_known_govt': 0.0,
    'http_status_code': 0.0,           # assume 2xx when unknown
    'http_status_matches_control': 1.0,
    'http_body_sha256_matches_control': 1.0,
    'http_blockpage_score': 0.0,
    'http_body_length_ratio': 1.0,
    'http_redirect_count': 0.0,
    'http_ttfb_ms': 0.24,
    'http_ttfb_delta_from_control': 0.0,
    'http_response_ms': 0.30,
    'http_content_type_match': 1.0,
    'http_server_header_match': 1.0,
    'http_body_truncated': 0.0,
    'measurement_rtt_total': 0.35,
    'control_unreachable': 0.0,
    'probe_is_mobile_asn': 0.22,       # training-set prevalence of mobile ASN probes
    'measurement_attempt_number': 1.0,
    'is_weekend': 0.29,
}


def extract_features(
    probe_result: ProbeResult,
    control_result: ControlResult | None,
    asn_db: MaxMindDB,
    injection_ips: frozenset[str],
    isp_resolver_asns: frozenset[int],
    blockpage_index: BlockpageIndex,
    control_stats: ControlStats | None,
) -> np.ndarray:
    """
    Transform a raw probe result and its paired control result into the 47-element
    float32 feature vector expected by the anomaly classifier.

    Args:
        probe_result: The raw measurement from the probe application.
        control_result: The most recent control measurement for this domain.
                        May be None if the control cache has no entry.
        asn_db: Thread-safe MaxMind GeoLite2 ASN database wrapper.
        injection_ips: Curated set of known DNS injection IP addresses.
        isp_resolver_asns: Set of ASNs operated by ISP-controlled DNS resolvers.
        blockpage_index: SimHash index over the 2300-entry block page corpus.
        control_stats: Per-domain control baseline statistics (mean/std for RTT, TTFB, etc.).
                       May be None for newly added domains.

    Returns:
        np.ndarray of shape (47,) with dtype float32.
        NaN values are left in place; the caller (inference service) handles
        imputation and enforces the abstain threshold.
    """
    control_available = control_result is not None

    if control_available:
        delta = ControlDelta.compute(probe_result, control_result)
    else:
        delta = ControlDelta.empty()  # all fields set to None/False

    dns_feats = extract_dns_features(
        delta=delta,
        probe=probe_result,
        asn_db=asn_db,
        injection_ips=injection_ips,
        isp_resolver_asns=isp_resolver_asns,
        control_asn=control_stats.dns_asn if control_stats else None,
    )

    tcp_feats = extract_tcp_features(
        delta=delta,
        probe=probe_result,
        control_rtt_mean=control_stats.tcp_rtt_mean if control_stats else None,
        control_rtt_std=control_stats.tcp_rtt_std if control_stats else None,
    )

    tls_feats = extract_tls_features(
        delta=delta,
        probe=probe_result,
        control_hs_mean=control_stats.tls_hs_mean if control_stats else None,
        control_hs_std=control_stats.tls_hs_std if control_stats else None,
        known_mitm_fps=KNOWN_MITM_FINGERPRINTS,
        govt_issuers=KNOWN_GOVT_ISSUERS,
    )

    http_feats = extract_http_features(
        delta=delta,
        probe=probe_result,
        control_ttfb_mean=control_stats.http_ttfb_mean if control_stats else None,
        control_ttfb_std=control_stats.http_ttfb_std if control_stats else None,
        control_body_len=control_result.http.body_length if control_result and control_result.http else None,
    )

    meta_feats = extract_meta_features(
        probe=probe_result,
        control_available=control_available,
        asn_db=asn_db,
    )

    raw = dns_feats + tcp_feats + tls_feats + http_feats + meta_feats
    assert len(raw) == 47, f"Feature count mismatch: {len(raw)}"

    vec = np.array(raw, dtype=np.float32)
    return vec


def impute_nans(vec: np.ndarray) -> tuple[np.ndarray, int]:
    """
    Replace NaN entries with feature-specific imputation values.
    Returns the imputed vector and the count of features that were NaN.
    The caller uses the NaN count to decide whether to abstain from classification.
    """
    nan_mask = np.isnan(vec)
    nan_count = int(nan_mask.sum())
    if nan_count > 0:
        for i, name in enumerate(FEATURE_NAMES):
            if nan_mask[i]:
                vec[i] = IMPUTE_VALUES[name]
    return vec, nan_count

Control cache design

The most expensive input to the feature extractor is the control measurement — the baseline that tells us what the domain looks like from an unobstructed connection. Without the control result, approximately 28 of the 47 features cannot be computed (all delta features, most binary comparison flags). But running a fresh control measurement synchronously with every probe measurement would double the infrastructure cost and add 100–800ms of round-trip latency to the feature extraction step. The solution is a per-domain LRU cache.

# control/cache.py
import time
from collections import OrderedDict
from threading import RLock
from typing import TypeVar, Generic

T = TypeVar('T')

class LRUCache(Generic[T]):
    """Thread-safe LRU cache with TTL expiry."""
    def __init__(self, max_size: int, ttl_seconds: float):
        self._max_size = max_size
        self._ttl = ttl_seconds
        self._cache: OrderedDict[str, tuple[T, float]] = OrderedDict()
        self._lock = RLock()

    def get(self, key: str) -> T | None:
        with self._lock:
            if key not in self._cache:
                return None
            value, inserted_at = self._cache[key]
            if time.monotonic() - inserted_at > self._ttl:
                del self._cache[key]
                return None
            # Move to end (most recently used)
            self._cache.move_to_end(key)
            return value

    def put(self, key: str, value: T) -> None:
        with self._lock:
            if key in self._cache:
                self._cache.move_to_end(key)
            self._cache[key] = (value, time.monotonic())
            if len(self._cache) > self._max_size:
                self._cache.popitem(last=False)  # evict LRU


# The control cache: keyed on domain name, holds the most recent ControlResult.
# 10,000 entries × ~4KB per control result = ~40MB peak memory footprint.
# 15-minute TTL: control measurements are re-fetched if the cached entry is older than 15 minutes.
# The 15-minute TTL balances staleness risk against control server load:
# - Too short (< 5 min): excessive control server load; most domains don't change that fast.
# - Too long (> 60 min): stale baselines during genuine routing changes that are not censorship.
CONTROL_CACHE: LRUCache[ControlResult] = LRUCache(
    max_size=10_000,
    ttl_seconds=15 * 60,
)

# Background thread that proactively refreshes cache entries for high-volume domains.
# Without proactive refresh, high-traffic domains would experience thundering-herd
# cache misses when their TTL expires simultaneously for many in-flight measurements.

async def background_control_refresher(
    domains: list[str],
    control_client: ControlServerClient,
    cache: LRUCache[ControlResult],
    refresh_interval_s: float = 60.0,
):
    """
    Proactively refresh control measurements for the top-N most-probed domains
    before their cache entries expire. Runs as an asyncio task on the inference node.
    """
    while True:
        for domain in domains:
            # Only refresh if the entry is within 3 minutes of expiry
            existing = cache.get(domain)
            if existing is None:
                result = await control_client.measure(domain)
                cache.put(domain, result)
        await asyncio.sleep(refresh_interval_s)

The cache holds up to 10,000 domain entries. The Voidly test list currently contains approximately 80 domains per country per probe run, and the collector serves ~37 vantage nodes, so the working set is well under the cache capacity even when all per-country supplemental domains are included. The 15-minute TTL was set by measuring the autocorrelation of legitimate control measurements over time: for 94% of domains, the control measurement at T+15min is within 10% of the control measurement at T+0. TTL beyond 15 minutes starts to see meaningful drift for CDN-served domains where the control node is assigned a different edge server.

Feature stability and schema versioning

The feature schema — which fields exist, in which order, and with which normalization — is a contract between the extractor and the model. The model is trained on a specific schema version; if the extractor produces a vector with a different number of features, or in a different order, or with a different normalization for a particular feature, the model produces garbage silently. There is no type system or schema validation at the numpy array level to catch a mismatch.

To prevent silent mismatches, every probe measurement record includes a feature_schema_version field that the extractor writes at extraction time. The inference service rejects any measurement whose schema version does not match the version the active model was trained on:

# Extractor writes this into every ProbeFeatures record:
CURRENT_FEATURE_SCHEMA_VERSION = "v2025-05-20"

# Inference service checks on every request:
def validate_schema_version(features_schema_version: str, model_version: str) -> None:
    expected = MODEL_SCHEMA_REQUIREMENTS[model_version]
    if features_schema_version != expected:
        raise SchemaVersionMismatch(
            f"Model {model_version} requires schema {expected}, "
            f"got {features_schema_version}"
        )

Adding a new feature requires bumping the schema version, retraining the model from scratch on a full historical corpus re-extraction with the new feature, and running the champion/challenger protocol in the inference API to promote the new model. This is intentionally heavyweight. The friction discourages adding features for marginal gain — each addition incurs a full retraining cycle, which takes approximately 6 hours on the training infrastructure and requires a week of shadow-mode validation before the model can go to production.

The champion/challenger framework in the inference API is what makes new feature additions operationally safe. The new model runs in shadow mode alongside the current champion for 7 days before any traffic is shifted to it. If the candidate's F1 is not at least 0.5 percentage points above the champion's on the labeled probe stream, it is retired and the schema change is rolled back. Neither the extractor change nor the model change takes effect until the challenger wins promotion — so at no point does the production traffic path see a mismatched extractor and model.

For the raw measurement schema these features are derived from: The Voidly measurement dataset: field-by-field schema reference →

For the classifier that consumes these features: The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall →

For the inference API that serves predictions using these features: Voidly's real-time inference API: classifying censorship measurements at 50ms →

For the URL test list that determines which domains generate these measurements: Voidly's URL test list: how we curate the domains that reveal internet censorship →