Technical writing

Voidly's Measurement Database: 2.2B Probe Results in TimescaleDB

March 1, 2025· 15 min read· AI Analytics

CensorshipVoidlyTimescaleDBInfrastructurePostgreSQL

Voidly's measurement database stores 2.2 billion probe results spanning five years of continuous internet censorship monitoring from 37+ global vantage points. Every result — a DNS resolution, TCP connection attempt, TLS handshake, or HTTP transaction — is a row with a timestamp, a country code, an ASN, a probe ID, and a set of per-protocol result fields. This article covers how we store, compress, aggregate, and query this data at the speeds that anomaly detection requires.

The choice of TimescaleDB over InfluxDB, Cassandra, or plain PostgreSQL reflects specific requirements: we need strong SQL (joins, window functions, GROUP BY with complex predicates), we need PostgreSQL extensions (PostGIS for geographic clustering, pg_vector for embedding similarity), and we need time-series optimizations (automatic partitioning, transparent compression) without a new query language. TimescaleDB delivers all three as a PostgreSQL extension.

Why not the alternatives

Database        Reason rejected
───────────────────────────────────────────────────────────────────────────────
InfluxDB 3.x    No SQL joins; can't join measurements to our entity table.
                Flux / InfluxQL query languages require re-training all analysts.

Cassandra       Write-optimized but read patterns are variable (range scans,
                per-country aggregations, cross-country correlations). Cassandra
                is hostile to ad-hoc queries; partition key design locks us in.

ClickHouse      Strong candidate, nearly chose it. Rejected because:
                (a) poor support for UPDATE (stale flag on low-quality measurements),
                (b) no PostGIS, (c) materialized view refresh semantics differ
                enough from PostgreSQL that the team would need retraining.

Plain           Would work for 5 years of data but degrades past ~500M rows
PostgreSQL      without partitioning. Manual partition management is error-prone.
                TimescaleDB automates this transparently.

Hypertable design

The measurements table is a TimescaleDB hypertable partitioned by time with 1-day chunks. Secondary partitioning by country code reduces cross-chunk scans for per-country queries:

CREATE TABLE measurements (
    -- Identity
    measurement_id  UUID        NOT NULL DEFAULT gen_random_uuid(),
    probe_id        TEXT        NOT NULL,   -- probe operator identifier (hashed)
    vantage_asn     INTEGER     NOT NULL,
    vantage_country CHAR(2)     NOT NULL,   -- ISO 3166-1 alpha-2

    -- What was measured
    target_url      TEXT        NOT NULL,
    target_domain   TEXT        NOT NULL,
    target_country  CHAR(2),               -- derived from target IP geolocation
    test_protocol   test_protocol_enum NOT NULL,  -- dns|tcp|tls|http|https

    -- When
    measured_at     TIMESTAMPTZ NOT NULL,
    probe_version   SEMVER      NOT NULL,

    -- Result fields (nullable — not all protocols populate all fields)
    dns_resolved        INET[],
    dns_error_code      TEXT,
    tcp_connected       BOOLEAN,
    tcp_connect_ms      SMALLINT,
    tcp_error           TEXT,
    tls_handshake_ok    BOOLEAN,
    tls_cert_valid      BOOLEAN,
    tls_cert_sni        TEXT,
    tls_alert_code      SMALLINT,
    http_status_code    SMALLINT,
    http_body_sha256    BYTEA,
    blockpage_match_id  INTEGER     REFERENCES blockpage_fingerprints(id),

    -- Classification (set by the anomaly classifier, updated async)
    interference_type   interference_type_enum,  -- none|dns_tampering|...
    interference_prob   REAL,
    confidence_tier     confidence_tier_enum,    -- anomaly|corroborated|verified
    classifier_version  SEMVER,

    -- Quality
    control_reachable   BOOLEAN,   -- was control server reachable during this probe?
    inference_dropped   TEXT,      -- non-null = why this measurement was dropped

    PRIMARY KEY (measured_at, measurement_id)
);

-- Convert to hypertable: 1-day chunks, secondary partition by vantage_country
SELECT create_hypertable(
    'measurements',
    'measured_at',
    chunk_time_interval => INTERVAL '1 day',
    number_partitions => 4,    -- space partition by hash(vantage_country)
);

The 1-day chunk interval is a balance between query performance and operational overhead. Smaller chunks (e.g., 6 hours) would improve range scan performance for short time windows but create more internal PostgreSQL overhead (more catalog entries, more parallel workers for cross-chunk queries). At 2.4 million measurements per day, a 1-day chunk is ~2–3 GB uncompressed — large enough to amortize chunk overhead, small enough to fit in working memory for sorting.

Compression

Chunks older than 7 days are automatically compressed. TimescaleDB's columnar compression uses a combination of delta encoding (for timestamps), dictionary encoding (for country codes, probe IDs, enum columns), and LZ4 for the remainder.

-- Compression configuration
ALTER TABLE measurements SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'vantage_country, test_protocol',
    timescaledb.compress_orderby = 'measured_at DESC, target_domain'
);

-- Schedule automatic compression after 7 days
SELECT add_compression_policy('measurements', INTERVAL '7 days');

-- Current compression ratios (production)
-- Column             Ratio   Method
-- measured_at        12:1    delta encoding
-- vantage_country    40:1    dictionary (only 200 distinct values)
-- test_protocol      50:1    dictionary (5 values)
-- target_domain      8:1     dictionary + LZ4
-- http_body_sha256   1.1:1   already random (SHA256 bytes don't compress)
-- interference_type  60:1    dictionary (7 values)
--
-- Overall compression: ~6.2x
-- Uncompressed: ~14TB
-- Compressed:   ~2.3TB (current database size)

The compress_segmentby choice is important. Rows with the samevantage_country and test_protocol are grouped together in compressed column stores, which dramatically improves the compression ratio for enum-like columns and also improves query performance because per-country scans can skip unrelated segments.

Continuous aggregates

The most common analytical queries are time-windowed aggregations: "what is the daily average interference probability for country X over the last 90 days?", "how many measurements showed DNS tampering for this domain this week?" These would be expensive to compute on demand from 2.2B raw rows. Continuous aggregates pre-compute these as materialized views that TimescaleDB refreshes incrementally as new data arrives.

-- Country-level daily summary (most used aggregate)
CREATE MATERIALIZED VIEW country_daily_summary
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 day', measured_at)   AS day,
    vantage_country                      AS country,
    test_protocol,
    COUNT(*)                             AS total_measurements,
    COUNT(*) FILTER (WHERE control_reachable AND inference_dropped IS NULL)
                                         AS valid_measurements,
    AVG(interference_prob)               AS avg_interference_prob,
    COUNT(*) FILTER (WHERE interference_type != 'none')
                                         AS interference_count,
    COUNT(DISTINCT target_domain)        AS distinct_domains_tested,
    COUNT(DISTINCT probe_id)             AS distinct_probes
FROM measurements
WHERE inference_dropped IS NULL
GROUP BY 1, 2, 3;

-- Refresh policy: update whenever new measurements arrive, with 1hr lag
SELECT add_continuous_aggregate_policy('country_daily_summary',
    start_offset => INTERVAL '30 days',  -- recompute back 30 days (late arrivals)
    end_offset   => INTERVAL '1 hour',   -- don't include last hour (partial day)
    schedule_interval => INTERVAL '1 hour'
);

-- Additional aggregates
-- domain_daily_summary: per-domain interference over time (for test list curation)
-- asn_hourly_summary:   per-ASN hourly counts (for real-time anomaly detection)
-- probe_daily_health:   per-probe daily valid measurement count (for probe health)

The start_offset = 30 days ensures late-arriving measurements from probes with poor connectivity are incorporated. A probe in a censored environment may buffer measurements locally and upload them in batch when connectivity is restored, potentially hours to days after the measurement was taken. Without this lookback, those measurements would never appear in the aggregate.

Retention tiers

Not all data needs to be available at the same speed forever. We implement a three-tier retention policy:

Tier        Age              Storage          Query latency
──────────────────────────────────────────────────────────────────
Hot         0–7 days         Uncompressed      < 100ms for country queries
Warm        7 days–1 year    Compressed        < 500ms for country queries
Cold        > 1 year         S3 Parquet        Minutes (Athena/DuckDB on demand)

-- Hot→Warm: automatic via TimescaleDB compression policy (7 days)
-- Warm→Cold: custom retention policy exports compressed chunks to S3

CREATE OR REPLACE PROCEDURE expire_old_chunks() AS $$
BEGIN
    -- Drop chunks older than 1 year from the live database
    -- (data already exported to S3 Parquet before this runs)
    PERFORM drop_chunks('measurements', older_than => INTERVAL '1 year');
END;
$$ LANGUAGE plpgsql;

-- Run nightly at 03:00 UTC
SELECT add_job('expire_old_chunks', '1 day', initial_start => '2025-01-01 03:00:00+00');

Before the warm→cold export runs, we export compressed chunks to S3 as Parquet files using a custom pg_parquet export worker. The exported Parquet files are the same files published to HuggingFace as the CC BY 4.0 public dataset — the public release and the archival export share the same export pipeline.

Indexes

TimescaleDB automatically creates a per-chunk index on the time column. We add the following additional indexes:

-- Selective indexes (built on each new chunk automatically via a hook)

-- Fast lookup of all measurements for a specific domain in a time range
CREATE INDEX idx_meas_domain_time
    ON measurements (target_domain, measured_at DESC)
    WHERE inference_dropped IS NULL;

-- Fast lookup for the anomaly detector: high-interference measurements
CREATE INDEX idx_meas_interference
    ON measurements (vantage_country, measured_at DESC, interference_prob)
    WHERE interference_prob > 0.5;

-- Probe health monitoring: per-probe lookups
CREATE INDEX idx_meas_probe
    ON measurements (probe_id, measured_at DESC);

-- Blockpage match lookups (for fingerprint coverage analysis)
CREATE INDEX idx_meas_blockpage
    ON measurements (blockpage_match_id, measured_at DESC)
    WHERE blockpage_match_id IS NOT NULL;

-- Note: no index on vantage_asn alone — this is covered by the
-- space partitioning on vantage_country, which reduces scan scope enough.

Index creation on compressed chunks is a known limitation of TimescaleDB: indexes are per-chunk, and creating them on historical compressed chunks requires decompression. We pre-create indexes on new chunks via a TimescaleDB chunk creation hook before data is inserted, so new chunks always have indexes. Historical re-indexing is done during maintenance windows.

Query patterns and performance

The three most frequent query patterns, with production performance on the current 2.2B-row dataset:

1. Country anomaly window

-- "Did interference in Iran increase in the last 6 hours?"
-- Hits the asn_hourly_summary continuous aggregate, not raw table

SELECT
    bucket,
    SUM(interference_count)::float / NULLIF(SUM(valid_measurements), 0)
        AS interference_rate
FROM asn_hourly_summary
WHERE country = 'IR'
  AND bucket > NOW() - INTERVAL '24 hours'
GROUP BY bucket
ORDER BY bucket;

-- Query time: 4ms (aggregate, no raw scan)

2. Domain blocking history

-- "How has blocking of twitter.com varied by country over 90 days?"

SELECT
    day,
    country,
    SUM(interference_count) AS blocked_measurements,
    SUM(valid_measurements) AS total_valid
FROM country_daily_summary_per_domain   -- domain-specific continuous aggregate
WHERE target_domain = 'twitter.com'
  AND day > NOW() - INTERVAL '90 days'
GROUP BY 1, 2
ORDER BY 1, 2;

-- Query time: 18ms (90 days × ~200 countries from aggregate)

3. Raw measurement detail for incident verification

-- "Show me the raw probe results during the Iran Twitter block on 2025-01-15"

SELECT
    measured_at,
    probe_id,
    vantage_asn,
    dns_resolved,
    tcp_connected,
    tls_handshake_ok,
    http_status_code,
    interference_type,
    interference_prob
FROM measurements
WHERE vantage_country = 'IR'
  AND target_domain = 'twitter.com'
  AND measured_at BETWEEN '2025-01-15 00:00:00+00' AND '2025-01-15 23:59:59+00'
  AND inference_dropped IS NULL
ORDER BY measured_at;

-- Query time: 140ms
-- (single-day chunk, ~12K matching rows in that day for this country+domain)

Write path

Probe measurements arrive at a Cloudflare Worker, are validated, and are pushed to a Kafka topic. A Rust consumer (running on a dedicated EC2 instance) reads from Kafka in batches of 1,000 measurements and bulk-inserts to TimescaleDB usingCOPY FROM STDIN (binary format, ~40% faster than INSERT for bulk loads).

// Bulk insert via PostgreSQL binary COPY
async fn bulk_insert(
    client: &mut Client,
    measurements: &[Measurement],
) -> Result<u64> {
    let writer = client.copy_in("COPY measurements FROM STDIN BINARY").await?;
    let mut writer = BinaryCopyInWriter::new(writer, &MEASUREMENT_TYPES);

    for m in measurements {
        writer.as_pin_mut().write_raw([
            &m.measurement_id as &dyn ToSql,
            &m.probe_id,
            &m.vantage_asn,
            &m.vantage_country,
            &m.target_url,
            // ... all 24 columns
        ]).await?;
    }

    Ok(writer.finish().await?)
}

// Throughput: 45,000 rows/second sustained (single consumer thread)
// Peak ingest during major censorship events: 120,000 rows/second
// (3 consumer instances scaled out automatically)

The classifier is async

Interference classification (the interference_type,interference_prob, and confidence_tier columns) is not set at insert time. Probes write raw measurements without classification; the anomaly classifier runs as a separate async job that reads unclassified measurements in batches, scores them, and updates the rows with UPDATE … WHERE classifier_version IS NULL.

This separation means the write path is never blocked by classification latency, and classification can be re-run retrospectively when the model is updated — a new model version can backfill classifications on historical data without reprocessing the full raw measurement.

Sizing and cost

Instance:           r6g.2xlarge (8 vCPU, 64GB RAM, ARM Graviton3)
Storage:            2× 4TB io2 EBS (RAID-1, 64,000 IOPS each)
Current DB size:    2.3TB compressed (≈ 14TB uncompressed equivalent)
Monthly cost:       ~$980 (instance) + ~$1,200 (EBS) = ~$2,180/mo

Write throughput:   45K rows/sec sustained, 120K/sec peak
Read concurrency:   60 simultaneous queries before latency degrades (p99 > 500ms)
Cache hit rate:     91% (hot data fits in 64GB RAM for uncompressed recent chunks)

The r6g.2xlarge at 64GB RAM is sized to hold 90 days of uncompressed active data in the OS page cache. The 91% cache hit rate means the vast majority of anomaly detection queries (which read recent data) are served from memory, not disk. This is the dominant performance factor — not query optimization.

For the MCP server that exposes TimescaleDB query results to AI agents and Claude, built on top of this measurement store: The Voidly MCP server: 83 censorship query tools for Claude and GPT →

For the public HuggingFace dataset exports that come from this same TimescaleDB via the Parquet export pipeline: The Voidly open datasets on HuggingFace: structure, daily snapshots, and filter recipes →

For the Cloudflare D1-based approach we took for the Federal Regulatory Data Hub (contrasting with the TimescaleDB approach here): Building the Federal Regulatory Data Hub on Cloudflare D1: 50M+ records at the edge →

For the corroboration engine that reads from this database in real time — tokio::join! parallel fetches, adaptive OONI polling, in-memory CensoredPlanet daily index, and independence-weighted source agreement: Voidly's real-time corroboration engine: fetching, aligning, and merging OONI, CensoredPlanet, and IODA data →

For the three-tier retention policy that governs how long data stays in this database — hot/warm/cold tiers, compression DDL, and R2 cold storage: Voidly's measurement retention policy: hot, warm, and cold tiers for 2.2B probe results →