Technical writing
Voidly's Measurement Database: 2.2B Probe Results in TimescaleDB
Voidly's measurement database stores 2.2 billion probe results spanning five years of continuous internet censorship monitoring from 37+ global vantage points. Every result — a DNS resolution, TCP connection attempt, TLS handshake, or HTTP transaction — is a row with a timestamp, a country code, an ASN, a probe ID, and a set of per-protocol result fields. This article covers how we store, compress, aggregate, and query this data at the speeds that anomaly detection requires.
The choice of TimescaleDB over InfluxDB, Cassandra, or plain PostgreSQL reflects specific requirements: we need strong SQL (joins, window functions, GROUP BY with complex predicates), we need PostgreSQL extensions (PostGIS for geographic clustering, pg_vector for embedding similarity), and we need time-series optimizations (automatic partitioning, transparent compression) without a new query language. TimescaleDB delivers all three as a PostgreSQL extension.
Why not the alternatives
Database Reason rejected
───────────────────────────────────────────────────────────────────────────────
InfluxDB 3.x No SQL joins; can't join measurements to our entity table.
Flux / InfluxQL query languages require re-training all analysts.
Cassandra Write-optimized but read patterns are variable (range scans,
per-country aggregations, cross-country correlations). Cassandra
is hostile to ad-hoc queries; partition key design locks us in.
ClickHouse Strong candidate, nearly chose it. Rejected because:
(a) poor support for UPDATE (stale flag on low-quality measurements),
(b) no PostGIS, (c) materialized view refresh semantics differ
enough from PostgreSQL that the team would need retraining.
Plain Would work for 5 years of data but degrades past ~500M rows
PostgreSQL without partitioning. Manual partition management is error-prone.
TimescaleDB automates this transparently.Hypertable design
The measurements table is a TimescaleDB hypertable partitioned by time with 1-day chunks. Secondary partitioning by country code reduces cross-chunk scans for per-country queries:
CREATE TABLE measurements (
-- Identity
measurement_id UUID NOT NULL DEFAULT gen_random_uuid(),
probe_id TEXT NOT NULL, -- probe operator identifier (hashed)
vantage_asn INTEGER NOT NULL,
vantage_country CHAR(2) NOT NULL, -- ISO 3166-1 alpha-2
-- What was measured
target_url TEXT NOT NULL,
target_domain TEXT NOT NULL,
target_country CHAR(2), -- derived from target IP geolocation
test_protocol test_protocol_enum NOT NULL, -- dns|tcp|tls|http|https
-- When
measured_at TIMESTAMPTZ NOT NULL,
probe_version SEMVER NOT NULL,
-- Result fields (nullable — not all protocols populate all fields)
dns_resolved INET[],
dns_error_code TEXT,
tcp_connected BOOLEAN,
tcp_connect_ms SMALLINT,
tcp_error TEXT,
tls_handshake_ok BOOLEAN,
tls_cert_valid BOOLEAN,
tls_cert_sni TEXT,
tls_alert_code SMALLINT,
http_status_code SMALLINT,
http_body_sha256 BYTEA,
blockpage_match_id INTEGER REFERENCES blockpage_fingerprints(id),
-- Classification (set by the anomaly classifier, updated async)
interference_type interference_type_enum, -- none|dns_tampering|...
interference_prob REAL,
confidence_tier confidence_tier_enum, -- anomaly|corroborated|verified
classifier_version SEMVER,
-- Quality
control_reachable BOOLEAN, -- was control server reachable during this probe?
inference_dropped TEXT, -- non-null = why this measurement was dropped
PRIMARY KEY (measured_at, measurement_id)
);
-- Convert to hypertable: 1-day chunks, secondary partition by vantage_country
SELECT create_hypertable(
'measurements',
'measured_at',
chunk_time_interval => INTERVAL '1 day',
number_partitions => 4, -- space partition by hash(vantage_country)
);The 1-day chunk interval is a balance between query performance and operational overhead. Smaller chunks (e.g., 6 hours) would improve range scan performance for short time windows but create more internal PostgreSQL overhead (more catalog entries, more parallel workers for cross-chunk queries). At 2.4 million measurements per day, a 1-day chunk is ~2–3 GB uncompressed — large enough to amortize chunk overhead, small enough to fit in working memory for sorting.
Compression
Chunks older than 7 days are automatically compressed. TimescaleDB's columnar compression uses a combination of delta encoding (for timestamps), dictionary encoding (for country codes, probe IDs, enum columns), and LZ4 for the remainder.
-- Compression configuration
ALTER TABLE measurements SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'vantage_country, test_protocol',
timescaledb.compress_orderby = 'measured_at DESC, target_domain'
);
-- Schedule automatic compression after 7 days
SELECT add_compression_policy('measurements', INTERVAL '7 days');
-- Current compression ratios (production)
-- Column Ratio Method
-- measured_at 12:1 delta encoding
-- vantage_country 40:1 dictionary (only 200 distinct values)
-- test_protocol 50:1 dictionary (5 values)
-- target_domain 8:1 dictionary + LZ4
-- http_body_sha256 1.1:1 already random (SHA256 bytes don't compress)
-- interference_type 60:1 dictionary (7 values)
--
-- Overall compression: ~6.2x
-- Uncompressed: ~14TB
-- Compressed: ~2.3TB (current database size)The compress_segmentby choice is important. Rows with the samevantage_country and test_protocol are grouped together in compressed column stores, which dramatically improves the compression ratio for enum-like columns and also improves query performance because per-country scans can skip unrelated segments.
Continuous aggregates
The most common analytical queries are time-windowed aggregations: "what is the daily average interference probability for country X over the last 90 days?", "how many measurements showed DNS tampering for this domain this week?" These would be expensive to compute on demand from 2.2B raw rows. Continuous aggregates pre-compute these as materialized views that TimescaleDB refreshes incrementally as new data arrives.
-- Country-level daily summary (most used aggregate)
CREATE MATERIALIZED VIEW country_daily_summary
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 day', measured_at) AS day,
vantage_country AS country,
test_protocol,
COUNT(*) AS total_measurements,
COUNT(*) FILTER (WHERE control_reachable AND inference_dropped IS NULL)
AS valid_measurements,
AVG(interference_prob) AS avg_interference_prob,
COUNT(*) FILTER (WHERE interference_type != 'none')
AS interference_count,
COUNT(DISTINCT target_domain) AS distinct_domains_tested,
COUNT(DISTINCT probe_id) AS distinct_probes
FROM measurements
WHERE inference_dropped IS NULL
GROUP BY 1, 2, 3;
-- Refresh policy: update whenever new measurements arrive, with 1hr lag
SELECT add_continuous_aggregate_policy('country_daily_summary',
start_offset => INTERVAL '30 days', -- recompute back 30 days (late arrivals)
end_offset => INTERVAL '1 hour', -- don't include last hour (partial day)
schedule_interval => INTERVAL '1 hour'
);
-- Additional aggregates
-- domain_daily_summary: per-domain interference over time (for test list curation)
-- asn_hourly_summary: per-ASN hourly counts (for real-time anomaly detection)
-- probe_daily_health: per-probe daily valid measurement count (for probe health)The start_offset = 30 days ensures late-arriving measurements from probes with poor connectivity are incorporated. A probe in a censored environment may buffer measurements locally and upload them in batch when connectivity is restored, potentially hours to days after the measurement was taken. Without this lookback, those measurements would never appear in the aggregate.
Retention tiers
Not all data needs to be available at the same speed forever. We implement a three-tier retention policy:
Tier Age Storage Query latency
──────────────────────────────────────────────────────────────────
Hot 0–7 days Uncompressed < 100ms for country queries
Warm 7 days–1 year Compressed < 500ms for country queries
Cold > 1 year S3 Parquet Minutes (Athena/DuckDB on demand)
-- Hot→Warm: automatic via TimescaleDB compression policy (7 days)
-- Warm→Cold: custom retention policy exports compressed chunks to S3
CREATE OR REPLACE PROCEDURE expire_old_chunks() AS $$
BEGIN
-- Drop chunks older than 1 year from the live database
-- (data already exported to S3 Parquet before this runs)
PERFORM drop_chunks('measurements', older_than => INTERVAL '1 year');
END;
$$ LANGUAGE plpgsql;
-- Run nightly at 03:00 UTC
SELECT add_job('expire_old_chunks', '1 day', initial_start => '2025-01-01 03:00:00+00');Before the warm→cold export runs, we export compressed chunks to S3 as Parquet files using a custom pg_parquet export worker. The exported Parquet files are the same files published to HuggingFace as the CC BY 4.0 public dataset — the public release and the archival export share the same export pipeline.
Indexes
TimescaleDB automatically creates a per-chunk index on the time column. We add the following additional indexes:
-- Selective indexes (built on each new chunk automatically via a hook)
-- Fast lookup of all measurements for a specific domain in a time range
CREATE INDEX idx_meas_domain_time
ON measurements (target_domain, measured_at DESC)
WHERE inference_dropped IS NULL;
-- Fast lookup for the anomaly detector: high-interference measurements
CREATE INDEX idx_meas_interference
ON measurements (vantage_country, measured_at DESC, interference_prob)
WHERE interference_prob > 0.5;
-- Probe health monitoring: per-probe lookups
CREATE INDEX idx_meas_probe
ON measurements (probe_id, measured_at DESC);
-- Blockpage match lookups (for fingerprint coverage analysis)
CREATE INDEX idx_meas_blockpage
ON measurements (blockpage_match_id, measured_at DESC)
WHERE blockpage_match_id IS NOT NULL;
-- Note: no index on vantage_asn alone — this is covered by the
-- space partitioning on vantage_country, which reduces scan scope enough.Index creation on compressed chunks is a known limitation of TimescaleDB: indexes are per-chunk, and creating them on historical compressed chunks requires decompression. We pre-create indexes on new chunks via a TimescaleDB chunk creation hook before data is inserted, so new chunks always have indexes. Historical re-indexing is done during maintenance windows.
Query patterns and performance
The three most frequent query patterns, with production performance on the current 2.2B-row dataset:
1. Country anomaly window
-- "Did interference in Iran increase in the last 6 hours?"
-- Hits the asn_hourly_summary continuous aggregate, not raw table
SELECT
bucket,
SUM(interference_count)::float / NULLIF(SUM(valid_measurements), 0)
AS interference_rate
FROM asn_hourly_summary
WHERE country = 'IR'
AND bucket > NOW() - INTERVAL '24 hours'
GROUP BY bucket
ORDER BY bucket;
-- Query time: 4ms (aggregate, no raw scan)2. Domain blocking history
-- "How has blocking of twitter.com varied by country over 90 days?"
SELECT
day,
country,
SUM(interference_count) AS blocked_measurements,
SUM(valid_measurements) AS total_valid
FROM country_daily_summary_per_domain -- domain-specific continuous aggregate
WHERE target_domain = 'twitter.com'
AND day > NOW() - INTERVAL '90 days'
GROUP BY 1, 2
ORDER BY 1, 2;
-- Query time: 18ms (90 days × ~200 countries from aggregate)3. Raw measurement detail for incident verification
-- "Show me the raw probe results during the Iran Twitter block on 2025-01-15"
SELECT
measured_at,
probe_id,
vantage_asn,
dns_resolved,
tcp_connected,
tls_handshake_ok,
http_status_code,
interference_type,
interference_prob
FROM measurements
WHERE vantage_country = 'IR'
AND target_domain = 'twitter.com'
AND measured_at BETWEEN '2025-01-15 00:00:00+00' AND '2025-01-15 23:59:59+00'
AND inference_dropped IS NULL
ORDER BY measured_at;
-- Query time: 140ms
-- (single-day chunk, ~12K matching rows in that day for this country+domain)Write path
Probe measurements arrive at a Cloudflare Worker, are validated, and are pushed to a Kafka topic. A Rust consumer (running on a dedicated EC2 instance) reads from Kafka in batches of 1,000 measurements and bulk-inserts to TimescaleDB usingCOPY FROM STDIN (binary format, ~40% faster than INSERT for bulk loads).
// Bulk insert via PostgreSQL binary COPY
async fn bulk_insert(
client: &mut Client,
measurements: &[Measurement],
) -> Result<u64> {
let writer = client.copy_in("COPY measurements FROM STDIN BINARY").await?;
let mut writer = BinaryCopyInWriter::new(writer, &MEASUREMENT_TYPES);
for m in measurements {
writer.as_pin_mut().write_raw([
&m.measurement_id as &dyn ToSql,
&m.probe_id,
&m.vantage_asn,
&m.vantage_country,
&m.target_url,
// ... all 24 columns
]).await?;
}
Ok(writer.finish().await?)
}
// Throughput: 45,000 rows/second sustained (single consumer thread)
// Peak ingest during major censorship events: 120,000 rows/second
// (3 consumer instances scaled out automatically)The classifier is async
Interference classification (the interference_type,interference_prob, and confidence_tier columns) is not set at insert time. Probes write raw measurements without classification; the anomaly classifier runs as a separate async job that reads unclassified measurements in batches, scores them, and updates the rows with UPDATE … WHERE classifier_version IS NULL.
This separation means the write path is never blocked by classification latency, and classification can be re-run retrospectively when the model is updated — a new model version can backfill classifications on historical data without reprocessing the full raw measurement.
Sizing and cost
Instance: r6g.2xlarge (8 vCPU, 64GB RAM, ARM Graviton3) Storage: 2× 4TB io2 EBS (RAID-1, 64,000 IOPS each) Current DB size: 2.3TB compressed (≈ 14TB uncompressed equivalent) Monthly cost: ~$980 (instance) + ~$1,200 (EBS) = ~$2,180/mo Write throughput: 45K rows/sec sustained, 120K/sec peak Read concurrency: 60 simultaneous queries before latency degrades (p99 > 500ms) Cache hit rate: 91% (hot data fits in 64GB RAM for uncompressed recent chunks)
The r6g.2xlarge at 64GB RAM is sized to hold 90 days of uncompressed active data in the OS page cache. The 91% cache hit rate means the vast majority of anomaly detection queries (which read recent data) are served from memory, not disk. This is the dominant performance factor — not query optimization.
For the MCP server that exposes TimescaleDB query results to AI agents and Claude, built on top of this measurement store: The Voidly MCP server: 83 censorship query tools for Claude and GPT →
For the public HuggingFace dataset exports that come from this same TimescaleDB via the Parquet export pipeline: The Voidly open datasets on HuggingFace: structure, daily snapshots, and filter recipes →
For the Cloudflare D1-based approach we took for the Federal Regulatory Data Hub (contrasting with the TimescaleDB approach here): Building the Federal Regulatory Data Hub on Cloudflare D1: 35M records at the edge →
For the corroboration engine that reads from this database in real time — tokio::join! parallel fetches, adaptive OONI polling, in-memory CensoredPlanet daily index, and independence-weighted source agreement: Voidly's real-time corroboration engine: fetching, aligning, and merging OONI, CensoredPlanet, and IODA data →
For the three-tier retention policy that governs how long data stays in this database — hot/warm/cold tiers, compression DDL, and R2 cold storage: Voidly's measurement retention policy: hot, warm, and cold tiers for 2.2B probe results →