Technical writing

Voidly's TimescaleDB continuous aggregates: pre-aggregating 2.2B probe measurements for fast queries

August 14, 2025· 8 min read· AI Analytics

VoidlyTimescaleDBInfrastructure

The raw voidly_measurements hypertable stores every probe result as a row. At 2.2 billion rows — and growing at roughly 500k per day — querying that table directly for country-level statistics is not viable in a user-facing API. A 7-day country blocking rate query against the raw table takes 4.1 seconds. The real-time inference API has a p99 latency budget of 44ms total. Without pre-aggregation, the database alone would blow that budget by two orders of magnitude.

TimescaleDB continuous aggregates solve this by maintaining pre-computed materialized views that are kept fresh on a configurable schedule. They also unlock the compression story: raw chunks become eligible for compression once the aggregate layer has fully covered them, which keeps storage manageable as the dataset grows. This article covers the three-level cagg hierarchy Voidly uses, the refresh policies that handle late-arriving probe data, how compression and aggregation interlock, and the query benchmarks before and after.

Why continuous aggregates over standard materialized views

PostgreSQL's ordinary MATERIALIZED VIEW is a full refresh: every refresh recomputes the entire view, which at 2.2B rows is not something you can run every 15 minutes. TimescaleDB continuous aggregates maintain an internal materialization hypertable that stores the incrementally-computed state. At refresh time, only the chunks whose data has changed since the last refresh are reprocessed. For a 15-minute refresh of measurement_hourly, that means scanning the 2-hour sliding window of recent data rather than the full table.

The trade-off is materialized_only = true. When this option is set, queries against the cagg read only the materialized state — they do not fall through to the raw hypertable for the unmaterialized region. This is correct for our use case: the inference API can tolerate data being up to 15 minutes stale, and the dashboard explicitly shows the cagg freshness timestamp. For any workflow that needs the absolute latest second of data, we query voidly_measurements directly.

The cagg hierarchy: three levels

Voidly uses a three-level hierarchy. Each level aggregates the level below it rather than the raw hypertable, which means the refresh cost at the upper levels is very low — they are summarizing already-summarized data.

voidly_measurements (raw hypertable, 2.2B rows)
       │
       │  time_bucket('1 hour', measured_at)
       │  GROUP BY country_code, probe_asn, interference_type
       ▼
measurement_hourly (Level 1 cagg)
       │  ~180M rows; refresh every 15 min, 2hr window
       │
       │  time_bucket('1 day', hour)
       │  GROUP BY country_code
       ▼
country_daily_summary (Level 2 cagg)
       │  ~3.2M rows; refresh hourly, 2-day window
       │
       │  time_bucket('30 days', day)
       │  GROUP BY country_code
       ▼
country_monthly_stats (Level 3 cagg)
          ~110K rows; refresh daily, 35-day window
          → used by HuggingFace export and global censorship index

Level 1: measurement_hourly

The first level aggregates raw measurements into hourly buckets keyed on(country_code, probe_asn, interference_type, hour). This is the finest granularity available via the cagg layer and handles the highest-frequency queries: the real-time anomaly detector queries this view to compute rolling baseline rates for each country-ASN pair.

Materialized columns include measurement_count,interference_count, blocked_count,dns_count, tls_count, http_count,throttling_count, bgp_count, and per-class average probabilities: avg_prob_dns,avg_prob_tls, avg_prob_http,avg_prob_bgp, avg_prob_throttle. Storing the per-class averages at this level means the country_daily_summary can produce mean probability columns without touching the raw table.

Level 2: country_daily_summary

The second level is the primary view for the public API's country endpoints and the dashboard. It aggregates measurement_hourly into per-country daily buckets and adds higher-cost derived columns that would be impractical at Level 1: distinct probe count (approximated via HyperLogLog), distinct ASN count, cross-source corroboration counts, and the derived blocking_rate.

CREATE MATERIALIZED VIEW country_daily_summary
WITH (timescaledb.continuous_aggregate = true,
      timescaledb.materialized_only = true)
AS
SELECT
    time_bucket('1 day', hour) AS day,
    country_code,
    sum(measurement_count)                               AS total_measurements,
    sum(interference_count)                              AS total_interference,
    sum(blocked_count)                                   AS total_blocked,
    sum(dns_count)                                       AS dns_interference,
    sum(tls_count)                                       AS tls_interference,
    sum(http_count)                                      AS http_interference,
    sum(bgp_count)                                       AS bgp_interference,
    sum(throttling_count)                                AS throttling_events,
    avg(avg_prob_dns)                                    AS mean_prob_dns,
    avg(avg_prob_tls)                                    AS mean_prob_tls,
    avg(avg_prob_http)                                   AS mean_prob_http,
    avg(avg_prob_bgp)                                    AS mean_prob_bgp,
    count(DISTINCT probe_asn)                            AS distinct_asn_count,
    sum(cross_source_corroborated_count)                 AS corroborated_count,
    CASE WHEN sum(measurement_count) > 0
         THEN sum(interference_count)::float / sum(measurement_count)
         ELSE 0 END                                      AS blocking_rate
FROM measurement_hourly
GROUP BY 1, 2;

The distinct_asn_count column uses count(DISTINCT probe_asn)directly on the hourly data rather than HyperLogLog sketches, which is tractable here because the number of distinct ASNs per country-day is small (typically 10–400). For the distinct probe count (not shown above for brevity), we use HyperLogLog at the measurement_hourly level and merge the sketches atcountry_daily_summary — probe IDs are high-cardinality and exact distinct counts at the daily level would be expensive.

Level 3: country_monthly_stats

The third level is a 30-day rollup over country_daily_summary. It serves two consumers: the nightly HuggingFace Parquet export (which includes monthly statistics as a convenience dataset alongside the raw measurements) and the global censorship index computation, which ranks countries by a composite monthly blocking score. At only ~110k rows, this view fits in PostgreSQL's shared buffer pool entirely, so queries against it are essentially in-memory.

Refresh policies

Each cagg level has its own refresh policy. The refresh window (start_offset to end_offset) defines the range of time that will be reprocessed on each run. Wider windows catch late-arriving data at the cost of more computation per refresh.

-- Level 1: refresh every 15 minutes, covering the last 2 hours
SELECT add_continuous_aggregate_policy('measurement_hourly',
    start_offset => INTERVAL '2 hours',
    end_offset   => INTERVAL '15 minutes',
    schedule_interval => INTERVAL '15 minutes');

-- Level 2: refresh hourly, covering the last 2 days (catch late uploads)
SELECT add_continuous_aggregate_policy('country_daily_summary',
    start_offset => INTERVAL '2 days',
    end_offset   => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour');

-- Level 3: refresh daily at 01:30 UTC
SELECT add_continuous_aggregate_policy('country_monthly_stats',
    start_offset => INTERVAL '35 days',
    end_offset   => INTERVAL '1 day',
    schedule_interval => INTERVAL '1 day');

The 35-day start_offset on country_monthly_stats (versus 30 days for the bucket size) gives a 5-day buffer to catch late-arriving data that propagates up from Level 1 through Level 2 before reaching Level 3. Without this buffer, a measurement uploaded 32 days late would never appear in the monthly stats for its measurement month.

The late-arriving probe data problem

Late arrivals are not an edge case in Voidly's data. Two separate causes produce them:

Operator safety delays. Probes in high-risk countries — China, Russia, Iran, Belarus — have artificial 4–48 hour publication delays built into the probe application. When the probe uploads its batch, the Cloudflare Worker withholds it from the Kafka stream until the delay period has elapsed. This prevents real-time correlation of measurement uploads with probe operators' network activity.
Mobile probe batch uploads. Many operators run the probe on mobile devices that upload measurements only when on Wi-Fi. A probe on a mobile connection all day may not upload until the next morning. This is the dominant source of 1–24 hour late arrivals.

The empirical late-arrival distribution from production data:

Arrival window        Cumulative fraction of measurements
──────────────────────────────────────────────────────────
Within 1 hour         94.2%
Within 24 hours       98.7%
Within 72 hours       99.9%
After 72 hours         0.1%  (delayed-upload or batch-backfill scenarios)

The measurement_hourly policy's start_offset => INTERVAL '2 hours'is the minimum safe value: it ensures the last 2 hours are always reprocessed, catching the small fraction of measurements that arrive between 15 minutes and 2 hours late. The country_daily_summary policy's 2-day window covers the 98.7% of measurements that arrive within 24 hours, with a day of additional buffer. Measurements that arrive after 72 hours are included when they arrive, but the Level 3 monthly stats will have already refreshed for their day. Manual backfill (described below) handles these cases.

The ASN hourly summary cagg

In addition to the three-level country hierarchy, a separate cagg supports ASN-level blocking analysis:asn_hourly_summary, keyed on(asn_id, country_code, hour). This view feeds the/countries/{cc}/asns REST endpoint, which returns per-ASN blocking breakdowns for a given country and time window.

This cagg is not hierarchically chained to country_daily_summaryeven though it shares the country_code dimension. The reason: ASN grouping adds a selectivity dimension that the country-only hierarchy cannot provide. A country-daily row summarizes all ASNs in that country for that day. An ASN-hourly row isolates a single autonomous system. Rolling up from country_daily_summary to asn_hourly_summary is not meaningful — you would lose the per-ASN breakdown you are trying to preserve. Soasn_hourly_summary aggregates directly fromvoidly_measurements and refreshes every 30 minutes.

-- asn_hourly_summary schema (abbreviated)
SELECT
    time_bucket('1 hour', measured_at) AS hour,
    probe_asn                          AS asn_id,
    country_code,
    count(*)                           AS measurement_count,
    sum(interference_count)            AS interference_count,
    sum(dns_count)                     AS dns_count,
    sum(tls_count)                     AS tls_count,
    sum(http_count)                    AS http_count,
    sum(bgp_count)                     AS bgp_count,
    sum(throttling_count)              AS throttling_count
FROM voidly_measurements
GROUP BY 1, 2, 3;

SELECT add_continuous_aggregate_policy('asn_hourly_summary',
    start_offset => INTERVAL '2 hours',
    end_offset   => INTERVAL '30 minutes',
    schedule_interval => INTERVAL '30 minutes');

Compression interplay

Raw voidly_measurements chunks are eligible for TimescaleDB compression 7 days after their time range ends. Compressed chunks can reduce storage 10–20x for time-series data (the column-oriented compression on the repeated country_code and interference_type columns is particularly effective). But there is a constraint: a chunk cannot be compressed if the continuous aggregate has not yet materialized through that chunk's time range. Compressing before the cagg covers the chunk would mean the cagg refresh would need to decompress it, losing the storage benefit and causing performance degradation.

-- Compression policy: compress raw chunks older than 7 days
SELECT add_compression_policy('voidly_measurements',
    compress_after => INTERVAL '7 days');

-- The compression job will not compress a chunk if the cagg has not
-- yet materialized through that chunk's time range.
-- TimescaleDB enforces this automatically via the internal dependency
-- tracking between the hypertable and its continuous aggregates.

-- Monitoring: find chunks pending compression that the cagg has not yet covered
SELECT
    c.chunk_name,
    c.range_start,
    c.range_end,
    c.compression_status,
    cagg_watermark('measurement_hourly') AS cagg_watermark
FROM timescaledb_information.chunks c
WHERE c.hypertable_name = 'voidly_measurements'
  AND c.compression_status = 'Uncompressed'
  AND c.range_end < now() - INTERVAL '7 days'
  AND c.range_end > cagg_watermark('measurement_hourly');

A monitoring job runs every 20 minutes and checks thatmeasurement_hourly is current to within 20 minutes before allowing the compression job to proceed. If the cagg refresh is lagging — for example, during a database failover or a backfill operation — the compression job is paused and an alert fires. This prevents the pathological case where a stale cagg permanently blocks compression across an accumulating backlog of chunks.

Query benchmarks before and after caggs

All timings measured on the production TimescaleDB instance (r7g.2xlarge, PostgreSQL 16.2 + TimescaleDB 2.14.2, 64GB RAM, 4TB gp3 NVMe). The "before" numbers reflect queries against the raw voidly_measurements hypertable with its standard indexes. The "after" numbers reflect queries against the appropriate cagg.

Query                                      Before cagg   After cagg
───────────────────────────────────────────────────────────────────
7-day country blocking rate (1 country)    4.1s          4ms
90-day country trend (1 country)           31s           18ms
Top 20 countries by blocking rate (today)  8.2s          2ms
ASN breakdown for 1 country, 30 days       12s           9ms
Global daily stats, last 90 days           180s          38ms

The most important number for the API is the first row: 4ms for a 7-day country blocking rate query fits comfortably inside the 44ms p99 inference budget after accounting for the REST layer, classifier lookup, and network RTT. The 90-day country trend case (18ms) is slower because it touches 90 rows ofcountry_daily_summary rather than 7, but it is still within budget for non-real-time dashboard queries.

The global daily stats query (38ms for 90 days across all countries) goes againstcountry_monthly_stats rather than the daily summary, which is why it is fast despite covering the widest data range. At 110k rows total, this view is essentially always hot in the buffer pool.

Monitoring cagg freshness

TimescaleDB exposes cagg state through thetimescaledb_information.continuous_aggregate_stats view. The fields that matter for operational monitoring are:

SELECT
    view_name,
    last_run_started_at,
    last_run_duration,
    next_start,
    materialization_hypertable_schema || '.' ||
    materialization_hypertable_name   AS mat_table
FROM timescaledb_information.continuous_aggregate_stats
ORDER BY next_start;

-- Example output:
-- view_name              | last_run_started_at    | last_run_duration | next_start
-- -----------------------+------------------------+-------------------+----------------------------
-- measurement_hourly     | 2025-08-14 12:00:03+00 | 00:00:01.842      | 2025-08-14 12:15:00+00
-- asn_hourly_summary     | 2025-08-14 12:00:05+00 | 00:00:02.317      | 2025-08-14 12:30:00+00
-- country_daily_summary  | 2025-08-14 12:00:09+00 | 00:00:00.291      | 2025-08-14 13:00:00+00
-- country_monthly_stats  | 2025-08-14 01:30:02+00 | 00:00:00.058      | 2025-08-15 01:30:00+00

We alert if last_run_started_at < now() - 2 × schedule_intervalfor any cagg. A missed refresh for measurement_hourly would mean the anomaly detector is operating on data that is at least 30 minutes stale, which risks missing the early stages of a censorship event. The alert routes to PagerDuty and the on-call engineer's Slack.

The dataset staleness monitoring pipeline (a separate article) builds on this: it exposes the cagg freshness timestamp through the public API so that downstream consumers — journalists, researchers, and external monitors — can see exactly how current the published data is.

Manual refresh for backfill

Occasionally a probe operator contacts us to report that they have been unable to upload for an extended period — a 6-month trip with limited connectivity, or a device that was recovered after being seized and returned. When we receive a large batch of delayed measurements, we need to backfill the cagg hierarchy.

The procedure is a manual CALL refresh_continuous_aggregate scoped to the affected date range, run as a background database job:

-- Step 1: insert the delayed measurements into voidly_measurements via normal ingest
-- (the Rust normalization consumer handles this; late_arrival_gt48h quality flag is set)

-- Step 2: refresh Level 1 for the affected range (most expensive step)
CALL refresh_continuous_aggregate('measurement_hourly',
    '2025-01-01'::timestamptz,
    '2025-07-01'::timestamptz);
-- Runtime: ~8.4 hours for 6 months of data (~50M rows/hour from raw hypertable to cagg)

-- Step 3: refresh Level 2 for the affected range (fast — reads from Level 1)
CALL refresh_continuous_aggregate('country_daily_summary',
    '2025-01-01'::timestamptz,
    '2025-07-01'::timestamptz);

-- Step 4: refresh Level 3
CALL refresh_continuous_aggregate('country_monthly_stats',
    '2024-12-01'::timestamptz,
    '2025-08-01'::timestamptz);  -- wider window to cover boundary months

The Level 1 refresh is the expensive step: backfilling 6 months ofmeasurement_hourly from the raw hypertable processes at approximately 50 million rows per hour. A 6-month backfill with typical probe density takes around 8 hours. During the backfill, the affected date range is blocked from compression — the compression job detects that a manual refresh is in progress via the cagg watermark position and skips those chunks.

For partial backfills (a single probe contributing a few days of missing data), the Level 1 refresh completes in minutes. The Level 2 and Level 3 refreshes are fast regardless of range size because they read from the already-refreshed Level 1 data.

What this unlocks downstream

The cagg hierarchy is the foundation that makes the rest of the Voidly stack viable at scale:

Real-time inference API. The anomaly detector queries measurement_hourly for rolling 24-hour baselines per country-ASN pair. Without the cagg, these queries would dominate the 44ms budget before any inference work ran.
Public REST API. The/countries/{cc}/daily endpoint queriescountry_daily_summary directly. Response times are sub-10ms for any country and any date range within the materialized window.
HuggingFace export. The nightly Parquet export includes country_monthly_stats as a pre-aggregated convenience table, so researchers who only need country-level monthly trends do not need to download and process the full 2.2B-row measurement dataset.
Storage compression. The cagg watermark gates chunk compression on the raw hypertable, keeping the storage footprint of voidly_measurements manageable as it grows. Without the cagg, we would need to choose between stale compression (risking cagg recomputation against compressed chunks) or no compression (unconstrained storage growth).

Voidly's measurement database: 2.2B probe results in TimescaleDB →

Voidly's probe-to-dataset ingest pipeline: normalization, quality filtering, and TimescaleDB indexing →

The Voidly measurement dataset: field-by-field schema reference →

Voidly's nightly Parquet export: from TimescaleDB to HuggingFace →