Technical writing

The Federal Regulatory Data Hub query layer: routing 50M+ records at the Cloudflare edge

April 25, 2025· 8 min read· AI Analytics

Regulatory dataCloudflare D1InfrastructureAPI design

The Federal Regulatory Data Hub stores 50M+ records from 208 federal datasets across 45 agencies in Cloudflare D1. The storage design and the API surface have their own writeups. This one covers what neither of those does: the query-time routing architecture — how a single GET /entity/XOM fans out to eight D1 databases in parallel, how FTS5 full-text search works against narrative datasets, how the entity bridge resolves a ticker to a canonical ID before querying, how caching is keyed and purged, and what the observed latency numbers look like in production.

The vertical sharding problem

Cloudflare D1 has a 10 GB per-database limit. When we started building the hub, 208 datasets across 45 agencies would have blown through that in a single database — the SEC EDGAR full-text filing index alone approaches that threshold. The solution is vertical sharding: one D1 database per logical agency group, not per dataset. We settled on eight shards:

// shard-map.ts — canonical shard assignment for all 208 datasets
export const SHARD_MAP = {
  securities: [
    'sec_edgar_filings', 'sec_edgar_companies', 'sec_enforcement',
    'finra_brokercheck', 'finra_disciplinary', 'pcaob_inspections',
  ],
  financial_crimes: [
    'ofac_sdn', 'ofac_consolidated', 'fincen_enforcement',
    'sam_exclusions', 'sam_registrations',
  ],
  healthcare: [
    'fda_warning_letters', 'fda_recalls', 'fda_enforcement',
    'cms_providers', 'cms_hospitals', 'dea_registrants',
    'hhs_ocr_settlements',
  ],
  labor_safety: [
    'msha_violations', 'msha_accidents', 'osha_citations',
    'osha_inspections', 'dol_wage_violations',
  ],
  environment: [
    'epa_enforcement', 'epa_facilities', 'epa_violations',
    'noaa_fisheries', 'fema_disasters', 'fema_nfip',
  ],
  transportation: [
    'nhtsa_recalls', 'nhtsa_complaints', 'faa_aircraft',
    'faa_enforcement', 'ntsb_accidents', 'fmc_ocean_carriers',
  ],
  enforcement: [
    'doj_press_releases', 'cfpb_enforcement', 'cfpb_complaints',
    'cftc_enforcement', 'ftc_actions',
  ],
  infrastructure: [
    'usace_permits', 'fhwa_safety', 'ferc_orders', 'fcc_enforcement',
  ],
} as const;

export type ShardName = keyof typeof SHARD_MAP;

Each shard name maps to a Cloudflare D1 binding in the Worker's Env interface — DB_SECURITIES, DB_FINANCIAL_CRIMES, and so on. Cloudflare allows up to 50,000 D1 databases per account, so this approach has significant headroom beyond the current eight shards as the dataset catalog expands.

The query router

A single Cloudflare Worker sits at api.ai-analytics.org and receives every API request. For single-dataset queries — GET /ofac/sdn?q=entity orGET /nhtsa/recalls?make=Ford — the router resolves the dataset name to its shard and issues a single D1 prepare().bind().all() call. No fan-out needed.

Cross-agency entity queries are different. GET /entity/XOM should return every regulatory event for ExxonMobil across SEC, EPA, OFAC, DOJ, and every other agency that has a record touching the company. That means querying all eight shards simultaneously. The router dispatches them in parallel with Promise.all:

// query-router.ts
type Row = Record<string, unknown> & { date: string };

interface Env {
  DB_SECURITIES: D1Database;
  DB_FINANCIAL_CRIMES: D1Database;
  DB_HEALTHCARE: D1Database;
  DB_LABOR_SAFETY: D1Database;
  DB_ENVIRONMENT: D1Database;
  DB_TRANSPORTATION: D1Database;
  DB_ENFORCEMENT: D1Database;
  DB_INFRASTRUCTURE: D1Database;
}

async function queryAllShards(
  env: Env,
  sql: string,
  params: unknown[],
): Promise<Row[]> {
  const shards: D1Database[] = [
    env.DB_SECURITIES,
    env.DB_FINANCIAL_CRIMES,
    env.DB_HEALTHCARE,
    env.DB_LABOR_SAFETY,
    env.DB_ENVIRONMENT,
    env.DB_TRANSPORTATION,
    env.DB_ENFORCEMENT,
    env.DB_INFRASTRUCTURE,
  ];

  const results = await Promise.all(
    shards.map(db => db.prepare(sql).bind(...params).all()),
  );

  return results
    .flatMap(r => r.results as Row[])
    .sort((a, b) => b.date.localeCompare(a.date));
}

Because the D1 binding is a direct in-process call from the Worker (no HTTP round-trip), each shard query adds only the SQLite execution time — typically 5–15ms. Eight shards in parallel means the cross-agency query wall time is bounded by the slowest shard, not the sum. Results from all shards are merged into a single array and sorted by date descending before the Worker returns the response.

The entity bridge join

Before the fan-out can happen, the router needs to know what to query for. A user sends GET /entity/XOM — that's a ticker symbol. The OFAC shard doesn't index by ticker; it indexes by name and internal UID. The SEC shard indexes by CIK. The SAM.gov shard indexes by UEI. Without resolution, each shard query would need to be re-parameterized with a different identifier type.

The entity bridge solves this. Every shard contains a copy of the entity_bridge table — a lightweight lookup that maps any known identifier to a single canonical_id:

-- entity_bridge: present in every shard database
CREATE TABLE entity_bridge (
  id            INTEGER PRIMARY KEY,
  canonical_id  TEXT NOT NULL,   -- shared UUID across all shards
  id_type       TEXT NOT NULL,   -- 'ticker' | 'cik' | 'uei' | 'lei' | 'duns' | 'npi'
  id_value      TEXT NOT NULL,
  entity_name   TEXT,
  entity_type   TEXT,            -- 'company' | 'individual' | 'vessel'
  confidence    REAL DEFAULT 1.0
);

CREATE INDEX eb_canonical ON entity_bridge(canonical_id);
CREATE INDEX eb_lookup    ON entity_bridge(id_type, id_value);

Resolution order: ticker → CIK (via EDGAR) → LEI (via GLEIF) → UEI (via SAM.gov). When a request arrives, the router runs the bridge resolution query against a single authoritative shard (the securities shard carries the highest-quality identifier coverage) before fanning out:

-- Bridge resolution: resolve any known identifier to canonical_id
SELECT canonical_id, entity_name, entity_type
FROM entity_bridge
WHERE (id_type = 'ticker'  AND id_value = ?1)
   OR (id_type = 'cik'     AND id_value = ?1)
   OR (id_type = 'uei'     AND id_value = ?1)
   OR (id_type = 'lei'     AND id_value = ?1)
   OR (id_type = 'duns'    AND id_value = ?1)
   OR (id_type = 'npi'     AND id_value = ?1)
ORDER BY confidence DESC
LIMIT 1;

The resolved canonical_id is then used as the query parameter across all eight shards: WHERE canonical_id = ?. Each shard's per-dataset tables carry a canonical_id foreign key populated at ingest time. This means the fan-out query is structurally identical across shards — no per-shard query rewriting needed.

FTS5 for narrative datasets

Structured datasets — OFAC SDN, NHTSA recalls, CMS provider directories — are queried via indexed column lookups. Narrative datasets are different: DOJ press releases, FDA warning letters, and CFPB enforcement narratives are long-form prose where the relevant entity information is buried in paragraph text. SQLite FTS5 handles these.

Each narrative dataset gets a companion FTS5 virtual table alongside its primary table. The Cloudflare D1 environment supports FTS5 with the unicode61 tokenizer (with remove_diacritics 1). Custom tokenizers are not supported in D1's SQLite build, so porter stemming is unavailable — queries must account for morphological variants manually or via MATCH prefix syntax:

-- FTS5 table: DOJ press releases
CREATE VIRTUAL TABLE doj_press_releases_fts USING fts5(
  title,
  body,
  district,
  content='doj_press_releases',
  content_rowid='id',
  tokenize='unicode61 remove_diacritics 1'
);

-- FTS5 table: FDA warning letters
CREATE VIRTUAL TABLE fda_warning_letters_fts USING fts5(
  subject,
  body,
  company,
  content='fda_warning_letters',
  content_rowid='id',
  tokenize='unicode61 remove_diacritics 1'
);

-- FTS5 table: CFPB enforcement narratives
CREATE VIRTUAL TABLE cfpb_narratives_fts USING fts5(
  title,
  description,
  respondent,
  content='cfpb_enforcement',
  content_rowid='id',
  tokenize='unicode61 remove_diacritics 1'
);

FTS5 search queries use the MATCH operator. The highlight() andsnippet() auxiliary functions format matched excerpts for the response:

-- Full-text search: DOJ press releases matching a query term
SELECT
  rowid,
  title,
  district,
  date,
  highlight(doj_press_releases_fts, 1, '<mark>', '</mark>') AS body_highlighted,
  snippet(doj_press_releases_fts, 1, '<mark>', '</mark>', '...', 32) AS excerpt,
  rank
FROM doj_press_releases_fts
WHERE doj_press_releases_fts MATCH ?
ORDER BY rank
LIMIT 25;

FTS5 queries run against the enforcement shard (DB_ENFORCEMENT) for DOJ and CFPB content and against the healthcare shard (DB_HEALTHCARE) for FDA warning letters. The router identifies FTS5 endpoints by path prefix (/search/doj,/search/fda, /search/cfpb) and routes to the appropriate shard without fan-out — free-text search is always single-shard.

Response caching

The Worker sits behind Cloudflare's CDN layer. Cache-key design matters: two requests for the same underlying data but with differently-ordered query parameters should hit the same cache entry. The Worker normalizes and sorts query parameters before constructing the cache key, and includes an X-Cache-Version header that increments on schema changes:

// cache.ts — cache wrapper with deterministic key construction
async function withCache(
  request: Request,
  handler: () => Promise<Response>,
  ttlSeconds: number,
): Promise<Response> {
  const url = new URL(request.url);

  // Normalize query params: sort keys, lowercase values
  const sortedParams = [...url.searchParams.entries()]
    .sort(([a], [b]) => a.localeCompare(b))
    .map(([k, v]) => `${k}=${v.toLowerCase()}`)
    .join('&');

  const cacheVersion = request.headers.get('X-Cache-Version') ?? 'v1';
  const cacheKey = new Request(
    `${url.origin}${url.pathname}?${sortedParams}`,
    { headers: { 'X-Cache-Version': cacheVersion } },
  );

  const cache = caches.default;
  const cached = await cache.match(cacheKey);
  if (cached) {
    return new Response(cached.body, {
      ...cached,
      headers: { ...Object.fromEntries(cached.headers), 'X-Cache': 'HIT' },
    });
  }

  const response = await handler();
  const cacheable = new Response(response.body, {
    ...response,
    headers: {
      ...Object.fromEntries(response.headers),
      'Cache-Control': `public, max-age=${ttlSeconds}`,
      'X-Cache': 'MISS',
    },
  });

  await cache.put(cacheKey, cacheable.clone());
  return cacheable;
}

TTLs are differentiated by endpoint type, reflecting how often the underlying data changes:

Bulk record queries (OFAC, SAM exclusions): 1 hour — these datasets update once daily, so a 1-hour TTL serves cache hits for the bulk of the day while ensuring staleness is bounded.
Cross-agency entity queries (GET /entity/*): 6 hours — entity profiles change infrequently; aggressive caching here yields significant latency wins for repeat lookups on the same company.
FTS5 full-text search: 15 minutes — results can change with each daily ingest as new press releases and warning letters are added, so a shorter TTL keeps results fresh.
GET /today.md: 5 minutes — generated by a scheduled cron, stale by at most one cron interval.
GET /coverage: 30 minutes — dataset record counts update daily; 30 minutes is a reasonable staleness bound.

Cache entries are purged on successful ingest completion via the Cache API. The ingest Worker calls caches.default.delete(cacheKey) for the affected dataset's endpoint URLs after each successful upsert batch, ensuring that users who query immediately after an ingest cycle see fresh data rather than a stale cache entry.

Latency budget

Observed latency in production (p50 / p99) across endpoint types:

Single-shard query (/ofac/sdn?q=entity): p50 12ms, p99 45ms
Cross-agency entity query (Promise.all across 8 shards): p50 38ms, p99 120ms
FTS5 full-text search (single shard): p50 18ms, p99 65ms
Cache hit (any endpoint): p50 2ms, p99 8ms
Cross-agency entity query (cache miss): p50 38ms, p99 120ms

Worker cold starts are rare. Cloudflare maintains a global pool of pre-initialized isolates; fewer than 0.1% of requests in production incur a cold-start penalty. Because the D1 binding is in-process (no TCP handshake, no HTTP overhead), the dominant latency component for cache-miss requests is SQLite query execution — not network. The cross-agency p99 of 120ms reflects the tail of the slowest shard in a given Promise.all batch, which is typically the FTS5-heavy enforcement shard querying across long narrative rows.

Cross-agency fan-out optimization

Naive Promise.all across all eight shards is wasteful when the query only touches one or two of them. A user querying GET /entity/NPI-1234567890 with an NPI identifier only needs the healthcare shard — NPIs don't appear in securities, environment, or infrastructure data. Fanning out to all eight shards wastes seven D1 round-trips.

The optimization is a dataset-shard index: a compile-time lookup table that maps each of the 208 dataset names to its shard. At query time, the router intersects the user's requested agencies (or the identifier type resolved by the bridge) against the index and fans out only to relevant shards:

// shard-selection.ts
import { SHARD_MAP, ShardName } from './shard-map';

// Invert SHARD_MAP: dataset name → shard name
const DATASET_TO_SHARD = new Map<string, ShardName>(
  (Object.entries(SHARD_MAP) as [ShardName, readonly string[]][])
    .flatMap(([shard, datasets]) => datasets.map(d => [d, shard])),
);

// Identifier types with known shard affinity
const ID_TYPE_SHARDS: Record<string, ShardName[]> = {
  ticker: ['securities'],
  cik:    ['securities'],
  npi:    ['healthcare'],
  lei:    ['securities', 'financial_crimes'],
  uei:    ['financial_crimes'],
  duns:   ['financial_crimes', 'infrastructure'],
};

export function selectShards(
  env: Env,
  idType?: string,
  agencies?: string[],
): D1Database[] {
  const allShards: Record<ShardName, D1Database> = {
    securities:      env.DB_SECURITIES,
    financial_crimes: env.DB_FINANCIAL_CRIMES,
    healthcare:      env.DB_HEALTHCARE,
    labor_safety:    env.DB_LABOR_SAFETY,
    environment:     env.DB_ENVIRONMENT,
    transportation:  env.DB_TRANSPORTATION,
    enforcement:     env.DB_ENFORCEMENT,
    infrastructure:  env.DB_INFRASTRUCTURE,
  };

  // Agency filter: only query shards that contain the requested datasets
  if (agencies && agencies.length > 0) {
    const relevantShards = new Set(
      agencies.flatMap(a => {
        const shard = DATASET_TO_SHARD.get(a);
        return shard ? [shard] : [];
      }),
    );
    return [...relevantShards].map(s => allShards[s]);
  }

  // Identifier-type affinity: skip shards that can't contain this ID type
  if (idType && ID_TYPE_SHARDS[idType]) {
    return ID_TYPE_SHARDS[idType].map(s => allShards[s]);
  }

  // Default: full fan-out
  return Object.values(allShards);
}

In practice, agency-filtered queries (e.g., GET /entity/XOM?agencies=SEC,FDA) reduce the fan-out from eight shards to two, cutting observed latency by roughly 40% for targeted queries. Identifier-type routing (NPI → healthcare only, CIK → securities only) provides similar gains for lookups where the identifier space is unambiguous.

The /today.md endpoint

GET /today.md returns a Markdown summary of today's federal regulatory activity: new OFAC sanctions added, FDA recalls published, SAM.gov exclusions posted, and DOJ actions dated to the current day. It's generated fresh every five minutes by a scheduled Worker cron and cached with a 5-minute TTL.

Internally, the cron Worker runs the same shard query infrastructure as a normal API request, but with a date filter: WHERE date = date('now') across all eight shards. The results are formatted into Markdown sections — one per agency group — and written to the cache. The response at /today.md is also the source for the LLM-facing surface at /llms.txt, which concatenates/today.md with the dataset coverage manifest and the API endpoint list to give LLM crawlers a complete, current-day context in a single request.

Error handling and partial responses

If a D1 shard is unavailable — due to a Cloudflare D1 regional outage, a write-rate-limit backoff, or a schema migration in progress — the naive response would be a 503 for any cross-agency query that touches that shard. That's too aggressive: OFAC and SEC data may still be fully available and useful to the caller even if the healthcare shard is temporarily down.

The router handles shard failures gracefully. Before each query batch, a lightweightSELECT 1 ping checks shard availability. Failed shards are excluded from the Promise.all fan-out, and their names are collected into ameta.unavailable_shards array in the response envelope. The response carries an X-Partial-Response: true header so callers can detect degraded results:

// partial-response handling in the query router
const shardResults: { shard: ShardName; rows: Row[] }[] = [];
const unavailableShards: ShardName[] = [];

await Promise.all(
  selectedShards.map(async ({ name, db }) => {
    try {
      // Lightweight health check before the real query
      await db.prepare('SELECT 1').first();
      const result = await db.prepare(sql).bind(...params).all();
      shardResults.push({ shard: name, rows: result.results as Row[] });
    } catch {
      unavailableShards.push(name);
    }
  }),
);

const rows = shardResults
  .flatMap(r => r.rows)
  .sort((a, b) => b.date.localeCompare(a.date));

const isPartial = unavailableShards.length > 0;

return new Response(
  JSON.stringify({
    data: rows,
    meta: {
      total: rows.length,
      partial: isPartial,
      unavailable_shards: isPartial ? unavailableShards : undefined,
    },
  }),
  {
    status: 200,
    headers: {
      'Content-Type': 'application/json',
      ...(isPartial ? { 'X-Partial-Response': 'true' } : {}),
    },
  },
);

This means a cross-agency query during a partial D1 outage still returns useful data with a clear signal that coverage is incomplete, rather than failing entirely. Callers can inspectmeta.unavailable_shards and retry the specific missing shards once the outage resolves. The 200 status on a partial response is intentional: the data returned is valid, just incomplete — a 503 would mask that.

Building the Federal Regulatory Data Hub on Cloudflare D1: 50M+ records at the edge →

The Federal Regulatory API: REST, MCP, and JSON-LD for 208 federal datasets →

Federal dataset ingest: keeping 208 federal datasets fresh at the edge →

Building the cross-agency regulatory entity graph: 50M+ records, one join →

Monitoring dataset freshness: staleness detection, multi-channel alerting, and the OFAC publish-time problem →