Technical writing

Federal Regulatory Data Hub schema design: per-vertical table layouts, entity_master bridge, and D1 indexing strategy for 50M+ records across 8 shards

March 21, 2025· 16 min read· AI Analytics

Regulatory dataCloudflare D1InfrastructureData engineering

The Federal Regulatory Data Hub serves 50 million+ records drawn from 208 federal datasets through a single REST API and MCP server. The underlying storage is eight Cloudflare D1 databases — one per regulatory vertical — each a managed SQLite instance running at Cloudflare's edge. This article covers the schema decisions behind this architecture: why we shard by vertical rather than by entity type, how the entity_master bridge enables cross-shard entity queries without JOIN at runtime, and the index strategy that achieves p50 38ms latency for three-shard fan-out queries.

Eight vertical D1 databases

D1's 10GB per-database limit drives the sharding strategy. The full 35M-record corpus at ~1.1KB average record size would exceed 38GB in a single database. We shard by vertical because same-vertical queries are the common case — a compliance screen against OFAC only needs the financial_crimes database; an EDGAR filing lookup only needs the securities database. Cross-vertical queries use parallel fan-out, not SQL JOINs.

D1 database	Datasets	Records	Refresh
regulatory-securities	EDGAR companies, filings, SEC enforcement, CFTC orders	9.2M	Daily
regulatory-financial-crimes	OFAC SDN, OFAC consolidated, FinCEN, BIS Entity List	4.1M	Daily
regulatory-healthcare	FDA warning letters, recalls, CMS providers, OIG exclusions	6.8M	Weekly
regulatory-labor-safety	OSHA citations, MSHA violations, NLRB decisions, WHD	3.4M	Quarterly
regulatory-environment	EPA enforcement actions, ECHO facilities, violations	2.9M	Quarterly
regulatory-transportation	NHTSA recalls, FAA enforcement, FMCSA carriers, PHMSA	4.6M	Weekly
regulatory-enforcement	DOJ press releases, FTC actions, CFPB enforcement	2.1M	Weekly
regulatory-infrastructure	SAM.gov entities, debarments, USASpending awards	2.9M	Daily

Key table schemas

-- financial-crimes: OFAC SDN (primary enforcement list)
CREATE TABLE ofac_sdn (
    id              INTEGER PRIMARY KEY,
    sdn_uid         INTEGER NOT NULL UNIQUE,
    sdn_name        TEXT    NOT NULL,
    sdn_type        TEXT    NOT NULL, -- 'Individual'|'Entity'|'Vessel'|'Aircraft'
    program_codes   TEXT    NOT NULL, -- pipe-separated, e.g. 'IRAN|SDGT'
    aliases         TEXT,             -- JSON array of alternate names
    addresses       TEXT,             -- JSON array of {address, city, country}
    nationalities   TEXT,             -- JSON array of ISO country codes
    id_documents    TEXT,             -- JSON array of {type, number, country}
    canonical_entity_id TEXT,         -- FK into entity_master (denormalized)
    last_updated    TEXT    NOT NULL
);

CREATE INDEX idx_ofac_program ON ofac_sdn (program_codes);
CREATE INDEX idx_ofac_canonical ON ofac_sdn (canonical_entity_id);

CREATE VIRTUAL TABLE ofac_sdn_fts USING fts5(
    sdn_name, aliases,
    content='ofac_sdn', content_rowid='id',
    tokenize='unicode61 remove_diacritics 2'
);

-- securities: EPA enforcement (covering index for entity queries)
CREATE TABLE epa_enforcement (
    id              INTEGER PRIMARY KEY,
    case_number     TEXT    NOT NULL UNIQUE,
    case_name       TEXT    NOT NULL,
    respondent_name TEXT    NOT NULL,
    statute         TEXT    NOT NULL,  -- 'CAA'|'CWA'|'RCRA'|'CERCLA'|'TSCA'|...
    activity_type   TEXT    NOT NULL,  -- 'Administrative'|'Civil Judicial'|'Criminal'
    initiated_date  TEXT,              -- ISO 8601 date
    concluded_date  TEXT,
    penalty_amount  REAL,              -- USD; NULL if no monetary penalty
    canonical_entity_id TEXT
);

-- Covering index: entity lookup returning summary columns without table read
CREATE INDEX idx_epa_entity_date ON epa_enforcement (
    canonical_entity_id,
    initiated_date DESC,
    case_number,
    penalty_amount
);

The entity_master bridge

The entity_master bridge is stored in a ninth D1 database (regulatory-entity-master). It contains canonical entity identifiers and a shard_presence bitmask that records which vertical databases have records for each entity. Cross-shard entity queries begin with a single entity_master lookup to resolve the bitmask, then fan out to only the shards that have matching records.

CREATE TABLE entity_master (
    canonical_id    TEXT    PRIMARY KEY,
    -- Format: '{entity_type}:{sha256_8hex}' e.g. 'CORP:a3f1c82d'
    entity_type     TEXT    NOT NULL, -- 'CORP'|'PERSON'|'VESSEL'|'AIRCRAFT'|'GOVT'
    canonical_name  TEXT    NOT NULL,
    ein             TEXT,
    cik             TEXT,
    lei             TEXT,
    uei             TEXT,   -- SAM.gov Unique Entity Identifier
    npi             TEXT,
    -- Bitmask of shards with records (bits 0-7 = 8 verticals above)
    shard_presence  INTEGER NOT NULL DEFAULT 0,
    last_updated    TEXT    NOT NULL
);

-- Covering index: canonical_id lookup returns name+bitmask without table scan
CREATE INDEX idx_entity_master_id ON entity_master (canonical_id)
    INCLUDE (canonical_name, shard_presence);

-- Identifier lookup indexes (partial: skip NULL values)
CREATE INDEX idx_entity_ein ON entity_master (ein) WHERE ein IS NOT NULL;
CREATE INDEX idx_entity_cik ON entity_master (cik) WHERE cik IS NOT NULL;
CREATE INDEX idx_entity_uei ON entity_master (uei) WHERE uei IS NOT NULL;

Shard routing Workers handler

// TypeScript; Cloudflare Workers runtime
// Bit index -> (D1 binding, table name, date column)
const SHARDS: Record<number, [string, string, string]> = {
  0: ['SECURITIES',    'edgar_companies',    'last_updated'],
  1: ['FINANCIAL_CRIMES', 'ofac_sdn',         'last_updated'],
  2: ['HEALTHCARE',    'fda_warning_letters', 'issued_date'],
  3: ['LABOR_SAFETY',  'osha_citations',      'citation_date'],
  4: ['ENVIRONMENT',   'epa_enforcement',     'initiated_date'],
  5: ['TRANSPORTATION','nhtsa_recalls',       'recall_initiation_date'],
  6: ['ENFORCEMENT',   'doj_enforcement',     'action_date'],
  7: ['INFRASTRUCTURE','sam_entities',        'registration_date'],
};

export async function queryEntityAllShards(
  canonicalId: string, env: Env,
): Promise<Record<string, unknown>> {
  // Step 1: resolve entity + shard_presence bitmask (3-6ms)
  const row = await (env.ENTITY_MASTER as D1Database)
    .prepare('SELECT canonical_name, shard_presence FROM entity_master WHERE canonical_id = ?')
    .bind(canonicalId).first<{ canonical_name: string; shard_presence: number }>();
  if (!row) throw new Error('Entity not found');

  // Step 2: collect bit indexes for shards with records
  const bits = Object.keys(SHARDS).map(Number)
    .filter(b => row.shard_presence & (1 << b));

  // Step 3: parallel fan-out (8-18ms per shard; wall-clock = slowest shard)
  const results = await Promise.all(bits.map(async (bit) => {
    const [dbKey, table, dateCol] = SHARDS[bit];
    const db = (env as Record<string, D1Database>)[dbKey];
    const { results: rows } = await db
      .prepare('SELECT * FROM ' + table +
               ' WHERE canonical_entity_id = ?' +
               ' ORDER BY ' + dateCol + ' DESC LIMIT 50')
      .bind(canonicalId).all();
    return { table, rows };
  }));

  return {
    canonical_id: canonicalId,
    canonical_name: row.canonical_name,
    records: Object.fromEntries(results.map(r => [r.table, r.rows])),
  };
}

Index strategy: covering vs. FTS5 vs. partial

Three index types serve different access patterns:

Covering indexes — used for entity-centric list queries where the result set is a predictable subset of columns. The EPA enforcement covering index (entity + date + case_number + penalty) is satisfied entirely from the index without touching data pages. At 2.9M rows this saves ~40% of query I/O vs. a non-covering index on the same column.

FTS5 virtual tables — used for name search. The unicode61 remove_diacritics 2 tokenizer handles Arabic, Cyrillic, and CJK name variants in the OFAC SDN list. FTS5 queries at 4.1M rows run in 19ms p50. A Redis cache layer caches the 200 most-queried entity names to avoid repeated FTS scans.

Partial indexes — used where large fractions of rows have NULL in the indexed column. The EIN index covers only ~62% of entities (those with known EINs), making it 38% smaller. D1 supports partial indexes via WHERE column IS NOT NULL.

Query latency benchmarks

Operation	p50 (ms)	p95 (ms)	Notes
entity_master lookup	4	9	Covering index; no table read
Single-shard entity records	11	24	Includes entity_master lookup
3-shard cross-entity query	38	67	Parallel fan-out; wall-clock dominated by slowest shard
OFAC FTS5 name search	19	41	4.1M row FTS5 virtual table
EIN identifier xref	5	11	Partial index on entity_master

The p99 for 3-shard queries (91ms) occasionally spikes during D1 write operations — SQLite uses a single-writer model and read queries may queue behind bulk ingest. Ingest runs at 02:00–05:00 UTC to minimize contention, and the Workers handler uses a 150ms per-shard timeout with graceful partial results on timeout.

For the D1 architecture overview and dataset ingest pipeline: Building the Federal Regulatory Data Hub on Cloudflare D1: 50M+ records at the edge →

For entity ID normalization across CIK, UEI, LEI, DUNS, and NPI: Entity ID normalization in the Federal Regulatory Data Hub →