Technical writing
Federal Regulatory Data Hub schema design: per-vertical table layouts, entity_master bridge, and D1 indexing strategy for 35M records across 8 shards
The Federal Regulatory Data Hub serves 35 million records drawn from 197 federal datasets through a single REST API and MCP server. The underlying storage is eight Cloudflare D1 databases — one per regulatory vertical — each a managed SQLite instance running at Cloudflare's edge. This article covers the schema decisions behind this architecture: why we shard by vertical rather than by entity type, how the entity_master bridge enables cross-shard entity queries without JOIN at runtime, and the index strategy that achieves p50 38ms latency for three-shard fan-out queries.
Eight vertical D1 databases
D1's 10GB per-database limit drives the sharding strategy. The full 35M-record corpus at ~1.1KB average record size would exceed 38GB in a single database. We shard by vertical because same-vertical queries are the common case — a compliance screen against OFAC only needs the financial_crimes database; an EDGAR filing lookup only needs the securities database. Cross-vertical queries use parallel fan-out, not SQL JOINs.
| D1 database | Datasets | Records | Refresh |
|---|---|---|---|
| regulatory-securities | EDGAR companies, filings, SEC enforcement, CFTC orders | 9.2M | Daily |
| regulatory-financial-crimes | OFAC SDN, OFAC consolidated, FinCEN, BIS Entity List | 4.1M | Daily |
| regulatory-healthcare | FDA warning letters, recalls, CMS providers, OIG exclusions | 6.8M | Weekly |
| regulatory-labor-safety | OSHA citations, MSHA violations, NLRB decisions, WHD | 3.4M | Quarterly |
| regulatory-environment | EPA enforcement actions, ECHO facilities, violations | 2.9M | Quarterly |
| regulatory-transportation | NHTSA recalls, FAA enforcement, FMCSA carriers, PHMSA | 4.6M | Weekly |
| regulatory-enforcement | DOJ press releases, FTC actions, CFPB enforcement | 2.1M | Weekly |
| regulatory-infrastructure | SAM.gov entities, debarments, USASpending awards | 2.9M | Daily |
Key table schemas
-- financial-crimes: OFAC SDN (primary enforcement list)
CREATE TABLE ofac_sdn (
id INTEGER PRIMARY KEY,
sdn_uid INTEGER NOT NULL UNIQUE,
sdn_name TEXT NOT NULL,
sdn_type TEXT NOT NULL, -- 'Individual'|'Entity'|'Vessel'|'Aircraft'
program_codes TEXT NOT NULL, -- pipe-separated, e.g. 'IRAN|SDGT'
aliases TEXT, -- JSON array of alternate names
addresses TEXT, -- JSON array of {address, city, country}
nationalities TEXT, -- JSON array of ISO country codes
id_documents TEXT, -- JSON array of {type, number, country}
canonical_entity_id TEXT, -- FK into entity_master (denormalized)
last_updated TEXT NOT NULL
);
CREATE INDEX idx_ofac_program ON ofac_sdn (program_codes);
CREATE INDEX idx_ofac_canonical ON ofac_sdn (canonical_entity_id);
CREATE VIRTUAL TABLE ofac_sdn_fts USING fts5(
sdn_name, aliases,
content='ofac_sdn', content_rowid='id',
tokenize='unicode61 remove_diacritics 2'
);
-- securities: EPA enforcement (covering index for entity queries)
CREATE TABLE epa_enforcement (
id INTEGER PRIMARY KEY,
case_number TEXT NOT NULL UNIQUE,
case_name TEXT NOT NULL,
respondent_name TEXT NOT NULL,
statute TEXT NOT NULL, -- 'CAA'|'CWA'|'RCRA'|'CERCLA'|'TSCA'|...
activity_type TEXT NOT NULL, -- 'Administrative'|'Civil Judicial'|'Criminal'
initiated_date TEXT, -- ISO 8601 date
concluded_date TEXT,
penalty_amount REAL, -- USD; NULL if no monetary penalty
canonical_entity_id TEXT
);
-- Covering index: entity lookup returning summary columns without table read
CREATE INDEX idx_epa_entity_date ON epa_enforcement (
canonical_entity_id,
initiated_date DESC,
case_number,
penalty_amount
);The entity_master bridge
The entity_master bridge is stored in a ninth D1 database (regulatory-entity-master). It contains canonical entity identifiers and a shard_presence bitmask that records which vertical databases have records for each entity. Cross-shard entity queries begin with a single entity_master lookup to resolve the bitmask, then fan out to only the shards that have matching records.
CREATE TABLE entity_master (
canonical_id TEXT PRIMARY KEY,
-- Format: '{entity_type}:{sha256_8hex}' e.g. 'CORP:a3f1c82d'
entity_type TEXT NOT NULL, -- 'CORP'|'PERSON'|'VESSEL'|'AIRCRAFT'|'GOVT'
canonical_name TEXT NOT NULL,
ein TEXT,
cik TEXT,
lei TEXT,
uei TEXT, -- SAM.gov Unique Entity Identifier
npi TEXT,
-- Bitmask of shards with records (bits 0-7 = 8 verticals above)
shard_presence INTEGER NOT NULL DEFAULT 0,
last_updated TEXT NOT NULL
);
-- Covering index: canonical_id lookup returns name+bitmask without table scan
CREATE INDEX idx_entity_master_id ON entity_master (canonical_id)
INCLUDE (canonical_name, shard_presence);
-- Identifier lookup indexes (partial: skip NULL values)
CREATE INDEX idx_entity_ein ON entity_master (ein) WHERE ein IS NOT NULL;
CREATE INDEX idx_entity_cik ON entity_master (cik) WHERE cik IS NOT NULL;
CREATE INDEX idx_entity_uei ON entity_master (uei) WHERE uei IS NOT NULL;Shard routing Workers handler
// TypeScript; Cloudflare Workers runtime
// Bit index -> (D1 binding, table name, date column)
const SHARDS: Record<number, [string, string, string]> = {
0: ['SECURITIES', 'edgar_companies', 'last_updated'],
1: ['FINANCIAL_CRIMES', 'ofac_sdn', 'last_updated'],
2: ['HEALTHCARE', 'fda_warning_letters', 'issued_date'],
3: ['LABOR_SAFETY', 'osha_citations', 'citation_date'],
4: ['ENVIRONMENT', 'epa_enforcement', 'initiated_date'],
5: ['TRANSPORTATION','nhtsa_recalls', 'recall_initiation_date'],
6: ['ENFORCEMENT', 'doj_enforcement', 'action_date'],
7: ['INFRASTRUCTURE','sam_entities', 'registration_date'],
};
export async function queryEntityAllShards(
canonicalId: string, env: Env,
): Promise<Record<string, unknown>> {
// Step 1: resolve entity + shard_presence bitmask (3-6ms)
const row = await (env.ENTITY_MASTER as D1Database)
.prepare('SELECT canonical_name, shard_presence FROM entity_master WHERE canonical_id = ?')
.bind(canonicalId).first<{ canonical_name: string; shard_presence: number }>();
if (!row) throw new Error('Entity not found');
// Step 2: collect bit indexes for shards with records
const bits = Object.keys(SHARDS).map(Number)
.filter(b => row.shard_presence & (1 << b));
// Step 3: parallel fan-out (8-18ms per shard; wall-clock = slowest shard)
const results = await Promise.all(bits.map(async (bit) => {
const [dbKey, table, dateCol] = SHARDS[bit];
const db = (env as Record<string, D1Database>)[dbKey];
const { results: rows } = await db
.prepare('SELECT * FROM ' + table +
' WHERE canonical_entity_id = ?' +
' ORDER BY ' + dateCol + ' DESC LIMIT 50')
.bind(canonicalId).all();
return { table, rows };
}));
return {
canonical_id: canonicalId,
canonical_name: row.canonical_name,
records: Object.fromEntries(results.map(r => [r.table, r.rows])),
};
}Index strategy: covering vs. FTS5 vs. partial
Three index types serve different access patterns:
Covering indexes — used for entity-centric list queries where the result set is a predictable subset of columns. The EPA enforcement covering index (entity + date + case_number + penalty) is satisfied entirely from the index without touching data pages. At 2.9M rows this saves ~40% of query I/O vs. a non-covering index on the same column.
FTS5 virtual tables — used for name search. The unicode61 remove_diacritics 2 tokenizer handles Arabic, Cyrillic, and CJK name variants in the OFAC SDN list. FTS5 queries at 4.1M rows run in 19ms p50. A Redis cache layer caches the 200 most-queried entity names to avoid repeated FTS scans.
Partial indexes — used where large fractions of rows have NULL in the indexed column. The EIN index covers only ~62% of entities (those with known EINs), making it 38% smaller. D1 supports partial indexes via WHERE column IS NOT NULL.
Query latency benchmarks
| Operation | p50 (ms) | p95 (ms) | Notes |
|---|---|---|---|
| entity_master lookup | 4 | 9 | Covering index; no table read |
| Single-shard entity records | 11 | 24 | Includes entity_master lookup |
| 3-shard cross-entity query | 38 | 67 | Parallel fan-out; wall-clock dominated by slowest shard |
| OFAC FTS5 name search | 19 | 41 | 4.1M row FTS5 virtual table |
| EIN identifier xref | 5 | 11 | Partial index on entity_master |
The p99 for 3-shard queries (91ms) occasionally spikes during D1 write operations — SQLite uses a single-writer model and read queries may queue behind bulk ingest. Ingest runs at 02:00–05:00 UTC to minimize contention, and the Workers handler uses a 150ms per-shard timeout with graceful partial results on timeout.
For the D1 architecture overview and dataset ingest pipeline: Building the Federal Regulatory Data Hub on Cloudflare D1: 35M records at the edge →
For entity ID normalization across CIK, UEI, LEI, DUNS, and NPI: Entity ID normalization in the Federal Regulatory Data Hub →