Technical writing
Entity ID normalization in the Federal Regulatory Data Hub: resolving CIK, UEI, LEI, DUNS, and NPI across 197 datasets
The Federal Regulatory Data Hub indexes 197 federal datasets across 45 agencies. Each dataset uses its own identifier system for companies. SEC EDGAR assigns CIKs. GSA's SAM.gov issues UEIs. The GLEIF LEI Registry issues 20-character ISO 17442 codes. Dun & Bradstreet's DUNS number persists in older data. CMS assigns NPIs to healthcare providers. A company like Wells Fargo appears as CIK 72971, UEI J6MYQ3ZPMV26, LEI VYVLBKJTMIMQ05WXPE37, DUNS 003453416, and dozens of NPI numbers for its healthcare subsidiaries — with no shared foreign key across any government database.
Without normalization, a query for “show all regulatory events for Wells Fargo” would require knowing which identifier to use in each of the 197 datasets, and re-parameterizing each query accordingly. The entity_master bridge eliminates that requirement. This post documents the identifier landscape, the schema of the bridge table, and the three-pass resolution strategy we use to populate and query it.
The five federal identifier systems
Understanding the design of entity_master requires understanding why each identifier system exists, how stable it is, and what datasets depend on it. The five systems have meaningfully different scopes and coverage.
CIK — SEC Central Index Key
The CIK is SEC EDGAR's primary identifier for every filer: public companies, investment funds, insiders filing Form 4, and foreign private issuers. It is a 7–10 digit integer assigned once and never changed — CIKs are the most stable identifier in the federal system. Wells Fargo's CIK 72971 was assigned when the company first registered with EDGAR and will remain 72971 for as long as the company files with the SEC. There are approximately 40,000 active CIKs for publicly reporting companies; the full EDGAR corpus, including historical and deregistered filers, covers over 600,000 CIKs.
CIK coverage is excellent for publicly traded companies and SEC-registered investment vehicles but zero for private companies, nonprofits, government contractors, and healthcare providers that have no SEC reporting obligation.
UEI — Unique Entity Identifier
The UEI is GSA's 12-character alphanumeric replacement for the DUNS number in federal procurement. GSA began issuing UEIs in 2022 and completed the mandatory cutover by April 2022 — any entity registering in SAM.gov, applying for federal contracts, or receiving federal grants now has a UEI. The SAM.gov registry covers 2.8 million registered entities, but that population includes nonprofits, universities, foreign entities, and sole proprietors in addition to corporate contractors. UEI coverage is strong for entities that have ever sought federal business; it is nonexistent for purely private-market companies with no government touchpoint.
LEI — Legal Entity Identifier
The LEI is a 20-character alphanumeric code defined by ISO 17442 and issued by accredited Local Operating Units (LOUs) through the Global Legal Entity Identifier Foundation (GLEIF). It is used in FinCEN records, CFTC derivatives reporting, and Federal Reserve financial stability data. Approximately 2.4 million active LEIs exist globally. The LEI system carries a parent/child hierarchy — each LEI record links to its direct parent LEI and its ultimate parent LEI — making it uniquely useful for tracing subsidiary structures without manual corporate-tree research. Wells Fargo's LEI VYVLBKJTMIMQ05WXPE37 links to subsidiary LEIs for Wells Fargo Bank, NA, Wells Fargo Securities, and dozens of other operating entities.
DUNS — Data Universal Numbering System
The DUNS is Dun & Bradstreet's 9-digit identifier. It was the federal procurement standard for decades before the UEI transition, and it persists in older data and datasets not yet migrated: EPA ECHO facility records, OSHA inspection records, some FEMA grant data, and export-control license records. Active DUNS numbers are no longer issued for new federal registrations, but the existing corpus remains relevant for historical data joins and for datasets that have not completed their migration to UEI. We treat DUNS as a read-only secondary identifier — we populate it from historical records but do not rely on it for new entity registrations.
NPI — National Provider Identifier
The NPI is CMS's 10-digit identifier for healthcare providers: hospitals, physician practices, pharmacies, imaging centers, and medical device distributors. Used in Medicare and Medicaid claims, FDA facility data, and HHS-OIG exclusion records. Approximately 7 million active NPIs are in the NPPES registry. The critical design detail: NPIs are granular to the individual facility, specialty, or state license — a single hospital system like HCA Healthcare holds 180+ NPIs, one for each facility and care setting. The parent company's CIK is a single value; its NPI list is an array of 180+ entries. Our schema encodes this as a JSON array in npi_list, queried at runtime with SQLite's json_each().
The entity_master schema
The bridge table lives in its own Cloudflare D1 database, separate from the eight vertical shards that hold agency-specific data. Every entity that appears in any of the 197 datasets gets exactly one row in entity_master, identified by a synthetic primary key prefixed EM-. Known ID equivalences are stored as indexed columns; the NPI case gets a JSON array because the one-to-many relationship cannot be expressed as a single column without sacrificing query simplicity.
CREATE TABLE entity_master ( entity_id TEXT PRIMARY KEY, -- "EM-0000012345" canonical_name TEXT NOT NULL, -- preferred display name cik TEXT UNIQUE, -- SEC EDGAR Central Index Key uei TEXT UNIQUE, -- SAM.gov Unique Entity Identifier (12-char) lei TEXT UNIQUE, -- ISO 17442 Legal Entity Identifier (20-char) duns TEXT UNIQUE, -- D&B DUNS (9-digit, legacy) npi_list TEXT, -- JSON array of NPI strings ticker TEXT, -- exchange ticker symbol sic_code TEXT, -- SIC industry classification state TEXT, -- US state of registration (2-letter) country TEXT DEFAULT 'US', -- ISO 3166-1 alpha-2 confidence REAL NOT NULL, -- 0.0–1.0 match confidence sources TEXT NOT NULL, -- JSON array of source dataset names last_updated TEXT NOT NULL -- ISO 8601 timestamp ); -- Indexes on every ID column — cross-shard resolution must be fast CREATE INDEX em_cik ON entity_master(cik); CREATE INDEX em_uei ON entity_master(uei); CREATE INDEX em_lei ON entity_master(lei); CREATE INDEX em_duns ON entity_master(duns); CREATE INDEX em_ticker ON entity_master(ticker); CREATE INDEX em_name ON entity_master(canonical_name);
The confidence column is central to the design. Rows seeded from authoritative government cross-references — where two datasets independently assert the same company under two different ID systems — carry confidence = 1.0. Rows built from name-similarity matching carry lower confidence, and all queries against the entity bridge return the confidence value so callers can apply their own threshold. A compliance workflow demanding near-certainty can filter to confidence >= 0.95; a discovery workflow comfortable with candidate lists can accept confidence >= 0.7.
Foreign entities — companies with LEIs but no US regulatory footprint — are assigned anentity_id and populated lei andcountry columns, but left with NULL incik, uei, andnpi_list. Government contractors that are not publicly traded have a uei from SAM.gov but cik = NULL. The schema handles these sparse cases naturally: a join against any NULL column simply produces no match rather than a false positive.
The alias table
Entity names change. Companies restructure, rebrand, merge, and operate under doing-business-as names that bear little resemblance to their legal registered name. Theentity_aliases table captures every known name variation for each entity in entity_master, along with the time window during which the alias was valid:
CREATE TABLE entity_aliases ( alias_id INTEGER PRIMARY KEY AUTOINCREMENT, alias_text TEXT NOT NULL, entity_id TEXT NOT NULL REFERENCES entity_master(entity_id), alias_type TEXT NOT NULL, -- 'former_name' | 'dba' | 'abbreviation' | 'ticker' source TEXT NOT NULL, -- which dataset provided this alias valid_from TEXT, -- ISO 8601; NULL means unknown start valid_to TEXT -- ISO 8601; NULL means currently active ); CREATE INDEX ea_alias_text ON entity_aliases(alias_text); CREATE INDEX ea_entity_id ON entity_aliases(entity_id);
The alias_type taxonomy captures four distinct cases. Aformer_name is a legal name the entity used before a corporate rename or merger — “First Union Corporation” for what became Wachovia, then Wells Fargo. A dba is a trade name used concurrently with the legal name. An abbreviation is a shortened form commonly used in enforcement records — EPA enforcement cases often use “ExxonMobil Pipeline” where the full legal name is “ExxonMobil Pipeline Company, a subsidiary of ExxonMobil Corporation.” A ticker alias links the exchange symbol as a queryable text alias so lookups by ticker hit the alias index before touching the ID columns.
The valid_to column is what allows temporal queries: a query for a company name that was valid in 2018 will match alias rows wherevalid_to IS NULL OR valid_to > '2018-01-01'. When a merger closes and the acquired entity's name is retired, we setvalid_to to the merger completion date rather than deleting the row, preserving the historical record.
Three-pass resolution strategy
Populating entity_master from 197 datasets and then querying it at runtime both follow the same three-pass priority order. Pass 1 is exact ID cross-reference, exploiting cases where two datasets independently carry the same ID type for an entity. Pass 2 is alias table lookup for known DBA and former names. Pass 3 is TF-IDF fuzzy name matching, the most expensive and least reliable pass, used only when the first two fail and always flagged with reduced confidence.
Pass 1: exact ID cross-reference
When two datasets share a common identifier, the join is deterministic. The entity_master bridge enables this across identifier types that would otherwise be incompatible. A query joining USAspending contract records against SEC EDGAR filings illustrates the pattern:
-- Cross-dataset join using entity_master as the bridge -- Find all companies that received federal contracts AND have SEC enforcement actions SELECT em.canonical_name, em.entity_id, ua.award_amount, ua.award_date, se.action_type, se.penalty_amount, se.action_date FROM usaspending_awards ua JOIN entity_master em ON ua.uei = em.uei JOIN sec_enforcement se ON se.cik = em.cik WHERE ua.award_amount > 10000000 -- contracts over $10M AND se.action_date > '2020-01-01' AND em.confidence >= 0.95 ORDER BY ua.award_amount DESC;
Without entity_master, this query cannot be written — there is no direct foreign key between the UEI column in USAspending and the CIK column in SEC enforcement records. The bridge table holds the equivalence, and the query reads it as a double join. Because both uei and cikare indexed in entity_master, the join operates as two indexed lookups rather than a table scan.
At ingest time, Pass 1 is seeded from two high-quality sources: EDGAR's company index (CIK → ticker → legal name) and SAM.gov's entity registration extract (UEI → EIN → CAGE → legal name). When a SAM.gov entity's EIN matches an EDGAR company's EIN — a linkage we can make via IRS EIN when both datasets carry it — the UEI and CIK are linked inentity_master with confidence = 1.0. The GLEIF daily full extract then augments those rows with LEI values using the same EIN-based cross-reference.
Pass 2: alias table lookup
When Pass 1 finds no match — the dataset carries only a free-text name, not a structured ID — we query the alias table before escalating to fuzzy matching. Alias table lookup is exact-match on normalized text, which means it is fast (an index scan) and carries high confidence (99.2% true positive rate in our evaluation corpus, 0.1% false positive rate).
-- Pass 2: alias table lookup -- Input: a company name from a dataset that carries no structured ID -- Output: entity_id if a match exists, with the confidence of that alias SELECT em.entity_id, em.canonical_name, em.cik, em.uei, em.lei, ea.alias_type, ea.source, em.confidence FROM entity_aliases ea JOIN entity_master em ON ea.entity_id = em.entity_id WHERE ea.alias_text = normalize_name(:input_name) -- exact match on normalized form AND (ea.valid_to IS NULL OR ea.valid_to > :query_date) ORDER BY em.confidence DESC LIMIT 5;
The normalize_name() function applied at query time is the same normalization applied when aliases were inserted — stripping punctuation, uppercasing, and removing legal suffixes. This ensures that “Wells Fargo & Company” and “WELLS FARGO COMPANY” both resolve to the same normalized form and hit the same alias row.
Pass 3: TF-IDF fuzzy name matching
When neither Pass 1 nor Pass 2 produces a match, we fall back to token-weighted similarity against the full entity_master corpus. This is the most expensive path and the least reliable — it is only invoked when direct ID and alias lookup have both failed, and every result it produces is assigned confidence < 0.7. Results below 0.6 are rejected entirely rather than returned with a caveat.
def fuzzy_match_entity(
query_name: str,
query_state: Optional[str] = None,
query_country: str = "US",
min_confidence: float = 0.6,
) -> list[EntityMatch]:
"""
Pass 3 entity resolution: TF-IDF fuzzy name match.
Only invoked after Pass 1 (exact ID) and Pass 2 (alias table) fail.
All results carry confidence < 0.7.
"""
# Step 1: normalize the query name before tokenization
normalized_query = normalize_company_name(query_name)
# Step 2: retrieve top-20 candidates from the pre-built TF-IDF index
# The index is built over canonical_name + all alias_text values per entity
candidates = tfidf_index.query(normalized_query, top_k=20)
results: list[EntityMatch] = []
for candidate in candidates:
score = compute_match_score(
query_name,
candidate.canonical_name,
query_state,
candidate.state,
)
if score >= min_confidence:
results.append(EntityMatch(
entity=candidate,
confidence=min(score, 0.69), # cap below 0.7; Pass 3 never exceeds this
))
return sorted(results, key=lambda x: -x.confidence)
def compute_match_score(
query: str,
candidate: str,
query_state: Optional[str],
candidate_state: Optional[str],
) -> float:
"""
Combine Jaro-Winkler string similarity with a geographic bonus.
Geographic agreement provides a +0.05 boost to handle cases where
a common company name (e.g., "National Energy Corporation") appears
in multiple states — state context disambiguates.
"""
name_score = jaro_winkler_similarity(
normalize_company_name(query),
normalize_company_name(candidate),
)
state_bonus = (
0.05
if query_state is not None and query_state == candidate_state
else 0.0
)
return min(1.0, name_score + state_bonus)Company name normalization
All three passes depend on a consistent normalization function that collapses the surface variation in company names before any comparison. Federal datasets use inconsistent capitalization, punctuation, and legal suffix abbreviations. Without normalization, “Wells Fargo & Co.”, “WELLS FARGO AND COMPANY”, and “Wells Fargo Co” would fail to match each other. The normalization function is deterministic and applied identically at ingest time (when buildingentity_aliases) and at query time (when comparing incoming names).
# Legal-suffix stop words — removed from normalized form before comparison.
# Does NOT include substantive industry words like "Bank", "Healthcare", "Energy".
STOP_WORDS = frozenset({
"the", "and", "of", "for",
"co", "corp", "inc", "llc", "ltd", "plc",
"gmbh", "sa", "sas", "nv", "bv", "ag", "se", "lp",
"company", "corporation", "incorporated", "limited",
"partners", "partnership", "associates", "group",
})
# Abbreviation expansion map — applied before stop word removal
ABBREV_MAP = {
r"\bINT'L\b": "INTERNATIONAL",
r"\bINTL\b": "INTERNATIONAL",
r"\bMFG\b": "MANUFACTURING",
r"\bSVC\b": "SERVICES",
r"\bINDS\b": "INDUSTRIES",
r"\bHLTH\b": "HEALTH",
r"\bFIN\b": "FINANCIAL",
}
def normalize_company_name(name: str) -> str:
"""
Canonical normalization applied identically at ingest and query time.
Returns a space-joined token string for comparison or TF-IDF indexing.
"""
name = name.upper().strip()
# Remove punctuation, keeping internal word boundaries
name = re.sub(r"[^ws]", " ", name)
name = re.sub(r"s+", " ", name).strip()
# Expand common abbreviations before stop-word removal
for pattern, expansion in ABBREV_MAP.items():
name = re.sub(pattern, expansion, name)
# Tokenize and filter stop words
tokens = [t for t in name.split() if t.lower() not in STOP_WORDS]
return " ".join(tokens)A few design decisions worth noting: we do not remove industry words like “Bank” or “Healthcare” because they are discriminative — “Wells Fargo Bank” and “Wells Fargo Securities” are distinct entities in our corpus. We expand abbreviations before stop-word removal so that “INTL” becomes “INTERNATIONAL” and is then comparable to the expanded form in another dataset's record. And we uppercase before any token operation to make the pipeline case-insensitive without a separatelower() call at comparison time.
Resolution accuracy by method
We evaluated each pass against a labeled corpus of 12,000 entity pairs drawn from overlapping datasets where ground-truth equivalence was independently verifiable through official government cross-references. The results by resolution method:
| Method | True positive | False positive | Use case |
|---|---|---|---|
| Exact ID join | 99.9% | 0.01% | Same ID type across datasets |
| Alias table | 99.2% | 0.1% | Known DBA and former names |
| TF-IDF fuzzy (≥0.9) | 97% | 1% | High-confidence name match |
| TF-IDF fuzzy (0.7–0.9) | 85% | 8% | Returned with confidence caveat |
| TF-IDF fuzzy (<0.7) | — | — | Rejected; not returned |
The 8% false positive rate for TF-IDF results in the 0.7–0.9 confidence band is the reason those results are always tagged with their confidence value in the API response. A caller querying the entity bridge for “National Energy Corporation” might receive three candidate matches in that band — all plausible, none certain — and is expected to apply domain context to disambiguate. The API never silently returns a low-confidence match as if it were exact.
Special cases in the corpus
Healthcare subsidiaries with 180+ NPIs
A hospital system like HCA Healthcare holds a single CIK (860730) as a publicly traded company and a single UEI for federal contracts, but 180+ NPIs for its individual facilities spread across 20 states. Queries against CMS Medicare claims or FDA facility registration data must join against all 180+ NPIs. The npi_list JSON array handles this:
-- Query CMS hospital quality data for all HCA facilities
-- json_each() expands the npi_list array into individual rows for the join
SELECT
em.canonical_name,
cms.facility_name,
cms.npi,
cms.overall_rating,
cms.state
FROM entity_master em,
json_each(em.npi_list) npi_row
JOIN cms_hospitals cms ON cms.npi = npi_row.value
WHERE em.cik = '860730' -- HCA Healthcare CIK
ORDER BY cms.state, cms.facility_name;The json_each() table-valued function is SQLite-native and available in Cloudflare D1. For the HCA Healthcare case, this expands into 180+ rows before the join against cms_hospitals, which is indexed onnpi. The full query resolves in under 15ms on the healthcare shard.
Foreign entities
Non-US entities appear in LEI data (the GLEIF corpus covers all 2.4 million active LEIs globally), in OFAC sanctions lists (which designate foreign entities and individuals), and as foreign private issuers in EDGAR. For these entities, cik,uei, npi_list, andduns are all NULL. The lei column is the primary identifier and the country column is non-US. LEI's parent/child hierarchy is exposed through a separatelei_hierarchy table rather than denormalized intoentity_master:
CREATE TABLE lei_hierarchy ( lei TEXT NOT NULL, parent_lei TEXT, -- direct parent; NULL if entity is its own parent ultimate_lei TEXT, -- ultimate parent in the ownership chain relationship_type TEXT, -- 'IS_DIRECTLY_CONSOLIDATED_BY' | 'IS_ULTIMATELY_CONSOLIDATED_BY' relationship_status TEXT, -- 'ACTIVE' | 'INACTIVE' effective_date TEXT ); CREATE INDEX lh_lei ON lei_hierarchy(lei); CREATE INDEX lh_parent_lei ON lei_hierarchy(parent_lei); CREATE INDEX lh_ultimate ON lei_hierarchy(ultimate_lei);
Government contractors with no CIK
Private defense contractors, IT service providers, and consulting firms that hold federal contracts but are not publicly traded have a SAM.gov UEI but no SEC CIK. Theirentity_master row has ueipopulated and cik = NULL. These entities are fully resolvable from contract data (USAspending, SAM exclusions, FPDS) but invisible in EDGAR. The query layer handles this correctly by only joining against ID columns that are non-NULL in the bridge row — a LEFT JOIN against sec_enforcementon cik simply returns no EDGAR rows for these entities rather than erroring.
Update cadence and merge/acquisition handling
The bridge table is not static. New entities register daily; existing entities restructure, merge, and rename. We handle each event class differently:
- New LEI registrations: GLEIF publishes a delta file daily via its API. We ingest it nightly and insert new rows into
entity_masterfor LEIs that do not yet appear in the table. - New SAM.gov UEI registrations: The SAM.gov entity registration extract is published daily. We diff against the previous snapshot and insert new entities. If a new UEI's EIN matches an existing CIK-bearing row in
entity_master, we update that row'sueicolumn in place rather than creating a duplicate. - New CIK assignments: EDGAR's company search index is polled weekly. New CIKs are inserted and cross-referenced against existing UEI and LEI records via EIN match.
- Corporate mergers and acquisitions: When an M&A disclosure in EDGAR (a Form S-4 or 8-K announcing a completed merger) references two known entities, we flag the acquiring and acquired entity_ids, close the acquired entity's alias rows with a
valid_todate equal to the merger close date, and add aformer_namealias on the surviving entity row pointing to the former legal name. The acquired entity'sentity_masterrow is retained withconfidence = 1.0and anacquired_byannotation in itssourcesJSON — historical records referencing the old entity continue to resolve correctly.
Query latency at the edge
Because all ID columns in entity_master are indexed, resolution is always an indexed point lookup rather than a scan. The bridge query itself — resolving any known identifier to an entity_id — executes in 1–3ms on Cloudflare D1. The consequential latency is in the downstream fan-out to the eight agency shards, which runs in parallel via Promise.all. Observed production latency for a full cross-agency entity timeline (all regulatory events for a company across all 8 shards) is p50 38ms, p99 120ms at the edge.
The entity bridge itself adds approximately 3ms to the total latency budget — the resolution query runs against the bridge database, which is a separate D1 binding from the agency shards. That 3ms is fixed overhead per request and does not scale with the number of shards queried.
For single-shard queries where the identifier type implies a specific shard — NPI implies healthcare, CIK implies securities — the query router uses shard affinity to skip the remaining seven shards entirely. A CMS provider lookup by NPI runs as: bridge resolution (3ms) + healthcare shard query (8ms) = 11ms total, rather than the full 38ms cross-agency budget. Identifier-type affinity routing is described in detail in the query layer post.
What the bridge enables
The entity_master normalization is what makes the Federal Regulatory Data Hub a cross-agency intelligence platform rather than a collection of isolated dataset endpoints. With the bridge in place:
- A compliance query for “show all regulatory events for Wells Fargo” resolves CIK 72971 (for EDGAR), UEI J6MYQ3ZPMV26 (for SAM.gov contracts), LEI VYVLBKJTMIMQ05WXPE37 (for FinCEN), and the full
npi_list(for CMS) in a single bridge lookup — no per-dataset ID lookup required by the caller. - An M&A diligence query asking “what is the full regulatory tail for this acquisition target?” returns EPA enforcement cases, OSHA inspection records, OFAC screening hits via subsidiaries, EDGAR filings, and SAM.gov exclusion history in one unified timeline.
- An investigative query asking “which entities appear in both the OFAC SDN list and the FDIC enforcement database?” becomes a SQL join through
entity_masterrather than a manual name-matching exercise across two separate government sites. - The MCP server exposes the entity bridge as a tool callable by Claude and other agent frameworks — a single tool invocation returns the full multi-agency regulatory profile for any company identifiable by ticker, CIK, UEI, LEI, or name.
The entity bridge is queryable directly through the Federal Regulatory API at /entity/resolve?name=Wells+Fargo or/entity/resolve?cik=72971, and via the MCP server's resolve_entity tool. The response includes all known IDs for the resolved entity, the confidence value, and the list of source datasets that contributed to the mapping.
Related articles
The entity graph that uses this bridge across all 197 datasets: Building the cross-agency regulatory entity graph: 35M records, one join →
The D1 database this normalization pipeline runs on: Building the Federal Regulatory Data Hub on Cloudflare D1: 35M records at the edge →
The query layer that routes entity lookups across eight shards: The Federal Regulatory Data Hub query layer: routing 35M records at the Cloudflare edge →
The API that exposes entity resolution to callers and agent frameworks: The Federal Regulatory API: REST, MCP, and JSON-LD for 197 federal datasets →
Full-text search over the normalized entity names using SQLite FTS5 and BM25: Full-text search across 35M federal records: SQLite FTS5, BM25 ranking, and cross-shard fan-out →