Technical writing

Entity ID normalization in the Federal Regulatory Data Hub: resolving CIK, UEI, LEI, DUNS, and NPI across 208 datasets

March 25, 2025· 12 min read· AI Analytics

Regulatory dataEntity resolutionCloudflare D1Data engineering

The Federal Regulatory Data Hub indexes 208 federal datasets across 45 agencies. Each dataset uses its own identifier system for companies. SEC EDGAR assigns CIKs. GSA's SAM.gov issues UEIs. The GLEIF LEI Registry issues 20-character ISO 17442 codes. Dun & Bradstreet's DUNS number persists in older data. CMS assigns NPIs to healthcare providers. A company like Wells Fargo appears as CIK 72971, UEI J6MYQ3ZPMV26, LEI VYVLBKJTMIMQ05WXPE37, DUNS 003453416, and dozens of NPI numbers for its healthcare subsidiaries — with no shared foreign key across any government database.

Without normalization, a query for “show all regulatory events for Wells Fargo” would require knowing which identifier to use in each of the 208 datasets, and re-parameterizing each query accordingly. The entity_master bridge eliminates that requirement. This post documents the identifier landscape, the schema of the bridge table, and the three-pass resolution strategy we use to populate and query it.

The five federal identifier systems

Understanding the design of entity_master requires understanding why each identifier system exists, how stable it is, and what datasets depend on it. The five systems have meaningfully different scopes and coverage.

CIK — SEC Central Index Key

The CIK is SEC EDGAR's primary identifier for every filer: public companies, investment funds, insiders filing Form 4, and foreign private issuers. It is a 7–10 digit integer assigned once and never changed — CIKs are the most stable identifier in the federal system. Wells Fargo's CIK 72971 was assigned when the company first registered with EDGAR and will remain 72971 for as long as the company files with the SEC. There are approximately 40,000 active CIKs for publicly reporting companies; the full EDGAR corpus, including historical and deregistered filers, covers over 600,000 CIKs.

CIK coverage is excellent for publicly traded companies and SEC-registered investment vehicles but zero for private companies, nonprofits, government contractors, and healthcare providers that have no SEC reporting obligation.

UEI — Unique Entity Identifier

The UEI is GSA's 12-character alphanumeric replacement for the DUNS number in federal procurement. GSA began issuing UEIs in 2022 and completed the mandatory cutover by April 2022 — any entity registering in SAM.gov, applying for federal contracts, or receiving federal grants now has a UEI. The SAM.gov registry covers 2.8 million registered entities, but that population includes nonprofits, universities, foreign entities, and sole proprietors in addition to corporate contractors. UEI coverage is strong for entities that have ever sought federal business; it is nonexistent for purely private-market companies with no government touchpoint.

LEI — Legal Entity Identifier

The LEI is a 20-character alphanumeric code defined by ISO 17442 and issued by accredited Local Operating Units (LOUs) through the Global Legal Entity Identifier Foundation (GLEIF). It is used in FinCEN records, CFTC derivatives reporting, and Federal Reserve financial stability data. Approximately 2.4 million active LEIs exist globally. The LEI system carries a parent/child hierarchy — each LEI record links to its direct parent LEI and its ultimate parent LEI — making it uniquely useful for tracing subsidiary structures without manual corporate-tree research. Wells Fargo's LEI VYVLBKJTMIMQ05WXPE37 links to subsidiary LEIs for Wells Fargo Bank, NA, Wells Fargo Securities, and dozens of other operating entities.

DUNS — Data Universal Numbering System

The DUNS is Dun & Bradstreet's 9-digit identifier. It was the federal procurement standard for decades before the UEI transition, and it persists in older data and datasets not yet migrated: EPA ECHO facility records, OSHA inspection records, some FEMA grant data, and export-control license records. Active DUNS numbers are no longer issued for new federal registrations, but the existing corpus remains relevant for historical data joins and for datasets that have not completed their migration to UEI. We treat DUNS as a read-only secondary identifier — we populate it from historical records but do not rely on it for new entity registrations.

NPI — National Provider Identifier

The NPI is CMS's 10-digit identifier for healthcare providers: hospitals, physician practices, pharmacies, imaging centers, and medical device distributors. Used in Medicare and Medicaid claims, FDA facility data, and HHS-OIG exclusion records. Approximately 7 million active NPIs are in the NPPES registry. The critical design detail: NPIs are granular to the individual facility, specialty, or state license — a single hospital system like HCA Healthcare holds 180+ NPIs, one for each facility and care setting. The parent company's CIK is a single value; its NPI list is an array of 180+ entries. Our schema encodes this as a JSON array in npi_list, queried at runtime with SQLite's json_each().

The entity_master schema

The bridge table lives in its own Cloudflare D1 database, separate from the eight vertical shards that hold agency-specific data. Every entity that appears in any of the 208 datasets gets exactly one row in entity_master, identified by a synthetic primary key prefixed EM-. Known ID equivalences are stored as indexed columns; the NPI case gets a JSON array because the one-to-many relationship cannot be expressed as a single column without sacrificing query simplicity.

CREATE TABLE entity_master (
  entity_id      TEXT PRIMARY KEY,   -- "EM-0000012345"
  canonical_name TEXT NOT NULL,      -- preferred display name
  cik            TEXT UNIQUE,        -- SEC EDGAR Central Index Key
  uei            TEXT UNIQUE,        -- SAM.gov Unique Entity Identifier (12-char)
  lei            TEXT UNIQUE,        -- ISO 17442 Legal Entity Identifier (20-char)
  duns           TEXT UNIQUE,        -- D&B DUNS (9-digit, legacy)
  npi_list       TEXT,               -- JSON array of NPI strings
  ticker         TEXT,               -- exchange ticker symbol
  sic_code       TEXT,               -- SIC industry classification
  state          TEXT,               -- US state of registration (2-letter)
  country        TEXT DEFAULT 'US',  -- ISO 3166-1 alpha-2
  confidence     REAL NOT NULL,      -- 0.0–1.0 match confidence
  sources        TEXT NOT NULL,      -- JSON array of source dataset names
  last_updated   TEXT NOT NULL       -- ISO 8601 timestamp
);

-- Indexes on every ID column — cross-shard resolution must be fast
CREATE INDEX em_cik     ON entity_master(cik);
CREATE INDEX em_uei     ON entity_master(uei);
CREATE INDEX em_lei     ON entity_master(lei);
CREATE INDEX em_duns    ON entity_master(duns);
CREATE INDEX em_ticker  ON entity_master(ticker);
CREATE INDEX em_name    ON entity_master(canonical_name);

The confidence column is central to the design. Rows seeded from authoritative government cross-references — where two datasets independently assert the same company under two different ID systems — carry confidence = 1.0. Rows built from name-similarity matching carry lower confidence, and all queries against the entity bridge return the confidence value so callers can apply their own threshold. A compliance workflow demanding near-certainty can filter to confidence >= 0.95; a discovery workflow comfortable with candidate lists can accept confidence >= 0.7.

Foreign entities — companies with LEIs but no US regulatory footprint — are assigned anentity_id and populated lei andcountry columns, but left with NULL incik, uei, andnpi_list. Government contractors that are not publicly traded have a uei from SAM.gov but cik = NULL. The schema handles these sparse cases naturally: a join against any NULL column simply produces no match rather than a false positive.

The alias table

Entity names change. Companies restructure, rebrand, merge, and operate under doing-business-as names that bear little resemblance to their legal registered name. Theentity_aliases table captures every known name variation for each entity in entity_master, along with the time window during which the alias was valid:

CREATE TABLE entity_aliases (
  alias_id    INTEGER PRIMARY KEY AUTOINCREMENT,
  alias_text  TEXT NOT NULL,
  entity_id   TEXT NOT NULL REFERENCES entity_master(entity_id),
  alias_type  TEXT NOT NULL,  -- 'former_name' | 'dba' | 'abbreviation' | 'ticker'
  source      TEXT NOT NULL,  -- which dataset provided this alias
  valid_from  TEXT,           -- ISO 8601; NULL means unknown start
  valid_to    TEXT            -- ISO 8601; NULL means currently active
);

CREATE INDEX ea_alias_text  ON entity_aliases(alias_text);
CREATE INDEX ea_entity_id   ON entity_aliases(entity_id);

The alias_type taxonomy captures four distinct cases. Aformer_name is a legal name the entity used before a corporate rename or merger — “First Union Corporation” for what became Wachovia, then Wells Fargo. A dba is a trade name used concurrently with the legal name. An abbreviation is a shortened form commonly used in enforcement records — EPA enforcement cases often use “ExxonMobil Pipeline” where the full legal name is “ExxonMobil Pipeline Company, a subsidiary of ExxonMobil Corporation.” A ticker alias links the exchange symbol as a queryable text alias so lookups by ticker hit the alias index before touching the ID columns.

The valid_to column is what allows temporal queries: a query for a company name that was valid in 2018 will match alias rows wherevalid_to IS NULL OR valid_to > '2018-01-01'. When a merger closes and the acquired entity's name is retired, we setvalid_to to the merger completion date rather than deleting the row, preserving the historical record.

Three-pass resolution strategy

Populating entity_master from 208 datasets and then querying it at runtime both follow the same three-pass priority order. Pass 1 is exact ID cross-reference, exploiting cases where two datasets independently carry the same ID type for an entity. Pass 2 is alias table lookup for known DBA and former names. Pass 3 is TF-IDF fuzzy name matching, the most expensive and least reliable pass, used only when the first two fail and always flagged with reduced confidence.

Pass 1: exact ID cross-reference

When two datasets share a common identifier, the join is deterministic. The entity_master bridge enables this across identifier types that would otherwise be incompatible. A query joining USAspending contract records against SEC EDGAR filings illustrates the pattern:

-- Cross-dataset join using entity_master as the bridge
-- Find all companies that received federal contracts AND have SEC enforcement actions
SELECT
  em.canonical_name,
  em.entity_id,
  ua.award_amount,
  ua.award_date,
  se.action_type,
  se.penalty_amount,
  se.action_date
FROM usaspending_awards  ua
JOIN entity_master       em  ON ua.uei = em.uei
JOIN sec_enforcement     se  ON se.cik = em.cik
WHERE ua.award_amount > 10000000   -- contracts over $10M
  AND se.action_date > '2020-01-01'
  AND em.confidence >= 0.95
ORDER BY ua.award_amount DESC;

Without entity_master, this query cannot be written — there is no direct foreign key between the UEI column in USAspending and the CIK column in SEC enforcement records. The bridge table holds the equivalence, and the query reads it as a double join. Because both uei and cikare indexed in entity_master, the join operates as two indexed lookups rather than a table scan.

At ingest time, Pass 1 is seeded from two high-quality sources: EDGAR's company index (CIK → ticker → legal name) and SAM.gov's entity registration extract (UEI → EIN → CAGE → legal name). When a SAM.gov entity's EIN matches an EDGAR company's EIN — a linkage we can make via IRS EIN when both datasets carry it — the UEI and CIK are linked inentity_master with confidence = 1.0. The GLEIF daily full extract then augments those rows with LEI values using the same EIN-based cross-reference.

Pass 2: alias table lookup

When Pass 1 finds no match — the dataset carries only a free-text name, not a structured ID — we query the alias table before escalating to fuzzy matching. Alias table lookup is exact-match on normalized text, which means it is fast (an index scan) and carries high confidence (99.2% true positive rate in our evaluation corpus, 0.1% false positive rate).

-- Pass 2: alias table lookup
-- Input: a company name from a dataset that carries no structured ID
-- Output: entity_id if a match exists, with the confidence of that alias

SELECT
  em.entity_id,
  em.canonical_name,
  em.cik,
  em.uei,
  em.lei,
  ea.alias_type,
  ea.source,
  em.confidence
FROM entity_aliases ea
JOIN entity_master  em ON ea.entity_id = em.entity_id
WHERE ea.alias_text = normalize_name(:input_name)   -- exact match on normalized form
  AND (ea.valid_to IS NULL OR ea.valid_to > :query_date)
ORDER BY em.confidence DESC
LIMIT 5;

The normalize_name() function applied at query time is the same normalization applied when aliases were inserted — stripping punctuation, uppercasing, and removing legal suffixes. This ensures that “Wells Fargo & Company” and “WELLS FARGO COMPANY” both resolve to the same normalized form and hit the same alias row.

Pass 3: TF-IDF fuzzy name matching

When neither Pass 1 nor Pass 2 produces a match, we fall back to token-weighted similarity against the full entity_master corpus. This is the most expensive path and the least reliable — it is only invoked when direct ID and alias lookup have both failed, and every result it produces is assigned confidence < 0.7. Results below 0.6 are rejected entirely rather than returned with a caveat.

def fuzzy_match_entity(
    query_name: str,
    query_state: Optional[str] = None,
    query_country: str = "US",
    min_confidence: float = 0.6,
) -> list[EntityMatch]:
    """
    Pass 3 entity resolution: TF-IDF fuzzy name match.
    Only invoked after Pass 1 (exact ID) and Pass 2 (alias table) fail.
    All results carry confidence < 0.7.
    """
    # Step 1: normalize the query name before tokenization
    normalized_query = normalize_company_name(query_name)

    # Step 2: retrieve top-20 candidates from the pre-built TF-IDF index
    # The index is built over canonical_name + all alias_text values per entity
    candidates = tfidf_index.query(normalized_query, top_k=20)

    results: list[EntityMatch] = []
    for candidate in candidates:
        score = compute_match_score(
            query_name,
            candidate.canonical_name,
            query_state,
            candidate.state,
        )
        if score >= min_confidence:
            results.append(EntityMatch(
                entity=candidate,
                confidence=min(score, 0.69),  # cap below 0.7; Pass 3 never exceeds this
            ))

    return sorted(results, key=lambda x: -x.confidence)


def compute_match_score(
    query: str,
    candidate: str,
    query_state: Optional[str],
    candidate_state: Optional[str],
) -> float:
    """
    Combine Jaro-Winkler string similarity with a geographic bonus.
    Geographic agreement provides a +0.05 boost to handle cases where
    a common company name (e.g., "National Energy Corporation") appears
    in multiple states — state context disambiguates.
    """
    name_score = jaro_winkler_similarity(
        normalize_company_name(query),
        normalize_company_name(candidate),
    )
    state_bonus = (
        0.05
        if query_state is not None and query_state == candidate_state
        else 0.0
    )
    return min(1.0, name_score + state_bonus)

Company name normalization

All three passes depend on a consistent normalization function that collapses the surface variation in company names before any comparison. Federal datasets use inconsistent capitalization, punctuation, and legal suffix abbreviations. Without normalization, “Wells Fargo & Co.”, “WELLS FARGO AND COMPANY”, and “Wells Fargo Co” would fail to match each other. The normalization function is deterministic and applied identically at ingest time (when buildingentity_aliases) and at query time (when comparing incoming names).

# Legal-suffix stop words — removed from normalized form before comparison.
# Does NOT include substantive industry words like "Bank", "Healthcare", "Energy".
STOP_WORDS = frozenset({
    "the", "and", "of", "for",
    "co", "corp", "inc", "llc", "ltd", "plc",
    "gmbh", "sa", "sas", "nv", "bv", "ag", "se", "lp",
    "company", "corporation", "incorporated", "limited",
    "partners", "partnership", "associates", "group",
})

# Abbreviation expansion map — applied before stop word removal
ABBREV_MAP = {
    r"\bINT'L\b": "INTERNATIONAL",
    r"\bINTL\b":  "INTERNATIONAL",
    r"\bMFG\b":   "MANUFACTURING",
    r"\bSVC\b":   "SERVICES",
    r"\bINDS\b":  "INDUSTRIES",
    r"\bHLTH\b":  "HEALTH",
    r"\bFIN\b":   "FINANCIAL",
}

def normalize_company_name(name: str) -> str:
    """
    Canonical normalization applied identically at ingest and query time.
    Returns a space-joined token string for comparison or TF-IDF indexing.
    """
    name = name.upper().strip()

    # Remove punctuation, keeping internal word boundaries
    name = re.sub(r"[^ws]", " ", name)
    name = re.sub(r"s+", " ", name).strip()

    # Expand common abbreviations before stop-word removal
    for pattern, expansion in ABBREV_MAP.items():
        name = re.sub(pattern, expansion, name)

    # Tokenize and filter stop words
    tokens = [t for t in name.split() if t.lower() not in STOP_WORDS]

    return " ".join(tokens)

A few design decisions worth noting: we do not remove industry words like “Bank” or “Healthcare” because they are discriminative — “Wells Fargo Bank” and “Wells Fargo Securities” are distinct entities in our corpus. We expand abbreviations before stop-word removal so that “INTL” becomes “INTERNATIONAL” and is then comparable to the expanded form in another dataset's record. And we uppercase before any token operation to make the pipeline case-insensitive without a separatelower() call at comparison time.

Resolution accuracy by method

We evaluated each pass against a labeled corpus of 12,000 entity pairs drawn from overlapping datasets where ground-truth equivalence was independently verifiable through official government cross-references. The results by resolution method:

Method	True positive	False positive	Use case
Exact ID join	99.9%	0.01%	Same ID type across datasets
Alias table	99.2%	0.1%	Known DBA and former names
TF-IDF fuzzy (≥0.9)	97%	1%	High-confidence name match
TF-IDF fuzzy (0.7–0.9)	85%	8%	Returned with confidence caveat
TF-IDF fuzzy (<0.7)	—	—	Rejected; not returned

The 8% false positive rate for TF-IDF results in the 0.7–0.9 confidence band is the reason those results are always tagged with their confidence value in the API response. A caller querying the entity bridge for “National Energy Corporation” might receive three candidate matches in that band — all plausible, none certain — and is expected to apply domain context to disambiguate. The API never silently returns a low-confidence match as if it were exact.

Special cases in the corpus

Healthcare subsidiaries with 180+ NPIs

A hospital system like HCA Healthcare holds a single CIK (860730) as a publicly traded company and a single UEI for federal contracts, but 180+ NPIs for its individual facilities spread across 20 states. Queries against CMS Medicare claims or FDA facility registration data must join against all 180+ NPIs. The npi_list JSON array handles this:

-- Query CMS hospital quality data for all HCA facilities
-- json_each() expands the npi_list array into individual rows for the join
SELECT
  em.canonical_name,
  cms.facility_name,
  cms.npi,
  cms.overall_rating,
  cms.state
FROM entity_master em,
     json_each(em.npi_list)  npi_row
JOIN cms_hospitals cms ON cms.npi = npi_row.value
WHERE em.cik = '860730'   -- HCA Healthcare CIK
ORDER BY cms.state, cms.facility_name;

The json_each() table-valued function is SQLite-native and available in Cloudflare D1. For the HCA Healthcare case, this expands into 180+ rows before the join against cms_hospitals, which is indexed onnpi. The full query resolves in under 15ms on the healthcare shard.

Foreign entities

Non-US entities appear in LEI data (the GLEIF corpus covers all 2.4 million active LEIs globally), in OFAC sanctions lists (which designate foreign entities and individuals), and as foreign private issuers in EDGAR. For these entities, cik,uei, npi_list, andduns are all NULL. The lei column is the primary identifier and the country column is non-US. LEI's parent/child hierarchy is exposed through a separatelei_hierarchy table rather than denormalized intoentity_master:

CREATE TABLE lei_hierarchy (
  lei         TEXT NOT NULL,
  parent_lei  TEXT,           -- direct parent; NULL if entity is its own parent
  ultimate_lei TEXT,          -- ultimate parent in the ownership chain
  relationship_type TEXT,     -- 'IS_DIRECTLY_CONSOLIDATED_BY' | 'IS_ULTIMATELY_CONSOLIDATED_BY'
  relationship_status TEXT,   -- 'ACTIVE' | 'INACTIVE'
  effective_date TEXT
);

CREATE INDEX lh_lei         ON lei_hierarchy(lei);
CREATE INDEX lh_parent_lei  ON lei_hierarchy(parent_lei);
CREATE INDEX lh_ultimate    ON lei_hierarchy(ultimate_lei);

Government contractors with no CIK

Private defense contractors, IT service providers, and consulting firms that hold federal contracts but are not publicly traded have a SAM.gov UEI but no SEC CIK. Theirentity_master row has ueipopulated and cik = NULL. These entities are fully resolvable from contract data (USAspending, SAM exclusions, FPDS) but invisible in EDGAR. The query layer handles this correctly by only joining against ID columns that are non-NULL in the bridge row — a LEFT JOIN against sec_enforcementon cik simply returns no EDGAR rows for these entities rather than erroring.

Update cadence and merge/acquisition handling

The bridge table is not static. New entities register daily; existing entities restructure, merge, and rename. We handle each event class differently:

New LEI registrations: GLEIF publishes a delta file daily via its API. We ingest it nightly and insert new rows intoentity_master for LEIs that do not yet appear in the table.
New SAM.gov UEI registrations: The SAM.gov entity registration extract is published daily. We diff against the previous snapshot and insert new entities. If a new UEI's EIN matches an existing CIK-bearing row inentity_master, we update that row'suei column in place rather than creating a duplicate.
New CIK assignments: EDGAR's company search index is polled weekly. New CIKs are inserted and cross-referenced against existing UEI and LEI records via EIN match.
Corporate mergers and acquisitions: When an M&A disclosure in EDGAR (a Form S-4 or 8-K announcing a completed merger) references two known entities, we flag the acquiring and acquired entity_ids, close the acquired entity's alias rows with a valid_to date equal to the merger close date, and add a former_name alias on the surviving entity row pointing to the former legal name. The acquired entity's entity_masterrow is retained with confidence = 1.0 and anacquired_by annotation in itssources JSON — historical records referencing the old entity continue to resolve correctly.

Query latency at the edge

Because all ID columns in entity_master are indexed, resolution is always an indexed point lookup rather than a scan. The bridge query itself — resolving any known identifier to an entity_id — executes in 1–3ms on Cloudflare D1. The consequential latency is in the downstream fan-out to the eight agency shards, which runs in parallel via Promise.all. Observed production latency for a full cross-agency entity timeline (all regulatory events for a company across all 8 shards) is p50 38ms, p99 120ms at the edge.

The entity bridge itself adds approximately 3ms to the total latency budget — the resolution query runs against the bridge database, which is a separate D1 binding from the agency shards. That 3ms is fixed overhead per request and does not scale with the number of shards queried.

For single-shard queries where the identifier type implies a specific shard — NPI implies healthcare, CIK implies securities — the query router uses shard affinity to skip the remaining seven shards entirely. A CMS provider lookup by NPI runs as: bridge resolution (3ms) + healthcare shard query (8ms) = 11ms total, rather than the full 38ms cross-agency budget. Identifier-type affinity routing is described in detail in the query layer post.

What the bridge enables

The entity_master normalization is what makes the Federal Regulatory Data Hub a cross-agency intelligence platform rather than a collection of isolated dataset endpoints. With the bridge in place:

A compliance query for “show all regulatory events for Wells Fargo” resolves CIK 72971 (for EDGAR), UEI J6MYQ3ZPMV26 (for SAM.gov contracts), LEI VYVLBKJTMIMQ05WXPE37 (for FinCEN), and the full npi_list (for CMS) in a single bridge lookup — no per-dataset ID lookup required by the caller.
An M&A diligence query asking “what is the full regulatory tail for this acquisition target?” returns EPA enforcement cases, OSHA inspection records, OFAC screening hits via subsidiaries, EDGAR filings, and SAM.gov exclusion history in one unified timeline.
An investigative query asking “which entities appear in both the OFAC SDN list and the FDIC enforcement database?” becomes a SQL join throughentity_master rather than a manual name-matching exercise across two separate government sites.
The MCP server exposes the entity bridge as a tool callable by Claude and other agent frameworks — a single tool invocation returns the full multi-agency regulatory profile for any company identifiable by ticker, CIK, UEI, LEI, or name.

The entity bridge is queryable directly through the Federal Regulatory API at /entity/resolve?name=Wells+Fargo or/entity/resolve?cik=72971, and via the MCP server's resolve_entity tool. The response includes all known IDs for the resolved entity, the confidence value, and the list of source datasets that contributed to the mapping.

The entity graph that uses this bridge across all 208 datasets: Building the cross-agency regulatory entity graph: 50M+ records, one join →

The D1 database this normalization pipeline runs on: Building the Federal Regulatory Data Hub on Cloudflare D1: 50M+ records at the edge →

The query layer that routes entity lookups across eight shards: The Federal Regulatory Data Hub query layer: routing 50M+ records at the Cloudflare edge →

The API that exposes entity resolution to callers and agent frameworks: The Federal Regulatory API: REST, MCP, and JSON-LD for 208 federal datasets →

Full-text search over the normalized entity names using SQLite FTS5 and BM25: Full-text search across 50M+ federal records: SQLite FTS5, BM25 ranking, and cross-shard fan-out →