Technical writing

Name matching in federal regulatory data: aliases, subsidiaries, and sanctions evasion across 197 datasets

· 10 min read· AI Analytics
Regulatory dataEntity resolutionComplianceData engineering

The Federal Regulatory Data Hub's cross-agency entity bridge is only as good as its ability to recognize that “JP Morgan Chase NA”, “JPMorgan Chase Bank, National Association”, and “JPMORGAN CHASE & CO” are the same entity appearing in different federal datasets with different name conventions. The entity bridge joins on canonical identifiers where they exist — SEC CIK, SAM.gov UEI, LEI, DUNS, NPI — but roughly 34% of cross-dataset matches we need to make have no shared identifier. Those matches must come from name similarity. This post covers the name-matching pipeline: normalization, alias expansion, the three matching passes, false positive rates by method, and how match confidence propagates into the compliance risk score.

Why identifiers alone are insufficient

Federal identifier coverage varies dramatically by dataset. SEC EDGAR assigns a CIK to every registrant. SAM.gov assigns a UEI to every government contractor. But OFAC's SDN list is identifier-free for most entries — it contains entity names, aliases, addresses, and dates of birth, but no CIK or UEI. The EPA's ECHO enforcement database uses a Facility Registry Service ID (FRS ID) that has no relationship to SEC or SAM.gov identifiers. DOJ press releases are free text with no structured identifier fields at all.

For the compliance risk score to work — a single API call that returns enforcement actions across SEC, OFAC, EPA, CFPB, DOJ, OSHA, MSHA, and 20+ other sources — the entity bridge must link records across datasets that share no common identifier. The 34% identifier gap means that roughly 1 in 3 cross-agency links depends entirely on name matching quality.

Name normalization

Before any similarity computation, every entity name goes through a normalization pipeline. Normalization is the single highest-leverage step: “JPMorgan” and “JP Morgan” that differ only in whitespace are a false negative from a case-sensitive exact match, but after normalization they become identical strings.

def normalize_entity_name(raw: str) -> str:
    """
    Canonical form for entity name matching.
    Input: "JP Morgan Chase Bank, National Association (N.A.)"
    Output: "jp morgan chase bank national association"
    """
    import unicodedata, re

    # 1. NFKC Unicode normalization: decompose then recompose in compatibility form
    name = unicodedata.normalize('NFKC', raw)

    # 2. Case fold (handles ß -> ss, etc. beyond .lower())
    name = name.casefold()

    # 3. Strip legal suffixes — order matters, strip longest first
    LEGAL_SUFFIXES = [
        'national association', 'national assoc', 'n.a.', 'na',
        'limited liability company', 'limited liability corp',
        'incorporated', 'corporation', 'company', 'limited',
        'inc.', 'corp.', 'co.', 'ltd.', 'llc', 'lp', 'plc',
        'international', 'group', 'holdings', 'holding',
        'bank', 'trust',  # strip only when trailing
    ]
    for suffix in LEGAL_SUFFIXES:
        if name.endswith(' ' + suffix) or name.endswith(', ' + suffix):
            name = name[:-(len(suffix) + 1)].rstrip(', ')

    # 4. Remove punctuation, preserve internal spaces
    name = re.sub(r"[&'".,;:()-/\]", ' ', name)

    # 5. Collapse whitespace
    name = re.sub(r's+', ' ', name).strip()

    return name

The legal suffix list is ordered longest-first to prevent partial stripping — “national association” must be removed before “association” alone, or the residue creates a spurious token. We maintain a 340-entry suffix table covering English, Spanish, French, German, and Arabic legal entity designators, since OFAC SDN entries include non-English entity names.

OFAC alias explosion

Every SDN entry can have multiple aliases. The SDN XML structure looks like this:

<sdnEntry uid="12345">
  <firstName>YUKOS</firstName>
  <lastName>OIL COMPANY</lastName>
  <akaList>
    <aka uid="12346">
      <type>a.k.a.</type>
      <lastName>YUKOS FINANCE BV</lastName>
    </aka>
    <aka uid="12347">
      <type>f.k.a.</type>
      <lastName>ROSPROM</lastName>
    </aka>
    <aka uid="12348">
      <type>n.k.a.</type>
      <lastName>MENATEP GROUP</lastName>
    </aka>
  </akaList>
</sdnEntry>

We explode every AKA into a separate row in the entity_aliases table, preserving the alias type (a.k.a., f.k.a., n.k.a.) and linking back to the canonical SDN UID. When screening a query name, we match against all aliases, not just the primary name. A match against any alias returns the canonical SDN entry.

-- entity_aliases table
CREATE TABLE entity_aliases (
    alias_id      INTEGER PRIMARY KEY,
    entity_id     TEXT NOT NULL REFERENCES entity_master(entity_id),
    alias_name    TEXT NOT NULL,          -- raw name
    alias_norm    TEXT NOT NULL,          -- normalized
    alias_type    TEXT,                  -- 'aka', 'fka', 'nka', 'dba'
    source        TEXT NOT NULL,          -- 'ofac_sdn', 'sec_edgar', 'cfpb', ...
    confidence    REAL DEFAULT 1.0
);
CREATE INDEX idx_alias_norm ON entity_aliases (alias_norm);

The 12,247 SDN entries in the current list expand to 44,891 aliases after explosion. The alias table also receives DBA (doing-business-as) names from SEC DEF 14A proxy filings, which frequently list trade names that differ from the registered corporate name: “Apple Inc.” operates retail stores under “Apple Store”, which is not an alias visible in the CIK or UEI databases.

Subsidiary mapping via SEC EDGAR

A major source of false negatives in sanctions screening is the subsidiary problem: an OFAC-sanctioned entity's subsidiary may not appear directly on the SDN list, but for compliance purposes an organization doing business with the subsidiary is transacting with a sanctioned group. SEC EDGAR's filing database provides parent-subsidiary relationships via the 10-K “Exhibit 21 — Subsidiaries” filing, which lists all significant subsidiaries of a reporting company.

-- Subsidiary relationships from SEC Exhibit 21
CREATE TABLE entity_subsidiaries (
    parent_cik    TEXT NOT NULL,
    child_cik     TEXT,              -- NULL when child not separately registered
    child_name    TEXT NOT NULL,     -- subsidiary legal name
    child_norm    TEXT NOT NULL,     -- normalized
    jurisdiction  TEXT,             -- state/country of incorporation
    ownership_pct REAL,             -- percentage if disclosed
    filing_date   DATE NOT NULL
);

-- Query: find all subsidiaries of a sanctioned parent
SELECT es.child_name, es.jurisdiction, es.ownership_pct
FROM entity_subsidiaries es
JOIN entity_master em ON es.parent_cik = em.sec_cik
WHERE em.entity_id = 'OFAC_SDN_12345'
ORDER BY ownership_pct DESC NULLS LAST;

We parse Exhibit 21 filings using a rule-based extractor that handles the five most common formats: numbered list, indented tree, HTML table, XBRL tagged, and plain text paragraph. The extractor has a 91% recall on a 400-document held-out test set from the SEC EDGAR archive. The 9% miss rate is dominated by non-standard formatting in smaller companies' filings.

Three-pass matching

After normalization and alias explosion, entity name matching proceeds in three passes in decreasing order of precision:

Pass 1 — exact normalized match. If the normalized query string exactly matches a normalized alias in the entity_aliases table, the match is accepted with confidence 1.0. This covers the majority of intra-dataset links (same company, different formatting) and most OFAC SDN hits after normalization.

Pass 2 — Jaro-Winkler similarity. For names that survive pass 1 without a hit, we run Jaro-Winkler against the FTS5-indexed alias table. Jaro-Winkler is optimal for short names (3–5 tokens) where transposition and prefix similarity matter — “ZION BANK” vs. “ZIONS BANK”, or “WELLS FARGO” vs. “WELS FARGO”. Threshold: score ≥ 0.88.

Pass 3 — TF-IDF cosine similarity. For longer names (6+ tokens) or when pass 2 finds no candidate above threshold, we fall back to TF-IDF vector comparison. TF-IDF handles cases where tokens appear in different order or when a DBA name omits common tokens: “Goldman Sachs Asset Management” matches “Goldman Sachs Asset Management, L.P.” where the legal suffix stripping leaves similar but not identical token sets. Threshold: cosine similarity ≥ 0.72.

import jellyfish
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class EntityMatcher:
    def __init__(self, alias_norms: list[str]):
        self.alias_norms = alias_norms
        # Build TF-IDF index over all known alias norms
        self.vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 4))
        self.tfidf_matrix = self.vectorizer.fit_transform(alias_norms)

    def match(self, query_norm: str) -> tuple[str | None, float, str]:
        """Returns (matched_alias, confidence, method)."""

        # Pass 1: exact match
        if query_norm in self.alias_set:
            return query_norm, 1.0, 'exact'

        # Pass 2: Jaro-Winkler (fast, good for short names)
        jw_scores = [jellyfish.jaro_winkler_similarity(query_norm, a) for a in self.alias_norms]
        best_jw = max(range(len(jw_scores)), key=jw_scores.__getitem__)
        if jw_scores[best_jw] >= 0.88:
            return self.alias_norms[best_jw], jw_scores[best_jw], 'jaro_winkler'

        # Pass 3: TF-IDF cosine (handles long names and reordered tokens)
        query_vec = self.vectorizer.transform([query_norm])
        cos_scores = cosine_similarity(query_vec, self.tfidf_matrix)[0]
        best_cos = int(np.argmax(cos_scores))
        if cos_scores[best_cos] >= 0.72:
            return self.alias_norms[best_cos], float(cos_scores[best_cos]), 'tfidf'

        return None, 0.0, 'no_match'

False positive rates by method and dataset

We evaluated false positive rates on a 5,000-pair test set: 2,500 true matches (same entity, different source) and 2,500 true non-matches (different entities with similar names). False positive rates by matching method:

MethodFP rateFN rateBest for
Exact (raw)0.0%41.3%datasets with consistent formatting
Exact (normalized)0.3%18.7%most intra-dataset links
Jaro-Winkler ≥ 0.882.1%8.4%short names, typos, abbreviations
TF-IDF cosine ≥ 0.724.8%5.1%long names, reordered tokens, DBA names
Three-pass combined1.4%4.9%general cross-dataset matching

The 1.4% combined false positive rate means that roughly 1 in 70 name-based cross-dataset links is incorrect. For the compliance risk score, each link carries a match_method and entity_confidence field in the API response. Risk scores derived from TF-IDF matches are penalized by a 0.7× confidence multiplier before contributing to the final score, so a low-quality fuzzy match against an OFAC SDN entry raises the risk score less than a confirmed exact match against the same entry.

The highest false positive rate occurs in the financial sector where many companies share generic names with geographic modifiers: “First National Bank” is a distinct entity in hundreds of states and cities, and after normalization all of them become “first national bank”. We handle this by requiring that name-only matches also agree on at least one geographic field (state, city, or country of incorporation) before crossing datasets. When no geographic field is available on either side, the match is flagged asgeo_unverified and the confidence is capped at 0.6.

Sanctions evasion patterns

OFAC publishes guidance on common sanctions evasion tactics that involve entity renaming. We maintain a set of structural heuristics that flag candidate matches as potentially evasion-related when the name similarity is above a lower threshold (0.78 Jaro-Winkler) but the two entities differ in ways consistent with known obfuscation patterns.

Character substitution. Replacing common Latin characters with visually similar Cyrillic, Greek, or Arabic characters — “RОСНЕФТЬ” using a Cyrillic О instead of a Latin O. Post-normalization with NFKC decomposition catches most substitutions: NFKC maps visually similar characters to their canonical forms when a compatibility equivalence exists. Characters with no NFKC decomposition are flagged separately.

Token reordering with insertion. “GLOBAL ENERGY SOLUTIONS” becoming “ENERGY SOLUTIONS GLOBAL TRADING LTD” — the same tokens plus an insertion. TF-IDF cosine similarity catches this (score ≈ 0.81 for this example), but the inserted token “TRADING” changes the entity's apparent business type. We compare the top-5 TF-IDF tokens between the query and the matched alias; when the token overlap is ≥ 80% but new tokens appear in the query that weren't in the original, the match is flaggedpossible_restructuring.

Jurisdiction hop with name preservation. The same entity name re-incorporated in a different jurisdiction — “MERIDIAN RESOURCES LLC (Nevada)” becoming “MERIDIAN RESOURCES LTD (Seychelles)”. Exact normalized match, but different jurisdiction. We flag this asjurisdiction_change when the match is exact or near-exact but the jurisdiction fields differ, especially when the new jurisdiction appears in the FATF high-risk or OFAC-designated country list.

The entity_confidence field

Every cross-dataset link in the API response carries an entity_confidencefield (0.0–1.0) that encodes the quality of the entity match:

{
  "entity_id": "entity_master_8819f2",
  "primary_name": "Meridian Energy Holdings LLC",
  "datasets": [
    {
      "source": "ofac_sdn",
      "record_id": "SDN-28821",
      "matched_alias": "MERIDIAN ENERGY HOLDINGS",
      "match_method": "exact_normalized",
      "entity_confidence": 0.97,
      "geo_verified": true,
      "flags": []
    },
    {
      "source": "sec_enforcement",
      "record_id": "AAER-3847",
      "matched_alias": "Meridian Energy Holdings, LLC",
      "match_method": "exact_normalized",
      "entity_confidence": 0.99,
      "geo_verified": true,
      "flags": []
    },
    {
      "source": "doj_press",
      "record_id": "DOJ-2024-0312",
      "matched_alias": "Meridian Energy Group Holdings",
      "match_method": "tfidf",
      "entity_confidence": 0.74,
      "geo_verified": false,
      "flags": ["geo_unverified"]
    }
  ]
}

The compliance risk score weights each enforcement event by itsentity_confidence before aggregating. An OFAC SDN hit withentity_confidence 0.97 contributes the full SDN severity multiplier (3.2×) to the risk score. The same hit via a TF-IDF match with confidence 0.74 contributes 0.74 × 3.2 = 2.37× — lower, but still a significant escalation over the baseline. This ensures that a fuzzy match against an SDN entry is flagged prominently rather than suppressed, while also not triggering maximum-severity alerts for what might be a false positive.

Continuous improvement loop

False positive and false negative rates are tracked per source dataset pair. When a compliance analyst flags a wrong match via the API feedback endpoint (POST /entity/{entity_id}/match-feedback), the case is added to the evaluation set and triggers a threshold recalibration for that source pair. Some dataset pairs warrant different thresholds than the global defaults: the CFPB complaint narrative database requires a lower Jaro-Winkler threshold (0.92, not 0.88) because company names in consumer complaint text are frequently misspelled by complainants, and those misspellings should not match unrelated entities.

Per-source-pair thresholds are stored in thematch_config table alongside the last-evaluated F1 score and the date of the most recent calibration. The calibration job runs weekly using the accumulated feedback set and a held-out test set that grows with each confirmed false positive or false negative report.


For how the entity bridge is structured across all 197 datasets — the schema, the join strategy, and the entity_master table: Building the cross-agency regulatory entity graph: 35M records, one join →

For how entity_confidence and source-specific severity multipliers combine into the 0–100 compliance risk score: Compliance screening across 30+ federal enforcement lists: how the risk score works →

For the identifier normalization layer (CIK, UEI, LEI, DUNS, NPI) that precedes name matching: Entity ID normalization in the Federal Regulatory Data Hub: resolving CIK, UEI, LEI, DUNS, and NPI across 197 datasets →

For OFAC SDN ingest specifics — conditional GET, XML parsing, alias explosion, and the FTS5 + Jaro-Winkler screening pipeline: OFAC SDN integration in the Federal Regulatory Data Hub →

For the end-to-end entity resolution pipeline that bridges this name matching into actionable compliance risk scores — confidence tiers, FP taxonomy, and calibration loop: Entity resolution for multi-list compliance screening: reducing false positives without sacrificing recall →

For how stable canonical IDs are generated and maintained after entity names are matched — SHA-256 ID generation, merge/split events, and EntityAlias tracking: Canonical entity IDs in the Federal Regulatory Data Hub: stable identifiers across 197 federal datasets →