Technical writing
Entity resolution for multi-list compliance screening: reducing false positives without sacrificing recall
The compliance screening risk score described in the previous article aggregates hits across 30+ federal enforcement lists into a single 0–100 score. That aggregation depends on a prior problem: establishing that “Boeing Co.” on the OFAC Non-SDN list, “The Boeing Company” in the BIS Entity List, and “BOEING COMPANY” in the DoD 1282 debarment record are the same legal entity — while “Boeing Aerospace Holdings GmbH” is not. Pure string matching cannot do this reliably. Set the similarity threshold too low and every large company matches every enforcement action mentioning a common corporate term. Set it too high and legitimately sanctioned entities slip through under their operating or subsidiary names.
The correct framing is entity resolution, not string matching. The goal is to link entity records — each a structured representation of a legal person or organization — rather than to compare raw name strings. This article covers how the Federal Regulatory Data Hub resolves entity identity across 30+ lists, the three-stage pipeline that does the work, the taxonomy of false positives each stage produces, and the confidence calibration that maps resolution outputs to compliance actions.
The entity resolution problem in compliance
A single sanctioned entity appears under dozens of name variants across 30+ federal lists. OFAC SDN entries carry an average of 3.7 aliases each. BIS Entity List entries use the name as it appeared in the export license denial. The CFPB complaint database uses the name as spelled by the consumer who filed the complaint. DOJ press releases use the name as cited in the indictment caption.
Name matching alone produces two failure modes. A loose threshold — say, Jaro-Winkler ≥ 0.75 applied globally — generates a massive false positive rate: “Samsung Electronics Co.” matches an OFAC SDN entry for a North Korean Samsung subsidiary because the name stems are identical and Jaro-Winkler cares about prefix similarity. A tight threshold — exact normalized match only — misses the same sanctioned entity when it appears under a common abbreviation or transliteration variant not present in the exact-match index.
Entity resolution reframes the problem. Instead of comparing query strings against list strings, we build a canonical representation of each known entity, populate it with all structured identifiers and known name variants, and ask whether the query record and the list record refer to the same underlying entity. Structured identifiers — LEI, UEI, DUNS, CIK, NPI — resolve the majority of organizational entities without touching name strings at all. Name-based matching is reserved for the minority of records where no shared identifier exists.
Canonical entity representation
Every known entity is represented as an EntityRecord in the entity master table:
struct EntityRecord {
entity_id: Uuid,
canonical_name: String,
name_variants: Vec<String>,
identifier_map: HashMap<IdScheme, String>,
// IdScheme: LEI | UEI | DUNS | CIK | NPI
entity_type: EntityType,
// EntityType: INDIVIDUAL | ORGANIZATION | VESSEL | AIRCRAFT
country_of_incorporation: Option<String>,
jurisdiction_codes: Vec<String>,
source_list_ids: Vec<String>,
last_seen_active: NaiveDate,
}The identifier_map is the primary resolution surface. Legal Entity Identifiers (LEI) are globally unique and assigned by GLEIF-accredited issuers; a match on LEI is a confirmed identity link. Unique Entity Identifiers (UEI) are assigned by SAM.gov and used across federal contracting and grants. DUNS numbers are Dun & Bradstreet identifiers used in older federal datasets. CIK is the SEC EDGAR company identifier. NPI is the CMS National Provider Identifier for healthcare entities.
The name_variants field is populated from all observed name forms across source lists: the primary SDN name, every alias in the AKA list, every DBA name from SEC DEF 14A filings, and every historical name from prior filings. The last_seen_active date enables age-gating: an entity record that has not been observed active in any source list for more than five years receives reduced confidence in identity links, because the legal entity may have been dissolved and the name reused by an unrelated successor.
Three-stage resolution pipeline
Stage 1: Identifier join
The first stage attempts to resolve entity identity through exact joins on structured identifiers. Confidence levels by identifier type are derived from observed error rates in cross-list validation:
- LEI — 100% confidence. GLEIF assigns LEIs to specific legal entities with mandatory uniqueness; two records sharing an LEI are the same entity by definition.
- UEI — 99.8% confidence. SAM.gov enforces uniqueness; the 0.2% error rate comes from data-entry errors in source list records that copy-paste UEIs incorrectly.
- DUNS — 99.2% confidence. DUNS numbers have been historically reused after entity dissolution, and some legacy datasets contain stale DUNS mappings; the 0.8% error rate reflects those cases.
Stage 1 resolves approximately 34% of organizational entity links. The SQL pattern:
SELECT
em.entity_id,
em.canonical_name,
er.list_id,
er.record_id,
'IDENTIFIER' AS resolution_stage,
CASE
WHEN er.lei IS NOT NULL THEN 1.00
WHEN er.uei IS NOT NULL THEN 0.998
WHEN er.duns IS NOT NULL THEN 0.992
END AS confidence_score
FROM entity_master em
JOIN enforcement_record er
ON (em.identifier_map->>'LEI' = er.lei AND er.lei IS NOT NULL)
OR (em.identifier_map->>'UEI' = er.uei AND er.uei IS NOT NULL)
OR (em.identifier_map->>'DUNS' = er.duns AND er.duns IS NOT NULL)
WHERE em.entity_type = er.entity_type
AND em.last_seen_active >= CURRENT_DATE - INTERVAL '10 years';The entity_type filter matters: an ORGANIZATION record should never be identifier-joined to an INDIVIDUAL record even if a data-entry error caused an identifier collision. The recency filter removes entity records that are almost certainly dissolved legal entities whose identifiers may have been reused.
Stage 2: Canonical name normalization and FTS5
Records that did not resolve in Stage 1 proceed to canonical name matching. Every name goes through a normalization pipeline before entering the FTS5 index:
- NFKD Unicode normalization — decomposes characters to their base + diacritic form, enabling accent-insensitive matching without losing information.
- Legal suffix stripping — removes entity type designators that vary by jurisdiction and filing convention. The full suffix list: Co., Corp., Inc., Ltd., LLC, LLP, PLC, GmbH, AG, SA, BV, NV, Srl, Sarl, AB, AS, Pty, Ltd, Limited, Company, Corporation, Incorporated, Brothers, Bros., & Co., Group, Holdings, International, Services, Solutions, Systems, Technologies, Enterprises, Partners. Stripping is applied longest-first to prevent partial removal leaving suffix fragments.
- Punctuation collapse — replaces hyphens, commas, periods, parentheses, and ampersands with spaces, then collapses runs of whitespace to a single space.
After normalization, names are indexed in SQLite FTS5 using the BM25 ranking function. FTS5 BM25 handles token frequency and document length normalization, which matters for entity names: “International Business Machines” has three high-value tokens, while “IBM” after normalization is a single token that FTS5 ranks differently. The acceptance threshold is BM25 score ≥ 0.7. Stage 2 resolves approximately 41% of the entities remaining after Stage 1 — roughly 27% of the total population.
Stage 3: Jaro-Winkler fuzzy match with entity-type filter
Records still unresolved after Stage 2 enter the fuzzy matching stage. Jaro-Winkler is optimal for compliance name matching because it weights prefix similarity heavily — most sanctioned entity name variants share a common prefix and differ in suffix or middle tokens — and it is robust to single-character transpositions and insertions typical in OCR errors and manual data entry.
Thresholds are calibrated per source list based on accumulated analyst feedback:
- Global default: Jaro-Winkler ≥ 0.88
- CFPB complaint dataset: 0.92 — consumer-submitted names are frequently misspelled; a looser threshold generates too many wrong matches against unrelated financial institutions.
- OFAC SDN aliases: 0.82 — OFAC aliases are deliberately broad (they include romanizations and historical names); a tighter threshold misses legitimate alias matches.
Stage 3 applies a hard entity-type agreement filter before any score is computed: an ORGANIZATION record cannot match an INDIVIDUAL record regardless of name similarity score. This single filter eliminates a large class of false positives where a person's name resembles a company name — “John Holdings LLC” vs. an individual named “John Holdings” in the HHS-OIG exclusion list.
Confidence for fuzzy matches is calibrated to the range 0.60–0.87. Matches in the 0.60–0.72 confidence band are routed to the human review queue rather than auto-flagged. Stage 3 resolves approximately 18% of the entities remaining after Stage 2.
False positive taxonomy
Across the three stages, false positives cluster into four structural types. The frequency distribution is measured against a 5,000-pair validation set annotated by compliance analysts:
Same-name, different entity (47% of FPs)
The most common false positive: two distinct legal entities share a name or near-name, one of which is sanctioned. The canonical example is “Samsung”: OFAC sanctions a North Korean front company with Samsung in its name; the compliance screening system must not match Samsung Electronics Co., Ltd. (South Korean public company, KOSPI 005930).
Resolution: after a name match, validate that the matched entity record and the query record agree on jurisdiction_codes and, when available, at least one structured identifier. A jurisdiction mismatch (North Korea vs. South Korea) combined with no shared identifier downgrades the match to POSSIBLE_MATCH and routes it to human review rather than auto-flagging.
Subsidiary-parent ambiguity (28% of FPs)
A sanctioned subsidiary's name partially matches the parent company's name. The OFAC SDN entry is for the subsidiary; the compliance screening query is for the parent. The name similarity is high because the subsidiary was named after the parent at incorporation.
Resolution: the entity graph maintains an is_controlled_by relationship between subsidiary and parent records. When a Stage 2 or Stage 3 match fires between a query entity and a sanctioned entity, the resolver checks whether anis_controlled_by edge links them. If the match is parent-to-subsidiary in the wrong direction (the sanctioned record is a subsidiary, the query is the parent company not itself sanctioned), the match confidence is adjusted downward and the flag subsidiary_not_parent is added to the resolution result.
Historical name collision (16% of FPs)
An entity changed its name; the old name appears in an older enforcement list. A different, unrelated entity subsequently adopted the same name. The enforcement record refers to the original entity, but the resolution system incorrectly links it to the name-adopting successor.
Resolution: every entity record carries a last_seen_active date. When a name match links a query entity to an enforcement record, the resolver checks whether the enforcement record predates a gap in the matched entity's activity window. A record dated 2012 matched to an entity whose earliest source-list appearance is 2018 is a temporal impossibility that should downgrade the confidence score.
Transliteration mismatch (9% of FPs)
Russian, Arabic, and Chinese entity names with multiple valid romanizations are indexed under all known variants in the name_variants field (average 3.7 aliases per OFAC SDN entry). When a query arrives under a romanization variant not present in the alias list, Stage 3 fuzzy matching should catch it — but a false positive occurs when the romanization of a different Cyrillic or Arabic name happens to score above the Jaro-Winkler threshold against a sanctioned entity's canonical romanization.
Resolution: the OFAC SDN alias explosion already handles most transliteration variants (OFAC explicitly lists alternate romanizations in the AKA list). Residual mismatches are caught by requiring a geographic or identifier corroboration for any Stage 3 match against an OFAC or BIS record where the matched alias type is a transliteration variant.
EntityResolutionResult and confidence-to-action mapping
Every resolution attempt returns a structured result:
struct EntityResolutionResult {
entity_id: Uuid,
resolution_stage: ResolutionStage,
// ResolutionStage: IDENTIFIER | CANONICAL_NAME | FUZZY | UNRESOLVED
confidence_score: f32,
supporting_identifiers: Vec<(IdScheme, String)>,
name_similarity_score: f32,
entity_type_match: bool,
active_status: EntityActiveStatus,
// EntityActiveStatus: ACTIVE | INACTIVE | UNKNOWN
}Confidence score ranges map to compliance actions:
| Confidence | Classification | Action |
|---|---|---|
| ≥ 0.90 | MATCH | Auto-flag; block pending review in regulated workflows |
| 0.72 – 0.90 | PROBABLE_MATCH | Auto-flag + route to human review queue |
| 0.60 – 0.72 | POSSIBLE_MATCH | Human review queue; no auto-flag |
| < 0.60 | NO_MATCH | No action; logged for drift monitoring |
The 0.72 boundary between PROBABLE_MATCH and POSSIBLE_MATCH was calibrated on the validation set to achieve precision ≥ 95% at PROBABLE_MATCH while keeping the human review queue to a manageable volume. At precision 95%, approximately 1 in 20 PROBABLE_MATCH results is a false positive — an acceptable rate for a queue that receives analyst review before any action is taken.
Performance metrics
Resolution stage breakdown across the full entity population:
- Identifier join (Stage 1): 34% of all organizational entities resolved here
- Canonical FTS5 (Stage 2): 41% of the remaining 66%, i.e., approximately 27% of total
- Jaro-Winkler fuzzy (Stage 3): 18% of the entities remaining after Stage 2, i.e., approximately 4.7% of total
- Unresolved: approximately 1.3% of total (routed to a low-confidence staging table for periodic batch re-evaluation)
Combined: roughly 74% of entities resolve through identifier join or canonical name matching; 4.7% through fuzzy matching; 1.3% remain unresolved.
False positive rates by resolution stage, measured on the annotated validation set:
| Stage | FP rate | Primary FP type |
|---|---|---|
| Identifier join | 0.02% | Stale DUNS reuse; data-entry identifier errors |
| Canonical FTS5 | 0.31% | Same-name different entity; generic industry terms |
| Jaro-Winkler fuzzy | 1.8% | Subsidiary-parent confusion; transliteration overlap |
Recall across a 127-entity gold-standard test set (entities confirmed present in at least one source list): 99.1%. The 0.9% miss rate consists of entities that appear only under a romanization variant not present in the alias index and whose name similarity to the canonical form falls below the Stage 3 threshold. Precision at confidence ≥ 0.90: 98.7%.
Resolution implementation
The resolution function runs the three stages in sequence, returning on the first successful match:
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import jellyfish
import sqlite3
class ResolutionStage(Enum):
IDENTIFIER = 'IDENTIFIER'
CANONICAL_NAME = 'CANONICAL_NAME'
FUZZY = 'FUZZY'
UNRESOLVED = 'UNRESOLVED'
class EntityType(Enum):
INDIVIDUAL = 'INDIVIDUAL'
ORGANIZATION = 'ORGANIZATION'
VESSEL = 'VESSEL'
AIRCRAFT = 'AIRCRAFT'
@dataclass
class EntityResolutionResult:
entity_id: Optional[str]
resolution_stage: ResolutionStage
confidence_score: float
supporting_identifiers: list[tuple[str, str]]
name_similarity_score: float
entity_type_match: bool
active_status: str # 'ACTIVE' | 'INACTIVE' | 'UNKNOWN'
LEGAL_SUFFIXES = [
'and company', 'and co', 'brothers', 'bros',
'incorporated', 'corporation', 'company', 'limited',
'enterprises', 'international', 'technologies', 'solutions',
'systems', 'services', 'partners', 'holdings', 'group',
'inc', 'corp', 'co', 'ltd', 'llc', 'llp', 'plc',
'gmbh', 'ag', 'sa', 'bv', 'nv', 'srl', 'sarl', 'ab', 'as',
'pty', 'pty ltd',
]
def _normalize(name: str) -> str:
import unicodedata, re
name = unicodedata.normalize('NFKD', name).casefold()
# Strip legal suffixes longest-first
for suffix in LEGAL_SUFFIXES:
if name.endswith(' ' + suffix):
name = name[:-(len(suffix) + 1)].rstrip(' ,')
name = re.sub(r'['".,;:()-/\&]', ' ', name)
return re.sub(r's+', ' ', name).strip()
def resolve_entity_for_screening(
query_name: str,
query_identifiers: dict, # e.g. {'LEI': 'HWUPKR0MPOU8FGXBT394'}
query_type: EntityType,
db: sqlite3.Connection,
) -> EntityResolutionResult:
"""
Three-stage entity resolution for compliance screening.
Returns the highest-confidence match found, or UNRESOLVED.
"""
# Stage 1: Identifier join
id_confidence = {'LEI': 1.00, 'UEI': 0.998, 'DUNS': 0.992, 'CIK': 0.995, 'NPI': 0.997}
for scheme, value in query_identifiers.items():
if not value or scheme not in id_confidence:
continue
row = db.execute(
'''SELECT entity_id, entity_type, last_seen_active
FROM entity_master
WHERE json_extract(identifier_map, '$.' || ?) = ?
LIMIT 1''',
(scheme, value)
).fetchone()
if row and row['entity_type'] == query_type.value:
return EntityResolutionResult(
entity_id=row['entity_id'],
resolution_stage=ResolutionStage.IDENTIFIER,
confidence_score=id_confidence[scheme],
supporting_identifiers=[(scheme, value)],
name_similarity_score=1.0,
entity_type_match=True,
active_status='ACTIVE' if row['last_seen_active'] else 'UNKNOWN',
)
# Stage 2: Canonical FTS5
query_norm = _normalize(query_name)
fts_rows = db.execute(
'''SELECT em.entity_id, em.entity_type, em.last_seen_active,
bm25(entity_aliases_fts) AS bm25_score
FROM entity_aliases_fts
JOIN entity_master em USING (entity_id)
WHERE entity_aliases_fts MATCH ?
AND em.entity_type = ?
ORDER BY bm25_score
LIMIT 5''',
(query_norm, query_type.value)
).fetchall()
for row in fts_rows:
if row['bm25_score'] is not None and abs(row['bm25_score']) >= 0.7:
return EntityResolutionResult(
entity_id=row['entity_id'],
resolution_stage=ResolutionStage.CANONICAL_NAME,
confidence_score=min(0.89, 0.60 + abs(row['bm25_score']) * 0.4),
supporting_identifiers=[],
name_similarity_score=abs(row['bm25_score']),
entity_type_match=True,
active_status='ACTIVE' if row['last_seen_active'] else 'UNKNOWN',
)
# Stage 3: Jaro-Winkler fuzzy with per-list threshold
candidates = db.execute(
'''SELECT em.entity_id, ea.alias_norm, em.entity_type, em.last_seen_active,
ea.source_list
FROM entity_aliases ea
JOIN entity_master em USING (entity_id)
WHERE em.entity_type = ?
LIMIT 5000''',
(query_type.value,)
).fetchall()
list_thresholds = {
'cfpb_complaints': 0.92,
'ofac_sdn': 0.82,
}
default_threshold = 0.88
best_score = 0.0
best_row = None
for row in candidates:
score = jellyfish.jaro_winkler_similarity(query_norm, row['alias_norm'])
if score > best_score:
best_score = score
best_row = row
if best_row:
threshold = list_thresholds.get(best_row['source_list'], default_threshold)
if best_score >= threshold:
confidence = 0.60 + (best_score - threshold) / (1.0 - threshold) * 0.27
return EntityResolutionResult(
entity_id=best_row['entity_id'],
resolution_stage=ResolutionStage.FUZZY,
confidence_score=min(0.87, confidence),
supporting_identifiers=[],
name_similarity_score=best_score,
entity_type_match=True,
active_status='ACTIVE' if best_row['last_seen_active'] else 'UNKNOWN',
)
return EntityResolutionResult(
entity_id=None,
resolution_stage=ResolutionStage.UNRESOLVED,
confidence_score=0.0,
supporting_identifiers=[],
name_similarity_score=best_score,
entity_type_match=False,
active_status='UNKNOWN',
)Continuous calibration
Analyst feedback is the primary signal for threshold recalibration. The feedback endpoint accepts a verdict and optional notes:
POST /entity/{entity_id}/match-feedback
Content-Type: application/json
{
"verdict": false,
"notes": "Different entity — query is Samsung Electronics Co KR, matched Samsung Koryo Group KP"
}Feedback is stored in the resolution_feedback table with the full resolution context: query name, matched entity ID, resolution stage, confidence score, and name similarity score at the time of the match. A weekly calibration job re-tunes per-list Jaro-Winkler thresholds by maximizing F1 on the accumulated feedback set, subject to a minimum recall constraint of 98% (to prevent the optimizer from raising thresholds to the point of missing sanctioned entities).
Population Stability Index (PSI) monitoring runs on the name similarity score distribution for each source list pair. When the PSI between the current week's distribution and the calibration baseline exceeds 0.03 — indicating that the distribution of incoming name similarity scores has shifted, typically because a source list changed its naming conventions — the drift is logged and a re-calibration is triggered outside the weekly schedule. The 0.03 PSI threshold corresponds to a moderate distributional shift: enough to materially affect threshold performance, but not so sensitive that minor random variation triggers unnecessary re-calibration.
Per-list threshold values, the last-evaluated F1 score, and the calibration date are stored in the match_config table — the same table described in the entity matching article for the broader cross-agency name matching pipeline. The compliance entity resolution layer shares the calibration infrastructure; per-list thresholds set by the entity resolution calibration job are visible to the name matching pipeline and vice versa.
Compliance screening across 30+ federal enforcement lists: how the risk score works →
Entity ID normalization in the Federal Regulatory Data Hub: resolving CIK, UEI, LEI, DUNS, and NPI →
OFAC SDN integration: conditional GET, entity normalization, and sub-second screening →