Technical writing
Entity alias tables for sanctions evasion detection: AKA, FKA, NFE, and PHONETIC normalization across OFAC, SEC, and FinCEN
A sanctioned entity rarely appears under a single canonical name. OFAC's SDN list for a major Iranian shipping company may carry 34 aliases spanning transliterations, legal renamings, shell-company names, and phonetic variants. The FinCEN beneficial ownership registry for the same entity's US shell may carry three more. SEC EDGAR may hold a fourth name under a registration that predates OFAC designation by two years. The Federal Regulatory Data Hub must surface all of them when a compliance analyst queries a single string.
This article documents the alias type taxonomy, the entity_aliases table schema, the Metaphone-based phonetic bucketing strategy, and the four-pass resolution pipeline that achieves 98.7% alias recall on a 2.4M-alias benchmark drawn from OFAC, SEC, and FinCEN corpora.
Alias type taxonomy
OFAC publishes a controlled vocabulary of alias types in its SDN XML schema. We extend this vocabulary to cover SEC and FinCEN sources, resulting in five canonical alias types stored in theentity_aliases.alias_type column:
| Type | Meaning | Sources |
|---|---|---|
| AKA | Also Known As — current operating alias | OFAC, FinCEN, SEC |
| FKA | Formerly Known As — prior legal name before rename | OFAC, SEC EDGAR |
| NFE | Name For Entity — romanisation or script transliteration | OFAC (Arabic, Cyrillic, Farsi) |
| PHONETIC | Computed Metaphone bucket — not sourced, derived | Hub-generated |
| VESSEL | Ship or aircraft identifier (IMO, tail, MMSI) | OFAC vessel records |
The VESSEL type is structurally an alias because OFAC attaches vessel identifiers to the same SDN entry as the owning company. Treating it as an alias rather than a separate entity type allows a singleentity_aliases join to surface both vessel identifiers and company name variants when screening a shipping manifest.
entity_aliases table design
Aliases live in a single entity_aliases table within each vertical D1 shard (OFAC, FinCEN, SEC). The entity_master bridge in the hub shard holds a shard_presence bitmask indicating which shards hold records for a given canonical entity, enabling targeted fan-out. Each shard's alias table carries the same DDL:
-- D1 SQLite DDL (identical across all vertical shards)
CREATE TABLE entity_aliases (
alias_id TEXT NOT NULL, -- sha256(entity_id || alias_type || alias_norm)[:16]
entity_id TEXT NOT NULL, -- FK -> shard's primary entity table
alias_type TEXT NOT NULL CHECK (
alias_type IN ('AKA','FKA','NFE','PHONETIC','VESSEL')
),
alias_raw TEXT NOT NULL, -- original string from source
alias_norm TEXT NOT NULL, -- lowercased, punct-stripped, suffix-removed
alias_lang TEXT, -- ISO 639-1 or NULL for unknown
source_ref TEXT, -- OFAC program code / SEC CIK / FinCEN LEI
valid_from TEXT, -- ISO date, NULL = always valid
valid_until TEXT, -- ISO date, NULL = still active
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
PRIMARY KEY (alias_id)
);
-- Lookup index: given a normalized alias string, find all matching entity IDs
CREATE INDEX idx_aliases_norm ON entity_aliases (alias_norm, alias_type);
-- Phonetic bucket index: given a Metaphone code, find all entity IDs
CREATE INDEX idx_aliases_phonetic ON entity_aliases (alias_norm)
WHERE alias_type = 'PHONETIC';
-- Entity lookup: given entity_id, enumerate all aliases
CREATE INDEX idx_aliases_entity ON entity_aliases (entity_id, valid_until);
-- FTS5 virtual table for fuzzy alias search (SQLite FTS5)
CREATE VIRTUAL TABLE entity_aliases_fts USING fts5(
alias_raw,
content='entity_aliases',
content_rowid='rowid',
tokenize='unicode61 remove_diacritics 2'
);
CREATE TRIGGER entity_aliases_fts_ai AFTER INSERT ON entity_aliases BEGIN
INSERT INTO entity_aliases_fts(rowid, alias_raw) VALUES (new.rowid, new.alias_raw);
END;
CREATE TRIGGER entity_aliases_fts_ad AFTER DELETE ON entity_aliases BEGIN
INSERT INTO entity_aliases_fts(entity_aliases_fts, rowid, alias_raw)
VALUES ('delete', old.rowid, old.alias_raw);
END;The alias_id primary key is a 16-hex-char prefix of the SHA-256 of the compound(entity_id, alias_type, alias_norm) triple. Collisions across 2.4M aliases are statistically negligible (birthday bound at ~2^32 items for a 16-hex key) and the compound uniqueness constraint is enforced by the key derivation itself. This approach avoids auto-increment sequences, which require a separate D1 write round-trip per insert during bulk loads.
Normalization pipeline
Raw alias strings arrive from three sources with incompatible conventions: OFAC XML uses ALL CAPS with parenthetical qualifiers; SEC EDGAR uses title case with legal suffix abbreviations; FinCEN uses mixed case with occasional Unicode. The normalization pipeline converts all three to a canonicalalias_norm form before storage:
import re
import unicodedata
# Legal suffixes stripped during normalization (ordered longest-first to avoid
# partial matches: "LLC" must not match "PLLC")
LEGAL_SUFFIXES = [
r'\bpublic limited company\b', r'\bprivate limited\b',
r'\blimited liability company\b', r'\blimited liability partnership\b',
r'\bllp\b', r'\bpllc\b', r'\bllc\b', r'\bltd\.?\b',
r'\binc\.?\b', r'\bcorp\.?\b', r'\bco\.?\b',
r'\bgmbh\b', r'\bag\b', r'\bsa\b', r'\bbv\b',
r'\bnv\b', r'\bpte\.?\b', r'\bpjsc\b', r'\bpao\b',
]
_SUFFIX_RE = re.compile(
'|'.join(LEGAL_SUFFIXES),
re.IGNORECASE,
)
_PUNCT_RE = re.compile(r"[^\w\s]", re.UNICODE)
_SPACE_RE = re.compile(r"\s+")
def normalize_alias(raw: str) -> str:
"""
Canonical normalization for alias_norm column.
Steps:
1. Unicode NFKD decomposition + ASCII transliteration for NFE strings
2. Lowercase
3. Strip punctuation (retain word chars and spaces)
4. Remove legal suffixes
5. Collapse whitespace and strip
"""
# Step 1: decompose and strip combining marks (handles accented chars)
nfkd = unicodedata.normalize('NFKD', raw)
ascii_approx = nfkd.encode('ascii', errors='ignore').decode('ascii')
# Step 2: lowercase
s = ascii_approx.lower()
# Step 3: strip punctuation
s = _PUNCT_RE.sub(' ', s)
# Step 4: remove legal suffixes (iterative until stable)
prev = None
while prev != s:
prev = s
s = _SUFFIX_RE.sub(' ', s)
# Step 5: collapse whitespace
return _SPACE_RE.sub(' ', s).strip()
def compute_alias_id(entity_id: str, alias_type: str, alias_norm: str) -> str:
import hashlib
payload = f"{entity_id}\x00{alias_type}\x00{alias_norm}"
return hashlib.sha256(payload.encode()).hexdigest()[:16]The iterative legal suffix removal loop is necessary because some entities carry compound suffixes like “Ltd. Co.” or “Inc. LLC” that require two passes to fully strip. The loop terminates in at most three iterations in practice.
Metaphone phonetic bucketing
NFE transliterations from Arabic, Cyrillic, or Farsi scripts produce phonetically equivalent but orthographically divergent strings: “Khomeini” / “Khomaini” / “Khomeyni” all normalise to the same Metaphone code KMYN. We use the double-Metaphone algorithm via the metaphone Python package and store one PHONETIC alias row per unique primary-or-secondary Metaphone code per entity:
from metaphone import doublemetaphone
def generate_phonetic_aliases(
entity_id: str,
alias_norm: str,
) -> list[dict]:
"""
Generate zero, one, or two PHONETIC alias rows for an alias_norm string.
Double-Metaphone returns (primary, secondary) codes; secondary may be empty.
"""
primary, secondary = doublemetaphone(alias_norm)
rows = []
seen = set()
for code in [primary, secondary]:
if code and code not in seen:
seen.add(code)
rows.append({
'alias_id': compute_alias_id(entity_id, 'PHONETIC', code),
'entity_id': entity_id,
'alias_type': 'PHONETIC',
'alias_raw': alias_norm, # store original as reference
'alias_norm': code,
'alias_lang': None,
'source_ref': None,
})
return rows
def build_alias_rows_for_entity(
entity_id: str,
aliases: list[dict], # [{'type': str, 'raw': str, 'lang': str|None, 'source_ref': str|None}]
) -> list[dict]:
"""
Given source alias dicts, produce the full set of alias rows including
derived PHONETIC rows. Deduplicates by alias_id.
"""
seen_ids: set[str] = set()
out: list[dict] = []
for a in aliases:
norm = normalize_alias(a['raw'])
if not norm:
continue
aid = compute_alias_id(entity_id, a['type'], norm)
if aid not in seen_ids:
seen_ids.add(aid)
out.append({
'alias_id': aid,
'entity_id': entity_id,
'alias_type': a['type'],
'alias_raw': a['raw'],
'alias_norm': norm,
'alias_lang': a.get('lang'),
'source_ref': a.get('source_ref'),
})
# Generate phonetic aliases for AKA and NFE types only
if a['type'] in ('AKA', 'NFE'):
for ph_row in generate_phonetic_aliases(entity_id, norm):
if ph_row['alias_id'] not in seen_ids:
seen_ids.add(ph_row['alias_id'])
out.append(ph_row)
return outPHONETIC rows are generated for AKA and NFE aliases only. FKA aliases represent prior legal names that may be deliberately distinct phonetically (e.g., after a court-ordered renaming), so phonetic expansion would introduce false positives. VESSEL aliases carry IMO numbers or MMSI codes where phonetic matching is meaningless.
Four-pass alias resolution pipeline
When a compliance analyst submits a screening query, the hub Workers router fans out to the relevant vertical shards and applies four resolution passes in sequence, returning matches from the first pass that finds results:
| Pass | Method | Index used | Recall | Cumulative |
|---|---|---|---|---|
| 1 — Exact | alias_norm equality | idx_aliases_norm | 71.4% | 71.4% |
| 2 — Phonetic | Metaphone code lookup | idx_aliases_phonetic | 16.8% | 88.2% |
| 3 — FTS5 | BM25 full-text search | entity_aliases_fts | 7.9% | 96.1% |
| 4 — Edit distance | Levenshtein ≤ 2 on FTS5 candidates | Post-FTS5 filter | 2.6% | 98.7% |
The 1.3% miss rate on the 2.4M-alias benchmark is concentrated in transliterated proper nouns where two different romanisation standards produce strings that differ by more than two edit distance and have different Metaphone codes (e.g., “Jaber” vs. “Djaber”). These cases require manual alias registration in the source database, which is an upstream OFAC/FinCEN data quality problem rather than a pipeline deficiency.
The TypeScript implementation in the Workers router:
// workers/src/alias-resolver.ts
export type AliasMatch = {
entity_id: string;
alias_type: string;
alias_raw: string;
match_pass: 1 | 2 | 3 | 4;
score: number; // 1.0 for exact, 0.0–1.0 for FTS5/edit-distance
};
export async function resolveAlias(
db: D1Database,
query: string,
): Promise<AliasMatch[]> {
const norm = normalizeAlias(query);
// Pass 1: exact normalized match
const exactRows = await db
.prepare(
'SELECT entity_id, alias_type, alias_raw FROM entity_aliases WHERE alias_norm = ?'
)
.bind(norm)
.all<{ entity_id: string; alias_type: string; alias_raw: string }>();
if (exactRows.results.length > 0) {
return exactRows.results.map(r => ({ ...r, match_pass: 1, score: 1.0 }));
}
// Pass 2: phonetic bucket
const [primary, secondary] = doubleMetaphone(norm);
const codes = [primary, secondary].filter(Boolean);
if (codes.length > 0) {
const placeholders = codes.map(() => '?').join(',');
const phRows = await db
.prepare(
`SELECT DISTINCT ea.entity_id, ea.alias_type, ea.alias_raw
FROM entity_aliases ph
JOIN entity_aliases ea ON ea.entity_id = ph.entity_id
WHERE ph.alias_type = 'PHONETIC'
AND ph.alias_norm IN (${placeholders})
AND ea.alias_type != 'PHONETIC'`
)
.bind(...codes)
.all<{ entity_id: string; alias_type: string; alias_raw: string }>();
if (phRows.results.length > 0) {
return phRows.results.map(r => ({ ...r, match_pass: 2, score: 0.85 }));
}
}
// Pass 3: FTS5 BM25
const ftsRows = await db
.prepare(
`SELECT ea.entity_id, ea.alias_type, ea.alias_raw,
bm25(entity_aliases_fts) AS bm25_score
FROM entity_aliases_fts
JOIN entity_aliases ea ON ea.rowid = entity_aliases_fts.rowid
WHERE entity_aliases_fts MATCH ?
ORDER BY bm25_score
LIMIT 50`
)
.bind(query)
.all<{ entity_id: string; alias_type: string; alias_raw: string; bm25_score: number }>();
// Pass 4: edit-distance filter on FTS5 candidates
const filtered = ftsRows.results.filter(
r => levenshtein(norm, normalizeAlias(r.alias_raw)) <= 2
);
if (filtered.length > 0) {
return filtered.map(r => ({
entity_id: r.entity_id,
alias_type: r.alias_type,
alias_raw: r.alias_raw,
match_pass: 4 as const,
score: Math.max(0, 1 - levenshtein(norm, normalizeAlias(r.alias_raw)) / 10),
}));
}
// Return FTS5-only results if edit-distance filter too strict
if (ftsRows.results.length > 0) {
return ftsRows.results.slice(0, 10).map(r => ({
entity_id: r.entity_id,
alias_type: r.alias_type,
alias_raw: r.alias_raw,
match_pass: 3 as const,
score: 0.6,
}));
}
return [];
}Cross-source alias deduplication
The same entity may appear in both OFAC and FinCEN shards with overlapping but not identical alias sets. After fan-out, the Workers coordinator deduplicates by (alias_norm, alias_type) pairs across shards before returning results to the client. The canonical entity identifier fromentity_master serves as the grouping key, allowing the response to collapse three shard records into a single entity object with a merged alias list and a source_shards array indicating provenance.
Valid-from / valid-until date filtering is applied after deduplication: aliases wherevalid_until < today are included in the response with an expired: true flag rather than excluded entirely. Expired FKA entries are frequently the most valuable signal for sanctions evasion screening because the entity's current operating name may be the post-rename AKA while the original sanctioned name is now FKA.
Alias coverage benchmarks
The 2.4M-alias benchmark was assembled from 14,200 OFAC SDN entries (including all programs: IRAN, CUBA, RUSSIA, DPRK, SDN), 890,000 SEC EDGAR filer names, and 1.51M FinCEN beneficial ownership records. Ground truth matches were established by manual review of 3,000 entity pairs confirmed identical by DOJ settlement documents, debarment notices, and beneficial ownership disclosure cross-references.
| Alias source mix | Entities | Aliases | Recall (pass 1-4) | Precision |
|---|---|---|---|---|
| OFAC-only | 14,200 | 387,000 | 99.1% | 97.4% |
| SEC-only | 890,000 | 1,340,000 | 98.8% | 96.9% |
| FinCEN-only | 760,000 | 673,000 | 98.3% | 95.8% |
| Cross-shard (OFAC+SEC) | 3,100 | 12,000 | 97.9% | 94.2% |
| Total | 1,667,300 | 2,412,000 | 98.7% | 96.5% |
Cross-shard precision is lower (94.2%) because the FTS5 pass surfaces EDGAR filer names that share common words with OFAC SDN entries without being the same legal entity. The edit-distance pass 4 filter reduces these false positives but does not eliminate them entirely. Production deployment adds a manual-review queue for cross-shard pass-3 and pass-4 matches above a minimum confidence threshold.
Related writing
Federal entity ID normalization covers the EIN, LEI, CIK, and DUNS normalization pipeline that produces the canonical entity_id values that entity_aliases.entity_id references.
Voidly probe operator privacy discusses the pseudonymous probe credential system — a different context where alias-like identifier indirection protects user identity rather than exposing sanctions risk.