Technical writing

Entity alias tables for sanctions evasion detection: AKA, FKA, NFE, and PHONETIC normalization across OFAC, SEC, and FinCEN

· AI Analytics
Regulatory dataEntity resolutionSanctionsData engineering

A sanctioned entity rarely appears under a single canonical name. OFAC's SDN list for a major Iranian shipping company may carry 34 aliases spanning transliterations, legal renamings, shell-company names, and phonetic variants. The FinCEN beneficial ownership registry for the same entity's US shell may carry three more. SEC EDGAR may hold a fourth name under a registration that predates OFAC designation by two years. The Federal Regulatory Data Hub must surface all of them when a compliance analyst queries a single string.

This article documents the alias type taxonomy, the entity_aliases table schema, the Metaphone-based phonetic bucketing strategy, and the four-pass resolution pipeline that achieves 98.7% alias recall on a 2.4M-alias benchmark drawn from OFAC, SEC, and FinCEN corpora.

Alias type taxonomy

OFAC publishes a controlled vocabulary of alias types in its SDN XML schema. We extend this vocabulary to cover SEC and FinCEN sources, resulting in five canonical alias types stored in theentity_aliases.alias_type column:

TypeMeaningSources
AKAAlso Known As — current operating aliasOFAC, FinCEN, SEC
FKAFormerly Known As — prior legal name before renameOFAC, SEC EDGAR
NFEName For Entity — romanisation or script transliterationOFAC (Arabic, Cyrillic, Farsi)
PHONETICComputed Metaphone bucket — not sourced, derivedHub-generated
VESSELShip or aircraft identifier (IMO, tail, MMSI)OFAC vessel records

The VESSEL type is structurally an alias because OFAC attaches vessel identifiers to the same SDN entry as the owning company. Treating it as an alias rather than a separate entity type allows a singleentity_aliases join to surface both vessel identifiers and company name variants when screening a shipping manifest.

entity_aliases table design

Aliases live in a single entity_aliases table within each vertical D1 shard (OFAC, FinCEN, SEC). The entity_master bridge in the hub shard holds a shard_presence bitmask indicating which shards hold records for a given canonical entity, enabling targeted fan-out. Each shard's alias table carries the same DDL:

-- D1 SQLite DDL (identical across all vertical shards)
CREATE TABLE entity_aliases (
  alias_id      TEXT NOT NULL,          -- sha256(entity_id || alias_type || alias_norm)[:16]
  entity_id     TEXT NOT NULL,          -- FK -> shard's primary entity table
  alias_type    TEXT NOT NULL CHECK (
    alias_type IN ('AKA','FKA','NFE','PHONETIC','VESSEL')
  ),
  alias_raw     TEXT NOT NULL,          -- original string from source
  alias_norm    TEXT NOT NULL,          -- lowercased, punct-stripped, suffix-removed
  alias_lang    TEXT,                   -- ISO 639-1 or NULL for unknown
  source_ref    TEXT,                   -- OFAC program code / SEC CIK / FinCEN LEI
  valid_from    TEXT,                   -- ISO date, NULL = always valid
  valid_until   TEXT,                   -- ISO date, NULL = still active
  created_at    TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
  PRIMARY KEY (alias_id)
);

-- Lookup index: given a normalized alias string, find all matching entity IDs
CREATE INDEX idx_aliases_norm ON entity_aliases (alias_norm, alias_type);

-- Phonetic bucket index: given a Metaphone code, find all entity IDs
CREATE INDEX idx_aliases_phonetic ON entity_aliases (alias_norm)
  WHERE alias_type = 'PHONETIC';

-- Entity lookup: given entity_id, enumerate all aliases
CREATE INDEX idx_aliases_entity ON entity_aliases (entity_id, valid_until);

-- FTS5 virtual table for fuzzy alias search (SQLite FTS5)
CREATE VIRTUAL TABLE entity_aliases_fts USING fts5(
  alias_raw,
  content='entity_aliases',
  content_rowid='rowid',
  tokenize='unicode61 remove_diacritics 2'
);

CREATE TRIGGER entity_aliases_fts_ai AFTER INSERT ON entity_aliases BEGIN
  INSERT INTO entity_aliases_fts(rowid, alias_raw) VALUES (new.rowid, new.alias_raw);
END;
CREATE TRIGGER entity_aliases_fts_ad AFTER DELETE ON entity_aliases BEGIN
  INSERT INTO entity_aliases_fts(entity_aliases_fts, rowid, alias_raw)
    VALUES ('delete', old.rowid, old.alias_raw);
END;

The alias_id primary key is a 16-hex-char prefix of the SHA-256 of the compound(entity_id, alias_type, alias_norm) triple. Collisions across 2.4M aliases are statistically negligible (birthday bound at ~2^32 items for a 16-hex key) and the compound uniqueness constraint is enforced by the key derivation itself. This approach avoids auto-increment sequences, which require a separate D1 write round-trip per insert during bulk loads.

Normalization pipeline

Raw alias strings arrive from three sources with incompatible conventions: OFAC XML uses ALL CAPS with parenthetical qualifiers; SEC EDGAR uses title case with legal suffix abbreviations; FinCEN uses mixed case with occasional Unicode. The normalization pipeline converts all three to a canonicalalias_norm form before storage:

import re
import unicodedata

# Legal suffixes stripped during normalization (ordered longest-first to avoid
# partial matches: "LLC" must not match "PLLC")
LEGAL_SUFFIXES = [
    r'\bpublic limited company\b', r'\bprivate limited\b',
    r'\blimited liability company\b', r'\blimited liability partnership\b',
    r'\bllp\b', r'\bpllc\b', r'\bllc\b', r'\bltd\.?\b',
    r'\binc\.?\b', r'\bcorp\.?\b', r'\bco\.?\b',
    r'\bgmbh\b', r'\bag\b', r'\bsa\b', r'\bbv\b',
    r'\bnv\b', r'\bpte\.?\b', r'\bpjsc\b', r'\bpao\b',
]
_SUFFIX_RE = re.compile(
    '|'.join(LEGAL_SUFFIXES),
    re.IGNORECASE,
)

_PUNCT_RE   = re.compile(r"[^\w\s]", re.UNICODE)
_SPACE_RE   = re.compile(r"\s+")

def normalize_alias(raw: str) -> str:
    """
    Canonical normalization for alias_norm column.
    Steps:
      1. Unicode NFKD decomposition + ASCII transliteration for NFE strings
      2. Lowercase
      3. Strip punctuation (retain word chars and spaces)
      4. Remove legal suffixes
      5. Collapse whitespace and strip
    """
    # Step 1: decompose and strip combining marks (handles accented chars)
    nfkd = unicodedata.normalize('NFKD', raw)
    ascii_approx = nfkd.encode('ascii', errors='ignore').decode('ascii')

    # Step 2: lowercase
    s = ascii_approx.lower()

    # Step 3: strip punctuation
    s = _PUNCT_RE.sub(' ', s)

    # Step 4: remove legal suffixes (iterative until stable)
    prev = None
    while prev != s:
        prev = s
        s = _SUFFIX_RE.sub(' ', s)

    # Step 5: collapse whitespace
    return _SPACE_RE.sub(' ', s).strip()


def compute_alias_id(entity_id: str, alias_type: str, alias_norm: str) -> str:
    import hashlib
    payload = f"{entity_id}\x00{alias_type}\x00{alias_norm}"
    return hashlib.sha256(payload.encode()).hexdigest()[:16]

The iterative legal suffix removal loop is necessary because some entities carry compound suffixes like “Ltd. Co.” or “Inc. LLC” that require two passes to fully strip. The loop terminates in at most three iterations in practice.

Metaphone phonetic bucketing

NFE transliterations from Arabic, Cyrillic, or Farsi scripts produce phonetically equivalent but orthographically divergent strings: “Khomeini” / “Khomaini” / “Khomeyni” all normalise to the same Metaphone code KMYN. We use the double-Metaphone algorithm via the metaphone Python package and store one PHONETIC alias row per unique primary-or-secondary Metaphone code per entity:

from metaphone import doublemetaphone

def generate_phonetic_aliases(
    entity_id: str,
    alias_norm: str,
) -> list[dict]:
    """
    Generate zero, one, or two PHONETIC alias rows for an alias_norm string.
    Double-Metaphone returns (primary, secondary) codes; secondary may be empty.
    """
    primary, secondary = doublemetaphone(alias_norm)
    rows = []
    seen = set()
    for code in [primary, secondary]:
        if code and code not in seen:
            seen.add(code)
            rows.append({
                'alias_id':   compute_alias_id(entity_id, 'PHONETIC', code),
                'entity_id':  entity_id,
                'alias_type': 'PHONETIC',
                'alias_raw':  alias_norm,   # store original as reference
                'alias_norm': code,
                'alias_lang': None,
                'source_ref': None,
            })
    return rows


def build_alias_rows_for_entity(
    entity_id: str,
    aliases: list[dict],  # [{'type': str, 'raw': str, 'lang': str|None, 'source_ref': str|None}]
) -> list[dict]:
    """
    Given source alias dicts, produce the full set of alias rows including
    derived PHONETIC rows.  Deduplicates by alias_id.
    """
    seen_ids: set[str] = set()
    out: list[dict] = []

    for a in aliases:
        norm = normalize_alias(a['raw'])
        if not norm:
            continue
        aid = compute_alias_id(entity_id, a['type'], norm)
        if aid not in seen_ids:
            seen_ids.add(aid)
            out.append({
                'alias_id':   aid,
                'entity_id':  entity_id,
                'alias_type': a['type'],
                'alias_raw':  a['raw'],
                'alias_norm': norm,
                'alias_lang': a.get('lang'),
                'source_ref': a.get('source_ref'),
            })
        # Generate phonetic aliases for AKA and NFE types only
        if a['type'] in ('AKA', 'NFE'):
            for ph_row in generate_phonetic_aliases(entity_id, norm):
                if ph_row['alias_id'] not in seen_ids:
                    seen_ids.add(ph_row['alias_id'])
                    out.append(ph_row)

    return out

PHONETIC rows are generated for AKA and NFE aliases only. FKA aliases represent prior legal names that may be deliberately distinct phonetically (e.g., after a court-ordered renaming), so phonetic expansion would introduce false positives. VESSEL aliases carry IMO numbers or MMSI codes where phonetic matching is meaningless.

Four-pass alias resolution pipeline

When a compliance analyst submits a screening query, the hub Workers router fans out to the relevant vertical shards and applies four resolution passes in sequence, returning matches from the first pass that finds results:

PassMethodIndex usedRecallCumulative
1 — Exactalias_norm equalityidx_aliases_norm71.4%71.4%
2 — PhoneticMetaphone code lookupidx_aliases_phonetic16.8%88.2%
3 — FTS5BM25 full-text searchentity_aliases_fts7.9%96.1%
4 — Edit distanceLevenshtein ≤ 2 on FTS5 candidatesPost-FTS5 filter2.6%98.7%

The 1.3% miss rate on the 2.4M-alias benchmark is concentrated in transliterated proper nouns where two different romanisation standards produce strings that differ by more than two edit distance and have different Metaphone codes (e.g., “Jaber” vs. “Djaber”). These cases require manual alias registration in the source database, which is an upstream OFAC/FinCEN data quality problem rather than a pipeline deficiency.

The TypeScript implementation in the Workers router:

// workers/src/alias-resolver.ts

export type AliasMatch = {
  entity_id: string;
  alias_type: string;
  alias_raw:  string;
  match_pass: 1 | 2 | 3 | 4;
  score:      number;   // 1.0 for exact, 0.0–1.0 for FTS5/edit-distance
};

export async function resolveAlias(
  db: D1Database,
  query: string,
): Promise<AliasMatch[]> {
  const norm = normalizeAlias(query);

  // Pass 1: exact normalized match
  const exactRows = await db
    .prepare(
      'SELECT entity_id, alias_type, alias_raw FROM entity_aliases WHERE alias_norm = ?'
    )
    .bind(norm)
    .all<{ entity_id: string; alias_type: string; alias_raw: string }>();
  if (exactRows.results.length > 0) {
    return exactRows.results.map(r => ({ ...r, match_pass: 1, score: 1.0 }));
  }

  // Pass 2: phonetic bucket
  const [primary, secondary] = doubleMetaphone(norm);
  const codes = [primary, secondary].filter(Boolean);
  if (codes.length > 0) {
    const placeholders = codes.map(() => '?').join(',');
    const phRows = await db
      .prepare(
        `SELECT DISTINCT ea.entity_id, ea.alias_type, ea.alias_raw
           FROM entity_aliases ph
           JOIN entity_aliases ea ON ea.entity_id = ph.entity_id
          WHERE ph.alias_type = 'PHONETIC'
            AND ph.alias_norm IN (${placeholders})
            AND ea.alias_type != 'PHONETIC'`
      )
      .bind(...codes)
      .all<{ entity_id: string; alias_type: string; alias_raw: string }>();
    if (phRows.results.length > 0) {
      return phRows.results.map(r => ({ ...r, match_pass: 2, score: 0.85 }));
    }
  }

  // Pass 3: FTS5 BM25
  const ftsRows = await db
    .prepare(
      `SELECT ea.entity_id, ea.alias_type, ea.alias_raw,
              bm25(entity_aliases_fts) AS bm25_score
         FROM entity_aliases_fts
         JOIN entity_aliases ea ON ea.rowid = entity_aliases_fts.rowid
        WHERE entity_aliases_fts MATCH ?
        ORDER BY bm25_score
        LIMIT 50`
    )
    .bind(query)
    .all<{ entity_id: string; alias_type: string; alias_raw: string; bm25_score: number }>();

  // Pass 4: edit-distance filter on FTS5 candidates
  const filtered = ftsRows.results.filter(
    r => levenshtein(norm, normalizeAlias(r.alias_raw)) <= 2
  );
  if (filtered.length > 0) {
    return filtered.map(r => ({
      entity_id: r.entity_id,
      alias_type: r.alias_type,
      alias_raw:  r.alias_raw,
      match_pass: 4 as const,
      score: Math.max(0, 1 - levenshtein(norm, normalizeAlias(r.alias_raw)) / 10),
    }));
  }

  // Return FTS5-only results if edit-distance filter too strict
  if (ftsRows.results.length > 0) {
    return ftsRows.results.slice(0, 10).map(r => ({
      entity_id: r.entity_id,
      alias_type: r.alias_type,
      alias_raw:  r.alias_raw,
      match_pass: 3 as const,
      score: 0.6,
    }));
  }

  return [];
}

Cross-source alias deduplication

The same entity may appear in both OFAC and FinCEN shards with overlapping but not identical alias sets. After fan-out, the Workers coordinator deduplicates by (alias_norm, alias_type) pairs across shards before returning results to the client. The canonical entity identifier fromentity_master serves as the grouping key, allowing the response to collapse three shard records into a single entity object with a merged alias list and a source_shards array indicating provenance.

Valid-from / valid-until date filtering is applied after deduplication: aliases wherevalid_until < today are included in the response with an expired: true flag rather than excluded entirely. Expired FKA entries are frequently the most valuable signal for sanctions evasion screening because the entity's current operating name may be the post-rename AKA while the original sanctioned name is now FKA.

Alias coverage benchmarks

The 2.4M-alias benchmark was assembled from 14,200 OFAC SDN entries (including all programs: IRAN, CUBA, RUSSIA, DPRK, SDN), 890,000 SEC EDGAR filer names, and 1.51M FinCEN beneficial ownership records. Ground truth matches were established by manual review of 3,000 entity pairs confirmed identical by DOJ settlement documents, debarment notices, and beneficial ownership disclosure cross-references.

Alias source mixEntitiesAliasesRecall (pass 1-4)Precision
OFAC-only14,200387,00099.1%97.4%
SEC-only890,0001,340,00098.8%96.9%
FinCEN-only760,000673,00098.3%95.8%
Cross-shard (OFAC+SEC)3,10012,00097.9%94.2%
Total1,667,3002,412,00098.7%96.5%

Cross-shard precision is lower (94.2%) because the FTS5 pass surfaces EDGAR filer names that share common words with OFAC SDN entries without being the same legal entity. The edit-distance pass 4 filter reduces these false positives but does not eliminate them entirely. Production deployment adds a manual-review queue for cross-shard pass-3 and pass-4 matches above a minimum confidence threshold.

Related writing

Federal entity ID normalization covers the EIN, LEI, CIK, and DUNS normalization pipeline that produces the canonical entity_id values that entity_aliases.entity_id references.

Voidly probe operator privacy discusses the pseudonymous probe credential system — a different context where alias-like identifier indirection protects user identity rather than exposing sanctions risk.