Technical writing

Named entity extraction and disambiguation in the OSINT pipeline: 58M posts per day, 15,000 entity mentions per hour

undefined undefined, NaN· 12 min read· AI Analytics

OSINTMLNLP

Keyword matching is the naïve approach to tracking entities across a large social media corpus. Search for “Putin” and you miss “Путин”, “普京”, “Putine”, “Btin”, and 336 other aliases that Wikidata records for entity Q5663788. You also get false positives from partial matches, homographs, and names that appear in unrelated contexts. For an OSINT pipeline that needs to answer “which posts in the last 24 hours mention Company X or Person Y?” across 58 million posts per day and dozens of languages, keyword matching is not a viable architecture.

Named entity recognition (NER) followed by entity linking to stable Wikidata QIDs is. This post describes how the pipeline extracts, disambiguates, and stores named entity mentions — from raw noisy social media text through to a TimescaleDB hypertable that downstream systems query for campaign detection, election anomaly detection, and censorship correlation.

Why entity extraction rather than keyword matching

The fundamental problem with keyword matching for entity tracking is the surface-form explosion problem. A single real-world entity has dozens to hundreds of surface forms across languages, scripts, and informal registers. The Wikidata entity for Vladimir Putin (Q5663788) has approximately 340 known aliases, spanning Cyrillic, Arabic, Chinese, Japanese, Korean, Latin-script transliterations in Romanian, Turkish, Finnish, and others, plus informal variants, abbreviations, and common misspellings. Maintaining keyword lists for every monitored entity is not tractable at the scale of thousands of monitored entities.

Entity linking to Wikidata QIDs solves this by normalizing all surface forms to a single stable identifier. Once “Путин” (Russian Cyrillic), “普京” (Chinese), and “Putin” (English) are all stored with canonical_qid = 'Q5663788', querying for all posts mentioning that entity requires a single indexed lookup oncanonical_qid regardless of the language the post was written in. Cross-language, cross-platform entity tracking becomes a database query rather than a multi-regex search.

A secondary benefit is entity type classification. NER models distinguish persons, organizations, geopolitical entities, locations, legislation, and events. The pipeline uses these distinctions downstream — election anomaly detection specifically looks for co-occurrence of candidate entities and polling station entities, which requires knowing that a mention is a person versus a place.

NER model selection and GPU throughput

The pipeline uses spaCy with transformer-based models for English and language-specific models for the five most common non-English languages in the corpus (Arabic, Chinese, Russian, Spanish, French). All models run GPU-accelerated via cuDNN, loaded once at worker startup and shared across all requests to that worker:

import spacy

# en_core_web_trf: transformer-based, highest accuracy English NER
# Loaded once at worker startup, GPU-accelerated via cuDNN
nlp = spacy.load("en_core_web_trf")
nlp.max_length = 1_000_000  # allow large batches

# For non-English: per-language models
LANG_MODELS = {
    "ar": spacy.load("ar_core_news_md"),
    "zh": spacy.load("zh_core_web_md"),
    "ru": spacy.load("ru_core_news_md"),
    "es": spacy.load("es_core_news_md"),
    "fr": spacy.load("fr_core_news_md"),
    # Fallback: xx_ent_wiki_sm (multi-language)
}

def extract_entities_batch(posts: list[CanonicalPost]) -> list[list[RawMention]]:
    by_lang = group_by_language(posts)
    results = {}
    for lang, lang_posts in by_lang.items():
        model = LANG_MODELS.get(lang, nlp)
        texts = [p.content for p in lang_posts]
        docs = list(model.pipe(texts, batch_size=32))
        results[lang] = [[RawMention(ent.text, ent.label_, p.post_id)
                          for ent in doc.ents] for doc, p in zip(docs, lang_posts)]
    return merge_by_post_order(results, posts)

The transformer-based English model (en_core_web_trf) is the most accurate option in the spaCy ecosystem but also the most compute-intensive. On an A100 GPU at batch_size=32, it processes approximately 2,800 English posts per second. With 8 GPU workers, peak throughput is roughly 22,400 posts per second — well above the 667 posts/second average input rate and providing 4× headroom for election-period burst traffic.

The non-English models are medium-sized convolutional models rather than transformer-based. They sacrifice some accuracy for throughput: Arabic and Russian run at roughly 6,000 posts/second per worker on the same hardware. The throughput asymmetry is acceptable because English content makes up approximately 61% of the corpus by post count and the transformer model's higher accuracy matters most for English political content, where entity ambiguity is highest.

Posts in languages without a dedicated model fall back toxx_ent_wiki_sm, spaCy's multi-language model trained on Wikipedia data. Accuracy is lower, but for languages like Persian and Thai that represent small fractions of the corpus, it covers the most common high-frequency entity types (persons, countries, major organizations) adequately.

EntityType taxonomy and FAUX filtering

NER model output is mapped to a pipeline-internal entity type taxonomy before disambiguation. The taxonomy extends the standard CoNLL-2003 types with categories relevant to OSINT use cases:

class EntityType(str, Enum):
    PER = "PER"       # Person
    ORG = "ORG"       # Organization
    GPE = "GPE"       # Geopolitical entity (country, city, state)
    LOC = "LOC"       # Non-political location (mountain, river)
    LAW = "LAW"       # Named legislation, court case
    PRODUCT = "PRODUCT"  # Product or service name
    EVENT = "EVENT"   # Named event (election, summit, war)
    FAUX = "FAUX"     # False positive (too short, number-like)

The FAUX type is a pre-disambiguation filter. Entities classified as FAUX are discarded immediately without touching the Wikidata disambiguation layer, which is important for controlling disambiguation cost. The FAUX classifier applies four heuristics: length fewer than 2 characters, all-digit strings, strings that match a common English stopword list (the model occasionally tags “The” or “US” as entities depending on context), and single-character CJK tokens.

In practice, FAUX filtering removes approximately 18% of raw NER output before disambiguation. The false positive rate for NER on social media text is higher than on news corpora because social media content is shorter, more informal, uses more abbreviations, and contains more noise — hashtags, usernames, and emoji that the NER model sometimes tags as named entities.

Cross-language transliteration normalization

Before Wikidata disambiguation, non-Latin mentions are normalized through a transliteration step. The goal is to build disambiguation cache keys that are script-independent: “Путин” and “Putin” should hit the same Redis cache entry and resolve to Q5663788 without two separate Wikidata lookups.

The pipeline uses the transliterate Python library for this step. Arabic and Cyrillic/Russian mentions are transliterated to Latin before the cache lookup. CJK scripts are handled differently — Wikidata natively indexes Chinese, Japanese, and Korean labels and aliases, so CJK mentions are passed directly to the Wikidata SPARQL search without transliteration:

Arabic: transliterate.translit(text, 'ar', reversed=True) → Latin
Russian/Cyrillic: transliterate.translit(text, 'ru', reversed=True) → Latin
CJK: skip transliteration; use language-specific Wikidata aliases directly

The Redis cache key is built from the normalized, transliterated, lowercased mention text, so “Путин” → “putin” and “Putin” → “putin” resolve to the same keyqid:putin. The transliteration step is not lossless — it can produce collisions between distinct entities in edge cases — but at the level of the high-frequency entities the pipeline monitors most closely, collisions have not been a source of significant error in practice.

Wikidata QID disambiguation

Disambiguation maps a raw mention (surface text + entity type + post context) to a canonical Wikidata QID. The process has three stages: Redis cache lookup, Wikidata SPARQL search for cache misses, and context-window disambiguation when multiple candidates are returned:

@dataclass
class WikidataCandidate:
    qid: str           # e.g. "Q5663788"
    label: str         # canonical English label
    description: str   # disambiguation hint ("Russian politician")
    aliases: list[str] # all known aliases across all languages
    score: float       # disambiguation confidence

async def disambiguate_mention(mention: RawMention) -> WikidataCandidate | None:
    # 1. Check Redis cache (1M QID LRU, TTL 24h)
    cached = await redis.get(f"qid:{mention.text.lower()}")
    if cached:
        return WikidataCandidate(**json.loads(cached))

    # 2. Query Wikidata SPARQL endpoint (search by label/alias)
    results = await wikidata_search(mention.text, entity_type=mention.entity_type)
    if not results:
        return None

    # 3. Context-window disambiguation: pick candidate whose description
    #    best matches the post's surrounding context (cosine similarity
    #    against a 384-dim sentence-transformer embedding of the post text)
    best = disambiguate_by_context(results, mention.post_context)

    # 4. Cache successful disambiguation
    if best.score >= 0.7:
        await redis.setex(f"qid:{mention.text.lower()}", 86400, json.dumps(best))

    return best if best.score >= 0.65 else None

The Redis cache holds up to 1 million QID entries with a 24-hour TTL, operating as an LRU. The cache hit rate is 78%, driven by a relatively concentrated distribution of entity mentions: the pipeline's corpus is politically focused, and approximately 50,000 high-frequency entities — heads of state, major governments, large corporations, recurring events — account for the vast majority of all mentions. Most posts that mention Putin, the United States, or the European Commission hit cached QIDs without touching the Wikidata API.

For the 22% of mentions that miss the cache, the disambiguation flow queries Wikidata's SPARQL endpoint and returns a ranked list of candidates. When multiple candidates match the surface form — “Washington” could be a person, a state, or a city — the context-window disambiguator selects among them using cosine similarity between each candidate's Wikidata description and a 384-dimensional sentence-transformer embedding of the surrounding post text. A candidate with score below 0.65 is not returned; the mention is left without a QID assignment rather than making a low-confidence guess that could corrupt downstream entity timelines. Successful disambiguations with score at or above 0.70 are written to the Redis cache; those between 0.65 and 0.70 are used but not cached, since the lower confidence suggests the surface form may be ambiguous enough to warrant a fresh lookup next time it appears in a different context.

Person co-reference resolution

Informal social media posts frequently refer to a person by first name only (“Donald said...”), by handle (“@realDJT”), or by nickname (“the Don”). These surface forms may not appear in the Wikidata alias list and will fail disambiguation at the mention level. A co-reference resolver runs after per-post entity extraction to resolve within-post ambiguous PER mentions by anchoring them to fully-resolved mentions from the same post or same author:

def resolve_person_coreference(
    mentions: list[EntityMention],
    post: CanonicalPost
) -> list[EntityMention]:
    # For each unresolved PER mention, find the most recent fully-resolved
    # PER entity in the same post or in posts from the same author in the
    # last 24h (from author_entity_context cache in Redis)
    full_names = [m for m in mentions if m.confidence >= 0.80]
    partials = [m for m in mentions if m.entity_type == EntityType.PER
                and m.confidence < 0.65]

    for partial in partials:
        candidate = find_best_coref_match(partial, full_names, post.author_id)
        if candidate and coref_edit_distance(partial.mention_text, candidate) < 3:
            partial.canonical_qid = candidate.canonical_qid
            partial.confidence = 0.60  # mark as coref-resolved, lower confidence

    return mentions

The edit distance threshold of 3 limits co-reference to cases where the partial mention is a recognizable prefix or suffix of the resolved name. This prevents the resolver from erroneously linking “Don” (3 characters) to “Donald Trump” solely on author context if the author has mentioned multiple Donalds. Co-reference resolved mentions are stored with confidence = 0.60, below the direct-disambiguation minimum, so downstream systems can filter them separately if needed.

The author_entity_context Redis cache records the QIDs of fully-resolved PER entities mentioned by each author in the last 24 hours. This allows the resolver to anchor a partial mention in post B to a full mention from the same author in post A, even if they appear in different posts — a common pattern in threads and reply chains where the author assumes conversational context.

EntityMention schema and TimescaleDB storage

Resolved mentions are materialized as EntityMention records with full provenance — character offsets into the original post, language, confidence, and collection timestamp:

@dataclass
class EntityMention:
    mention_id: str         # uuid
    post_id: str            # FK to canonical_posts
    mention_text: str       # surface form in the post
    canonical_qid: str      # Wikidata QID
    entity_type: EntityType
    confidence: float       # disambiguation confidence (0.65–1.0)
    char_start: int         # character offset in post content
    char_end: int
    language: str           # post language (ISO 639-1)
    collected_at: datetime

These records are written to a TimescaleDB hypertable entity_mentionswith a one-day chunk interval, partitioned by space using a hash ofcanonical_qid modulo 8. Two indexes cover the primary query patterns: an index on (canonical_qid, collected_at) for entity timeline queries (“show all mentions of Q37158 over the last 7 days”) and an index onpost_id for post-level entity lookups (“which entities does this post mention?”). Bulk inserts run at 2,000 rows per batch; at the pipeline's output rate of ~15,000 mentions per hour, each batch takes approximately 8ms.

Organization hierarchy table

When an ORG mention disambiguates to a Wikidata entity that has a parent organization claim (Wikidata property P749), the pipeline writes a row to the org_hierarchy table:

CREATE TABLE org_hierarchy (
    child_qid TEXT NOT NULL,
    parent_qid TEXT NOT NULL,
    relationship TEXT DEFAULT 'subsidiary',
    UNIQUE(child_qid, parent_qid)
);

This table is populated lazily — the first time an ORG entity is disambiguated, its Wikidata P749 claim is fetched and the parent-child relationship is stored. Subsequent mentions of the same child QID read the hierarchy from the table rather than re-fetching Wikidata. The primary use case is hierarchical entity queries: “show all posts mentioning any subsidiary of Meta Platforms (Q37158)” becomes a join of entity_mentions against org_hierarchy whereparent_qid = 'Q37158'. This is materially useful for tracking coordinated narratives that are careful to mention subsidiaries rather than the parent brand directly.

Throughput and latency budget

The end-to-end latency budget for the entity extraction pipeline:

Stage	Latency	Notes
NER extraction (English, batch=32)	~11ms per batch	2,800 posts/sec per A100 worker
Transliteration + FAUX filter	<0.5ms per mention	CPU-only; negligible
Disambiguation (cache hit, 78%)	~0.5ms Redis round-trip	1M LRU cache, 24h TTL
Disambiguation (cache miss, 22%)	~180ms Wikidata SPARQL	200 async coroutines per worker
Average per-mention (weighted)	~40ms	(0.78 × 0.5ms) + (0.22 × 180ms)
TimescaleDB bulk write	~8ms per 2,000-row batch	~15,000 mentions/hour output

The 40ms average disambiguation latency per mention is absorbed by the 200 concurrent asyncio coroutines per worker running against the Wikidata SPARQL endpoint. The SPARQL endpoint rate limit (600 queries/minute per IP) is managed via the same token-bucket limiter used in the ingestion pipeline, with a per-worker IP rotation through a pool of 12 outbound addresses to stay within the rate limit at full disambiguation throughput.

Output rate of ~15,000 entity mentions per hour corresponds to an average of 0.26 entity mentions per post across the full multilingual corpus. English-language political content averages approximately 0.8 mentions per post; short-form posts in other languages average closer to 0.1. The lower-than-expected rate reflects both the FAUX filter removing 18% of raw NER output and the confidence threshold that discards sub-0.65 disambiguations.

Downstream integration

The entity_mentions hypertable feeds four downstream analytical systems, each consuming it in a different pattern:

Coordinated campaign detection uses MinHash clustering on sets of entity mentions shared across posts. When multiple accounts post content that mentions the same set of entities in a short time window, the entity co-occurrence is one of four signals the coordination scorer evaluates. The canonical_qidnormalization is what makes this cross-language: a Russian-language post mentioning “Путин” and an English-language post mentioning “Putin” both produce the same QID in the mention set, so the MinHash comparison sees them as sharing an entity.

Election anomaly detection flags posts that co-mention candidate entities and polling-location entities within the same post during declared election windows. The entity type taxonomy matters here: the system specifically looks for co-occurrence of PER entities with high-confidence matches to registered candidate QIDs alongside GPE or LOC entities that match known polling station locations for that election.

BGP and censorship correlation cross-references ORG entity mentions of ISPs and government agencies against Voidly shutdown events in the same time window. When a national ISP (say, an autonomous system operator with a known QID) is mentioned with elevated frequency in posts collected just before or during a confirmed BGP withdrawal, the correlation is logged as a potential censorship precursor signal. The org_hierarchy table is used here to catch mentions of subsidiary networks whose parent is the known ISP.

OSINT actor tracking maintains velocity metrics for specific high-interest QIDs — a configurable list of entities under active monitoring. For each monitored QID, the system records mention frequency per hour per platform and source country, alerting when velocity exceeds a rolling baseline by a configurable multiple. This is distinct from coordinated campaign detection (which looks for structure in who mentions an entity) and is instead a volume anomaly detector: a sudden 10× spike in mentions of a particular official or institution is flagged for analyst review independent of whether the underlying posts show coordination patterns.

For the ingestion pipeline that collects the 58M posts this extraction runs on: Social media ingestion at scale: collecting 58M posts per day from 47 platform schemas →

For how entity co-occurrence feeds into coordinated campaign detection: Detecting Coordinated Inauthentic Behavior in Social Media at Scale →

For the broader OSINT reconnaissance pipeline these extractions feed into: Building a digital-footprint reconnaissance pipeline for OSINT investigations →

For the NLP pipeline architecture — GPU workers, Kafka fan-out, and the sentiment analysis pipeline: NLP pipeline for real-time sentiment analysis at scale →