Technical writing

NLP Pipeline for Real-Time Sentiment Analysis at Scale

September 28, 2024· 17 min read· AI Analytics

NLPDistilBERTSpaCyInfrastructureOSINT

Our OSINT platform processes 2.4 million social media posts per hour across 47 platforms. The infrastructure for ingestion is covered separately — this article focuses on the NLP layer: how we classify language, extract entities, score sentiment, and detect coordinated inauthentic behavior at 667 posts per second without sacrificing accuracy.

The NLP pipeline is one of the data sources that feeds Voidly's censorship measurement: when a government blocks a platform or domain, we see it first in the social signals (posts about connectivity failures, reports of access problems) before the network probes accumulate enough measurements to confirm an event with high confidence.

Processing requirements

Each post must pass through four stages within a 2-second latency budget from ingestion to stored result:

Stage                     Model            Latency   Accuracy target
─────────────────────────────────────────────────────────────────────
Language detection        FastText lid      3ms       ≥99.5% (top-1)
Named entity recognition  Custom SpaCy      12ms      ≥91% F1 (macro)
Sentiment classification  DistilBERT FT     45ms      ≥93% (3-class)
Coordinated behavior      MinHash + rules   2ms       ≥87% precision
─────────────────────────────────────────────────────────────────────
Total per post:                             ~62ms
GPU workers (80×):                          667 posts/sec sustained

We process English, Spanish, and Chinese. Posts in other languages are stored raw for later batch processing. This covers 78% of the volume from our source platforms.

Language detection: FastText lid.176

FastText's compact language identification model (lid.176.bin, 917KB) handles 176 languages at 3ms per post on CPU. We run it on CPU rather than GPU because the model is too small to saturate a GPU batch, and CPU inference doesn't compete with the GPU-bound DistilBERT stage.

The model uses character n-gram features (1–5 grams) with a shallow neural network. Social media text is noisy (hashtags, mentions, mixed-script emoji) so we preprocess before detection:

import fasttext
import re

_lid_model = fasttext.load_model("lid.176.bin")

_NOISE_PATTERN = re.compile(
    r"https?://S+"          # URLs
    r"|@w+"                  # Mentions
    r"|#(w+)"                # Hashtags (keep the word, strip #)
    r"|[U00010000-U0010ffff]"  # Emoji (outside BMP)
    r"|[^ws]"               # Punctuation
)

def detect_language(text: str) -> tuple[str, float]:
    """Returns (ISO-639-1 code, confidence)."""
    clean = _NOISE_PATTERN.sub(" ", text).strip()
    if len(clean) < 10:
        return "xx", 0.0   # Too short — skip language detection

    (lang,), (conf,) = _lid_model.predict(clean, k=1)
    # FastText returns '__label__zh' etc.
    code = lang.replace("__label__", "")[:2]
    return code, float(conf)

# Only English, Spanish, Chinese proceed to NER + sentiment
PROCESS_LANGS = {"en", "es", "zh"}

Removing URLs and mentions before detection is important. Without it, a Spanish post containing an English URL will often be misclassified as English. Keeping the hashtag word (stripping only the #) preserves language signal since hashtags are usually in the author's language.

Measured accuracy on a 10K sample of hand-labeled posts: 99.7% on English, 99.1% on Spanish, 98.8% on Chinese (zh covers both simplified and traditional). The confusion cases are all code-switching posts (Spanish/English mixing) where the correct label is genuinely ambiguous.

Named entity recognition: custom SpaCy model

Off-the-shelf SpaCy models (en_core_web_lg, etc.) are trained on news corpora and perform poorly on social media text: informal capitalization, abbreviations, and domain-specific entities (politician nicknames, organization abbreviations common in election discourse) are systematically missed.

We fine-tuned a custom NER model on 2.3 million labeled examples collected from prior election cycles (2016, 2018, 2020 US elections plus 2022 European elections). The label set:

Label         Description                     Examples
──────────────────────────────────────────────────────────────
PERSON        Political figures, officials    "Biden", "Kamala", "AOC"
ORG           Organizations, parties          "DNC", "MAGA", "DOJ"
GPE           Geopolitical entities           "Iowa", "DC", "Florida"
FAC           Facilities, venues              "Capitol", "Mar-a-Lago"
LAW           Legislation, legal actions      "SB 202", "Jan 6"
EVENT         Named political events          "Super Tuesday", "debate"
TOPIC         Issue domains                   "abortion", "immigration"
──────────────────────────────────────────────────────────────
Total labeled spans:  2.3M
Train / dev / test:   80% / 10% / 10%

Training used SpaCy's train command with the transition-based NER pipeline (ner component on top of the tok2vec shared representation). We started from the en_core_web_trf weights (RoBERTa-base backbone) and fine-tuned for 20 epochs with batch size 128:

# config.cfg (abbreviated)
[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]

[components.tok2vec]
factory = "tok2vec"
model = @architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"

[components.ner]
factory = "ner"

[training]
train_corpus = "corpus/train.spacy"
dev_corpus = "corpus/dev.spacy"
max_steps = 20000
eval_frequency = 200
patience = 1600

[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.0001
beta1 = 0.9
beta2 = 0.999

Evaluation on the held-out test set (230K spans):

Label     Precision   Recall    F1
──────────────────────────────────
PERSON    94.2%       93.1%     93.6%
ORG       89.7%       88.4%     89.0%
GPE       96.1%       95.8%     95.9%
FAC       82.3%       79.6%     80.9%
LAW       88.9%       84.2%     86.5%
EVENT     85.4%       81.7%     83.5%
TOPIC     91.3%       90.1%     90.7%
──────────────────────────────────
Macro F1: 91.4%

FAC and EVENT have the lowest scores because they are the most context-dependent: "the Capitol" vs "a capitol building" requires world knowledge that the model doesn't always have. We accept this tradeoff — these labels are less important for downstream analysis than PERSON and ORG.

At inference time, SpaCy processes posts in batches of 64 with the GPU disabled (the RoBERTa backbone is fast enough on CPU for 12ms/post throughput, and GPU time is reserved for the sentiment model). We export to ONNX for CPU inference to get deterministic latency without Python GIL contention across 80 worker processes.

Sentiment classification: DistilBERT fine-tuning

DistilBERT was chosen over BERT-base for throughput reasons: it has 40% fewer parameters (66M vs 110M) and is 60% faster at inference while retaining 97% of BERT-base's performance on GLUE benchmarks. The sentiment task is three-class (positive / neutral / negative) plus a per-class confidence score.

Training dataset construction

We assembled 5 million labeled examples from three sources:

3.1M from TweetSentimentExtraction and similar public datasets, filtered to political topics
1.2M manually labeled tweets from 2020–2022 election monitoring (contracted human annotators via Scale AI)
0.7M from weak supervision: posts containing strong positive/negative indicator phrases (e.g., "voting is rigged" → negative, "democracy works" → positive), reviewed and filtered at 85% confidence threshold using a bootstrap model

Class distribution after balancing (undersampling neutral, which was overrepresented): 33% positive, 34% neutral, 33% negative. Without balancing, the model collapses to predicting neutral for ambiguous cases.

Fine-tuning configuration

from transformers import (
    DistilBertForSequenceClassification,
    DistilBertTokenizerFast,
    TrainingArguments,
    Trainer,
)
import torch

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=3,
    id2label={0: "positive", 1: "neutral", 2: "negative"},
)

training_args = TrainingArguments(
    output_dir="./election-sentiment-v2",
    num_train_epochs=4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    warmup_steps=500,
    weight_decay=0.01,
    learning_rate=2e-5,
    lr_scheduler_type="linear",
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    fp16=True,   # A100 GPU, mixed precision
)

# Fine-tuning took 8 hours on 4× A100 (80GB)
# Final checkpoint: 63MB (distilbert weights + classification head)

Evaluation results

                Precision   Recall    F1
Positive        95.1%       93.8%     94.4%
Neutral         93.4%       95.2%     94.3%
Negative        94.9%       94.3%     94.6%
─────────────────────────────────────────
Macro F1:       94.7%

Evaluated on 500K held-out posts, labeled by human annotators.
Human-human agreement on same posts: 91.4% (3-way majority).
Model-human agreement: 94.7% — model outperforms human inter-annotator.

The model outperforming human inter-annotator agreement is not unusual for sentiment tasks — human annotators disagree on borderline cases, and the model has seen more labeled examples than any individual annotator. It means the model is consistent, not necessarily that it is "better" than humans.

Production inference

The fine-tuned model is exported to ONNX and quantized to INT8 (dynamic quantization using the Optimum library). INT8 quantization reduces model size from 252MB to 63MB and cuts inference time from 45ms to 28ms on NVIDIA T4, with 0.4pp F1 degradation — acceptable for production.

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = ORTModelForSequenceClassification.from_pretrained(
    "./election-sentiment-v2-onnx",
    provider="CUDAExecutionProvider",
)

def classify_batch(texts: list[str]) -> list[dict]:
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,   # 128 vs 512 — social media posts are short
        return_tensors="np",
    )
    logits = model(**inputs).logits     # (batch, 3)
    probs = softmax(logits, axis=-1)

    return [
        {
            "positive": float(p[0]),
            "neutral":  float(p[1]),
            "negative": float(p[2]),
            "label":    ["positive", "neutral", "negative"][p.argmax()],
        }
        for p in probs
    ]

We cap sequence length at 128 tokens instead of 512. Social media posts rarely exceed 128 tokens (Twitter's 280-character limit is ~70 tokens on average). This alone cuts inference time in half because attention is quadratic in sequence length.

Coordinated campaign detection

Sentiment and entities are table stakes for social media monitoring. The operationally important signal is coordinated inauthentic behavior: bot networks, astroturfing campaigns, and state-sponsored influence operations posting near-identical content at high volume. We detect this with a two-stage system.

Stage 1: MinHash LSH content similarity

MinHash locality-sensitive hashing finds near-duplicate content in O(1) time per post by exploiting a property of Jaccard similarity: two documents with Jaccard similarity ≥ θ will hash to the same bucket in at least one of b × r band comparisons with probability ≥ 1 − (1 − θ^r)^b. We use 128 permutations with b=16 bands × r=8 rows, giving P(collision | similarity ≥ 0.85) ≈ 0.998.

from datasketch import MinHash, MinHashLSH
import re

lsh = MinHashLSH(threshold=0.85, num_perm=128)

_NORMALIZE = re.compile(r"https?://S+|@w+|#")

def minhash_of(text: str) -> MinHash:
    """Character 4-grams for robustness against word substitution."""
    clean = _NORMALIZE.sub("", text.lower()).strip()
    m = MinHash(num_perm=128)
    for i in range(max(0, len(clean) - 3)):
        m.update(clean[i:i+4].encode("utf-8"))
    return m

def check_and_index(post) -> dict:
    m = minhash_of(post.text)
    similar_ids = lsh.query(m)

    if len(similar_ids) >= 3:
        # 3+ near-duplicates = candidate coordinated posting
        return {"campaign_candidate": True, "cluster_ids": similar_ids[:20]}

    lsh.insert(post.id, m)
    return {"campaign_candidate": False}

We use character 4-grams rather than word unigrams. Word-level MinHash is easily defeated by campaigns that vary a few words per post (substituting synonyms or reordering sentences). Character n-grams are more robust because substitution still preserves most of the underlying n-gram multiset.

The LSH index is maintained per 60-minute window. After 60 minutes the window is checkpointed to TimescaleDB and a new in-memory LSH is started. This bounds memory usage at ~2GB at peak volume (2.4M posts/hour × ~800 bytes per MinHash signature).

Stage 2: behavioral signals

Content similarity alone produces false positives on organic viral posts (a breaking news quote that many people share). Stage 2 applies account-behavioral features to filter candidates:

def is_coordinated_campaign(cluster_ids: list[str]) -> bool:
    posts = fetch_posts(cluster_ids)
    authors = [fetch_author(p.author_id) for p in posts]

    signals = []

    # Temporal: 5+ accounts posting within 90 seconds
    timestamps = sorted(p.created_at for p in posts)
    if timestamps[-1] - timestamps[0] < 90:
        signals.append("temporal_burst")

    # Account age: 60%+ of accounts created within 14 days of each other
    account_ages = [a.created_at for a in authors]
    age_range = max(account_ages) - min(account_ages)
    if age_range < timedelta(days=14) and len(authors) >= 5:
        signals.append("account_age_cluster")

    # Posting velocity: any author posting > 50 times/hour
    for author in authors:
        if author.posts_last_hour > 50:
            signals.append("high_velocity_poster")
            break

    # Identical metadata: same posted URL, same hashtag set
    urls = [extract_url(p.text) for p in posts if extract_url(p.text)]
    if len(set(urls)) == 1 and len(urls) > 3:
        signals.append("identical_url")

    # Flag if 2+ signals
    return len(signals) >= 2

In production, 89% of clusters that pass stage 1 content similarity are confirmed as coordinated by stage 2 behavioral filtering. The 11% false positive rate is dominated by: viral tweets being quote-tweeted with minimal editorial content, and news aggregation bots that legitimately post many similar items.

Pipeline integration

The four stages run sequentially within a single Python process per worker. Workers consume from Kafka in batches of 32 posts (batch size chosen to saturate GPU memory on T4 without exceeding latency budget).

class NLPWorker:
    def process_batch(self, posts: list[Post]) -> list[ProcessedPost]:
        results = []
        for post in posts:
            # Stage 1: Language (CPU, 3ms)
            lang, lang_conf = detect_language(post.text)
            if lang not in PROCESS_LANGS:
                store_raw(post)          # Store without NLP for batch processing
                continue

            # Stage 2: NER (CPU/ONNX, 12ms)
            entities = ner_pipeline(post.text)

            # Stage 3: Sentiment (GPU/ONNX-CUDA, 28ms)
            # Accumulated across batch for GPU efficiency
            pass  # batched below

        # Batch GPU inference for sentiment (all posts at once)
        texts = [p.text for p in eligible_posts]
        sentiments = classify_batch(texts)    # GPU, ~28ms for 32 posts

        for post, sentiment, entities in zip(eligible_posts, sentiments, all_entities):
            # Stage 4: Campaign detection (CPU, 2ms)
            campaign_signal = check_and_index(post)

            results.append(ProcessedPost(
                post=post,
                language=lang,
                entities=entities,
                sentiment=sentiment,
                campaign_candidate=campaign_signal["campaign_candidate"],
            ))

        return results

Sentiment inference for all 32 posts in a batch takes ~28ms on the GPU, compared to 45ms × 32 = 1440ms if serialized. GPU batching is the single biggest throughput multiplier in the pipeline.

Connecting to Voidly censorship events

The NLP pipeline is one of Voidly's measurement signals. When users in a country begin posting about connectivity loss, the NER layer extracts the platform or domain name (e.g., "Telegram", "Twitter"), the TOPIC classifier captures "internet access" framing, and the sentiment score goes negative. A spike in this signal pattern — many posts about a platform from a specific GPE with negative sentiment — triggers a measurement task injection into the Voidly scheduler, which immediately queues that platform's domain for probing from all available vantage points in that country.

This integration means Voidly often detects censorship events within 4–8 minutes of them starting, before the scheduled measurement cycle would have caught them. The social signal is the canary; the network probe is the confirmation.

Performance summary

Metric                                   Value
──────────────────────────────────────────────────────
Sustained throughput                     2.4M posts/hr
Peak throughput (headroom @ 52% cap.)    4.6M posts/hr
End-to-end latency p50                   1.4s
End-to-end latency p99                   2.1s
Language detection accuracy              99.3% (3-lang)
NER macro F1                             91.4%
Sentiment macro F1 (production model)    94.3%
Campaign detection precision             89.0%
Campaign detection recall                76.2%

Recall of 76.2% on campaign detection means we miss roughly 24% of coordinated campaigns. Most misses are in campaigns that deliberately vary content beyond our similarity threshold. Improving recall requires moving beyond text similarity to graph-based methods (account follow/retweet networks), which we track as future work.

For the infrastructure layer (Kafka configuration, TimescaleDB schema, GPU instance sizing, and cost breakdown) that runs this NLP pipeline: How we process 2.4M social-media posts per hour →

For how the social signal feeds Voidly's real-time measurement scheduler: The Voidly measurement scheduler: how we decide which domains to probe and when →

For Voidly's network-layer anomaly classifier that receives the injected measurement tasks: The Voidly Anomaly Classifier: five interference classes, gradient boosted trees →

Multilingual bot detection extends the NLP pipeline with language-stratified XGBoost training and per-language Platt scaling for bot classification across 14 languages.