Technical writing
NLP Pipeline for Real-Time Sentiment Analysis at Scale
Our OSINT platform processes 2.4 million social media posts per hour across 47 platforms. The infrastructure for ingestion is covered separately — this article focuses on the NLP layer: how we classify language, extract entities, score sentiment, and detect coordinated inauthentic behavior at 667 posts per second without sacrificing accuracy.
The NLP pipeline is one of the data sources that feeds Voidly's censorship measurement: when a government blocks a platform or domain, we see it first in the social signals (posts about connectivity failures, reports of access problems) before the network probes accumulate enough measurements to confirm an event with high confidence.
Processing requirements
Each post must pass through four stages within a 2-second latency budget from ingestion to stored result:
Stage Model Latency Accuracy target ───────────────────────────────────────────────────────────────────── Language detection FastText lid 3ms ≥99.5% (top-1) Named entity recognition Custom SpaCy 12ms ≥91% F1 (macro) Sentiment classification DistilBERT FT 45ms ≥93% (3-class) Coordinated behavior MinHash + rules 2ms ≥87% precision ───────────────────────────────────────────────────────────────────── Total per post: ~62ms GPU workers (80×): 667 posts/sec sustained
We process English, Spanish, and Chinese. Posts in other languages are stored raw for later batch processing. This covers 78% of the volume from our source platforms.
Language detection: FastText lid.176
FastText's compact language identification model (lid.176.bin, 917KB) handles 176 languages at 3ms per post on CPU. We run it on CPU rather than GPU because the model is too small to saturate a GPU batch, and CPU inference doesn't compete with the GPU-bound DistilBERT stage.
The model uses character n-gram features (1–5 grams) with a shallow neural network. Social media text is noisy (hashtags, mentions, mixed-script emoji) so we preprocess before detection:
import fasttext
import re
_lid_model = fasttext.load_model("lid.176.bin")
_NOISE_PATTERN = re.compile(
r"https?://S+" # URLs
r"|@w+" # Mentions
r"|#(w+)" # Hashtags (keep the word, strip #)
r"|[U00010000-U0010ffff]" # Emoji (outside BMP)
r"|[^ws]" # Punctuation
)
def detect_language(text: str) -> tuple[str, float]:
"""Returns (ISO-639-1 code, confidence)."""
clean = _NOISE_PATTERN.sub(" ", text).strip()
if len(clean) < 10:
return "xx", 0.0 # Too short — skip language detection
(lang,), (conf,) = _lid_model.predict(clean, k=1)
# FastText returns '__label__zh' etc.
code = lang.replace("__label__", "")[:2]
return code, float(conf)
# Only English, Spanish, Chinese proceed to NER + sentiment
PROCESS_LANGS = {"en", "es", "zh"}Removing URLs and mentions before detection is important. Without it, a Spanish post containing an English URL will often be misclassified as English. Keeping the hashtag word (stripping only the #) preserves language signal since hashtags are usually in the author's language.
Measured accuracy on a 10K sample of hand-labeled posts: 99.7% on English, 99.1% on Spanish, 98.8% on Chinese (zh covers both simplified and traditional). The confusion cases are all code-switching posts (Spanish/English mixing) where the correct label is genuinely ambiguous.
Named entity recognition: custom SpaCy model
Off-the-shelf SpaCy models (en_core_web_lg, etc.) are trained on news corpora and perform poorly on social media text: informal capitalization, abbreviations, and domain-specific entities (politician nicknames, organization abbreviations common in election discourse) are systematically missed.
We fine-tuned a custom NER model on 2.3 million labeled examples collected from prior election cycles (2016, 2018, 2020 US elections plus 2022 European elections). The label set:
Label Description Examples ────────────────────────────────────────────────────────────── PERSON Political figures, officials "Biden", "Kamala", "AOC" ORG Organizations, parties "DNC", "MAGA", "DOJ" GPE Geopolitical entities "Iowa", "DC", "Florida" FAC Facilities, venues "Capitol", "Mar-a-Lago" LAW Legislation, legal actions "SB 202", "Jan 6" EVENT Named political events "Super Tuesday", "debate" TOPIC Issue domains "abortion", "immigration" ────────────────────────────────────────────────────────────── Total labeled spans: 2.3M Train / dev / test: 80% / 10% / 10%
Training used SpaCy's train command with the transition-based NER pipeline (ner component on top of the tok2vec shared representation). We started from the en_core_web_trf weights (RoBERTa-base backbone) and fine-tuned for 20 epochs with batch size 128:
# config.cfg (abbreviated) [nlp] lang = "en" pipeline = ["tok2vec", "ner"] [components.tok2vec] factory = "tok2vec" model = @architectures = "spacy-transformers.TransformerModel.v3" name = "roberta-base" [components.ner] factory = "ner" [training] train_corpus = "corpus/train.spacy" dev_corpus = "corpus/dev.spacy" max_steps = 20000 eval_frequency = 200 patience = 1600 [training.optimizer] @optimizers = "Adam.v1" learn_rate = 0.0001 beta1 = 0.9 beta2 = 0.999
Evaluation on the held-out test set (230K spans):
Label Precision Recall F1 ────────────────────────────────── PERSON 94.2% 93.1% 93.6% ORG 89.7% 88.4% 89.0% GPE 96.1% 95.8% 95.9% FAC 82.3% 79.6% 80.9% LAW 88.9% 84.2% 86.5% EVENT 85.4% 81.7% 83.5% TOPIC 91.3% 90.1% 90.7% ────────────────────────────────── Macro F1: 91.4%
FAC and EVENT have the lowest scores because they are the most context-dependent: "the Capitol" vs "a capitol building" requires world knowledge that the model doesn't always have. We accept this tradeoff — these labels are less important for downstream analysis than PERSON and ORG.
At inference time, SpaCy processes posts in batches of 64 with the GPU disabled (the RoBERTa backbone is fast enough on CPU for 12ms/post throughput, and GPU time is reserved for the sentiment model). We export to ONNX for CPU inference to get deterministic latency without Python GIL contention across 80 worker processes.
Sentiment classification: DistilBERT fine-tuning
DistilBERT was chosen over BERT-base for throughput reasons: it has 40% fewer parameters (66M vs 110M) and is 60% faster at inference while retaining 97% of BERT-base's performance on GLUE benchmarks. The sentiment task is three-class (positive / neutral / negative) plus a per-class confidence score.
Training dataset construction
We assembled 5 million labeled examples from three sources:
- 3.1M from TweetSentimentExtraction and similar public datasets, filtered to political topics
- 1.2M manually labeled tweets from 2020–2022 election monitoring (contracted human annotators via Scale AI)
- 0.7M from weak supervision: posts containing strong positive/negative indicator phrases (e.g., "voting is rigged" → negative, "democracy works" → positive), reviewed and filtered at 85% confidence threshold using a bootstrap model
Class distribution after balancing (undersampling neutral, which was overrepresented): 33% positive, 34% neutral, 33% negative. Without balancing, the model collapses to predicting neutral for ambiguous cases.
Fine-tuning configuration
from transformers import (
DistilBertForSequenceClassification,
DistilBertTokenizerFast,
TrainingArguments,
Trainer,
)
import torch
model = DistilBertForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=3,
id2label={0: "positive", 1: "neutral", 2: "negative"},
)
training_args = TrainingArguments(
output_dir="./election-sentiment-v2",
num_train_epochs=4,
per_device_train_batch_size=64,
per_device_eval_batch_size=128,
warmup_steps=500,
weight_decay=0.01,
learning_rate=2e-5,
lr_scheduler_type="linear",
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
load_best_model_at_end=True,
metric_for_best_model="f1_macro",
fp16=True, # A100 GPU, mixed precision
)
# Fine-tuning took 8 hours on 4× A100 (80GB)
# Final checkpoint: 63MB (distilbert weights + classification head)Evaluation results
Precision Recall F1 Positive 95.1% 93.8% 94.4% Neutral 93.4% 95.2% 94.3% Negative 94.9% 94.3% 94.6% ───────────────────────────────────────── Macro F1: 94.7% Evaluated on 500K held-out posts, labeled by human annotators. Human-human agreement on same posts: 91.4% (3-way majority). Model-human agreement: 94.7% — model outperforms human inter-annotator.
The model outperforming human inter-annotator agreement is not unusual for sentiment tasks — human annotators disagree on borderline cases, and the model has seen more labeled examples than any individual annotator. It means the model is consistent, not necessarily that it is "better" than humans.
Production inference
The fine-tuned model is exported to ONNX and quantized to INT8 (dynamic quantization using the Optimum library). INT8 quantization reduces model size from 252MB to 63MB and cuts inference time from 45ms to 28ms on NVIDIA T4, with 0.4pp F1 degradation — acceptable for production.
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = ORTModelForSequenceClassification.from_pretrained(
"./election-sentiment-v2-onnx",
provider="CUDAExecutionProvider",
)
def classify_batch(texts: list[str]) -> list[dict]:
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128, # 128 vs 512 — social media posts are short
return_tensors="np",
)
logits = model(**inputs).logits # (batch, 3)
probs = softmax(logits, axis=-1)
return [
{
"positive": float(p[0]),
"neutral": float(p[1]),
"negative": float(p[2]),
"label": ["positive", "neutral", "negative"][p.argmax()],
}
for p in probs
]We cap sequence length at 128 tokens instead of 512. Social media posts rarely exceed 128 tokens (Twitter's 280-character limit is ~70 tokens on average). This alone cuts inference time in half because attention is quadratic in sequence length.
Coordinated campaign detection
Sentiment and entities are table stakes for social media monitoring. The operationally important signal is coordinated inauthentic behavior: bot networks, astroturfing campaigns, and state-sponsored influence operations posting near-identical content at high volume. We detect this with a two-stage system.
Stage 1: MinHash LSH content similarity
MinHash locality-sensitive hashing finds near-duplicate content in O(1) time per post by exploiting a property of Jaccard similarity: two documents with Jaccard similarity ≥ θ will hash to the same bucket in at least one of b × r band comparisons with probability ≥ 1 − (1 − θr)b. We use 128 permutations with b=16 bands × r=8 rows, giving P(collision | similarity ≥ 0.85) ≈ 0.998.
from datasketch import MinHash, MinHashLSH
import re
lsh = MinHashLSH(threshold=0.85, num_perm=128)
_NORMALIZE = re.compile(r"https?://S+|@w+|#")
def minhash_of(text: str) -> MinHash:
"""Character 4-grams for robustness against word substitution."""
clean = _NORMALIZE.sub("", text.lower()).strip()
m = MinHash(num_perm=128)
for i in range(max(0, len(clean) - 3)):
m.update(clean[i:i+4].encode("utf-8"))
return m
def check_and_index(post) -> dict:
m = minhash_of(post.text)
similar_ids = lsh.query(m)
if len(similar_ids) >= 3:
# 3+ near-duplicates = candidate coordinated posting
return {"campaign_candidate": True, "cluster_ids": similar_ids[:20]}
lsh.insert(post.id, m)
return {"campaign_candidate": False}We use character 4-grams rather than word unigrams. Word-level MinHash is easily defeated by campaigns that vary a few words per post (substituting synonyms or reordering sentences). Character n-grams are more robust because substitution still preserves most of the underlying n-gram multiset.
The LSH index is maintained per 60-minute window. After 60 minutes the window is checkpointed to TimescaleDB and a new in-memory LSH is started. This bounds memory usage at ~2GB at peak volume (2.4M posts/hour × ~800 bytes per MinHash signature).
Stage 2: behavioral signals
Content similarity alone produces false positives on organic viral posts (a breaking news quote that many people share). Stage 2 applies account-behavioral features to filter candidates:
def is_coordinated_campaign(cluster_ids: list[str]) -> bool:
posts = fetch_posts(cluster_ids)
authors = [fetch_author(p.author_id) for p in posts]
signals = []
# Temporal: 5+ accounts posting within 90 seconds
timestamps = sorted(p.created_at for p in posts)
if timestamps[-1] - timestamps[0] < 90:
signals.append("temporal_burst")
# Account age: 60%+ of accounts created within 14 days of each other
account_ages = [a.created_at for a in authors]
age_range = max(account_ages) - min(account_ages)
if age_range < timedelta(days=14) and len(authors) >= 5:
signals.append("account_age_cluster")
# Posting velocity: any author posting > 50 times/hour
for author in authors:
if author.posts_last_hour > 50:
signals.append("high_velocity_poster")
break
# Identical metadata: same posted URL, same hashtag set
urls = [extract_url(p.text) for p in posts if extract_url(p.text)]
if len(set(urls)) == 1 and len(urls) > 3:
signals.append("identical_url")
# Flag if 2+ signals
return len(signals) >= 2In production, 89% of clusters that pass stage 1 content similarity are confirmed as coordinated by stage 2 behavioral filtering. The 11% false positive rate is dominated by: viral tweets being quote-tweeted with minimal editorial content, and news aggregation bots that legitimately post many similar items.
Pipeline integration
The four stages run sequentially within a single Python process per worker. Workers consume from Kafka in batches of 32 posts (batch size chosen to saturate GPU memory on T4 without exceeding latency budget).
class NLPWorker:
def process_batch(self, posts: list[Post]) -> list[ProcessedPost]:
results = []
for post in posts:
# Stage 1: Language (CPU, 3ms)
lang, lang_conf = detect_language(post.text)
if lang not in PROCESS_LANGS:
store_raw(post) # Store without NLP for batch processing
continue
# Stage 2: NER (CPU/ONNX, 12ms)
entities = ner_pipeline(post.text)
# Stage 3: Sentiment (GPU/ONNX-CUDA, 28ms)
# Accumulated across batch for GPU efficiency
pass # batched below
# Batch GPU inference for sentiment (all posts at once)
texts = [p.text for p in eligible_posts]
sentiments = classify_batch(texts) # GPU, ~28ms for 32 posts
for post, sentiment, entities in zip(eligible_posts, sentiments, all_entities):
# Stage 4: Campaign detection (CPU, 2ms)
campaign_signal = check_and_index(post)
results.append(ProcessedPost(
post=post,
language=lang,
entities=entities,
sentiment=sentiment,
campaign_candidate=campaign_signal["campaign_candidate"],
))
return resultsSentiment inference for all 32 posts in a batch takes ~28ms on the GPU, compared to 45ms × 32 = 1440ms if serialized. GPU batching is the single biggest throughput multiplier in the pipeline.
Connecting to Voidly censorship events
The NLP pipeline is one of Voidly's measurement signals. When users in a country begin posting about connectivity loss, the NER layer extracts the platform or domain name (e.g., "Telegram", "Twitter"), the TOPIC classifier captures "internet access" framing, and the sentiment score goes negative. A spike in this signal pattern — many posts about a platform from a specific GPE with negative sentiment — triggers a measurement task injection into the Voidly scheduler, which immediately queues that platform's domain for probing from all available vantage points in that country.
This integration means Voidly often detects censorship events within 4–8 minutes of them starting, before the scheduled measurement cycle would have caught them. The social signal is the canary; the network probe is the confirmation.
Performance summary
Metric Value ────────────────────────────────────────────────────── Sustained throughput 2.4M posts/hr Peak throughput (headroom @ 52% cap.) 4.6M posts/hr End-to-end latency p50 1.4s End-to-end latency p99 2.1s Language detection accuracy 99.3% (3-lang) NER macro F1 91.4% Sentiment macro F1 (production model) 94.3% Campaign detection precision 89.0% Campaign detection recall 76.2%
Recall of 76.2% on campaign detection means we miss roughly 24% of coordinated campaigns. Most misses are in campaigns that deliberately vary content beyond our similarity threshold. Improving recall requires moving beyond text similarity to graph-based methods (account follow/retweet networks), which we track as future work.
For the infrastructure layer (Kafka configuration, TimescaleDB schema, GPU instance sizing, and cost breakdown) that runs this NLP pipeline: How we process 2.4M social-media posts per hour →
For how the social signal feeds Voidly's real-time measurement scheduler: The Voidly measurement scheduler: how we decide which domains to probe and when →
For Voidly's network-layer anomaly classifier that receives the injected measurement tasks: The Voidly Anomaly Classifier: five interference classes, gradient boosted trees →
Multilingual bot detection extends the NLP pipeline with language-stratified XGBoost training and per-language Platt scaling for bot classification across 14 languages.