Technical writing
Bot detection across 14 languages: language-invariant behavioral features and cross-platform sockpuppet fingerprinting
Election-related disinformation campaigns operate across Telegram, Twitter/X, Facebook, and Bluesky simultaneously — with the same coordinated network posting in Arabic, Russian, Persian, Spanish, and English depending on which audience they're targeting. A bot detector that relies on English-language lexical features (sentiment word lists, political keyword bags) will miss the full campaign scope and systematically undercount non-English actors.
Our multilingual bot detector uses eight behavioral features that are language- invariant — they measure how accounts post, not what they post. Language is used only as a grouping variable for per-language threshold calibration, not as an input feature. The result is a single XGBoost classifier that achieves F1 0.883–0.908 across 14 languages in the 2024 election monitoring corpus.
The eight language-invariant features
from dataclasses import dataclass
from typing import Optional
@dataclass
class BotFeatureVector:
account_id: str
language_code: str # ISO 639-1; used only for threshold calibration
# Feature 1: Posting interval entropy
# Low entropy = machine-timed posts; high entropy = human irregular timing
# Computed over the last 500 posts, binned into 1-minute intervals
posting_interval_entropy: float # bits; range 0-8.97 (log2(500))
# Feature 2: Reply graph outdegree ratio
# Bots initiate interactions to visible accounts; humans reply to conversations
# outdegree = mentions_sent / (mentions_sent + replies_received)
reply_outdegree_ratio: float # 0.0-1.0
# Feature 3: Content hash cluster density
# Fraction of posts within Hamming distance 3 of another post by a different account
# High density = copied/templated content across coordinated accounts
content_cluster_density: float # 0.0-1.0
# Feature 4: Account age-activity velocity
# (total_posts / account_age_days); normalized by platform median
# Very high velocity in early account life = bot signup burst
age_velocity_zscore: float # z-score relative to platform median
# Feature 5: Quote-to-original ratio
# Amplifier bots have high quote/retweet ratio; content farms have low
# Combined with feature 4 to distinguish amplifiers from content generators
quote_to_original_ratio: float # 0.0-inf; capped at 20.0
# Feature 6: URL recycling rate
# Fraction of posts containing a URL that was posted by another account
# within the same 15-minute window; coordinated link drops show high rate
url_recycling_rate: float # 0.0-1.0
# Feature 7: Cross-platform timing correlation
# Pearson r between this account's posting times and the same content
# appearing on a different platform (0.0 if no cross-platform match found)
cross_platform_correlation: float # -1.0 to 1.0; 0.0 = not found
# Feature 8: Bio change cadence
# Number of profile bio changes in the last 90 days
# Coordinated networks often update bios in synchronized bursts
bio_change_count_90d: int # raw count; not normalizedPosting interval entropy
The most discriminating single feature is posting interval entropy — how uniformly distributed a user's post timing is across the minutes of the day. Human users post with irregular timing driven by daily life: meals, sleep, commutes. Bots post at fixed intervals, jittered intervals, or in precise bursts triggered by an external scheduler.
import numpy as np
from typing import Sequence
def compute_posting_interval_entropy(
timestamps: Sequence[float],
n_bins: int = 60,
) -> float:
"""
Compute Shannon entropy of the posting interval distribution.
timestamps: Unix timestamps of posts, sorted ascending.
n_bins: number of histogram bins (default: 60 = 1-minute bins over 1 hour).
Returns entropy in bits. Range: 0 (all posts at identical interval)
to log2(n_bins) (uniform distribution across all bins).
"""
if len(timestamps) < 10:
return float('nan')
intervals = np.diff(timestamps) # seconds between consecutive posts
# Clip to [0, 3600] to exclude account dormancy periods > 1 hour
intervals = intervals[intervals <= 3600]
if len(intervals) < 5:
return float('nan')
counts, _ = np.histogram(intervals, bins=n_bins, range=(0, 3600))
# Add-1 smoothing to avoid log(0)
counts = counts + 1.0
probs = counts / counts.sum()
entropy = -np.sum(probs * np.log2(probs))
# Subtract baseline entropy from the smoothing term
baseline = -np.log2(1.0 / n_bins) # uniform distribution entropy
return float(min(entropy, baseline))
# Empirical thresholds (calibrated on 120K labeled accounts):
# entropy < 2.1 bits: HIGH bot probability (automated scheduler)
# entropy 2.1-4.8 bits: AMBIGUOUS (jitter-based bot or low-activity human)
# entropy > 4.8 bits: LOW bot probability (human irregular pattern)Cross-platform perceptual hash fingerprinting
Sockpuppet networks often maintain linked accounts across platforms — a Telegram channel that coordinates posting strategy, and Twitter/X + Bluesky accounts that execute it. Cross-platform linkage is detectable even when accounts use different usernames, because profile photos are frequently reused.
We compute a perceptual hash (pHash) for every scraped profile photo and store it in a Redis sorted set keyed by the first 16 bits of the hash. At query time, candidate matches within Hamming distance 8 are retrieved from Redis and verified with the full 64-bit hash. A Hamming distance ≤ 8 represents a 87.5% bit-level similarity — sufficient to match the same photo after JPEG recompression and minor resizing, while rejecting false positives from stock photo reuse (which clusters at Hamming 15+).
import imagehash
from PIL import Image
import io
import redis
REDIS_PREFIX = 'phash:'
HAMMING_THRESHOLD = 8
def compute_profile_phash(image_bytes: bytes) -> str:
"""Compute 64-bit perceptual hash of a profile photo."""
img = Image.open(io.BytesIO(image_bytes)).convert('RGB')
# Resize to 32x32 for consistent hash regardless of source resolution
img = img.resize((32, 32), Image.LANCZOS)
return str(imagehash.phash(img, hash_size=8)) # 64-bit pHash
def store_phash(r: redis.Redis, account_id: str, phash_str: str) -> None:
"""Store pHash indexed by its first 16 bits (bucket key)."""
bucket = phash_str[:4] # first 4 hex chars = 16 bits
r.hset(f'{REDIS_PREFIX}{bucket}', account_id, phash_str)
def find_phash_matches(
r: redis.Redis,
query_phash: str,
source_account_id: str,
) -> list[tuple[str, int]]:
"""
Find accounts with similar profile photos across all platforms.
Returns [(account_id, hamming_distance)] sorted ascending.
"""
query_hash = imagehash.hex_to_hash(query_phash)
matches = []
# Check the query bucket and adjacent buckets (covers 1-bit bucket boundary errors)
query_bucket_int = int(query_phash[:4], 16)
buckets_to_check = set()
for delta in range(-2, 3):
bucket = hex(max(0, min(0xFFFF, query_bucket_int + delta)))[2:].zfill(4)
buckets_to_check.add(bucket)
for bucket in buckets_to_check:
bucket_entries = r.hgetall(f'{REDIS_PREFIX}{bucket}')
for acct_id, stored_phash in bucket_entries.items():
if acct_id == source_account_id:
continue
stored_hash = imagehash.hex_to_hash(stored_phash.decode())
distance = query_hash - stored_hash
if distance <= HAMMING_THRESHOLD:
matches.append((acct_id.decode(), distance))
return sorted(matches, key=lambda x: x[1])XGBoost with language as grouping variable
The eight features are fed to a single XGBoost binary classifier (is_bot: bool). Language is not an input feature — instead, we use it as a grouping variable for StratifiedGroupKFold cross-validation and for computing per-language decision thresholds after training.
The motivation: some behavioral features have different baseline distributions by language. Arabic-language accounts tend to have higher quote-to-original ratios (a cultural norm around sharing) and Korean-language accounts have different dormancy patterns. Training on a pooled multilingual corpus without language control would produce a classifier that conflates cultural behavioral norms with bot behavior. Per-language Platt scaling on the classifier's raw score output corrects for this without requiring separate per-language models.
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import label_binarize
import numpy as np
import pandas as pd
XGB_PARAMS = {
'n_estimators': 600,
'max_depth': 5,
'learning_rate': 0.04,
'subsample': 0.8,
'colsample_bytree': 0.75,
'min_child_weight': 3,
'scale_pos_weight': 4.2, # bot:human ratio ~1:4.2 in training set
'eval_metric': 'aucpr',
'early_stopping_rounds': 40,
'random_state': 42,
}
FEATURE_COLS = [
'posting_interval_entropy',
'reply_outdegree_ratio',
'content_cluster_density',
'age_velocity_zscore',
'quote_to_original_ratio',
'url_recycling_rate',
'cross_platform_correlation',
'bio_change_count_90d',
]
def train_multilingual_bot_classifier(
df: pd.DataFrame, # must have FEATURE_COLS + 'is_bot' + 'language_code'
) -> tuple[XGBClassifier, dict[str, LogisticRegression]]:
"""
Train XGBoost on pooled corpus; calibrate per language.
Returns (base_classifier, per_language_calibrators).
"""
X = df[FEATURE_COLS].values
y = df['is_bot'].values.astype(int)
groups = df['language_code'].values
cv = StratifiedGroupKFold(n_splits=5)
oof_scores = np.zeros(len(df))
clf = XGBClassifier(**XGB_PARAMS)
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y, groups)):
X_tr, X_val = X[train_idx], X[val_idx]
y_tr, y_val = y[train_idx], y[val_idx]
clf.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
oof_scores[val_idx] = clf.predict_proba(X_val)[:, 1]
# Fit final model on full training set
clf.fit(X, y, verbose=False)
# Per-language Platt scaling calibration on OOF predictions
per_lang_calibrators: dict[str, LogisticRegression] = {}
for lang in np.unique(groups):
mask = groups == lang
if mask.sum() < 50:
continue
lr = LogisticRegression(C=1.0, solver='lbfgs')
lr.fit(oof_scores[mask].reshape(-1, 1), y[mask])
per_lang_calibrators[lang] = lr
return clf, per_lang_calibratorsPer-language F1 results
Evaluation on the held-out test set (20% stratified split, 88,400 accounts across 14 languages, 22% bot prevalence):
| Language | Test accounts | Precision | Recall | F1 |
|---|---|---|---|---|
| English (en) | 24,100 | 0.912 | 0.904 | 0.908 |
| Arabic (ar) | 9,800 | 0.891 | 0.877 | 0.884 |
| Russian (ru) | 8,400 | 0.902 | 0.896 | 0.899 |
| Spanish (es) | 7,600 | 0.898 | 0.888 | 0.893 |
| Persian (fa) | 6,200 | 0.879 | 0.887 | 0.883 |
| Chinese (zh) | 5,800 | 0.907 | 0.891 | 0.899 |
| French (fr) | 4,400 | 0.894 | 0.882 | 0.888 |
| 7 others (ko, tr, de, pt, hi, id, vi) | 22,100 | 0.889 | 0.877 | 0.883 |
The narrowest performance gap across languages (F1 range: 0.883–0.908) confirms that behavioral features generalize across languages without per-language feature engineering. The lowest-performing languages (Persian, Arabic) show lower precision primarily because their higher baseline quote-to-original ratios require more conservative per-language thresholds to avoid false positives on legitimate cultural amplification behavior.
Throughput
Feature extraction runs as a Kafka consumer group with 12 workers, each processing a slice of the incoming post stream. The eight features can be computed at approximately 200,000 accounts per minute using Python with NumPy vectorization — well within our 2.4M posts/hour ingestion rate. The XGBoost inference call itself takes 0.4ms per account batch of 128 on CPU, making inference latency negligible compared to feature extraction I/O.
For the coordinated campaign detection methodology that uses these bot labels: Coordinated campaign detection: identifying inauthentic networks in election data →
For the NLP pipeline that processes the 2.4M posts/hour stream these accounts produce: NLP at 2.4M posts/hour: the social media analysis pipeline →