Technical writing

Bot detection across 14 languages: language-invariant behavioral features and cross-platform sockpuppet fingerprinting

· 13 min read· AI Analytics
OSINTNLPMLElections

Election-related disinformation campaigns operate across Telegram, Twitter/X, Facebook, and Bluesky simultaneously — with the same coordinated network posting in Arabic, Russian, Persian, Spanish, and English depending on which audience they're targeting. A bot detector that relies on English-language lexical features (sentiment word lists, political keyword bags) will miss the full campaign scope and systematically undercount non-English actors.

Our multilingual bot detector uses eight behavioral features that are language- invariant — they measure how accounts post, not what they post. Language is used only as a grouping variable for per-language threshold calibration, not as an input feature. The result is a single XGBoost classifier that achieves F1 0.883–0.908 across 14 languages in the 2024 election monitoring corpus.

The eight language-invariant features

from dataclasses import dataclass
from typing import Optional

@dataclass
class BotFeatureVector:
    account_id: str
    language_code: str       # ISO 639-1; used only for threshold calibration

    # Feature 1: Posting interval entropy
    # Low entropy = machine-timed posts; high entropy = human irregular timing
    # Computed over the last 500 posts, binned into 1-minute intervals
    posting_interval_entropy: float      # bits; range 0-8.97 (log2(500))

    # Feature 2: Reply graph outdegree ratio
    # Bots initiate interactions to visible accounts; humans reply to conversations
    # outdegree = mentions_sent / (mentions_sent + replies_received)
    reply_outdegree_ratio: float         # 0.0-1.0

    # Feature 3: Content hash cluster density
    # Fraction of posts within Hamming distance 3 of another post by a different account
    # High density = copied/templated content across coordinated accounts
    content_cluster_density: float      # 0.0-1.0

    # Feature 4: Account age-activity velocity
    # (total_posts / account_age_days); normalized by platform median
    # Very high velocity in early account life = bot signup burst
    age_velocity_zscore: float           # z-score relative to platform median

    # Feature 5: Quote-to-original ratio
    # Amplifier bots have high quote/retweet ratio; content farms have low
    # Combined with feature 4 to distinguish amplifiers from content generators
    quote_to_original_ratio: float      # 0.0-inf; capped at 20.0

    # Feature 6: URL recycling rate
    # Fraction of posts containing a URL that was posted by another account
    # within the same 15-minute window; coordinated link drops show high rate
    url_recycling_rate: float           # 0.0-1.0

    # Feature 7: Cross-platform timing correlation
    # Pearson r between this account's posting times and the same content
    # appearing on a different platform (0.0 if no cross-platform match found)
    cross_platform_correlation: float   # -1.0 to 1.0; 0.0 = not found

    # Feature 8: Bio change cadence
    # Number of profile bio changes in the last 90 days
    # Coordinated networks often update bios in synchronized bursts
    bio_change_count_90d: int           # raw count; not normalized

Posting interval entropy

The most discriminating single feature is posting interval entropy — how uniformly distributed a user's post timing is across the minutes of the day. Human users post with irregular timing driven by daily life: meals, sleep, commutes. Bots post at fixed intervals, jittered intervals, or in precise bursts triggered by an external scheduler.

import numpy as np
from typing import Sequence

def compute_posting_interval_entropy(
    timestamps: Sequence[float],
    n_bins: int = 60,
) -> float:
    """
    Compute Shannon entropy of the posting interval distribution.

    timestamps: Unix timestamps of posts, sorted ascending.
    n_bins: number of histogram bins (default: 60 = 1-minute bins over 1 hour).

    Returns entropy in bits. Range: 0 (all posts at identical interval)
    to log2(n_bins) (uniform distribution across all bins).
    """
    if len(timestamps) < 10:
        return float('nan')

    intervals = np.diff(timestamps)           # seconds between consecutive posts
    # Clip to [0, 3600] to exclude account dormancy periods > 1 hour
    intervals = intervals[intervals <= 3600]
    if len(intervals) < 5:
        return float('nan')

    counts, _ = np.histogram(intervals, bins=n_bins, range=(0, 3600))

    # Add-1 smoothing to avoid log(0)
    counts = counts + 1.0
    probs = counts / counts.sum()
    entropy = -np.sum(probs * np.log2(probs))

    # Subtract baseline entropy from the smoothing term
    baseline = -np.log2(1.0 / n_bins)  # uniform distribution entropy
    return float(min(entropy, baseline))

# Empirical thresholds (calibrated on 120K labeled accounts):
# entropy < 2.1 bits: HIGH bot probability (automated scheduler)
# entropy 2.1-4.8 bits: AMBIGUOUS (jitter-based bot or low-activity human)
# entropy > 4.8 bits: LOW bot probability (human irregular pattern)

Cross-platform perceptual hash fingerprinting

Sockpuppet networks often maintain linked accounts across platforms — a Telegram channel that coordinates posting strategy, and Twitter/X + Bluesky accounts that execute it. Cross-platform linkage is detectable even when accounts use different usernames, because profile photos are frequently reused.

We compute a perceptual hash (pHash) for every scraped profile photo and store it in a Redis sorted set keyed by the first 16 bits of the hash. At query time, candidate matches within Hamming distance 8 are retrieved from Redis and verified with the full 64-bit hash. A Hamming distance ≤ 8 represents a 87.5% bit-level similarity — sufficient to match the same photo after JPEG recompression and minor resizing, while rejecting false positives from stock photo reuse (which clusters at Hamming 15+).

import imagehash
from PIL import Image
import io
import redis

REDIS_PREFIX = 'phash:'
HAMMING_THRESHOLD = 8

def compute_profile_phash(image_bytes: bytes) -> str:
    """Compute 64-bit perceptual hash of a profile photo."""
    img = Image.open(io.BytesIO(image_bytes)).convert('RGB')
    # Resize to 32x32 for consistent hash regardless of source resolution
    img = img.resize((32, 32), Image.LANCZOS)
    return str(imagehash.phash(img, hash_size=8))  # 64-bit pHash

def store_phash(r: redis.Redis, account_id: str, phash_str: str) -> None:
    """Store pHash indexed by its first 16 bits (bucket key)."""
    bucket = phash_str[:4]  # first 4 hex chars = 16 bits
    r.hset(f'{REDIS_PREFIX}{bucket}', account_id, phash_str)

def find_phash_matches(
    r: redis.Redis,
    query_phash: str,
    source_account_id: str,
) -> list[tuple[str, int]]:
    """
    Find accounts with similar profile photos across all platforms.
    Returns [(account_id, hamming_distance)] sorted ascending.
    """
    query_hash = imagehash.hex_to_hash(query_phash)
    matches = []

    # Check the query bucket and adjacent buckets (covers 1-bit bucket boundary errors)
    query_bucket_int = int(query_phash[:4], 16)
    buckets_to_check = set()
    for delta in range(-2, 3):
        bucket = hex(max(0, min(0xFFFF, query_bucket_int + delta)))[2:].zfill(4)
        buckets_to_check.add(bucket)

    for bucket in buckets_to_check:
        bucket_entries = r.hgetall(f'{REDIS_PREFIX}{bucket}')
        for acct_id, stored_phash in bucket_entries.items():
            if acct_id == source_account_id:
                continue
            stored_hash = imagehash.hex_to_hash(stored_phash.decode())
            distance = query_hash - stored_hash
            if distance <= HAMMING_THRESHOLD:
                matches.append((acct_id.decode(), distance))

    return sorted(matches, key=lambda x: x[1])

XGBoost with language as grouping variable

The eight features are fed to a single XGBoost binary classifier (is_bot: bool). Language is not an input feature — instead, we use it as a grouping variable for StratifiedGroupKFold cross-validation and for computing per-language decision thresholds after training.

The motivation: some behavioral features have different baseline distributions by language. Arabic-language accounts tend to have higher quote-to-original ratios (a cultural norm around sharing) and Korean-language accounts have different dormancy patterns. Training on a pooled multilingual corpus without language control would produce a classifier that conflates cultural behavioral norms with bot behavior. Per-language Platt scaling on the classifier's raw score output corrects for this without requiring separate per-language models.

from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import label_binarize
import numpy as np
import pandas as pd

XGB_PARAMS = {
    'n_estimators': 600,
    'max_depth': 5,
    'learning_rate': 0.04,
    'subsample': 0.8,
    'colsample_bytree': 0.75,
    'min_child_weight': 3,
    'scale_pos_weight': 4.2,  # bot:human ratio ~1:4.2 in training set
    'eval_metric': 'aucpr',
    'early_stopping_rounds': 40,
    'random_state': 42,
}

FEATURE_COLS = [
    'posting_interval_entropy',
    'reply_outdegree_ratio',
    'content_cluster_density',
    'age_velocity_zscore',
    'quote_to_original_ratio',
    'url_recycling_rate',
    'cross_platform_correlation',
    'bio_change_count_90d',
]

def train_multilingual_bot_classifier(
    df: pd.DataFrame,  # must have FEATURE_COLS + 'is_bot' + 'language_code'
) -> tuple[XGBClassifier, dict[str, LogisticRegression]]:
    """
    Train XGBoost on pooled corpus; calibrate per language.
    Returns (base_classifier, per_language_calibrators).
    """
    X = df[FEATURE_COLS].values
    y = df['is_bot'].values.astype(int)
    groups = df['language_code'].values

    cv = StratifiedGroupKFold(n_splits=5)
    oof_scores = np.zeros(len(df))

    clf = XGBClassifier(**XGB_PARAMS)

    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y, groups)):
        X_tr, X_val = X[train_idx], X[val_idx]
        y_tr, y_val = y[train_idx], y[val_idx]
        clf.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
        oof_scores[val_idx] = clf.predict_proba(X_val)[:, 1]

    # Fit final model on full training set
    clf.fit(X, y, verbose=False)

    # Per-language Platt scaling calibration on OOF predictions
    per_lang_calibrators: dict[str, LogisticRegression] = {}
    for lang in np.unique(groups):
        mask = groups == lang
        if mask.sum() < 50:
            continue
        lr = LogisticRegression(C=1.0, solver='lbfgs')
        lr.fit(oof_scores[mask].reshape(-1, 1), y[mask])
        per_lang_calibrators[lang] = lr

    return clf, per_lang_calibrators

Per-language F1 results

Evaluation on the held-out test set (20% stratified split, 88,400 accounts across 14 languages, 22% bot prevalence):

LanguageTest accountsPrecisionRecallF1
English (en)24,1000.9120.9040.908
Arabic (ar)9,8000.8910.8770.884
Russian (ru)8,4000.9020.8960.899
Spanish (es)7,6000.8980.8880.893
Persian (fa)6,2000.8790.8870.883
Chinese (zh)5,8000.9070.8910.899
French (fr)4,4000.8940.8820.888
7 others (ko, tr, de, pt, hi, id, vi)22,1000.8890.8770.883

The narrowest performance gap across languages (F1 range: 0.883–0.908) confirms that behavioral features generalize across languages without per-language feature engineering. The lowest-performing languages (Persian, Arabic) show lower precision primarily because their higher baseline quote-to-original ratios require more conservative per-language thresholds to avoid false positives on legitimate cultural amplification behavior.

Throughput

Feature extraction runs as a Kafka consumer group with 12 workers, each processing a slice of the incoming post stream. The eight features can be computed at approximately 200,000 accounts per minute using Python with NumPy vectorization — well within our 2.4M posts/hour ingestion rate. The XGBoost inference call itself takes 0.4ms per account batch of 128 on CPU, making inference latency negligible compared to feature extraction I/O.


For the coordinated campaign detection methodology that uses these bot labels: Coordinated campaign detection: identifying inauthentic networks in election data →

For the NLP pipeline that processes the 2.4M posts/hour stream these accounts produce: NLP at 2.4M posts/hour: the social media analysis pipeline →