Technical writing

Bot detection across 14 languages: language-invariant behavioral features and cross-platform sockpuppet fingerprinting

September 24, 2024· 13 min read· AI Analytics

OSINTNLPMLElections

Election-related disinformation campaigns operate across Telegram, Twitter/X, Facebook, and Bluesky simultaneously — with the same coordinated network posting in Arabic, Russian, Persian, Spanish, and English depending on which audience they're targeting. A bot detector that relies on English-language lexical features (sentiment word lists, political keyword bags) will miss the full campaign scope and systematically undercount non-English actors.

Our multilingual bot detector uses eight behavioral features that are language- invariant — they measure how accounts post, not what they post. Language is used only as a grouping variable for per-language threshold calibration, not as an input feature. The result is a single XGBoost classifier that achieves F1 0.883–0.908 across 14 languages in the 2024 election monitoring corpus.

The eight language-invariant features

from dataclasses import dataclass
from typing import Optional

@dataclass
class BotFeatureVector:
    account_id: str
    language_code: str       # ISO 639-1; used only for threshold calibration

    # Feature 1: Posting interval entropy
    # Low entropy = machine-timed posts; high entropy = human irregular timing
    # Computed over the last 500 posts, binned into 1-minute intervals
    posting_interval_entropy: float      # bits; range 0-8.97 (log2(500))

    # Feature 2: Reply graph outdegree ratio
    # Bots initiate interactions to visible accounts; humans reply to conversations
    # outdegree = mentions_sent / (mentions_sent + replies_received)
    reply_outdegree_ratio: float         # 0.0-1.0

    # Feature 3: Content hash cluster density
    # Fraction of posts within Hamming distance 3 of another post by a different account
    # High density = copied/templated content across coordinated accounts
    content_cluster_density: float      # 0.0-1.0

    # Feature 4: Account age-activity velocity
    # (total_posts / account_age_days); normalized by platform median
    # Very high velocity in early account life = bot signup burst
    age_velocity_zscore: float           # z-score relative to platform median

    # Feature 5: Quote-to-original ratio
    # Amplifier bots have high quote/retweet ratio; content farms have low
    # Combined with feature 4 to distinguish amplifiers from content generators
    quote_to_original_ratio: float      # 0.0-inf; capped at 20.0

    # Feature 6: URL recycling rate
    # Fraction of posts containing a URL that was posted by another account
    # within the same 15-minute window; coordinated link drops show high rate
    url_recycling_rate: float           # 0.0-1.0

    # Feature 7: Cross-platform timing correlation
    # Pearson r between this account's posting times and the same content
    # appearing on a different platform (0.0 if no cross-platform match found)
    cross_platform_correlation: float   # -1.0 to 1.0; 0.0 = not found

    # Feature 8: Bio change cadence
    # Number of profile bio changes in the last 90 days
    # Coordinated networks often update bios in synchronized bursts
    bio_change_count_90d: int           # raw count; not normalized

Posting interval entropy

The most discriminating single feature is posting interval entropy — how uniformly distributed a user's post timing is across the minutes of the day. Human users post with irregular timing driven by daily life: meals, sleep, commutes. Bots post at fixed intervals, jittered intervals, or in precise bursts triggered by an external scheduler.

import numpy as np
from typing import Sequence

def compute_posting_interval_entropy(
    timestamps: Sequence[float],
    n_bins: int = 60,
) -> float:
    """
    Compute Shannon entropy of the posting interval distribution.

    timestamps: Unix timestamps of posts, sorted ascending.
    n_bins: number of histogram bins (default: 60 = 1-minute bins over 1 hour).

    Returns entropy in bits. Range: 0 (all posts at identical interval)
    to log2(n_bins) (uniform distribution across all bins).
    """
    if len(timestamps) < 10:
        return float('nan')

    intervals = np.diff(timestamps)           # seconds between consecutive posts
    # Clip to [0, 3600] to exclude account dormancy periods > 1 hour
    intervals = intervals[intervals <= 3600]
    if len(intervals) < 5:
        return float('nan')

    counts, _ = np.histogram(intervals, bins=n_bins, range=(0, 3600))

    # Add-1 smoothing to avoid log(0)
    counts = counts + 1.0
    probs = counts / counts.sum()
    entropy = -np.sum(probs * np.log2(probs))

    # Subtract baseline entropy from the smoothing term
    baseline = -np.log2(1.0 / n_bins)  # uniform distribution entropy
    return float(min(entropy, baseline))

# Empirical thresholds (calibrated on 120K labeled accounts):
# entropy < 2.1 bits: HIGH bot probability (automated scheduler)
# entropy 2.1-4.8 bits: AMBIGUOUS (jitter-based bot or low-activity human)
# entropy > 4.8 bits: LOW bot probability (human irregular pattern)

Cross-platform perceptual hash fingerprinting

Sockpuppet networks often maintain linked accounts across platforms — a Telegram channel that coordinates posting strategy, and Twitter/X + Bluesky accounts that execute it. Cross-platform linkage is detectable even when accounts use different usernames, because profile photos are frequently reused.

We compute a perceptual hash (pHash) for every scraped profile photo and store it in a Redis sorted set keyed by the first 16 bits of the hash. At query time, candidate matches within Hamming distance 8 are retrieved from Redis and verified with the full 64-bit hash. A Hamming distance ≤ 8 represents a 87.5% bit-level similarity — sufficient to match the same photo after JPEG recompression and minor resizing, while rejecting false positives from stock photo reuse (which clusters at Hamming 15+).

import imagehash
from PIL import Image
import io
import redis

REDIS_PREFIX = 'phash:'
HAMMING_THRESHOLD = 8

def compute_profile_phash(image_bytes: bytes) -> str:
    """Compute 64-bit perceptual hash of a profile photo."""
    img = Image.open(io.BytesIO(image_bytes)).convert('RGB')
    # Resize to 32x32 for consistent hash regardless of source resolution
    img = img.resize((32, 32), Image.LANCZOS)
    return str(imagehash.phash(img, hash_size=8))  # 64-bit pHash

def store_phash(r: redis.Redis, account_id: str, phash_str: str) -> None:
    """Store pHash indexed by its first 16 bits (bucket key)."""
    bucket = phash_str[:4]  # first 4 hex chars = 16 bits
    r.hset(f'{REDIS_PREFIX}{bucket}', account_id, phash_str)

def find_phash_matches(
    r: redis.Redis,
    query_phash: str,
    source_account_id: str,
) -> list[tuple[str, int]]:
    """
    Find accounts with similar profile photos across all platforms.
    Returns [(account_id, hamming_distance)] sorted ascending.
    """
    query_hash = imagehash.hex_to_hash(query_phash)
    matches = []

    # Check the query bucket and adjacent buckets (covers 1-bit bucket boundary errors)
    query_bucket_int = int(query_phash[:4], 16)
    buckets_to_check = set()
    for delta in range(-2, 3):
        bucket = hex(max(0, min(0xFFFF, query_bucket_int + delta)))[2:].zfill(4)
        buckets_to_check.add(bucket)

    for bucket in buckets_to_check:
        bucket_entries = r.hgetall(f'{REDIS_PREFIX}{bucket}')
        for acct_id, stored_phash in bucket_entries.items():
            if acct_id == source_account_id:
                continue
            stored_hash = imagehash.hex_to_hash(stored_phash.decode())
            distance = query_hash - stored_hash
            if distance <= HAMMING_THRESHOLD:
                matches.append((acct_id.decode(), distance))

    return sorted(matches, key=lambda x: x[1])

XGBoost with language as grouping variable

The eight features are fed to a single XGBoost binary classifier (is_bot: bool). Language is not an input feature — instead, we use it as a grouping variable for StratifiedGroupKFold cross-validation and for computing per-language decision thresholds after training.

The motivation: some behavioral features have different baseline distributions by language. Arabic-language accounts tend to have higher quote-to-original ratios (a cultural norm around sharing) and Korean-language accounts have different dormancy patterns. Training on a pooled multilingual corpus without language control would produce a classifier that conflates cultural behavioral norms with bot behavior. Per-language Platt scaling on the classifier's raw score output corrects for this without requiring separate per-language models.

from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import label_binarize
import numpy as np
import pandas as pd

XGB_PARAMS = {
    'n_estimators': 600,
    'max_depth': 5,
    'learning_rate': 0.04,
    'subsample': 0.8,
    'colsample_bytree': 0.75,
    'min_child_weight': 3,
    'scale_pos_weight': 4.2,  # bot:human ratio ~1:4.2 in training set
    'eval_metric': 'aucpr',
    'early_stopping_rounds': 40,
    'random_state': 42,
}

FEATURE_COLS = [
    'posting_interval_entropy',
    'reply_outdegree_ratio',
    'content_cluster_density',
    'age_velocity_zscore',
    'quote_to_original_ratio',
    'url_recycling_rate',
    'cross_platform_correlation',
    'bio_change_count_90d',
]

def train_multilingual_bot_classifier(
    df: pd.DataFrame,  # must have FEATURE_COLS + 'is_bot' + 'language_code'
) -> tuple[XGBClassifier, dict[str, LogisticRegression]]:
    """
    Train XGBoost on pooled corpus; calibrate per language.
    Returns (base_classifier, per_language_calibrators).
    """
    X = df[FEATURE_COLS].values
    y = df['is_bot'].values.astype(int)
    groups = df['language_code'].values

    cv = StratifiedGroupKFold(n_splits=5)
    oof_scores = np.zeros(len(df))

    clf = XGBClassifier(**XGB_PARAMS)

    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y, groups)):
        X_tr, X_val = X[train_idx], X[val_idx]
        y_tr, y_val = y[train_idx], y[val_idx]
        clf.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
        oof_scores[val_idx] = clf.predict_proba(X_val)[:, 1]

    # Fit final model on full training set
    clf.fit(X, y, verbose=False)

    # Per-language Platt scaling calibration on OOF predictions
    per_lang_calibrators: dict[str, LogisticRegression] = {}
    for lang in np.unique(groups):
        mask = groups == lang
        if mask.sum() < 50:
            continue
        lr = LogisticRegression(C=1.0, solver='lbfgs')
        lr.fit(oof_scores[mask].reshape(-1, 1), y[mask])
        per_lang_calibrators[lang] = lr

    return clf, per_lang_calibrators

Per-language F1 results

Evaluation on the held-out test set (20% stratified split, 88,400 accounts across 14 languages, 22% bot prevalence):

Language	Test accounts	Precision	Recall	F1
English (en)	24,100	0.912	0.904	0.908
Arabic (ar)	9,800	0.891	0.877	0.884
Russian (ru)	8,400	0.902	0.896	0.899
Spanish (es)	7,600	0.898	0.888	0.893
Persian (fa)	6,200	0.879	0.887	0.883
Chinese (zh)	5,800	0.907	0.891	0.899
French (fr)	4,400	0.894	0.882	0.888
7 others (ko, tr, de, pt, hi, id, vi)	22,100	0.889	0.877	0.883

The narrowest performance gap across languages (F1 range: 0.883–0.908) confirms that behavioral features generalize across languages without per-language feature engineering. The lowest-performing languages (Persian, Arabic) show lower precision primarily because their higher baseline quote-to-original ratios require more conservative per-language thresholds to avoid false positives on legitimate cultural amplification behavior.

Throughput

Feature extraction runs as a Kafka consumer group with 12 workers, each processing a slice of the incoming post stream. The eight features can be computed at approximately 200,000 accounts per minute using Python with NumPy vectorization — well within our 2.4M posts/hour ingestion rate. The XGBoost inference call itself takes 0.4ms per account batch of 128 on CPU, making inference latency negligible compared to feature extraction I/O.

For the coordinated campaign detection methodology that uses these bot labels: Coordinated campaign detection: identifying inauthentic networks in election data →

For the NLP pipeline that processes the 2.4M posts/hour stream these accounts produce: NLP at 2.4M posts/hour: the social media analysis pipeline →