Technical writing

Detecting Election Anomalies Using Statistical Methods

September 12, 2024· 16 min read· AI Analytics

ElectionsStatisticsXGBoostBenfordOSINT

During the 2024 election cycle we monitored 47 races across 23 states. The system flagged 12 statistical anomalies worth human review. All 12 had non-nefarious explanations — data entry errors, procedural changes, demographic shifts. This outcome is expected. The value is in the triage process: surfacing anomalies fast enough that journalists and oversight groups can investigate before certification, not in the anomalies themselves.

This article covers the four detection methods we run in parallel, their implementation, and their known limitations. We are explicit about limitations because the misuse of statistical election analysis to make fraud claims — claims that outrun the evidence — has done measurable harm to public trust. Our system flags anomalies for investigation; it does not conclude fraud.

What we are and are not looking for

Anomaly detection and fraud detection are different tasks. Fraud detection requires ground truth (confirmed instances of fraudulent activity to train against) which does not exist at scale in US elections. Anomaly detection requires only a model of normal behavior and a way to measure deviation.

Anomalies we flag:

Precinct-level vote totals with unusual first-digit distributions
Turnout rates more than 2.5 standard deviations from model prediction
Vote reporting curves that deviate from county-level historical patterns
Campaign finance contributions clustered in ways inconsistent with organic donor behavior
Down-ballot undervote rates outside historical ranges

Anomalies do not mean fraud. In our experience, most anomalies trace to: data entry errors (trailing zeros, transposed digits), late reporting of a specific precinct type (mail-in vs in-person), demographic changes in a precinct (new housing development), or procedural changes (new voting equipment, changed counting workflow). The system produces a triage list, not a verdict.

Data ingestion

We ingest from three primary sources, normalized into a common schema in PostgreSQL with PostGIS extensions for geographic queries.

Source                  Format      Update cadence     Coverage
──────────────────────────────────────────────────────────────────
State election sites    HTML/JSON   Every 5 min E-day  All 23 states
AP Elections API        JSON        Every 30 sec        47 races
OpenFEC API             JSON        Daily               All FEC filers
Voter registration      CSV bulk    Weekly              12 states (public)
Census ACS (2020/2022)  CSV         Static              All precincts
Historical results      CSV         Static              2016–2022

State election websites are the most heterogeneous. Some publish structured JSON (California, Florida, Texas). Most publish HTML tables that require scraping. Three states (Pennsylvania, Wisconsin, Michigan) update their result pages via JavaScript rendering and require Playwright-based scraping. We run scrapers for each state independently rather than trying to generalize — election website formats change between cycles, and a broken scraper on election night is worse than redundant code.

Data normalization is the hardest engineering problem. Precinct IDs are not standardized: the same precinct is "Ward 3, Precinct 8" in state results, "03-08" in AP data, and "030800" in FIPS format in the census. We maintain a hand-curated crosswalk table for all 47 races that maps between formats. Building and maintaining this table takes more time than any of the statistical models.

Method 1: Benford's law

Benford's law states that in naturally occurring collections of numbers, the leading digit d appears with probability log₁₀(1 + 1/d). It holds for data spanning multiple orders of magnitude. For small precinct vote totals (50–800 votes per precinct) it does not hold reliably and should not be applied — this is the most common misuse of Benford's law in election analysis.

import numpy as np
from scipy import stats

BENFORD_EXPECTED = [30.1, 17.6, 12.5, 9.7, 7.9, 6.7, 5.8, 5.1, 4.6]  # digits 1–9

def benford_chi2(vote_counts: list[int]) -> tuple[float, float, bool]:
    """
    Returns (chi2, p_value, test_valid).
    test_valid = False if sample is too small or too narrow in range.
    """
    # Validity checks
    if len(vote_counts) < 100:
        return 0.0, 1.0, False   # Too few precincts for chi2 to be reliable

    magnitude_range = max(vote_counts) / max(min(vote_counts), 1)
    if magnitude_range < 10:
        # Benford's law requires data spanning ≥ 1 order of magnitude.
        # A county where all precincts have 400–600 votes will not follow
        # Benford regardless of integrity.
        return 0.0, 1.0, False

    first_digits = [int(str(v)[0]) for v in vote_counts if v > 0]
    n = len(first_digits)
    observed = [first_digits.count(d) / n * 100 for d in range(1, 10)]

    chi2, p = stats.chisquare(f_obs=observed, f_exp=BENFORD_EXPECTED)
    return chi2, p, True

# We apply Benford's law only at county or state level (n ≥ 100 precincts)
# and only where vote count range spans at least one order of magnitude.
# Applying it at precinct level (individual vote totals) is invalid.

In 2024 we ran Benford analysis at the county level for all 47 races. Three counties showed p < 0.01. All three were subsequently explained by data entry issues (one county's reporting system was outputting totals with an extra trailing zero, which compresses the first-digit distribution), confirmed by the counties' own corrections filed the following day.

Important limitation: Benford's law is sensitive to the granularity of the data. A jurisdiction reporting only rounded totals (to the nearest 10 votes) will mechanically fail a Benford test regardless of underlying integrity. Always check reporting methodology before drawing conclusions.

Method 2: XGBoost turnout model

We build a regression model per county (or per precinct in large counties) predicting expected turnout as a fraction of registered voters. The model is trained on 2016–2022 election data and applied to 2024 to identify precincts where actual turnout deviates significantly from prediction.

Feature engineering

# Features per precinct (all normalized to [0,1] range)
features = {
    # Historical turnout (3 prior cycles, same election type)
    "turnout_2022":           float,   # midterm or general
    "turnout_2020":           float,
    "turnout_2018":           float,
    "turnout_std_3cycle":     float,   # standard deviation of past 3

    # Registration
    "reg_delta_30d":          float,   # pct change in registrations, last 30 days
    "reg_dem_fraction":       float,   # partisan registration split
    "new_voter_fraction":     float,   # registered ≤ 2 years ago

    # Demographics (ACS 2022 5-year estimates)
    "median_age":             float,
    "pct_65_plus":            float,   # older voters turn out at higher rates
    "median_income_norm":     float,
    "pct_college_degree":     float,
    "pct_white":              float,
    "pct_black":              float,
    "pct_hispanic":           float,
    "pop_density_log":        float,   # rural vs urban

    # Election-type controls
    "is_presidential":        int,     # 0 or 1
    "is_competitive_race":    float,   # Cook PVI margin ≤ 5 = 1.0

    # Early voting (where reported before E-day)
    "early_vote_fraction_prev": float, # prior cycle's early vote share
    "early_vote_current":     float,   # current cycle (partial, may be 0)

    # Social media signal (from NLP pipeline)
    "turnout_intent_score":   float,   # positive sentiment re: voting in area
}

The is_competitive_race feature is critical: competitive races have systematically higher turnout, and a model that doesn't account for race competitiveness will flag every competitive-race precinct as an anomaly.

import xgboost as xgb
import shap

# Separate model per election type (presidential, midterm, special)
model = xgb.XGBRegressor(
    n_estimators=400,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_weight=5,   # avoid overfitting small precincts
    reg_lambda=1.0,
    random_state=42,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=30)

# Evaluation on held-out 2022 midterm data
from sklearn.metrics import mean_absolute_error
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
# MAE: 3.1 percentage points
# i.e., predictions within 3.1pp of actual turnout on average

# SHAP for explainability — required for any flagged anomaly
explainer = shap.TreeExplainer(model)

def explain_prediction(precinct_id):
    x = feature_vector(precinct_id)
    shap_vals = explainer.shap_values(x)
    top_features = sorted(
        zip(FEATURE_NAMES, shap_vals),
        key=lambda t: abs(t[1]),
        reverse=True,
    )[:5]
    return top_features   # attached to every anomaly report

SHAP values are attached to every anomaly report. Without them, analysts receiving a "precinct X is anomalous" flag have no starting point for investigation. With SHAP, they see: "this precinct is flagged primarily because itsturnout_2022 feature implies lower expected turnout but actual was high, and new_voter_fraction is high — check for recent voter drives in this area."

Anomaly thresholding

def flag_turnout_anomalies(
    race_id: str,
    threshold_z: float = 2.5,
) -> list[TurnoutAnomaly]:
    precincts = get_precincts(race_id)
    X = build_features(precincts)
    predicted = model.predict(X)
    actual = np.array([p.turnout for p in precincts])

    residuals = actual - predicted
    # Use median absolute deviation (more robust than std to outliers)
    mad = np.median(np.abs(residuals - np.median(residuals)))
    sigma_robust = 1.4826 * mad   # consistent with std under normality

    anomalies = []
    for i, precinct in enumerate(precincts):
        z = residuals[i] / sigma_robust
        if abs(z) >= threshold_z:
            anomalies.append(TurnoutAnomaly(
                precinct_id=precinct.id,
                predicted=predicted[i],
                actual=actual[i],
                z_score=z,
                shap_top5=explain_prediction(precinct.id),
            ))

    return anomalies

We use median absolute deviation (MAD) instead of standard deviation for the z-score denominator. A handful of genuinely extreme precincts would inflate the standard deviation, making other anomalies harder to detect. MAD is robust to these outliers.

At threshold z = 2.5, we expect about 1.2% of precincts to be flagged by chance under a normal distribution. In practice, the residuals are heavier-tailed than normal, so the empirical false positive rate at this threshold runs 2–3%.

Method 3: Reporting curve anomalies

Vote counts are reported incrementally on election night as precincts finish counting. The shape of the reporting curve (fraction of votes reported versus time since polls close) follows a predictable county-level pattern driven by the number of polling locations, counting equipment, and the mix of in-person, mail-in, and provisional ballots. Deviations from this pattern indicate either technical issues (reporting system delays) or procedural changes.

from statsmodels.tsa.arima.model import ARIMA
import numpy as np

def fit_reporting_curve(county_id: str) -> ARIMA:
    """Fit ARIMA model on historical reporting timeseries."""
    # Historical: fraction reported at each 10-minute mark
    # after polls close, for the same election type (presidential, midterm)
    historical = get_historical_reporting(county_id, election_type="general")
    # historical: list of arrays, each 72 elements (720 min / 10 min intervals)
    # We take the mean across past 3 cycles as the training series
    mean_curve = np.mean(historical, axis=0)

    model = ARIMA(mean_curve, order=(2, 1, 2))
    return model.fit()

def detect_reporting_anomaly(
    county_id: str,
    observed_so_far: list[float],
) -> dict:
    """
    Compare observed reporting curve to historical model.
    observed_so_far: fraction reported at each 10-min interval so far.
    """
    arima_fit = fit_reporting_curve(county_id)
    t = len(observed_so_far)

    # Predict what fraction should be reported at time t
    predicted_full = arima_fit.forecast(steps=72)
    predicted_at_t = predicted_full[t - 1]
    residual_std = arima_fit.resid.std()

    observed_t = observed_so_far[-1]
    z = (observed_t - predicted_at_t) / residual_std

    return {
        "county_id": county_id,
        "t_minutes": t * 10,
        "predicted_fraction": predicted_at_t,
        "observed_fraction": observed_t,
        "z_score": z,
        "flagged": abs(z) > 2.5,
    }

This method is the noisiest of the four. Reporting delays are common for entirely mundane reasons: a precinct with a broken scanner, a county with a large mail-in backlog, or simply a county that has historically reported slowly. We use it only as a secondary signal — a reporting anomaly that co-occurs with a turnout anomaly or a Benford anomaly is more worth investigating than a standalone reporting delay.

The ARIMA(2,1,2) specification was chosen by grid search over historical data across all 23 states. We fit AIC and BIC for (p,d,q) combinations from (0,0,0) to (3,2,3) and found (2,1,2) minimizes the information criteria on the county-level reporting curve data in the majority of cases.

Method 4: Campaign finance clustering

FEC filings are public and updated daily during election cycles. We ingest via the OpenFEC API and look for contribution patterns inconsistent with organic donor behavior.

Organic small-dollar donation campaigns show: distributions clustered around round numbers ($25, $50, $100, $250), geographic spread matching the candidate's known support base, and timing distributed across weeks and months. Coordinated or fraudulent contribution patterns can show: many contributions of identical amounts on the same day from different names at similar addresses, or contributions that exceed the individual limit when aggregated across multiple LLCs.

import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

def detect_contribution_clusters(committee_id: str) -> list[dict]:
    """
    Detect suspicious clusters in FEC contributions.
    Returns list of cluster summaries for human review.
    """
    df = fetch_fec_contributions(committee_id)

    # Features per contribution
    features = df[["amount", "zip_code_3digit", "day_of_cycle"]].copy()
    features["amount_log"] = np.log1p(features["amount"])
    features["zip_norm"] = features["zip_code_3digit"] / 999
    features["day_norm"] = features["day_of_cycle"] / 365

    scaler = StandardScaler()
    X = scaler.fit_transform(features[["amount_log", "zip_norm", "day_norm"]])

    # DBSCAN: density-based clustering, no need to specify k
    db = DBSCAN(eps=0.15, min_samples=10).fit(X)
    df["cluster"] = db.labels_

    suspicious = []
    for cid in set(df["cluster"]) - {-1}:
        cluster_df = df[df["cluster"] == cid]

        # Flag clusters with unusual characteristics
        amount_cv = cluster_df["amount"].std() / cluster_df["amount"].mean()
        if (
            amount_cv < 0.05          # near-identical amounts (CV < 5%)
            and len(cluster_df) >= 20  # at least 20 contributions
            and cluster_df["date"].nunique() <= 3  # all within 3 days
        ):
            suspicious.append({
                "cluster_id": cid,
                "size": len(cluster_df),
                "total_amount": cluster_df["amount"].sum(),
                "date_range": f"{cluster_df['date'].min()} – {cluster_df['date'].max()}",
                "modal_amount": cluster_df["amount"].mode()[0],
                "zip_codes": cluster_df["zip_code_3digit"].unique().tolist(),
                "contributor_names": cluster_df["name"].tolist(),
            })

    return suspicious

In 2024, this method flagged two clusters across the 47 monitored races. Investigation found:

One cluster of 47 contributions of exactly $200 on the same day from different names — traced to a labor union coordinating contributions through an internal solicitation. Legal under FEC rules; the FEC record showed an itemized reporting exemption notation that we were not parsing correctly.
One cluster of 31 near-identical contributions that remain under investigation by the campaign's compliance team as of publication. We shared the data with a journalist covering campaign finance; the story was published separately.

Down-ballot undervote analysis

Undervotes — where a voter completes higher-office races but leaves a lower-office race blank — follow predictable patterns. Contested presidential races have lower undervote rates (2–5%) than uncontested or low-profile local races (10–20%).

def analyze_undervote(precinct: Precinct) -> dict:
    """
    Compare current undervote rate to historical baseline.
    An anomalous increase can indicate ballot printing errors
    (candidates missing from some ballots), scanning calibration
    issues, or in rare cases ballot-stuffing in one race only.
    """
    pres_total = precinct.votes_by_race["president"]
    for race_id, race_votes in precinct.votes_by_race.items():
        if race_id == "president":
            continue

        undervote_rate = (pres_total - race_votes) / pres_total
        baseline = get_historical_undervote_rate(
            precinct_id=precinct.id,
            race_type=get_race_type(race_id),
        )

        delta = undervote_rate - baseline
        if abs(delta) > 0.10:   # >10pp shift from historical
            yield {
                "precinct_id": precinct.id,
                "race_id": race_id,
                "undervote_rate": undervote_rate,
                "baseline": baseline,
                "delta_pp": delta * 100,
            }

This method caught two anomalies in 2024. Both were in precincts with competitive local races (a contested sheriff race and a ballot measure about county tax rates) that drove unusual split-ticket voting. In retrospect, both are explicable from the local political context and are not worth flagging in future cycles — we have added precinct-specific overrides to the baseline table.

Triage workflow

The four methods run on a 5-minute polling cycle on election night. Results feed into a triage dashboard showing all active anomalies with severity ranking. Analysts classify each anomaly within 15 minutes of detection:

AnomalyStatus:
  PENDING       # Detected, not yet reviewed
  INVESTIGATING # Analyst actively following up with county official
  EXPLAINED     # Cause identified (data error, procedure change, etc.)
  PERSISTENT    # 4+ hours with no explanation — escalate to journalist
  FALSE_POSITIVE # Reclassified as not anomalous after review

// 2024 triage outcomes
Total flagged:    12
Explained:         9  (75%) — data errors (3), procedure changes (4), demographics (2)
False positives:   2  (17%) — competitive-race undervote, slow-reporting county
Persistent:        1  (8%)  — unexplained; likely noise, investigation ongoing

The 8% persistent rate is not concerning: a single unexplained anomaly across 47 races and 890 million data points, with no independent corroborating evidence, is consistent with statistical noise. If two or more independent methods flagged the same jurisdiction with the same directionality at the same time, that would warrant more aggressive escalation.

What we do not do

We do not publish anomaly flags in real time. Publishing unverified anomaly flags during vote counting is harmful — it creates the appearance of evidence before investigation, and bad actors can amplify flagged anomalies as proof of fraud before they are explained. All anomalies are shared with credentialed journalists and election law researchers after at least 24 hours of internal investigation. Zero were deemed worth publishing in 2024.

We do not apply these methods to final certified results with the purpose of challenging them after certification. The methods are designed for pre-certification monitoring, where legitimate errors can still be corrected through normal processes. Post-certification challenges require legal standards of evidence that statistical anomaly detection does not meet.

For the social media NLP pipeline that feeds the sentiment signal used in turnout modeling: NLP pipeline for real-time sentiment analysis at scale →

For the Kafka/TimescaleDB infrastructure layer that handles the 890M data points: How we process 2.4M social-media posts per hour →

For how FEC contribution data integrates with the Federal Regulatory Data Hub: US Federal Regulatory Data Hub →

For how coordinated social media campaigns are detected and scored before they reach the election anomaly pipeline: Detecting coordinated inauthentic behavior in social media at scale →

For the data engineering backbone that feeds this pipeline — AP Election API, Kafka precinct results, state scraper normalization, and FIPS edge cases: Election data pipeline: AP feeds, Kafka precinct results, and state scraper normalization →

For the statistical methods that produce the anomaly signals this pipeline ingests — Benford's Law applicability gates, last-digit uniformity, turnout z-scores, and cross-validation requirements: Statistical anomaly detection for election integrity: Benford's Law, digit uniformity, and turnout modeling →

Election finance entity resolution covers the FEC committee identity pipeline — joint fundraising committees, legal-suffix normalization, and four-pass FEC matching — that this anomaly detection pipeline feeds into.