Technical writing

Statistical anomaly detection for election integrity: Benford's Law, digit uniformity, and turnout modeling

· 11 min read· AI Analytics
ElectionsMLMethodology

Statistical tests do not detect fraud. They detect anomalies. An anomaly might be fraud, but it might equally be a precinct that reports late after a recount, a county with unusually homogeneous demographics, a data entry error, or an artifact of how a particular state's reporting system rounds its tallies. The pipeline described here generates signals; human analysts and cross-source validation determine whether a signal is meaningful. This distinction is not a legal disclaimer — it is the correct epistemic posture for election monitoring, and it has to be stated at the outset.

Misuse of statistical election analysis to assert fraud — claims that outrun the evidence — has caused measurable harm to public trust in democratic institutions. The four methods below are designed to surface anomalies worth investigating before certification, not to conclude anything about their cause. Every flagged signal enters a review queue and requires independent corroboration before any alert is issued.

Benford's Law: what it tests and when it applies

Benford's Law predicts that in naturally occurring numerical data spanning multiple orders of magnitude, the leading digit d appears with frequency log₁₀(1 + 1/d). The expected distribution for digits 1 through 9 is:

Digit     Expected frequency
─────────────────────────────
  1           30.1%
  2           17.6%
  3           12.5%
  4            9.7%
  5            7.9%
  6            6.7%
  7            5.8%
  8            5.1%
  9            4.6%

The law holds because the leading digit of a product or quotient of independent random variables tends toward this distribution regardless of the underlying distributions involved. In election data, it is most meaningful at precinct level within a large jurisdiction where vote counts span at least three orders of magnitude — precincts reporting anywhere from ~100 to ~100,000 votes.

Benford's Law does not apply in three common election scenarios. First, jurisdictions with uniform precinct sizes: if every precinct contains approximately 2,000 registered voters, the leading digit of vote counts will cluster around 1 and 2 regardless of any manipulation — the distribution is constrained by precinct design, not corrupted data. Second, jurisdictions with fewer than 200 precincts lack sufficient statistical power for chi-squared inference to be reliable. Third, data that has already been rounded or truncated will mechanically deviate from Benford even with perfectly accurate underlying counts. Skipping the applicability check and running the test anyway is the most common misuse of Benford in election analysis.

BenfordTest implementation

import numpy as np
from scipy import stats
from dataclasses import dataclass

@dataclass
class BenfordResult:
    jurisdiction_fips: str
    n_precincts: int
    observed_frequencies: list[float]   # digits 1–9
    expected_frequencies: list[float]   # Benford expected
    chi2_statistic: float
    chi2_p_value: float
    ks_statistic: float
    ks_p_value: float
    is_applicable: bool                 # False if precinct range < 2 orders of magnitude
    anomaly_flag: bool                  # True if p < 0.001 AND is_applicable

BENFORD_EXPECTED = [np.log10(1 + 1/d) for d in range(1, 10)]

def run_benford_test(vote_counts: list[int], jurisdiction_fips: str) -> BenfordResult:
    n = len(vote_counts)

    # Applicability check
    positive_counts = [v for v in vote_counts if v > 0]
    log_range = np.log10(max(positive_counts)) - np.log10(min(positive_counts))
    is_applicable = log_range >= 2.0 and n >= 200

    # Extract leading digits
    leading_digits = [int(str(v)[0]) for v in positive_counts]

    # Observed frequencies
    observed = np.array([leading_digits.count(d) / n for d in range(1, 10)])
    expected = np.array(BENFORD_EXPECTED)

    # Chi-squared test (degrees of freedom = 8)
    chi2_stat, chi2_p = stats.chisquare(observed * n, expected * n)

    # Kolmogorov-Smirnov test on cumulative distributions
    obs_cdf = np.cumsum(observed)
    exp_cdf = np.cumsum(expected)
    ks_stat = np.max(np.abs(obs_cdf - exp_cdf))
    ks_p = 1 - stats.kstest(leading_digits,
        lambda x: sum(BENFORD_EXPECTED[:int(x)-1])
    ).pvalue

    return BenfordResult(
        jurisdiction_fips=jurisdiction_fips,
        n_precincts=n,
        observed_frequencies=observed.tolist(),
        expected_frequencies=expected.tolist(),
        chi2_statistic=chi2_stat,
        chi2_p_value=chi2_p,
        ks_statistic=ks_stat,
        ks_p_value=ks_p,
        is_applicable=is_applicable,
        anomaly_flag=(chi2_p < 0.001 and is_applicable)
    )

We run both chi-squared and Kolmogorov-Smirnov because they are sensitive to different failure modes. Chi-squared accumulates evidence across all nine digit bins and is best for detecting systematic distributional shifts. KS measures the maximum deviation between cumulative distributions and is more sensitive to localized spikes — for example, an unexpected excess of leading 6s that leaves other bins close to expected. Requiring p < 0.001 (rather than the conventional 0.05) reduces false positives given how many jurisdictions the test runs against simultaneously.

The is_applicable flag gates the anomaly_flag. A jurisdiction that fails the applicability check still gets a result record — withis_applicable = False and anomaly_flag = False — so that the audit trail is complete and analysts can retrospectively verify which jurisdictions were excluded and why.

Last-digit uniformity test

In authentic vote counts, the last digit should be approximately uniformly distributed: each digit 0 through 9 appearing about 10% of the time. When results are rounded (by reporting systems that truncate to the nearest 10 or 100) or when numbers are fabricated by copying or incrementing, last digits cluster. This test is more robust than Benford for detecting small-scale irregularities in homogeneous precincts where vote count ranges are narrow and Benford does not apply.

The specific sub-check for 0 and 5 clustering captures rounding: any system that rounds to the nearest 10 will produce last digits of only 0, and any system rounding to the nearest 5 will produce last digits of only 0 or 5. A zero-five fraction above 30% in a dataset of authentic integer counts is a reliable indicator of upstream rounding before publication.

@dataclass
class LastDigitResult:
    chi2_p_value: float
    zero_five_fraction: float
    anomaly_flag: bool

def run_last_digit_test(vote_counts: list[int]) -> LastDigitResult:
    last_digits = [v % 10 for v in vote_counts]
    observed = np.array([last_digits.count(d) / len(last_digits) for d in range(10)])
    expected = np.full(10, 0.1)

    chi2_stat, chi2_p = stats.chisquare(
        observed * len(last_digits), expected * len(last_digits)
    )

    # Specific clustering check: are >30% of digits 0 or 5? (rounding indicator)
    zero_five_fraction = (last_digits.count(0) + last_digits.count(5)) / len(last_digits)

    return LastDigitResult(
        chi2_p_value=chi2_p,
        zero_five_fraction=zero_five_fraction,
        anomaly_flag=(chi2_p < 0.01 or zero_five_fraction > 0.30)
    )

The last-digit test uses p < 0.01 rather than the stricter 0.001 threshold applied to Benford. Because last-digit uniformity is theoretically guaranteed for genuine integer vote counts (unlike Benford, which requires additional structural conditions), deviations are more diagnostic and warrant a lower detection threshold. The tradeoff is a higher false positive rate for jurisdictions that legitimately round their published tallies — which is why the zero-five fraction check provides a second, independent route to the same conclusion.

Turnout anomaly detection

The turnout test compares reported turnout against a regression-predicted baseline derived from historical cycles. A z-score above 3.5 standard deviations from the predicted value triggers an anomaly flag. The threshold is deliberately conservative relative to the textbook 3.0 — competitive races, weather events, and demographic shifts all push individual precincts into the 3.0–3.5 range legitimately, and flagging all of them would bury genuine signals in noise.

@dataclass
class TurnoutAnomalyResult:
    jurisdiction_fips: str
    reported_turnout_pct: float
    predicted_turnout_pct: float
    residual: float           # reported - predicted
    z_score: float
    anomaly_flag: bool        # abs(z_score) > 3.5

def compute_turnout_baseline(
    jurisdiction_fips: str,
    election_type: str,   # 'presidential', 'midterm', 'primary'
    historical_cycles: int = 3
) -> TurnoutModel:
    # Linear regression on historical turnout using features:
    # - historical_turnout_mean (same election type, last N cycles)
    # - registered_voters (from state voter rolls)
    # - early_voting_rate (from early ballot returns, where available)
    # - weather_score (composite of temperature + precipitation on election day)
    # - competitive_race_flag (top-of-ticket margin < 5% in polling)
    ...

def run_turnout_test(
    precincts: list[PrecinctResult], model: TurnoutModel
) -> list[TurnoutAnomalyResult]:
    results = []
    for precinct in precincts:
        predicted = model.predict(precinct)
        residual = precinct.turnout_pct - predicted
        z_score = residual / model.residual_std  # per-precinct z-score
        results.append(TurnoutAnomalyResult(
            jurisdiction_fips=precinct.fips,
            reported_turnout_pct=precinct.turnout_pct,
            predicted_turnout_pct=predicted,
            residual=residual,
            z_score=z_score,
            anomaly_flag=abs(z_score) > 3.5
        ))
    return results

The weather_score feature deserves attention. Precipitation suppresses turnout by 1–3 percentage points depending on severity; temperature extremes on both ends do similarly. Without this feature, any election held during a major weather event produces widespread anomaly flags in the affected region — a storm that suppresses turnout across an entire state looks like a coordinated anomaly if the model has no way to condition on it. We ingest NOAA station data the morning of each election and compute a composite score per precinct based on the nearest reporting station.

The competitive_race_flag is equally important in the other direction. When a top-of-ticket race is within five points in polling, turnout runs 4–7 points higher than in non-competitive cycles, and the effect is not uniform across precincts. A model trained on historical non-competitive cycles will flag every high-turnout precinct in a competitive-race county. We encountered this in 2024: three of the four turnout anomalies that came back as legitimate turned out to be precincts in a competitive Senate race where the model underestimated the competitive-race effect at the precinct level. The fix — increasing the feature weight forcompetitive_race_flag in precincts with below-median historical variance — will be incorporated before the next cycle.

Non-monotonic reporting detection

Precinct-reported vote counts should only increase as election night progresses: more ballots counted means more ballots reported. A decrease in reported votes from one update to the next is a data integrity event. It is not necessarily fraud — reporting systems correct data entry errors in real time, and a county that inputs 14,432 and then corrects it to 13,432 will produce a non-monotonic event — but it is always worth flagging for immediate review.

from datetime import datetime
from dataclasses import dataclass

@dataclass
class NonmonotonicEvent:
    timestamp: datetime
    previous_count: int
    current_count: int
    delta: int          # always negative for a true non-monotonic event

def detect_nonmonotonic_reporting(
    reporting_timeline: list[tuple[datetime, int]]  # (timestamp, votes_reported)
) -> list[NonmonotonicEvent]:
    events = []
    for i in range(1, len(reporting_timeline)):
        prev_time, prev_count = reporting_timeline[i-1]
        curr_time, curr_count = reporting_timeline[i]
        if curr_count < prev_count:
            events.append(NonmonotonicEvent(
                timestamp=curr_time,
                previous_count=prev_count,
                current_count=curr_count,
                delta=curr_count - prev_count
            ))
    return events

Non-monotonic events are the fastest to confirm or dismiss. A quick call to the county clerk's election night hotline — maintained by most states — resolves the cause in minutes. In 2024, all non-monotonic events we observed were confirmed as real-time corrections by the reporting system within 20 minutes of detection. None required escalation. The detection latency is effectively zero: because the check runs on every AP precinct result update (30–60 second cadence), a count drop appears in the review queue within 90 seconds of the corrected figure being published.

ElectionAnomalySignal and composite severity

Each test produces an ElectionAnomalySignal record that normalizes outputs across methods into a common schema. The composite severity score weights signals by their empirical reliability — turnout modeling contributes more than Benford, which contributes more than non-monotonic detection — and applies a multiplier when multiple independent signals fire simultaneously on the same jurisdiction.

@dataclass
class ElectionAnomalySignal:
    jurisdiction_fips: str
    signal_type: str       # 'benford', 'last_digit', 'turnout', 'nonmonotonic'
    test_statistic: float
    p_value: float
    z_score: float | None
    severity: float        # 0–10 composite severity
    anomaly_flag: bool
    requires_validation: bool  # True if severity >= 6

def compute_composite_severity(signals: list[ElectionAnomalySignal]) -> float:
    # Weight by signal type reliability
    weights = {'benford': 0.2, 'last_digit': 0.35, 'turnout': 0.4, 'nonmonotonic': 0.05}
    if not signals:
        return 0.0
    weighted = sum(s.severity * weights.get(s.signal_type, 0.1) for s in signals)
    # Boost if multiple independent signals fire simultaneously
    n_signals = sum(1 for s in signals if s.anomaly_flag)
    multiplier = 1.0 + 0.3 * max(0, n_signals - 1)
    return min(weighted * multiplier, 10.0)

The weight assignments reflect calibrated reliability, not theoretical preference. Turnout modeling earns the highest weight (0.40) because it uses the richest feature set and has the most direct relationship to observable election behavior. Last-digit testing earns 0.35 because its null hypothesis (uniformity) is theoretically guaranteed for genuine vote counts and deviations are therefore highly diagnostic. Benford earns only 0.20 because its applicability conditions are frequently unmet and its theoretical grounding in election data is weaker. Non-monotonic detection earns 0.05 because it almost always resolves as a mundane correction; the signal is fast but low-information.

The co-occurrence multiplier — 30% severity boost per additional signal beyond the first — reflects a key statistical property: independent tests finding the same jurisdiction anomalous simultaneously is much less likely under the null hypothesis than any single test flagging it. A jurisdiction with a Benford anomaly and a turnout anomaly at the same time is materially more interesting than either alone. The cap at 10.0 prevents the multiplier from inflating severity beyond the scale.

Cross-validation requirement

A statistical anomaly alone never triggers a public alert. All signals withanomaly_flag = True are written to anelection_anomaly_review_queue table and held there until at least one of the following conditions is met:

  • Social media correlation: elevated volume of posts from that jurisdiction mentioning "recount", "irregularity", or "fraud" using the OSINT pipeline's entity extraction, with a cluster formation timestamp predating the statistical flag (ruling out post-flag amplification)
  • Media coverage: AP, Reuters, or state election officials have independently confirmed the anomaly exists and are investigating
  • OSCE or international election observer report, for monitored international elections
  • Analyst manual review and sign-off — elections team SLA during active election night is two hours from signal creation to disposition

The social media correlation check is the most operationally useful during active election nights. When a statistical anomaly is genuine, organic conversation about it appears within minutes in the affected jurisdiction — local journalists, poll workers, party observers, and voters all post in real time. When a statistical anomaly is a data entry error that gets corrected within the hour, social media in that jurisdiction shows nothing unusual. This cross-validation step has a lower false positive rate than any of the statistical tests individually.

2024 validation results

Across 47 monitored races in 2024, the four tests produced the following outcomes against ground truth established through post-election official records:

Method              TP    FP    Notes
────────────────────────────────────────────────────────────────────────
Benford              3     1    FP: small rural county, 80 precincts —
                               applicability check should have excluded it.
                               All 3 TPs were data entry errors corrected
                               by state officials the following day.
Last-digit           2     0    Both TPs were rounding artifacts from
                               an early-reporting system that truncated
                               to nearest 10 before publication.
Turnout              4*    —    3 of 4 were legitimate high-turnout rural
                               precincts (model underestimated competitive-
                               race effect). 1 confirmed as data entry error.
Non-monotonic        —     —    All events resolved as real-time corrections.

False positive rate across all tests (applicable precincts): 0.7%
Fraud detections: 0

The single Benford false positive — a county with 80 precincts — traced to a bug in the applicability check. The code correctly requires n >= 200, but a preprocessing step that aggregated multi-precinct reporting units had inflated the apparent precinct count to 214. The fix — running the applicability check on raw precinct records rather than aggregated reporting units — was deployed before the general election.

The turnout model's three legitimate-but-flagged rural precincts are the more instructive failure. The competitive-race effect is larger in rural precincts (where candidate-specific mobilization efforts have more concentrated impact) than the model's single competitive_race_flag feature captures. Replacing the binary flag with a continuous variable — polling margin, interacted with a rural/urban indicator — is the planned improvement.

The most important finding: none of the 47 races showed statistical signals consistent with large-scale vote manipulation. This is the expected result. Large-scale manipulation would be extremely difficult to execute without producing anomalies that span multiple test types simultaneously, and no jurisdiction produced that pattern.

Performance and infrastructure

Running all three primary tests (Benford, last-digit, turnout) on a county with 2,000 precincts takes under 50ms using NumPy-vectorized operations. The full pipeline executes on each AP precinct result update, which arrives on a 30–60 second cadence, so tests run approximately 60–120 times per county over the course of an election night.

All test results — including non-flagged runs — are persisted to TimescaleDB with millisecond timestamps. The complete audit trail is queryable: for any jurisdiction at any point during the night, analysts can retrieve every test result, every observed frequency, every z-score, and every disposition. This traceability is not optional. Post-election challenges, academic analysis, and press inquiries all require the ability to show exactly what the system saw and when.

The review queue runs as a separate service with a PostgreSQL LISTEN/NOTIFY trigger: when a record with anomaly_flag = True is inserted intoelection_anomaly_review_queue, an alert fires to the on-call analyst via PagerDuty. Average time from AP update to analyst notification during 2024 general election: 23 seconds.


For the data engineering backbone that feeds precinct results into these tests — AP Elections API ingestion, Kafka routing, state scraper normalization, and FIPS crosswalk: Election data pipeline: AP feeds, Kafka precinct results, and state scraper normalization →

For the full anomaly detection system that consumes these statistical signals — XGBoost turnout modeling with SHAP, ARIMA reporting curves, and the triage workflow: Detecting election anomalies using statistical methods →

For the social media cross-validation layer — coordinated campaign detection that provides independent corroboration of statistical signals: Detecting coordinated inauthentic behavior in social media at scale →

For the 58M-posts-per-day pipeline that provides the social media corroboration used in the cross-validation step: Social media ingestion at 58M posts per day →