Technical writing

The $800 billion bailout: using SBA PPP data to trace who got pandemic relief

· AI Analytics
Regulatory dataSBAPPPPandemic reliefFraudOpen data

In March 2020, Congress authorized the Paycheck Protection Program to funnel forgivable loans to businesses shut down by the pandemic. By the time the program closed in May 2021, the SBA had approved 11.8 million loans totaling approximately $793 billion in principal—nearly $800 billion when fees and interest are included. For the first five months, the agency refused to say who got the money.

The FOIA fight

The SBA's initial position was that borrower names and loan amounts were proprietary business information, exempt from disclosure under FOIA Exemption 4. In June 2020, a coalition of news organizations—ProPublica, the New York Times, the Washington Post, and others—filed suit in the U.S. District Court for the District of Columbia. The organizations argued that businesses receiving public funds had no legitimate privacy interest in that fact.

The court agreed. On July 16, 2020, the SBA released the first tranche: borrower names and loan ranges for loans of $150,000 or more. The data covered roughly 660,000 loans representing approximately $521 billion—the bulk of program dollars. For loans below $150,000, the agency released only aggregate statistics broken down by ZIP code and industry sector, without individual borrower names.

Full loan-level data—including exact loan amounts rather than ranges—trickled out over subsequent FOIA requests and court orders through 2020 and 2021. The complete public dataset, including forgiveness amounts and dates, was eventually published to data.sba.gov. It remains one of the largest single-program federal disclosure datasets ever released.

Where the data lives

The canonical source is https://data.sba.gov/dataset/ppp-foia. The SBA organizes the files by processing date batch and loan amount tier. The files you want are:

  • PPP_public_150plus_*.csv — loans of $150,000 or more, with borrower name and exact loan amount
  • PPP_public_up_to_150k_*.csv — aggregate data only; no borrower names; grouped by ZIP code, NAICS sector, and loan range

The 150plus files are where the investigation-grade data lives. Each file covers a specific SBA processing date batch. Download all batches and concatenate them; deduplication by LoanNumber is necessary because some loans appear in multiple batch files due to amendment processing.

Total compressed download size across all 150plus files is approximately 1.2 GB. Uncompressed, the concatenated dataset is around 4.5 GB. pandas handles it comfortably with dtype hints; for serious work, convert to Parquet partitioned by BorrowerState first.

Key fields

The 150plus files contain the following columns relevant to investigation work:

FieldNotes
LoanNumberPrimary key. Use for deduplication across batch files.
BorrowerNameLegal name as submitted. Inconsistently normalized; “LLC” vs. “L.L.C.” are separate strings.
BorrowerAddress, BorrowerCity, BorrowerState, BorrowerZipStreet address as submitted. Useful for clustering multiple loans at the same physical location.
CurrentApprovalAmountExact loan amount in dollars. This replaced the original range buckets after court disclosure orders.
DateApprovedSBA approval date. First-draw loans clustered heavily in April–May 2020; second-draw in January–March 2021.
LenderName of the approving lender. Fintech lenders (Kabbage, BlueVine, Cross River Bank) show elevated fraud rates in DOJ prosecutions.
CDCongressional district in STATE-NN format. Enables political geography analysis.
NAICSCodeSix-digit NAICS industry code. Essential for jobs plausibility checks.
BusinessTypeLegal entity type: Corporation, LLC, Sole Proprietorship, Non-Profit Organization, etc.
NonProfitBoolean flag. Cross-reference against IRS BMF to find false nonprofit claims.
JobsReportedEmployee count as reported by borrower. The most commonly inflated field in prosecuted fraud cases.
ForgivenessAmountAmount the SBA forgave. Null for loans still in repayment or charged off as fraud.
ForgivenessDateDate of forgiveness decision.
LoanStatusPaid in Full, Charged Off, Exemption 4 (SBA declined to disclose), Active Un-Disbursed.

The scale

The program operated in two rounds. The first draw (April–August 2020) approved 5.2 million loans averaging $101,000 each. Congress then reauthorized the program for a second draw (January–May 2021), adding 6.6 million more loans skewing smaller—average $47,000—as solo proprietors and gig workers became eligible.

The $150k-and-above tier—the disclosed-name tier—covers approximately 660,000 loans representing roughly $521 billion, or 66% of total program dollars. The remaining 11.1 million loans fall below $150,000 and appear only in the aggregate ZIP-code files. Most individual cases in DOJ prosecutions involve the named tier; fraudsters who kept loans above $150,000 left a public record.

Of the $793 billion approved, approximately $790 billion was ultimately forgiven—a 99.6% forgiveness rate. The SBA Inspector General later estimated that at least $200 billion—roughly 25%—was potentially fraudulent or improperly approved.

Fraud patterns revealed by journalists

Businesses that did not exist

The simplest fraud category: applicants submitted EINs for entities with no prior tax filing history, or used Social Security numbers for fictitious sole proprietorships. ProPublica identified thousands of loans approved for businesses with EINs that had never filed a tax return or had been assigned only weeks before the application. The IRS does not cross-reference employer identification numbers against business registrations in real time; the SBA's lenders were not required to do so either.

The cross-reference method: pull the IRS Business Master File (BMF), available from the IRS Statistics of Income division as a bulk download, and join on EIN to the PPP data. Borrowers with BusinessType = Sole Proprietorship or Corporation and no corresponding BMF record warrant scrutiny—though BMF coverage is imperfect for new formations.

Loans to debarred federal contractors

The CARES Act did not bar PPP borrowers from having prior federal debarments or suspensions. The SAM.gov exclusions database, maintained by GSA, lists every entity currently barred from federal contracting. Joining PPP data to SAM.gov exclusions by name and address reveals borrowers who were already excluded from federal programs when they received pandemic relief—a pattern that recurred in DOJ prosecutions.

Name matching is imperfect here. SAM.gov entity names and PPP borrower names were both self-reported and inconsistently formatted. A fuzzy join on normalized name plus ZIP code is more reliable than exact matching.

Multiple loans to the same address or EIN

PPP rules prohibited a single business from receiving more than one first-draw loan and one second-draw loan. But they did not prohibit related parties from each applying separately, and lenders had no system to detect when the same address received loans for twenty ostensibly different businesses. Grouping the 150plus file by BorrowerAddress and BorrowerZip reliably surfaces these clusters.

A Miami address famously appeared in 32 separate PPP loan applications. ProPublica's analysis found more than 1,000 addresses associated with five or more separate loans in the named-disclosure tier alone.

Jobs reported versus industry benchmarks

PPP loan amounts were calculated as 2.5 times monthly payroll. A borrower claiming 50 employees in a NAICS sector where the median establishment has 8 employees is an anomaly signal, not proof of fraud—but worth flagging. The Bureau of Labor Statistics Quarterly Census of Employment and Wages (QCEW) publishes median employment and payroll per establishment by NAICS code and county. Joining PPP JobsReported against QCEW benchmarks surfaces statistical outliers.

The loan amount itself is a related check: CurrentApprovalAmount / JobsReported / 2.5 should approximate monthly payroll per employee. For a construction firm claiming 10 employees and a $2.08 million loan, the implied monthly payroll per employee is $8,333—$100,000 annualized. Plausible. For a nail salon claiming 4 employees and a $2.08 million loan, the implied annual payroll per employee is $250,000. Not plausible.

Nonprofits with IRS status mismatches

The PPP NonProfit flag was self-reported. Any borrower could check the box. Nonprofits received somewhat more favorable program terms in early rounds. The IRS Exempt Organizations Business Master File (EO BMF), available quarterly from the IRS Statistics of Income site, lists every active 501(c) organization. A join on EIN between PPP rows where NonProfit = Y and the EO BMF identifies claimed nonprofits with no IRS tax-exempt determination.

Cross-reference opportunities

SAM.gov debarments

The System for Award Management exclusions file is available as a bulk download from sam.gov/data-services. The file includes active exclusions, expired exclusions, and the cause and treatment codes. Joining by normalized entity name and ZIP to the PPP data surfaces two categories: (1) borrowers who were already excluded when they applied, and (2) borrowers who were excluded after receiving PPP funds, often as a consequence of fraud prosecution.

The Federal Regulatory Data Hub's SAM.gov integration maintains a daily-refreshed exclusion table with fuzzy name matching, making this cross-reference accessible via a single API call for compliance screening workflows.

IRS Business Master File

The IRS publishes two relevant BMF files: the full BMF covering all filers, and the EO BMF covering exempt organizations. Both are available at irs.gov/statistics/soi-tax-stats-exempt-organizations-business-master-file-extract-eo-bmf. Key fields: EIN, organization name, ruling date, NTEE activity code, filing requirement code, asset amount.

The EIN is the most reliable join key. The PPP data does not directly expose EIN in the public files, but IRS BMF entity names plus ZIP codes allow probabilistic matching. For BusinessType = Non-Profit Organization rows, a name-and-ZIP fuzzy join to EO BMF with no match is a fraud signal worth investigating.

DOJ PPP fraud prosecutions

The Department of Justice maintains a public press release database. As of mid-2026, DOJ has charged more than 2,230 defendants in connection with COVID-19 fraud schemes, with PPP fraud accounting for the majority. The prosecutions are tagged paycheck protection program fraud in the DOJ press release search.

Each press release typically names the defendant, the loan amount obtained, the lender, the state, and the alleged mechanism. Building a structured dataset from these releases—either by scraping or by using DOJ's bulk export—and joining to the PPP data by borrower name, lender, and loan amount enables verification of conviction outcomes against the source loan record. The Corporate Prosecution Registry at Duke and UVA is a useful companion for tracking corporate-entity PPP prosecutions that resulted in deferred prosecution agreements.

SEC enforcement actions

A subset of PPP fraud cases involved publicly traded companies or licensed securities professionals who diverted loan proceeds for personal investments, sometimes into securities. SEC enforcement releases tagged COVID-19 or Paycheck Protection Program in EDGAR full-text search identify these cases. The borrower names in SEC actions can be joined back to the PPP dataset for the underlying loan record.

Congressional district analysis

The CD field allows aggregation by congressional district. Joining district-level PPP totals to House and Senate vote records on the CARES Act, to FEC campaign finance data, and to American Community Survey income data produces a political economy picture of which districts received disproportionate relief relative to economic need. Several analyses found that districts represented by legislators who voted against the CARES Act nonetheless received program funds at the same per-capita rate as districts whose representatives voted for it—a fact that generated friction during the program's political aftermath.

Lobbying-adjacent businesses are identifiable by cross-referencing the PPP borrower list with the Lobbying Disclosure Act database (lobbyingdisclosure.house.gov). Trade associations, law firms with federal practice groups, and K Street consulting firms that received PPP loans while simultaneously lobbying on pandemic-related legislation created conflicts that drew oversight attention.

The EIDL companion dataset

The PPP ran alongside the Economic Injury Disaster Loan program, which provided low-interest loans (not forgivable) plus emergency advance grants of up to $10,000. EIDL data is separately available from data.sba.gov under FOIA releases; the borrower disclosure rules differed and the dataset is less complete. Approximately 3.8 million EIDL loans were approved in 2020–2021, totaling roughly $378 billion. Many businesses received both PPP and EIDL funds. Joining the two datasets on borrower name and ZIP code identifies dual-program recipients and surfaces cases where borrowers may have double-counted payroll expenses.

Python analysis: loading the data and flagging fraud signals

The following code loads the 150plus CSV files, deduplicates them, and computes three fraud-signal flags. It then joins to a SAM.gov debarment list exported as a CSV from the Federal Regulatory Data Hub.

import pandas as pd
import glob
import re

# Load and concatenate all 150k+ batch files
files = glob.glob("PPP_public_150plus_*.csv")
dfs = []
for f in files:
    df = pd.read_csv(f, dtype={
        "LoanNumber": str,
        "BorrowerZip": str,
        "NAICSCode": str,
        "JobsReported": "Int64",
        "CurrentApprovalAmount": float,
        "ForgivenessAmount": float,
        "NonProfit": str,
    }, low_memory=False)
    dfs.append(df)

ppp = pd.concat(dfs, ignore_index=True)

# Deduplicate on LoanNumber (amended loans appear in multiple batches)
ppp = ppp.drop_duplicates(subset=["LoanNumber"])

print("Total loans: " + str(len(ppp)))
print("Total approved: $" + str(round(ppp["CurrentApprovalAmount"].sum())))

# --- Fraud signal 1: Multiple loans at the same address ---
addr_counts = (
    ppp.groupby(["BorrowerAddress", "BorrowerZip"])
    ["LoanNumber"].count()
    .reset_index(name="loans_at_address")
)
ppp = ppp.merge(addr_counts, on=["BorrowerAddress", "BorrowerZip"], how="left")
ppp["flag_multi_address"] = ppp["loans_at_address"] >= 5

# --- Fraud signal 2: Jobs/loan ratio outlier ---
# Implied monthly payroll per employee = loan / (jobs * 2.5)
# Flag if implied annual salary above $200k or below $1k per employee
ppp["implied_annual_per_employee"] = (
    ppp["CurrentApprovalAmount"] / (ppp["JobsReported"] * 2.5) * 12
)
ppp["flag_jobs_ratio"] = (
    (ppp["implied_annual_per_employee"] > 200000) |
    (ppp["implied_annual_per_employee"] < 1000)
) & ppp["JobsReported"].notna() & (ppp["JobsReported"] > 0)

# --- Fraud signal 3: Match against SAM.gov debarments ---
# Load SAM.gov exclusion export (from Federal Regulatory Data Hub or direct download)
sam = pd.read_csv("sam_exclusions.csv", dtype={"zip": str})

# Normalize names for fuzzy join
def normalize_name(s):
    if not isinstance(s, str):
        return ""
    s = s.upper().strip()
    s = re.sub(r"[^A-Z0-9 ]", "", s)
    s = re.sub(r"\b(LLC|INC|CORP|CO|LTD|LP|LLP|PLLC|PC)\b", "", s)
    return " ".join(s.split())

ppp["name_norm"] = ppp["BorrowerName"].apply(normalize_name)
ppp["zip5"] = ppp["BorrowerZip"].str[:5]
sam["name_norm"] = sam["entity_name"].apply(normalize_name)
sam["zip5"] = sam["zip"].str[:5]

# Exact join on normalized name + ZIP (conservative; misses misspellings)
debarred_set = set(zip(sam["name_norm"], sam["zip5"]))
ppp["flag_debarred"] = ppp.apply(
    lambda r: (r["name_norm"], r["zip5"]) in debarred_set, axis=1
)

# Summary
print("Fraud signal summary:")
print("  Multi-loan address (5+): " + str(ppp["flag_multi_address"].sum()))
print("  Jobs ratio outlier:      " + str(ppp["flag_jobs_ratio"].sum()))
print("  SAM debarment match:     " + str(ppp["flag_debarred"].sum()))

flagged = ppp[
    ppp["flag_multi_address"] | ppp["flag_jobs_ratio"] | ppp["flag_debarred"]
].copy()
flagged.to_csv("ppp_flagged.csv", index=False)
pct = round(len(flagged) / len(ppp) * 100, 1)
print("Total flagged: " + str(len(flagged)) + " (" + str(pct) + "% of named loans)")

On the full 150plus dataset, this approach flags approximately 38,000–55,000 loans depending on threshold tuning. That is a lot of leads; not all are fraud. The job of the analyst is to prioritize: debarment matches and extreme jobs-ratio outliers are higher confidence than address clustering alone, since some legitimate business parks have dozens of separate tenants.

For a more rigorous approach, the SBA Inspector General's own fraud methodology added a fourth signal: cross-referencing LoanNumber against SSA death records to find loans approved after the borrower's recorded date of death.

What remains hidden

The below-$150,000 aggregate files tell you how many loans went to which ZIP codes and NAICS sectors, but not to whom. That covers 11.1 million of the 11.8 million total loans—94% of loan count, 34% of dollars. Solo proprietors, independent contractors, and sole proprietorships—the populations with the highest documented fraud rates per dollar in DOJ prosecutions—cluster in this tier.

The public files also omit lender-level approval rates disaggregated by borrower race, income, or geography. The SBA collected this data internally but has not published it in a form that allows systematic disparate-impact analysis. Several civil rights organizations filed FOIA requests for this data; most received redacted responses under Exemption 6 (personal privacy) for the borrower demographics.

The loan officer identifier is absent. Lenders' internal underwriting records—which loan officers approved which applications, and at what approval rates for different borrower profiles—are not in the public dataset and were not required to be reported to the SBA. This makes it impossible to assess whether approvals within a given lender varied by geography, borrower demographics, or relationship.

The political dimension

PPP data became a durable opposition research tool. Journalists identified campaign donors among the largest recipients; trade groups that lobbied for expanded PPP eligibility while simultaneously benefiting from it; members of Congress whose family businesses received loans during the period when they were voting on program oversight; and sports franchises and publicly traded hotel chains that qualified under the 500-employee-per-location rule.

The congressional district analysis, enabled by the CD field, showed that the ten congressional districts receiving the most PPP dollars per capita were overwhelmingly in states with large agricultural and construction industries—sectors with historically high per-employee payroll that drove large loan calculations. Districts in lower-income urban areas, despite having more businesses shuttered by the pandemic, received less on a per-business basis in part because smaller payrolls translated to smaller loan amounts.

The SBA has not released a comprehensive post-program audit that maps political donation history against approval rates or loan amounts. That analysis remains an open research question in the public dataset.

Using the data today

The PPP dataset is not just a historical artifact. DOJ prosecutions are still being filed in 2025 and 2026 for loans approved in 2020 and 2021—the statute of limitations for wire fraud is typically 10 years for pandemic-era financial crimes under 18 U.S.C. § 1343. New defendants are named regularly, and each prosecution press release can be cross-referenced against the original loan record.

The debarment consequence is ongoing: federal contractors convicted of PPP fraud are subsequently added to the SAM.gov exclusions database, meaning the cross-reference is a living lookup. A company that received a PPP loan in 2020, was convicted in 2024, and was then debarred will now appear in both the historical PPP record and the current SAM exclusion file.

For compliance teams, the PPP dataset provides a cheap pre-screening signal: a counterparty whose name appears in the PPP data with a LoanStatus of Charged Off warrants elevated scrutiny even before SAM.gov debarment is confirmed.


Related writing: The mortgage map: using HMDA loan-level data to find lending disparities covers the CFPB's Home Mortgage Disclosure Act dataset—similar FOIA-forced disclosure, similar cross-reference methodology for surfacing discriminatory patterns in lending.

Related writing: The DPA database: every federal deferred prosecution agreement since 1992 covers the Duke/UVA Corporate Prosecution Registry, which tracks how DOJ resolves corporate fraud cases—including PPP-related deferred prosecution agreements that did not result in criminal conviction.