In the spring of 2020 the American economy stopped, and Congress answered with the largest small-business lending program in the country's history. In a matter of weeks the Small Business Administration stood up a channel that pushed hundreds of billions of dollars in forgivable loans through thousands of private banks, credit unions, and fintech lenders and into the payrolls of restaurants, dentists, churches, sole proprietors, and Fortune-500 subsidiaries alike. Then, after a fight over disclosure, the SBA published the result loan by loan: roughly 11.8 million records, one row per loan, naming who borrowed, who lent, how much, for what kind of business, and whether the loan was ultimately forgiven—the most granular public account of a federal emergency program ever released, fraud and all.
This article covers what the PPP loan-level dataset is and how the CARES Act and the SBA's 7(a) authority frame it; the loan-sizing formula—2.5 times average monthly payroll, the food-service exception, and the caps—and the two rounds the program ran in; the central and contested role of the private lenders, from the largest banks to fintech originators like Kabbage and BlueVine, who actually made the loans; the FOIA litigation that forced the SBA to release borrower names for the larger loans; the forgiveness pipeline and the 60-percent payroll rule that determines whether a loan becomes a grant; how fraud became a defining feature of the program and what the Inspector General and GAO estimated; how the dataset joins to other federal records through NAICS, geography, and entity name; a Python workflow that pulls the public SBA bulk files and computes lender volume, sector totals, and forgiveness rate by state; and the caveats—self-reported fields, the jobs-supported number, lender-of-record ambiguity, and entity resolution—that every analyst must internalize before drawing conclusions.
What the dataset is
The Paycheck Protection Program loan-level data is the SBA's public, row-per-loan record of every loan approved under the program. It is not a summary or a sample: it is the full administrative dataset, released after a disclosure fight described below, covering both rounds of the program from its launch in April 2020 through its closure in 2021. Each record represents a single loan made by a single lender to a single borrower, and the file carries the operational facts of that loan—the borrower's identity and location, the industry it operates in, the amount approved, the lender that originated and serviced it, the number of jobs the borrower reported the loan would support, and, crucially, the forgiveness amount and status that records whether the loan was ultimately converted from a debt into a grant. Surfaced through the SBA's open-data portal, the loan-level record comprises roughly 11.8 million rows.
In our database this record is stored as the table sba_ppp_loans, with the grain of one row per loan: a borrower that took a loan in the first round and a Second Draw loan in 2021 contributes two rows. The columns capture who borrowed, who lent, how much, for what, and what became of the loan:
loan_number -- SBA loan identifier (one per loan)
borrower_name -- the borrowing business or individual
borrower_address -- street, city, state, ZIP of the borrower
naics_code -- 6-digit industry code for the borrower
business_type -- corporation, LLC, sole proprietor, nonprofit, etc.
draw_type -- first draw (2020) vs second draw (2021)
initial_approval_amount -- amount approved at origination
current_approval_amount -- amount after any later adjustment
approval_date -- date SBA approved the loan
originating_lender -- the lender that made the loan
servicing_lender -- the lender servicing it (may differ)
jobs_reported -- jobs the borrower said the loan would support
forgiveness_amount -- dollars forgiven (0 if not / not yet forgiven)
forgiveness_date -- date forgiveness was paid by SBA
-- demographic fields (race, gender, veteran, ethnicity): often "Unanswered"The load-bearing columns are the amount fields, the lender fields, and the forgiveness fields. The dataset distinguishes the initial approval amount from the current approval amount because some loans were later adjusted or partially cancelled, so the two can differ and an analyst must decide which to sum. The originating lender and servicing lender are recorded separately because loans were frequently transferred—a fintech might originate a loan that a bank later serviced—which matters enormously for any analysis that attributes lending behavior to a named institution. The forgiveness amount is the field that converts a loan record into a measure of program outcome: PPP loans were designed to be forgiven if the money was spent correctly, so a forgiveness amount greater than zero means the loan effectively became a grant. Finally, the demographic fields—race, gender, veteran status, ethnicity—are present in the schema but were optional for borrowers to complete, and in practice the large majority are blank or marked “Unanswered,” a limitation the caveats section returns to because it cripples any straightforward equity analysis of the program.
What it is and the CARES Act frame
The Paycheck Protection Program was created by the Coronavirus Aid, Relief, and Economic Security Act—the CARES Act—which Congress passed in late March 2020 as the pandemic shut down commerce across the country. The program's premise was simple and urgent: rather than let employers lay off workers as revenue collapsed, the government would lend businesses enough to keep paying their staff, and it would forgive the loan—turning it into a grant—if the business actually used the money for payroll and a few other eligible costs. The aim was to keep workers attached to their employers through the worst of the shutdown so that the economy could restart quickly when restrictions lifted, and to do it through a delivery mechanism that already existed and could move fast.
That existing mechanism was the SBA's 7(a) loan-guarantee program. Ordinarily the 7(a) program is how the SBA backstops conventional small-business loans: private lenders make the loans and the SBA guarantees a portion, so the lenders bear less risk and extend credit they otherwise would not. The CARES Act bolted PPP onto this framework but changed the economics entirely. PPP loans carried a 100-percent SBA guarantee—the lender bore no credit risk—required no collateral and no personal guarantee, charged a low fixed interest rate, and, above all, were forgivable. The genius and the peril of the design were the same thing: by riding on the existing network of thousands of 7(a) lenders, the program could disburse money in days rather than months, but it also delegated the front-line decision about who got a loan to those private lenders, with the government's verification deferred to the forgiveness stage on the back end.
The scale of demand overwhelmed the first appropriation almost immediately. The initial round of funding, enormous as it was, was exhausted within days as businesses rushed to apply, forcing Congress to pass a second tranche of money to reopen the program weeks later. That sequence—a first appropriation drained almost instantly, then replenished—is the origin of the program's reputation for chaos in its earliest weeks, when application portals crashed, lenders prioritized existing customers, and small and minority-owned businesses without established banking relationships found themselves at the back of the line. The lessons of that scramble shaped the program's second act.
The loan formula and the two rounds
The amount a business could borrow was tied directly to its payroll, which is what made the program a paycheck-protection program rather than a general business-relief program. The basic formula set the maximum loan at 2.5 times the borrower's average monthly payroll cost, the idea being to cover roughly ten weeks of payroll. Payroll cost was defined to include wages, salaries, and certain benefits, but with an annualized cap of $100,000 per employee, so the salaries of highly paid staff counted only up to that ceiling. The overall loan size was capped at $10 million. For a sole proprietor or an independent contractor with no employees, the formula ran off the owner's own compensation, which is why the dataset is full of very small loans to single-person businesses alongside the multimillion-dollar loans to larger employers.
The program ran in two distinct rounds, and the draw type field in the data distinguishes them. The first round was the 2020 program created by the CARES Act and its immediate replenishment—open broadly to eligible small businesses, with the 2.5x formula and the scramble described above. The second round was opened by the Consolidated Appropriations Act of December 2020, which both reopened first-draw lending for businesses that had not yet borrowed and, more importantly, created a Second Draw program for borrowers that had already taken a first PPP loan and could show they had been harder hit—specifically, a substantial drop in revenue from one quarter to the same quarter a year earlier. Second Draw loans were targeted at smaller and more genuinely distressed businesses and carried tighter eligibility, a lower maximum, and one notable enhancement: businesses in the accommodation and food services sector—NAICS code 72, the restaurants and hotels the pandemic hit hardest—could borrow at 3.5 times average monthly payroll rather than 2.5, recognizing that those businesses needed proportionally more help to survive. The two-round structure is why NAICS sector 72 is so prominent in any analysis of the data and why the draw-type split is essential to interpreting it.
The lenders, banks and fintechs
The single most important structural fact about PPP—and the one that explains both its speed and its fraud—is that the SBA did not make the loans. The private lenders did. The program rode on the existing 7(a) lender network and then expanded it dramatically, ultimately involving thousands of institutions: the largest national banks, regional and community banks, credit unions, and a new category of participant that came to dominate the later phases of the program—the financial-technology lenders. Each lender took applications, applied the eligibility rules, originated the loan, and submitted it to the SBA for the guarantee, earning a processing fee scaled to the loan size. The originating_lender field in the dataset records which institution did this for each loan, making the data a direct ledger of which lenders moved the most money and to whom.
The fintech lenders deserve particular attention because they reshaped the program and figure heavily in its later troubles. Online lenders and fintech platforms such as Kabbage and BlueVine, often working through partner banks, built automated application pipelines that could approve loans far faster than a traditional bank's manual underwriting. For many small businesses, sole proprietors, and gig workers who lacked a relationship with a conventional bank, these platforms were the only practical way in, and they became enormous PPP originators—a genuine expansion of access. But the same automation that made them fast also made them attractive to fraud: thinner human review, rapid turnaround, and a fee structure that rewarded volume created conditions in which fraudulent applications could pass through at scale. Later analysis by researchers and the SBA Inspector General found that fintech-originated loans were associated with disproportionately high rates of suspicious and fraudulent activity, which is why the originating-lender field is not just a descriptive attribute but one of the most analytically loaded columns in the dataset.
The FOIA fight over borrower names
It is easy to take for granted that the PPP data names borrowers, but that disclosure was not voluntary—it was won in court. When the SBA first released PPP data in 2020, it withheld the names and precise loan amounts of smaller borrowers, providing only ranges and aggregate figures, on the argument that loan-level borrower identities were confidential business information. A coalition of news organizations sued under the Freedom of Information Act (FOIA), arguing that the public had a compelling interest in knowing exactly who had received hundreds of billions of dollars in forgivable public funds and that the SBA's confidentiality claims did not justify withholding the names of recipients of a government program of this magnitude.
The news organizations prevailed. A federal court ordered the SBA to release the detailed loan-level data, including borrower names and addresses and specific loan amounts, and in late 2020 the agency complied. That release is the reason the dataset that exists today is so powerful—and so unusual. For most federal financial-assistance programs, individual recipient detail is hard to obtain or simply unavailable; for PPP, the combination of the program's sheer size, the forgivability that made the loans effectively grants of public money, and a successful press FOIA campaign produced a complete, named, loan-level public record. The disclosure is what enabled the entire downstream ecosystem of journalism, academic research, and oversight that has scrutinized the program ever since, from local reporting on who in a community got a loan to national investigations of fraud rings. Without the FOIA litigation, this dataset as we know it would not exist.
The forgiveness pipeline and the 60-percent rule
The defining feature of a PPP loan is that it could be forgiven—converted entirely into a grant the borrower never has to repay—and the forgiveness fields are how the dataset records whether that happened. Forgiveness was not automatic. A loan became forgivable only if the borrower spent the proceeds on eligible costs within a defined covered period after disbursement, and only if it maintained its workforce and wage levels (subject to safe-harbor provisions that softened the headcount requirement as the program evolved). The eligible costs were dominated by payroll, with a limited allowance for certain other expenses—rent, mortgage interest, utilities, and, in the second round, an expanded set including some operational and supplier costs.
The pivotal constraint was the 60-percent payroll rule: to qualify for full forgiveness, at least 60 percent of the loan proceeds had to be spent on payroll, with no more than 40 percent going to the non-payroll eligible costs. This rule is what kept PPP a paycheck-protection program rather than a general subsidy—it forced the bulk of the money to flow to wages, which was the entire policy rationale. The mechanics of forgiveness ran through the lenders: a borrower applied for forgiveness through the same lender that made the loan, the lender reviewed the application, and the SBA paid the lender the forgiven amount, extinguishing the borrower's debt. The SBA also created a simplified forgiveness process for the smallest loans to reduce the paperwork burden. In the dataset, the forgiveness_amount and forgiveness_date fields record the outcome of this pipeline, which is why computing a forgiveness rate—the share of approved dollars that were ultimately forgiven, sliced by lender, sector, geography, or loan size—is one of the most informative analyses the data supports. A loan with no forgiveness recorded is either still outstanding, still in the forgiveness process, or one the borrower will have to repay—and distinguishing those cases is a genuine analytical challenge the caveats section addresses.
Fraud as a defining feature
No honest account of the PPP data can treat fraud as a footnote. The program was designed for speed under emergency conditions, and the design choices that made it fast—a self-certification model in which borrowers attested to their eligibility and payroll, minimal up-front verification, a 100-percent guarantee that removed lenders' credit risk, and fee incentives that rewarded volume—also made it extraordinarily vulnerable to abuse. The SBA Office of Inspector General and the Government Accountability Office (GAO) have estimated that tens of billions of dollars in PPP funds went to potentially fraudulent or otherwise improper applications—a figure that, while inherently uncertain, is large enough to make PPP one of the most consequential cases of federal program fraud on record.
The fraud took several recognizable forms. The most common vectors were identity theft—applications filed using stolen identities or the identities of real businesses without their knowledge—and shell-company applications, in which fabricated or dormant businesses with invented payrolls and employee counts applied for loans. Because the loan amount was driven by claimed payroll, inflating the payroll figure directly inflated the loan, and a fraudster who never intended to spend the money on wages could simply pocket it. Rings filed large numbers of applications across multiple lenders, exploiting the automated fintech pipelines that processed applications with limited human scrutiny. The aftermath has been a sprawling enforcement effort: the Department of Justice has pursued thousands of PPP-fraud prosecutions, the SBA and lenders have clawed back funds where they can, and the forgiveness review process became a second line of defense, catching some loans that should never have been approved. For the analyst, this history is not just context: it means the dataset is partly a record of fraud as well as of relief, and patterns in the data—clusters of loans at improbable round-number payroll amounts, identical addresses across many loans, implausible jobs-per-dollar ratios, lenders with anomalous forgiveness or default profiles—are exactly the signals that fraud researchers and oversight bodies mine the data to find.
Joining to other federal data
The PPP dataset is valuable on its own, but it becomes far more powerful when joined to other federal records, and it carries several keys that make those joins possible. Three matter most.
The first is the NAICS industry code. Because every loan carries the borrower's six-digit North American Industry Classification System code, the data can be aggregated by industry and compared against industry-level baselines—the number of establishments, the employment, and the pandemic revenue losses in each sector—to ask whether the relief flowed to the industries that needed it most. The prominence of NAICS sector 72, accommodation and food services, both in raw volume and through its 3.5x Second Draw multiplier, makes the NAICS join the natural starting point for any question about whether PPP money was well targeted.
The second is geography. The borrower address—down to the ZIP code—lets PPP loans be mapped and joined to Census demographic and economic data, which is how researchers have studied whether the program reached low-income and minority communities or concentrated in already-advantaged areas, and how local journalists have reported on who in a given town received funds. The third, and most powerful for oversight, is entity name and identifier. Joining PPP borrowers to the federal spending record in USASpending, to the SAM exclusions list of debarred parties, and to Department of Justice prosecution records ties the loan data into the broader accountability web—testing, for instance, whether loans went to already-excluded parties, or connecting a loan record to the criminal case that the loan ultimately produced. These cross-dataset joins are where the named, loan-level disclosure pays its largest dividend.
Python workflow: lender volume, sector totals, and forgiveness rate
The SBA publishes the full PPP loan-level data on its open-data portal as downloadable CSV files, split into an over-$150,000 file and several under-$150,000 batch files because the full set is far too large for a single CSV. The script below downloads the over-$150,000 file— the most analytically tractable starting point—resolves the verbose, release-dependent column names defensively, and then computes three of the core metrics: origination volume by lender (which banks and fintechs moved the most money), approved dollars by NAICS two-digit sector (where the money went by industry), and the forgiveness rate by borrower state (the share of approved dollars that were ultimately forgiven). No API key is required for the public data. For national-scale work, all of the under-$150,000 batch files must be loaded and concatenated, and any production use should be validated against the current SBA PPP FOIA data-release page, whose file URLs and column names change between refreshes. Requirements: requests and pandas.
import requests, io, zipfile
import pandas as pd
# SBA publishes the full Paycheck Protection Program loan-level data on
# its open-data portal as downloadable CSV files. SBA split the release
# into a "loans over 150k" file and several "under 150k" files (one per
# region / batch), because the full set is too large for a single CSV.
# The exact file URLs are listed on data.sba.gov; isolate them here and
# confirm against the current SBA PPP FOIA data-release page.
# - portal: https://data.sba.gov/dataset/ppp-foia
# - per-file: https://data.sba.gov/.../public_150k_plus_YYMMDD.csv
BASE = "https://data.sba.gov/dataset/ppp-foia"
OVER_150K = "https://data.sba.gov/dataset/8aa276e2-6cab-4f86-aca4-a7dde42adf24/resource/public_150k_plus.csv"
def load_csv(url):
# Stream the file; the PPP CSVs are large, so avoid loading the raw
# bytes twice. low_memory=False keeps dtype inference consistent.
r = requests.get(url, timeout=600, stream=True)
r.raise_for_status()
return pd.read_csv(io.BytesIO(r.content), low_memory=False)
# Column names in the SBA extract are verbose and have shifted across
# the FOIA releases; resolve them defensively rather than hard-coding.
def col(frame, *candidates):
lower = {c.lower(): c for c in frame.columns}
for cand in candidates:
if cand.lower() in lower:
return lower[cand.lower()]
raise KeyError(f"none of {candidates} in {list(frame.columns)[:12]}...")
df = load_csv(OVER_150K)
print(f"Loan records loaded (over-150k file): {len(df):,}")
c_amt = col(df, "CurrentApprovalAmount", "LoanAmount", "InitialApprovalAmount")
c_forg = col(df, "ForgivenessAmount")
c_naics = col(df, "NAICSCode", "NAICS")
c_lender = col(df, "OriginatingLender", "Lender", "ServicingLenderName")
c_state = col(df, "BorrowerState", "ProjectState")
c_draw = col(df, "ProcessingMethod", "DrawType")
df[c_amt] = pd.to_numeric(df[c_amt], errors="coerce")
df[c_forg] = pd.to_numeric(df[c_forg], errors="coerce").fillna(0)
# --- 1. Origination volume by lender ----------------------------------
# Which lenders -- banks, credit unions, and fintechs -- moved the most
# money. Sum of approved amount by originating lender.
by_lender = (df.groupby(c_lender)[c_amt]
.agg(["count", "sum"])
.sort_values("sum", ascending=False))
print("\nTop 15 originating lenders by approved dollars:")
for name, row in by_lender.head(15).iterrows():
print(f" {str(name)[:40]:<40} {int(row['count']):>7,} loans "
f"${row['sum']/1e9:>6.2f}B")
# --- 2. Approved dollars by NAICS 2-digit sector ----------------------
df["_sector"] = df[c_naics].astype("string").str.slice(0, 2)
by_sector = df.groupby("_sector")[c_amt].sum().sort_values(ascending=False)
print("\nApproved dollars by NAICS 2-digit sector:")
for sector, total in by_sector.head(12).items():
print(f" sector {sector} ${total/1e9:>7.2f}B")
# --- 3. Forgiveness rate by borrower state ----------------------------
# Share of approved dollars that have been forgiven, by state. A loan is
# treated as forgiven where ForgivenessAmount > 0.
g = df.groupby(c_state)
forg_rate = (g[c_forg].sum() / g[c_amt].sum()).sort_values(ascending=False)
print("\nForgiveness rate by state (forgiven $ / approved $):")
for state, rate in forg_rate.head(15).items():
print(f" {str(state):<4} {rate:>6.1%}")
Two practical notes apply. First, the forgiveness-rate calculation in the script is deliberately coarse: it sums the forgiveness amount and divides by the approved amount, treating a zero forgiveness value as not forgiven. In reality a zero can mean a loan still in the forgiveness pipeline, a loan the borrower chose not to seek forgiveness on, or a loan flagged for review, so a rigorous forgiveness analysis must distinguish those states using the forgiveness date and the loan status fields rather than collapsing them all into “not forgiven.” Second, the over-$150,000 file is unrepresentative on its own—it contains the larger loans, which skew toward bigger employers and different industries than the millions of small loans in the under-$150,000 files. Any conclusion about the program as a whole, and especially any analysis of sole proprietors, gig workers, and the smallest businesses, requires loading the full set, not just the headline file the script starts with.
Limitations and analytical caveats
The PPP dataset is the most complete public record of a federal emergency program ever released, but it carries structural limitations that an analyst must internalize before drawing conclusions—and several of them are unusually consequential given how heavily the data has been used.
Many fields are self-reported and unverified at origination.The program ran on borrower self-certification, which means the payroll figures that drove loan size, the business descriptions, and the eligibility attestations were supplied by the borrower and, in the rush of 2020, frequently not verified up front. This is not a peripheral data-quality note—it is the same feature that enabled the fraud described above. An analyst treating the reported amounts and counts as audited facts will mistake the borrower's claim for the ground truth, when in some meaningful share of records the claim was inflated, fabricated, or simply wrong. The data is authoritative as a record of what was claimed and approved; it is not an audited record of what was true.
The jobs-supported number is a borrower estimate, not a measurement. The widely cited “jobs reported” field is one of the most misused columns in the dataset. It is the number of jobs the borrower stated the loan would support, recorded at application—not a verified count of jobs actually retained, and not an outcome measured after the fact. Using it to compute a cost-per-job-saved figure for the program, as many early analyses did, conflates a self-reported intention with a realized outcome and produces numbers that the data cannot actually support. Treat the jobs figure as a descriptive attribute of the application, not as evidence of the program's employment effect.
Lender-of-record can be ambiguous. The split between originating and servicing lender, and the transfer of loans between institutions— particularly between fintech originators and partner or servicing banks—means that attributing a loan to a single named lender is not always straightforward. An analysis that ranks lenders by volume must decide deliberately whether to credit the originator or the servicer, because the two answers can differ substantially for the fintech-heavy segment of the program, and the choice changes which institutions appear most exposed to the program's fraud and forgiveness patterns.
Entity resolution is hard, and the demographic fields are mostly blank. The borrower name is a free-text field, so the same business can appear under slightly different names, and joining PPP records to other datasets—USASpending, the exclusions list, court records—requires careful entity resolution across name and address rather than a clean identifier join. Compounding this, the demographic fields that would allow a rigorous study of whether the program reached women-owned, minority-owned, and veteran-owned businesses were optional and are overwhelmingly unanswered, so any equity analysis built on those fields is working from a small, self-selected, and likely unrepresentative subset. Geographic and surname-based imputation can partially fill the gap, but it introduces its own error and must be flagged as inference, not fact.
Held with these caveats in mind, the sba_ppp_loans table is a uniquely valuable resource: roughly 11.8 million named, loan-level records of the largest small-business relief program in American history—who borrowed, who lent, how much, for what, and whether the loan became a grant—a complete public ledger of an emergency that the country ran in weeks, and that it has spent the years since trying to account for.
Related writing
SBA Loan Programs: The Federal Database Behind $50 Billion in Annual Small Business Financing — PPP was an emergency overlay on the SBA's ordinary 7(a) guarantee machinery, and the broader SBA lending data shows the permanent programs—7(a), 504, and microloans—that the pandemic program borrowed its delivery network from.
USASpending Subawards: The Federal Database Behind Sub-Grant and Sub-Contract Flow Tracking — Joining PPP borrowers to the government's spending record is the key oversight move, and the subaward data shows how federal dollars flow downstream once they leave the agency, the same flow-tracking logic that lets analysts test whether relief reached its intended recipients.
US Attorney Prosecution Data: The Federal Database Behind 80,000 Annual Criminal Cases — The thousands of PPP-fraud cases the Department of Justice has brought live in the federal prosecution record, and joining the two datasets traces the path from a fraudulent loan application to the criminal case it ultimately produced.