Almost every dollar saved for retirement in the private United States passes through a plan that, once a year, has to explain itself to the federal government. The instrument is Form 5500—the annual report that employee benefit plans must file under ERISA—and it is the single public window into a private system that holds trillions of dollars and covers most of the American workforce. Each filing names the plan, its sponsor, what kind of plan it is, how many people it covers, how much it holds, who it pays to run it, and what its auditors found. Our slice holds roughly 217,000 plan-year filings, one row per plan and plan year, drawn from the far larger EFAST2 universe.
This article covers what Form 5500 is and how ERISA frames it; the unusual joint-agency design under which a single filing serves the Department of Labor, the IRS, and the Pension Benefit Guaranty Corporation, and how it is filed electronically through EFAST2; the kinds of plan that file—defined-benefit pensions, defined-contribution 401(k) plans, and welfare plans such as group health and disability; the schedule architecture, especially the Schedule H financial statement, the Schedule C service-provider fees, and the actuarial funding schedules; the fields that carry the participant counts, assets, and contributions; how the data is used to study 401(k) fees, pension funding and underfunding, and provider consolidation; a Python workflow that pulls the public Form 5500 datasets, aggregates assets and participants by plan type, and ranks sponsors by plan assets; and the caveats—the large-plan reporting threshold, the small-plan forms, filing lag, and the gap between a filing and the reality it summarizes—that every analyst must internalize.
What the dataset is
Form 5500 is the annual return/report of an employee benefit plan. ERISA and the Internal Revenue Code together require most private-sector employee benefit plans to file it every year, and the form is the government's primary tool for monitoring the operation and financial condition of the private retirement and welfare benefit system. It is, deliberately, a public document: once filed, the data is available to participants, researchers, journalists, and the public, on the theory that transparency is itself a safeguard for the people whose retirement security depends on these plans. The modern series is filed electronically through the EFAST2 system—the ERISA Filing Acceptance System—and the resulting data is published as downloadable annual datasets running to hundreds of thousands of filings per year.
In our database this record is stored as the table dol_form_5500, with the grain of one row per filing: a single plan filing every year for a decade contributes ten rows, one per plan year. Our slice holds roughly 217,000 such plan-year filings—a portion of the full EFAST2 universe, which is larger still. The columns capture the identity of the plan and its sponsor, the plan type, and the headline financial and participant figures that the rest of the filing details:
sponsor_name -- the employer / plan sponsor
sponsor_ein -- sponsor employer identification number
plan_number -- three-digit plan number (PN), unique per sponsor
plan_name -- the plan's own name
plan_year -- the reporting year of this filing
plan_type -- pension or welfare; benefit-feature codes
type_pension_bnft_code -- defined benefit / defined contribution features
type_welfare_bnft_code -- health, life, disability, etc. (welfare plans)
total_participants -- participant count (active + retired + separated)
total_assets_eoy -- plan assets at end of year (from Schedule H)
total_contributions -- contributions received during the year
funding_arrangement -- trust, insurance, general assets, combination
iqpa_audit -- whether an independent audit was attached (Sch H)
filing_status -- whether the return is an annual report or amendmentThe load-bearing identifier is the pair of sponsor_ein and plan_number. A single employer can sponsor several plans—a 401(k), a defined-benefit pension, a separate health plan—so the EIN alone does not identify a plan. ERISA assigns each plan a three-digit plan number (PN), unique within a sponsor, and the combination of the sponsor's employer identification number and that PN is what uniquely and persistently names a plan across years. That EIN-plus-PN key is the join that ties a plan's filing to its prior years, to its schedules, and to the same employer's other plans. The plan_type and the benefit-feature codes distinguish a pension plan from a welfare plan, and within pensions a defined-benefit plan from a defined-contribution one—the single most important classification for analysis, because everything from the schedules attached to the funding rules to the meaning of “assets” depends on it. The participant count is on the main form; the asset and contribution dollar figures come from the Schedule H financial statement, joined on that same EIN-plus-PN key, and the schedules behind them are where the detail lives.
ERISA and why the form exists
The form exists because of the Employee Retirement Income Security Act of 1974 (ERISA), the federal statute that governs private-sector employee benefit plans. ERISA was enacted in response to a basic problem: workers were being promised pensions that, when the time came, were not there. The catalytic example was the collapse of the Studebaker auto plant, whose pension fund was so underfunded that thousands of workers received only a fraction of the benefits they had been promised, and many received nothing. ERISA's answer was to impose federal minimum standards on private plans: rules about who must be covered and how benefits vest, fiduciary duties requiring those who run plans to act prudently and solely in the interest of participants, funding rules for pensions, and—the part that produces this dataset—reporting and disclosure requirements so that participants and the government can see what a plan is doing.
Form 5500 is the principal vehicle for ERISA's reporting and disclosure mandate. It is how the government verifies that plans are operating in compliance with the statute's substantive rules—that a pension is being funded adequately, that plan assets are being held in trust and not diverted, that fiduciaries are not engaging in prohibited transactions, that fees paid to service providers are reasonable. It is also how participants and the public can examine a plan: a worker can look up the filing for the plan that holds their retirement savings and see its assets, its costs, and its auditor's findings. The form is therefore both an enforcement input—a structured stream of data that regulators mine for problems—and a transparency instrument, and the dataset's value flows from being both at once: a comprehensive, public, machine-readable census of the private benefit system.
A joint filing: DOL, IRS, and PBGC
The most distinctive structural fact about Form 5500—and the reason it carries the data it does—is that it is a joint filing serving three different federal bodies at once. A single annual return satisfies the reporting obligations that three separate agencies impose, and the data is shared among them. Understanding which agency wants which part of the form explains the otherwise puzzling breadth of what a single return collects.
The lead agency is the Department of Labor, acting through the Employee Benefits Security Administration (EBSA). EBSA administers and enforces the fiduciary, reporting, and disclosure provisions of ERISA's Title I, and Form 5500 is its central monitoring tool: it uses the filings to identify plans with fiduciary problems, imprudent investments, excessive fees, or signs of asset diversion, and to target its limited investigative resources. The Internal Revenue Service is the second agency: employee benefit plans receive favorable tax treatment—contributions are deductible and earnings grow tax-deferred—on the condition that they satisfy the tax-qualification rules of the Internal Revenue Code, and the IRS uses Form 5500 to confirm that a plan continues to meet those conditions (coverage, participation, vesting, and the minimum funding standards for pensions). The third is the Pension Benefit Guaranty Corporation (PBGC), the federal corporation that insures private defined-benefit pensions: when a covered defined-benefit plan fails, the PBGC steps in and pays guaranteed benefits, and it uses the Form 5500 funding data to monitor the financial health of the plans it backstops and to assess the premiums those plans owe. Three agencies, three purposes, one form—and one public dataset that, because it was built to serve all three, captures the operational, tax, and funding dimensions of every plan in a single record.
The mechanism that ties this together is EFAST2, the ERISA Filing Acceptance System through which the modern series is filed entirely electronically. Before EFAST2, filings were paper, and the resulting data was slow, error-prone, and hard to use at scale. Electronic filing transformed the form from a compliance artifact into a research-grade dataset: structured, validated at submission, and published in machine-readable annual datasets that anyone can download. The filings are public—EFAST2 exposes both a filing search for individual returns and bulk datasets of the full universe—and it is those public datasets that make analyses like the ones below possible without any special access or credential.
Plan types: pensions, 401(k)s, and welfare plans
Form 5500 covers two broad families of plan, and the distinction governs everything about how a filing is read. The first family is pension plans— plans that provide retirement income—and within it the critical split is between defined-benefit and defined-contribution plans. The second family is welfare benefit plans, which provide benefits other than retirement income: group health, life, and disability coverage, and similar arrangements. The benefit-feature codes on the form identify which family a plan belongs to and, within pensions, which sub-type.
A defined-benefit (DB) pension promises a specific benefit at retirement—typically a monthly payment calculated from years of service and salary—and the employer bears the investment and longevity risk of delivering on that promise. The classic corporate and public pension is a DB plan. Because the benefit is a fixed promise, the central regulatory question is whether the plan holds enough assets to meet its future obligations, which is why DB plans attach an actuarial funding schedule and why the PBGC insures them. A defined-contribution (DC) plan, of which the 401(k) is the dominant form, makes no promise about the eventual benefit: the employer (and usually the employee) contribute defined amounts to individual accounts, the account is invested, and the participant's retirement benefit is simply whatever the account is worth—the investment risk sits with the worker. The decades-long shift from DB to DC plans is the single largest structural change in the American retirement system, and it is visible directly in the Form 5500 data as the rising count and asset share of DC plans against the slow decline of DB plans. For a DC plan the questions that matter most—and that the form answers—are how much is in the accounts, what the participants are charged, and who is paid to administer and invest the money.
Welfare benefit plans are the other family. These provide health insurance, life insurance, disability coverage, and similar non-retirement benefits, and they file Form 5500 too—though many small welfare plans, and certain fully insured or unfunded arrangements, are exempt from filing. The welfare-plan filings are the principal public data source on employer-sponsored health and welfare benefits at the plan level, carrying the benefit types offered, the funding arrangement (insured, self-funded, or a combination), and, for large welfare plans, the financial detail. Across both families, the system is enormous: the plans that file Form 5500 collectively hold assets commonly cited on the order of thirty trillion dollars and cover the large majority of the private US workforce—which is why a census of these filings is, in effect, a census of how Americans are provided for in retirement and in illness.
The schedules: Schedule H, Schedule C, and the funding schedules
The main Form 5500 is a relatively short cover document; the substance lives in the schedules attached to it, and which schedules a plan must file depends on its type and size. Three schedules carry most of the analytic payload.
Schedule H is the financial statement of a large plan. A plan with one hundred or more participants is a “large plan” and must file Schedule H—a detailed financial statement reporting the plan's assets and liabilities by category at the beginning and end of the year, its income (contributions, investment gains, interest, dividends), and its expenses (benefit payments, administrative fees). Crucially, large plans must also attach the report of an independent qualified public accountant (IQPA)—an audit of the plan's financial statements by an outside CPA. This audit requirement is one of ERISA's most important participant protections: it puts a professional, independent set of eyes on whether plan assets actually exist and are properly valued and accounted for. The presence and content of Schedule H, and the IQPA opinion it carries, are what make the large-plan financial data trustworthy enough to study. (Small plans—fewer than one hundred participants—file the simpler Schedule I or, more often, the streamlined Form 5500-SF, and are generally exempt from the audit requirement.)
Schedule C is the service-provider and fee disclosure, and it is the foundation of the entire body of research on retirement-plan fees. ERISA requires that the fees a plan pays for services be reasonable, and Schedule C requires large plans to disclose the service providers they paid and the compensation those providers received—recordkeepers, investment managers, consultants, custodians, auditors, actuaries—including, importantly, indirect compensation such as revenue-sharing payments that flow to a provider from the plan's investments rather than directly from the plan. Because small differences in fees compound into large differences in retirement balances over a working life, the Schedule C data is what lets researchers and litigators ask whether a given plan's participants are being overcharged, and the fiduciary-fee lawsuits of the last fifteen years have leaned heavily on exactly this disclosure.
The actuarial funding schedules—Schedule SB for single-employer defined-benefit plans and Schedule MB for multiemployer plans—carry the pension-funding data. Prepared and signed by an enrolled actuary, these schedules report the plan's funding target (the present value of accrued benefits), the value of plan assets, the resulting funded status, the minimum required contribution, and the actuarial assumptions used to compute them. This is the data behind every analysis of pension funding and underfunding: the gap between what a defined-benefit plan has promised and what it holds. Because the PBGC insures these plans, systematic underfunding is not merely a private problem but a contingent liability of the federal insurance program, which is why the funding schedules are scrutinized so closely—the multiemployer pension crisis that led to a federal financial-assistance program for failing plans was, in the data, a story told by Schedule MB.
Participants, assets, and contributions
Below the schedules, the headline fields are the participant count, the plan assets, and the contributions, and each repays careful reading because each is a term of art.
The participant count is not the number of current employees. Under ERISA a plan's participants include active participants (employees currently accruing or eligible), retired or separated participants still receiving or entitled to benefits, and the beneficiaries of deceased participants. The count is reported on the main form, and it is what determines a plan's large-versus-small status—and therefore which schedules and audit it must file—so the one-hundred-participant line is the single most consequential threshold in the whole regime. Total assetsis the value of what the plan holds, reported at the beginning and end of the year on the Schedule H financial statement (large plans) or Schedule I (small plans); for a DC plan it is the sum of the participants' account balances, while for a DB plan it is the pool of assets set against the plan's promised liabilities, and the two meanings should never be conflated. The asset figure is the basis for ranking plans and sponsors by size, but it is meaningful only alongside the plan type. Contributions, also reported on the financial schedule, records the money flowing into the plan during the year—employer and employee contributions for a DC plan, the employer's funding contribution for a DB plan— and is the flow that, year over year, drives the change in assets. Read together with the schedules, these three fields locate a plan in size, type, and trajectory before any deeper analysis begins.
Analytical uses
A comprehensive, public, plan-level census of the private benefit system supports a distinctive set of analyses that no other federal dataset can.
401(k) fee analysis is the most active research use. Combining the Schedule H administrative-expense figures and the Schedule C service-provider disclosures with each plan's asset base lets an analyst compute an effective cost ratio for a plan—what its participants pay, in total, per dollar invested—and benchmark it against plans of similar size. Because the data is plan-level and longitudinal, it can show whether fees fall as a plan grows (they generally should, through economies of scale), which providers are associated with higher or lower costs, and which plans are outliers. This is the empirical engine behind both academic research on retirement costs and the wave of fiduciary-breach litigation alleging that specific plans paid unreasonable fees.
Pension funding and underfunding exploits the Schedule SB and MB actuarial data. Aggregating funded status across single-employer and multiemployer DB plans measures the health of the defined-benefit system as a whole and the exposure of the PBGC insurance program; tracking individual plans over time identifies those sliding toward insolvency before they fail. The same data supports research on how funding responds to interest rates, investment returns, and the actuarial assumptions plans choose—assumptions that, the data reveals, are not uniform and can themselves be a lever on reported funded status.
Provider consolidation uses the Schedule C provider records across the whole universe. Because every large plan names its recordkeeper, investment managers, and other providers, the data can be aggregated to map market concentration in the retirement services industry—which recordkeepers administer the most plans and assets, how that concentration has changed as the industry consolidates, and how a given provider's fee levels compare to its rivals. Finally, the EIN-plus-PN key supports longitudinal and cross-plan analysis: following a single plan across years to study its growth, cost trajectory, and provider changes, or rolling all of a sponsor's plans together to see an employer's full benefit footprint.
Python workflow: assets and participants by plan type, and top sponsors
DOL EFAST2 publishes the full Form 5500 series as annual public datasets—flat CSV files, one set per filing year, downloadable with no API key. The main form carries the sponsor, the plan type, and the participant counts but not the asset dollar totals, so the script below pulls both a year's main 5500 extract and its Schedule H financial file, joins them on the EIN-plus-plan-number key to bring plan assets over, resolves the (verbose but stable) column names defensively, classifies plans into defined-benefit, defined-contribution, and other/welfare using the pension benefit-feature codes, and then computes two of the core views: assets and participants aggregated by plan class, and a ranking of sponsors by total plan assets. Requirements: requests and pandas. The Schedule C file ships separately and can be joined on the same key for the fee analysis described above.
import requests, io, zipfile
import pandas as pd
# DOL EFAST2 publishes the full Form 5500 series as annual public
# datasets -- flat CSV files, one per form/schedule per filing year,
# downloadable with no API key. The main filing file is "F_5500_<year>";
# the large-plan financial detail (plan assets, contributions) lives in
# the separate Schedule H file ("F_SCH_H_<year>"). The main form carries
# the sponsor, plan type, and participant counts but NOT the asset dollar
# totals, so any assets analysis must join Schedule H to the main form.
# Confirm the exact file names against the current DOL Form 5500
# datasets page before running -- column and file names shift by year.
BASE = "https://askebsa.dol.gov/FOIA%20Files/<YEAR>/Latest"
def load_zip_csv(url):
# Each yearly extract ships as a single zipped CSV.
r = requests.get(url, timeout=600)
r.raise_for_status()
zf = zipfile.ZipFile(io.BytesIO(r.content))
name = next(n for n in zf.namelist() if n.lower().endswith(".csv"))
with zf.open(name) as fh:
return pd.read_csv(fh, low_memory=False)
def load_main(year):
url = BASE.replace("<YEAR>", str(year)) + f"/F_5500_{year}_Latest.zip"
return load_zip_csv(url)
def load_sch_h(year):
# Schedule H: large-plan financial statement, keyed by EIN + plan no.
url = BASE.replace("<YEAR>", str(year)) + f"/F_SCH_H_{year}_Latest.zip"
return load_zip_csv(url)
def col(frame, *candidates):
# Form 5500 column names are stable across years but verbose;
# resolve defensively rather than hard-coding a single spelling.
lower = {c.lower(): c for c in frame.columns}
for cand in candidates:
if cand.lower() in lower:
return lower[cand.lower()]
raise KeyError(f"none of {candidates} in {list(frame.columns)[:10]}...")
main = load_main(2023)
schh = load_sch_h(2023)
print(f"Main-form plan-year filings: {len(main):,}")
print(f"Schedule H (large-plan) financials: {len(schh):,}")
m_name = col(main, "SPONSOR_DFE_NAME", "SPONS_DFE_DBA_NAME")
m_ein = col(main, "SPONS_DFE_EIN", "SPONSOR_DFE_EIN")
m_pn = col(main, "SPONS_DFE_PN", "LAST_RPT_PLAN_NUM")
m_part = col(main, "TOT_PARTCP_BOY_CNT", "TOT_ACTIVE_PARTCP_CNT")
m_pen = col(main, "TYPE_PENSION_BNFT_CODE")
h_ein = col(schh, "SCH_H_EIN")
h_pn = col(schh, "SCH_H_PN")
h_assets = col(schh, "TOT_ASSETS_EOY_AMT", "NET_ASSETS_EOY_AMT")
# Build the EIN + plan-number key on both frames and bring plan assets
# over from Schedule H. Plans without a Schedule H (small plans) get 0.
for f, ein, pn in ((main, m_ein, m_pn), (schh, h_ein, h_pn)):
f["_ein"] = f[ein].astype(str).str.split(".").str[0].str.zfill(9)
f["_pn"] = f[pn].astype("Int64").astype(str)
assets = schh.groupby(["_ein", "_pn"])[h_assets].sum().rename("plan_assets")
df = main.merge(assets, on=["_ein", "_pn"], how="left")
df["plan_assets"] = df["plan_assets"].fillna(0)
# --- 1. Defined-benefit vs defined-contribution split ------------------
# Pension benefit codes flag the plan features; 2-series codes denote
# defined-contribution (401(k)-type) plans, 1-series defined benefit.
def plan_class(code):
s = str(code)
if any(t in s for t in ("2A", "2E", "2G", "2J", "2K", "2S")):
return "Defined contribution"
if any(t in s for t in ("1A", "1B", "1C", "1D", "1I")):
return "Defined benefit"
return "Other / welfare"
df["_class"] = df[m_pen].fillna("").map(plan_class)
agg = df.groupby("_class").agg(
plans=(m_ein, "size"),
participants=(m_part, "sum"),
assets=("plan_assets", "sum"),
)
print("\nAssets and participants by plan class:")
for cls, row in agg.iterrows():
print(f" {cls:<22} plans={int(row.plans):>7,} "
f"participants={int(row.participants):>12,} "
f"assets=${row.assets/1e9:>10,.1f}B")
# --- 2. Rank sponsors by total Schedule H plan assets ------------------
top = (df.groupby([m_name, "_ein"])["plan_assets"]
.sum().sort_values(ascending=False).head(15))
print("\nTop 15 sponsors by plan assets:")
for (name, ein), assets in top.items():
print(f" {str(name)[:40]:<40} EIN={ein} ${assets/1e9:>8,.2f}B")
Two practical notes apply. First, the plan classification in the script is deliberately a coarse first pass: it reads the pension benefit-feature codes to separate defined-benefit from defined-contribution plans, but the codes are a feature list rather than a clean type flag, and a rigorous classification should consult the current DOL Form 5500 instructions for the exact code meanings in the filing year and should treat welfare-only filings (which carry welfare benefit codes instead) separately. Second, because only large plans file Schedule H, the asset totals the join brings over cover the large-plan universe; small plans report less detail on Schedule I or the streamlined Form 5500-SF, so any sponsor ranking by assets is effectively a ranking of the large plans, and a serious fee or funding analysis must join to the relevant schedule rather than relying on the cover-page fields. For national-scale work, downloading the full annual datasets is far more efficient than the EFAST2 filing search, and the datasets ship with the version-stamped layout files that define every column for the year.
Limitations and analytical caveats
Form 5500 is the most comprehensive public record of the private benefit system, but it carries structural limitations that an analyst must internalize before drawing conclusions.
The large-plan threshold shapes what is visible. The richest data—the Schedule H financial statement and the independent audit—applies only to large plans, those with one hundred or more participants. The vast number of small plans file the streamlined Form 5500-SF, report far less financial detail, and are generally exempt from the audit requirement, while owner-only plans file the Form 5500-EZ, which is not part of the public dataset at all. Any analysis built on Schedule H or Schedule C is therefore an analysis of large plans, and generalizing its findings about fees or audits to the small-plan world—where costs are often higher and oversight thinner—is exactly the wrong direction to extrapolate. The dataset captures the well-resourced plans most completely and the smallest plans least.
There is substantial filing lag. A Form 5500 is due seven months after the end of the plan year, and that deadline can be extended by two and a half months, so a plan's filing for a given year does not appear until well after the year closes—and the most recent plan years in any snapshot of the public datasets are systematically incomplete as late and extended filings continue to arrive. The data is authoritative for established years and multi-year trends; it is not a current-state monitor, and treating the latest available year as complete will understate plan counts and assets at the leading edge.
A filing is a snapshot, and assets are reported, not audited to a common standard. The financial figures describe the plan as of a single year-end and are sensitive to market conditions on that date; a plan's reported assets and a DB plan's funded status can swing sharply with the markets and with the actuarial assumptions chosen, which are not uniform across plans. Even with the IQPA audit, valuation conventions, asset categories, and the handling of indirect compensation on Schedule C vary enough that cross-plan comparisons require care—two plans reporting the same headline fee may be measuring different things. The audit confirms that the statements are fairly presented; it does not make every plan's numbers mechanically comparable to every other plan's.
Entity resolution on sponsors is imperfect. The sponsor name is free text and the EIN, though structured, can change—through corporate reorganizations, mergers, and plan transfers—so following a sponsor or a plan across years and across its multiple plans demands careful entity resolution rather than naive name or EIN matching. A plan can be merged into another, a sponsor can be acquired, and a plan number can be reused; any longitudinal analysis that does not account for these discontinuities will mis-attribute history.
Held with these caveats in mind, the dol_form_5500 table is a uniquely valuable resource: a public, plan-level, year-stamped census of the private retirement and welfare benefit system—the sponsors, the participants, the assets, the fees, and the funding—that turns a private system holding the retirement security of most of the American workforce into something the public can actually see.
Related writing
SEC N-PORT Mutual Fund Holdings: The Federal Database Behind Every Fund Portfolio Position — The 401(k) and pension assets that Form 5500 totals are largely invested in the mutual funds whose position-by-position holdings N-PORT discloses, so the two datasets together trace retirement money from the plan that holds it down to the individual securities it ultimately owns.
IRS Exempt Organizations Business Master File: The Federal Record of 1.3 Million Tax-Exempt Nonprofits — Both datasets are keyed on the employer identification number and rest on the same IRS qualification regime, and many nonprofit sponsors in the exempt-organizations file run the very 403(b) and pension plans that surface on the Form 5500 side.
BLS Occupational Employment and Wage Statistics: The Federal Database Behind Median Salary Data for Every US Occupation — Defined-benefit pensions calculate promised benefits from salary, so the wage distributions OEWS measures are the input on which the funding obligations that Form 5500's actuarial schedules report are ultimately built.