Technical writing
College Scorecard: The Federal Dataset That Exposes Graduation Rates, Debt, and Earnings for Every US College
For most of the history of American higher education, the question “is this college worth it?” had no empirically rigorous answer. Prospectus brochures described outcomes in terms of alumni success stories carefully selected for maximum impressiveness. US News rankings measured institutional prestige and selectivity rather than what graduates actually earned or owed. The College Scorecard — launched by the Obama administration in 2015 and substantially expanded since — changed that by publishing institution-level and program-level outcome data sourced directly from federal tax records, FAFSA submissions, and loan servicer reports. For the first time, a prospective student, a journalist, or a regulator could look up any accredited US college and see what its graduates owed ten years later and what they were earning.
Where the data comes from
The Scorecard is not a survey. It is a linked administrative dataset assembled from four federal sources: the Integrated Postsecondary Education Data System (IPEDS), which colleges are legally required to report enrollment, completions, and financial data to each year; FAFSA application records, which capture the family income and dependency status of students who applied for federal aid; federal student loan and Pell Grant records maintained by the Department of Education and its loan servicers; and IRS earnings records matched to the Social Security Numbers of former students. The IRS earnings linkage is what makes the Scorecard uniquely powerful: it is not self-reported income from alumni surveys with participation rates below 20%. It is tax return data for the near-complete population of former students.
The linkage process, handled by the National Student Loan Data System (NSLDS), is restricted to students who received federal financial aid (loans or Pell Grants) at some point during their enrollment. This means the Scorecard's earnings figures exclude students who attended college entirely without federal aid — typically wealthy students at elite private institutions. At schools where the federal aid population is less than 75% of total enrollment, the published earnings figures may underrepresent what the full student body earns. The Department of Education suppresses cells with fewer than ten students to prevent re-identification, which creates gaps in program-level data for small fields of study.
Institution-level metrics
The institution-level dataset covers every Title IV-eligible school — roughly 6,000 colleges, universities, trade schools, and certificate programs. For each institution, the Scorecard publishes:
- Completion rate — the share of first-time, full-time students who graduated within 150% of normal program time (six years for a four-year degree, three years for a two-year program). Completion rates are tracked separately for four-year institutions and less-than-four-year institutions because the populations and missions differ substantially.
- Median earnings 1, 5, and 10 years after enrollment — the median annual earnings of former students who received federal aid, measured at one, five, and ten years after they first enrolled. The one-year figure captures students who may still be in school or recently graduated; the ten-year figure is the most policy-relevant for evaluating return on investment.
- Median debt for completers and non-completers — these are separate fields and the distinction matters enormously. Students who did not complete their programs typically carry debt but receive none of the wage premium associated with the credential. At for-profit colleges with low completion rates, the median non-completer debt load can approach the completer debt load while earnings are substantially lower.
- Repayment rate — the share of borrowers who, three years after leaving school (whether by graduating or withdrawing), have successfully reduced their principal by at least one dollar. This is a stricter measure than default rate: a borrower in an income-driven repayment plan making payments that cover interest but not principal is not reducing their balance and does not count as a successful repayer, even though they are technically current on their loan.
- Pell Grant share — the fraction of enrolled students receiving Pell Grants, which are means-tested awards capped at $7,395 per year (as of 2025–2026) for students from families below roughly $65,000 in annual income. Pell share is the standard proxy for institutional socioeconomic diversity. It is also a prerequisite for certain federal accountability thresholds and state performance funding formulas.
- Federal loan rate — the share of students who took out any federal student loans during the year. Combined with median debt, this field allows estimation of the typical borrowing behavior at an institution.
- Median family income of enrolled students — derived from FAFSA records, this reflects the economic backgrounds of the enrolled aid population. Significant variation exists across institution types: elite private universities enroll a median family income above $150,000; community colleges often report median family incomes below $30,000.
- Demographic composition — the share of students who are part-time, first-generation college students (defined as neither parent having a bachelor's degree), independent (not claimed as dependents on a parent's tax return), and the gender breakdown. These fields are critical for interpreting completion rates: institutions enrolling high shares of independent, part-time, and first-generation students face substantially higher structural barriers to completion and cannot be directly compared to traditional residential universities on raw completion rate.
The earnings-debt gap as a screening metric
The most useful single metric for identifying high-risk programs is the ratio of median ten-year earnings to median debt for completers. Programs where graduates earn substantially less than what they borrowed signal a poor return on investment even for the students who successfully finished. The threshold of $40,000 in debt and $40,000 in earnings — a ratio of 1.0 — has become a rough journalistic and policy benchmark: programs below this line produce graduates who, in the median case, will spend the first decade of their careers with debt that is not being repaid at a meaningful rate relative to income.
The Department of Education's Gainful Employment rule (first finalized in 2023, then challenged in federal court) used a related but stricter metric: programs where annual debt payments exceeded 8% of earnings or 20% of discretionary income for typical graduates could lose Title IV eligibility. Under standard amortization at 2025 interest rates, a $40,000 loan at 6.5% generates an annual payment of roughly $5,400 on a ten-year repayment plan — which is 13.5% of $40,000 in annual earnings, well above the 8% threshold. The Gainful Employment rule targeted programs at the tail of the earnings-debt distribution, but the Scorecard data makes it possible for anyone to run the equivalent calculation for any institution without access to confidential program-level borrowing records.
For-profit college patterns
The for-profit sector — approximately 900 institutions in the Scorecard database, ranging from large publicly traded chains to small vocational schools — exhibits systematic patterns that distinguish it from the public and private nonprofit sectors.
Completion rates at for-profit four-year institutions average roughly 20–25 percentage points lower than at public four-year institutions, and this gap persists even after controlling for student demographic composition. The Scorecard's first-generation, part-time, and independent student share fields allow crude compositional adjustment, but researchers who have done more rigorous adjustments using matched student populations find that for-profit completion rates remain significantly below public alternatives serving equivalent students.
Median debt for completers at for-profit institutions consistently exceeds median debt at comparable community colleges, despite the community college programs often leading to the same occupational credentials at a fraction of the tuition cost. A student earning a medical billing certificate from a for-profit college may leave with $15,000 in debt; the same credential from a community college may cost $3,000. The Scorecard does not capture tuition price directly, but the debt field is a close proxy for tuition because the federal aid population at for-profit schools borrows at very high rates.
Three-year repayment rates at for-profit institutions are, in aggregate, the lowest of any sector. A repayment rate below 50% means that fewer than half of borrowers three years out have reduced their loan principal at all — they are either in default, forbearance, deferment, or on income-driven plans paying less than their accruing interest. Several large for-profit chains that collapsed between 2015 and 2022 (ITT Technical Institutes, the Education Management Corporation schools, Corinthian Colleges) had publicly available Scorecard repayment rates below 40% in the years before their closures, a signal that retrospective analysis has highlighted as an early indicator of institutional fragility.
Program-level data: CIP codes and field-of-study earnings
In 2020 the Department of Education added program-level data to the Scorecard, breaking out earnings and debt by field of study at the Classification of Instructional Programs (CIP) code level. CIP codes are a six-digit taxonomy maintained by the National Center for Education Statistics: 11.0701 is Computer Science, 14.0901 is Computer Engineering, 23.0101 is English Language and Literature, 50.0702 is Fine and Studio Art.
The program-level data makes the earnings differentials between fields of study legible in ways that institution-level averages obscure. Software and computer engineering programs (CIP 11 and 14) at large public universities report ten-year median earnings in the $80,000–$120,000 range. General liberal arts and humanities programs at the same institutions often report $45,000–$60,000 — lower, but still positive relative to debt. Art and design programs (CIP 50) at expensive private art colleges frequently exhibit earnings-to-debt ratios below 1.0 even for completers: a graduate of a four-year BFA program at a $50,000-per-year art school may carry $80,000 in federal debt while earning a median of $38,000 at the ten-year mark.
The program-level dataset is more sparse than the institution-level dataset. Cell suppression for small programs, the recency of the data (only available for recent cohorts), and the fact that earnings are reported for completers who received federal aid rather than all completers limit the analytical scope. Nevertheless, it is the only federally compiled source of field-of-study earnings at the institution-program-code level, and investigative teams have used it to identify specific degree programs — not just institutions — where typical graduates are financially worse off after attending than they would have been without the credential.
Data access: API and bulk downloads
The Scorecard data is available through two channels. The API at collegescorecard.ed.gov/data requires a free API key obtained through api.data.gov. The API endpoint at https://api.data.gov/ed/collegescorecard/v1/schools accepts GET requests with field specifications and filter parameters, returns JSON, and paginates up to 100 records per page. Field names follow a hierarchical dot-notation schema: the prefix latest. returns the most recent year's value; 2022. returns the 2022 cohort's value. Fields under latest.earnings contain the IRS-linked earnings data; latest.aid contains debt and Pell metrics; latest.completion contains graduation rates.
The bulk download option, also linked from collegescorecard.ed.gov/data, provides zipped CSV files covering all institutions and all available cohort years. The full institution-level file is approximately 300 MB compressed. A separate data dictionary spreadsheet documents every field name, the source system, the cohort year, suppression criteria, and the calculation methodology. Reading the data dictionary before analysis is essential: several fields that appear to measure the same thing (different completion rate variants, different debt aggregations) have meaningfully different denominators and cannot be swapped without changing what the analysis measures.
Python: pulling institution data and computing earnings-debt ratios by sector
The code below queries the Scorecard API for all institutions with available ten-year earnings data, computes the earnings-to-debt ratio for completers, and segments by sector. It also flags institutions meeting the high-debt / low-earnings threshold of $40,000 median completer debt and less than $40,000 median ten-year earnings — the population most likely to be targeted by Gainful Employment-style accountability rules and investigative reporting.
import requests
import pandas as pd
# College Scorecard API — institution-level data
# Register for a free API key at collegescorecard.ed.gov/data
# Docs: https://collegescorecard.ed.gov/data/documentation/
API_KEY = "YOUR_API_KEY_HERE"
BASE_URL = "https://api.data.gov/ed/collegescorecard/v1/schools"
# Fields to pull per institution
FIELDS = ",".join([
"school.name",
"school.state",
"school.ownership", # 1=public, 2=private nonprofit, 3=for-profit
"school.region_id",
"latest.completion.completion_rate_4yr_150nt", # 6-yr rate at 4-yr schools
"latest.completion.completion_rate_less_than_4yr_150nt",
"latest.earnings.10_yrs_after_entry.median",
"latest.repayment.3_yr_repayment.overall", # share who reduced principal
"latest.aid.median_debt.completers.overall",
"latest.aid.median_debt.noncompleters",
"latest.aid.pell_grant_rate",
"latest.aid.federal_loan_rate",
"latest.student.demographics.median_family_income",
"latest.student.demographics.first_generation",
"latest.student.part_time_share",
"latest.student.demographics.female_share",
])
def fetch_page(page: int, per_page: int = 100) -> dict:
params = {
"api_key": API_KEY,
"fields": FIELDS,
"page": page,
"per_page": per_page,
# Only schools that have reported earnings data
"latest.earnings.10_yrs_after_entry.median__range": "1..",
}
resp = requests.get(BASE_URL, params=params, timeout=30)
resp.raise_for_status()
return resp.json()
# Paginate through all institutions
rows = []
page = 0
while True:
data = fetch_page(page)
results = data.get("results", [])
if not results:
break
rows.extend(results)
total = data.get("metadata", {}).get("total", 0)
per_page = data.get("metadata", {}).get("per_page", 100)
if len(rows) >= total:
break
page += 1
df = pd.DataFrame(rows)
print(f"Institutions retrieved: {len(df)}")
# Rename for readability
df = df.rename(columns={
"school.name": "name",
"school.state": "state",
"school.ownership": "ownership",
"latest.completion.completion_rate_4yr_150nt": "completion_rate",
"latest.earnings.10_yrs_after_entry.median": "median_earnings_10yr",
"latest.repayment.3_yr_repayment.overall": "repayment_rate_3yr",
"latest.aid.median_debt.completers.overall": "median_debt_completers",
"latest.aid.pell_grant_rate": "pell_share",
"latest.aid.federal_loan_rate": "loan_share",
"latest.student.demographics.median_family_income": "median_family_income",
"latest.student.demographics.first_generation": "first_gen_share",
"latest.student.part_time_share": "part_time_share",
"latest.student.demographics.female_share": "female_share",
})
# Sector labels: 1=public, 2=private nonprofit, 3=for-profit
SECTOR = {1: "Public", 2: "Private nonprofit", 3: "For-profit"}
df["sector"] = df["ownership"].map(SECTOR)
# --- Earnings-to-debt ratio by sector ---
df["earnings_debt_ratio"] = (
pd.to_numeric(df["median_earnings_10yr"], errors="coerce")
/ pd.to_numeric(df["median_debt_completers"], errors="coerce")
)
summary = (
df.groupby("sector")
.agg(
institutions=("name", "count"),
median_earnings=("median_earnings_10yr", "median"),
median_debt=("median_debt_completers", "median"),
median_ed_ratio=("earnings_debt_ratio", "median"),
avg_completion=("completion_rate", "mean"),
avg_repayment_3yr=("repayment_rate_3yr", "mean"),
avg_pell_share=("pell_share", "mean"),
)
.reset_index()
.sort_values("median_ed_ratio", ascending=False)
)
print(summary.to_string(index=False))
# --- Flag high-debt / low-earnings programs ---
df["debt"] = pd.to_numeric(df["median_debt_completers"], errors="coerce")
df["earn"] = pd.to_numeric(df["median_earnings_10yr"], errors="coerce")
flagged = df[
(df["debt"] >= 40_000) & (df["earn"] < 40_000)
][["name", "state", "sector", "debt", "earn", "repayment_rate_3yr"]].copy()
flagged = flagged.sort_values("debt", ascending=False)
print(f"\nHigh-debt / low-earnings institutions: {len(flagged)}")
print(flagged.head(20).to_string(index=False))Running this against the current API typically returns 3,500–4,500 institutions with sufficient data to compute the ratio. Public institutions cluster at a ratio of roughly 1.8–2.5 (graduates earning $54,000–$75,000 at the ten-year median against debt of $27,000–$30,000). Private nonprofits vary widely, with elite research universities at ratios above 3.0 and smaller liberal arts colleges closer to 1.5. For-profit institutions consistently show the lowest median ratio, typically 0.9–1.3, and account for a disproportionate share of the flagged high-debt / low-earnings institutions.
How journalists use the Scorecard
Investigative journalists have used the Scorecard as primary source material for accountability reporting on specific programs and institutions. The analytical pattern that has proven most durable is joining Scorecard data to state attorney general complaints, accreditor reports, and Department of Education enforcement actions to show that institutions with the worst Scorecard metrics were also accumulating regulatory scrutiny in parallel systems. The Scorecard data predates these enforcement actions in most cases, which supports the argument that the financial outcome signals were visible and ignored rather than unknown.
The program-level CIP data has enabled a more granular form of accountability: identifying not just underperforming institutions but underperforming programs at otherwise reputable institutions. A regional state university with solid overall Scorecard metrics may have a specific graduate certificate program in a for-profit-adjacent field — criminal justice administration, healthcare administration, digital marketing — with debt loads and earnings outcomes that look more like a for-profit chain than the parent institution. The CIP-level data surfaces these within-institution anomalies that aggregate institution metrics conceal.
The repayment rate field has been used as an early warning indicator. Journalists and researchers monitoring institutions in financial stress have documented cases where declining repayment rates — meaning more borrowers failing to reduce their principal — preceded accreditor sanction or closure by two to four years. For-profit chains whose repayment rates fell below 40% in the data were, in retrospect, showing financial stress in their student populations that foreshadowed the enrollment declines, Department of Education scrutiny, and eventual shutdowns that followed.
The Department of Education also operates a College Navigator tool built on the same underlying data that is designed for prospective students rather than researchers. For programmatic access, the API and bulk downloads are more appropriate, but College Navigator is useful for quickly checking whether a specific institution's published metrics match the API output — a useful validation step when building any analysis pipeline on top of the Scorecard data.
Limitations
The Scorecard's earnings figures cover only former students who received federal aid. At institutions where the aid population is substantially poorer or younger than the non-aid population, median earnings may not generalize. Elite private universities with large endowments enroll many students who do not need federal aid; their Scorecard earnings figures may understate true graduate earnings because they exclude the highest-earning graduates who attended without borrowing.
Cohort timing matters. The ten-year earnings figure measures what students who enrolled ten years ago now earn, not what students enrolling today will earn. During periods of rapid wage change in specific sectors — technology in 2015–2022, for example — the historical cohort earnings figure may lag current labor market conditions substantially. The one-year earnings figure is more timely but captures students who may still be in school, in low-paid entry-level roles, or in graduate programs that depress near-term earnings relative to long-term trajectory.
The completion rate denominator is first-time, full-time students. At community colleges and open-access institutions where the majority of students enroll part-time or transfer in with prior credits, the measured completion rate systematically undercounts actual completion. The Scorecard publishes supplemental completion metrics using broader denominators for schools that opt in, but these are not uniformly available and cannot be compared across institution types without care.
Related writing
The demographic backbone: using Census ACS data to contextualize every other federal dataset — the American Community Survey's 5-year tract-level estimates provide the population denominators that make per-capita analysis of Scorecard and every other federal dataset meaningful.
NIH Research Grant Data: Mapping $40 Billion in Annual Biomedical Funding — how NIH Reporter exposes the full portfolio of federal biomedical research awards, with geographic concentration, activity code taxonomy, and indirect cost analysis.
The mortgage map: using HMDA loan-level data to find lending disparities — the same pattern of federal administrative data revealing systemic disparities that institutions would prefer not to discuss publicly, applied to mortgage lending and census tract demographics.