Almost every line of basic American science begins as a proposal sitting on a panel table—a few dozen pages describing what is not yet known, read and scored by other scientists, and either funded or not. When the National Science Foundation says yes, the decision becomes a public record: an award number, an institution, a named principal investigator, a program and a directorate, a title and an abstract, start and end dates, and a dollar figure obligated against it. Roughly 4,500 of those recent award records sit in our table, the searchable trace of who gets US science funding and for what.
This article covers what the NSF Awards dataset is and how the National Science Foundation Act of 1950 frames it; how NSF differs from the mission agencies and the biomedical funders, and why it sits at the center of the basic-research enterprise; the merit-review process—the two criteria of intellectual merit and broader impacts, and the panel system that produces the awards; the directorate and program structure that the award fields encode; the signature programs—CAREER, the Graduate Research Fellowship, the instrumentation and facilities programs, and the National AI Research Institutes; the abstract corpus and why it makes the data searchable by research topic; how the awards join to other federal research-funding and higher-education datasets; a Python workflow against the open NSF Awards API that tallies obligations by institution and program and searches abstracts by keyword; and the caveats—obligation versus outlay accounting, the institution-not-PI grain, and recency lag—that every analyst must hold in mind.
What the dataset is
The NSF Awards data is the public record of the grants the National Science Foundation has made. Every funded proposal becomes an award, and every award is published with the structured facts that describe it: who received the money, who leads the work, what the work is, when it runs, and how much was committed. NSF publishes this through its Awards search and the NSF Awards API at api.nsf.gov, and through bulk award downloads—no key required for either. The grain is one row per award, identified by the NSF award number, a numeric identifier that stays with the project across reports, amendments, and supplements.
In our database this record is stored as the table nsf_awards, roughly 4,500 recent awards keyed by the award number and recipient. The columns capture the recipient, the people, the program lineage, the timing, the money, and—the field that makes the corpus genuinely searchable—the abstract:
award_id -- NSF award number (the persistent project key)
title -- the award / project title
awardee_name -- recipient institution (university, lab, company)
awardee_state_code -- two-letter state of the recipient institution
pi_first_name -- principal investigator, given name
pi_last_name -- principal investigator, surname
fund_program_name -- the funding program (rolls up to a directorate)
directorate -- NSF directorate administering the award
funds_obligated_amt -- dollars obligated to date on the award
total_intnd_awd_amt -- total intended (anticipated) award amount
start_date -- award start date
exp_date -- award expiration (end) date
abstract_text -- the public project abstract / summaryThe award_id is the load-bearing column: it is the persistent key that ties an award to its later annual and final project reports, to any supplements, and to outputs (publications, patents, datasets) that acknowledge the grant. The awardee_name records the institution, not the individual—a point the caveats section returns to—because NSF grants are legally made to the institution that administers the funds, with the pi_first_name and pi_last_namenaming the principal investigator who leads the science. The fund_program_name and directorate encode the program lineage that places an award within NSF's organizational structure. And the two money fields—funds_obligated_amt and total_intnd_awd_amt— distinguish what has actually been committed to date from what the full award was intended to be, a distinction that matters whenever the analysis is about dollars. The abstract_text is the column that lifts the dataset above a bare ledger: a paragraph or two of plain-language description of the research, which turns the corpus into something you can query by topic rather than only by institution or program.
What it is and the NSF Act of 1950
The National Science Foundation is an independent federal agency created by the National Science Foundation Act of 1950. Its origins lie in the postwar recognition—crystallized in Vannevar Bush's 1945 report Science, the Endless Frontier—that the United States needed a permanent, peacetime federal mechanism to support basic research and scientific education, the kind of curiosity-driven inquiry that industry would not fund on its own because its payoffs are distant and uncertain. The Act gave NSF a mandate that is unusual in its breadth: to promote the progress of science, to advance the national health, prosperity, and welfare, and to secure the national defense—by funding research and education across essentially all fields of science and engineering.
Two features of that mandate shape the data. The first is the focus on basic research. NSF is the country's principal funder of fundamental, non-medical science: it supports roughly a quarter of all federally funded basic research conducted at US colleges and universities, and in many fields—mathematics, the physical sciences, the non-medical life sciences, the social and behavioral sciences, much of computer science and engineering—it is the dominant or only federal source. The second feature is what NSF does not do, which is as important for interpreting the data as what it does. NSF does not fund biomedical research (that is the National Institutes of Health), nor energy research (the Department of Energy), nor defense systems or space hardware (the Department of Defense and NASA). Those mission agencies fund applied and targeted work tied to their missions; NSF funds the broad base of inquiry beneath them. An analysis of the US research enterprise that reads the NSF awards alone is reading the basic-science layer, not the whole stack—which is precisely why the dataset is complementary to, rather than a substitute for, the biomedical, energy, and defense funding records.
NSF is also more than a grant-maker. It funds and stewards major national research facilities and instruments—telescopes, research vessels, supercomputing centers, field stations, the national high-magnetic-field laboratory—that no single university could build or run. It is a major funder of STEM education and of the training of the next generation of scientists. And it is the source of much of the official statistics on the US scientific workforce and on national research and development spending. The awards data is the most granular and most queryable face of that enterprise: where the high-level statistics tell you how much the country spends on basic research, the awards tell you which proposals, at which institutions, in which fields, the money actually went to.
The merit-review process
Every award in the dataset is the output of NSF's merit-review process, and understanding that process is essential to interpreting the data, because the awards are not a random or formula-driven distribution of funds—they are the survivors of an intensely competitive, peer-judged selection. A researcher (or team) submits a proposal to a specific program. Program officers organize the proposals for review, and the review is conducted by scientific peers—other experts in the field, who read the proposals and assess them either through written ad hoc reviews, through a convened review panel that meets to discuss and rank a batch of proposals, or both. The program officer weighs the reviews, the panel's recommendations, and program-balance considerations, and makes a funding recommendation that moves up through NSF for the final award decision.
NSF judges proposals against two explicit merit-review criteria, applied to every proposal the agency receives. The first is intellectual merit: the potential of the proposed work to advance knowledge—its importance, the soundness of its approach, the qualifications of the team, the adequacy of the resources. The second, distinctive to NSF, is broader impacts: the potential of the work to benefit society and contribute to desired societal outcomes—through education and training, broadening participation in science, building research infrastructure, disseminating results, or informing public policy. The two-criterion structure is why an NSF proposal is never only about the science; it is also about who the science reaches and trains. This matters for the data because the broader-impacts criterion is part of what the abstracts describe and part of what drives the demographic and geographic distribution—NSF deliberately funds programs aimed at institutions and regions that have historically received less research funding, and that intent leaves a fingerprint in where the awards land.
The competitiveness of the process is the context for any analysis of funding rates. Across NSF's programs only a fraction of submitted proposals are funded—funding rates commonly run well below half and in the most competitive programs are far lower—so an award represents a proposal that cleared a high bar. An analyst counting awards by institution or field is, in effect, counting wins in a tournament whose entry pool the awards data does not itself contain. The dataset records the funded proposals, not the declined ones; it shows who succeeded, not the denominator of who competed. That asymmetry is benign as long as it is remembered, and misleading the moment an award count is read as if it measured research activity rather than research funding success.
Directorates and the program structure
NSF is organized into directorates, each covering a broad domain of science and engineering, and the directorate and program fields in the awards data are the keys to that structure. Historically the directorates have covered the biological sciences; computer and information science and engineering; engineering; geosciences; the mathematical and physical sciences; the social, behavioral, and economic sciences; STEM education; and, more recently, a directorate devoted to technology, innovation, and partnerships that aims to speed the translation of research into application. Within each directorate sit divisions, and within divisions sit the programs—the concrete funding lines, each run by program officers, to which researchers actually apply. The fund_program_name field records that program, and it rolls up to the directorate.
This structure is what makes funding trends by directorate a first-class analysis. Because each award carries its program and directorate, the data lets an analyst track how the balance of NSF funding shifts across domains over time—whether dollars are flowing toward computing and engineering, whether the geosciences or the social sciences are gaining or losing share, how a new cross-cutting priority (artificial intelligence, quantum information science, climate, advanced manufacturing) draws funding across multiple directorates at once. Many of NSF's most important initiatives are deliberately cross-directorate: a single research theme is funded through programs in several directorates simultaneously, which means tracing the money for a priority area requires querying across programs rather than reading a single directorate's line. The program and directorate fields, combined with the obligated-amount field, are what let the awards data answer the perennial science-policy question of where the federal basic-research dollar is actually going—by field, by year, and by the priorities that cut across fields.
CAREER, GRFP, instrumentation, and the AI institutes
Several signature programs run through the awards data and are worth knowing because they structure large, recognizable slices of it. The Faculty Early Career Development (CAREER) program is NSF's most prestigious award for early-career faculty: a multi-year grant that supports a junior researcher's integrated research and education plan, and one of the strongest early signals of a rising scientific career. CAREER awards form an identifiable cohort in the data, and tracing them is a way to study the pipeline of emerging investigators across fields and institutions.
The Graduate Research Fellowship Program (GRFP) is NSF's oldest fellowship and one of its most consequential investments: it funds graduate students directly, early in their training, across the sciences and engineering. GRFP fellows include an extraordinary number of scientists who went on to lead their fields, which makes the fellowship a long-running natural experiment in identifying and supporting research talent. The Major Research Instrumentation program and the broader facilities and infrastructure programs fund the equipment and shared instruments—from microscopes and mass spectrometers to clusters and detectors— that a department or consortium needs to do modern research; these awards tend to be larger and to cluster at research-intensive institutions, and they shape the capital base of the research enterprise rather than a single project.
Among the newer flagship efforts, the National Artificial Intelligence Research Institutesare large, multi-institution centers, funded by NSF together with partner agencies, that organize AI research around themes—AI for science, for agriculture, for education, for trustworthy systems—and concentrate substantial funding into coordinated programs. They are a clear example of the cross-directorate, multi-agency, large-center model that increasingly sits alongside the traditional single-investigator grant. For an analyst, the value of knowing these programs is that they explain much of the structure in the data—the clusters of large awards, the cohorts of early-career grants, the multi-institution centers—and they are exactly the strata one most often wants to isolate when studying how NSF funding maps onto the shape of the US research workforce and its frontier areas.
The abstract corpus and topic search
The single feature that most distinguishes the NSF Awards data from a plain financial ledger is the abstract. Every award carries a public project abstract—a plain-language summary of the research, written to communicate both the intellectual content and the broader impacts to a general audience. Collected across thousands of awards, the abstracts form a large, structured, topic-rich corpus that turns the dataset into something queryable by what the research is about, not merely by who received it or which program funded it.
This is what makes the awards data uniquely suited to mapping the topology of US science. Because the program and directorate fields are administrative categories that do not always track the actual research topic—a study of machine learning might be funded out of a computing program, a statistics program, a neuroscience program, or an engineering program—the abstract is often the only field that reliably identifies what a project really studies. Searching the abstract corpus for a term lets an analyst trace an emerging research front across the administrative boundaries that would otherwise hide it: how much NSF funding mentions a given technique, how that mention count has grown year over year, which institutions and which directorates the funded work clusters in, and how a frontier idea diffuses from a handful of programs into many. The modern extension of this is to embed the abstracts and cluster or classify them, building a semantic map of the research portfolio rather than a keyword index—but even the simple keyword scan in the worked example below is enough to turn the awards data from a record of transactions into a record of ideas being funded.
Joining to other federal research-funding data
The NSF Awards data is most powerful as one node in a network of federal research-funding and higher-education datasets. Three kinds of join matter most.
The first is to the government-wide spending record. Every NSF grant is also a federal financial-assistance transaction reported under the government-wide spending data, where it appears alongside grants from every other agency, keyed to the recipient and carrying the federal award identification and the obligated and outlaid amounts. Joining the NSF awards to that record places NSF's basic-research spending in the context of the whole federal grant enterprise—and lets an analyst compare an institution's NSF funding against the grants it draws from the biomedical, energy, and defense funders, building the full picture of a university's federal research support that no single agency's data can give.
The second is to the higher-education statistics that describe the institutions themselves. The recipient institution in an NSF award can be joined to the federal higher-education data—enrollment, degrees conferred, classification, finances, the count of doctorates produced—so that funding can be normalized against the size and research intensity of the institution. Awards-per-doctorate, obligated-dollars-per-faculty, the share of a state's research funding concentrated in its flagship versus its regional institutions: these are the analyses that only become possible once the awards are tied to the institutional denominators that the higher-education data supplies. The third join is to the grant-opportunity record—the federal listing of funding opportunities—which sits upstream of the awards: the opportunity is the call that solicited the proposals, and the awards are the funded results, so linking them connects what was offered to what was won. Together these joins turn the NSF awards from a standalone list into the basic-research layer of an integrated picture of how the country funds, houses, and produces science.
Analytical uses
A national, institution-resolved, abstract-bearing record of basic-research awards supports a distinctive set of analyses.
Which institutions and fields draw funding is the most immediate use. Aggregating obligated dollars and award counts by recipient institution and by program or directorate reveals the concentration of NSF funding—how much flows to the research-intensive flagships versus the rest, how it splits across the fields of science and engineering, and how those shares move over time. The same aggregation by state and region surfaces the geographic distributionof basic-research funding, the kind of analysis that informs the long-running policy debate over whether federal research dollars are too concentrated in a handful of states and what the programs aimed at broadening participation actually shift.
Funding trends by directorate and priority areaexploit the program lineage and the abstract corpus together: tracking how the balance of funding moves across domains, and how a cross-cutting priority such as artificial intelligence or quantum information draws dollars across multiple directorates, by combining the directorate field with a keyword search of the abstracts. Finally, the lineage from basic research to later innovation is the most ambitious use: because the award number persists and is acknowledged in the publications, patents, and follow-on funding that the research produces, the awards data is the starting point for tracing how a fundamental NSF grant matures—years or decades later—into applied results, commercial technology, and economic value, the empirical backbone of the case for funding curiosity-driven science.
Python workflow: awards, obligations, and abstract search
The script below pulls awards from NSF's public Awards API at api.nsf.gov, pages through the results, and computes the core metrics: total obligated dollars, obligations by recipient institution, and obligations by funding program (a proxy for the directorate), and then performs a full-text search of the abstract corpus for a research topic. No API key is required. The API returns one record per award, you request fields through the printFields parameter, and you filter with query parameters such as keyword, dateStart, fundProgramName, or awardeeName; results page twenty-five at a time, so the script steps the offset by twenty-five until the result set is exhausted.
import requests, pandas as pd
from collections import Counter
# NSF Awards API -- public, no API key required.
# Base: https://api.nsf.gov/services/v1/awards.json
# The API returns one record per award. You request the fields you
# want via the printFields parameter and filter with query params
# such as keyword, dateStart/dateEnd, fundProgramName, or awardeeName.
# Results page 25 at a time; offset is 1-based and steps by 25.
BASE = "https://api.nsf.gov/services/v1/awards.json"
FIELDS = ",".join([
"id", "title", "awardeeName", "awardeeStateCode",
"piFirstName", "piLastName", "fundProgramName",
"fundsObligatedAmt", "startDate", "expDate", "abstractText",
])
def fetch_awards(params, max_records=2000):
# Page through the NSF Awards API and collect award records.
out, offset = [], 1
while len(out) < max_records:
q = dict(params, printFields=FIELDS, offset=offset)
r = requests.get(BASE, params=q, timeout=60)
r.raise_for_status()
batch = r.json().get("response", {}).get("award", [])
if not batch:
break
out.extend(batch)
offset += 25
return pd.DataFrame(out[:max_records])
def to_dollars(series):
# fundsObligatedAmt arrives as a string; coerce to numeric.
return pd.to_numeric(series, errors="coerce").fillna(0)
def analyze(df):
if df.empty:
print("No awards returned for this query.")
return
df["obligated"] = to_dollars(df.get("fundsObligatedAmt"))
total = df["obligated"].sum()
print(f"{len(df):,} awards, ${total:,.0f} obligated total")
# --- Obligations by recipient institution ------------------------
by_inst = (df.groupby("awardeeName")["obligated"]
.agg(["sum", "count"]).sort_values("sum", ascending=False))
print("\nTop institutions by obligated dollars:")
for name, row in by_inst.head(10).iterrows():
print(f" {name[:44]:44s} ${row['sum']:>14,.0f} ({int(row['count'])})")
# --- Obligations by funding program (proxy for directorate) -------
by_prog = (df.groupby("fundProgramName")["obligated"]
.sum().sort_values(ascending=False))
print("\nObligated dollars by funding program:")
for prog, amt in by_prog.head(10).items():
print(f" {str(prog)[:44]:44s} ${amt:>14,.0f}")
def search_abstracts(df, keyword):
# Full-text scan of the abstract corpus for a research topic.
kw = keyword.lower()
hits = df[df["abstractText"].fillna("").str.lower().str.contains(kw)]
print(f"\n{len(hits)} of {len(df)} abstracts mention '{keyword}'")
return hits[["title", "awardeeName", "obligated"]]
# Recent awards mentioning a research topic, fetched by keyword.
awards = fetch_awards({"keyword": "machine learning", "dateStart": "01/01/2024"})
analyze(awards)
print(search_abstracts(awards, "neural").head())
Two practical notes apply. First, the money the script sums is the obligated amount, not the full intended award and not actual outlays—a distinction the caveats section develops—so the totals are best read as “dollars committed to date on these awards,” not as money spent. Aggregating by institution, by program, and by state is straightforward once the dollars are coerced to numeric, but any cross-year dollar comparison should be careful about which money field it is summing and whether the awards in the window are complete. Second, the abstract search in the example is a simple case-insensitive substring scan, which is fast and transparent but coarse—it will miss synonyms and acronyms and will catch incidental mentions. For serious topic analysis the abstract text should be properly tokenized, the search terms expanded, and, for large-scale work, the abstracts embedded and clustered. And for national-scale aggregation—ranking every institution, or building the full spending- and higher-education-joined picture—NSF's bulk award download files are far more efficient than paging the API one twenty-five-record page at a time.
Limitations and analytical caveats
The NSF Awards data is among the cleanest and most usable federal datasets, but it carries structural features an analyst must internalize before drawing conclusions.
Obligated is not outlaid, and obligated is not the full award. The obligated amount is the money committed against an award to date; the total intended amount is what the full, multi-year award is anticipated to be; and neither is the same as the cash actually disbursed (the outlay). A multi-year award is typically funded in increments, so at any snapshot its obligated figure may be a fraction of its intended total. Summing obligated amounts across a set of awards therefore answers “how much has been committed so far,” not “how much these projects will ultimately cost” and not “how much has been spent.” Choosing the wrong money field is the most common error in NSF-awards analysis, and it can distort totals by a wide margin.
The grain is the institution, not the investigator.NSF awards are made to institutions, which administer the funds, with a principal investigator named as the scientific lead. This means dollars aggregate naturally by institution but only awkwardly by person—the same investigator may move between institutions over a career, names are recorded inconsistently, and collaborative awards split a single project across multiple institutions and multiple PIs. Counting funding “by researcher” requires careful name disambiguation and an awareness that the data was never designed to be a person-level ledger. Likewise, large multi-institution centers appear as several linked awards, so naive award counts can over- or under-state the true number of distinct projects.
It records funded proposals, not the competition. The awards data shows who won, never who applied and was declined. Funding rates, the relative success of fields or institutions, and any inference about how competitive a program is cannot be computed from the awards alone, because the denominator—the submitted proposals—is not in the dataset. Reading an award count as a measure of research activity, rather than of funding success against an unseen pool of competitors, is a recurring misreading. And the awards reflect the priorities of the programs that funded them: where the money clusters is partly a map of where NSF chose to invest, not only of where the best science is.
There is a recency lag, and our table is a recent slice. Awards appear in the public data after they are made and processed, and the most recent months are systematically under-represented; an analysis of “this year's” funding read too early will understate it. Our nsf_awardstable holds roughly 4,500 recent awards—a current slice of the enterprise, not the full historical corpus stretching back decades—so it is ideal for studying the recent structure of NSF funding and the contents of the current abstract corpus, but a longitudinal study of multi-decade trends should pull the complete award history from NSF's bulk downloads. Held with these caveats in mind, the nsf_awards table is a uniquely rich resource: an institution-resolved, abstract-bearing, program-tagged record of where the federal basic-research dollar goes—the public trace of the merit-review decisions that, quietly and one proposal at a time, set the agenda of American science.
Related writing
Grants.gov: The Federal Database Behind $500 Billion in Annual Federal Grant Opportunities — The grant-opportunity record sits directly upstream of the NSF awards: a Grants.gov funding opportunity is the call that solicited the proposals, and the awards are the funded results, so linking the two connects what was offered to what was won across the whole federal grant enterprise.
USASpending Contracts: The Federal Record of Every Dollar the Government Buys — Every NSF grant is also a federal financial-assistance transaction in the government-wide spending record, where it can be placed alongside contracts and grants from every other agency to build the full picture of an institution's federal research support.
NCES IPEDS: The Federal Database Behind Higher Education Statistics for 6,000 US Colleges — The recipient institution in an NSF award joins to the federal higher-education data—enrollment, degrees, finances, doctorates produced—supplying the institutional denominators needed to normalize research funding against the size and research intensity of each university.