Technical writing

EPA Pollutant Emissions: The Federal Database Behind 10 Million Facility-Level Air and Toxic Release Records

· 12 min read· AI Analytics
EPAAir PollutionTRINEIFederal Data

EPA stitches two of its largest environmental programs—the National Emissions Inventory and the Toxics Release Inventory—into a single facility-level record of what American industry puts into the air, roughly 10.4 million rows, one per facility per pollutant per reporting year, each keyed to a federal Registry ID that ties a smokestack in Texas to a permit, an enforcement file, and a census tract.

This article covers what the combined pollutant emissions dataset is and how its 10.4 million records are structured; the distinction between criteria air pollutants and the 188 hazardous air pollutants that drive the most consequential exposure analysis; the triennial methodology of the National Emissions Inventory and its point, nonpoint, mobile, and event source categories; the Toxics Release Inventory and its origins in the public right-to-know response to the Bhopal disaster; the units, identifiers, and registry linkages that make cross-program analysis possible; the analytical uses from environmental justice screening to cumulative-burden mapping; a Python workflow against the EPA Envirofacts REST API; and the caveats that every analyst must internalize before drawing conclusions from self-reported emissions data.

What the dataset is

EPA does not operate a single program called “pollutant emissions.” The dataset described here is a consolidation of two distinct, statutorily independent reporting systems that EPA and downstream data users merge because they answer the same underlying question from different angles: how much of which pollutants does a given facility release? The two systems are the National Emissions Inventory (NEI), which inventories air emissions of criteria and hazardous air pollutants from every significant source category in the country on a triennial cycle, and the Toxics Release Inventory (TRI), which collects annual self-reported releases of listed toxic chemicals from industrial and federal facilities above defined activity thresholds. Both report emissions at the facility level, and both can be keyed to the EPA Facility Registry Service. Joining them produces a unified facility-by-pollutant-by-year emissions record.

In our database this consolidated record is stored as the table epa_pollutant_emissions, with roughly 10.4 million rows. The grain of the table is one row per (facility × pollutant × reporting year): a single refinery reporting twenty pollutants in a given NEI year contributes twenty rows for that year, and the same refinery reporting a dozen TRI chemicals the following year contributes a dozen more. The columns are:

id                 -- surrogate primary key, one per emission record
reporting_year     -- calendar year the emissions are attributed to
registry_id        -- EPA FRS Registry ID (the cross-program facility key)
pgm_sys_acrnm      -- source program acronym (e.g. NEI, TRIS)
pgm_sys_id         -- the facility's ID within that source program
pollutant_name     -- pollutant or chemical name (e.g. "Benzene", "PM2.5 Primary")
annual_emission    -- numeric quantity emitted in the reporting year
unit_of_measure    -- units for annual_emission (typically "POUNDS" or "TONS")
nei_type           -- NEI source category (point, nonpoint, onroad, nonroad, event)
nei_hap_voc_flag   -- classifies the pollutant (HAP, VOC, both, or neither)

The two identifier columns deserve early emphasis because they are what make the table analytically powerful. The registry_id is the FRS Registry ID, a persistent EPA-assigned identifier for a physical facility location. It is the same key that appears in RCRAInfo, in the Safe Drinking Water Information System, in the Risk Management Plan database, and in the ECHO enforcement platform. Because the NEI and TRI records both carry the Registry ID, a single facility's air emissions, hazardous waste handler status, water discharge permits, and enforcement history can be assembled into one profile. The pgm_sys_acrnm and pgm_sys_id columns preserve the provenance of each row: they record which source program a given emission figure came from and what that program's own internal facility identifier was, so an analyst can always trace a value back to its original NEI or TRI submission rather than treating the consolidated table as an undifferentiated blob.

Criteria air pollutants versus hazardous air pollutants

The Clean Air Act regulates two fundamentally different families of air pollutant, and the nei_hap_voc_flag column in the table reflects this split. Understanding the distinction is essential to interpreting any aggregation of the data, because criteria pollutants and hazardous air pollutants are measured in different magnitudes, regulated under different statutory provisions, and used to answer different questions.

Criteria Air Pollutants (CAPs) are the small set of ubiquitous pollutants for which EPA has established National Ambient Air Quality Standards (NAAQS) under Clean Air Act Section 109. There are six: fine particulate matter (PM2.5, particles 2.5 micrometers in diameter or smaller), coarse particulate matter (PM10), sulfur dioxide (SO2), nitrogen oxides (NOx), carbon monoxide (CO), and lead (Pb). EPA also inventories two additional pollutants that are not themselves criteria pollutants but are precursors to ozone and fine particle formation in the atmosphere: volatile organic compounds (VOCs), which react with NOx in sunlight to form ground-level ozone, and ammonia (NH3), which contributes to secondary particulate formation. Criteria pollutants are emitted in large quantities—a single coal plant emits NOx and SO2 in thousands of tons per year—and they are the pollutants behind smog, regional haze, acid deposition, and the bulk of population-level cardiopulmonary health burden from air pollution. When the dataset reports a facility emitting tens of thousands of tons of a pollutant, that pollutant is almost always a criteria pollutant or precursor.

Hazardous Air Pollutants (HAPs) are a separate, larger list regulated under Clean Air Act Section 112. The 1990 Clean Air Act Amendments replaced an earlier, slow-moving risk-based program with a statutory list of 189 named hazardous air pollutants (one, caprolactam, was later delisted, leaving 188 in common reference). HAPs are also called air toxics. Unlike criteria pollutants, they are regulated not because they are ubiquitous but because they are associated with cancer, neurological damage, reproductive harm, developmental effects, and other serious health outcomes at far lower concentrations. The list includes benzene (a known human carcinogen and a marker of petroleum and chemical operations), formaldehyde, mercury and mercury compounds (a potent neurotoxicant that bioaccumulates in fish), dioxins and furans (persistent, highly toxic byproducts of combustion and certain chemical processes), as well as lead compounds, arsenic, chromium, hydrochloric acid, toluene, xylene, 1,3-butadiene, and acrolein. HAP emissions are typically measured in pounds rather than tons, and a facility may emit only a few hundred pounds of a HAP while emitting thousands of tons of a criteria pollutant—yet the HAP emission can be the more consequential from a localized health-risk standpoint.

The nei_hap_voc_flag column captures the cross-cutting nature of these categories. Some pollutants are HAPs, some are VOCs, and some are both: many individual HAP compounds (benzene, toluene, xylene) are also VOCs that contribute to ozone formation. EPA tracks this overlap explicitly because it matters for control strategy. A control device installed to reduce VOC emissions for ozone-attainment purposes will incidentally reduce HAP emissions of the VOC-classified air toxics, and emission-inventory accounting has to avoid double-counting a single physical release that falls into both categories. The flag lets an analyst aggregate HAP-only, VOC-only, or combined totals correctly rather than summing across categories that partially overlap.

NEI methodology: the triennial inventory

The National Emissions Inventory is EPA's comprehensive, bottom-up estimate of air emissions across the entire United States. It is compiled by EPA's Office of Air Quality Planning and Standards on a triennial cycle: full inventories are produced for years ending in 0, 3, 6, and so on. The recent inventory years are 2017, 2020, and 2023. The NEI is built from data submitted by state, local, and tribal air agencies through EPA's Emissions Inventory System, supplemented by EPA-run models for source categories that states do not report directly. The result is an inventory that attempts to account for every significant air emission in the country, not merely those from large permitted facilities.

The NEI organizes emissions into source categories that correspond to the nei_type column. Point sources are individual, geographically fixed facilities—power plants, refineries, chemical plants, cement kilns, large manufacturers—reported with stack-level or unit-level detail and a specific latitude and longitude. Point sources are the records most readily tied to a Registry ID and the most useful for facility-level analysis. Nonpoint sources (historically called area sources) are emissions too small or too numerous to inventory individually but significant in aggregate: residential wood combustion, consumer solvent use, gas stations, dry cleaners, architectural coatings, and agricultural operations. Nonpoint emissions are estimated for a county or other geographic unit rather than for a single street address. Onroad mobile sources cover cars, trucks, buses, and motorcycles operating on public roads; nonroad mobile sources cover construction and agricultural equipment, locomotives, marine vessels, aircraft ground operations, lawn and garden equipment, and similar engines. Both mobile categories are estimated with EPA emission-factor models (the MOVES model for onroad and much of nonroad) rather than reported facility-by-facility. Events are episodic emissions, most importantly wildfires and prescribed burns, which in a bad fire year can dominate national emissions of PM2.5 and carbon monoxide and can swamp anthropogenic sources in particular regions.

Underlying the entire NEI is the Source Classification Code (SCC) system. Every emission process in the inventory is tagged with an SCC, a hierarchical code that identifies the source category, the specific process, and the equipment generating the emission—for example, an external combustion boiler burning bituminous coal, or a specific solvent-coating operation. SCCs are how EPA applies emission factors consistently: an emission factor (pounds of pollutant per unit of activity, such as per ton of coal burned) is associated with an SCC, and the facility's reported activity level is multiplied by the factor to estimate emissions where direct measurement is unavailable. This is the central methodological reality of the NEI: for large sources, emissions may be based on continuous emission monitors or stack tests, but for the vast number of smaller processes, emissions are estimated by multiplying an activity level by an emission factor tied to an SCC. The inventory is a careful synthesis of measurement and modeling, not a direct census of metered releases.

The Toxics Release Inventory

The Toxics Release Inventory has a different statutory origin and a different philosophy. It was created by the Emergency Planning and Community Right-to-Know Act (EPCRA) of 1986, enacted in the direct aftermath of the 1984 Bhopal disaster in India, where a release of methyl isocyanate from a Union Carbide pesticide plant killed thousands of people, and a smaller but alarming chemical release at a related plant in Institute, West Virginia, the following year. Congress concluded that communities had a right to know what hazardous chemicals were being stored and released in their midst. TRI is the embodiment of that public right-to-know principle: it does not itself limit emissions, but it requires facilities to publicly disclose them, on the theory that disclosure creates accountability and pressure to reduce. The TRI reporting requirement lives in EPCRA Section 313, and the reporting form is the Form R (with a shorter Form A certification statement available for facilities with very small quantities).

TRI is reported annually, not triennially, which makes it the higher-frequency component of the consolidated dataset. Roughly 21,000 facilities file TRI reports in a typical year, covering approximately 770 individually listed chemicals and chemical categories. A facility must report for a given chemical if it is in a covered industry sector (originally manufacturing, later expanded to include metal mining, electric utilities, hazardous waste treatment, and others), employs the equivalent of ten or more full-time workers, and exceeds an activity threshold for that chemical—generally manufacturing or processing more than 25,000 pounds, or otherwise using more than 10,000 pounds, during the year. For a smaller set of persistent, bioaccumulative, and toxic (PBT) chemicals—such as mercury, dioxins, and certain PFAS compounds added in recent years—the thresholds are dramatically lower, sometimes a tenth of a gram, reflecting the outsized hazard of even tiny releases of substances that accumulate in the food chain.

The structure of a TRI report is more detailed than a simple air-emission figure, and this detail is part of what TRI contributes to the consolidated table. TRI distinguishes fugitive air emissions (releases that are not channeled through a stack or vent—leaks from valves, flanges, pump seals, storage tanks, and equipment, plus evaporative losses) from stack or point air emissions (releases through a confined vent or stack). It further distinguishes on-site releases (to air, to surface water, to on-site land disposal, and to underground injection) from off-site transfers (chemicals sent to other locations for disposal, treatment, recycling, or energy recovery). When TRI data is folded into the epa_pollutant_emissions table alongside NEI air emissions, it is generally the air-release component that aligns most directly with the NEI's air inventory, but TRI's richer accounting of water, land, and off-site pathways is what makes it indispensable for total-release and waste-management analysis that the NEI alone cannot support.

One conceptual caution follows from the difference in scope. The NEI aims to inventory all significant air emissions from all source categories, including the diffuse nonpoint and mobile sources that have no reporting facility at all. TRI, by contrast, captures only the listed chemicals released by the subset of larger industrial facilities that cross the reporting thresholds. The two are complementary rather than redundant: TRI tells you, in fine chemical-specific and pathway-specific detail, what a covered industrial facility self-reported; the NEI tells you, in broader source-category terms, the total air-emission picture including everything TRI never sees.

Units, identifiers, and registry linkage

The single most common error in working with this dataset is mishandling units. The unit_of_measure column exists precisely because the two source programs use different conventions. NEI criteria-pollutant emissions are conventionally reported in tons per year, because the quantities are large; NEI HAP emissions and essentially all TRI quantities are reported in pounds per year, because the quantities are smaller and the precision matters. A naive aggregation that sums annual_emissionacross rows without respecting unit_of_measure will add pounds to tons and produce a meaningless number. Any roll-up must first normalize to a common unit— multiplying tons by 2,000 to convert to pounds, or dividing pounds by 2,000 to convert to short tons—keyed off the unit_of_measure value on each row. This is not a hypothetical concern; it is the difference between a defensible analysis and an embarrassing one.

The Facility Registry Service (FRS) is the backbone that makes the consolidated table more than the sum of its parts. FRS is EPA's authoritative, centrally managed inventory of facilities and other places that are subject to environmental regulation. Each physical site receives a Registry ID, and FRS maintains crosswalks linking that Registry ID to the facility's identifiers in every individual program system—the NEI EIS facility ID, the TRI facility ID (TRIFID), the RCRA EPA ID, the NPDES permit number, the air program plant ID, and so on. The registry_id column in epa_pollutant_emissions is this FRS key. Its presence means an analyst can join the emissions table to the FRS facility table to obtain latitude and longitude, NAICS industry code, and standardized name and address, and can then join onward to any other FRS-keyed EPA dataset. Without FRS, the NEI and TRI would be two parallel universes; with it, they are facets of a single facility record.

Public access to the underlying data flows through two principal EPA channels. Envirofacts (enviro.epa.gov, with a REST API at data.epa.gov/efservice) is EPA's integrated query system across its major program databases, including the TRI and NEI tables and the FRS registry; it offers both an interactive web interface and a programmatic REST API that returns JSON, CSV, or XML. ECHO (echo.epa.gov), the Enforcement and Compliance History Online platform, brings the same facilities together from the compliance and enforcement angle, layering inspection history, violations, and penalties onto the same FRS-keyed facilities whose emissions appear in this table. For most emissions analysis, Envirofacts is the natural data source and ECHO is the natural place to pull the enforcement context; both speak the language of the Registry ID.

Analytical uses

A facility-level, pollutant-level, year-level emissions table is one of the most analytically fertile datasets in the federal environmental catalog. Its uses range from individual-facility due diligence to national-scale equity analysis.

Environmental justice screening is among the most important. EPA's own EJScreen tool combines environmental indicators—several of which derive from this emissions data, including air-toxics cancer risk and respiratory hazard indices built on NEI HAP emissions—with demographic indicators of low-income and minority population share, to identify communities that bear disproportionate environmental burden. Because the emissions table carries the Registry ID and, through FRS, latitude and longitude, an analyst can geocode every emitting facility, intersect it with census geography, and quantify the pollutant load falling on a particular block group or tract. This is the raw material for cumulative-burden analysis: identifying communities that are simultaneously exposed to multiple emitting facilities, multiple pollutants, and multiple pathways, rather than evaluating each facility in isolation as the permitting system historically did.

Top-emitter identification is the simplest and most common query: rank facilities by emissions of a given pollutant within a state or nationally for a given year. Because the data is facility-resolved, the output is actionable—a named refinery, a specific power plant—rather than an abstract sectoral total. Pairing this with the pollutant_name and nei_hap_voc_flag columns lets an analyst ask sharper questions: which facilities are the largest emitters of carcinogenic HAPs specifically, as opposed to the largest emitters by raw tonnage of criteria pollutants?

Year-over-year trend analysis exploits the reporting_year column. TRI's annual cadence supports genuine time-series analysis of a facility's reported releases—tracking whether a chemical plant has reduced benzene emissions over a decade, or whether an industry-wide pollution-prevention program shows up in aggregate declines. The NEI's triennial cadence supports coarser trend analysis of the broader air-emission picture across inventory years. The two cadences have to be analyzed on their own clocks; comparing a TRI annual value to an NEI triennial value as if they were the same kind of measurement invites error.

Finally, cross-program cross-referencing is where the Registry ID pays its largest dividend. Joining the emissions table to ECHO enforcement records reveals whether high-emitting facilities also carry a history of Clean Air Act violations, civil penalties, or significant noncompliance. Joining to the RCRA hazardous waste data shows whether a facility releasing large quantities of toxic air pollutants is also a large hazardous waste generator, since the same industrial processes that emit air toxics frequently generate listed hazardous wastes. A facility that ranks high on emissions, carries an active enforcement case, and operates as a large-quantity hazardous waste generator is a specific, identifiable priority that only the joined, Registry-ID-keyed view can surface.

Python workflow: querying the Envirofacts REST API

The Envirofacts REST API at data.epa.gov/efservice exposes the underlying TRI, NEI, and FRS tables through a simple URL-path query grammar. A request is built by chaining table names and column filters into a path of the form /TABLE/COLUMN/VALUE/, optionally with an operator and a row-range and output format suffix; chained filters are combined with logical AND. No API key is required for public data. The script below pulls TRI reporting forms for a state and year, normalizes the air-release column (whose exact name varies between Envirofacts releases), aggregates the heaviest reported chemicals, and sketches a cross-state roll-up for a single chemical. Because Envirofacts column names and table structures evolve with each data release, the script discovers the relevant column names at runtime rather than hard-coding them, and any production use should be validated against the current Envirofacts metadata catalog.

import requests, pandas as pd
from collections import defaultdict

# EPA Envirofacts REST API (no key required)
# Base: https://data.epa.gov/efservice
# Pattern: /<TABLE>/<COLUMN>/<OPERATOR>/<VALUE>/<FORMAT>
BASE = "https://data.epa.gov/efservice"

# --- TRI: on-site air releases for one reporting year, one state ---
# TRI_RELEASE_QRY joins facility, chemical, and release quantity rows.
def tri_air_releases(state, year, rows=5000):
    # Chained filters are ANDed together by Envirofacts.
    path = (
        f"{BASE}/TRI_FACILITY/STATE_ABBR/{state}"
        f"/TRI_REPORTING_FORM/REPORTING_YEAR/{year}"
        f"/JSON/rows/0:{rows}"
    )
    r = requests.get(path, timeout=60)
    r.raise_for_status()
    return r.json()

# --- NEI: facility-level emissions for one pollutant, one state ---
# The NEI tables live under the same service; column names vary by
# release year, so confirm against the Envirofacts metadata catalog.
def nei_pollutant(state, pollutant, rows=5000):
    path = (
        f"{BASE}/POINT_SOURCE_SUMMARY/STATE_ABBREVIATION/{state}"
        f"/POLLUTANT_DESC/{pollutant}"
        f"/JSON/rows/0:{rows}"
    )
    r = requests.get(path, timeout=60)
    r.raise_for_status()
    return r.json()

# Pull Texas TRI forms for 2023 and rank the heaviest reported chemicals.
records = tri_air_releases("TX", 2023)
df = pd.DataFrame(records)

# Different Envirofacts releases label the air-release column slightly
# differently; normalize to a single working name before aggregating.
air_col = next(
    (c for c in df.columns if "AIR" in c.upper() and "POUND" in c.upper()),
    None,
)
chem_col = next(
    (c for c in df.columns if "CHEM" in c.upper() and "NAME" in c.upper()),
    None,
)

if air_col and chem_col:
    df[air_col] = pd.to_numeric(df[air_col], errors="coerce").fillna(0)
    top = (
        df.groupby(chem_col)[air_col]
        .sum()
        .sort_values(ascending=False)
        .head(10)
    )
    print("Top 10 TRI air-release chemicals in TX, 2023 (pounds):")
    for name, lbs in top.items():
        print(f"  {str(name)[:42]:<42} {lbs:>16,.0f}")
else:
    print("Inspect df.columns and map air_col / chem_col for this release.")

# Cross-state roll-up: total reported air releases by state for one chemical.
def benzene_by_state(states, year):
    totals = defaultdict(float)
    for st in states:
        recs = tri_air_releases(st, year)
        for row in recs:
            chem = str(row.get(chem_col or "CHEM_NAME", "")).upper()
            if "BENZENE" == chem:
                val = row.get(air_col or "TOTAL_RELEASE_AIR", 0)
                try:
                    totals[st] += float(val)
                except (TypeError, ValueError):
                    pass
    return dict(sorted(totals.items(), key=lambda kv: -kv[1]))

# print(benzene_by_state(["TX", "LA", "PA", "OH", "CA"], 2023))

Two practical notes apply to the Envirofacts API. First, the row-range suffix (/rows/0:5000) is the pagination mechanism; large states and years exceed the default response size, so production code should page through the full result set rather than assuming a single request returns everything. Second, for genuinely national-scale analysis—ranking every facility in the country, or building the full EJScreen-style geographic intersection—the bulk data files that EPA publishes for both the TRI (the TRI Basic and Data Plus files) and the NEI (the complete inventory releases) are far more efficient than thousands of paginated API calls, and they ship with the authoritative, version-stamped column definitions for that release year.

Limitations and analytical caveats

The consolidated emissions table is the most comprehensive public picture of facility-level air and toxic releases in the United States, but it carries structural limitations that an analyst must internalize before drawing conclusions from it.

The data is overwhelmingly self-reported and estimated.TRI quantities are reported by the facilities themselves, and facilities are explicitly permitted to use reasonable estimation methods—mass balance, emission factors, engineering calculations—rather than direct measurement for most chemicals. The NEI, likewise, relies on emission factors applied to activity levels for the large majority of its processes, with direct monitoring reserved for the largest sources. These are good-faith estimates governed by EPA methods, but they are estimates, not metered counts, and their accuracy varies with the quality of the underlying activity data and the representativeness of the emission factor.

Reporting thresholds exclude small sources. TRI only captures facilities above its employee and activity thresholds in covered sectors; a small shop using listed chemicals below the thresholds, or a facility in a non-covered industry, simply does not appear, no matter what it releases. The NEI captures large point sources individually but rolls small sources into county-level nonpoint estimates that cannot be attributed to any single facility. A community's true exposure may therefore include substantial contributions from sources that are invisible at the facility grain of this table.

NEI and TRI are methodologically different and must not be conflated. They were built for different statutory purposes, they cover different (though overlapping) universes of facilities and pollutants, they use different units, and they report on different cadences. The same physical facility can show different air-emission figures in NEI versus TRI for the same pollutant and year because the two programs define the source boundary, the included processes, and the estimation method differently. The pgm_sys_acrnm column is in the table precisely so that an analyst can keep the two provenances distinct; treating an NEI value and a TRI value as interchangeable measurements of the same quantity is a methodological error.

There is a substantial data lag. TRI reports for a given calendar year are due to EPA by July 1 of the following year, and the cleaned, quality-assured public data is typically released later still, so the most recent fully published TRI year usually trails the present by roughly a year and a half or more. The NEI lag is longer: a triennial inventory takes years to compile, reconcile across thousands of state and local submissions, and finalize, so the most recent published NEI can be several years old at any given moment. This dataset is authoritative for understanding established patterns and multi-year trends; it is not a real-time monitor of what a facility emitted last month.

Held with these caveats in mind, the consolidated epa_pollutant_emissionstable remains a uniquely powerful resource: a facility-resolved, pollutant-resolved, year-resolved, Registry-ID-keyed record of American industrial air and toxic emissions, the data substrate beneath environmental-justice screening, enforcement targeting, and the public right-to-know that EPCRA established four decades ago.

Related writing

EPA RCRA Hazardous Waste Data: The Federal Database Behind 400,000 Regulated Facilities — The same industrial processes that emit toxic air pollutants frequently generate listed hazardous wastes, and both datasets share the FRS Registry ID, so joining facility-level emissions to RCRA generator and TSDF records produces a combined view of what a plant releases to the air and what it ships off-site as hazardous waste.

EPA Drinking Water Violations: The Federal Database Behind Safe Drinking Water Act Enforcement — Air deposition of mercury and the migration of industrial pollutants into source water connect facility emissions to drinking water safety, and the SDWIS data sits beside this emissions table within EPA's broader environmental public-health surveillance architecture.

EIA Form 860: The Federal Database Behind Every US Power Plant and Electricity Generator — Power plants are among the largest point sources in the National Emissions Inventory, and joining Form 860's generator-level capacity, fuel, and ownership data to facility emissions reveals which generating assets carry the heaviest criteria-pollutant and air-toxics burdens.