Technical writing
FDA National Drug Code Directory: The Federal Index of Every US Drug Product
The FDA National Drug Code Directory is the federal index of every drug product marketed in the United States — on the order of 40,000 active listings, each keyed by its three-segment National Drug Code and carrying the product's brand and generic names, its labeler, its dosage form and route, its active ingredients and strengths, its DEA schedule where applicable, and the dates that bracket its time on the market. The NDC is the universal serial number of the American drug supply, printed on every package and embedded in every pharmacy claim, and the directory is the canonical place to look one up.
Almost every quantitative question about prescription drugs in the United States eventually runs through the NDC. When a researcher wants to know how much Medicare spent on a particular molecule, when a payer reconciles a pharmacy claim, when an analyst tracks a drug shortage or maps a generic back to its brand, the join key is an NDC. But the NDC is also one of the most quietly treacherous identifiers in all of US health data: it comes in two lengths, three segment layouts, and a normalization convention that, done wrong, silently fails to match records and corrupts every downstream number. This article walks through what the directory is, how the code is structured, the normalization problem in detail, what the listing does and does not certify, how the directory links into the wider drug-data ecosystem, and a Python workflow for pulling and normalizing the whole thing.
What the directory is and how the NDC is built
The NDC Directory is published by the FDA and assembled from drug listings that manufacturers and other firms are legally required to submit. The agency exposes it through the openFDA program at the drug/ndc.json endpoint and also as a downloadable file. Each row is a single marketed drug product as the FDA understands a product: a specific formulation, in a specific dosage form, sold by a specific company, identified by a specific code.
That code is the National Drug Code itself, and its structure is the key to everything else. An NDC is a three-segment number, conventionally written with hyphens, of the form labeler-product-package. The first segment is the labeler code, and it is the one piece the FDA assigns directly: when a firm registers as a drug establishment, the agency issues it a labeler code that uniquely identifies that company as the entity whose name appears on the label. The labeler is not necessarily the physical manufacturer — it may be a distributor, a repackager, or a private-label brand owner — it is whoever is responsible for the labeling. The second segment is the product code, which the firm itself assigns to identify the particular strength, dosage form, and formulation. The third segment is the package code, also assigned by the firm, which identifies the package size and type — a bottle of 30 versus a bottle of 100, a single vial versus a carton of ten.
The directory reflects this two-level structure. The product_ndc field carries the labeler-product portion — the identity of the drug formulation — while the full three-segment package codes live in a nested packaging array, one entry per package size, each with its own complete package_ndc. A single product listing with three package sizes therefore yields one product NDC and three package NDCs. Getting this right matters because pharmacy claims are billed at the package level: the NDC on a claim is an eleven-digit package code, not a product code.
The 10-to-11-digit normalization problem
Here is the single most important thing to understand about NDCs, and the source of more broken analyses than any other quirk in the data. The FDA assigns NDCs in a ten-digit format, but it does so in three different segment layouts. The labeler code can be four or five digits, the product code three or four, and the package code one or two, and the FDA allots them in one of exactly three patterns: 4-4-2, 5-3-2, or 5-4-1. Each adds up to ten digits, but the digits fall in different places.
The billing world, however, does not use ten digits. Pharmacy claims, Medicare Part D event files, Medicaid drug-utilization data, and essentially every payer system use an eleven-digit NDC in a fixed 5-4-2 layout. To get from the FDA's ten-digit code to the billing eleven-digit code, you pad whichever segment is short with a single leading zero: a 4-4-2 code gets a zero in front of the labeler, a 5-3-2 code gets a zero in front of the product, and a 5-4-1 code gets a zero in front of the package.
The trap is that the hyphens are the only thing that tells you which layout you are looking at, and people routinely strip them. Once you de-hyphenate, a 4-4-2 code and a 5-3-2 code are both flat ten-character strings that look identical in length, but they must be padded in completely different positions. Padding by overall string length — the instinct to “just add a zero to the front to make it eleven” — is wrong for two of the three layouts and produces an eleven-digit number that matches nothing. Worse, it usually fails silently: the malformed NDC simply does not join to the claims data, the affected rows drop out, and the analyst sees a plausible-looking total that is quietly missing a chunk of the drugs. The only correct approach is to keep the segments intact, detect the layout from the segment lengths, and pad the specific short segment. The Python example below implements exactly this, and treats anything that is not one of the three known layouts as a flag-for-review case rather than guessing.
One further subtlety: an eleven-digit billing NDC is not reversible to a unique ten-digit FDA NDC without external knowledge, because a leading zero in the eleven-digit form could be either a genuine zero or padding. This is why the directory, and any reliable normalization, works forward from the segmented FDA code rather than trying to reconstruct it from a flattened billing string.
The directory is a listing, not an approval
A persistent misconception is that appearing in the NDC Directory means the FDA has approved a drug. It does not. The directory is built on the Drug Listing Act, which requires every registered drug establishment to report the drugs it manufactures, prepares, propagates, compounds, or processes for commercial distribution. Listing is a self-reported, mandatory disclosure obligation, not a certification of efficacy or safety, and the FDA is explicit on this point: inclusion of a product in the NDC Directory does not imply that the FDA has verified or approved the product, or that the labeled claims are accurate.
This matters because the directory contains products at very different regulatory standings. Some are approved under a New Drug Application or a Biologics License Application. Some are approved generics under an Abbreviated New Drug Application. Some are marketed under an over-the-counter monograph without any product-specific approval at all. And historically the directory has contained unapproved drugs that are marketed but have never been through an FDA approval pathway — products that exist legally in a regulatory gray zone, or sometimes illegally. The FDA has worked to flag and remove products it considers improperly listed, but the foundational fact stands: an NDC is an identifier and a disclosure, not a seal of approval. Any analysis that treats “has an NDC” as a proxy for “FDA-approved” is making a category error.
Product types and the prescription/OTC split
The product_type field classifies each listing by its broad regulatory category, and it is the first cut most analyses make. The dominant categories are human prescription drug and human OTC drug, which together account for the vast majority of listings. Prescription drugs are dispensed only on the order of a licensed prescriber; OTC drugs are sold directly to consumers, governed either by an OTC monograph or, less commonly, by an approved application.
Beyond those two, the directory carries several smaller but important types. Plasma derivative covers products such as clotting factors and immune globulins manufactured from human plasma. Cellular therapy, vaccine, and other biologic categories appear where biologics are listed. Standardized allergenic and non-standardized allergenic products cover allergen extracts used in testing and immunotherapy. There are also categories for human OTC products distinct from the prescription line, and for drugs intended for further processing. The split between prescription and OTC is the one that drives most downstream segmentation, because it separates the products that flow through pharmacy benefit claims from those that largely do not.
Within each listing, a cluster of fields describes the pharmacology. dosage_form gives the physical form — tablet, capsule, injection, solution, cream, patch, and so on.route gives the route of administration — oral, intravenous, topical, subcutaneous, ophthalmic, and the like — and is frequently a list, because a single product can have more than one labeled route. The active_ingredients array lists each active moiety with its name and strength, capturing both single-ingredient products and complex combinations. The pharm_class field carries pharmacologic class designations — established pharmacologic class, mechanism of action, chemical structure, and physiologic effect classifications — which let analysts group products therapeutically rather than one molecule at a time.
Marketing category, application numbers, and the link to Drugs@FDA
The marketing_category field records the regulatory pathway under which a product is marketed, and it is the bridge from the NDC Directory to the rest of the FDA's drug data. Its values mirror the approval pathways: NDA for a New Drug Application, ANDA for an Abbreviated New Drug Application (the generic pathway), BLA for a Biologics License Application, and a family of OTC monograph values such as OTC MONOGRAPH FINAL and OTC MONOGRAPH NOT FINAL for products marketed under the monograph system rather than an individual application. Other values cover unapproved drugs and specialized pathways.
For application-based products, the application_number field carries the FDA application number — an NDA, ANDA, or BLA number — and this is the join key into Drugs@FDA, the FDA's database of approved drug products. With the application number you can move from a package-level NDC to the full approval history of the product: the original approval date, the sponsor, the approval letters, the labeling, and any supplements. This linkage is what lets an analyst connect a specific marketed package back to the regulatory action that authorized it, and it is the reason the directory is far more useful as a hub than as a standalone list.
Links to RxNorm and the openFDA drug endpoints
The NDC's real power comes from the standards that hang off it. The most important is RxNorm, the normalized naming system for clinical drugs maintained by the National Library of Medicine. RxNorm assigns a concept identifier (an RxCUI) to each clinical drug at various levels of abstraction — ingredient, strength, dose form, branded versus generic — and crucially it publishes crosswalks from NDCs to RxCUIs. That crosswalk is what lets you collapse the thousands of distinct NDCs for, say, generic atorvastatin 20 mg tablets — one per labeler, per package size — into a single clinical concept, which is essential for any spend or utilization analysis that wants to reason about a drug rather than a SKU. The NDC tells you which exact package was dispensed; the RxCUI tells you what it clinically is.
Within openFDA itself, the NDC is the connective tissue across endpoints. The same NDC appears in the drug/label.json structured product labeling endpoint, in the drug/event.json adverse-event reports, and in the drug/enforcement.json drug-recall endpoint. A single NDC therefore lets an analyst pivot from a product's identity in the directory to its official label, its adverse-event signal, and its recall history, all keyed on the same code. The directory is the registry that anchors that web.
DEA scheduling and controlled substances
For products that are controlled substances, the directory carries a dea_schedule field. The Controlled Substances Act places drugs with abuse potential into five schedules. Schedule II (CII) covers drugs with a high potential for abuse but accepted medical use, with use leading to severe psychological or physical dependence — most opioid analgesics such as oxycodone, hydromorphone, and fentanyl, along with stimulants like amphetamine salts and methylphenidate. Schedule III (CIII) covers drugs with moderate-to-low dependence potential, including certain codeine combinations and some anabolic steroids. Schedule IV (CIV) covers lower-potential drugs such as most benzodiazepines and several sleep agents. Schedule V (CV) covers the lowest tier, including some antidiarrheal and antitussive preparations with limited quantities of narcotics.
The overwhelming majority of listings have no schedule at all — they are not controlled substances — and in the data the field is simply absent or null for those products. That absence is meaningful and should be treated as “non-controlled” rather than missing data. For anyone studying the opioid supply, stimulant prescribing, or controlled-substance utilization, the dea_schedule field is the field that lets you isolate the relevant products before joining out to dispensing data. It is worth noting that the schedule recorded in the NDC Directory reflects the labeler's listing; the authoritative scheduling is set by the DEA, and the two are nearly always consistent but the DEA classification governs.
Marketing dates and how discontinued products linger
Two date fields bracket a product's commercial life. The marketing_start_date is when the labeler began, or intends to begin, marketing the product; the marketing_end_date is when marketing is expected to cease, and it is blank for products still actively marketed. Both are stored as eight-digit YYYYMMDD strings.
The important and often-overlooked behavior is what happens at discontinuation. When a product is discontinued, it does not vanish from the directory the moment marketing stops. The listing persists, now carrying a marketing_end_date in the past and a finished or status indicator marking it as no longer marketed. This is by design and it is useful: a package dispensed last year may carry an NDC for a product discontinued this year, and a claims dataset will be full of NDCs for products that are no longer sold. If the directory dropped discontinued products immediately, historical claims would fail to match. But it also means that a raw count of rows in the directory overstates the number of currently marketed products, and any “what is on the market today” analysis must filter on the marketing-end date and status rather than assuming every listed NDC is live. The flip side is that even retaining discontinued products, the directory is not a complete historical archive of every NDC ever issued; very old products do age out, so it is best read as a current-and-recent registry rather than a permanent ledger.
Real-world uses
The directory's value is almost entirely as a join key and a decoder ring for other, larger datasets. The flagship use is drug spend and utilization analysis. Medicare Part D and Medicaid both publish drug-level spending and utilization keyed by NDC; on their own those files are opaque lists of eleven-digit codes and dollar figures. Joined to the NDC Directory — through correct eleven-digit normalization — they become analyzable: you can attribute spend to a brand or generic name, roll it up to a pharmacologic class, separate prescription from OTC, isolate controlled substances, and trace a labeler's total footprint. The normalization step is not optional plumbing here; it is the difference between a join that captures the spend and one that silently drops a fraction of it.
Other uses follow the same pattern. Generic-versus-brand mapping uses the proprietary and non-proprietary name fields, together with the marketing category, to label every NDC as a brand or a generic and to count how many generic competitors exist for a given molecule — a direct input to competition and pricing analysis. Labeler and manufacturer concentration analysis counts listings and, joined to spend, market share by labeler, surfacing how concentrated the supply of a particular drug class is among a few firms. Shortage tracking pairs the directory with the FDA drug shortage list and with discontinuation flags to monitor which products are leaving the market and which molecules are losing suppliers. And NDC normalization for claims is itself a deliverable: building and maintaining a clean ten-to-eleven-digit crosswalk, with product-to-package rollups and RxNorm links, is a foundational data-engineering task that every downstream pharmacy analysis depends on.
Querying the drug/ndc API with Python
The openFDA NDC endpoint uses the same compact query language as the rest of openFDA. The search parameter takes field-scoped expressions joined with +AND+ and +OR+, with .exact for exact-match terms; count returns server-side term frequencies for fast aggregation without downloading records; and limit and skip page through results, with skip capped at 25,000. Because the directory runs to roughly 40,000 listings, a full record-level pull must be sharded — here by product_type — to stay under that ceiling. No API key is required for light use; a free key raises the limits.
The script below sizes the directory by product type and route, pages a full copy into a DataFrame, normalizes every package-level NDC to the canonical eleven-digit form with the segment-aware logic described above, aggregates product counts by route and DEA schedule, and ranks the labelers with the most listed products. The normalization function is the load-bearing part: it pads the correct segment for each of the three layouts and flags anything that does not fit, rather than guessing by string length.
import requests
import pandas as pd
from collections import Counter
# ---------------------------------------------------------------
# openFDA NDC Directory API
# Endpoint: https://api.fda.gov/drug/ndc.json
# No API key required for <= 240 requests/min and 1,000/day.
# Register at https://open.fda.gov/apis/authentication/ for
# 120,000 requests/day and a higher per-minute ceiling.
#
# This script:
# 1. Counts listed products by product_type and by route
# 2. Pages a full slice of the directory into a DataFrame
# 3. Normalizes every NDC from the 10-digit segmented form
# to the unambiguous 11-digit 5-4-2 form used by claims
# 4. Aggregates product counts by route and DEA schedule
# 5. Ranks the labelers with the most listed products
# ---------------------------------------------------------------
BASE = "https://api.fda.gov/drug/ndc.json"
def count_field(field: str, search: str | None = None) -> dict:
"""Use the openFDA count parameter for a fast server-side tally.
The count endpoint returns aggregate term frequencies without
transferring individual records -- the efficient way to size a
dataset before downloading rows.
"""
params = {"count": field}
if search:
params["search"] = search
resp = requests.get(BASE, params=params, timeout=30)
resp.raise_for_status()
results = resp.json().get("results", [])
return {row["term"]: row["count"] for row in results}
def fetch_records(search: str | None = None, page_size: int = 1000) -> list[dict]:
"""Page through NDC listings for an optional search expression.
openFDA caps skip at 25,000, so a full pull of the ~40,000-row
directory must be sharded -- here we shard by product_type and
concatenate, since no single type exceeds the skip ceiling.
"""
records: list[dict] = []
skip = 0
while True:
params: dict = {"limit": page_size, "skip": skip}
if search:
params["search"] = search
resp = requests.get(BASE, params=params, timeout=60)
if resp.status_code == 404: # openFDA returns 404 on empty result sets
break
resp.raise_for_status()
batch = resp.json().get("results", [])
if not batch:
break
records.extend(batch)
print(f" Fetched {len(records):,} records so far...")
if len(batch) < page_size or skip + page_size >= 25000:
break
skip += page_size
return records
def normalize_ndc(ndc: str) -> str | None:
"""Convert a segmented NDC to the canonical 11-digit 5-4-2 form.
Manufacturers assign 10-digit NDCs in one of three segment
layouts: 4-4-2, 5-3-2, or 5-4-1. Claims systems demand a flat
11-digit string padded to 5-4-2, inserting a leading zero into
whichever segment is short. Doing this by string length on the
de-hyphenated code is the classic bug: 1234-5678-90 and
12345-678-90 both have ten digits but pad differently. The
segments must be padded individually, which is only possible
while the hyphens are intact.
"""
if not ndc or "-" not in ndc:
return None
parts = ndc.split("-")
if len(parts) != 3:
return None
labeler, product, package = parts
layout = (len(labeler), len(product), len(package))
if layout == (5, 4, 2):
pass # already canonical
elif layout == (4, 4, 2):
labeler = "0" + labeler # pad labeler segment
elif layout == (5, 3, 2):
product = "0" + product # pad product segment
elif layout == (5, 4, 1):
package = "0" + package # pad package segment
else:
return None # non-standard; flag for review
return labeler + product + package
# ---------------------------------------------------------------
# Step 1: Size the directory by product type and by route.
# ---------------------------------------------------------------
print("Counting listed products by product type...")
by_type = count_field("product_type")
print("\nListed Drug Products by Type")
print("-" * 52)
for term, n in Counter(by_type).most_common():
print(f"{term[:38]:<40} {n:>8,}")
print("\nTop routes of administration...")
by_route = count_field("route")
print("\nListed Products by Route (top 15)")
print("-" * 40)
for term, n in Counter(by_route).most_common(15):
print(f"{term[:24]:<26} {n:>8,}")
# ---------------------------------------------------------------
# Step 2: Pull the directory, sharded by product type to stay
# under the 25,000 skip ceiling on any single query.
# ---------------------------------------------------------------
print("\nDownloading the NDC directory...")
rows: list[dict] = []
for ptype in by_type:
search = f'product_type:"{ptype}"'
print(f"Fetching product_type = {ptype} ...")
rows.extend(fetch_records(search=search))
df = pd.DataFrame(rows)
print(f"Retrieved {len(df):,} product listings.")
# ---------------------------------------------------------------
# Step 3: Normalize the package-level NDCs to 11-digit 5-4-2.
# The product_ndc field is labeler-product; the package
# codes live under the nested packaging[] list, each with
# its own full 3-segment package_ndc.
# ---------------------------------------------------------------
def package_codes(row) -> list[str]:
pkgs = row.get("packaging") or []
return [p.get("package_ndc", "") for p in pkgs if p.get("package_ndc")]
pairs: list[tuple[str, str, str | None]] = []
for _, row in df.iterrows():
for raw in package_codes(row):
pairs.append((row.get("product_ndc", ""), raw, normalize_ndc(raw)))
ndc_df = pd.DataFrame(pairs, columns=["product_ndc", "package_ndc_raw", "ndc11"])
unresolved = ndc_df["ndc11"].isna().sum()
print(f"\nPackage-level NDCs: {len(ndc_df):,}")
print(f"Normalized to 11-digit: {ndc_df['ndc11'].notna().sum():,}")
print(f"Non-standard / unresolved: {unresolved:,}")
# ---------------------------------------------------------------
# Step 4: Product counts by route and by DEA schedule.
# route and dea_schedule may be lists or scalars; coerce.
# ---------------------------------------------------------------
def first(val) -> str:
if isinstance(val, list):
return val[0] if val else ""
return val or ""
df["route_1"] = df["route"].apply(first)
print("\nProducts by Route (from downloaded rows, top 12)")
print("-" * 40)
for route, n in df["route_1"].value_counts().head(12).items():
print(f"{route[:24]:<26} {n:>8,}")
df["schedule"] = df.get("dea_schedule", pd.Series(dtype=object))
sched = df["schedule"].dropna()
print("\nControlled-Substance Products by DEA Schedule")
print("-" * 44)
if not sched.empty:
for s, n in sched.value_counts().items():
print(f"{s:<10} {n:>8,}")
else:
print("(no scheduled products in this slice)")
print(f"Non-controlled (no schedule): {df['schedule'].isna().sum():,}")
# ---------------------------------------------------------------
# Step 5: Labeler concentration -- who lists the most products?
# A count query keeps this lightweight at full scale.
# ---------------------------------------------------------------
print("\nFetching top labelers by listed-product count...")
labeler_counts = count_field("labeler_name.exact")
top_labelers = Counter(labeler_counts).most_common(15)
print("\nTop 15 Labelers by Listed Product Count")
print("-" * 60)
print(f"{'Labeler':<46} {'Products':>10}")
print("-" * 60)
for name, n in top_labelers:
print(f"{name[:45]:<46} {n:>10,}")
The pattern generalizes. Swap the product_type shard for a marketing_category:ANDA filter to study the generic landscape, or for dea_schedule:CII to isolate Schedule II products and profile the opioid and stimulant supply. Replace the labeler count with a count over pharm_class.exact to size therapeutic classes, or join the normalized ndc11 column against a Medicare Part D spending file to attribute dollars to names, classes, and labelers. The directory is the decoder; the analysis lives in the join.
Caveats and limits
Four limits shape any honest use of this dataset. First, listing is self-reported under the Drug Listing Act and is not an approval. The presence of an NDC certifies only that a registered firm disclosed a product, not that the FDA verified, approved, or endorsed it, and the directory has historically contained unapproved and even improperly listed products. Treat regulatory standing as something to be read from marketing_category and the application number, never inferred from the mere existence of a code.
Second, the ten-versus-eleven-digit ambiguity is a permanent hazard. The FDA publishes ten-digit segmented codes; billing systems demand eleven-digit flat codes; the conversion depends on segment layout and cannot be done reliably by overall length or after the hyphens are stripped. Every join between the directory and a claims dataset must normalize on the segmented form, and any mismatch tends to fail silently by dropping rows rather than raising an error, so reconciliation counts — how many NDCs matched, how many did not — are mandatory.
Third, discontinued products linger and the directory is a current-and-recent registry rather than a complete historical archive. Discontinued listings persist with a past marketing-end date, so a raw row count overstates what is currently marketed; meanwhile very old NDCs may age out entirely, so a claims record from many years ago may reference an NDC the directory no longer carries. Filter on marketing dates and status for “currently marketed” questions, and expect some historical NDCs to be unmatchable.
Fourth, repackager and relabeler NDCs proliferate the codes. When a repackager or relabeler takes a manufacturer's product and packages it under its own labeler code, it creates a new NDC for what is clinically the same drug. The result is many NDCs — and many labeler names — pointing at the same underlying product, which inflates raw product counts and can double-count utilization if not collapsed. This is precisely why RxNorm exists and why serious analysis maps NDCs to RxCUIs before counting: the NDC identifies a package and a label, not a unique clinical drug. Treated with those four caveats in mind, the NDC Directory remains the authoritative, openly accessible index of the US drug supply — the registry that turns the eleven-digit codes scattered across every pharmacy claim into names, classes, schedules, and companies.
Related writing
FDA device classifications is the parallel federal registry for medical devices, with its own product-code taxonomy and risk tiers.
FDA food enforcement reports covers the recall side of the FDA's product data, including the drug-recall endpoint that shares the NDC key.
CMS physician data pairs naturally with NDC-keyed Part D records to connect who prescribes a drug to what the drug is.