Technical writing

CMS Doctors and Clinicians: The Federal Database Behind Every Medicare Physician

· 11 min read· AI Analytics
CMSMedicarePhysiciansHealthcareFederal Data

For nearly every physician who treats a Medicare patient in the United States, there is a row in a single federal file that names their medical school, the year they graduated, their specialties, the group practice they bill under, the hospital they are tied to, and whether they accept Medicare's approved amount as payment in full. The CMS Doctors and Clinicians national file — the data behind the public Care Compare clinician profiles, and known for years as Physician Compare — holds roughly 163,000 such records, and it is the closest thing the country has to a public directory of who practices medicine inside Medicare and on what terms.

It is an unglamorous file. It does not score doctors, rank them, or tell you whether they are any good. What it does is establish identity and structure at national scale: a clean, downloadable map of clinicians to specialties, to group practices, to hospitals, and to practice locations, all keyed to the same National Provider Identifier that runs through every other Medicare dataset. That makes it the connective tissue of physician-level analysis. Join it to Medicare claims and you can attribute spending to a specialty. Join it to Open Payments and you can see which physicians take industry money. Aggregate it by county and you can map where the doctors are — and, more pointedly, where they are not.

What it is, and how it differs from NPPES and PECOS

Three federal systems describe physicians, and they are constantly confused for one another. The Doctors and Clinicians file is one of them, and understanding it means understanding what the other two are and where each draws its data.

NPPES — the National Plan and Provider Enumeration System — is the registry that issues the NPI itself. Every healthcare provider in the United States, whether or not they ever touch Medicare, must obtain an NPI from NPPES to bill any insurer electronically. NPPES is therefore the universe: it holds millions of records covering physicians, nurses, dentists, therapists, pharmacies, hospitals, labs, and durable medical equipment suppliers. It is self-maintained — providers update their own NPPES record — and it carries taxonomy codes, practice addresses, and contact details, but it says nothing about whether a provider actually participates in Medicare. NPPES is enumeration, not enrollment.

PECOS — the Provider Enrollment, Chain, and Ownership System — is the enrollment system. It is where a provider applies to bill Medicare and is vetted: identity, licensure, ownership structure, reassignment of benefits to group practices, and revalidation all live in PECOS. PECOS is administrative plumbing, much of it not public, and it is the authoritative source for who is actually enrolled to bill the program. The Medicare Fee-For- Service Public Provider Enrollment file is the limited public extract drawn from PECOS.

The Doctors and Clinicians file sits downstream of both. It is the public, consumer-facing slice: the set of individual physicians and clinicians who are enrolled in Medicare (so they are in PECOS) and identified by NPI (so they are in NPPES), filtered to the professionals a patient might choose — doctors, nurse practitioners, physician assistants, clinical nurse specialists, certified nurse-midwives, and a defined list of allied clinicians. It draws on PECOS enrollment, NPPES identity, and CMS's own quality-program records to assemble a profile suited to comparison shopping rather than billing administration. Where NPPES is the universe and PECOS is the gatekeeper, the Doctors and Clinicians file is the published directory.

The file has been through several names. CMS launched it as Physician Compare in 2010, a mandate of the Affordable Care Act to publish physician-level information for Medicare beneficiaries. In 2020 CMS folded Physician Compare, Hospital Compare, Nursing Home Compare, and the other Compare tools into the unified Care Compare portal, and the clinician data now appears there as “Doctors and Clinicians.” The bulk downloads live in the CMS Provider Data Catalog at data.cms.gov/provider-data, and the dataset is the one referenced internally in our catalog as cms_doctors, with on the order of 163,000 provider records.

The schema

The Doctors and Clinicians data is published as a small family of related CSV files — principally a clinician-level file, a group-affiliation file, and a hospital-affiliation file — that share the NPI as a join key. The combined national file gives, for each provider, a row per practice location and group affiliation, with the following fields forming the spine:

The crucial structural fact is that the clinician-level file is not one row per physician. A physician who practices at three locations under two group affiliations can appear in six rows, differing only in address and group fields. Every count of distinct physicians therefore has to begin by deduplicating on NPI, and every aggregate has to be explicit about whether it is counting clinicians, clinician-locations, or clinician-group pairs. The roughly 163,000 figure in our catalog refers to the provider records as published; the count of distinct individual clinicians is lower once locations and affiliations are collapsed, and the count of rows in the fully expanded national file is far higher.

Specialties and the CMS taxonomy

The specialty fields are where the file does its most analytical work, and they sit at an awkward intersection of two coding systems. Medicare enrollment uses its own list of provider specialty descriptions — the Medicare specialty codes that map to the values you see inpri_spec, such as “Internal Medicine,” “Family Practice,” “Cardiovascular Disease,” “Nurse Practitioner,” and “Physician Assistant.” These are the categories CMS pays under, and they are the ones the Doctors and Clinicians file exposes.

Running parallel to them is the NUCC Healthcare Provider Taxonomy — the hierarchical code set carried in NPPES that classifies providers by type, classification, and area of specialization. The taxonomy is far more granular than the Medicare specialty list, and the two do not map cleanly one-to-one: a single Medicare specialty can correspond to several taxonomy codes, and the taxonomy distinguishes subspecialties that Medicare lumps together. Separately, occupational statistics from the Bureau of Labor Statistics use the Standard Occupational Classification (SOC) system, which is coarser still and built for labor-market analysis rather than clinical billing. Any project that tries to reconcile physician counts across CMS, NPPES, and BLS sources is really reconciling three taxonomies — Medicare specialty codes, NUCC taxonomy codes, and SOC codes — and the crosswalks between them are approximate. For work that stays inside the Doctors and Clinicians file, the practical move is to treat pri_spec as the authoritative classification and to be wary of joining specialty counts to any source that uses a different scheme without an explicit crosswalk.

Secondary specialties add a second dimension. A clinician whose primary specialty is Internal Medicine may carry a secondary specialty of Cardiovascular Disease or Hematology/Oncology, reflecting fellowship training. Counting only primary specialties understates the true supply of subspecialists, because many practicing cardiologists and oncologists are enrolled with a broad primary specialty and a narrow secondary one. Serious specialty-supply analysis reads the secondary fields as well.

Group and hospital affiliations — and how they link to cms_hospitals

Two affiliation structures turn the file from a flat directory into a graph. The first is the group practice affiliation. Most physicians reassign their Medicare billing to an organization — a group practice, a hospital-owned medical group, an academic faculty practice plan — identified by its org_pac_id. Because the PAC ID is stable and shared by every clinician in the group, grouping the file by org_pac_idreconstructs the membership roster of every Medicare-billing practice in the country, along with its size from num_org_mem. That single field powers an entire genre of analysis: practice consolidation, the growth of large multispecialty groups, and the hospital employment of physicians all show up as changes in group membership over successive releases.

The second structure is the hospital affiliation, and this is the bridge to the rest of the CMS hospital data. Each hospital affiliation field carries a CMS Certification Number, the same six-digit CCN that identifies hospitals in the CMS hospital quality datasets and that we store as the key of the cms_hospitals table. Joining the Doctors and Clinicians hospital-affiliation CCN to cms_hospitals lets you attach a clinician to the ownership type, location, and quality measures of the hospital they practice at — and, run in the other direction, lets you list the physicians affiliated with any given hospital and profile that hospital's medical staff by specialty. The CCN join is the single most valuable cross-link the file offers: it turns an individual-level clinician file and a facility-level quality file into one connected picture of who practices where.

Both affiliation links come with a reliability caveat that recurs throughout this dataset: the affiliations are derived, in part, from where a clinician's Medicare claims indicate they practiced, and they are subject to the same self- and clinic-reported error as the rest of the enrollment data. A physician who moved groups, or who has privileges at a hospital they rarely use, may show stale or incomplete affiliations. The structure is real and useful; it is not a clean legal record of current privileges.

Medicare assignment — what “accepting assignment” means for patients

The assignment field looks like a dry administrative flag and is in fact one of the most consequential pieces of information in the file for an actual patient. Accepting Medicare assignment means the clinician agrees to accept the Medicare-approved amount as payment in full for covered services. For a beneficiary, that translates directly into out-of-pocket cost: when a clinician accepts assignment, the patient is responsible only for the standard deductible and coinsurance, and the clinician cannot bill the difference between their usual charge and the Medicare rate.

Clinicians who do not accept assignment on a given claim are “non-participating” providers, and they may charge up to a capped limit above the Medicare-approved amount — the limiting charge, generally fifteen percent over the reduced non-participating fee schedule. The patient pays that excess. The assignment indicator in the Doctors and Clinicians file records whether the clinician accepts assignment for all Medicare-covered services, and the overwhelming majority of clinicians do: participation rates among physicians billing Medicare run well above ninety percent, because participation carries higher fee-schedule amounts and direct payment from Medicare. The minority who do not are concentrated in particular specialties and markets.

Distinct from both is the small population of physicians who have formally opted out of Medicare entirely. An opted-out physician has filed an affidavit to treat Medicare patients only under private contract, outside the program's fee structure altogether. Opted-out physicians are not the same as non-participating physicians — the former are outside Medicare, the latter are inside it but charging the limiting charge — and, because the Doctors and Clinicians file is built from Medicare-enrolled clinicians, opted-out physicians are largely absent from it. That absence is itself a caveat for anyone trying to use the file as a complete census of practicing physicians.

MIPS and the Quality Payment Program context

Physician Compare was created to publish quality information, and that lineage still shapes the data ecosystem around the file. The Medicare Access and CHIP Reauthorization Act of 2015 (MACRA) replaced the old sustainable growth rate formula with the Quality Payment Program, whose dominant track for most clinicians is the Merit-based Incentive Payment System, or MIPS. Under MIPS, eligible clinicians earn a composite performance score across four categories — quality, cost, improvement activities, and promoting interoperability — and that score drives a positive, neutral, or negative adjustment to their Medicare Part B payments two years later.

MIPS scores and a set of performance measures are published as companion datasets in the same Provider Data Catalog, keyed by the same NPI, so that a beneficiary browsing a clinician's Care Compare profile can in principle see both who the clinician is (from the Doctors and Clinicians file) and how they performed (from the MIPS performance files). For analysts, the practical consequence is that the Doctors and Clinicians file is the identity backbone onto which the MIPS performance data attaches: the demographic, specialty, and affiliation fields come from here, and the quality scores join on by NPI. The alternative QPP track, Advanced Alternative Payment Models, covers clinicians who take on enough risk through models such as certain accountable care organizations to be exempt from MIPS; those clinicians still appear in the Doctors and Clinicians file but carry different or absent MIPS records.

Real-world uses

Because the file establishes identity and structure rather than judgment, its value is almost entirely in what it can be joined to and aggregated by. A handful of uses recur.

Referral networks. Combined with the shared-patient patterns derivable from Medicare claims, the group and hospital affiliations let researchers reconstruct physician referral networks — which primary care doctors send patients to which specialists, and how those flows cluster around particular groups and hospitals. The affiliation fields supply the node attributes (specialty, group, hospital) that make a referral graph interpretable.

Specialty supply by county. Aggregating distinct NPIs by primary specialty and practice location produces a map of physician supply — cardiologists per capita, psychiatrists per capita, primary care access — at the state and county level. This is the canonical workforce-policy use of the file, and it is the basis of the Python example below. It surfaces the geography of shortage: rural counties with no practicing psychiatrist, whole regions thin on certain surgical subspecialties.

School-to-specialty pipelines. Because every record carries a medical school and graduation year, the file supports analysis of which schools feed which specialties and which regions, and how cohorts age. Grouping by med_sch and pri_specshows the specialty mix a given school's graduates enter; grouping bygrd_yr reveals the age structure of a specialty and impending retirement waves.

Telehealth adoption. The telehealth flag, read across releases, tracks the diffusion of telehealth by specialty and geography — a question that became urgent during the coverage expansions of the early 2020s and that the file is one of the few national sources able to answer at the individual-clinician level.

Joining to Open Payments. Perhaps the highest-leverage join is to CMS Open Payments, the database of payments and transfers of value from drug and device manufacturers to physicians. Both datasets carry the NPI, so the Doctors and Clinicians file supplies the clean specialty, group, and hospital context that Open Payments lacks, and Open Payments supplies the industry-money dimension that the directory lacks. Joined on NPI, the two answer questions like which specialties receive the most industry payments, whether physicians at particular groups or hospitals take disproportionate sums, and how payment patterns vary by career stage inferred from graduation year. It is one of the most direct ways to connect a physician's professional identity to the financial relationships that may influence their practice.

Computing physicians-per-100k with Python

The workhorse analysis on this file is physician supply: how many clinicians of a given specialty practice per capita, and where. The script below pulls the Doctors and Clinicians national file from the CMS Provider Data Catalog through its Socrata-style datastore API, collapses the location-and-group rows down to one row per physician, joins state population to compute physicians-per-100,000 residents, and then breaks the national supply out by primary specialty. The Provider Data API serves the same data as the bulk ZIP downloads but lets you page it programmatically, which is convenient for a reproducible pipeline.

import requests
import pandas as pd

# ---------------------------------------------------------------
# CMS Provider Data Catalog -- Doctors and Clinicians national file
# Catalog page:  https://data.cms.gov/provider-data/dataset/mj5m-pzi6
# The dataset id below is the stable Socrata-style resource id used
# by the Provider Data API. CMS occasionally re-versions the id, so
# resolve the current one from the catalog page if a 404 appears.
#
# This script:
#   1. Pages the full national file through the Provider Data API
#   2. Collapses it to one row per physician (NPI)
#   3. Joins to Census state population to get physicians-per-100k
#   4. Breaks the supply rate out by primary specialty
# ---------------------------------------------------------------

DATASET_ID = "mj5m-pzi6"
BASE = f"https://data.cms.gov/provider-data/api/1/datastore/query/{DATASET_ID}/0"


def fetch_all(page_size: int = 5000) -> pd.DataFrame:
    """Page through the Doctors and Clinicians datastore endpoint.

    The Provider Data API returns JSON pages; we walk offsets until a
    short page signals the end. The full file is on the order of
    1.5 million rows because each provider repeats once per practice
    location and group affiliation.
    """
    rows: list[dict] = []
    offset = 0
    while True:
        params = {"limit": page_size, "offset": offset}
        resp = requests.get(BASE, params=params, timeout=120)
        resp.raise_for_status()
        page = resp.json().get("results", [])
        if not page:
            break
        rows.extend(page)
        print(f"  Fetched {len(rows):,} rows so far...")
        if len(page) < page_size:
            break
        offset += page_size
    return pd.DataFrame(rows)


# ---------------------------------------------------------------
# Step 1: Download the national file.
# ---------------------------------------------------------------
print("Downloading CMS Doctors and Clinicians national file...")
raw = fetch_all()
print(f"Raw rows (provider x location x group): {len(raw):,}")

# Column names in the file use lowercase with underscores. The exact
# casing has drifted across releases, so normalise defensively.
raw.columns = [c.strip().lower() for c in raw.columns]

# Map the handful of fields we need to whatever variant is present.
def pick(df: pd.DataFrame, *candidates: str) -> str:
    for c in candidates:
        if c in df.columns:
            return c
    raise KeyError(f"none of {candidates} found in columns")

npi_col   = pick(raw, "npi")
state_col = pick(raw, "state", "st", "adr_state")
spec_col  = pick(raw, "pri_spec", "primary_specialty", "specialty")
cred_col  = pick(raw, "cred", "credential")


# ---------------------------------------------------------------
# Step 2: Collapse to one row per physician.
#         A provider appears once per (location, group) combination,
#         so dedupe on NPI and keep the first practice state and
#         primary specialty for a clean per-clinician table.
# ---------------------------------------------------------------
# Keep MDs and DOs; drop NPs, PAs, etc. by credential when present.
PHYS_CREDS = {"MD", "DO", "M.D.", "D.O."}
if raw[cred_col].notna().any():
    phys = raw[raw[cred_col].astype(str).str.upper().str.replace(".", "", regex=False).isin(
        {"MD", "DO"}
    )].copy()
else:
    phys = raw.copy()

phys = phys.dropna(subset=[npi_col, state_col, spec_col])
phys = phys.drop_duplicates(subset=[npi_col])
print(f"Distinct physicians (one row per NPI): {len(phys):,}")


# ---------------------------------------------------------------
# Step 3: Join state population for a per-100k rate.
#         Plug in any current state population table; a small inline
#         dict keeps this example self-contained.
# ---------------------------------------------------------------
state_pop = {
    "CA": 39_000_000, "TX": 30_000_000, "FL": 22_000_000, "NY": 19_500_000,
    "PA": 13_000_000, "IL": 12_600_000, "OH": 11_800_000, "GA": 11_000_000,
    "NC": 10_700_000, "MI": 10_000_000, "WY": 580_000,    "VT": 645_000,
    # ... extend with the full 50-state + DC table for production use
}

by_state = (
    phys.groupby(state_col)[npi_col]
    .nunique()
    .rename("physicians")
    .reset_index()
)
by_state["population"] = by_state[state_col].map(state_pop)
by_state = by_state.dropna(subset=["population"])
by_state["per_100k"] = (by_state["physicians"] / by_state["population"] * 100_000).round(1)
by_state = by_state.sort_values("per_100k", ascending=False)

print("\nMedicare physicians per 100,000 residents, by state")
print("-" * 52)
print(by_state.to_string(index=False))


# ---------------------------------------------------------------
# Step 4: Supply by primary specialty (national counts).
# ---------------------------------------------------------------
by_spec = (
    phys.groupby(spec_col)[npi_col]
    .nunique()
    .rename("physicians")
    .reset_index()
    .sort_values("physicians", ascending=False)
)
total_phys = by_spec["physicians"].sum()
by_spec["pct"] = (by_spec["physicians"] / total_phys * 100).round(1)

print("\nTop 20 primary specialties by physician count")
print("-" * 52)
print(by_spec.head(20).to_string(index=False))

The two non-obvious steps are the deduplication and the credential filter. Deduplicating on NPI is mandatory: without it, a physician with several practice locations is counted multiple times and high-density specialties are overstated. The credential filter narrows the file to physicians (MD and DO) and drops nurse practitioners, physician assistants, and other clinicians — appropriate when the question is specifically about physician supply, but worth removing when you want the full clinician workforce, since NPs and PAs are a large and growing share of Medicare primary care. The inline population dictionary is a stand-in; in production you would join the full fifty-state-plus-DC table from the Census Bureau, and for county-level supply you would group on the practice ZIP or county derived from the address fields and join county population instead. Expect the per-100k figures to vary widely — the Northeast and a handful of academic-medicine hubs run high, while large, fast-growing, and rural states run conspicuously low, the empirical signature of the physician maldistribution that workforce policy has chased for decades.

Caveats and limits

Four limits govern any honest use of the Doctors and Clinicians file. The first is that much of it is self- or clinic-reported. Medical school, graduation year, specialties, and affiliations originate in enrollment data that providers and their practice managers maintain, and the accuracy of any individual field depends on how diligently it was kept current. Specialty self-designation in particular can be loose: a clinician may be enrolled under a broad primary specialty that understates their actual subspecialty practice, which is one reason secondary specialties matter.

The second is update lag. The file is refreshed on a periodic cadence, not in real time, and the affiliation and telehealth fields in particular can trail reality. A physician who changed groups, moved cities, or stopped offering telehealth may carry stale values until the next refresh propagates the enrollment change. Any time-series built from successive releases should treat each release as a lagged snapshot rather than a live state, and should record the release date.

The third is coverage. The file is, by construction, a directory of Medicare-enrolled clinicians who accept Medicare patients. Physicians who have opted out of Medicare entirely, those who practice exclusively in settings that do not bill Medicare Part B (some pediatric, cash-pay, and concierge practices), and clinicians enrolled only under arrangements that exclude them from the public file are absent or under-represented. The file is therefore a strong census of the Medicare physician workforce but a biased one for the physician workforce as a whole — it tilts toward specialties and settings that serve older patients.

The fourth is identifier subtlety around the NPI. The NPI is durable and almost always a reliable key, but it is not flawless: providers occasionally end up with more than one NPI, an NPI can be deactivated when a provider retires or dies and in principle reused, and an individual's Type 1 NPI must not be conflated with the Type 2 organizational NPI of the group they bill under. Most analyses never hit these edges, but a join that silently matches on a deactivated or reassigned NPI, or that mixes individual and organizational NPIs, will introduce errors that are hard to spot after the fact. Deduplicate on NPI, confirm you are working with Type 1 individual identifiers, and validate join cardinality before trusting any cross-dataset result. Treated with those four caveats in mind, the Doctors and Clinicians file is the authoritative, openly downloadable backbone for understanding who practices medicine inside Medicare — and the join key to nearly everything else CMS publishes about them.

Related writing

CMS hospital quality data covers the facility-level Care Compare datasets that the hospital-affiliation CCN in this file links into.

FDA food enforcement reports is another federal health dataset built on structured records and classification, with its own caveats around free-text fields and reporting lag.

NHTSA vehicle complaints is the consumer-facing federal safety database for motor vehicles, with a parallel public-reporting and identifier structure.