Technical writing

CMS Post-Acute Care Utilization: The Federal Database Behind Home Health, Hospice, and Skilled Nursing Spending

June 1, 2026· 10 min read· AI Analytics

CMSPost-Acute CareHome HealthHospiceFederal Data

After a hospital stay ends, the most consequential and least visible part of American healthcare begins: the home health nurse who visits three times a week, the hospice that takes over when treatment stops, the skilled nursing facility that handles the rehabilitation a patient is not yet well enough to do at home. Medicare spends on the order of sixty billion dollars a year on this post-acute care, and for nearly every agency, hospice, and nursing facility that bills the program, CMS publishes a provider-level record of how much it delivered and how much it was paid. The Post-Acute Care utilization Public Use Files — unified in our catalog as cms_pac_utilization, roughly 28,404 provider-by-measure rows — are the closest thing the country has to a public ledger of who provides post-acute care, in what volume, and at what cost.

These files matter because post-acute care is, by a wide margin, the part of Medicare fee-for-service most distorted by payment incentives and most chronically targeted by fraud. The settings are reimbursed not for outcomes but for episodes, stays, and days, and the details of how those units are priced have, again and again, produced exactly the behavior the pricing rewarded: too many home-health episodes, hospice stays that run far longer than any terminal prognosis, and rehabilitation therapy delivered in the precise volume that maximized a payment threshold rather than the volume a patient needed. The utilization files are the instrument for seeing that behavior at the level of the individual provider — for asking which agency bills twice what its neighbors do, which hospice keeps patients for a year, and which nursing facility's payment per stay sits three standard deviations above its state. This article is a field guide to what the files contain and how to use them.

What it is, and the post-acute care landscape

“Post-acute care” is the umbrella term for the services a patient receives after the acute, hospital phase of an illness or injury. Medicare recognizes several distinct settings, each with its own providers, its own certification rules, and — crucially for this dataset — its own payment system. The three that dominate spending and that this file covers are home health agencies, hospices, and skilled nursing facilities.

Home Health Agencies (HHAs) deliver skilled nursing, physical and occupational therapy, and home health aide services in the patient's home, to beneficiaries who are largely homebound and need intermittent skilled care. The unit of payment is the home-health episode (now a thirty-day period), within which the agency provides some number of visits. Home health is the lowest-capital setting in post-acute care — no building full of beds, just nurses and therapists in cars — which is part of why it has been a perennial fraud target.
Hospices care for terminally ill beneficiaries who have elected to forgo curative treatment in favor of comfort-focused care, almost always in the patient's home or a nursing facility rather than a dedicated inpatient unit. Hospice is paid a fixed per-diem — a daily rate for each enrolled day, largely regardless of the services delivered that day — which gives it an economic structure entirely different from the other settings and an incentive profile all its own.
Skilled Nursing Facilities (SNFs) provide short-term inpatient rehabilitation and skilled nursing — the rehab after a hip replacement, the recovery after a stroke — to patients not yet ready to go home. The unit of payment is the SNF stay, measured in covered days. The same physical buildings often house long-stay custodial residents paid for by Medicaid or out of pocket, but the Medicare SNF benefit, and this file, cover the skilled, post-hospital stay.

Two further settings round out the post-acute landscape and are worth naming even though they sit at the margins of this file. Inpatient Rehabilitation Facilities (IRFs) are hospitals or hospital units that deliver intensive, physician-supervised rehabilitation — the demanding several-hours-a-day therapy a patient must be well enough to tolerate — for conditions like stroke, spinal cord injury, and major trauma. Long-Term Care Hospitals (LTCHs) treat the sickest, most medically complex patients who need hospital-level care for extended periods, often the ventilator-dependent and those with multiple organ failures. IRFs and LTCHs are paid under their own prospective payment systems and reported in their own provider files; the great bulk of post-acute volume, spending, and controversy lives in the three home-health, hospice, and SNF settings this dataset consolidates, which is why the unified table centers on them.

The single fact that makes post-acute care a coherent analytical subject is that the same patient frequently flows through several of these settings in sequence — hospital to SNF to home health, or hospital to home health to hospice — and that the choice of setting, the length of stay in each, and the intensity of services are all shaped by how each setting is paid rather than purely by clinical need. To read the utilization data honestly, you have to understand the payment systems first.

The prospective payment systems: PDPM, PDGM, and the hospice per-diem

Each post-acute setting is paid under a Medicare prospective payment system — a scheme that pays a predetermined amount for a defined unit of care, rather than reimbursing billed charges. The design of each system is the key to interpreting the utilization file, because every payment system creates incentives, and the recent history of post-acute care is in large part the history of CMS rewriting these systems to defeat the gaming the previous version invited.

Skilled nursing — the Patient-Driven Payment Model (PDPM), since October 2019. For two decades, Medicare paid SNFs under a system (RUGs) that tied a large part of the daily rate to the volume of therapy minutes delivered. The predictable result was that therapy was furnished in the amount that maximized payment: a striking concentration of patients received therapy in the narrow band just above the threshold for the highest-paying category, a pattern no clinical theory could explain. PDPM, which took effect in October 2019, deliberately severed the link between therapy volume and payment. It instead sets the daily rate from the patient's clinical characteristics — diagnoses, functional status, and comorbidities — across several case-mix components, so that a facility is paid for how sick and complex the patient is, not for how many therapy minutes it logs. PDPM was designed to be budget-neutral, but it changed behavior immediately and visibly: therapy minutes dropped, the use of group and concurrent therapy rose, and the case-mix coding of patients intensified as facilities learned to document the conditions PDPM now paid for. The SNF payment figures in this file therefore reflect a system that pays for documented complexity, which is itself a thing worth auditing.

Home health — the Patient-Driven Groupings Model (PDGM), since January 2020. Home health underwent a parallel reform one quarter later. The old system paid for sixty-day episodes and, like the old SNF system, keyed payment partly to therapy volume, rewarding agencies for providing therapy visits above certain thresholds. PDGM, effective January 2020, made two structural changes at once: it cut the payment period in half, from sixty days to thirty, and it removed therapy-visit counts as a payment factor entirely. Under PDGM, the thirty-day period is classified into one of hundreds of payment groups based on the admission source (community versus institutional), the timing (early versus late in a spell), the clinical grouping derived from the primary diagnosis, the functional impairment level, and a comorbidity adjustment. As with PDPM, the explicit goal was to pay for patient characteristics rather than service volume and to blunt the incentive to pad therapy. The practical consequence for this dataset is enormous and easy to miss: an “episode” in home-health data from 2020 onward is a thirty-day period, while older data counted sixty-day episodes, so any time series of episode counts or payment-per-episode that spans the changeover is comparing two different units unless it is explicitly reconciled.

Hospice — the per-diem and the aggregate cap. Hospice has always been paid differently, and its payment system is the one whose incentives run most directly against patient interest. Medicare pays hospice a flat per-diem rate for each day a beneficiary is enrolled, at one of four levels — routine home care (the overwhelming majority of days), continuous home care, inpatient respite care, and general inpatient care — with routine home care carrying a modest daily rate that is now slightly front-loaded to pay more in the first sixty days and less thereafter. Because the per-diem is paid regardless of how much care is actually delivered on a given day, the economics favor patients with long enrollments and low service intensity: the marginal day of a stable, long-staying patient is nearly pure margin. Medicare counters this with two cap mechanisms, the binding one being the aggregate cap: each hospice's total annual Medicare payments may not exceed a per-beneficiary cap amount multiplied by its number of beneficiaries, and a hospice that exceeds the cap must repay the overage. The cap is, in effect, a ceiling on average length of stay — a hospice can keep some patients a very long time only if it also serves enough short-stay patients to pull the average down. Cap liability is one of the sharpest signals in the data that a hospice's patient mix is tilted toward the long, low-intensity stays that the per-diem rewards and that draw OIG scrutiny.

The data and what each measure means

CMS publishes the underlying data as a family of “by Provider” Public Use Files, one per setting, drawn from the claims that home health agencies, hospices, and skilled nursing facilities submit to Medicare fee-for-service. Each file is provider-level: the fundamental unit is one provider for one report year, with its volume and payment measures aggregated across all of its beneficiaries. The unified cms_pac_utilization table stacks these into provider-by-measure rows, where the roughly 28,404 figure counts the provider-and-setting observations across the three settings, not distinct organizations and not claims. Every row carries an identity block and a measure block.

The identity block locates the provider:

Provider identifier — the CMS Certification Number (CCN) for the facility, and in many releases an associated National Provider Identifier (NPI). The CCN is the stable key that joins this file to Care Compare quality data and to the provider enrollment and ownership files.
Provider name and geography — the legal or doing-business-as name and the practice location: city, state, and ZIP. The state field is the workhorse for the geographic-variation analysis that this dataset is built for.
Setting — which of the three post-acute settings the row describes (Home Health Agency, Hospice, or Skilled Nursing Facility), the dimension that must gate almost every comparison, because the measures mean different things in each.

The measure block quantifies what the provider did and was paid:

Episodes or stays — the primary volume count, setting-specific: home health reports episodes (thirty-day periods under PDGM), SNFs report stays, and hospice, having no episode unit, is measured by enrolled days and beneficiaries. This is the denominator for the most important derived measure, payment per episode or per stay.
Total days or visits — the intensity measure: home health reports total visits (and visits per episode is a key utilization ratio), SNFs report covered days, and hospice reports total days, the basis of the per-diem and of average length of stay.
Distinct beneficiaries — the unduplicated count of Medicare patients the provider served in the year. Dividing volume by beneficiaries yields episodes per patient or days per patient — the measures most diagnostic of over-utilization, since a provider that bills many episodes per beneficiary or keeps patients for many days per enrollment is exactly the profile the payment incentives reward.
Total Medicare payments — the actual dollars Medicare paid the provider for the year, the headline spending figure.
Standardized payments — total payments with the geographic and policy-driven adjustments removed (chiefly the area wage index and assorted add-ons), so that a dollar of standardized payment means the same thing in rural Mississippi as in Manhattan. This is the field to use for any cross-provider or cross-geography comparison; comparing raw total payments across regions mostly measures the wage index, not provider behavior.
Average payment per episode or per stay — the per-unit price, the single most useful derived measure in the file. Computed on standardized payments, it is the basis for the outlier detection that drives fraud screening and margin analysis.
Charge-to-payment ratio — the relationship between what the providercharged and what Medicare actually paid. Because Medicare pays prospective rates, charges are largely fictional list prices, and a wildly high charge-to-payment ratio is less a fact about reimbursement than a flag about a provider's billing posture — useful as one input to an outlier model, not as a measure of cost.

The interpretive discipline the file demands is to keep the units straight. An episode is not a stay is not a beneficiary-day; a home-health episode count and a SNF stay count cannot be added or compared; and a payment-per-episode figure is meaningful only against other providers in the same setting. The setting field exists precisely so that every aggregate can be partitioned correctly, and the most common error with this dataset is comparing across settings as though the measures were commensurable.

The fraud and overutilization story

No corner of Medicare fee-for-service has a longer or more consistent enforcement record than post-acute care, and the utilization files are the public surface of that story. Home health and hospice in particular sit perennially near the top of the lists the HHS Office of Inspector General (OIG) and the Department of Justice publish of program-integrity priorities, and the reasons trace directly back to the payment incentives described above.

Home health has been a fraud locus for decades for a structural reason: it is cheap to enter and the unit of payment is a visit-bundle delivered in private homes, where the care is hard to verify after the fact. The classic abuse patterns are billing for beneficiaries who are not in fact homebound or do not need skilled care, billing for visits not rendered, paying kickbacks for patient referrals, and — the upcoding pattern the payment reforms were built to defeat — providing or documenting just enough therapy to clear a payment threshold. Certain metropolitan areas became so saturated with fraudulent home-health billing that CMS imposed enrollment moratoria, temporarily freezing new home-health certifications in the worst-hit markets. In the utilization data, the fingerprints are statistical: agencies with anomalously high episodes per beneficiary, visits per episode, or payment per episode relative to their geographic peers.

Hospice is the setting where the incentive runs most directly against the patient's clinical reality, and the enforcement record reflects it. Because the per-diem rewards long enrollments of low-intensity patients, the signature abuse is enrolling beneficiaries who are not actually terminally ill — who have a life expectancy far longer than the six-month prognosis the benefit requires — and keeping them on service for many months or years of profitable per-diem days. OIG has documented this pattern repeatedly, along with hospices that provide minimal care for the days they bill, that enroll patients without a valid terminal diagnosis, and that bill the high general-inpatient level without justification. The problem became acute in a handful of states — California, Texas, Nevada, and Arizona among them — that saw explosive, fraud-driven growth in new hospice certifications, prompting CMS to impose special enrollment scrutiny there. In the data, the tell is long average length of stay, low spending per day, aggregate-cap liability, and concentrations of newly enrolled for-profit hospices with shared characteristics.

Underlying all of this is the steady drumbeat of the MedPAC margin reports. The Medicare Payment Advisory Commission, which advises Congress, has for years reported that Medicare margins in home health and hospice run conspicuously high — home-health margins in particular have at times exceeded twenty percent, far above what a competitive, appropriately priced service would sustain — and has repeatedly recommended payment cuts on the grounds that the rates are simply too generous relative to the cost of efficient care. The Government Accountability Office (GAO) has produced parallel work documenting vulnerability to fraud and the persistence of overutilization. These reports are the policy backdrop against which the utilization file is read: when MedPAC says home-health margins are twenty percent, the per-provider payment data is where you can see which agencies are driving the average and whether the high-margin providers cluster by ownership, geography, or billing pattern.

How this joins to ownership and quality data

The utilization file is most powerful not on its own but as one layer in a stack, joined to the other CMS provider datasets on the shared facility key. Two joins matter most.

The ownership join. The CCN that identifies each provider in the utilization file is the same key that runs through the CMS all-owners files, which disclose who owns every Medicare-enrolled SNF, home health agency, and hospice — including entity-type flags for private equity companies and REITs. Joining utilization to ownership lets you ask the question at the center of the current policy debate directly: do private equity-owned post-acute providers utilize and bill differently than their peers? Because the financial sponsors described in the provider-ownership data have rolled up home health and especially hospice aggressively — the low-capital, recurring-revenue per-diem economics being precisely what a roll-up strategy is built for — the join lets you test whether PE-owned hospices run longer lengths of stay, whether PE-owned home-health agencies bill more episodes per beneficiary, and whether the high-payment outliers in the utilization file are disproportionately PE-owned. That join, against the ownership graph, is the analytical heart of the worked example below.

The quality join. The same CCN links the utilization file to the CMS Care Compare quality datasets, which publish setting-specific outcome and process measures: the Home Health Care Compare star ratings and quality measures, the Hospice Care Compare and CAHPS family-survey results, and the nursing-home Care Compare measures and payroll-based staffing data. Joined on the facility key, utilization supplies the volume-and-spending dimension and Care Compare supplies the quality-and-staffing dimension, so you can ask whether the providers billing the most are delivering measurably better or worse care — whether high payment per episode buys anything, or whether the high-utilization agencies and the low-quality agencies are the same agencies. That is the question the two files can answer together that neither can answer alone.

What you can do with it

Because the file establishes utilization and payment structure rather than rendering judgment, its value is in ranking, mapping, screening, and joining. Several uses recur.

Rank providers by utilization and payment. The most direct use is to rank agencies, hospices, and facilities within a setting by total Medicare payments, by payment per episode or per stay, by episodes or days per beneficiary, and by distinct beneficiaries served — producing, for any state or nationally, a leaderboard of who delivers and bills the most post-acute care. This is the foundational layer for almost every downstream question.

Map geographic variation. Post-acute utilization varies across the country to a degree that has no clinical explanation — some states use many times more home health per beneficiary than others, and hospice length of stay varies several-fold by region. Aggregating standardized payment and utilization per beneficiary by state turns the file into a map of that variation, the empirical foundation of decades of Dartmouth-style practice-variation research.

Detect outliers for fraud screening. The file is a natural input to program-integrity screening: flag the providers whose payment per unit, episodes per beneficiary, visits per episode, hospice length of stay, or charge-to-payment ratio sits far out in the tail relative to peers in the same setting and state. Outlier status is not proof of fraud — legitimate providers serve sicker patients — but it is exactly how enforcement triage begins, and it is the analysis the worked example implements.

Analyze margins. Read against the MedPAC margin findings and, where available, against cost-report data, the per-provider payment figures let you reconstruct which providers are driving the high sector margins and how the high-margin providers differ from the rest in size, ownership, and patient mix.

Join to ownership to study PE behavior. The highest-value use, as above, is to join the utilization measures to the ownership flags and test whether private equity and REIT ownership is associated with measurably different billing and utilization — the question the combination of these two federal files is uniquely able to answer at national scale.

A worked example in Python

The workhorse analysis on these files is the outlier screen: compute payment per unit within each setting, flag the providers far out in the right tail relative to their in-state peers, and then join the flagged providers to the ownership file to see whether private-equity-owned providers are over-represented among the outliers. The script below pulls the home-health, hospice, and skilled-nursing Public Use Files from data.cms.gov, maps each setting's native measures onto a common frame, computes standardized payment per episode or per stay (per beneficiary for hospice), scores each provider with a robust z-score against its setting-and-state peer group, and joins the SNF outliers to the all-owners file on the CCN.

import requests
import pandas as pd

# ---------------------------------------------------------------
# CMS data.cms.gov -- Post-Acute Care (PAC) utilization PUFs
# CMS publishes a "by Provider" Public Use File for each setting:
#   - Home Health Agency  (episodes, visits, payments)
#   - Hospice             (beneficiaries, days, payments)
#   - Skilled Nursing Fac (stays, covered days, payments)
# Catalog (Medicare provider utilization & payment data):
#   https://data.cms.gov/provider-summary-by-type-of-service
#
# The three files do not share identical columns -- a home-health
# "episode" is not a SNF "stay" -- so we read each, map its native
# measures onto a common (episodes_or_stays, days, payment) frame,
# stack them, and then screen for payment-per-unit outliers WITHIN
# each setting and state. Finally we join to the SNF all-owners file
# to see whether PE-flagged providers price differently.
#
# Resolve the current dataset UUIDs from the catalog if a request
# 404s -- CMS re-versions these files every release year.
# ---------------------------------------------------------------

# Stable data.cms.gov dataset UUIDs, one per setting + report year.
DATASETS = {
    "Home Health Agency":      "REPLACE_WITH_HHA_PUF_UUID",
    "Hospice":                 "REPLACE_WITH_HOSPICE_PUF_UUID",
    "Skilled Nursing Facility": "REPLACE_WITH_SNF_PUF_UUID",
}
OWNERS_UUID = "REPLACE_WITH_CURRENT_SNF_OWNERS_UUID"

API = "https://data.cms.gov/data-api/v1/dataset/{uuid}/data"


def fetch_all(uuid: str, page_size: int = 5000) -> pd.DataFrame:
    """Page through one data.cms.gov datastore endpoint.

    The API returns JSON arrays; we walk size/offset until a short
    page signals the end. Each PUF is provider-level -- one row per
    provider (per report year), so the files are tens of thousands
    of rows, not millions.
    """
    rows: list[dict] = []
    offset = 0
    url = API.format(uuid=uuid)
    while True:
        resp = requests.get(url, params={"size": page_size, "offset": offset}, timeout=120)
        resp.raise_for_status()
        page = resp.json()
        if not page:
            break
        rows.extend(page)
        if len(page) < page_size:
            break
        offset += page_size
    return pd.DataFrame(rows)


def pick(df: pd.DataFrame, *candidates: str) -> str:
    """Return the first candidate column that exists in df."""
    for c in candidates:
        if c in df.columns:
            return c
    raise KeyError(f"none of {candidates} found in columns")


def normalize_setting(setting: str, df: pd.DataFrame) -> pd.DataFrame:
    """Map one setting's native measures onto a common frame.

    Output columns: ccn, provider, state, setting, units, days,
    payment, std_payment -- where 'units' is episodes for home
    health and stays for SNF; hospice has no per-episode unit, so we
    use distinct beneficiaries as its volume measure.
    """
    df = df.copy()
    df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

    ccn   = pick(df, "ccn", "prvdr_num", "provider_id", "rndrng_prvdr_ccn")
    name  = pick(df, "provider_name", "facility_name", "rndrng_prvdr_org_name", "org_name")
    state = pick(df, "state", "rndrng_prvdr_state_abrvtn", "prvdr_state", "state_code")
    pay   = pick(df, "total_medicare_payment_amount", "tot_mdcr_pymt_amt", "medicare_payment")
    std   = pick(df, "total_medicare_standardized_payment_amount",
                 "tot_mdcr_stdzd_pymt_amt", "standardized_payment")

    if setting == "Home Health Agency":
        units = pick(df, "total_hha_episodes", "tot_episodes", "episodes")
        days  = pick(df, "total_hha_visits", "tot_visits", "visits")
    elif setting == "Skilled Nursing Facility":
        units = pick(df, "total_snf_stays", "tot_stays", "stays")
        days  = pick(df, "total_snf_covered_days", "tot_cvrd_days", "covered_days")
    else:  # Hospice -- per-diem, so beneficiaries are the volume unit
        units = pick(df, "distinct_beneficiaries", "tot_benes", "beneficiaries")
        days  = pick(df, "total_days", "tot_days", "hospice_days")

    out = pd.DataFrame({
        "ccn":        df[ccn].astype(str).str.strip(),
        "provider":   df[name].astype(str).str.strip(),
        "state":      df[state].astype(str).str.strip().str.upper(),
        "setting":    setting,
        "units":      pd.to_numeric(df[units], errors="coerce"),
        "days":       pd.to_numeric(df[days], errors="coerce"),
        "payment":    pd.to_numeric(df[pay], errors="coerce"),
        "std_payment": pd.to_numeric(df[std], errors="coerce"),
    })
    return out


# ---------------------------------------------------------------
# Step 1: Download and stack all three settings.
# ---------------------------------------------------------------
frames = []
for setting, uuid in DATASETS.items():
    print(f"Downloading {setting} PUF...")
    frames.append(normalize_setting(setting, fetch_all(uuid)))

pac = pd.concat(frames, ignore_index=True)
print(f"Stacked provider-by-setting rows: {len(pac):,}")


# ---------------------------------------------------------------
# Step 2: Payment per unit (per episode / per stay / per
#         beneficiary). Use the STANDARDIZED payment so geographic
#         wage and policy adjustments are stripped out -- this is
#         the apples-to-apples basis for cross-provider comparison.
# ---------------------------------------------------------------
pac = pac[(pac["units"] > 0) & (pac["std_payment"] > 0)].copy()
pac["pay_per_unit"] = pac["std_payment"] / pac["units"]


# ---------------------------------------------------------------
# Step 3: Flag outliers WITHIN each (setting, state) cell, so we
#         compare home-health agencies to home-health agencies in
#         the same state -- never across settings. A robust z-score
#         on the median / MAD resists the long right tail that any
#         payment distribution carries.
# ---------------------------------------------------------------
def robust_z(s: pd.Series) -> pd.Series:
    med = s.median()
    mad = (s - med).abs().median()
    if mad == 0:
        return pd.Series(0.0, index=s.index)
    return 0.6745 * (s - med) / mad

pac["rz"] = (
    pac.groupby(["setting", "state"])["pay_per_unit"]
    .transform(robust_z)
)
# Suppress thin cells: a state with only a handful of providers in a
# setting has no stable median to score against.
cell_n = pac.groupby(["setting", "state"])["ccn"].transform("size")
pac.loc[cell_n < 10, "rz"] = pd.NA

outliers = (
    pac[pac["rz"] >= 3.5]
    .sort_values(["setting", "rz"], ascending=[True, False])
)

print("\nHigh payment-per-unit outliers (robust z >= 3.5)")
print("-" * 64)
for setting in DATASETS:
    top = outliers[outliers["setting"] == setting].head(10)
    print(f"\n{setting}:")
    print(
        top[["ccn", "provider", "state", "units", "pay_per_unit", "rz"]]
        .to_string(index=False, float_format=lambda x: f"{x:,.0f}")
    )


# ---------------------------------------------------------------
# Step 4: Join SNF outliers to the all-owners file to see whether
#         private-equity-flagged providers are over-represented in
#         the high-payment tail. The CCN is the join key.
# ---------------------------------------------------------------
print("\nDownloading SNF all-owners file for the ownership join...")
owners = fetch_all(OWNERS_UUID)
owners.columns = [c.strip().lower().replace(" ", "_") for c in owners.columns]

own_ccn = pick(owners, "ccn", "prvdr_num", "enrollment_id", "associate_id")
pe_flag = pick(owners, "type_owner_pe", "private_equity_company_owner", "pe_owner")

owners["is_pe"] = (
    owners[pe_flag].astype(str).str.strip().str.upper().isin({"Y", "YES", "TRUE", "1"})
)
pe_by_ccn = (
    owners.groupby(owners[own_ccn].astype(str).str.strip())["is_pe"].max()
    .rename("pe_owned")
)

snf = pac[pac["setting"] == "Skilled Nursing Facility"].copy()
snf = snf.merge(pe_by_ccn, left_on="ccn", right_index=True, how="left")
snf["pe_owned"] = snf["pe_owned"].fillna(False)

pe_rate_all = snf["pe_owned"].mean()
pe_rate_out = snf.loc[snf["rz"] >= 3.5, "pe_owned"].mean()

print("\nPE ownership among SNFs: baseline vs. high-payment outliers")
print("-" * 64)
print(f"  All SNFs flagged PE-owned:            {pe_rate_all:.1%}")
print(f"  High-payment-outlier SNFs PE-owned:   {pe_rate_out:.1%}")

Three steps carry the analytical weight. The first is using the standardized payment rather than the raw payment for any comparison: without standardization, the payment-per-unit ranking mostly reflects the area wage index, and high-cost urban providers would dominate the outlier list for no reason but their geography. The second is gating every comparison on the setting and screening within state-and-setting cells: a home-health agency must be compared to home-health agencies, and ideally to ones operating under the same regional cost structure, so the robust z-score is computed within each (setting, state) group and thin cells are suppressed because a state with a handful of providers has no stable median to score against. The third is the join cardinality on the ownership side: because a single facility has many owners, the all-owners file has many rows per CCN, so it must be collapsed to one PE flag per CCN before the merge or the utilization rows will fan out and every count will be wrong. The robust z-score on the median and MAD, rather than a mean-and-standard-deviation z, matters too — payment distributions have a long right tail, and a classical z-score lets the very outliers you are hunting inflate the standard deviation and hide themselves.

Caveats and limits

Four limits govern any honest use of the post-acute utilization files. The first, and the one that bounds every conclusion, is that the data is Medicare fee-for-service only. These files are built from traditional Medicare claims and exclude Medicare Advantage entirely — and Medicare Advantage now covers more than half of all Medicare beneficiaries, with post-acute care being precisely the area where Advantage plans most aggressively manage utilization through prior authorization and steering. A provider's row reflects only its fee-for-service business, so a hospice or SNF with a large Advantage population looks far smaller in this file than it is, and any market-share or per-capita-utilization figure computed from these files describes the shrinking fee-for-service slice, not the whole Medicare population. This is the single most important thing to state alongside any number drawn from the data.

The second is that the data is provider-level aggregation. Every row is a provider's full year collapsed to totals and averages; there are no individual claims, no patient-level records, and no case-mix detail within the provider. That means the file can tell you a hospice's average length of stay but not its distribution, can flag an agency's high payment per episode but cannot tell you whether it is driven by a few extreme cases or a uniformly high book of business, and cannot, by itself, risk-adjust for how sick a provider's patients actually are. An outlier on aggregate measures is a hypothesis to investigate, not a verdict, precisely because the aggregation hides the patient mix that might legitimately explain it.

The third is annual lag. The Public Use Files are released on an annual cadence with a substantial delay — the most recent file available typically describes a calendar year already a year or two in the past — so the data is a rear-view mirror, not a live feed, and is unsuited to detecting an emerging fraud scheme in real time. The lag also collides with the payment-system changeovers: a multi-year trend in home-health episodes that spans the 2020 PDGM transition is comparing sixty-day episodes to thirty-day episodes, and a SNF payment-per-stay trend across the 2019 PDPM transition is comparing two different case-mix systems, so any time series crossing those boundaries must be reconciled to the unit and the rules in force rather than read as a smooth series.

The fourth is measure suppression for low volume. To protect beneficiary privacy, CMS suppresses or redacts measures for providers with very small counts — typically when a cell would reflect fewer than eleven beneficiaries — so the smallest agencies, hospices, and facilities either drop out of the file or appear with blanked measures. That suppression is not random: it removes the smallest providers, which biases any unweighted average toward larger ones and means the file is not a complete census of the very long tail of tiny rural and start-up providers. An analysis that ignores suppression will silently under-represent exactly the small, newly enrolled providers that are often most worth watching. Taken with those four caveats in mind, the CMS Post-Acute Care utilization files are the authoritative, openly downloadable record of who delivers home health, hospice, and skilled nursing care inside fee-for-service Medicare, in what volume, and at what cost — and the public instrument that, joined to ownership and quality, makes the incentives and the consolidation reshaping post-acute care visible at the level of the individual provider.

Related writing

CMS provider ownership covers the all-owners files that disclose who owns every Medicare SNF, home health agency, and hospice, with private-equity and REIT flags — the file this utilization data joins to on the CCN to study how ownership shapes billing.

CMS hospital quality data covers the facility-level Care Compare outcome and staffing measures that share the CCN key with these utilization files and let you set spending against quality.

CMS Doctors and Clinicians covers the individual-clinician file built on the same NPI and enrollment plumbing, the physician layer beneath the facilities profiled here.