Technical writing

CMS Skilled Nursing Facility Data: Star Ratings, Staffing, and the Quality Metrics Behind 15,000 Nursing Homes

May 24, 2026· 20 min read· AI Analytics

CMSHealthcareQualityFederal Data

The United States has roughly 15,000 Medicare- and Medicaid-certified skilled nursing facilities housing 1.2 million residents on any given day. CMS publishes detailed quality data on every one of them — inspection deficiencies, staffing levels, clinical quality measures, star ratings, and ownership chains — through a public system originally called Nursing Home Compare and now rebranded as CMS Care Compare. Understanding what that data measures, how it is collected, and where it breaks down is essential for anyone using it in research, regulation, or investment due diligence.

From Nursing Home Compare to Care Compare

CMS launched the Nursing Home Compare website in 1998 as a consumer-facing quality transparency tool, initially listing only basic facility characteristics and inspection results. The 2008 addition of a five-star rating system transformed it into a structured quality ranking framework. In 2021, CMS consolidated Nursing Home Compare with its hospital, home health, hospice, and dialysis quality websites into a unified portal called Care Compare at medicare.gov/care-compare. The underlying data files — the machine-readable CSVs that researchers and analysts actually use — are published separately through the CMS Provider Data Catalog at data.cms.gov/provider-data.

The SNF dataset on Care Compare covers every facility certified for Medicare Part A skilled nursing benefits, Medicaid long-term care, or both. Facilities must be certified to receive any Medicare or Medicaid reimbursement; as a practical matter, certification is nearly universal among facilities that accept patients with public insurance. Each facility is identified by its six-digit CMS Certification Number (CCN), the universal join key across all CMS provider datasets. SNF CCNs carry a state prefix (two digits corresponding to the state's CMS code) followed by a four-digit suffix in the range 5000–6499.

The full Care Compare SNF download package comprises four primary CSV files: Provider_Info (facility demographics, overall star ratings, and aggregate staffing), Deficiencies (individual inspection citations), Quality_Measures (clinical outcome and process measure scores), and Staffing (PBJ-sourced hours per resident day by employee category). These files are refreshed quarterly following the quarterly star rating recalculation cycle. A fifth file, Ownership, was added following legislative pressure for greater ownership transparency after private equity acquisition patterns attracted scrutiny.

The five-star quality rating system

CMS introduced the five-star quality rating system for nursing homes in December 2008 following a congressional mandate in the Medicare Improvements for Patients and Providers Act. Every certified SNF receives four ratings: an overall composite star (1–5), a health inspection star, a staffing star, and a quality measure star. Each domain is rated independently; the overall star is a weighted composite that gives extra weight to health inspections.

Health inspection stars

The health inspection star is derived from the three most recent standard surveys (annual inspections) plus any complaint investigations conducted during that period, covering approximately 36 months. Each standard survey yields a total score computed by summing the scope-and-severity weights assigned to each deficiency cited during the survey. CMS maps the 12 possible scope/severity codes (A through L on the grid, with rows representing scope — isolated, pattern, widespread — and columns representing severity — no actual harm potential, minimal harm, actual harm, immediate jeopardy) to numerical weights. A deficiency coded J (isolated, immediate jeopardy) carries a weight of 50; a deficiency coded A (isolated, potential for minimal harm) carries a weight of 4. The total survey score is the sum of all deficiency weights for that survey cycle.

Survey scores are then adjusted using a “weighted” rolling formula: the most recent cycle's score counts for 50%, the second most recent counts for 35%, and the third most recent counts for 15%. The weighted total score is compared against national thresholds (recalibrated quarterly to maintain stable national distribution) to assign one through five stars. Facilities in the bottom 20th percentile nationally receive one star; those in the top 20th percentile receive five stars; the middle distribution is divided into three roughly equal groups receiving two, three, or four stars. Additional penalty provisions apply: any facility cited with an Immediate Jeopardy (IJ) deficiency — scope/severity codes J, K, or L — during the most recent survey automatically receives no more than two stars on health inspections, regardless of its total score.

Staffing stars

The staffing star is based on two metrics: total nurse hours per resident day (HPRD) and registered nurse (RN) hours per resident day, both drawn from the Payroll-Based Journal (PBJ) system described below. CMS computes each facility's case-mix adjusted expected staffing level using the MDS-derived acuity of its resident population. Facilities are then rated on the difference between actual staffing and expected staffing. An adjusted RN HPRD at or above the 80th percentile nationally earns five stars; below the 20th percentile earns one star. The two metrics — total nurse HPRD and RN HPRD — are each rated separately, and the staffing star is the lower of the two ratings, preventing a high total nurse HPRD driven entirely by CNAs from masking dangerously low RN coverage.

A critical threshold: any facility that reports zero RN hours on any day during the quarter — meaning no registered nurse was present that day — is automatically capped at two staffing stars. Federal law (42 CFR §483.35) requires that every SNF have an RN on duty at least eight consecutive hours per day, seven days a week. Zero-RN days indicate a statutory violation and are treated as a hard floor on the staffing star regardless of aggregate hourly totals.

Quality measure stars

The quality measure star is derived from a weighted composite of clinical quality measure scores drawn from MDS assessments. CMS selects measures that span both long-stay residents (those in the facility more than 100 days, typically long-term care) and short-stay residents (post-acute skilled nursing admissions following hospitalization). Within each category, measures are equally weighted to compute a composite score, which is then ranked against national thresholds to assign one through five stars. Beginning in 2016, CMS added claims-based quality measures — 30-day rehospitalization and community discharge rates — derived from Medicare claims rather than MDS, providing an independent verification source less susceptible to MDS coding manipulation.

Overall composite star

The overall star is not a simple average. The algorithm gives primary weight to the health inspection star and then adjusts upward or downward based on staffing and quality. Specifically: start with the health inspection star; add one star if both the staffing star and quality measure star are four or five stars; subtract one star if either is one star. The result is then capped at five and floored at one. Any facility that received a Special Focus Facility (SFF) designation — imposed on the worst-performing facilities with a pattern of serious deficiencies — is automatically capped at two overall stars. This weighting scheme means health inspections are the dominant determinant of the overall rating; staffing and quality can only shift the composite by one star in either direction.

Health inspection data: surveys, complaints, and the deficiency grid

The CMS inspection system distinguishes between two survey types. Standard surveys are annual comprehensive inspections conducted by state survey agencies under contract to CMS. Survey teams — typically four to six inspectors, including a registered nurse — spend two to three days on-site reviewing clinical records, observing care, and interviewing residents and staff. They evaluate compliance with the hundreds of requirements in 42 CFR Part 483 Subpart B (Requirements for Long-Term Care Facilities). Complaint investigations are triggered by complaints filed by residents, families, or staff and may occur at any time between annual surveys. Complaint investigations that find deficiencies are included in the health inspection star calculation.

Every deficiency cited is assigned a scope/severity code on a 3×4 grid. The scope dimension has three levels: Isolated (A, B, C, D, E, F — affecting one or few residents), Pattern (G, H, I — affecting multiple residents or a systemic pattern), and Widespread (J, K, L — broadly affecting residents or with potential to affect all). The severity dimension has four levels: Potential for Minimal Harm (A–C), Minimal Harm (D–F), Actual Harm (G–I), and Immediate Jeopardy (J–L). A deficiency code of “F” means pattern scope with minimal actual harm; a code of “K” means widespread scope with immediate jeopardy — the most severe non-isolated immediate jeopardy category.

Deficiencies are organized by regulatory tag (F-tag) within defined categories corresponding to the CFR subparts. The categories most frequently cited include resident rights (F550–F585), quality of care (F684–F700), infection control (F880–F884), environment (F800–F812), and administration (F835–F850). The Deficiencies CSV in the Care Compare download contains one row per deficiency per survey, with fields for the CCN, survey date, deficiency category, F-tag number, scope/severity code, corrected date, and whether a civil money penalty was imposed. Civil money penalties (CMPs) are financial sanctions assessed for serious deficiencies. Per-day CMPs range from $50 to $10,000 per day for ongoing deficiencies; per-instance CMPs range from $1,000 to $100,000. CMP amounts are also published in the Deficiencies file.

Staffing data and the Payroll-Based Journal

Before 2016, nursing home staffing data submitted to CMS was entirely self-reported: facilities filled out staffing worksheets based on their own records, with no verification mechanism. The resulting data was widely distrusted by researchers who found suspiciously round numbers and implausible staffing levels. Section 6106 of the Affordable Care Act directed CMS to collect staffing data through an electronic payroll system, and in 2016 CMS implemented the Payroll-Based Journal (PBJ)system, which requires all Medicare- and Medicaid-certified SNFs to submit quarterly payroll data directly from their payroll and scheduling systems to CMS.

PBJ data captures hours worked by employee type, aggregated by day. The employee categories include Certified Nurse Aides (CNAs),Licensed Practical Nurses (LPNs),Registered Nurses (RNs), and various therapy and administrative staff. From the daily hour totals and daily census data, CMS computes the key staffing metrics used in the staffing star rating:

Metric	PBJ Field	Typical Range
CNA hours per resident day	AIDHRD	1.5 – 3.0
LPN hours per resident day	LPNHRD	0.8 – 1.8
RN hours per resident day	RNHRD	0.3 – 0.9
Total licensed nurse HPRD (RN + LPN)	TOTLICHRD	1.2 – 2.6
Total nurse HPRD (all nursing staff)	TOTHRD	2.8 – 5.0

The transition to PBJ dramatically changed the staffing picture. Self-reported staffing had averaged approximately 4.1 total nurse HPRD nationally. PBJ data consistently shows lower actual staffing: approximately 3.5–3.8 total nurse HPRD. The gap is largest at the bottom of the distribution, where facilities previously reporting 3.0 HPRD often revealed PBJ figures below 2.5. CMS adopted a proposed minimum staffing rule in 2024 establishing a floor of 0.55 RN HPRD and 2.45 CNA HPRD; as of 2026 that rule is still subject to ongoing litigation and phase-in timelines.

Quality measures: MDS and claims-based metrics

The Minimum Data Set (MDS) is a standardized resident assessment instrument mandated by OBRA 1987. Every certified nursing home must complete a comprehensive MDS assessment within 14 days of admission, at quarterly intervals thereafter, following a significant change in clinical status, and at annual review. The MDS 3.0 (effective since 2010) captures roughly 400 data elements across 20 sections: demographics, cognitive function (Section C), behavioral symptoms (Section E), mood (PHQ-9 embedded in Section D), functional status and ADLs (Section G), continence (Section H), active diagnoses (Section I), health conditions (Section J), swallowing and nutritional status (Section K), oral/dental status (Section L), skin conditions and wounds (Section M), medications (Section N), special treatments (Section O), discharge planning (Section Q), restraints (Section P), assessments and care area triggers (Section V), and correction requests (Section X).

MDS data flows through a secure electronic submission system (QIES ASAP) to CMS, which uses it to compute quality measures (QMs) for public reporting and to determine reimbursement under the Patient-Driven Payment Model. The quality measures fall into long-stay and short-stay categories:

Long-stay quality measures

Long-stay residents are those who have been in the facility for more than 100 days at the time of the assessment. Long-stay QMs assess the chronic care environment. Key measures include: the percentage of residents experiencing one or more falls with major injury; percentage with pressure ulcers (Stage 2 or higher); percentage receiving antipsychotic medications (a measure of chemical restraint practices); percentage with a urinary tract infection; percentage whose ability to perform ADLs (Activities of Daily Living — bed mobility, transfers, locomotion, dressing, eating, toilet use) worsened during the quarter; percentage who were physically restrained; percentage with depressive symptoms; and percentage with loss of bowel and bladder control. Lower rates are better on harm measures; higher rates are better on function-maintenance measures. Each measure has a risk-adjustment specification that excludes residents for whom the outcome is not clinically relevant or who are in hospice.

Short-stay quality measures

Short-stay residents are post-acute admissions — typically patients recovering from hip fractures, cardiac events, joint replacements, or other hospitalizations who receive skilled nursing and therapy under Medicare Part A before discharging to the community. Short-stay QMs are designed for post-acute performance measurement and include: percentage with new or worsening pressure ulcers; percentage who successfully returned to the community after a short stay; percentage who improved in function (walking, ADLs); percentage with new antipsychotic medication; and, critically, the30-day all-cause rehospitalization rate and the 30-day potentially preventable rehospitalization rate. The rehospitalization measures are claims-based rather than MDS-based: CMS computes them by linking Medicare claims to the MDS admission assessment using beneficiary IDs, enabling cross-verification independent of the facility's own assessments.

COVID-19 and the nursing home crisis

Nursing homes were the epicenter of the COVID-19 pandemic in the United States. By the end of 2020, nursing home residents accounted for an estimated 38% of all US COVID-19 deaths despite comprising less than 0.5% of the population — approximately 170,000 deaths in nursing home settings in the pandemic's first year alone. The catastrophic toll reflected structural vulnerabilities: congregate settings with high-acuity frail residents, chronic staffing shortages, pre-pandemic infection control deficiencies, and supply chain failures in personal protective equipment.

In response, CMS imposed a new federal data collection requirement: beginning May 2020, all certified nursing homes must report COVID-19 case and death data weekly through the National Healthcare Safety Network (NHSN) at the CDC. Weekly NHSN reporting captures confirmed and suspected COVID-19 resident and staff cases, resident and staff deaths attributable to COVID-19, PPE supply levels, vaccination rates (staff and residents), and antigen testing activity. Failure to report results in citations and CMPs. The NHSN nursing home data has been used extensively in research on vaccine effectiveness in long-term care settings and on the relationship between pre-pandemic quality ratings and COVID mortality outcomes. Notably, pre-pandemic health inspection star ratings showed a modest but statistically significant inverse association with COVID mortality rates, consistent with the hypothesis that stronger infection control practices (which are evaluated during standard surveys) reduced transmission.

CMS also conducted a targeted round of infection control inspections in 2020 focused exclusively on COVID preparedness. These targeted infection control surveyswere not full standard surveys and their deficiencies were not incorporated into the health inspection star rating, but they are publicly available in the Deficiencies file with a distinct survey type code. Approximately 80% of surveyed facilities received at least one infection control citation in these targeted surveys — a finding that revealed how widespread pre-existing infection control failures were across the industry.

Ownership transparency and private equity

Nursing home ownership is historically opaque. A single facility may be legally owned by a limited liability company, operated under a separate management contract by another entity, leased from a real estate investment trust (REIT), and financed through a private equity holding structure. CMS collects ownership disclosure through theProvider Enrollment, Chain, and Ownership System (PECOS), which requires facilities to disclose 5%-or-greater direct ownership interests. The ownership data published in the Care Compare Ownership CSV reflects PECOS disclosures and includes direct owners, officers, and managing employees.

The PECOS disclosure requirement has a well-documented limitation: beneficial ownership — the ultimate human beings who own the controlling interest through chains of legal entities — is not consistently captured. A private equity fund that owns a nursing home through four layers of Delaware LLCs is required to disclose the intermediate entities but is not required to trace to the fund's limited partners or general partner in a way that reveals who ultimately benefits. Legislative proposals to require disclosure to the ultimate beneficial owner have been debated repeatedly but not enacted as of 2026.

Research using hand-matched ownership data — linking PECOS disclosures, REIT holdings, and private equity portfolio lists — has consistently found that private equity-owned nursing homes have measurably worse outcomes on multiple dimensions: studies find 5–10% higher short-term mortality, higher deficiency counts, lower staffing levels, and higher rates of antipsychotic use compared to non-private-equity facilities, after controlling for patient acuity, location, and facility size. The 2023 Government Accountability Office report on private equity in nursing homes found that PE-owned facilities had average deficiency scores 20% higher than the national average and staffing levels approximately 8% below comparably sized non-PE facilities. The CMS Ownership file provides a starting point for ownership classification analysis, though it must be supplemented with third-party PE portfolio databases for complete identification.

MDS and the reimbursement connection

The MDS is simultaneously a clinical assessment tool, a quality measurement instrument, and a billing document. Under the Patient-Driven Payment Model (PDPM), effective October 1, 2019 and replacing the Resource Utilization Group (RUG-IV) system, Medicare Part A payment for each short-stay resident is determined by five concurrent case-mix components derived from MDS assessments:

PDPM Component	Primary MDS Driver	Payment Weight
Physical Therapy (PT)	Functional status (Section GG), primary diagnosis	~17%
Occupational Therapy (OT)	Functional status, ADL scoring	~16%
Speech-Language Pathology (SLP)	Cognitive function, swallowing, nutritional status	~5%
Nursing (NTA)	Active diagnoses, comorbidities, treatments	~40%
Non-Therapy Ancillary (NTA)	High-cost medications, IV therapies, wound care	~22%

The PDPM replaced RUG-IV specifically to decouple therapy payment from the volume of therapy minutes delivered. Under RUG-IV, facilities were financially incentivized to provide high volumes of physical and occupational therapy regardless of clinical need because therapy minutes directly determined the RUG category and payment rate. The resulting “ultra-high” therapy RUG gaming was extensively documented in Office of Inspector General reports showing therapy minutes concentrated just above RUG thresholds. PDPM pays based on patient characteristics at admission (primary diagnosis, functional status, comorbidities) rather than on services delivered, shifting incentives toward appropriate rather than maximal therapy utilization.

The reimbursement connection creates a structural tension in MDS data quality. Because MDS coding determines payment — particularly the NTA component's dependence on diagnosis coding — facilities have a financial incentive to code diagnoses and conditions as thoroughly as possible. CMS has documented PDPM upcoding in several post-implementation analyses: the proportion of residents coded with major diagnoses that increase NTA scores rose sharply after October 2019. The MDS quality measures use risk-adjustment specifications designed to account for legitimate variation in acuity, but upcoded diagnoses can systematically bias both quality measures and staffing-adjusted comparisons.

SNF star ratings and market competition

Star ratings affect nursing home market dynamics in ways that extend beyond consumer choice. Managed care organizations — including Medicare Advantage plans, which have grown to cover more than half of Medicare beneficiaries as of 2026 — use CMS star ratings as a primary criterion for preferred provider network inclusion. Facilities with four or five overall stars are eligible for preferred SNF networks that route post-acute volume from hospital discharge planners. A fall from three stars to two stars can result in removal from preferred networks, with direct revenue consequences that typically exceed any remediation cost savings.

Accountable Care Organizations and bundled payment participants (BPCI Advanced) use star ratings similarly, selecting preferred SNF partners based on quality and cost efficiency metrics. The five-star system thus functions as a gatekeeping mechanism for managed care volume, not merely as a consumer information tool. Facilities in competitive urban markets with multiple high-rated alternatives face real revenue exposure from rating declines; rural facilities with monopoly positions face less competitive pressure from ratings, which critics argue partially explains persistent quality variation in rural nursing home markets.

Data access: CMS Care Compare download files

The CMS Provider Data Catalog at data.cms.gov/provider-data publishes the full SNF dataset quarterly. The primary files for analytical use are:

Provider_Info.csv — one row per facility; columns include CCN (PROVNUM), facility name, address, ownership type, certification date, bed count, overall and domain star ratings, aggregate PBJ staffing HPRD fields, inspection date, SFF status, and abuse indicator flags.
Deficiencies.csv — one row per deficiency per survey; joins to Provider_Info on PROVNUM. Key fields: survey date, survey type (standard vs. complaint vs. targeted), deficiency category, F-tag number, scope/severity code, corrected date, and civil money penalty amount.
Quality_Measures.csv — one row per measure per facility per quarter; joins on PROVNUM. Contains the score (percentage or rate), the number of residents in the denominator, and a flag indicating whether the score is suppressed for small sample sizes.
Staffing.csv — PBJ-sourced quarterly HPRD by employee category; one row per facility per quarter. More granular than the aggregated staffing fields in Provider_Info.
Ownership.csv — PECOS ownership disclosures; one row per owner per facility. Fields include owner name, ownership percentage, ownership type (individual vs. organization), role (officer, managing employee, 5%+ owner), and whether the owner is a chain organization.

All files join on the six-digit PROVNUM (CCN) field. CMS recalibrates star rating thresholds quarterly to maintain a stable national distribution; the thresholds used for each quarter are published in a separate Thresholds file. Analytical work that compares facilities across quarters must account for threshold changes, since the same numeric score may yield different stars in different quarters.

Python: download and analyze CMS SNF data

The following script downloads the Provider_Info and Deficiencies CSV files from the CMS Care Compare download center, computes average star ratings by state, flags facilities with a 1-star overall rating, and shows deficiency rates by star rating tier. Adjust the file URLs to the current quarterly release from data.cms.gov/provider-data.

import requests, zipfile, io
import pandas as pd
import numpy as np

# -----------------------------------------------------------------------
# Download CMS Care Compare Skilled Nursing Facility CSV files
# Provider_Info: facility characteristics and star ratings
# Deficiencies: inspection citations with scope/severity codes
# All files available from the CMS Provider Data Catalog:
#   https://data.cms.gov/provider-data/dataset/4pq5-n9py (Provider_Info)
#   https://data.cms.gov/provider-data/dataset/r4ts-mnhm (Deficiencies)
# -----------------------------------------------------------------------

PROVIDER_URL = (
    "https://data.cms.gov/provider-data/sites/default/files/resources/"
    "Provider_Info_jun2024.zip"
)
DEFICIENCY_URL = (
    "https://data.cms.gov/provider-data/sites/default/files/resources/"
    "Deficiencies_jun2024.zip"
)

def load_zip_csv(url: str) -> pd.DataFrame:
    print(f"Downloading {url.split('/')[-1]} ...")
    resp = requests.get(url, timeout=300)
    resp.raise_for_status()
    print(f"  Downloaded {len(resp.content) / 1e6:.1f} MB")
    with zipfile.ZipFile(io.BytesIO(resp.content)) as z:
        csv_name = next(n for n in z.namelist() if n.endswith(".csv"))
        with z.open(csv_name) as f:
            return pd.read_csv(f, dtype=str, low_memory=False)

df_prov = load_zip_csv(PROVIDER_URL)
df_def  = load_zip_csv(DEFICIENCY_URL)

print(f"Provider_Info rows: {len(df_prov):,}, columns: {len(df_prov.columns)}")
print(f"Deficiencies rows:  {len(df_def):,}, columns: {len(df_def.columns)}")

# -----------------------------------------------------------------------
# Normalize and coerce star rating columns
# CMS column names as of 2024 release:
#   PROVNUM            -- CMS Certification Number (CCN), 6 chars
#   PROVNAME           -- facility name
#   STATE              -- 2-letter state abbreviation
#   OVERALL_RATING     -- 1-5 stars overall composite
#   SURVEY_RATING      -- 1-5 stars health inspections
#   STAFFING_RATING    -- 1-5 stars staffing
#   QUALITY_RATING     -- 1-5 stars quality measures
#   CYCLE_1_TOTAL_SCORE -- most recent standard survey score (lower = better)
#   AIDHRD             -- CNA hours per resident day (PBJ)
#   RNHRD              -- RN hours per resident day (PBJ)
#   TOTLICHRD          -- total licensed nurse HPRD (RN + LPN)
#   TOTHRD             -- total nurse HPRD including CNAs
# -----------------------------------------------------------------------

star_cols = ["OVERALL_RATING", "SURVEY_RATING", "STAFFING_RATING", "QUALITY_RATING"]
hprd_cols = ["AIDHRD", "RNHRD", "TOTLICHRD", "TOTHRD"]

for col in star_cols + hprd_cols:
    if col in df_prov.columns:
        df_prov[col] = pd.to_numeric(df_prov[col], errors="coerce")

df_prov["PROVNUM"] = df_prov["PROVNUM"].astype(str).str.strip().str.zfill(6)

print(f"\nTotal SNF facilities: {len(df_prov):,}")
print(f"Facilities with overall star rating: {df_prov['OVERALL_RATING'].notna().sum():,}")

# -----------------------------------------------------------------------
# Average star ratings by state
# -----------------------------------------------------------------------

state_summary = (
    df_prov.groupby("STATE")[star_cols + hprd_cols]
    .agg(
        facilities=("OVERALL_RATING", "count"),
        avg_overall=("OVERALL_RATING", "mean"),
        avg_survey=("SURVEY_RATING", "mean"),
        avg_staffing=("STAFFING_RATING", "mean"),
        avg_quality=("QUALITY_RATING", "mean"),
        avg_rn_hprd=("RNHRD", "mean"),
        avg_cna_hprd=("AIDHRD", "mean"),
        avg_total_hprd=("TOTHRD", "mean"),
    )
    .reset_index()
    .round(2)
    .sort_values("avg_overall", ascending=False)
)

# groupby with named agg on multi-column -- rebuild manually
state_summary = (
    df_prov.groupby("STATE")
    .apply(lambda g: pd.Series({
        "facilities": len(g),
        "avg_overall":  round(g["OVERALL_RATING"].mean(), 2),
        "avg_survey":   round(g["SURVEY_RATING"].mean(), 2),
        "avg_staffing": round(g["STAFFING_RATING"].mean(), 2),
        "avg_quality":  round(g["QUALITY_RATING"].mean(), 2),
        "avg_rn_hprd":  round(g["RNHRD"].mean(), 2),
        "avg_cna_hprd": round(g["AIDHRD"].mean(), 2),
        "pct_1star":    round((g["OVERALL_RATING"] == 1).mean() * 100, 1),
        "pct_5star":    round((g["OVERALL_RATING"] == 5).mean() * 100, 1),
    }), include_groups=False)
    .reset_index()
    .sort_values("avg_overall", ascending=False)
)

print("\n=== Average SNF star ratings by state (top 10) ===")
print(state_summary.head(10).to_string(index=False))
print("\n=== Lowest-rated states (bottom 10) ===")
print(state_summary.tail(10).to_string(index=False))

# -----------------------------------------------------------------------
# Flag 1-star (overall) nursing homes
# -----------------------------------------------------------------------

one_star = df_prov[df_prov["OVERALL_RATING"] == 1].copy()
print(f"\nFacilities with 1-star overall rating: {len(one_star):,}")
print(f"  ({len(one_star)/len(df_prov)*100:.1f}% of all rated facilities)")

cols_show = [
    c for c in [
        "PROVNUM", "PROVNAME", "STATE",
        "OVERALL_RATING", "SURVEY_RATING", "STAFFING_RATING", "QUALITY_RATING",
        "RNHRD", "AIDHRD", "CYCLE_1_TOTAL_SCORE",
    ]
    if c in df_prov.columns
]
print("\nSample 1-star facilities:")
print(one_star[cols_show].head(15).to_string(index=False))

# -----------------------------------------------------------------------
# Deficiency analysis: citations by scope/severity
# Severity codes: A-L (A=no harm potential, IJ=Immediate Jeopardy K or L)
# SCOPE: A-C = isolated, D-F = pattern, G-I = widespread
# SEVERITY (harm level): A-C = no actual harm, D-F = minimal harm,
#                        G-I = actual harm, J-L = immediate jeopardy
# -----------------------------------------------------------------------

df_def["PROVNUM"] = df_def["PROVNUM"].astype(str).str.strip().str.zfill(6)

# Citation counts per facility
cit_per_facility = (
    df_def.groupby("PROVNUM")
    .size()
    .reset_index(name="total_citations")
)

# Immediate Jeopardy citations (scope/severity J, K, or L)
ij_col = "SCOPE_SEVERITY_CODE" if "SCOPE_SEVERITY_CODE" in df_def.columns else "SEVERITY"
if ij_col in df_def.columns:
    ij_citations = (
        df_def[df_def[ij_col].isin(["J", "K", "L"])]
        .groupby("PROVNUM")
        .size()
        .reset_index(name="ij_citations")
    )
else:
    ij_citations = pd.DataFrame(columns=["PROVNUM", "ij_citations"])

# Join citation counts to provider info
df_merged = (
    df_prov
    .merge(cit_per_facility, on="PROVNUM", how="left")
    .merge(ij_citations, on="PROVNUM", how="left")
)
df_merged["total_citations"] = df_merged["total_citations"].fillna(0).astype(int)
df_merged["ij_citations"] = df_merged["ij_citations"].fillna(0).astype(int)

# Deficiency rate by star rating bucket
deficiency_summary = (
    df_merged.groupby("OVERALL_RATING")
    .apply(lambda g: pd.Series({
        "n_facilities":       len(g),
        "avg_citations":      round(g["total_citations"].mean(), 1),
        "pct_with_ij":        round((g["ij_citations"] > 0).mean() * 100, 1),
        "avg_rn_hprd":        round(g["RNHRD"].mean(), 2),
    }), include_groups=False)
    .reset_index()
)

print("\n=== Deficiency rates and staffing by overall star rating ===")
print(deficiency_summary.to_string(index=False))

# Save results
df_merged[cols_show + ["total_citations", "ij_citations"]].to_csv(
    "snf_star_analysis.csv", index=False
)
print("\nSaved to snf_star_analysis.csv")

The output includes a state-level summary with average star ratings, RN hours per resident day, the percentage of facilities with 1-star and 5-star ratings, and — when joined to the Deficiencies file — average citation counts and the percentage of facilities with at least one Immediate Jeopardy deficiency stratified by overall star rating. The resulting CSV joins cleanly to Medicare claims data on the CCN field, enabling linkage to hospitalization and spending outcomes.

For the broader hospital financial dataset that includes SNF cost reports under Form CMS-1728-94 alongside acute care cost reporting: CMS Medicare Cost Reports: The Annual Financial Disclosure Behind Every US Hospital →

For HHS Medicaid enrollment data, which drives a large share of nursing home long-term care revenue alongside Medicare Part A skilled nursing benefits: HHS Medicaid Enrollment Data: Coverage, Spending, and the Federal-State Partnership →

For occupational injury and illness data in healthcare and long-term care settings, where nursing home workers have injury rates among the highest of any US industry: BLS Occupational Injuries and Illnesses: The Annual Survey Behind Workplace Safety Data →