Technical writing

CMS Nursing Home Compare: The Federal Database Behind Quality Ratings for 14,700 US Nursing Homes

· 16 min read· AI Analytics
CMSNursing HomesElder CareHealthcare QualityFederal Data

On any given day, approximately 1.35 million Americans live in one of the 14,700 nursing homes certified by Medicare and Medicaid — facilities that cost private-pay residents between $90,000 and $105,000 per year for a semi-private room. The Centers for Medicare & Medicaid Services publishes detailed, publicly accessible quality data on every one of those facilities through Nursing Home Compare, now integrated into the Care Compare portal at medicare.gov/care-compare. The underlying datasets, downloadable at no charge from data.cms.gov, include star ratings, health inspection deficiency records down to the individual F-tag level, actual payroll-derived staffing hours, fifteen quality measures derived from clinical assessments of every resident, and flags identifying facilities with substantiated abuse complaints or persistent systemic failures. Almost no one outside of the long-term care research community uses them directly.

This article covers the scope and institutional background of the CMS nursing home certification program, the Five-Star Quality Rating System introduced in 2008 and its three component domains, the Health Inspection survey process and the F-tag deficiency citation system with its scope-and-severity matrix, the Special Focus Facility designation for chronically underperforming homes, the Payroll-Based Journal staffing data system that replaced self-reported staffing figures, the Minimum Data Set clinical assessments that underlie the Quality Measures domain, the abuse and neglect flag fields in the public dataset, ownership transparency and the academic literature on private equity ownership effects, the full data access landscape at data.cms.gov, and a Python script that downloads and analyzes the Provider Information dataset.

What Nursing Home Compare Is

Nursing Home Compare originated as a consumer-facing website launched by CMS in 1998 to make nursing home quality information accessible to patients, families, and the public. It drew on data that CMS already collected as part of its oversight of Medicare and Medicaid certification: the federal government pays for the majority of nursing home care in the United States through these two programs, and certification is the mechanism through which CMS enforces minimum quality standards as a condition of payment. A facility that loses Medicare and Medicaid certification loses its primary revenue source — in most markets, the threat of decertification is the most powerful regulatory lever CMS holds.

CMS certifies approximately 14,700 to 15,000 nursing homes nationwide. This figure covers only Medicare- and Medicaid-certified facilities — the skilled nursing facilities (SNFs) and nursing facilities (NFs) that participate in the federal payment programs. A smaller number of purely private-pay facilities operate outside the certification framework and do not appear in the CMS datasets, though they are typically licensed by state health departments under separate authority. The certified facilities collectively have approximately 1.55 million licensed beds, with average occupancy around 1.35 million residents at any given time.

In 2022, CMS consolidated Nursing Home Compare into the broader Care Compare portal at medicare.gov/care-compare, which also covers physicians, hospitals, home health agencies, hospice providers, and dialysis facilities under a unified interface. The underlying data continues to be published separately at data.cms.gov, where researchers can download complete CSV extracts of the Provider Information, Health Deficiencies, Quality Measures, Staffing, and Penalties tables. The Socrata-based API at data.cms.gov provides programmatic access to all tables with no registration or API key required.

Nursing home care is among the most expensive categories of long-term care. For 2024, Genworth's annual Cost of Care Survey placed the median annual cost of a semi-private room in a nursing home at approximately $94,900 and a private room at roughly $108,400. Medicare covers skilled nursing facility care only for short-stay post-acute rehabilitation episodes (up to 100 days after a qualifying hospital stay), not long-term custodial care. Medicaid is the primary payer for long-term custodial nursing home care, covering roughly 60–65% of nursing home residents nationally. Private-pay residents and those covered by long-term care insurance account for the remainder.

The Five-Star Quality Rating System

CMS introduced the Five-Star Quality Rating System in December 2008 as a standardized method for comparing nursing home quality that moved beyond the raw deficiency counts that had previously been the primary public metric. The system assigns each certified nursing home a composite overall star rating of one to five stars, where one star indicates performance much below the national average and five stars indicates performance much above average. The ratings are relative to the national distribution, not absolute thresholds: a facility rated three stars is performing near the average for its peer group nationally, not meeting some fixed minimum quality standard. As facility performance across the sector improves, the distribution shifts and individual facility ratings can change even without any change in the facility's own measured performance.

The overall star rating is derived from three component domain ratings, each of which receives its own one-to-five-star score. The three domains are: Health Inspections (based on the standard annual survey plus complaint investigations), Staffing (nurse hours per resident per day from Payroll-Based Journal data), and Quality Measures (fifteen clinical measures derived from Minimum Data Set assessments of residents). The overall star rating is not a simple average of the three domain ratings; Health Inspections carries the most weight and can suppress the overall rating even when staffing and quality measure scores are high.

DomainData sourceKey measures
Health InspectionsState survey agency deficiency citations (CMS-2567)Deficiency count, scope/severity weighting, complaint surveys; 3-year rolling window
StaffingPayroll-Based Journal (PBJ) quarterly submissionsRN hours/resident/day, total nurse hours/resident/day; weekend staffing separately
Quality MeasuresMinimum Data Set (MDS 3.0) resident assessments15 measures: pressure ulcers, falls, antipsychotic use, UTI, restraints, functional decline

The Health Inspections rating is calculated using a weighted point total from the facility's three most recent standard survey cycles and any complaint investigations conducted during that period. Deficiencies cited at higher scope-and-severity levels carry more points; immediate jeopardy citations carry the highest weights and can drive a facility to a one-star rating regardless of performance on other dimensions. The five-star cutoffs for Health Inspections are set by CMS periodically and adjusted to maintain the intended national distribution (approximately 20% of facilities in each star tier, though the actual distribution is never perfectly uniform because star-rating changes follow a discrete rule-based algorithm).

CMS refreshes the Five-Star ratings on a quarterly basis as new inspection data, PBJ staffing submissions, and MDS quality measure data are processed. A facility that recently completed a standard survey will see its Health Inspection rating updated within weeks; staffing ratings update quarterly after PBJ submission deadlines. The public-facing Care Compare portal reflects the most recently published quarterly release, which typically lags the current data by one to two months.

Health Inspections: The F-Tag System

State survey agencies, operating under contracts with CMS, conduct the annual standard surveys that form the basis of the Health Inspections rating. These surveys are unannounced — facilities receive no advance notice of the survey date, a requirement enforced since the Nursing Home Reform Act of 1987 (OBRA '87) to prevent facilities from staging compliance for inspectors. A typical standard survey takes one to two days and involves a multi-disciplinary team that may include nurses, a registered dietitian, a social worker, and an activities specialist, depending on the size of the facility. CMS conducts oversight surveys at approximately 10% of facilities annually to assess the consistency of state survey agency determinations.

When surveyors identify a regulatory violation, they cite a deficiency using the Federal Tag (F-tag) system. F-tags are numbered requirements from the federal nursing home regulations at 42 CFR Part 483. The current tag numbering runs from F600 through F999, reflecting a 2017 reorganization of the regulatory framework that consolidated and renumbered the requirements. Each F-tag corresponds to a specific regulatory requirement; citations are documented on Form CMS-2567, which becomes a public record and is published in the CMS Health Deficiencies dataset.

Frequently cited F-tags include: F600 (freedom from abuse, neglect, and exploitation), F607 (reporting of suspected abuse, neglect, and exploitation), F637 (comprehensive person-centered care planning), F686 (treatment and services to prevent and heal pressure ulcers), F695 (respiratory care), F725 (sufficient and competent nursing staff), F761 (labeling of drugs and biologicals), and F812 (food procurement, storage, and preparation). The Health Deficiencies dataset at data.cms.gov includes, for each citation, the F-tag number, the deficiency description, the citation date, and the scope-and-severity code.

Each deficiency is assigned a scope-and-severity rating using a two-dimensional matrix. The scope dimension has three levels: Isolated (one or few residents affected), Pattern (more than a few residents or occurrences, but not widespread), and Widespread (all or most residents affected, or systemic facility practice). The severity dimension runs from A through L, where A through C indicate no actual harm with potential for minimal harm, D through F indicate no actual harm with potential for more than minimal harm, G through I indicate actual harm that is not immediate jeopardy, and J through L indicate immediate jeopardy to resident health or safety. Immediate jeopardy citations (J, K, L) trigger mandatory federal enforcement action and impose the heaviest star-rating penalties.

Facilities cited for immediate jeopardy deficiencies must submit and implement a credible plan of correction before the IJ designation can be removed. CMS may impose civil monetary penalties (CMPs) of up to $22,320 per day for IJ deficiencies while they remain uncorrected. The Penalties dataset at data.cms.gov records all federal enforcement actions including CMPs, denial of payment for new admissions (DPNA), and directed in-service training. Complaint investigations — triggered by complaints from residents, families, or staff — use the same F-tag deficiency system and are scored in the Health Inspections domain alongside standard surveys.

Special Focus Facilities

The Special Focus Facility (SFF) program identifies nursing homes with persistent, serious quality problems that have not improved despite standard enforcement mechanisms. A facility is designated SFF based on a CMS algorithm that scores each facility's inspection history over the prior 36 months, weighting deficiencies by scope and severity. Facilities with the worst composite scores — typically those with multiple immediate jeopardy citations or high counts of serious deficiencies across several survey cycles — are placed on the SFF list.

At any given time, approximately 90 facilities are on the active SFF list. SFF designation triggers enhanced oversight: facilities receive surveys approximately every six months rather than annually, and state survey agencies are directed to conduct a more intensive review. CMS publishes the SFF list monthly on its website and in the Provider Information dataset. A facility on the SFF list that demonstrates sustained improvement — achieving two consecutive standard surveys with no immediate jeopardy deficiencies and reduced overall deficiency scores — can graduate off the list. A facility that fails to improve faces the ultimate sanction: termination of its Medicare and Medicaid provider agreements, which is effectively decertification and forces closure or a change of ownership and complete re-certification.

The SFF Candidate list, which CMS also publishes, tracks approximately 400 additional facilities approaching SFF status. These facilities are monitored more closely by state survey agencies and may be placed on the active SFF list if their performance does not improve at subsequent surveys. The distinction between SFF Candidate and SFF status matters for the Provider Information dataset: both groups appear in the special_focus_statusfield with different coded values, allowing researchers to distinguish facilities that are actively under enhanced oversight from those in the pipeline.

Research on the SFF program has generally found that SFF designation is associated with some measurable improvement in cited deficiency rates among facilities that graduate, but that a significant subset of facilities cycle in and out of SFF status without achieving lasting improvement. The program's effectiveness is constrained by the limited pool of alternative nursing home beds in many markets: in rural areas with one or two nursing homes serving a county, decertification creates an immediate placement crisis for existing residents, giving CMS and state agencies an incentive to exhaust every other enforcement option before terminating a provider agreement.

Staffing Data: Payroll-Based Journal

Before 2017, CMS collected staffing data through self-reported two-week snapshots that facilities submitted with their annual cost reports. The self-reported system was widely criticized for susceptibility to gaming: facilities could select a favorable two-week period to report, and there was no mechanism to verify reported hours against actual payroll records. Academic research consistently found that self-reported staffing levels exceeded staffing levels observed by surveyors during on-site visits.

The Payroll-Based Journal (PBJ) program, implemented in July 2016 and mandatory from fiscal year 2017 onward, requires nursing homes to submit actual payroll data to CMS on a quarterly basis. PBJ data covers every direct-care employee, contracted agency worker, and facility-employed nurse, with daily hours worked by staff type. This granularity allows CMS to compute staffing hours per resident per day for each quarter using actual census-adjusted daily staffing figures rather than two-week snapshots. The PBJ submission deadlines are 45 days after each calendar quarter ends: May 15, August 14, November 14, and February 14.

The Staffing domain of the Five-Star rating uses four measures derived from PBJ data: Registered Nurse (RN) hours per resident per day, total nurse hours per resident per day (RN plus Licensed Practical Nurse/Licensed Vocational Nurse plus Certified Nursing Assistant), and the corresponding weekend staffing measures for both RN and total nurse hours. Weekend staffing is measured separately because facilities historically maintain lower staffing on weekends, and the weekend measure was added in 2019 specifically to capture and penalize that pattern. Each measure is scored relative to the national distribution, and the four scores are combined into the Staffing domain star rating.

PBJ data has enabled a significant body of research that was not possible with self-reported staffing. The transition from self-reported to PBJ staffing produced an immediate, measurable decline in average reported RN hours per resident per day in the CMS data, reflecting the elimination of the upward bias in self-reported figures rather than any actual change in staffing levels. Research using PBJ data has documented substantial staffing variation across ownership types, with investor-owned facilities averaging lower RN staffing ratios than government-owned and nonprofit facilities. Weekend staffing consistently runs 10–20% below weekday staffing at the national level.

The Biden administration published a proposed nursing home staffing minimum in September 2023 that would have required nursing homes to provide at least 0.55 RN hours and 2.45 total nurse aide hours per resident per day, with an RN on-site 24 hours per day, seven days per week. The proposed rule was finalized in April 2024 and immediately challenged in federal court by nursing home industry associations. The rule represented the first federal minimum staffing standard for nursing homes since the 1987 OBRA reforms, which set floor requirements that critics have long argued are inadequate given current resident acuity.

Quality Measures from MDS

The Quality Measures (QM) domain of the Five-Star system derives from the Minimum Data Set (MDS 3.0), a comprehensive standardized clinical assessment instrument that nursing homes are required to complete for every resident upon admission, quarterly, annually, and whenever the resident experiences a significant change in condition. The MDS captures hundreds of data elements covering cognition, communication, mood, behavior, physical functioning, continence, nutrition, skin integrity, medications, and diagnoses. Every completed MDS assessment is transmitted electronically to the CMS ASAP (Assessment Submission and Processing) system, creating a longitudinal clinical record for every resident of every certified nursing home.

CMS derives fifteen Quality Measures from MDS assessments and uses them in the QM domain. The measures are split between long-stay and short-stay populations. Long-stay measures apply to residents who have been in the facility for 100 or more days (the custodial population); short-stay measures apply to post-acute residents in shorter episodes. Long-stay measures include: percentage of residents with high-risk pressure ulcers, percentage who have had a fall with major injury, percentage with a urinary tract infection, percentage physically restrained, percentage receiving antipsychotic medications, and percentage with depressive symptoms. Short-stay measures include: percentage with pressure ulcers that are new or worsened, percentage with improved function in activities of daily living, and percentage receiving antipsychotics.

The antipsychotic medication quality measure has received particular attention since 2012, when CMS launched the National Partnership to Improve Dementia Care in Nursing Homes. The partnership targeted the widespread, often clinically inappropriate, use of antipsychotic medications in nursing home residents with dementia. Antipsychotics — including quetiapine, risperidone, haloperidol, and olanzapine — carry a black-box FDA warning for use in elderly patients with dementia-related psychosis because of elevated mortality risk. Despite this, antipsychotics were used in approximately 28% of long-stay nursing home residents at the program's launch in 2012. By 2023, the national average had declined to approximately 14%, representing a reduction of roughly 200,000 residents receiving antipsychotics nationally. The public reporting of facility-level antipsychotic rates through Nursing Home Compare is credited as a significant driver of this improvement.

Each QM is risk-adjusted using demographic and clinical covariates from the MDS to account for differences in resident population characteristics across facilities. A facility serving a higher proportion of residents with advanced dementia or severe functional impairment would be expected to have higher baseline rates of some adverse outcomes; risk adjustment attempts to produce fair comparisons across facilities with different case mixes. CMS risk adjustment methodologies are documented in the Measure Specifications reports published annually and are a frequent subject of methodological debate in the health services research literature.

Abuse and Neglect Flags

The CMS Provider Information dataset includes two fields that flag facilities with particularly serious history: the Abuse Icon field and the SFF status field discussed above. The Abuse Icon is set to “Y” for facilities that meet one of two criteria: a substantiated complaint of abuse, neglect, or exploitation of a resident within the most recent 36 months, or a federal enforcement action related to resident harm within the most recent three years. The flag is intended to alert consumers and families to a pattern of serious conduct beyond what the star ratings alone convey.

Nursing homes are required under federal law (42 CFR §483.12) to report all alleged violations of residents' rights and all cases of neglect, abuse, and exploitation to the state survey agency and to law enforcement. The F-tag corresponding to this requirement — F607 — is one of the most commonly cited deficiencies in the Health Deficiencies dataset, because failures to report are themselves violations regardless of whether the underlying abuse allegation is substantiated. Substantiation of an abuse allegation requires an investigation by the state survey agency, law enforcement, or the state long-term care ombudsman program; not all complaints are substantiated even when something concerning occurred.

The Abuse Icon is a relatively blunt instrument: it captures facilities with any qualifying event in the look-back window, without distinguishing between a single incident at an otherwise well-performing facility and a pattern of systemic failures. Researchers analyzing abuse-flagged facilities should join the Provider Information data with the Health Deficiencies dataset, which includes the specific F-tag citations and scope-and-severity codes that generated the flag, to assess the severity and recurrence of the underlying conduct.

Ownership Transparency

CMS requires nursing homes to disclose ownership and management structure through Form CMS-855A as a condition of Medicare and Medicaid enrollment. This disclosure covers the facility's direct owners, any individuals or organizations with 5% or more ownership interest, managing employees, and related-party transactions. CMS publishes nursing home ownership data in the Ownership dataset at data.cms.gov, which links facility provider numbers to disclosed ownership entities and their roles. The dataset is imperfect: multi-layered private equity ownership structures are often disclosed only at the immediate ownership level, with ultimate beneficial owners obscured behind limited liability company chains that each satisfy the 5% threshold requirement individually.

The nursing home industry is highly fragmented by facility count but moderately concentrated at the chain level. The largest chains by facility count in the mid-2020s include: HCR ManorCare (now operating under the ProMedica Health System, approximately 220–250 facilities), Kindred Healthcare (which sold its long-term care division to create SavaSeniorCare and later distributed facilities to various buyers), Genesis Healthcare (approximately 250 facilities at its peak, reduced through divestitures), Brookdale Senior Living (which absorbed Emeritus Corporation in 2014 and operates primarily assisted living but also some nursing homes), and The Ensign Group (approximately 300 facilities, predominantly in western states, operating under a highly decentralized model with local operators).

The effect of private equity ownership on nursing home quality has been a significant area of academic research since the 2010s. Harrington et al. (2020) analyzed CMS data from 2009 to 2016 and found that investor-owned nursing homes had significantly more deficiencies, more serious deficiencies, and lower staffing levels than nonprofit and government-owned facilities, with the differences persisting after controlling for facility size, location, and payer mix. Braun et al. (2021) examined private equity–owned nursing homes specifically and found lower nurse staffing ratios and higher rates of adverse events including falls and infections compared to non-private-equity investor-owned facilities. A 2023 National Bureau of Economic Research working paper by Gupta et al. estimated that private equity acquisition of nursing homes was associated with a 10% increase in short-term mortality among Medicare patients and a 25% reduction in nursing staff hours.

CMS finalized a rule in 2024 requiring nursing homes to disclose private equity and real estate investment trust ownership more explicitly, requiring disclosure of entities at multiple levels of the ownership chain rather than only at the immediate ownership layer. The rule was part of a broader Biden administration effort to increase transparency in nursing home ownership following heightened scrutiny of the industry's COVID-19 outcomes, in which nursing homes accounted for a disproportionate share of pandemic deaths despite housing a small fraction of the total elderly population.

Data Access

CMS publishes nursing home data through the Provider Data Catalog at data.cms.gov/provider-data. The datasets are available as CSV downloads and through the Socrata API. No API key is required; rate limits are generous for research use. The primary tables are:

DatasetRecordsKey fields
Provider Information~14,700 facilitiesCMS Certification Number, star ratings, ownership type, SFF status, abuse icon, bed count, payer mix, location
Health Deficiencies~500K citations (3-year window)F-tag number, deficiency description, scope/severity code, citation date, correction date, survey type
Quality Measures~14,700 facilities × 15 measuresMeasure code, observed rate, adjusted rate, state average, national average, footnote codes
Staffing~14,700 facilitiesRN hours/resident/day, total nurse hours/resident/day, weekend staffing measures, CNA hours, staffing star rating
Penalties~5,000–8,000 actions/yearPenalty type, penalty amount, effective date, correction required date, waived/reduced flag
Ownership~60,000+ ownership recordsOwner name, ownership percentage, owner type (individual/organization), role, association date

The CMS Certification Number (CCN), also called the Medicare Provider Number, is the primary key linking these datasets. The CCN is a six-digit number; for nursing homes, the first two digits are the state code and the last four digits identify the facility within the state. The CCN is stable across time and can be used to join CMS datasets to state-level data, cost reports from the Healthcare Cost Report Information System (HCRIS) at data.cms.gov/cost-reports, and county-level demographic data from the Census Bureau.

The Socrata API at data.cms.gov supports SQL-style queries using the SoQL query language. For most research purposes, the CSV bulk download is simpler and more reliable than the paginated API for full-dataset analysis. The nursing home Provider Information CSV is approximately 15–20 MB uncompressed and loads readily into pandas with pd.read_csv(). CMS updates the nursing home datasets quarterly, with the release date corresponding to the quarterly star-rating refresh cycle.

Python: Analyzing the CMS Nursing Home Dataset

The following script downloads the CMS Nursing Home Provider Information dataset from data.cms.gov, computes the distribution of overall star ratings, identifies Special Focus Facilities, analyzes average staffing hours by star rating, ranks states by share of one-star facilities, and identifies facilities with the abuse icon flag. The script requires requestsand pandas; no API key is needed.

import requests
import pandas as pd
import io

# ---------------------------------------------------------------------------
# CMS Nursing Home Compare: Provider Information dataset
# ---------------------------------------------------------------------------
# CMS publishes the Nursing Home Provider Information table as a CSV download
# at data.cms.gov. The dataset includes star ratings, staffing hours,
# ownership type, SFF status, abuse flags, and facility demographics.
# No API key is required.

# Direct CSV download from data.cms.gov (Socrata export endpoint)
# Dataset: NH_ProviderInfo_<quarter>.csv  -- URL below targets the current release
PROVIDER_INFO_URL = (
    "https://data.cms.gov/provider-data/api/1/datastore/query/"
    "4pq5-n9py/0/download?format=csv"
)

print("Downloading CMS Nursing Home Provider Information dataset...")
resp = requests.get(PROVIDER_INFO_URL, timeout=120)
resp.raise_for_status()
df = pd.read_csv(io.StringIO(resp.text), low_memory=False)
print(f"Loaded {len(df):,} facility records, {df.shape[1]} columns")
print("Columns:", list(df.columns[:15]), "...")

# ---------------------------------------------------------------------------
# Part 1: Overall star rating distribution
# ---------------------------------------------------------------------------
# Column: overall_rating (1-5, or NaN if not yet rated)
rating_col = "overall_rating"
if rating_col not in df.columns:
    # Try alternate column name
    rating_col = [c for c in df.columns if "overall" in c.lower() and "rating" in c.lower()][0]

rating_dist = (
    df[rating_col]
    .dropna()
    .astype(int)
    .value_counts()
    .sort_index()
)
total_rated = rating_dist.sum()

print("\n=== Overall Star Rating Distribution ===")
print(f"  {'Stars':<8}  {'Facilities':>12}  {'Share':>8}")
print("  " + "-" * 34)
STAR_LABELS = {1: "1-star (much below avg)", 2: "2-star (below avg)",
               3: "3-star (average)",        4: "4-star (above avg)",
               5: "5-star (much above avg)"}
for stars, count in rating_dist.items():
    bar = "#" * int(count / 50)
    pct = count / total_rated * 100
    print(f"  {STAR_LABELS.get(stars, str(stars)):<30}  {count:>6,}  {pct:>6.1f}%  {bar}")
print(f"  {'TOTAL':<30}  {total_rated:>6,}")

# ---------------------------------------------------------------------------
# Part 2: Special Focus Facilities (SFF)
# ---------------------------------------------------------------------------
# Column: special_focus_status -- values: 'Y' (SFF), 'SFF Candidate', or NaN
sff_col = [c for c in df.columns if "special_focus" in c.lower()]
if sff_col:
    sff_col = sff_col[0]
    sff_counts = df[sff_col].value_counts(dropna=False)
    print("\n=== Special Focus Facility Status ===")
    print(sff_counts.to_string())

    sff_df = df[df[sff_col].astype(str).str.upper().str.startswith("Y")]
    print(f"\nActive SFF facilities: {len(sff_df):,}")
    if len(sff_df) > 0:
        print("Sample SFF facilities:")
        name_col = [c for c in df.columns if "provider_name" in c.lower() or "facility_name" in c.lower()]
        state_col = [c for c in df.columns if c.lower() in ("state", "provider_state")]
        cols_to_show = (name_col[:1] + state_col[:1] + [sff_col])
        print(sff_df[cols_to_show].head(10).to_string(index=False))

# ---------------------------------------------------------------------------
# Part 3: Average staffing hours by star rating
# ---------------------------------------------------------------------------
# Columns: reported_rn_staffing_hours_per_resident_per_day
#          reported_total_nurse_staffing_hours_per_resident_per_day
rn_col    = [c for c in df.columns if "rn" in c.lower() and "hour" in c.lower()]
total_col = [c for c in df.columns if "total_nurse" in c.lower() and "hour" in c.lower()]

if rn_col and total_col and rating_col in df.columns:
    rn_col, total_col = rn_col[0], total_col[0]
    staffing = (
        df[[rating_col, rn_col, total_col]]
        .dropna()
        .copy()
    )
    staffing[rating_col] = staffing[rating_col].astype(int)
    grouped = staffing.groupby(rating_col)[[rn_col, total_col]].mean().round(2)

    print("\n=== Average Staffing Hours per Resident Day by Star Rating ===")
    print(f"  {'Stars':<6}  {'RN hrs/res/day':>16}  {'Total nurse hrs':>16}")
    print("  " + "-" * 43)
    for stars, row in grouped.iterrows():
        print(f"  {stars:<6}  {row[rn_col]:>16.2f}  {row[total_col]:>16.2f}")

# ---------------------------------------------------------------------------
# Part 4: Top 10 states by 1-star facilities per capita
# ---------------------------------------------------------------------------
# We approximate per-capita using 1-star count / total rated facilities in state
# (true population-adjusted figures require Census state population data)
state_col_name = [c for c in df.columns if c.lower() in ("state", "provider_state")]
if state_col_name and rating_col in df.columns:
    state_col_name = state_col_name[0]
    state_total  = df.groupby(state_col_name)[rating_col].count().rename("total_rated")
    state_1star  = (
        df[df[rating_col].astype("Int64", errors="ignore") == 1]
        .groupby(state_col_name)
        .size()
        .rename("one_star_count")
    )
    state_stats = pd.concat([state_total, state_1star], axis=1).dropna()
    state_stats["pct_1star"] = (state_stats["one_star_count"] / state_stats["total_rated"] * 100).round(1)
    top10 = state_stats.sort_values("pct_1star", ascending=False).head(10)

    print("\n=== Top 10 States by Share of 1-Star Facilities ===")
    print(f"  {'State':<8}  {'Total Rated':>12}  {'1-Star Count':>13}  {'% 1-Star':>9}")
    print("  " + "-" * 48)
    for state, row in top10.iterrows():
        print(f"  {state:<8}  {int(row['total_rated']):>12,}  "
              f"{int(row['one_star_count']):>13,}  {row['pct_1star']:>8.1f}%")

# ---------------------------------------------------------------------------
# Part 5: Facilities with abuse icon flag
# ---------------------------------------------------------------------------
# Column: abuse_icon -- 'Y' if facility has substantiated abuse/neglect
# complaint or enforcement action within the relevant look-back period
abuse_col = [c for c in df.columns if "abuse" in c.lower()]
if abuse_col:
    abuse_col = abuse_col[0]
    abuse_flagged = df[df[abuse_col].astype(str).str.upper() == "Y"]
    print(f"\n=== Abuse Icon Flagged Facilities ===")
    print(f"  Total facilities with abuse flag: {len(abuse_flagged):,}")

    if rating_col in df.columns and len(abuse_flagged) > 0:
        abuse_by_stars = (
            abuse_flagged[rating_col]
            .dropna()
            .astype(int)
            .value_counts()
            .sort_index()
        )
        print("  Abuse-flagged facilities by star rating:")
        for stars, count in abuse_by_stars.items():
            print(f"    {stars}-star: {count:,}")

print("\nDone. Data sourced from CMS data.cms.gov Nursing Home datasets.")
print("Care Compare public portal: https://www.medicare.gov/care-compare/")

The CMS data.cms.gov endpoint URLs occasionally change between quarterly releases as CMS migrates datasets to new identifiers. If the download URL above returns a 404, the current endpoint can be found by navigating to data.cms.gov/provider-data, searching for “Nursing Home Provider Information,” and copying the CSV download URL from the dataset page. The column names in the CSV are generally stable across releases; the script includes flexible column detection using string matching to handle minor naming variations between quarterly exports.

Related writing

BLS QCEW: The Federal Database Behind US Payroll Data for Every Industry and County — quarterly payroll data from UI administrative records, 40M+ records by NAICS industry and county.

DOL UI Claims: The Federal Database Behind Weekly US Unemployment Statistics Since 1967 — initial and continuing unemployment claims, FRED ICSA series, Thursday morning release, and state breakdowns.