Technical writing

CDC BRFSS: The World's Largest Telephone Survey and the Federal Health Behavior Database

May 24, 2026· AI Analytics

Federal DataCDCPublic HealthHealth Surveys

Every year the Centers for Disease Control and Prevention coordinates roughly 450,000 telephone interviews with American adults — one per state, one unified national picture — tracking the behavioral and health status factors most likely to cause premature death and preventable disability. The result is the Behavioral Risk Factor Surveillance System, known as BRFSS, the world's largest ongoing telephone health survey by sample size and the primary source of state-level prevalence estimates for chronic disease risk factors across the United States.

What BRFSS Is and Where It Comes From

BRFSS was established in 1984 by the CDC in response to a straightforward problem: state health departments needed population-level data on the behaviors driving heart disease, cancer, and stroke, but no survey infrastructure existed at the state level to produce reliable estimates. The CDC designed a standardized telephone survey methodology and provided it to state health departments, which conduct the actual interviews. The arrangement has persisted for four decades. States collect the data; CDC provides the questionnaire core, the sampling methodology, the weighting procedures, and the public data repository.

The survey began as a landline-only instrument. By 2011 that design had become untenable: cell-phone-only households had grown to roughly 30% of the US adult population, concentrated among younger adults, lower-income households, and renters — precisely the groups whose health behaviors differed most from landline households. CDC integrated cell phone sampling beginning with the 2011 survey year, creating the combined LLCP (landline and cellphone combined) file that is now the primary analysis dataset. The methodology change introduced a discontinuity in some trend series; analysts comparing pre- and post-2011 estimates need to account for it.

Today BRFSS operates continuously in all 50 states plus the District of Columbia, Guam, Puerto Rico, and the US Virgin Islands. The ~450,000 completed interviews per year make it the largest health survey in the world by sample, and the consistency of the core questionnaire across states and across decades makes it uniquely valuable for both cross-sectional geographic comparisons and long-run trend analysis.

What BRFSS Measures

The BRFSS questionnaire has three tiers. The core module is asked in every state every year; optional modules are offered by CDC and adopted at each state's discretion; and state-added questions address priorities specific to that state's policy agenda.

The core module covers:

Self-rated health status — a single item asking respondents to rate their general health as excellent, very good, good, fair, or poor. Decades of research have confirmed that self-rated health predicts mortality independently of clinical measures.
Healthy days — the number of days in the past 30 days on which physical health was not good, and separately the number of days on which mental health was not good. These two items are widely used as population-level measures of health-related quality of life.
Health care access — whether the respondent has any kind of health care coverage, whether they have a personal doctor or health care provider, and whether cost has prevented them from seeing a doctor in the past year.
Exercise and physical activity — any leisure-time physical activity in the past month, and if so, frequency and intensity.
Fruit and vegetable consumption — daily servings of fruits and of vegetables, captured separately.
Tobacco use — whether the respondent has smoked at least 100 cigarettes in their lifetime and, if so, whether they currently smoke every day, some days, or not at all. E-cigarette use was added to the core in later years.
Alcohol use — drinks per week and binge drinking episodes (five or more drinks for men, four or more for women, on a single occasion in the past 30 days).
Seatbelt use, HIV testing, and immunization status — brief items tracking preventive behaviors and screening uptake.
Height and weight — self-reported, from which CDC computes body mass index as _BMI5 (BMI multiplied by 100 to avoid decimal storage).

Optional modules that states may add include sleep behavior, diabetes management, chronic obstructive pulmonary disease, depression and anxiety, sexual behavior, oral health, cognitive decline and caregiver burden, falls prevention among older adults, and cancer survivorship. The optional module structure means that not every state collects every measure in every year; analysts working with optional-module variables should verify which states participated before computing multi-state comparisons.

Key National Findings

BRFSS has documented the secular trends in American health behavior with a consistency that no other dataset matches. Several findings stand out.

Obesity. In the early years of BRFSS, roughly 14% of US adults met the clinical definition of obesity (BMI of 30 or higher). By the early 2020s that figure had risen to approximately 36%. No other data system tracks this trend with BRFSS's combination of annual cadence, state-level disaggregation, and demographic detail. The geographic pattern is stark: Mississippi, West Virginia, and Louisiana consistently rank among the states with the highest obesity prevalence; Utah and Colorado consistently rank among the lowest. A clear rural-urban gradient is visible within states.

Tobacco. Smoking prevalence among US adults fell from roughly 40% in the 1960s to approximately 13% in recent BRFSS years. BRFSS did not exist for the 1960s decline, but it has tracked the continued fall since 1984 with enough granularity to show that rural adults, adults without college degrees, and adults in certain geographic clusters — particularly Appalachia and the rural South — have experienced smaller declines than the national trend implies.

Diabetes and prediabetes. Approximately 10% of US adults report having been told by a health professional that they have diabetes; an additional 38% are estimated to have prediabetes based on blood glucose criteria. BRFSS tracks the diagnosed fraction continuously; the prediabetes figure comes from clinical surveys, but BRFSS BMI and physical activity data are used to model the distribution.

Physical inactivity. Roughly 25% of US adults report no leisure-time physical activity in a typical month. This proportion has changed less dramatically than smoking or obesity over the BRFSS era, and the geographic clustering is strong: the highest inactivity rates are concentrated in the same Southern and Appalachian states with the highest obesity rates.

Survey Methodology and Weighting

BRFSS uses random-digit dialing across both the landline and cell phone frames. Telephone numbers are generated randomly within working area codes and prefixes, then screened for eligibility. On the landline frame, a single adult within each reached household is randomly selected using a within-household selection algorithm. On the cell frame, the person who answers is presumed to be the phone's primary user and is interviewed directly.

The weighting procedure is iterative proportional fitting, commonly called raking. BRFSS rakes to Census Bureau population control totals on eight dimensions simultaneously: age group, race and ethnicity, sex, education level, marital status, home ownership, region within the state, and phone ownership type (landline only, cell only, or dual user). Raking iterates across all dimensions until the weighted sample distribution matches the population distribution within a convergence criterion. The final weight for each respondent, stored in _LLCPWT, reflects how many adults in the population that respondent statistically represents.

The stratified complex survey design has critical implications for analysis. Naive (unweighted) analysis of BRFSS data produces biased estimates because the sample is not a simple random sample of the US adult population — younger adults, cell-phone-only users, and certain demographic groups are sampled at different rates and then up-weighted or down-weighted to match population controls. Any prevalence estimate, mean, regression coefficient, or standard error computed without accounting for the survey design will be wrong. The appropriate tools are R's survey package (usingsvydesign and svymean/svyglm), Stata'ssvy prefix commands, SAS PROC SURVEYLOGISTIC, or Python'sstatsmodels survey design objects with the _LLCPWT column as the weight and _STSTR as the stratum variable.

State-level estimates from BRFSS are statistically reliable for most core-module variables because the annual sample for each state is large enough — typically 4,000 to 20,000 completed interviews. Sub-state estimates (county level, metro area level) require caution: small cell sizes produce wide confidence intervals, and for many counties the BRFSS sample is insufficient for direct estimation. CDC's PLACES project addresses this limitation through small area estimation methods described below.

Data Structure and Access

CDC publishes annual BRFSS data at the BRFSS website (cdc.gov/brfss). Each annual release includes:

LLCP XPT file — the primary analysis file combining landline and cell phone respondents. Distributed in SAS transport format (.XPT), readable by SAS, R (haven or foreign), Stata, and Python (pyreadstat). This is the file used for all national and state-level prevalence analysis.
ASCII fixed-width file — an alternative format with the same data, accompanied by a format statement for parsing.
Codebook — a PDF document mapping every variable name to the question text, response codes, and skip logic. The codebook is essential reading before using BRFSS; response codes are not intuitive without it.

Key variables in the LLCP file include:

_STATE — state FIPS code (numeric), used to compute state-level estimates.
SEXVAR — respondent sex (1 = male, 2 = female). Earlier years used SEX; the variable was renamed in 2019.
_AGEG5YR — age group in five-year bands from 1 (18–24) through 13 (80 and older). The computed variable is preferred over the raw age item because it is recoded consistently.
GENHLTH — self-rated general health, 1 (excellent) through 5 (poor); 7 and 9 are “don't know/not sure” and “refused.”
MENTHLTH — days in the past 30 on which mental health was not good; 88 = none, 77 and 99 are missing.
PHYSHLTH — days in the past 30 on which physical health was not good; same coding as MENTHLTH.
_BMI5 — BMI multiplied by 100. A value of 2750 means BMI 27.50. Obesity threshold is _BMI5 >= 3000.
SMOKE100 — ever smoked 100 cigarettes in lifetime (1 = yes, 2 = no). Combined with SMOKDAY2 (smoking frequency) to derive current smoker status.
_RFSMOK3 — computed current smoker indicator (1 = not a current smoker, 2 = current smoker). The leading underscore indicates a CDC computed variable rather than a direct survey item.
ALCDAY5 — number of alcoholic drinks per week or per month, encoded with a prefix digit (1xx = per week, 2xx = per month; 888 = no drinks).
_TOTINDA — leisure-time physical activity indicator (1 = had activity, 2 = no activity in the past month).
HLTHPLN1 — any health care coverage (1 = yes, 2 = no).
_LLCPWT — the final raked survey weight. Never analyze BRFSS without applying this weight.
_STSTR — stratification variable for variance estimation in complex survey procedures.

The convention throughout the BRFSS codebook is that missing and refusal codes are large numbers (77, 99, 777, 999, 9999). Any analysis pipeline must explicitly recode these to missing (NaN) before computing statistics; including them as numeric values inflates means and distorts prevalence estimates.

The PLACES Project: Small Area Estimates from BRFSS

The direct BRFSS sample is insufficient for reliable sub-state estimation in most counties. To address this, CDC developed the PLACES project (formerly called 500 Cities) using multilevel regression and poststratification (MRP). The approach fits a multilevel regression model to BRFSS microdata, incorporating demographic predictors and geographic random effects, then poststratifies the model predictions to Census Bureau population totals for every small area. The result is model-based prevalence estimates for 27 health measures at the county level (all 3,100-plus US counties) and the census tract level (more than 72,000 tracts nationwide).

PLACES covers measures including obesity, smoking, physical inactivity, diabetes, hypertension, high cholesterol, asthma, depression, arthritis, dental visits, mammography screening, and health insurance coverage — all derived from BRFSS core and optional module data. The estimates are available through the CDC Open Data portal and its API, making them the standard input for county- and tract-level health mapping applications.

The critical caveat for PLACES is that these are model estimates, not direct survey estimates. For large counties with substantial BRFSS sample, the MRP estimate will be close to the direct estimate. For small counties — particularly those with fewer than 50,000 residents — the model is doing most of the work, and the estimates inherit the model's assumptions about how demographics predict health outcomes. Users should treat PLACES county estimates for small counties as informed guesses, not measurements.

BRFSS and Health Equity Research

The consistent demographic breakdowns available in BRFSS — race and ethnicity, income, education, disability status, sexual orientation and gender identity in recent years — make it the primary data source for state-level health disparity analysis. Because the sample is large enough to produce reliable state-level estimates within demographic subgroups, BRFSS enables comparisons that national surveys cannot support: obesity prevalence among Black adults with college degrees in Georgia versus in Minnesota, or smoking rates among adults with incomes below poverty in Appalachian states versus the national average.

Several health disparities documented through BRFSS are particularly striking in their persistence. Black adults have higher rates of hypertension and type 2 diabetes diagnosis than white adults at every income and education level — the disparity is not fully explained by socioeconomic differences. This finding, replicated across years of BRFSS data and corroborated by clinical datasets, has shaped federal priorities in chronic disease prevention and has driven research into structural and environmental determinants of health beyond individual behavior.

BRFSS is also the primary source for tracking health disparities by geographic type. Rural adults in the United States report lower rates of health insurance coverage, higher rates of smoking and obesity, less leisure-time physical activity, and worse self-rated health than urban adults — a pattern that has widened rather than narrowed over the BRFSS era. Because BRFSS produces state-level estimates annually, changes in rural-urban health gaps can be tracked over time within states, not just nationally.

BRFSS Limitations

Every data source has structural limitations, and BRFSS is no exception. Understanding them is essential for honest analysis.

Self-reported biometric data. BMI in BRFSS is computed from self-reported height and weight. Research comparing BRFSS-based BMI to objectively measured BMI from the National Health and Nutrition Examination Survey (NHANES) — which conducts physical examinations — consistently finds that BRFSS underestimates obesity prevalence by approximately 5 percentage points. Respondents tend to report being taller and lighter than they actually are. BRFSS obesity trends are internally consistent over time (the bias is relatively stable), but BRFSS prevalence estimates should not be treated as clinically precise.

Telephone coverage bias. Despite the addition of cell phone sampling in 2011, certain populations remain systematically underrepresented: homeless individuals, those living in institutional settings (nursing homes, correctional facilities), non-English and non-Spanish speakers, and individuals with severe cognitive impairment. These groups tend to have worse health outcomes than the general population, meaning BRFSS likely understates disease prevalence for some measures.

The 2011 methodology discontinuity. Adding cell phone sampling substantially changed the demographic composition of the achieved sample and altered weighting procedures. For some outcomes — particularly those with strong age gradients or rural-urban gradients — trends across the 2010–2011 boundary reflect the methodology change as much as real changes in population health. CDC published bridged analyses for some measures to help users assess the magnitude of the discontinuity.

Recall bias and health literacy variation. Items asking about behavior “in the past 30 days” rely on respondent recall over a calendar period that varies in salience. Items asking about servings of fruits and vegetables are particularly sensitive to how respondents define a serving. These limitations affect the absolute accuracy of estimates but are relatively stable across years and states, so comparative and trend analyses remain informative.

State variation in data collection timing. States collect BRFSS data throughout the year, and the seasonal distribution of interviews varies by state. For behaviors with strong seasonal patterns — physical activity, alcohol consumption around holidays, influenza vaccination uptake — a state that completes most interviews in January may produce estimates that differ from a state that interviews primarily in summer, not because of genuine population differences but because of timing. The raking weights do not fully correct for this.

Python: Weighted Obesity Prevalence by State for Adults 18–44

The following script downloads the 2022 BRFSS LLCP XPT file directly from CDC, loads it using pyreadstat, applies the _LLCPWT survey weight, and computes weighted obesity prevalence by state for adults aged 18 to 44. It then prints the 10 states with the highest and lowest obesity rates in that age group.

import requests
import io
import pandas as pd
import pyreadstat
import numpy as np

# Download BRFSS LLCP (landline + cell combined) XPT file for 2022
# The LLCP file is the primary analysis file for national and state estimates
YEAR = "2022"
url = (
    "https://www.cdc.gov/brfss/annual_data/"
    + YEAR
    + "/files/LLCP"
    + YEAR
    + "XPT.zip"
)

resp = requests.get(url, timeout=300)
resp.raise_for_status()

import zipfile

with zipfile.ZipFile(io.BytesIO(resp.content)) as zf:
    xpt_name = [n for n in zf.namelist() if n.upper().endswith(".XPT")][0]
    with zf.open(xpt_name) as f:
        raw_bytes = f.read()

# pyreadstat reads SAS transport (.XPT) format
df, meta = pyreadstat.read_xport(io.BytesIO(raw_bytes))

# Normalize column names to uppercase (BRFSS codebook uses uppercase)
df.columns = [c.upper() for c in df.columns]

# _LLCPWT: final combined landline + cell weight
# _STATE:   state FIPS code (numeric)
# _BMI5:    BMI * 100 (e.g. 2750 = BMI 27.50)
# _AGEG5YR: age group 1-13 (1=18-24, 2=25-29, ..., 13=80+)

# Keep records with valid weight, state, BMI, and age group
analysis = df[
    df["_LLCPWT"].notna()
    & df["_STATE"].notna()
    & df["_BMI5"].notna()
    & df["_AGEG5YR"].notna()
    & (df["_BMI5"] > 0)
    & (df["_AGEG5YR"] >= 1)
    & (df["_AGEG5YR"] <= 13)
].copy()

# Obesity indicator: BMI >= 30 means _BMI5 >= 3000
analysis["obese"] = (analysis["_BMI5"] >= 3000).astype(float)

# Adults 18-44: age groups 1 (18-24), 2 (25-29), 3 (30-34), 4 (35-39), 5 (40-44)
young_adults = analysis[analysis["_AGEG5YR"].isin([1, 2, 3, 4, 5])].copy()


def weighted_prevalence(group):
    """Compute weighted obesity prevalence and effective sample size."""
    w = group["_LLCPWT"]
    ob = group["obese"]
    total_w = w.sum()
    if total_w == 0:
        return pd.Series({"prevalence_pct": float("nan"), "n_unweighted": 0})
    prev = (ob * w).sum() / total_w * 100.0
    n = len(group)
    return pd.Series({"prevalence_pct": round(prev, 1), "n_unweighted": n})


state_results = (
    young_adults.groupby("_STATE")
    .apply(weighted_prevalence, include_groups=False)
    .reset_index()
)

# Map state FIPS to two-letter abbreviations for display
fips_to_abbr = {
    1: "AL", 2: "AK", 4: "AZ", 5: "AR", 6: "CA", 8: "CO", 9: "CT",
    10: "DE", 11: "DC", 12: "FL", 13: "GA", 15: "HI", 16: "ID", 17: "IL",
    18: "IN", 19: "IA", 20: "KS", 21: "KY", 22: "LA", 23: "ME", 24: "MD",
    25: "MA", 26: "MI", 27: "MN", 28: "MS", 29: "MO", 30: "MT", 31: "NE",
    32: "NV", 33: "NH", 34: "NJ", 35: "NM", 36: "NY", 37: "NC", 38: "ND",
    39: "OH", 40: "OK", 41: "OR", 42: "PA", 44: "RI", 45: "SC", 46: "SD",
    47: "TN", 48: "TX", 49: "UT", 50: "VT", 51: "VA", 53: "WA", 54: "WV",
    55: "WI", 56: "WY", 66: "GU", 72: "PR", 78: "VI",
}

state_results["state"] = state_results["_STATE"].map(
    lambda x: fips_to_abbr.get(int(x), str(int(x)))
)

state_results = state_results.sort_values("prevalence_pct", ascending=False)

print("=== 10 States with HIGHEST obesity prevalence, adults 18-44 ===")
print(
    state_results[["state", "prevalence_pct", "n_unweighted"]]
    .head(10)
    .to_string(index=False)
)

print()
print("=== 10 States with LOWEST obesity prevalence, adults 18-44 ===")
print(
    state_results[["state", "prevalence_pct", "n_unweighted"]]
    .tail(10)
    .sort_values("prevalence_pct")
    .to_string(index=False)
)

A few notes on the implementation. The _BMI5 variable stores BMI multiplied by 100, so the obesity threshold of BMI 30 corresponds to _BMI5 >= 3000. The _AGEG5YR groups 1 through 5 cover ages 18–24, 25–29, 30–34, 35–39, and 40–44. The weighted prevalence calculation divides the sum of weights for obese respondents by the total weight for the group, which is the standard survey-weighted proportion estimator. Large values like 888 or 99 in the raw _BMI5 field indicate missing or invalid responses and should be filtered out before analysis; the filter _BMI5 > 0 in the script handles this because CDC sets _BMI5 to 0 for records where height or weight was not reported.

The CDC occasionally restructures the BRFSS ZIP archive URL or changes variable names across survey years. If the download fails, check the CDC BRFSS website for the current URL pattern and confirm variable names against the year-specific codebook before running the script.

Outpatient health outcomes and hospital quality data complement the population-level behavior picture from BRFSS with clinical performance metrics. See CMS Hospital Quality Data: Comparing Outcomes Across US Hospitals.

Prescription drug utilization patterns documented in Medicare Part D provide a population-level view of pharmaceutical treatment for many of the chronic conditions that BRFSS tracks as risk factors. See Medicare Part D: Prescription Drug Utilization Data and Drug Spending.

Financial relationships between pharmaceutical and device manufacturers and physicians — documented in CMS Open Payments — add context for understanding how clinical practice patterns interact with the health behaviors BRFSS measures. See CMS Open Payments: Tracking Industry Payments to Physicians and Hospitals.