Technical writing

ClinicalTrials.gov: The Federal Database Behind 500,000 Clinical Trials and Drug Approval Research

· 16 min read· AI Analytics
NIHClinical TrialsDrug ApprovalResearchFederal Data

ClinicalTrials.gov is the world's largest registry of clinical research studies, maintained by the National Library of Medicine at the National Institutes of Health. As of 2024 it holds more than 500,000 registered studies spanning every disease area, every sponsor type, and more than 220 countries. The database underpins drug approval submissions, academic meta-analyses, regulatory enforcement, patient recruitment, and the global scientific record of what clinical research has actually been conducted — including research whose results were never published anywhere else.

This article covers the legislative origin and institutional structure of ClinicalTrials.gov, the FDAAA 801 mandatory registration requirements and their enforcement record, the structure of clinical trial phases and the study design fields that distinguish interventional from observational research, the key data fields in the registry and what they reveal about drug development pipelines, the publication bias problem and how the results database was designed to address it, sponsor categories and the dominant role of industry in Phase 3 research, the ClinicalTrials.gov API v2 and bulk download options, aggregate statistics on the composition of the registry, and a Python script that queries the API for recruiting Phase 3 oncology trials and computes phase distribution across all cancer studies.

Origin and institutional structure

ClinicalTrials.gov was launched in February 2000 as a direct product of Section 113 of the Food and Drug Administration Modernization Act of 1997 (FDAMA). FDAMA required the Secretary of Health and Human Services to establish a registry of federally and privately funded clinical trials for serious and life-threatening diseases, motivated primarily by HIV/AIDS advocates who argued that patients with terminal diagnoses deserved access to information about ongoing experimental treatments before those treatments reached commercial approval or publication in medical journals. The original mandate was narrow: serious and life-threatening conditions, government-funded trials, drugs and biologics.

The National Library of Medicine, a component of NIH housed within the Department of Health and Human Services, was designated as the implementing agency. NLM already operated MEDLINE and PubMed, the world's dominant biomedical literature database, and had the infrastructure and subject-matter expertise to build and maintain a registry at scale. The database is operated by NLM's Division of Library Operations and is distinct from the FDA's own regulatory databases, though results submitted to ClinicalTrials.gov can satisfy certain FDA reporting obligations.

The initial launch in 2000 registered only a few thousand studies. Growth accelerated after the International Committee of Medical Journal Editors (ICMJE) announced in 2004 that member journals — including the New England Journal of Medicine, JAMA, The Lancet, and the BMJ — would require prospective trial registration as a condition of publication. This editorial requirement, which preceded the legislative mandate by three years, had significant effect: registration volumes doubled in 2005 as investigators anticipated future publication requirements. The FDAAA 801 requirements enacted in 2007 then extended mandatory registration to industry-sponsored trials, driving the growth to the current 500,000+ scale.

FDAAA 801: mandatory registration and results reporting

The Food and Drug Administration Amendments Act of 2007, specifically Title VIII Section 801 (FDAAA 801), transformed ClinicalTrials.gov from a voluntary registry with limited scope into a mandatory compliance system. FDAAA 801 defines a class of “applicable clinical trials” (ACTs) that must register: interventional studies of drugs, biologics, or devices that are subject to FDA regulation, are Phase 2 or later (or device trials of any phase), and are conducted in the United States or in a country where the data will be submitted to FDA. Registration must occur within 21 days of first patient enrollment.

Beyond registration, FDAAA 801 added the most consequential structural change: mandatory results reporting. Sponsors of applicable clinical trials must post results to the ClinicalTrials.gov results database within 12 months of the primary completion date — the date on which the last participant was examined or received an intervention for purposes of collecting data for the primary outcome. Results must include a participant flow table, baseline characteristics, outcome measure data for primary and secondary endpoints, adverse event tables (serious and non-serious, by organ system and preferred term), and the protocol and statistical analysis plan. The results database is publicly accessible and constitutes a scientific record of trial outcomes independent of any journal publication.

Penalties for non-compliance are substantial on paper. The FDA can impose civil monetary penalties of up to $10,000 per day for failure to register or report results. NIH can withhold future grant funding from investigators or institutions that fail to comply with reporting requirements for NIH-funded trials. However, enforcement has been persistently criticized as inadequate. A landmark 2015 study published in the New England Journal of Medicine by Anderson et al. examined 13,327 applicable clinical trials with primary completion dates between January 2008 and August 2012 and found that only 13.4% had reported results to ClinicalTrials.gov within 12 months of primary completion — a compliance rate that exposed the gap between statutory mandate and actual enforcement. The FDA had issued no civil monetary penalties for results reporting failures as of that study's publication. Subsequent analyses through 2022 showed compliance rates improving, reaching approximately 40–60% for industry-sponsored trials and 30–50% for academic trials, but substantial non-compliance persists.

A 2016 HHS Final Rule updated the regulatory implementation of FDAAA 801 and extended results reporting requirements to Phase 1 trials (not required under the statute alone) for trials registered on or after January 18, 2017. The Final Rule also clarified definitions of primary completion date, primary outcome measure, and the adverse event reporting structure. NIH simultaneously issued a policy requiring results reporting for all NIH-funded trials regardless of FDAAA applicability — capturing observational studies and device studies not covered by the statute.

Study types and clinical trial phases

ClinicalTrials.gov distinguishes two primary study types: interventional and observational. Interventional studies (also called clinical trials) assign participants to receive an intervention — a drug, device, behavioral treatment, procedure, dietary supplement, genetic therapy, or combination product — and measure outcomes. Observational studies observe participants without assigning interventions: cohort studies follow groups over time, case-control studies compare people with and without an outcome, and cross-sectional studies measure characteristics at a single point in time. A third category, expanded access (compassionate use), covers the use of investigational products outside of clinical trials for patients with serious conditions who have no alternatives. Patient registries, a subset of observational studies, collect standardized data on patients with specific conditions for natural history and outcomes research.

Interventional studies are assigned a phase that reflects their position in the drug development sequence. Phase 0 studies — rarely registered but present in the database — involve microdosing and pharmacokinetic characterization before full Phase 1 safety studies, typically with fewer than fifteen participants receiving sub-therapeutic doses. Phase 1 trials are first-in-human safety and dose-ranging studies enrolling approximately 20 to 80 healthy volunteers or patients, designed to characterize pharmacokinetics, maximum tolerated dose, and dose-limiting toxicities. Phase 2 trials test efficacy signals in the target disease population, typically enrolling 100 to 300 participants, and provide preliminary data on whether the treatment works at doses that are tolerable. Phase 3 trials are the pivotal efficacy studies required for FDA approval: large randomized controlled trials comparing the investigational treatment to the current standard of care or placebo, often enrolling hundreds to thousands of participants across multiple sites, and powered to detect the primary efficacy endpoint with statistical significance. Phase 4 trials are post-marketing surveillance studies conducted after FDA approval, either voluntarily or as a condition of approval (post-marketing commitment or requirement), to assess long-term safety, rare adverse events, or effectiveness in populations not well-represented in pivotal trials.

Observational studies do not use the phase classification. They use study designs (cohort, case-control, cross-sectional, case-only, ecologic) and time perspectives (prospective, retrospective, cross-sectional, other). The ClinicalTrials.gov registry includes both prospective and retrospective observational studies, as well as retrospective chart reviews that some registries would exclude.

Key data fields and the NCT number

Every study in the registry is assigned an NCT number — a unique identifier in the format NCT followed by eight digits (e.g., NCT04368728). The NCT number is the canonical identifier for a clinical trial in the biomedical literature: journals require NCT numbers in publication abstracts, FDA submissions reference NCT numbers, and citations in regulatory documents and systematic reviews use NCT numbers to link publications to registry records. The NCT number persists for the life of the study and is immutable; if a study is terminated or withdrawn, its NCT number record remains in the database with updated status.

The registry data model is organized into modules. The identification module includes the NCT number, official title, brief title, and acronym. The status module includes overall status (recruiting, active not recruiting, completed, terminated, suspended, withdrawn, not yet recruiting, enrolling by invitation, or unknown), overall status dates, start date, primary completion date, and study completion date. The design module includes study type, phases, allocation (randomized or non-randomized), intervention model (parallel group, crossover, factorial, sequential, or single group), primary purpose (treatment, prevention, diagnostic, supportive care, screening, health services research, basic science, device feasibility, or other), and masking (none, single, double, triple, or quadruple — referring to participants, care providers, investigators, and outcomes assessors).

The conditions module contains free-text condition names and MeSH (Medical Subject Headings) condition terms assigned by NLM indexers. MeSH terms follow a controlled hierarchical vocabulary and enable consistent retrieval across conditions named differently by different sponsors (e.g., “non-small cell lung carcinoma,” “NSCLC,” and “lung cancer” are all mapped to MeSH terms under Neoplasms). The interventions module describes each intervention by name and type: drug, device, behavioral, procedure, dietary supplement, genetic, biological, combination product, diagnostic test, or other. The eligibility module specifies inclusion and exclusion criteria as free text, plus structured fields for sex (all, female, male), minimum and maximum age, and whether healthy volunteers are accepted. The outcomes module lists primary and secondary outcome measures with measure name, description, and time frame. The sponsor/collaborators module identifies the lead sponsor and any collaborating organizations, with a classification field distinguishing NIH, other U.S. federal agency, industry, and other as sponsor types.

ModuleKey fieldsResearch use
identificationModulenctId, briefTitle, officialTitle, acronymLinking publications, cross-referencing FDA submissions
statusModuleoverallStatus, startDateStruct, primaryCompletionDateStructPipeline tracking, enrollment feasibility, results compliance
designModulestudyType, phases, allocation, interventionModel, maskingStudy quality assessment, meta-analysis eligibility
conditionsModuleconditions, keywords; MeSH termsDisease-area portfolio analysis, systematic reviews
interventionsModuleinterventionName, interventionTypeDrug pipeline tracking, competitive intelligence
eligibilityModuleeligibilityCriteria, sex, minimumAge, maximumAgePatient matching, diversity analysis, inclusion/exclusion audit
outcomesModuleprimaryOutcomes, secondaryOutcomes (measure, timeFrame)Endpoint consistency review, outcome switching detection
sponsorCollaboratorsModuleleadSponsor (name, class), collaboratorsIndustry portfolio analysis, conflict of interest research

Sponsor categories and industry dominance of Phase 3

Across the full ClinicalTrials.gov registry, approximately 50% of registered studies list an industry sponsor, approximately 20% list NIH or another U.S. federal agency, and approximately 30% list academic medical centers, universities, hospitals, or other non-industry organizations. The balance shifts sharply by phase: Phase 3 trials are overwhelmingly industry-sponsored, reflecting the cost structure of pivotal efficacy trials. A multi-site Phase 3 oncology trial enrolling 1,500 patients across 200 centers can cost $300 million to $600 million or more to execute — well beyond the capacity of NIH grant funding for a single study.

Among the highest-volume industry sponsors in the registry are Pfizer, Roche and its Genentech subsidiary, Johnson & Johnson (Janssen), Novartis, AstraZeneca, and Merck. These sponsors each have hundreds to over a thousand registered studies reflecting decades of product pipeline. NIH-funded research flows primarily through its 27 institutes and centers: the National Cancer Institute (NCI) is the dominant NIH funder of oncology trials; the National Institute of Allergy and Infectious Diseases (NIAID) funds infectious disease and vaccine trials; the National Heart, Lung, and Blood Institute (NHLBI) funds cardiovascular and pulmonary trials; the National Institute of Mental Health (NIMH) funds psychiatry trials. Academic medical centers — Mayo Clinic, Cleveland Clinic, MD Anderson Cancer Center, Johns Hopkins, UCSF, Memorial Sloan Kettering — appear as lead sponsors for investigator-initiated trials, and frequently as collaborators in industry-sponsored multi-site studies.

The geographic distribution of ClinicalTrials.gov studies reflects the global nature of modern drug development. Approximately half of all registered studies include at least one site outside the United States. Multi-regional clinical trials (MRCTs) that simultaneously enroll patients in North America, Europe, and Asia have become standard for Phase 3 programs seeking data packages that satisfy both FDA and the European Medicines Agency simultaneously. Countries with large registered trial volumes include the United States, Canada, France, Germany, the United Kingdom, China, Australia, Spain, and Italy. Trial conduct in China and India grew substantially in the 2010s as sponsors sought faster enrollment, lower per-patient costs, and access to treatment-naive populations in certain therapeutic areas.

Disease area composition and the COVID-19 surge

Oncology is the largest disease area in ClinicalTrials.gov by a substantial margin, accounting for approximately 35% of all registered studies. The concentration reflects the convergence of scientific opportunity (the cancer biology revolution following the Human Genome Project), regulatory incentives (FDA breakthrough therapy designation, accelerated approval, and priority review for serious conditions with unmet need), and commercial returns (cancer drugs represent the largest segment of pharmaceutical revenue globally). Within oncology, breast cancer, lung cancer (NSCLC), colorectal cancer, and leukemia each have thousands of registered studies.

Diabetes and endocrinology, cardiovascular disease, psychiatry and neurology, and infectious disease are the next largest areas. The distribution of trials across disease areas does not map proportionally to disease burden: conditions with high mortality but small market size (neglected tropical diseases, rare pediatric conditions) are substantially underrepresented relative to prevalence, while conditions with large commercial markets (type 2 diabetes, heart failure, major depressive disorder) have trial volumes driven by competitive pharmaceutical development programs.

The COVID-19 pandemic produced an unprecedented surge in trial registration. Approximately 11,000 COVID-19–related studies were registered in 2020 and 2021 combined, representing more than any single disease had accumulated in any comparable two-year period in the registry's history. The surge included vaccine trials (Oxford/AstraZeneca, Pfizer/BioNTech, Moderna, Janssen, Novavax, Sinovac, and dozens of others in parallel development), antiviral trials (remdesivir, molnupiravir, nirmatrelvir/ritonavir), and repurposing trials for existing drugs including hydroxychloroquine, ivermectin, and dexamethasone — the latter confirming a survival benefit for severely ill patients in the RECOVERY trial. Many COVID trials were registered but never started or were terminated early due to enrollment failure, vaccine availability, or funding cessation, contributing to a visible uptick in terminated and withdrawn status in 2021 and 2022.

Publication bias and the file drawer problem

Publication bias — the systematic tendency for studies with positive or statistically significant results to be published in peer-reviewed journals while studies with null or negative results are not — was a recognized problem in clinical research long before ClinicalTrials.gov existed. The consequences are serious: systematic reviews and meta-analyses that aggregate published literature to inform clinical guidelines or FDA decisions overestimate the efficacy of treatments because the evidence base excludes negative trials. The phenomenon was first formally analyzed by the statistician Robert Rosenthal in 1979 as the “file drawer problem” — negative results that remain unpublished in researchers' file drawers.

Ben Goldacre's 2012 book Bad Pharma brought the issue to broad public attention, documenting case studies in which pharmaceutical companies and academic researchers selectively published positive results from multi-trial programs while negative trials remained unpublished — causing prescribers and regulators to systematically overestimate drug efficacy. Goldacre was a co-founder of the AllTrials campaign, launched in 2013, which advocated for mandatory registration and results reporting for all clinical trials and collected endorsements from medical journals, regulatory agencies, and professional societies across more than 100 countries. The AllTrials campaign accelerated regulatory and journal policy changes that increased compliance with FDAAA 801 requirements.

The FDAAA 801 results database was specifically designed to address publication bias by requiring results reporting independent of journal publication. A sponsor that conducts a Phase 3 trial, finds that the drug does not work, and decides not to submit an NDA is still required to post results to ClinicalTrials.gov within 12 months of primary completion. The results database therefore contains trial outcomes that are not otherwise in the scientific literature — a valuable but underutilized source for systematic reviews and regulatory analysis.

A related concern is outcome switching: the practice of registering one set of primary and secondary outcomes before a trial begins, then reporting different outcomes in publications when the pre-specified primary outcome is not met. The COMPARE project, led by researchers at the University of Oxford, systematically compared pre-specified outcomes in ClinicalTrials.gov registrations to reported outcomes in journal publications for a sample of trials and found that a majority had at least one discrepancy. Outcome switching inflates apparent efficacy and statistical significance. The Consolidated Standards of Reporting Trials (CONSORT) statement and ICMJE journal requirements now require explicit reconciliation of registered and reported outcomes.

The ClinicalTrials.gov API

ClinicalTrials.gov provides two programmatic access routes. The classic (legacy) API at classicapi.ct.gov/api/query/ uses XML responses and a query syntax based on field names and Boolean operators; it remains functional but NLM has signaled that the v2 API is the preferred interface for new development. The v2 API at clinicaltrials.gov/api/v2/studies returns JSON with a consistent schema organized by the module hierarchy described above.

The v2 API requires no authentication. All requests are unauthenticated HTTP GET calls; rate limiting applies but is generous for research use. The primary query parameters are: query.cond for condition/disease terms,query.term for general keyword search, query.intr for intervention name, query.spons for sponsor name,filter.overallStatus for study status,filter.phase for study phase, and filter.geo for geographic restriction. The fields parameter accepts a comma-separated list of field paths to return, allowing callers to reduce response size by requesting only needed fields. Pagination uses pageSize(maximum 1,000 per page) and pageToken (returned asnextPageToken in the response when more results exist). AtotalCount field in the response gives the total matching study count across all pages.

The response JSON structure nests data within a studies array. Each element has a protocolSection containing the pre-results trial design information organized by module, and a resultsSectioncontaining posted results data (participant flow, baseline characteristics, outcome data, adverse events) when results have been submitted. Callers accessing the results database should check for the presence ofresultsSection before attempting to parse it, as it is absent for studies that have not yet posted results.

For bulk downloads, NLM provides a complete database export atclinicaltrials.gov/ct2/resources/download. The bulk download is available as a ZIP archive of JSON files (one file per study), updated daily. As of 2024 the full export is approximately 10–15 GB uncompressed. Researchers conducting large-scale analysis across the full 500,000-study corpus should use the bulk download rather than the API to avoid rate limiting and to ensure a consistent snapshot. The bulk download is the basis for most academic studies analyzing the composition of the clinical trial registry.

Aggregate statistics and registry composition

The composition of the ClinicalTrials.gov registry as of 2024 reveals several structural patterns. Approximately 40% of all registered studies have a status of “completed” — the single largest status category. Approximately 25% are currently recruiting. Approximately 15% are terminated before reaching primary completion, a rate that is higher in academic-sponsored trials than in industry-sponsored ones and that reflects the realities of investigator-initiated research: enrollment shortfalls, loss of funding, or the emergence of a definitive result from another trial that makes continued enrollment unethical or unnecessary.

Median enrollment across all registered studies is approximately 50 participants, reflecting the fact that the registry includes many small Phase 1 and Phase 2 studies alongside the large Phase 3 trials that dominate in enrollment-weighted statistics. The median enrolled patient count for Phase 3 studies alone is approximately 300–500, while the mean is considerably higher due to the long right tail of mega-trials in cardiovascular disease and oncology that enroll 5,000 or more participants. The largest trials in the registry — large simple trials in cardiovascular prevention and mortality — have enrolled tens of thousands of participants.

The randomized controlled trial (RCT) is the dominant design for interventional studies, with approximately 70% of interventional studies using randomized allocation. Double-blind masking (where both participants and outcome assessors are masked to treatment assignment) is standard for placebo-controlled drug trials. Open-label designs predominate in surgical trials, behavioral interventions, and many oncology immunotherapy trials where the toxicity profile makes blinding logistically difficult. The trend in oncology has shifted toward open-label designs with objective response rate as the primary endpoint (where blinding is less critical) rather than overall survival endpoints that require longer follow-up.

Python: querying the ClinicalTrials.gov API v2

The following script demonstrates the ClinicalTrials.gov v2 API. It queries for recruiting Phase 3 oncology trials, extracts NCT number, title, sponsor, enrollment, and primary completion date, and prints the ten largest by enrollment. It then fetches the total count for each study phase across all cancer trials to show the phase distribution, and computes the status distribution for Phase 3 oncology trials. The script requires onlyrequests and pandas; no API key is needed.

import requests
import pandas as pd
from datetime import datetime

# ---------------------------------------------------------------------------
# ClinicalTrials.gov API v2 Example
# Docs: https://clinicaltrials.gov/data-api/api
# No API key required.
# ---------------------------------------------------------------------------

BASE = "https://clinicaltrials.gov/api/v2/studies"

# ---------------------------------------------------------------------------
# Part 1: Top 10 largest recruiting Phase 3 oncology trials by enrollment
# ---------------------------------------------------------------------------
# We query for:
#   - status: RECRUITING
#   - phase: PHASE3
#   - condition: Neoplasms (MeSH term for cancer)
# The v2 API returns JSON with a "studies" list and optional "nextPageToken".

params_phase3 = {
    "filter.overallStatus": "RECRUITING",
    "filter.phase": "PHASE3",
    "query.cond": "Neoplasms",
    "fields": (
        "NCTId,BriefTitle,LeadSponsorName,EnrollmentCount,"
        "PrimaryCompletionDate,OverallStatus,Phase"
    ),
    "pageSize": 200,  # fetch a larger page to sort by enrollment locally
    "format": "json",
}

resp = requests.get(BASE, params=params_phase3, timeout=30)
resp.raise_for_status()
data = resp.json()

studies = data.get("studies", [])
print(f"Retrieved {len(studies)} recruiting Phase 3 oncology studies (first page)")

rows = []
for s in studies:
    proto = s.get("protocolSection", {})
    id_mod     = proto.get("identificationModule", {})
    status_mod = proto.get("statusModule", {})
    design_mod = proto.get("designModule", {})
    sponsor_mod = proto.get("sponsorCollaboratorsModule", {})

    nct    = id_mod.get("nctId", "")
    title  = id_mod.get("briefTitle", "")
    sponsor = sponsor_mod.get("leadSponsor", {}).get("name", "")
    enroll = design_mod.get("enrollmentInfo", {}).get("count")
    pcd    = status_mod.get("primaryCompletionDateStruct", {}).get("date", "")

    rows.append({
        "NCT Number":              nct,
        "Brief Title":             title[:80] + ("..." if len(title) > 80 else ""),
        "Lead Sponsor":            sponsor[:40] + ("..." if len(sponsor) > 40 else ""),
        "Enrollment":              enroll,
        "Primary Completion Date": pcd,
    })

df = pd.DataFrame(rows)
df["Enrollment"] = pd.to_numeric(df["Enrollment"], errors="coerce")
df_sorted = df.sort_values("Enrollment", ascending=False).head(10).reset_index(drop=True)

print("\n=== Top 10 Recruiting Phase 3 Oncology Trials by Enrollment ===")
print(f"  {'#':<3}  {'NCT Number':<14}  {'Enrollment':>10}  {'Prim. Completion':<18}  Title")
print("  " + "-" * 110)
for i, row in df_sorted.iterrows():
    enroll_str = f"{int(row['Enrollment']):,}" if pd.notna(row['Enrollment']) else "N/A"
    print(
        f"  {i+1:<3}  {row['NCT Number']:<14}  {enroll_str:>10}  "
        f"{row['Primary Completion Date']:<18}  {row['Brief Title']}"
    )

print("\nTop 5 sponsors in recruiting Phase 3 oncology trials:")
sponsor_counts = df["Lead Sponsor"].value_counts().head(5)
for sponsor, count in sponsor_counts.items():
    print(f"  {sponsor:<45}  {count:>3} trials")

# ---------------------------------------------------------------------------
# Part 2: Phase distribution for all cancer trials
# ---------------------------------------------------------------------------
# Query all cancer trials (no status filter) and group by phase.
# Use pageSize=1 with aggregate-style fields to get phase counts efficiently.
# The v2 API does not have a native group-by; we page through and tally.

PHASES = ["EARLY_PHASE1", "PHASE1", "PHASE2", "PHASE3", "PHASE4", "NA"]
phase_counts = {}

print("\n=== Fetching phase distribution for all cancer trials (Neoplasms) ===")
print("    (Paging through API results; this may take 30-60 seconds...)")

for phase in PHASES:
    params_count = {
        "query.cond": "Neoplasms",
        "filter.phase": phase,
        "pageSize": 1,
        "fields": "NCTId",
        "format": "json",
    }
    r = requests.get(BASE, params=params_count, timeout=30)
    r.raise_for_status()
    total = r.json().get("totalCount", 0)
    phase_counts[phase] = total
    print(f"    Phase {phase:<15}: {total:>6,} trials")

total_cancer = sum(phase_counts.values())
print(f"\n=== Cancer Trial Phase Distribution ===")
print(f"  {'Phase':<20}  {'Count':>8}  {'Share':>7}  {'Bar'}")
print("  " + "-" * 65)
phase_labels = {
    "EARLY_PHASE1": "Phase 0 / Early 1",
    "PHASE1":       "Phase 1",
    "PHASE2":       "Phase 2",
    "PHASE3":       "Phase 3",
    "PHASE4":       "Phase 4 (post-mkt)",
    "NA":           "Not Applicable / N/A",
}
for phase in PHASES:
    count = phase_counts[phase]
    pct   = count / total_cancer * 100 if total_cancer else 0
    bar   = "#" * int(pct / 2)
    label = phase_labels.get(phase, phase)
    print(f"  {label:<20}  {count:>8,}  {pct:>6.1f}%  {bar}")
print(f"  {'TOTAL':<20}  {total_cancer:>8,}")

# ---------------------------------------------------------------------------
# Part 3: Status distribution for Phase 3 oncology trials
# ---------------------------------------------------------------------------
STATUSES = [
    "RECRUITING", "ACTIVE_NOT_RECRUITING", "COMPLETED",
    "TERMINATED", "SUSPENDED", "WITHDRAWN", "NOT_YET_RECRUITING",
]
status_counts = {}
for status in STATUSES:
    params_s = {
        "query.cond": "Neoplasms",
        "filter.phase": "PHASE3",
        "filter.overallStatus": status,
        "pageSize": 1,
        "fields": "NCTId",
        "format": "json",
    }
    r = requests.get(BASE, params=params_s, timeout=30)
    r.raise_for_status()
    status_counts[status] = r.json().get("totalCount", 0)

total_p3 = sum(status_counts.values())
print(f"\n=== Phase 3 Oncology Trial Status Distribution ===")
print(f"  {'Status':<28}  {'Count':>7}  {'Share':>7}")
print("  " + "-" * 48)
for status, count in sorted(status_counts.items(), key=lambda x: -x[1]):
    pct = count / total_p3 * 100 if total_p3 else 0
    label = status.replace("_", " ").title()
    print(f"  {label:<28}  {count:>7,}  {pct:>6.1f}%")
print(f"  {'TOTAL':<28}  {total_p3:>7,}")

The script uses the totalCount field from the v2 API response to retrieve phase and status counts without downloading full study records for each combination — setting pageSize=1 and requesting only the NCT ID field makes these count queries fast. The phase distribution query reveals that Phase 2 oncology trials outnumber Phase 3 by a roughly 3:1 ratio, reflecting the attrition funnel of drug development: most Phase 2 oncology programs do not advance to Phase 3 due to insufficient efficacy signal, unacceptable toxicity, or commercial decisions by sponsors. The enrollment-weighted results show Phase 3 trials dominating actual patient exposure despite their smaller count.

Data limitations and research notes

ClinicalTrials.gov data quality is uneven across sponsor types and registration periods. Many older records — particularly from before the FDAAA 801 mandatory registration era — were registered retrospectively or have incomplete fields. Enrollment counts are frequently estimates at the time of registration and are not always updated to reflect actual enrollment, so the enrolled count field should be treated as approximate for completed trials unless the results section provides an actual participant count. Condition and intervention fields are free-text, creating substantial variation in naming conventions that MeSH normalization only partially addresses.

The registry does not directly link to FDA submission records, NDA approval databases, or published journal articles — though NCT numbers in MEDLINE records enable cross-linking via PubMed's Clinical Trial filter and the PubMed-to-CT linkage maintained by NLM. Researchers building linked datasets should use the NCT number as the join key across PubMed, the FDA drug approval database (Drugs@FDA), and the EMA clinical data repository. The International Clinical Trials Registry Platform (ICTRP), maintained by the World Health Organization, aggregates data from ClinicalTrials.gov and 17 other national and regional primary registries, providing a single query interface across international trial registration.

For research on publication bias and outcome reporting, the TrialsTracker tool maintained by the Evidence-Based Medicine DataLab at Oxford automatically monitors ClinicalTrials.gov for overdue results reports and links registered trials to PubMed publications to identify unpublished trials. The RIAT (Restoring Invisible and Abandoned Trials) initiative advocates for independent publication of unpublished trials using raw data obtained through data sharing agreements or regulatory disclosures. These initiatives treat ClinicalTrials.gov as a reference standard for what trials should have been conducted and reported, enabling systematic identification of the gap between what was done and what is publicly known.

Related writing

Census Current Population Survey: The Federal Database Behind the Official US Poverty and Unemployment Rates — monthly household survey of 60,000 households, unemployment rate methodology, poverty measurement, and CPS microdata.

CPSC Recalls: The Federal Database Behind 50 Years of Consumer Product Safety Recalls — CPSC product recall data from 1973, recall API, hazard types, and consumer safety enforcement.