Before Congress writes a law, confirms an official, or hauls an industry in to answer for itself, it holds a hearing—a room with a committee on the dais, a witness at the table under oath or on the record, and a stenographer taking down every word. The Government Publishing Office turns that proceeding into the official published transcript, and stores it in the Congressional Hearings collection on GovInfo: roughly 46,000 hearing records, one per published hearing, each carrying the committee, the title, the date, the Congress, the witnesses, and the full searchable text. It is the federal record of what Congress was told—the testimony, the questioning, and the statements for the record—and one of the richest, least-used corpora the government publishes.
This article covers what the Congressional Hearings dataset is and how GovInfo and the GPO frame it; the constitutional and institutional role hearings play in how committees gather information; the five recognizable hearing types—legislative, oversight, investigative, confirmation, and field—and why the differences matter for analysis; the structure of a hearing transcript and the give-and-take it preserves; the GPO publishing pipeline that turns a proceeding into a citable package; how the hearings record joins to the bills, votes, and laws that hearings feed; the analytical uses of a date-stamped, committee-keyed, full-text corpus of testimony; a Python workflow that pulls hearings from the GovInfo API, tallies them by committee and Congress, and full-text-searches the transcripts by topic; and the caveats—publication lag, uneven coverage across committees, and the gap between a transcript and the proceeding it records—that every analyst must internalize.
What the dataset is
GovInfo is the GPO's digital system for authentic, published federal information: it holds the Federal Register, the Congressional Record, the United States Code, federal court opinions, and the products of Congress—bills, reports, documents, and hearings. Each is organized into a collection with a short code. The hearings live in the collection coded CHRG: the published transcripts of the hearings held by the standing, select, and special committees of the House and the Senate, plus their subcommittees and the joint committees. A congressional hearing is the principal formal mechanism by which a committee gathers information—taking testimony from agency officials, outside experts, industry representatives, advocates, and members of the public—and the published transcript is the official record of that testimony. Surfaced through GovInfo, the hearings record comprises roughly 46,000 published hearings.
In our database this record is stored as the table govinfo_chrg, with the grain of one row per published hearing: a single committee that holds twenty hearings in a Congress contributes twenty rows, each a distinct proceeding. The unit of identity is the GovInfo package ID, a stable identifier the GPO assigns to every published item. The columns capture which committee held the hearing, what it was about, when, in which Congress, who testified, and where the full text and official PDF can be retrieved:
package_id -- GovInfo package ID (e.g. CHRG-118hhrg52345)
chamber -- House, Senate, or Joint
committee -- the committee (and subcommittee) that held the hearing
title -- the published hearing title
hearing_date -- date(s) the hearing was held
congress -- the Congress number (e.g. 118)
session -- 1st or 2nd session of that Congress
hearing_type -- legislative, oversight, investigative, nomination, field
witnesses -- the persons who gave testimony
serial_no -- the committee's serial / print number, where assigned
su_doc_class -- Superintendent of Documents classification number
text_url -- link to the full searchable text (HTML/TXT)
pdf_url -- link to the official, authenticated PDFThe package_id is the load-bearing column. It is the persistent handle that resolves to the hearing's metadata, its full text, and its authenticated PDF on GovInfo, and it is what lets an analyst move from a row in the table to the actual document. The committee and congress columns are the primary grouping keys: nearly every meaningful aggregation of the corpus—how many hearings a committee held, how committee attention shifted across Congresses, which committees took up a given subject—runs on these two fields. The hearing_type distinguishes the recognizable kinds of hearing, described below, and is essential to interpreting any count, because a confirmation hearing and an investigative hearing are doing entirely different institutional work. The witnesses field is the answer to a question no other congressional dataset answers as directly—whom did Congress call?—and the text_url is what makes the collection a corpus rather than a catalog: the full searchable text means the dataset records not just that a hearing happened but everything that was said in it.
What a hearing is and the role it plays
Congress legislates, but before it can legislate intelligently it has to learn, and the hearing is its principal instrument for learning. The power to conduct hearings and compel testimony is not spelled out in the Constitution in so many words, but it has been recognized since the earliest Congresses as inherent in the legislative power: to write good laws and to oversee how laws are executed, Congress must be able to gather facts, and gathering facts means calling witnesses and putting questions to them. The Supreme Court has repeatedly affirmed that the power of inquiry—with the attendant power to compel testimony and the production of documents—is an essential and appropriate auxiliary to the legislative function. The hearing is the forum where that inquiry happens in the open.
The work is done at the committee level, and that is the most important structural fact about the data. The House and the Senate are too large and the legislative agenda too vast for the full chamber to gather information directly; instead the work is divided among standing committees with subject-matter jurisdiction—Judiciary, Armed Services, Finance and Ways and Means, Energy and Commerce, Appropriations, and the rest —each of which holds hearings within its domain, often through specialized subcommittees. A hearing is therefore always anchored to a committee, and the committee is what gives a hearing its jurisdictional meaning: a hearing on a data-privacy bill held by the Commerce Committee, the Judiciary Committee, or the Intelligence Committee is three different proceedings asking three different questions, even about the same subject. Because every row in the dataset carries its committee, the corpus is, in effect, a map of how Congress has divided the world into jurisdictions and which jurisdiction took up which question, when.
A hearing is also a deliberately public act. Most committee hearings are open, and the published transcript is the durable public memory of them—the reason a statement an agency head made to a committee a decade ago, or the testimony an industry gave about a risk it later realized, remains on the record and citable. That permanence is what makes the collection valuable as data. The transcript captures not just conclusions but the on-the-record back-and-forth: what was asserted, what was conceded under questioning, what was promised, what was evaded. It is the closest thing the federal government publishes to a verbatim account of its own information-gathering.
The five types of hearing
Hearings are not all the same kind of thing, and the dataset rewards keeping the types distinct, because each serves a different institutional purpose and produces a differently shaped record. Five recognizable types account for nearly all of the collection.
Legislative hearings are held on a specific bill or on a legislative proposal a committee is considering. Their purpose is to build the record for a piece of legislation—to hear from supporters and opponents, from the agencies that would administer the law, and from the interests it would affect— before the committee marks the bill up and reports it. The witness list at a legislative hearing is itself a statement about whose views the committee thought relevant to the bill, and the testimony becomes part of the legislative history a court may later consult when interpreting the statute.
Oversight hearings examine how an existing program or agency is working—whether a law Congress passed is being implemented as intended, whether an agency is spending its appropriation well, whether a program is achieving its purpose. The archetypal oversight hearing has an agency official in the witness chair answering for the agency's performance. Oversight is Congress checking on the executive branch's execution of the laws, and the closely related budget and appropriations hearings—in which committees take testimony on agency budget requests—are a recurring, calendar-driven subspecies of it.
Investigative hearings are the most consequential and the most visible. When a committee uncovers—or suspects—serious wrongdoing, mismanagement, or a matter of grave public concern, it investigates, and the investigative hearing is where the inquiry surfaces in public. These are the hearings history remembers: the Watergate hearings that exposed a presidential cover-up, the hearings into the financial crisis, the questioning of tobacco executives, the appearances of technology-company chief executives before committees probing privacy and market power. Investigative hearings often involve subpoenaed witnesses and documents, sworn testimony, and adversarial questioning, and their transcripts are among the most substantively dense in the collection.
Confirmation (nomination) hearings are a Senate function, rooted in the Constitution's advice-and-consent role: the Senate does not confirm a nominee for the executive branch or the federal bench without, in most cases, a hearing before the committee of jurisdiction, where senators question the nominee about qualifications, record, and views. The Judiciary Committee's hearings on Supreme Court nominees are the most prominent, but confirmation hearings span the cabinet, the sub-cabinet, ambassadorships, and the courts. They appear only on the Senate side of the collection, and their transcripts are a distinctive record of what a nominee said under questioning before taking office. Finally, field hearings are held outside Washington—in a member's district or state, or at the site of the matter under study—to take testimony from people and in places the Washington hearing room cannot reach; they cut across the other types (a field hearing can be legislative, oversight, or investigative) and are flagged by location rather than by purpose.
The structure of a transcript
A hearing transcript has a recognizable architecture, and understanding it is what lets an analyst read the full text intelligently rather than treating it as an undifferentiated blob. The published hearing opens with front matter: the committee and subcommittee, the hearing title, the date and place, the serial or print number the committee assigned, and a list of the members present and the witnesses who appeared. Then comes the proceeding itself, which unfolds in a predictable order.
The chair opens with a statement framing the hearing's purpose; the ranking minority member usually responds; other members may make opening statements. The witnesses are then recognized in turn to deliver their oral testimony, which is almost always a condensed version of a longer written statement. That fuller prepared statement for the record is reproduced in the transcript in full, even though the witness summarized it aloud—so the published hearing typically contains both what the witness said and the more complete written submission, a distinction worth keeping in mind when mining the text. After the testimony comes the part that makes hearings distinctive as data: the question-and-answer round, in which members question the witnesses, usually under a time limit per member, alternating across party lines. This give-and-take is where assertions are tested, where a witness concedes or evades, and where the on-the-record exchanges that later get quoted occur.
Finally, the transcript closes with material submitted for the record: additional written statements, letters, answers to questions members posed in writing after the hearing (questions for the record), supporting documents, and exhibits. The whole package is a far richer object than a simple speech—it is a structured proceeding with named speakers, prepared and extemporaneous text, adversarial questioning, and a documentary appendix. Because GovInfo publishes the full searchable text along with the official PDF, every layer of this structure is available to analysis: an analyst can study just the prepared statements, just the question-and-answer exchanges, or the entire record, and can attribute passages to the members and witnesses who spoke them.
The GPO publishing pipeline
The path from a hearing in a committee room to a citable package on GovInfo runs through the Government Publishing Office, and the pipeline explains both the authority and the timing of the data. During the hearing, an official reporter produces a verbatim transcript. The committee then prepares the hearing for publication—a process that can include the witnesses reviewing their testimony for accuracy, the insertion of the prepared statements and the material submitted for the record, and the assignment of a serial or print number. The committee transmits the finished record to the GPO, which composes it, assigns it a Superintendent of Documents classification number, and publishes it as an authenticated electronic document on GovInfo, with both a full-text rendition and a digitally signed PDF that carries the GPO's seal of authenticity.
Two features of this pipeline matter for the data. The first is authenticity: the GPO's mandate is to be the authentic publisher of the federal record, and the signed PDFs on GovInfo are the official, verifiable versions of the hearings—not a third-party scrape but the government's own published copy. The second is lag. Because publication follows the hearing by the time it takes the committee to finalize the record and the GPO to compose it, a hearing held this month does not appear in the collection this month; the interval varies by committee and can be substantial, sometimes a year or more for hearings with extensive submitted material. The collection is therefore authoritative and complete for hearings that have worked their way through the pipeline, but it is not a real-time feed of what was held last week—a point the caveats return to. GovInfo exposes the whole collection both as browsable pages on govinfo.gov and through a public API at api.govinfo.gov, with no key required for basic access and a free registered key for higher-volume use.
Joining to bills, votes, and laws
Hearings are most valuable not in isolation but as one stage in the legislative process, and the collection is a natural companion to the other congressional datasets that capture the stages around it. A hearing rarely stands alone: a legislative hearing is held on a bill, an oversight hearing often precedes legislation that responds to what it found, and an investigative hearing can be the proximate cause of a new law. The hearings record is the upstream, deliberative half of a pipeline whose downstream half—the bills introduced, the roll-call votes cast, the laws enacted—lives in adjacent datasets.
The first and most direct link is to bills. A legislative hearing references the measure it considers, and tying the hearing to the bill—by the bill number it names, and by the committee of jurisdiction that holds both the hearing and the bill—lets an analyst reconstruct the record a committee built before it acted: who testified, what they said, and how the bill changed afterward. The second link is to roll-call votes: when a bill that was the subject of a hearing reaches the floor, the votes the members cast can be set against the testimony the committee heard, connecting what Congress was told to what it then did. The third link is to public laws: for the bills that become law, the hearings form part of the documented legislative history—the testimony and committee record that courts and agencies consult when interpreting an ambiguous statute. Because all of these datasets are keyed on the same organizing facts—the Congress number, the chamber, the committee, and the bill—the hearings collection slots into a unified view of the legislative cycle: deliberation in the hearings, action in the bills and votes, and outcome in the laws.
Analytical uses
A national, committee-keyed, date-stamped, full-text corpus of testimony supports a set of analyses that no other congressional dataset can.
Tracking congressional attention over time is the most immediate use. Because every hearing carries a committee, a Congress, and full text, an analyst can measure how attention to an issue rose and fell—counting hearings that mention a topic by Congress, watching a subject migrate from a single committee to many, and dating the moment a question moved from the margins onto the national agenda. The collection is, in effect, a longitudinal record of what Congress thought worth scrutinizing, decade by decade.
Studying whom Congress calls exploits the witness field. Aggregating witnesses across hearings reveals which agencies, companies, experts, and organizations a committee turns to—and which it does not—making visible the structure of expertise and influence a committee draws on. A repeat witness across many hearings is a node of recognized authority or interest; the composition of a witness list is a statement about whose voice a committee thought the record needed. Text analysis of the testimony goes further still: because the full transcripts are available, the corpus supports the kind of computational study—topic modeling, sentiment of questioning, measuring the framing of an issue—that turns the give-and-take of hearings into quantitative evidence about how Congress and its witnesses talked about a subject.
Finally, connecting hearings to legislative outcomesbrings the joins of the previous section to bear: tracing the line from a hearing to the bill it informed, the votes that followed, and the law that resulted—or measuring how often hearings on a subject actually produced legislation, and how long the lag was. That is the analysis that turns the collection from an archive of testimony into evidence about whether, and how, the act of being told something moved Congress to act.
Python workflow: hearings from the GovInfo API
The script below uses the GovInfo public API to do two things: pull the hearings published in a date window from the CHRG collection and tally them by committee and by Congress, and then run a full-text search across the transcripts for a topic. The /published endpoint lists the packages in a collection by the date they were issued and paginates with an offsetMark cursor; the /packages/{id}/summary endpoint resolves each hearing's committee, Congress, and witness metadata; and the /search endpoint runs across the full text rather than only the metadata, so a query for a topic finds the hearings that discussed it, not merely those with the word in the title. A demo key works for light use; a free registered key from api.data.gov raises the rate limit. Requirements: requests.
import requests
from collections import Counter
# GovInfo public API -- the GPO's API for federal publications.
# The Congressional Hearings collection has the collection code CHRG.
# A demo key (DEMO_KEY) works for light use; register at api.data.gov
# for a free key with a higher rate limit. No key is needed to browse
# the same content on govinfo.gov.
BASE = "https://api.govinfo.gov"
API_KEY = "DEMO_KEY"
def _get(path, **params):
params["api_key"] = API_KEY
r = requests.get(f"{BASE}{path}", params=params, timeout=120)
r.raise_for_status()
return r.json()
def hearings(start_date, end_date, page_size=1000):
# The /published endpoint lists packages in a collection by the date
# they were issued. offsetMark paginates; "*" starts the first page.
offset = "*"
while True:
data = _get(
f"/published/{start_date}/{end_date}",
collection="CHRG",
pageSize=page_size,
offsetMark=offset,
)
pkgs = data.get("packages", [])
if not pkgs:
break
for p in pkgs:
yield p
# nextPage is a URL when more results remain; offsetMark is the
# cursor for the following page. Stop when there is no nextPage.
if not data.get("nextPage"):
break
offset = data.get("offsetMark")
def summary(package_id):
# Per-package metadata: committee, title, dates, Congress, members,
# and links to the text/PDF renditions.
return _get(f"/packages/{package_id}/summary")
# --- 1. Tally hearings by committee and Congress -----------------------
by_committee = Counter()
by_congress = Counter()
n = 0
for p in hearings("2023-01-01T00:00:00Z", "2024-12-31T23:59:59Z"):
n += 1
# The package title and id encode the chamber, committee, and Congress;
# the summary call below resolves them authoritatively.
s = summary(p["packageId"])
comms = s.get("committees") or []
for c in comms:
by_committee[c.get("committeeName", "(unknown)")] += 1
cong = s.get("congress")
if cong:
by_congress[cong] += 1
print(f"Hearings published in window: {n:,}")
print("\nTop 12 committees by hearing count:")
for name, cnt in by_committee.most_common(12):
print(f" {name[:48]:<48} {cnt:>5,}")
print("\nBy Congress:")
for cong, cnt in sorted(by_congress.items()):
print(f" {cong}th Congress {cnt:>5,}")
# --- 2. Full-text search the transcripts by topic ----------------------
# The /search endpoint runs across the full text, not just metadata.
# It is a POST: the query and paging options go in a JSON body.
def search_text(query, congress=None):
q = f'collection:CHRG "{query}"'
if congress:
q += f" congress:{congress}"
body = {"query": q, "pageSize": 20, "offsetMark": "*",
"sorts": [{"field": "publishdate", "sortOrder": "DESC"}]}
r = requests.post(f"{BASE}/search?api_key={API_KEY}", json=body, timeout=120)
r.raise_for_status()
return r.json()
res = search_text("artificial intelligence", congress=118)
total = res.get("count", 0)
print(f'\nHearings mentioning "artificial intelligence" (118th): {total:,}')
for hit in res.get("results", [])[:8]:
title = (hit.get("title") or "")[:62]
date = hit.get("dateIssued", "?")
print(f" {date} {title}")
Two practical notes apply. First, the committee-and-Congress tally calls the summary endpoint once per hearing, which is fine for a date window but expensive across the whole collection; for national-scale work the GovInfo bulk data repository and the sitemap-style package listings let you fetch the metadata far more efficiently than tens of thousands of per-package calls, and they ship the authoritative MODS metadata for each item. Second, the full-text search is the feature that makes the collection a corpus rather than a catalog: a query scoped with collection:CHRG and a Congress filter returns the hearings whose text matches, and the same search grammar supports phrase queries, committee filters, and date ranges—so the topic tally in the script is the simplest possible version of a much richer attention-tracking analysis. Any production use should request a free API key rather than relying on the demo key's low limit, and should respect the API's pagination and rate limits when walking large windows.
Limitations and analytical caveats
The Congressional Hearings collection is the most complete public record of committee testimony the federal government publishes, but it carries structural limitations that an analyst must internalize before drawing conclusions from it.
There is a publication lag, and not every hearing is published. Because a hearing reaches GovInfo only after the committee finalizes the record and the GPO composes it, the most recent months of hearing activity are systematically under-represented in any snapshot of the collection—a count of hearings in the current Congress will understate the true number simply because some have not yet been published. More fundamentally, not every hearing a committee holds results in a published transcript: some proceedings are never printed, others appear only after long delays, and closed hearings on classified or sensitive matters are generally not published at all. The collection is therefore close to comprehensive for the open, printed record, but it is not a complete census of every gathering a committee held.
Coverage is uneven across committees and time. The collection is assembled from the publishing practices of fifty-odd independent committees across both chambers, and those practices differ. Some committees print their hearings promptly and completely; others are slower or less consistent. Digital coverage is also denser for recent Congresses than for older ones, where the published record may be thinner or available only as scanned images rather than clean full text. Any cross-committee or long-run comparison has to account for these differences: a committee that appears to hold fewer hearings may simply publish fewer of them, and an apparent rise in hearings on a topic over decades may partly reflect improving digital coverage rather than a real change in congressional attention.
A transcript is a record of a proceeding, not the proceeding itself. The published hearing reflects editorial conventions: oral testimony is often a summary of a fuller written statement that is also printed, questions for the record and submitted material are inserted after the fact, and witnesses may review and lightly correct their testimony before publication. Text analysis that does not separate the prepared statements from the extemporaneous question-and-answer, or that treats a witness's printed prepared statement as if it were spoken aloud, will misattribute both volume and tone. The structure described earlier is not decoration—it is the key to reading the text correctly, and a corpus analysis that ignores it conflates layers of the record that mean different things.
A hearing is not legislation, and witness lists encode selection. The collection records deliberation, not outcome: many hearings are held on bills that never advance, on subjects that never produce a law, and on matters raised for the record more than for action. Inferring legislative consequence from the mere existence of a hearing over-reads the data; the connection to bills, votes, and laws has to be established through the joins, not assumed. And the witness list, valuable as it is, is a curated artifact: a committee chooses whom to invite, and that choice reflects the committee's priorities and the majority's framing. Treating the witnesses at a hearing as a neutral sample of expertise, rather than as a selected set that itself carries information about the committee's purpose, misreads what the field represents.
Held with these caveats in mind, the govinfo_chrg table is a uniquely valuable resource: a committee-keyed, date-stamped, full-text record of roughly 46,000 published hearings—the official account of what agency officials, experts, industry, and the public told Congress, the deliberative record that the bills, votes, and laws downstream of it can only be fully understood against.
Related writing
CRS Reports: The Federal Database Behind Congress’s Own Nonpartisan Research — The other half of how Congress informs itself: where hearings gather testimony from outside witnesses on the record, the Congressional Research Service produces the nonpartisan analysis members read in preparation, and the two together are the deliberative input to the legislative process.
Congressional Voting Records: The Federal Database Behind Every House and Senate Roll Call Vote — The downstream action a hearing can lead to: the roll-call record captures how members voted on the floor, and setting those votes against the testimony a committee heard connects what Congress was told to what it then did.
US Public Laws: The Federal Record of Every Law Congress Has Enacted — The end of the pipeline that hearings feed: for the bills that become law, the hearings form part of the documented legislative history that courts and agencies consult when interpreting an ambiguous statute.