Writing

Technical notes from building intelligence infrastructure.

Long-form, light on hype. Architectures, trade-offs, and post-mortems from work on censorship measurement, OSINT pipelines, election analysis, and post-quantum communications.

Browse federal data by topic

Financial & Markets · Healthcare · Public Health · Safety · Environment & Energy · Money & Politics · Enforcement & Oversight · Trade, Immigration & Labor · Disasters & Hazards · Education & Research · Economic Indicators · All guides →

The investigations

The OrganWatch Investigation · 9 parts
The US organ-procurement, transplant, and tissue system, read from the government’s own records — the monopoly, the bedside, the money, the consent gaps, the court record. Nine parts, institution-level only.
The Farmland Register · 5 parts
Who holds American agricultural land from abroad — the federal register nobody reads, the shell structures behind friendly flags, the thirty-state law wave, and the Hong Kong question. Five parts.
The Detention Ledger, Read Closely · 3 parts
Who runs the beds, who checks them, and who pays for the empty ones — ICE’s own detention file read at the unit level: the county-shell contract structure, the inspection blanks, and the guaranteed-minimum floor. Three parts.
Foreign Money in US Universities · 2 parts
Sixty-two billion disclosed dollars, and the statute that lets most of them stay nameless — the Section 117 ledger, read in two parts.

Browse by topic

Federal data (367) · Health and medicine (83) · Finance and markets (63) · Economy and demographics (55) · Labor and workplace (35) · Transportation safety (33) · Environment and energy (41) · Justice and immigration (38) · Government operations (36) · Money in politics (20) · Sanctions and illicit finance (14) · Consumer protection (16) · Food and agriculture (11) · Research and education (18) · Organ and tissue industry (9) · Ownership and consolidation (13) · Transparency and open data (54) · Censorship and information control (82) · Engineering and infrastructure (116) · Machine learning and OSINT (30) · Drones and cryptography (18) · Cybersecurity and privacy (13)

2026 (428)

The Custody Chain: Fifteen Million Genomes, One Bankruptcy Court
2026-07-18
In 2023, recycled passwords opened roughly 14,000 accounts at 23andMe and, through its relative-matching features, the profiles of millions. Within two years the genetic database had been fined on two continents and sold through a Chapter 11 auction for $305 million, under a court-appointed Consumer Privacy Ombudsman, over the objection of more than thirty states. The full custody chain — breach, penalty, bankruptcy, sale, and the 12-state statute wave that is its legacy — from the public record, via the new Genetic Privacy Ledger.
Cybersecurity and privacy · Finance and markets · Transparency and open data
Fourteen Thousand to Zero: The Federal Private-Prison Ban That Outlived Its Reversal
2026-07-11
Five days before the January 2021 order ending federal private-prison contracts, the Bureau of Prisons’ own feed recorded 14,095 people in 11 private facilities. Today it records zero — and it still records zero more than a year after the order was rescinded in January 2025. Both publicly traded operators state in their own SEC filings that they hold no BOP prison contracts. The two policy cycles behind the number, the removed Contract Prisons page, and the standing weekly tripwire on its return — all from BOP’s own numbers via the new BOP Ledger.
Justice and immigration · Transparency and open data
The Beds and the Badges: Two ICE Files, One Map
2026-07-11
ICE publishes a price list for custody (203 detention facilities, 66,161 held on an average day) and a signature ledger (2,123 agreements deputizing 1,804 local agencies). Joined at the state level they draw one map: Texas leads both boards, California holds the third-largest detained population with zero agreements under a 2017 state law, West Virginia signed 38 agreements with no detention facility at all — and 86 percent of the detained population sits in states that signed up. Geography, not causation — the join stays inside what two files can prove.
Justice and immigration · Transparency and open data
Guaranteed Empty: The Bed Floor in ICE’s Own Arithmetic
2026-07-11
Guaranteed-minimum contracts commit the government to paying for 45,621 detention beds whether or not anyone is in them. In ICE’s own file, 8,813 of those beds sat paid-for and empty at the 20 facilities running more than ten percent under their guarantee — while 36 facilities ran more than ten percent over. Two columns of the public file and a minus sign: the take-or-pay floor covering 76 percent of the detained population.
Justice and immigration · Government operations · Transparency and open data
No Result on Record: The Inspection Blanks in ICE’s Own File
2026-07-11
Of 203 facilities in ICE’s own detention file, 61 carry no inspection result on record — 9,186 people held on an average day at facilities whose rating column is blank, 38 of them under the US Marshals Service umbrella. The blanks, the 21 missing inspection dates, the 73.1% of the population with no recorded threat level, and the 5 facilities that were inspected, failed, and still hold people.
Justice and immigration · Government operations · Transparency and open data
The County Is the Shell: Who Signs for America’s Detention Beds
2026-07-11
ICE’s own file lists 203 detention facilities holding 66,161 people on an average day — and federal records name a private operator for only 7 of them. The intergovernmental-agreement structure that ends the federal record at the county line, what the file does say (61 facilities with no inspection result on record, 8,813 guaranteed beds paid for and empty), and why “no private operator identified in federal records” is itself a finding.
Justice and immigration · Transparency and open data
CDC Drug Overdose Mortality Data: The Federal Dataset Behind the Opioid Crisis
2026-07-10
The CDC publishes overdose mortality through the National Vital Statistics System, CDC WONDER, and monthly VSRR provisional counts — tracking 107,000+ annual drug deaths at the county, demographic, and drug-category level. Here is the ICD-10 code structure, the three waves of the opioid epidemic, racial disparity inversion driven by fentanyl, and how to access the data.
Federal data · Health and medicine
The Grid Has New Landlords: What 1.38 Terawatts of Ownership Filings Show
2026-07-09
The United States runs on 1.38 terawatts of generating capacity, and the federal filings say who owns every megawatt. Computed from Form EIA-860: independent power producers now out-own the investor-owned utilities on your bill; the federal government is one of the largest owners in the country; and a quarter-terawatt is jointly owned through capacity shares most customers have never heard of.
Ownership and consolidation · Environment and energy · Engineering and infrastructure · Transparency and open data
FDIC Bank Failure Data: Every US Bank That Has Failed Since 1934
2026-07-09
The FDIC publishes a complete failure list covering 4,000+ bank closures since 1934 — S&L crisis wave, the 2008–2012 GFC wave with 500+ failures, and the 2023 SVB/Signature/First Republic episode. Here is the dataset schema, how to use call report data and the Texas Ratio to identify at-risk institutions, and how financial journalists access FDIC BankFind.
Federal data · Finance and markets
The Disclosure Law That Hides the Donor
2026-07-08
Section 117 requires American universities to disclose foreign gifts and contracts — but for most of the record, not who they came from. Computed from the federal file: 97 percent of the 62 billion disclosed dollars carry no source name, because the statute asks only for a country. The anonymity is not evasion; it is the design. What the law collects, what it hides, the 2019 enforcement spike, and what the DETERRENT Act fight would actually change.
Ownership and consolidation · Research and education · Transparency and open data
Sixty-Two Billion Dollars: Reading the Ledger of Foreign Money in US Universities
2026-07-08
Since 1981 American universities have disclosed 62 billion dollars in foreign gifts and contracts under Section 117 of the Higher Education Act — 117,152 transactions at 528 institutions. A reading of the federal ledger: who received it, which countries and governments sent it, how it concentrates at the top, and what the disclosure regime does and does not reveal.
Ownership and consolidation · Research and education · Money in politics · Transparency and open data
The Hong Kong Question: One Territory, Five Federal Answers
2026-07-08
Since 2020, export controls, CFIUS, outbound-investment screening, and the ICTS rules have all treated Hong Kong as part of China. The federal farmland register still counts it separately — which is why the most famous Chinese-linked land purchase in America sits outside the China total the debate cites. One territory, five federal answers, and 144,000 acres in the gap.
Ownership and consolidation · Transparency and open data
Follow Every Flag: What the Shell Map Reveals About Foreign-Held US Land
2026-07-08
We took the largest conduit-flagged and no-country blocks in the US foreign farmland register and traced every ownership chain through public documents, with adversarial verification and a defamation review. Sovereign funds behind quiet flags, blank filings that resolve to Munich Re and the French state, a wall of fund structures whose investors no record names, and ghost entries carried for decades. The full map, chain by chain.
Ownership and consolidation · Transparency and open data
Thirty States Banned Foreign Farmland Ownership. Each Banned Something Different.
2026-07-08
Between 2023 and 2025, most US states enacted or strengthened laws restricting foreign ownership of land — but the statutes disagree on who counts as a foreign adversary, whether Hong Kong counts as China, whether leases count as ownership, and who checks. What the laws say, the single completed enforcement action, and the broken federal register they all lean on.
Ownership and consolidation · Transparency and open data
The Shell Game in the Farmland Register: Friendly Flags and the Attribution Gap
2026-07-08
State laws ban farmland ownership tied to foreign adversaries, but enforcement leans on a federal register that records only the first ownership tier. Computed from the government files: most register-flagged secondary Chinese interests sit behind holdings attributed to Singapore, Canada, Japan, and Hong Kong; one ChemChina-owned seed group appears under two country labels in a single file; and the acreage attributed to no country at all has grown six-fold since 2010.
Ownership and consolidation · Transparency and open data
Who Owns American Farmland: Reading the Federal Register of Foreign-Held Land
2026-07-08
Foreign persons report holding 46.3 million acres of US agricultural land — 3.6 percent of privately held farmland, nearly double the 2010 figure. Thirty state legislatures are writing laws about the number while almost nobody reads the register it comes from. A sourced walk through the AFIDA data: who holds American farmland, what held really means, and why the condition of the register is the sharpest finding in it.
Ownership and consolidation · Food and agriculture · Transparency and open data
FDA Warning Letters: The Public Enforcement Record for 100,000+ Regulatory Actions
2026-07-08
The FDA publishes every warning letter on its website — pharmaceutical cGMP violations, food safety failures, device adulteration, and clinical investigator fraud. Here is the enforcement hierarchy from Form 483 to criminal referral, how to access and scrape the letter database, and what the record reveals about repeat violators and food safety trends.
Federal data · Health and medicine
MSHA Mine Safety Data: Violations, Accidents, and Fatalities Across 10,000 Active Mines
2026-07-07
The Mine Safety and Health Administration publishes three linked datasets — mine listings, accident/injury records, and violation citations going back to 1983. Here is the significant-and-substantial designation, the Pattern of Violations enforcement mechanism, the Upper Big Branch disaster context, and how to join violations to accidents by Mine ID.
Federal data · Labor and workplace
USCG Marine Casualty Data: Every US Vessel Accident Since 1982
2026-07-06
The US Coast Guard maintains the Boating Accident Report Database (BARD) for recreational vessels and the Marine Casualty and Pollution Database (MCPD) for commercial casualties. Here is what each database contains, how alcohol and life-jacket non-use drive fatality statistics, and how journalists use the data to track manufacturer defects and rental company safety records.
Federal data · Transportation safety
FMCSA Carrier Safety Ratings: The Federal Database Behind 550,000 Trucking Companies
2026-07-05
The FMCSA maintains SAFER and MCMIS covering every commercial motor carrier in interstate commerce — three official safety ratings (Satisfactory, Conditional, Unsatisfactory), seven SMS BASICs scoring each carrier as a percentile, inspection counts, OOS rates, and crash data. Here is the data structure, how to access it, and what it reveals about high-risk carriers.
Federal data · Transportation safety
CBP US Trade Statistics: The Federal Dataset Behind Every Import and Export
2026-07-04
US Customs and Border Protection and the Census Bureau publish comprehensive import and export statistics by commodity (HTS code), trading partner, port of entry, and month. Here is the data structure, how to access USA Trade Online and the Census Foreign Trade API, and what the data reveals about trade diversion after Section 301 tariffs.
Federal data · Economy and demographics
ICE Enforcement and Removal Operations: Reading the Federal Dataset Behind Immigration Enforcement
2026-07-03
ICE publishes annual ERO reports covering arrests, detentions, removals, and returns by country of origin, criminal vs. non-criminal designation, and field office. Here is the data structure, TRAC-ICE access, and what the dataset reveals about enforcement priority shifts, nationality composition changes, and the interior vs. border enforcement split.
Federal data · Justice and immigration
BLS CPI-U: The Inflation Dataset That Moves Markets and Sets Policy
2026-07-02
The BLS Consumer Price Index for All Urban Consumers tracks monthly inflation going back to January 1913. Here is the expenditure weight breakdown, how CPI-U differs from core CPI and the PCE deflator, how to access it via the BLS API, and what the 2021-2023 surge revealed about shelter inflation measurement and monetary policy transmission.
Federal data · Economy and demographics
SSA Disability Award Statistics: The Federal Dataset Behind 8 Million Benefit Decisions
2026-07-01
The Social Security Administration publishes annual disability award statistics covering both SSDI and SSI — awards by state, diagnosis code, age group, gender, and decision level. Here is what the dataset contains, how to access it, and what it reveals about geographic variation in award rates, the ALJ hearing backlog, and the Trust Fund solvency timeline.
Federal data · Economy and demographics
NLRB Unfair Labor Practice Data: 300,000 Cases of Worker-Management Conflict
2026-06-30
The National Labor Relations Board maintains a public case management system tracking every unfair labor practice charge filed under the NLRA — 20,000–25,000 annually. Here is the case lifecycle, data structure, how to query the NLRB API, and what the data reveals about the 2022–2024 Starbucks and Amazon organizing surge.
Federal data · Labor and workplace
BLS JOLTS: The Federal Dataset That Measures Why Workers Quit
2026-06-29
The Job Openings and Labor Turnover Survey tracks monthly job openings, hires, quits, layoffs, and other separations by industry and region. Here is the data structure, BLS API access, and what JOLTS reveals about the Great Resignation, the Fed's rate-hike calculus, and the labor market signals that precede recessions.
Federal data · Economy and demographics · Labor and workplace
FTC Consumer Sentinel Network: 16 Million Fraud Reports Hiding in Plain Sight
2026-06-28
The FTC Consumer Sentinel Network aggregates 8M+ fraud, identity theft, and consumer complaint reports annually from the FTC and dozens of partner organizations. Here is what the dataset contains, how to access it, and what it reveals about imposter scams, cryptocurrency fraud, and the counterintuitive age dynamics of financial loss.
Federal data · Consumer protection
Derailments and grade crossings: using FRA railroad accident data to analyze rail safety trends
2026-06-27
The Federal Railroad Administration publishes two linked databases covering US railroad safety since 1975: Form 54 (all rail accidents — derailments, collisions, fires, explosions) and Form 57 (highway-rail grade crossing accidents). Together they cover 250,000+ incidents with train information, track type, speed at accident, casualties, and equipment damage.
Federal data · Transportation safety
The graveyard of pensions: using PBGC data to track terminated defined-benefit plans
2026-06-26
The Pension Benefit Guaranty Corporation publishes data on every terminated private-sector defined-benefit pension plan it has trusteed since 1975 — over 5,000 plans covering millions of workers. The data reveals which industries have abandoned their pension obligations, how much the PBGC paid out vs. what was promised, and which plan sponsors walked away from the largest underfunded obligations.
Federal data · Labor and workplace
Seismic record: using the USGS earthquake catalog to analyze fault risk and induced seismicity
2026-06-25
The USGS National Earthquake Information Center maintains a catalog of every recorded earthquake globally — magnitude 2.5+ events back to 1900, with 100,000+ events per year above M4 globally. Here is the data structure, how to access the API and bulk downloads, and what the catalog reveals about fault hazard zones, the Oklahoma induced seismicity surge from wastewater injection, and historical earthquake patterns.
Federal data · Environment and energy
OrganWatch: What the Public Record Shows About the US Organ System
2026-06-24
The US organ-procurement, transplant, and tissue system, assembled from government records: hundreds of source-linked findings on a federally protected monopoly, the procurement-vs-care conflict at the bedside, the consent gaps over unclaimed bodies, the money, the prosecutions, and the sworn testimony. The index to the whole OrganWatch investigation. Institution level, zero personal data.
Organ and tissue industry · Health and medicine · Transparency and open data
Following EPA enforcement: using ECHO data to track environmental violations and penalties
2026-06-24
EPA's Enforcement and Compliance History Online (ECHO) publishes every CAA, CWA, RCRA, and TSCA enforcement case — facility violations, formal actions, penalties assessed, and compliance status for 800,000+ regulated facilities. Here is the data structure, how to query it, and what the database reveals about which facilities violate the most, which industries face the steepest penalties, and where environmental justice and enforcement gaps align.
Federal data · Environment and energy
Which Organ Procurement Organizations Are Failing: The CMS Tier Map
2026-06-23
For the first time, the US government grades every organ procurement organization on objective outcome measures and publishes the result. The tiers are damning: roughly a third of OPOs sit in the lowest band, which CMS itself deems out of compliance and eligible for decertification. A sourced reading of the CMS tier data — what the tiers mean, which OPOs are in Tier 3, and how the first-ever decertification finally happened.
Health and medicine · Organ and tissue industry · Transparency and open data
Who Buys the Dead: The Demand Side of the US Body Trade
2026-06-23
The consent gaps in US body donation exist because there is a paying market on the other side. A sourced account of the demand side: medical-device companies that run cadaver labs, the US military buying donated bodies for blast and landmine testing, surgical-training firms, and the per-part price market that moved tens of thousands of bodies — all lawful, because federal law bars selling organs for transplant but barely touches the non-transplant body trade.
Health and medicine · Organ and tissue industry · Transparency and open data
The Body Trade on Trial: What the US Court Record Shows
2026-06-23
The consent gaps in US body and tissue donation are not theoretical — they have a criminal record. A sourced account of the court cases: a $58.5M verdict against an Arizona body-donation company, federal prison for operators who sold bodies with forged consent, convictions for shipping disease-infected tissue, and the 2025 Harvard Medical School morgue trafficking case — set against the law that bans selling transplant organs but barely touches the non-transplant body trade.
Health and medicine · Organ and tissue industry · Transparency and open data
Check Your State: Where the Unclaimed Dead Can Be Used Without Consent
2026-06-23
If you died unclaimed, could your body be sent for dissection or research without consent? The answer depends almost entirely on the state. Reading the statutes for all 51 US jurisdictions finds that 33 permit use of an unclaimed or indigent body without affirmative next-of-kin consent, and only 13 require consent. A sourced, de-identified map of the 50-state patchwork.
Health and medicine · Organ and tissue industry · Transparency and open data
How Organs Are Taken From the Dying: Federal Findings on the Procurement-vs-Care Line
2026-06-23
The hardest question in the US organ system is at the bedside of the dying: when does recovery begin, and who is watching for the patient rather than the organ? In 2025 a federal HRSA review of 351 donation-after-circulatory-death cases found concerning features in roughly 29% and concluded a number of patients may not have been deceased when procurement began. A sourced, de-identified account of the dead-donor rule, the NRP controversy, the premature-procurement findings, and the structural conflict behind them.
Health and medicine · Organ and tissue industry · Transparency and open data
The Money Behind the US Organ System: Nine-Figure Nonprofits and a Cost-Plus Monopoly
2026-06-23
Organ donation is free; the system around it is not. The federally designated OPOs are cost-reimbursed regional monopolies, and the largest are nonprofits reporting $100M+ revenue with seven-figure executive pay. A sourced follow-the-money account — the cost-plus model, the OPTN contract, the Senate Finance finding that OPOs have stronger incentives for tissue than for lifesaving organs, the for-profit tissue pipeline, lobbying against reform, and the federal audits. Institution/role level, zero personal data.
Health and medicine · Organ and tissue industry · Transparency and open data
Used Without Consent: Unclaimed Bodies and the Holes in US Donation Law
2026-06-23
When a person dies unclaimed or indigent in America, the law in most states lets their body be sent for dissection, research, or the for-profit body trade with no next-of-kin consent required. A sourced account of the consent gap — the state unclaimed-body statutes, the documented University of North Texas case, the coroner cornea-removal laws and the court split over whether a body is property, and the FDA exemption that leaves whole bodies and tissue barely regulated while transplant organs are tightly governed.
Health and medicine · Organ and tissue industry · Transparency and open data
The US Organ System, in the Government’s Own Records: A Monopoly Under Decertification
2026-06-23
The US organ-procurement system is a federally regulated monopoly — 56+ Organ Procurement Organizations with exclusive territories feeding a national network that had one contractor for nearly four decades. Its failures are documented by the government itself: a CMS performance rule, a bipartisan Senate Finance investigation, HRSA’s breakup of the monopoly, GAO and HHS-OIG audits, and in 2025 the first move to decertify an OPO. Sourced, with the public data behind it.
Health and medicine · Organ and tissue industry · Transparency and open data
Information Rights Have Two Sides: Access and Protection, Mapped Together
2026-06-23
Information-rights posture has two axes — how open the government is (Right to Information) and how protected the citizen is (Data Protection). Joining the two new Voidly datasets on country (a rare exact key), this maps the two-by-two space and the real tension where privacy law is used to deny access and openness without protection exposes individuals.
Censorship and information control · Cybersecurity and privacy · Transparency and open data
When a Breach Happens: How One Cyber Incident Surfaces Across Federal Disclosure Regimes
2026-06-23
There is no single US cyber-breach registry. One incident can surface in three unconnected federal places — CISA’s KEV catalog (the exploited vulnerability), an SEC 8-K Item 1.05 filing (the material event), and the HHS OCR breach portal (health data) — each with a different trigger, threshold, and clock. A guide to joining them by victim organization and date.
Cybersecurity and privacy · Finance and markets · Transparency and open data · Engineering and infrastructure
Not All Lists Are Sanctions: A Field Guide to US Restriction Regimes
2026-06-23
A company can be on a US government list and it can mean five completely different things — an OFAC asset freeze, a BIS export-license denial, an NS-CMIC securities ban, a UFLPA import ban, or an FCC equipment-authorization bar. A field guide to telling the five regimes apart and why conflating them is wrong — the taxonomy behind SpyLedger and the sanctions-programs reference.
Sanctions and illicit finance · Transparency and open data
The Recall Web: How US Product Recalls Are Split Across Four Agencies
2026-06-23
There is no single US recall database — cars are recalled by NHTSA, consumer products by the CPSC, food/drugs/devices by the FDA, and meat and poultry by USDA-FSIS. A guide to weaving the four feeds into one cross-agency recall view, joined on firm name and date, with the hazard-classification mismatches and what a unified view reveals.
Consumer protection · Health and medicine · Transportation safety · Engineering and infrastructure
The Influence Pipeline: Campaign Money, Lobbying, and Federal Contracts
2026-06-23
Three public datasets describe how organizations try to shape government and what they receive: FEC campaign finance (who gives), lobbying disclosures (who lobbies), and USAspending (who wins contracts). Joined on organization + parent name — there is no shared identifier — with the honest correlation-not-causation caveat.
Money in politics · Government operations · Transparency and open data · Engineering and infrastructure
Following a Nonprofit Through Federal Data: Status, Finances, and the Money In
2026-06-23
A tax-exempt organization’s full federal footprint — its IRS exemption ruling, its self-reported Form 990 finances and grants made, and the USAspending grants, contracts, and subawards flowing to it — joined on the EIN, the universal nonprofit key. The money-in vs money-out distinction and the gotchas that break the join.
Transparency and open data · Government operations · Engineering and infrastructure
From Lab to Market: Tracing Federally Funded Research into Patents and Products
2026-06-23
Federal money funds research, that research becomes patents, and some patents become products. Four datasets follow the path — NIH and NSF grants, USPTO patents (whose Bayh-Dole government-interest statements disclose the funding behind them), and FDA approvals — joined on institution, inventor, and the government-interest clause.
Research and education · Engineering and infrastructure
The Drug Lifecycle in Federal Data: From Approval to the Pharmacy Counter
2026-06-22
Six federal datasets follow a prescription drug across its life — FDA approval, the National Drug Code directory, CMS Open Payments (manufacturer payments to prescribers), Medicare Part D prescribing and spending, and CDC overdose mortality — joined on the NDC code, ingredient, and manufacturer. The keys, the brand/generic and NDC-format gotchas, and what the assembled pipeline answers.
Health and medicine · Transparency and open data · Engineering and infrastructure
The Voidly Accountability Stack: Fifteen Datasets on Secrecy, Rights, and Harm
2026-06-22
Voidly is fifteen datasets on how accountability is suppressed and reclaimed: network censorship, banned books (Verboten), the surveillance industry (SpyLedger), ownership opacity (DarkRegister), foreign-held land (Foreign-Held U.S. Farmland), foreign money in universities (Section 117 Ledger), grid ownership (GridOwners), who runs ICE detention (the Detention Ledger), who signed up to enforce immigration law (the 287(g) Wave), every federal prison (the BOP Ledger), the sanctions authorities behind designations (Sanctions Programs), information rights (Right to Information and Data Protection), and the US organ system (OrganWatch). One shared source-cited, static, agent-first, privacy-careful method.
Censorship and information control · Transparency and open data
DarkRegister: Tracking the Rollback of Corporate-Ownership Transparency
2026-06-22
The public-access status of 46 national beneficial-ownership registers (EU, UK, US, and major offshore centres) after the 2022 CJEU ruling — only 7 remain fully public. A record of corporate-transparency rollback as state behavior, with zero personal data, plus the open CC0 GLEIF ownership graph captured as the preservable counterweight.
Censorship and information control · Transparency and open data
SpyLedger: A Source-Cited Record of the Surveillance Industry
2026-06-22
The public corporate identity and government-designation status of 20 marquee spyware and mass-surveillance vendors (NSO Group, Intellexa, Hikvision, Huawei and more) — every designation rebuilt from a primary US/EU source and precisely typed: export control, sanction, equipment-authorization, or investment restriction.
Censorship and information control · Sanctions and illicit finance · Transparency and open data
What the World Bans, and Why: Patterns in 35,000 Book-Censorship Records
2026-06-22
A data read of the Verboten index across 119 countries: political content is the world’s #1 stated reason for banning a book (9,813 titles), LGBTQ+ bans are ~95% American, and the 2020s already hold 9,411 newly banned titles — the two censorship regimes and the four-century arc behind them.
Censorship and information control · Transparency and open data
Verboten: Building a Queryable Index of Where Books Are Banned
2026-06-22
A structured, source-cited index of book censorship worldwide — 19,283 banned or restricted titles across 119 countries — built on the CC-BY banned-books.org Open Censorship Core and served as static JSON for AI agents to query: is this book banned in that country, and why?
Censorship and information control · Transparency and open data · Engineering and infrastructure
Dark money disclosed: using IRS Form 990 data to map political organization spending
2026-06-22
The IRS publishes Form 990 filings for political organizations — 527 committees (direct political spending) and 501(c)(4) social welfare organizations (the dark money vehicle). The data covers revenue, expenditures, officer compensation, and political activities for 65,000+ organizations. Here is what the data contains, how to access it via ProPublica Nonprofit Explorer and the IRS bulk XML, and what it reveals about the shadow infrastructure of US political spending.
Federal data · Transparency and open data · Money in politics
Trade, Sanctions, and Export Controls: Joining the Rules to the Goods That Move
2026-06-21
US trade controls and the trade they govern usually live in separate datasets — OFAC’s sanctions lists, BIS’s export-enforcement record, and the Census foreign-trade statistics. This guide joins all three by country, HS commodity, and party so the rules, the violations, and the actual flow of goods line up in one view.
Economy and demographics · Sanctions and illicit finance · Engineering and infrastructure
The EPA Facility 360: Joining Waste, Toxic Releases, Air, and Enforcement Data
2026-06-21
A guide to building a single facility view from four EPA compliance datasets — joining the RCRA hazardous-waste registry, the Toxic Release Inventory, Clean Air Act compliance, and the ECHO enforcement record through the Facility Registry Service ID, so a site’s permits, pollution, violations, and penalties come together in one place. Covers the join key, the program identifiers, the environmental-justice questions the assembled data answers, and a worked Python walkthrough against EPA’s key-free public APIs.
Environment and energy · Engineering and infrastructure
The Natural Hazards Picture: From NWS Warning to NOAA Storm to FEMA Declaration
2026-06-21
Four federal hazard datasets — the National Weather Service alert feed, NOAA’s Storm Events Database, the USGS earthquake catalog, and FEMA’s disaster declarations — line the warning, the event, the damage, and the federal response up in one view. A guide to joining them on place and time so an analyst can measure warning lead time against casualties and see what share of damaging events ever became a federal disaster.
Environment and energy · Engineering and infrastructure
The Food Safety System: Joining CDC Outbreaks with FSIS and FDA Recalls
2026-06-21
The US food-safety system is split across three federal agencies — CDC epidemiology, USDA FSIS recalls of meat and poultry, and FDA recalls of everything else — and the chain from a detected outbreak to the recall that pulls contaminated food off shelves only comes together when their datasets are joined. This guide traces that chain through public, key-free data, aligning the records by pathogen, product, firm, and time.
Food and agriculture · Health and medicine · Engineering and infrastructure
The US Power Grid in Data: Joining EIA Plants, Ownership, Generation, and FERC
2026-06-21
Four federal energy datasets — EIA’s plant and generator inventory, the ownership schedule, the electricity generation and price series, and FERC’s market-manipulation enforcement record — joined on the plant code and the operator name into one view of the physical grid, who owns it, what it produces, and who polices it.
Environment and energy · Engineering and infrastructure
The Federal Research Enterprise: Joining NSF, NIH, and Research-Misconduct Data
2026-06-21
A guide to mapping the federal research enterprise from data — joining NSF awards, NIH grants, and ORI misconduct findings on institution and investigator so the flow of federal science money and its oversight come together in one view. Covers the two funding pillars, the misconduct accountability layer, the join-key normalization problem, and a Python workflow that aggregates funding and lines findings up against grants.
Research and education · Engineering and infrastructure
Is College Worth It? Measuring Value by Joining IPEDS and the College Scorecard
2026-06-21
Whether a given college is worth it can be answered school by school — and now program by program — by joining two federal datasets: NCES IPEDS, the mandatory census of what every college charges and graduates, with the College Scorecard’s record of debt and post-enrollment earnings. A guide to the UnitID/OPEID join, the CIP program layer, and a worked earnings-to-debt value ratio.
Research and education · Engineering and infrastructure
The Employment-Immigration Pipeline: From Labor Certification to H-1B Approval
2026-06-21
A guide to tracing the employment-based immigration pipeline through federal data — joining the Department of Labor’s foreign-labor certifications, the USCIS record of H-1B petitions, and the BLS wage statistics that set the prevailing-wage floor, so the path by which a US employer hires a foreign worker comes together in one view.
Justice and immigration · Labor and workplace · Engineering and infrastructure
How a Federal Rule Is Made: Tracing the Rulemaking Lifecycle Across Federal Data
2026-06-21
A guide to tracing a federal regulation end to end — from the public law that authorizes it, through the proposed and final rules in the Federal Register, to the public-comment docket on Regulations.gov — by threading the RIN, the docket ID, and the Federal Register document number across three connected datasets.
Government operations · Engineering and infrastructure
The Worker Safety Record: Joining OSHA Inspections, Citations, and Injuries
2026-06-21
A single workplace’s federal safety history is scattered across four OSHA datasets — the inspections, the citations they generate, the severe-injury reports employers must file, and the annual Form 300A summaries. This guide assembles them into one establishment-level view, keyed by inspection number and employer, and asks whether enforcement actually lowers injury rates afterward.
Labor and workplace · Federal data · Engineering and infrastructure
The Disaster Cycle: Tracing One Catastrophe Through FEMA Federal Data
2026-06-21
A guide to reconstructing the full arc of a US catastrophe from four OpenFEMA datasets — joining the disaster declaration, the Public Assistance grants that rebuild, the Hazard Mitigation projects that prevent the next loss, and the NFIP flood claims that pay homeowners — so one hurricane, wildfire, or flood can be followed from the day the President signs the declaration through every dollar the federal government spends in response.
Environment and energy · Engineering and infrastructure
The Corporate Securities Lifecycle: One Company Across EDGAR on a Single Key
2026-06-21
Most federal data syntheses fail at the join — there is no shared key, so the analyst is left fuzzy-matching names. The SEC ecosystem is the rare exception: EDGAR’s Central Index Key uniquely identifies every filer, so the company registry, the 8-K material-event filings, the 13F holdings that name the company, its Form D private placements, and its litigation and administrative enforcement all link without ambiguity. This guide assembles all six SEC datasets into one CIK-keyed corporate profile that follows a company across its entire public life.
Finance and markets · Engineering and infrastructure
The Opioid Epidemic in Three Federal Datasets: Distribution, Death, and Treatment
2026-06-21
No single federal dataset shows the opioid epidemic whole — but three of them, joined on geography, do. This guide aligns the DEA’s ARCOS record of how the pills were shipped, the CDC’s death-certificate record of how people died, and the CMS record of where Medicare addiction treatment reaches survivors, so the supply, the toll, and the response line up in one geographic view.
Health and medicine · Engineering and infrastructure
The Anatomy of a Bank Failure: Reading the Federal Data Before a Bank Goes Under
2026-06-21
A bank rarely fails without warning — its slide shows up in the quarterly call reports quarters before regulators close it. This is a guide to reconstructing the whole arc from federal data, joining the FDIC institution registry, the call-report financials, the enforcement orders, and the failed-bank record on one clean key: the FDIC certificate number.
Finance and markets · Federal data · Engineering and infrastructure
The Vehicle Safety Pipeline: From Owner Complaint to Recall to the Death Toll
2026-06-21
A single agency owns the entire defect-to-outcome loop — NHTSA collects the complaints, opens the investigations, compels the recalls, tracks the completion reports, and counts the deaths. This guide assembles all five datasets into one pipeline and shows how to join a complaint cluster to the recall it provoked and the fatalities it cost.
Transportation safety · Consumer protection · Engineering and infrastructure
The Aviation Safety Pipeline: From NTSB Accident to FAA Airworthiness Directive
2026-06-21
The US aviation-safety feedback loop is split across two agencies — the NTSB that investigates and the FAA that regulates — so the accident finding and the mandatory fix live in different databases. This guide traces an unsafe condition from accident, to airworthiness directive, to the registered fleet it applies to, joining four federal datasets on aircraft make and model into one accountable chain.
Transportation safety · Engineering and infrastructure
Mortality in America: Assembling the Full Picture from Federal Death Data
2026-06-21
No single federal file tells you how Americans die — five CDC mortality datasets do, when you assemble them: the leading-causes ranking, the injury and external-cause records, the suicide series, drug-overdose mortality, and excess deaths. All are cut from the same National Vital Statistics System death certificates, so they share join keys and age-adjustment, and the work is aligning their cause definitions rather than parsing five separate NCHS releases.
Health and medicine · Engineering and infrastructure
Healthcare Consolidation: Tracing Hospital and Nursing-Home Roll-Ups Through Federal Data
2026-06-21
Consolidation is the defining force in American healthcare, but it leaves its fingerprints across four separate CMS records that do not share a key. This is a field guide to bridging the provider-ownership files, the change-of-ownership transactions, and the Care Compare quality datasets through the enrollment-to-CCN crosswalk — turning who owns a facility, the deal that changed it, and the staffing and outcomes that followed into one traceable, facility-level story.
Health and medicine · Ownership and consolidation · Engineering and infrastructure
Following the Money: Joining Federal Campaign Finance, Lobbying, and Spending Data
2026-06-21
The hardest and most valuable move in federal data analysis is following one company across four disconnected systems — the political money it gives, the lobbying it pays for, the contracts it wins, and the fraud cases brought against it. None of them share a key, so the work is entity resolution: normalizing names, mapping subsidiaries to parents, and living with fuzzy matches.
Engineering and infrastructure · Money in politics · Government operations
CMS Home Infusion Therapy Suppliers: The Federal Record of a New Medicare Benefit
2026-06-21
Home infusion therapy — antibiotics, immune globulin, chemotherapy, and parenteral nutrition delivered into a vein or under the skin at a patient’s kitchen table — became a permanent Medicare benefit only on January 1, 2021. CMS keeps the enrollment record of the suppliers qualified to bill for it: roughly 324 home infusion therapy suppliers, a small and concentrated census of a benefit just a few years old.
Health and medicine · Federal data
EIA Generator Ownership: The Federal Record of Who Owns America’s Power Plants
2026-06-21
EIA Form 860’s ownership schedule records who actually holds the equity in every US electric generator — the regulated utilities, merchant producers, public power authorities, and financial investors — and the joint-ownership percentages that untangle who controls how many megawatts across the fleet. A field-level guide to the ~5,400-record ownership file, joint ownership in baseload coal and nuclear plants, parent-company rollups, and the Python to aggregate owned capacity and flag co-owned plants.
Environment and energy · Ownership and consolidation · Federal data
EPA Public Water Systems: The Federal Inventory of Every System That Delivers US Drinking Water
2026-06-21
The EPA’s inventory of every public water system in the country — an inventory file of roughly 400,000 records keyed by PWSID, of which about 150,000 are active systems, from the largest city utilities to the campground wells that serve twenty-five people. A field-level guide to system types, size categories, source water, state primacy, and the inventory that the violations and inspection datasets all hang off of.
Environment and energy · Engineering and infrastructure · Federal data
FAA Airworthiness Directives: The Federal Record of Mandatory Aircraft Safety Fixes
2026-06-21
Airworthiness Directives are not advice — they are legally binding FAA orders, issued under 14 CFR Part 39, that ground an aircraft until an unsafe condition is fixed. This guide walks through the ~22,900 FAA rulemaking actions in the faa_actions table: how an AD turns an accident finding into a fleet-wide mandate, emergency ADs and the 737 MAX, aging-aircraft and engine directives, the join to the aircraft registry by make and model, and a worked Federal Register API walkthrough.
Transportation safety · Federal data
CDC Leading Causes of Death by State: The Federal Ranking of How Americans Die
2026-06-21
Every year the CDC ranks the top causes of death in each state — heart disease, cancer, unintentional injury, and the rest of the top ten — with counts and age-adjusted rates standardized so a young state and an old one can be compared on a like-for-like basis. This is the top-level view that sits above the cause-specific mortality datasets.
Health and medicine · Federal data
CMS Hospital Ownership: The Federal Record of Who Owns America’s Hospitals
2026-06-21
CMS publishes the full ownership filings for every Medicare-enrolled hospital — naming the health systems, private-equity firms, and real-estate investment trusts with a direct or indirect interest in each facility, keyed to the hospital’s PECOS enrollment and associate IDs (not the CMS Certification Number, which the file does not carry). A field guide to the ~147,000-record hospital all-owners file: the disclosure rule behind it, the schema and role codes, how to trace an opco/propco/REIT chain, the Steward–Medical Properties Trust collapse, the joins to change-of-ownership data and, through an enrollment-to-CCN crosswalk, to quality data, a worked Python walkthrough, and the caveats of self-reported ownership.
Health and medicine · Ownership and consolidation · Federal data
FDA Tobacco Product Problem Reports: The Federal Record of Vaping and Tobacco Hazards
2026-06-21
When a vaping device’s battery overheats in a pocket, when a pouch of tobacco arrives webbed with mold, when an e-liquid triggers a reaction no label warned of, the complaint can land in one federal file — the FDA’s Tobacco Product Problem Reports. This deep-dive walks the ~1,300-report dataset behind tobacco and vaping safety surveillance: the 2009 statute that created the Center for Tobacco Products, the 2016 deeming rule that pulled e-cigarettes in, the product and problem taxonomy, the underreporting and causation caveats, and a Python workflow over the openFDA tobacco endpoint.
Health and medicine · Consumer protection · Federal data
FDA Animal & Veterinary Adverse Events: The Federal Record of Veterinary Drug Safety
2026-06-21
When a dog seizes after a flea-and-tick chew or a horse colics after a dewormer, the report often lands in the FDA Center for Veterinary Medicine’s adverse-event system — the veterinary counterpart to FAERS. This guide covers the ~25,000-report dataset, the passive-surveillance model, the species-and-reaction taxonomy, and a Python walkthrough of the openFDA animalandveterinary/event API.
Health and medicine · Federal data
CMS Change of Ownership: The Federal Record of Hospital and Nursing-Home M&A
2026-06-21
When a hospital or nursing home is sold, the new owner usually inherits the seller’s Medicare provider number, compliance history, and liabilities — and that transfer leaves a CHOW record. The roughly 5,900-row CMS change-of-ownership dataset is the transaction-level ledger of healthcare consolidation: who bought which facility, from whom, and when.
Health and medicine · Ownership and consolidation · Federal data
NHTSA Recall Completion: The Federal Record of Whether Recalled Cars Actually Get Fixed
2026-06-21
A recall only prevents harm if the defective part is actually replaced — and federal regulations make manufacturers report, quarter by quarter, how many recalled units they have repaired. This guide covers the ~73,600 quarterly completion reports behind the question the recall headline never answers: did the cars get fixed?
Transportation safety · Consumer protection · Federal data
FDA Device Establishment Registration: The Federal Map of the Medical-Device Supply Chain
2026-06-21
Every facility on earth that makes, repackages, relabels, or imports a medical device for the US market must register with the FDA each year and list the devices it handles — roughly 324,000 establishments keyed by FEI and registration number. This is the federal worldwide map of who handles what in the device supply chain, and the registry that ties manufacturers, contract makers, specification developers, repackagers, and importers to the device product codes they touch.
Health and medicine · Federal data
CDC Suicide Mortality: The Federal Record of a Public-Health Crisis Over Seven Decades
2026-06-21
Suicide is one of the leading causes of death in the United States, and one of the few that rose for most of two decades. The CDC/NCHS suicide-mortality record — age-adjusted and crude rates by year, sex, age group, and method, built from death-certificate data in the National Vital Statistics System — is the baseline that prevention policy aims to lower. A field-level guide to the data behind 988 and the suicide-prevention effort.
Health and medicine · Federal data
NHTSA Defect Investigations: The Federal Record of What Leads to a Recall
2026-06-21
Before a recall there is an investigation. NHTSA’s Office of Defects Investigation works the space between a complaint pattern and a recall — opening a Preliminary Evaluation, escalating to an Engineering Analysis, and either forcing a recall or closing without action. This is the ~5,300-record federal account of how a safety defect moves from early signal to mass recall, the data behind Takata, the GM ignition switch, and Firestone.
Transportation safety · Consumer protection · Federal data
FDA PMA Approvals: The Federal Record of How High-Risk Medical Devices Reach Market
2026-06-21
Premarket Approval is the FDA’s most stringent device pathway — the route a Class III device must take to reach the US market, proving its own safety and effectiveness with clinical evidence rather than borrowing equivalence from a predecessor. This guide walks the ~56,000 PMA approvals and supplements: originals versus supplements, the PMA-vs-510(k) divide, advisory-committee specialties, the supplement lifecycle, and a Python workflow against the openFDA device/pma endpoint.
Health and medicine · Federal data
CMS Opioid Treatment Programs: The Federal Record of Medicare Access to Addiction Care
2026-06-21
For the first fifty-five years of Medicare, the program would not pay a methadone clinic a cent — until the SUPPORT Act built a Part B bundled benefit that took effect in January 2020. This guide walks the CMS enrollment file of the roughly 1,300 opioid treatment programs now billing Medicare: the 42 CFR Part 8 rules, the SAMHSA-DEA-accreditation triad, the schema keyed by CCN and enrollment ID, and a Python workflow that maps treatment capacity against the overdose burden.
Health and medicine · Federal data
NASA OIG Reports: The Federal Record of Auditing the Space Program
2026-06-21
The NASA Office of Inspector General is the independent watchdog that audits the cost and schedule of the Space Launch System, Orion, the James Webb Space Telescope, and the Commercial Crew contracts with SpaceX and Boeing — and reports the overruns. Roughly 850 audit and investigative reports trace how NASA spends its ~$25 billion budget, which programs run over, and what the OIG recommends to fix it.
Government operations · Federal data
SEC Comment Letters: The Federal Record of How Disclosure Is Actually Enforced
2026-06-21
When the SEC’s staff reviews a public company’s filings and has questions, it sends a comment letter — and EDGAR publishes both the staff’s questions and the company’s replies. This is the candid, lagged record of how disclosure standards get enforced in the space between formal enforcement actions: which accounting topics draw scrutiny, which companies got pushed, and how filings changed in response.
Finance and markets · Transparency and open data · Federal data
CMS FQHCs and Rural Health Clinics: The Federal Record of the Medicare Safety Net
2026-06-21
Federally Qualified Health Centers and Rural Health Clinics are the two Medicare clinic types that anchor the primary-care safety net in low-income and rural America — and CMS’s enrollment files are the supply map of where they sit, who runs them, and under what status. A field guide to ~16,600 clinic enrollments, the statutes behind them, and how to join them to ownership and provider data.
Health and medicine · Federal data
Oversight.gov: The Federal Record of Every Inspector General Report in One Place
2026-06-21
Oversight.gov is the single searchable library of federal Inspector General work, run by CIGIE to aggregate the audits, inspections, and investigations that some seventy-odd OIGs publish separately. This guide covers the Inspector General Act, CIGIE and the PRAC, recurring findings and open recommendations, how the catalog joins to spending and agency data, a worked Oversight.gov API walkthrough, and the caveats.
Government operations · Transparency and open data · Federal data
SEC Administrative Proceedings: The Federal Record of the SEC’s In-House Enforcement
2026-06-21
The SEC brings a large share of its enforcement in its own forum — before an administrative law judge or the Commission itself — rather than in federal court. This deep-dive covers the ~18,400 administrative proceedings in our table: the registrant bars, accountant suspensions, registration revocations, and orders instituting proceedings, the forum-choice questions that Lucia and Jarkesy reshaped, and how the record joins to EDGAR and the compliance-screening lists.
Finance and markets · Federal data
CISA Known Exploited Vulnerabilities: The Federal List of What Attackers Are Actually Using
2026-06-21
The full catalog of known software vulnerabilities runs into the hundreds of thousands — far too many to patch all at once. CISA’s Known Exploited Vulnerabilities catalog cuts that universe down to the ~1,600 CVEs confirmed to be exploited in the wild, and Binding Operational Directive 22-01 turns the list into an enforceable federal patching mandate. This guide reads it as the highest-signal vulnerability prioritization feed the government publishes.
Cybersecurity and privacy · Federal data
FEMA Firefighter Grants: The Federal Record of Funding the Fire Service
2026-06-21
FEMA’s Assistance to Firefighters Grants and its sister programs — SAFER staffing, Fire Prevention and Safety, and EMPG — are the everyday-readiness side of FEMA, putting turnout gear, breathing apparatus, apparatus, training, and firefighters themselves into local departments. OpenFEMA publishes the awards as roughly 74,000 grant records: who got funded, where, for what, and how federal dollars reach volunteer and rural departments.
Environment and energy · Federal data
SEC Litigation Releases: The Federal Record of Securities Cases Filed in Court
2026-06-21
When the SEC sues in federal district court—for accounting fraud, insider trading, Ponzi schemes, market manipulation, or FCPA bribery—it issues a Litigation Release that summarizes the complaint and tracks the case to judgment. This guide walks the ~11,800-release record: the civil-court versus administrative-forum split, what Jarkesy changed, the join by defendant to EDGAR and the enforcement screening lists, and a Python workflow that tallies releases by year and searches them by name.
Finance and markets · Federal data
NCUA Enforcement: The Federal Record of Actions Against Credit Unions
2026-06-21
When a credit union or its officials break the law or run the institution into the ground, the NCUA acts — cease-and-desist orders, civil money penalties, prohibitions, conservatorships, and liquidations. This guide reads ~1,400 of those enforcement actions as the credit-union piece that completes the four-regulator picture of every federally insured depository in the country.
Finance and markets · Federal data
VA Inspector General Reports: The Federal Record of Oversight at Veterans Affairs
2026-06-21
The VA Office of Inspector General is the independent watchdog over the second-largest federal department — the nation’s largest integrated health system, the benefits administration, and the cemeteries. Roughly 4,280 reports, keyed to the facility or program reviewed, span healthcare inspections, benefits audits, construction reviews, and criminal investigations — the documentary trail behind the Phoenix wait-time scandal and the accountability that followed.
Health and medicine · Government operations · Federal data
Federal Lobbying Disclosures: The Public Record of Who Is Paid to Influence Washington
2026-06-21
The Lobbying Disclosure Act forces every paid lobbyist to file the client, the issues, the agencies contacted, and the money — a quarterly public ledger of the influence industry. This guide walks the LD-2 and LD-203 filings, the standardized issue-area codes, the Senate and House disclosure systems, and how the data joins to the campaign-finance and foreign-agent records.
Money in politics · Government operations · Federal data
DOJ Inspector General Reports: The Federal Record of Watching the Watchmen
2026-06-21
The Department of Justice has its own independent watchdog — an inspector general who audits, inspects, and investigates the FBI, DEA, ATF, the Bureau of Prisons, and the rest of the department, then publishes the findings. This guide covers what the roughly 3,000 published OIG reports are, the Inspector General Act of 1978 that created the office, the audit-evaluation-investigation product lines, the landmark FBI and FISA reviews, how recommendations are tracked, where the data lives on oig.justice.gov and oversight.gov, a worked Python walkthrough, and the caveats.
Justice and immigration · Government operations · Federal data
FEMA Hazard Mitigation: The Federal Record of Spending to Prevent the Next Disaster
2026-06-21
FEMA’s Hazard Mitigation Assistance grants pay to reduce disaster risk before and after the storm — the preventive complement to the rebuild-after Public Assistance program — and OpenFEMA publishes roughly 56,000 funded mitigation projects. A field-level guide to HMGP, FMA, PDM, and BRIC; buyouts, elevations, safe rooms, and mitigation plans; the “$6 saved per $1 spent” finding and the equity debate over benefit-cost scoring; and a Python workflow that sums federal share by program, state, and project type.
Environment and energy · Government operations · Federal data
SEC Form D: The Federal Record of Private Securities Offerings
2026-06-21
When a startup raises a seed round, a hedge fund launches, or a sponsor syndicates an apartment building, it almost never registers with the SEC — it files a Form D instead. That brief notice is one of the only public windows into the private markets that now raise more capital than public offerings, and this guide reads it column by column.
Finance and markets · Federal data
NWS Weather Alerts: The Federal Feed of Every Active Warning and Watch
2026-06-21
The National Weather Service issues every official US watch, warning, and advisory in the Common Alerting Protocol — the same machine-readable feed that drives the Emergency Alert System, Wireless Emergency Alerts, and NOAA Weather Radio. Our weather_alerts table is a rolling snapshot of roughly 3,000 active and recently expired messages, keyed by alert identifier, mapped to NWS zones, and complementary to the deep NOAA storm-events archive.
Environment and energy · Justice and immigration · Federal data
Federal Reserve Enforcement: The Federal Record of Actions Against Bank Holding Companies
2026-06-21
The Federal Reserve supervises bank holding companies, state member banks, and the US operations of foreign banks — and when they break the law or run themselves unsafely, it acts. This guide walks the ~1,500-action public enforcement record: the cease-and-desist orders, written agreements, civil money penalties, and removal-and-prohibition orders, the holding-company vantage that distinguishes the Fed from the OCC and FDIC, and how the three banking regulators’ records join into a single map of US bank supervision.
Finance and markets · Federal data
DOJ Antitrust Cases: The Federal Record of Government Competition Enforcement
2026-06-21
The Antitrust Division’s case record runs from United States v. Microsoft to the modern big-tech monopolization suits—roughly 920 civil merger challenges, monopolization cases, and criminal cartel prosecutions keyed by matter and defendant. A field-level guide to the Sherman and Clayton Acts, the DOJ–FTC split, the criminal cartel program, Hart-Scott-Rodino premerger review, and a Python workflow that tallies cases by type and year.
Justice and immigration · Federal data
GovInfo Congressional Hearings: The Federal Record of What Witnesses Tell Congress
2026-06-21
Congressional hearings are the principal way House and Senate committees gather information, and the published transcripts are the official record of that testimony. This guide covers the GovInfo CHRG collection — roughly 46,000 hearing transcripts keyed by package ID and committee — the five hearing types, the give-and-take that the transcripts preserve, how hearings feed the bills, votes, and laws that follow, a worked GovInfo API walkthrough, and the caveats of a corpus assembled from fifty-odd committees.
Government operations · Federal data
FEMA Public Assistance: The Federal Record of How Disaster Recovery Money Is Spent
2026-06-21
The Public Assistance program is the federal government’s largest disaster-recovery grant — the money that rebuilds roads, schools, hospitals, and water systems after a presidential disaster declaration. OpenFEMA publishes roughly 195,000 funded-project summaries showing what was rebuilt and at what cost; this guide explains the Stafford Act frame, the work categories, the cost-share, and how to join the spending to the declarations.
Environment and energy · Government operations · Federal data
CFTC Enforcement: The Federal Record of Derivatives and Crypto Market Actions
2026-06-21
The CFTC’s Division of Enforcement brings the civil actions that police the US derivatives markets — fraud, manipulation, spoofing, and, increasingly, digital-asset cases. This guide walks the ~4,400-record enforcement file: the Commodity Exchange Act frame, administrative orders versus federal-court complaints, the spoofing and benchmark-manipulation eras, the crypto-as-commodity fight with the SEC, and a Python workflow that scrapes the public cftc.gov pages to tally actions by year and violation and rank respondents by monetary relief.
Finance and markets · Federal data
NSF Awards: The Federal Record of Who Gets US Science Funding
2026-06-21
The National Science Foundation funds roughly a quarter of all federally supported basic research at US colleges and universities, and every grant it makes leaves a public record — the award number, the institution, the principal investigator, the program and directorate, the title and abstract, and the obligated dollars. This guide covers the merit-review system behind the awards, the directorate structure, the searchable abstract corpus, and a Python workflow against the open NSF Awards API.
Research and education · Government operations · Federal data
OCC Enforcement: The Federal Record of Actions Against National Banks
2026-06-21
When a national bank breaks the law or runs an unsafe operation, the Office of the Comptroller of the Currency answers with a cease-and-desist order, a consent order, or a civil money penalty — and publishes it. This is a deep dive into the OCC enforcement record: roughly 4,900 actions against national banks, federal savings associations, and the bankers behind them, the BSA/AML and mortgage-servicing and sales-practices failures that drive them, and how the dataset joins the FDIC and Federal Reserve records into one map of US bank supervision.
Finance and markets · Federal data
US Public Laws: The Federal Record of Every Law Congress Has Enacted
2026-06-21
Every bill that clears both chambers and the President’s desk becomes a numbered public law — the Americans with Disabilities Act, HIPAA, Dodd-Frank, the Affordable Care Act, the CARES Act. GPO publishes the chronological record through GovInfo, and our public_laws table holds the several thousand of them enacted from the 101st Congress forward. A field-level guide to public-law numbering, the slip-law to Statutes-at-Large to US Code pipeline, and a worked GovInfo API walkthrough.
Government operations · Federal data
NRC Enforcement: The Federal Record of Nuclear Safety Violations
2026-06-21
When a nuclear licensee breaks a safety requirement, the Nuclear Regulatory Commission issues a Notice of Violation, a civil penalty, or an order — and records it. This guide reads the ~1,800-action enforcement file as the federal ledger of nuclear-safety accountability, from the color-coded Reactor Oversight Process to the lessons of Three Mile Island and Fukushima.
Environment and energy · Federal data
GAO Reports: The Federal Database Behind Congress’s Watchdog
2026-06-21
The Government Accountability Office is the nonpartisan audit, evaluation, and investigative arm of Congress — the congressional watchdog. This guide covers what the GAO reports dataset is, how the office works, the High-Risk List and the duplication report, recommendation tracking and reported financial benefits, bid-protest decisions, how the reports table joins to the spending and legislative records, a worked Python walkthrough against gao.gov, and the caveats every analyst must hold.
Government operations · Federal data
DOL OFLC Disclosures: The Federal Record of Employer Visa-Labor Applications
2026-06-21
The Department of Labor’s Office of Foreign Labor Certification runs the labor side of employment-based immigration — and publishes every case as quarterly disclosure data. A guide to ~298,000 visa-labor records spanning the H-1B LCA, PERM, and the seasonal H-2A and H-2B programs, the prevailing-wage attestation at their core, and how they join to USCIS and BLS.
Labor and workplace · Justice and immigration · Federal data
PBGC Trusteed Pensions: The Federal Record of Every Dead Corporate Pension
2026-06-21
When a company with a traditional pension goes bankrupt and its plan is underfunded, the Pension Benefit Guaranty Corporation steps in as trustee and pays the retirees — and records the dead plan in a federal registry. This guide covers the ~5,170 trusteed plans, the ERISA insurance frame, the single- and multiemployer programs, the benefit guarantee cap, the steel-airlines-auto collapse the data documents, the Form 5500 join, and a worked Python walkthrough.
Labor and workplace · Finance and markets · Federal data
FTC Enforcement: The Federal Record of Consumer-Protection and Antitrust Actions
2026-06-21
The Federal Trade Commission announces nearly everything it does — every settlement, every merger challenge, every new rule — through a press release, and those releases together form a searchable public log of US consumer-protection and antitrust enforcement. This guide reads the FTC’s ~10,700-record press and enforcement archive as a dataset: the Section 5 frame, the consumer-protection and competition missions, landmark privacy penalties, the shifting priorities of junk fees and noncompetes and big-tech antitrust, and a Python workflow that tallies releases by year and topic.
Consumer protection · Justice and immigration · Cybersecurity and privacy · Federal data
SEC 8-K Filings: The Federal Record of Every Material Corporate Event
2026-06-21
Form 8-K is the SEC’s current report — the event-driven filing a public company must submit within four business days of a major development. This guide covers the item-code taxonomy, the four-business-day clock, the 2023 cyber-incident rule, how the corpus joins to the EDGAR registry by CIK, a worked submissions-API walkthrough, and the caveats of tagged-event data.
Finance and markets · Federal data
FRA Rail Accidents: The Federal Record of Every Reportable US Railroad Incident
2026-06-21
Every reportable US railroad accident since 1975 — derailments, collisions, grade-crossing strikes — flows to the Federal Railroad Administration on Form 6180.54 and lands in the Railroad Accident/Incident Reporting System. This guide covers the reporting threshold, the cause-code taxonomy, the East Palestine and PTC debates, and how ~224,000 records join to the grade-crossing inventory.
Transportation safety · Federal data
MSHA Violations: The Federal Record of Every Citation Written at a US Mine
2026-06-21
Every citation and order a federal inspector writes at a US mine — coal, metal, and nonmetal — lands in one Department of Labor record, roughly 3.07 million violations keyed by mine ID and citation number. A field-level guide to the Mine Act, the S&S and unwarrantable-failure tiers, the pattern-of-violations process, penalty assessment and contest, the join to the mines and accidents data, and a Python workflow that ranks operators by assessed penalty and S&S rate.
Labor and workplace · Federal data
SEC Financial Statement Facts: The Structured XBRL Behind Every 10-K and 10-Q
2026-06-21
Every number in a public company’s 10-K and 10-Q is filed not just as formatted text but as a structured, machine-readable XBRL fact — revenue, net income, total assets — tagged to a US-GAAP concept and keyed to the filer’s CIK and period. This guide covers the 2009 mandate that created a decade of comparable structured fundamentals, the anatomy of a financial fact, the company-facts API, and the caveats of company-chosen tags and restatements.
Finance and markets · Federal data
USGS Earthquakes: The Federal Catalog of Every Significant Global Quake
2026-06-21
The US Geological Survey runs ComCat, the authoritative catalog of global earthquakes — our slice holds ~101,000 magnitude-4-and-greater events worldwide since 2020, each with an origin time, location, depth, and magnitude type. A field guide to the ANSS comprehensive catalog, why depth and magnitude type matter, the ShakeMap–PAGER impact pipeline, and the no-key FDSN web service that serves it all in minutes.
Environment and energy · Federal data
USCIS H-1B Data: The Federal Record of Who Sponsors Skilled Foreign Workers
2026-06-21
The H-1B visa is the central channel through which US employers hire skilled foreign workers, and the USCIS H-1B Employer Data Hub turns that program into a public record — roughly 764,000 employer-petition rows from 2009 to 2023 naming who sponsors, where the job sits, and whether the petition was approved or denied.
Justice and immigration · Federal data
DOL Form 5500: The Federal Window Into Every US Pension and 401(k) Plan
2026-06-21
Form 5500 is the annual report every private-sector retirement and welfare plan must file under ERISA — the federal government’s primary public window into a private benefit system holding trillions of dollars. This guide covers the joint DOL, IRS, and PBGC filing, the EFAST2 system, plan types and schedules, the 401(k)-fee and pension-funding data, and a Python workflow over the public datasets.
Labor and workplace · Federal data
CRS Reports: The Federal Database Behind Congress’s Own Nonpartisan Research
2026-06-21
For most of its history the Congressional Research Service wrote authoritative, nonpartisan analysis for members of Congress that the public was not allowed to read — until a 2018 appropriations law forced the reports into the open. This is a guide to the ~23,200-report database that resulted: what CRS is and why it sits inside the Library of Congress, the anatomy of a product number and its revisions, how the reports earn their reputation as the most citable secondary source on US federal policy, a worked Python walkthrough of the EveryCRSReport bulk index, and the caveats of working with a corpus that was never designed to be a dataset.
Government operations · Federal data
FINRA BrokerCheck Firms: The Federal-Adjacent Registry of Every US Broker-Dealer
2026-06-21
FINRA’s BrokerCheck publishes the registration, status, and disciplinary history of every US broker-dealer firm — roughly 13,300 firm records keyed by CRD number, the screening source behind every “is this brokerage legit?” question. A field-level guide to CRD identifiers, the SEC-FINRA self-regulatory structure, disclosure events, registration scope, and the broker-misconduct research built on this record.
Finance and markets · Federal data
USASpending Contracts: The Federal Record of Every Dollar the Government Buys
2026-06-21
USASpending.gov is the government’s official open-data record of how it spends money, and its contracts half — sourced from FPDS-NG — is the authoritative governmentwide procurement file: roughly 100 million award and transaction records carrying the agency, the recipient UEI, the obligated dollars, the NAICS and PSC codes, and the competition status of every federal contract action. This guide covers the DATA Act mandate, the prime-award schema, competition and set-aside fields, the UEI transition, joins to subawards and the exclusions list, and a worked USASpending API walkthrough.
Government operations · Federal data
CMS Revoked Providers: The Federal List of Who Lost the Right to Bill Medicare
2026-06-21
When CMS revokes a provider’s Medicare billing privileges, it ends their ability to bill the program and attaches a re-enrollment bar of one to ten years — up to twenty for the worst cases. A field-level guide to 42 CFR 424.535, the revocation reasons, the re-enrollment bar, how revocation differs from an HHS-OIG exclusion, the ACA screening expansion, and a Python workflow over the genuine data.cms.gov revocations file.
Health and medicine · Federal data
CDC PLACES: The Federal Model of Health for Every US Census Tract
2026-06-21
No national survey can measure diabetes or smoking at the scale of a single neighborhood — the samples are far too small. CDC PLACES solves that with model-based small-area estimates, projecting survey responses onto every US census tract; the tract-level file runs to ~3.05 million tract-by-measure rows, the data behind neighborhood health-equity work.
Health and medicine · Federal data
CDC Excess Deaths: The Federal Measure of How Many More Americans Died Than Expected
2026-06-21
Excess deaths are the gap between how many Americans actually died and how many a statistical model expected — the measure that captured the full toll of COVID-19, including the undiagnosed and the indirect deaths the official tally missed. A field-level guide to the NCHS excess-mortality dataset: the over-dispersed Poisson baseline, the observed-versus-expected threshold, jurisdiction-by-week structure, provisional lag, and a worked data.cdc.gov Python walkthrough.
Health and medicine · Federal data
FEC Committees: The Federal Registry of Every PAC, Super PAC, and Campaign Committee
2026-06-21
Before a single dollar of federal campaign money can be traced, the spender has to be named — and the FEC committee registry is where every candidate committee, party committee, traditional PAC, and Super PAC is identified by a unique C-prefixed ID. This guide covers FECA and the registration threshold, the committee taxonomy, how Citizens United and SpeechNow created the Super PAC, and how the committee ID joins to the itemized money.
Money in politics · Federal data
HHS-OIG Enforcement: The Federal Record of Healthcare Fraud Settlements and Penalties
2026-06-21
The HHS Office of Inspector General is the largest inspector general in the federal government, and its enforcement record — roughly 10,900 settlements, civil monetary penalties, and corporate integrity agreements — is the closest thing there is to a map of where healthcare-fraud risk has concentrated, from drug makers and hospital systems to nursing homes and labs. A field-level guide to the False Claims Act, the Anti-Kickback Statute, the Stark Law, CIAs, the LEIE relationship, and a Python workflow over the genuine oig.hhs.gov enforcement listing.
Health and medicine · Federal data
CMS HCAHPS: The Federal Survey of What Patients Say About Every US Hospital
2026-06-21
HCAHPS is the first national, standardized survey of what patients actually experienced in the hospital — nurse communication, responsiveness, cleanliness, the 0-to-10 rating, the would-recommend question — adjusted for survey mode and patient mix so hospitals can be compared fairly. A field-level guide to the ~326,000 hospital-by-measure records published on Care Compare, how the scores feed Hospital Value-Based Purchasing and Medicare payment, and a Python workflow that ranks hospitals, computes state averages, and tests whether response rate tracks score.
Health and medicine · Federal data
FDIC Enforcement Actions: The Federal Record of Orders Against Banks and Bankers
2026-06-21
The FDIC publishes every formal enforcement order it issues against state-chartered banks and the bankers who run them — roughly 10,900 cease-and-desist orders, civil money penalties, and prohibition orders that bar individuals from the industry for life. A field-level guide to the action types, the institution-affiliated-party concept, BSA/AML and safety-and-soundness causes, the join to the institutions directory, and a Python walkthrough of the public orders system.
Finance and markets · Sanctions and illicit finance · Federal data
CMS Hospital Service Area: The Federal Map of Where Every Hospital’s Patients Come From
2026-06-21
For every Medicare-certified hospital, CMS publishes the ZIP codes its patients come from — a hospital-by-patient-ZIP crosswalk of beneficiaries, cases, and charges that is the federal data behind hospital-market definition, merger antitrust review, and the Dartmouth Atlas tradition. A field-level guide to the ~1.16 million-row Hospital Service Area file, why small cells are suppressed, and how to compute catchments and market concentration in Python.
Health and medicine · Federal data
Trade.gov Consolidated Screening List: The Federal Index of Who US Exporters Cannot Do Business With
2026-06-21
Before any US company exports a good, transfers technology, or pays a foreign counterparty, it must check one list — the Consolidated Screening List, Trade.gov’s single feed of the restricted-party and sanctions lists from Commerce, State, and Treasury. This guide covers the entries, the legal authorities that bar dealing with them, and how a fuzzy-name search catches the aliases and transliterations that exact matching misses.
Sanctions and illicit finance · Federal data
CDC Injury Mortality: The Federal Record of How Americans Die from Firearms, Overdoses, and Crashes
2026-06-21
The CDC’s National Center for Health Statistics compiles every injury death in the United States from the death certificates filed under the National Vital Statistics System — the federal record behind the overdose epidemic, firearm deaths, rising suicide, and motor-vehicle fatalities, classified by mechanism and intent and reported as age-adjusted rates per 100,000.
Health and medicine · Federal data
CMS Healthcare-Associated Infections: The Federal Record of CLABSI, CAUTI, MRSA, and C. diff in US Hospitals
2026-06-21
Every US hospital reports the bloodstream infections, urinary-tract infections, MRSA, and C. diff its patients acquire while in its care — and CMS publishes the standardized infection ratios on Care Compare. Roughly 173,000 hospital-by-measure records covering the SIR, the observed-versus-predicted math, the HAC penalty money, and a worked data.cms.gov walkthrough.
Health and medicine · Federal data
FDIC Call Reports: The Federal Database Behind Every US Bank’s Quarterly Financials
2026-06-21
Every federally insured US bank files a Consolidated Report of Condition and Income every quarter — the Call Report — and the FDIC publishes the result as a system of record so granular that the deposit run and unrealized securities losses that felled Silicon Valley Bank were legible in it months ahead. A guide to ~1.67 million bank-quarter rows: the Call Report’s statutory frame, the FFIEC forms, regulatory capital and asset-quality ratios, CAMELS, the 2023 failures, and a worked BankFind Suite API walkthrough.
Finance and markets · Federal data
OSHA Severe Injury Reports: The Federal Record of Amputations and Hospitalizations Since 2015
2026-06-21
On January 1, 2015 OSHA began requiring employers to report every work-related amputation, eye loss, and in-patient hospitalization within 24 hours — creating, for the first time, a near-real-time federal stream of individual severe-injury events. This is a field-level guide to the ~103,000-report dataset: the 29 CFR 1904.39 rule, the employer-name-and-NAICS columns, the inspection-versus-Rapid-Response-Investigation split, the State Plan coverage gap, and a Python workflow over OSHA’s downloadable file.
Labor and workplace · Federal data
FDIC Institutions: The Federal Registry of Every US Bank, Active and Historical
2026-06-21
The FDIC’s BankFind Suite is the canonical registry of every FDIC-insured institution, active and historical — roughly 27,800 banks and thrifts, each pinned to a permanent certificate number that ties together its call-report financials, its failure record, and its enforcement history. A field-level guide to the CERT key, charter classes, the active/inactive lifecycle, and the no-key BankFind API.
Finance and markets · Federal data
SEC 13F Institutional Holdings: The Federal Database Behind What the Big Money Owns
2026-06-21
Every quarter the SEC requires large institutional investment managers to disclose their long positions in exchange-listed securities, producing the federal database that powers all whale-watching — what Berkshire, Bridgewater, and the big hedge funds bought and sold. This guide covers Section 13(f), the XML information table, the 45-day lag and confidential-treatment carve-outs, the CUSIP and CIK join keys, and a Python walkthrough that pulls a manager’s latest 13F-HR from EDGAR and ranks its top holdings.
Finance and markets · Federal data
HMDA Mortgage Lender Filings: The Federal Record of Who Reports Under the Home Mortgage Disclosure Act
2026-06-21
The Home Mortgage Disclosure Act forces thousands of lenders to report every mortgage application and loan — but first someone has to record who filed. The FFIEC filer panel is that registry: roughly 34,700 filer-year records (2018–2023) keyed by Legal Entity Identifier, the index to the most important fair-lending dataset in the country.
Finance and markets · Economy and demographics · Federal data
SBA PPP Loans: The Federal Database Behind 11.8 Million Pandemic Paycheck Protection Loans
2026-06-21
When COVID-19 closed the economy in March 2020, Congress answered with the largest small-business lending program in American history — and the SBA published it loan by loan. The result is a roughly 11.8 million-record database of forgivable paycheck-protection loans: every borrower, lender, NAICS code, amount, and forgiveness status, and the fraud that rode in alongside the relief.
Finance and markets · Federal data
Food stamps by the numbers: using USDA SNAP participation data to track hunger and benefit policy
2026-06-19
The USDA Food and Nutrition Service publishes monthly SNAP participation and benefit data by state — total participants, households, benefits issued, average benefit per person, and issuance history going back to 1969. The data shows how food assistance responds to recessions, pandemic aid expansions, and state-level work requirement policies. Here is what the data contains, how to access it, and what 50 years of SNAP data reveals.
Federal data · Food and agriculture · Economy and demographics
The demographic backbone: using Census ACS data to contextualize every other federal dataset
2026-06-18
The Census Bureau's American Community Survey publishes 5-year estimates for every census tract in the US — income, poverty, race, housing tenure, education, employment, and 350+ other variables at the tract level. ACS is the denominator that makes every other federal dataset meaningful: HMDA denial rates per capita, OSHA injury rates per worker, SNAP participation per household. Here is what it contains, how to access it, and how to join it to enforcement data.
Federal data · Economy and demographics · Transparency and open data
Mapping housing discrimination: using HUD FHEO complaint data to find fair housing violations
2026-06-17
HUD's Fair Housing and Equal Opportunity office publishes a complaint database covering every fair housing complaint filed with HUD and participating state agencies — basis of discrimination (race, national origin, disability, familial status, sex, religion), property type, complaint disposition, and whether the complainant received relief. Here is the data structure and what 50,000+ complaints reveal about where housing discrimination concentrates.
Federal data · Economy and demographics · Justice and immigration
Inside the count: using BJS National Prisoner Statistics to analyze incarceration trends
2026-06-16
The Bureau of Justice Statistics publishes the National Prisoner Statistics program — state and federal prison populations back to 1925, with demographics (race, sex, age), offense categories, sentence lengths, and admissions/releases flows. Here is the data structure, how to access it, and what 100 years of incarceration data reveals about mandatory minimums, the drug war, and mass incarceration's racial dimensions.
Federal data · Justice and immigration
Workplace safety violations: using OSHA inspection and citation data to find dangerous employers
2026-06-15
OSHA publishes its full inspection and citation database — every workplace inspection since 1972, every violation found, every penalty assessed, and whether the employer contested the citation. The database covers 2.5M+ inspections across all industries. Here is what it contains, how to query it, and what patterns emerge from 50 years of enforcement data.
Federal data · Labor and workplace
Wage theft by employer: using DOL Wage and Hour Division enforcement data to find labor violations
2026-06-14
The Department of Labor's Wage and Hour Division publishes a public enforcement database covering every concluded investigation — employer name, violation type, back wages owed, employees affected, and civil money penalties. The database covers FLSA minimum wage/overtime, H-2A/H-2B temporary workers, FMLA, and Davis-Bacon prevailing wage violations. Here is the structure, how to query it, and what the data reveals about wage theft patterns across industries.
Federal data · Labor and workplace
Every US traffic death since 1975: using NHTSA FARS to analyze road safety, vehicle defects, and enforcement gaps
2026-06-13
The Fatality Analysis Reporting System (FARS) contains a record for every motor vehicle crash death on US public roads since 1975 — 1.1M+ fatalities with vehicle type, crash circumstances, driver behavior, and roadway conditions. Here is the data structure, how to download it, and what it reveals about drunk driving trends, pedestrian deaths, and the safety gap between vehicle classes.
Federal data · Transportation safety
Follow the money: mapping dark money and super PAC flows with FEC bulk data
2026-06-11
The FEC publishes bulk data on every contribution and expenditure in federal elections — candidates, PACs, super PACs, and party committees. Here is how to download the full dataset, trace money from donor to expenditure, and identify the shell-company layer that obscures dark money flows.
Federal data · Money in politics
The Wall of Shame: what the HHS-OCR HIPAA breach database reveals about healthcare data security
2026-06-10
HHS-OCR publishes every reported healthcare data breach affecting 500+ patients — the "Wall of Shame." Over 5,000 entries covering ransomware attacks, stolen laptops, unauthorized employee access, and business associate failures. Here is what the database contains and what it reveals about healthcare security failures.
Federal data · Health and medicine · Cybersecurity and privacy
By the numbers: using EEOC charge statistics to find discrimination patterns by industry and employer
2026-06-09
The EEOC publishes annual charge statistics and, since 2017, charge-level data under FOIA. The aggregate data shows which industries generate the most race, sex, disability, and age discrimination charges — and which large employers appear repeatedly in the conciliation record.
Federal data · Justice and immigration · Labor and workplace
The $800 billion bailout: using SBA PPP data to trace who got pandemic relief
2026-06-08
After a FOIA fight, the SBA released PPP loan data covering 11.8 million loans and $793 billion in forgiven funds. Here is what the public data contains, the fraud patterns it revealed, and how to cross-reference it with SAM.gov debarments, IRS nonprofit data, and the DOJ prosecution record.
Federal data · Finance and markets · Consumer protection · Transparency and open data
Trading on the inside: using STOCK Act filings to track congressional stock transactions
2026-06-07
The STOCK Act requires members of Congress to report stock trades within 45 days. The House Clerk publishes scanned PDFs — not structured data. Here is how Quiver Quantitative, Capitol Trades, and journalists have structured this data, and what the disclosures reveal about trading patterns around legislation and committee assignments.
Federal data · Government operations · Transparency and open data
The asylum lottery: what EOIR data reveals about judge-by-judge grant rate disparities
2026-06-06
EOIR publishes quarterly data on every immigration judge's case outcomes, including asylum grant rates. The spread is enormous — some judges grant asylum in fewer than 5% of cases; others grant it in more than 90%. Here is how to access and analyze the data.
Federal data · Justice and immigration
The mortgage map: using HMDA loan-level data to find lending disparities
2026-06-05
The Home Mortgage Disclosure Act requires 7,000+ lenders to report every mortgage application — approvals, denials, withdrawn, race, income, loan amount, census tract. Here is how to use the CFPB bulk download to find redlining, reverse redlining, and lender-level denial rate disparities.
Federal data · Economy and demographics · Finance and markets
The recall record: what the CPSC product safety database shows and what manufacturers hide
2026-06-04
The CPSC Recall database covers 9,800+ recalls since 1973. Behind the press releases: how many units are actually returned, which hazard categories dominate, and why the voluntary recall system lets manufacturers negotiate the language of their own enforcement actions.
Federal data · Consumer protection
SEC Form 13F: The Institutional Holdings Disclosure Behind Every Hedge Fund Tracker
2026-06-03
Section 13(f) requires institutional investment managers with >$100M in 13(f) securities to file quarterly holdings disclosures with the SEC — ~5,000 filers, 45-day lag, long-equity-only view. Here is the full holdings table schema (CUSIP, VALUE, SH/PRN, PUT/CALL, INVESTMENT DISCRETION, VOTING AUTHORITY), what 13F covers and critically excludes (no short positions, no bonds, no foreign-listed shares), major filers (Berkshire, BlackRock, Renaissance), confidential treatment requests, the 45-day stale-data limitation and clone strategy research, academic use (Griffin/Xu 2009, Brunnermeier/Nagel 2004, Edmans 2009), comparison to 13D/13G/Form 4, and a Python EDGAR bulk index parser to track position changes for any manager by CIK.
Federal data · Finance and markets
380 million transactions: indexing the DEA's ARCOS opioid distribution data
2026-06-02
How we indexed 380 million DEA ARCOS controlled-substance transaction records from the opioid MDL discovery release, what the data reveals about pill distribution, and how to cross-reference it against DEA enforcement actions and CDC overdose mortality.
Federal data · Health and medicine
EPA Safe Drinking Water Act Site Visits: The Federal Record of Public Water System Inspections
2026-06-01
Before a public water system ever incurs a drinking-water violation, a sanitary surveyor usually walks the site — the wellhead, the chlorination room, the storage tank, the operator logbook — and records what is wrong. EPA stores those inspections in the Safe Drinking Water Information System: roughly 433,150 site visits, each keyed to a public water system and scored across eight evaluation areas.
Environment and energy · Federal data
EPA ICIS-Air: The Federal Database Behind Clean Air Act Stationary Source Compliance
2026-06-01
Every factory, refinery, power plant, and chemical works in America that emits to the air sits somewhere between in compliance and High Priority Violator, and EPA keeps the ledger in ICIS-Air — roughly 279,262 stationary sources, each carrying its Clean Air Act program classification, permitted pollutants, compliance status, last full compliance evaluation, and formal enforcement actions.
Environment and energy · Federal data
CDC Nutrition, Physical Activity, and Obesity: The Federal Surveillance Record of American Health Behavior
2026-06-01
Every year the federal government calls roughly 400,000 Americans and asks how tall they are, how much they weigh, how often they exercise, and how many vegetables they eat. The CDC Nutrition, Physical Activity, and Obesity dataset is the state-by-state distillation of those answers — the most comprehensive federal record of how American health behavior varies across geography, income, race, and education.
Health and medicine · Food and agriculture · Federal data
CMS Post-Acute Care Utilization: The Federal Database Behind Home Health, Hospice, and Skilled Nursing Spending
2026-06-01
After a hospital stay ends, the least visible part of American healthcare begins — the home health nurse, the hospice, the skilled nursing facility. Medicare spends roughly $60 billion a year on this post-acute care, and CMS publishes a provider-level record of how much each agency, hospice, and nursing facility delivered and was paid, across roughly 28,404 provider-by-measure rows.
Health and medicine · Federal data
NVD CVE Database: The Federal Record of Every Known Software Vulnerability
2026-06-01
The NIST National Vulnerability Database is the federal record that turns CVE identifiers into structured, comparable data — roughly 459,000 catalogued software vulnerabilities, each carrying a CVE ID, CVSS severity score, CWE weakness type, affected products, and references. It is the layer that makes the world catalogue of known vulnerabilities something you can query, rank, and prioritize.
Cybersecurity and privacy · Federal data
CMS Provider Ownership: The Federal Database Behind Private Equity in Nursing Homes, Home Health, and Hospice
2026-06-01
For nearly every nursing home, home health agency, hospice, and hospital that bills Medicare, the federal government now publishes who owns it — the holding companies, management firms, real-estate trusts, and private equity funds stacked behind the name on the door. The CMS all-owners files under 42 CFR 455.104 are an X-ray of who controls American institutional care, with roughly 280,000 ownership records for nursing homes alone plus home health, hospice, and hospitals.
Health and medicine · Ownership and consolidation · Federal data
SAM Exclusions and Debarments: The Federal List of Who Cannot Win Government Contracts
2026-06-01
There is a single federal list that can end a company. When a firm or person is placed on the SAM.gov exclusions list, they are barred across the entire US government from winning federal contracts and most grants, loans, and benefits — roughly 64,400 active exclusion records, each naming the excluded party, why, by whom, and for how long.
Government operations · Federal data
FDA National Drug Code Directory: The Federal Index of Every US Drug Product
2026-06-01
The FDA National Drug Code Directory is the federal index of every drug product marketed in the United States — roughly 40,000 active listings, each keyed by its three-segment NDC and carrying brand and generic names, labeler, dosage form, route, active ingredients, DEA schedule, and marketing dates. The NDC is the universal serial number of the American drug supply.
Health and medicine · Federal data
FAA Airmen Certification Database: The Federal Record of Every US Pilot and Mechanic
2026-06-01
The FAA Airmen Certification Database is the federal registry of every person certified to work in American aviation — roughly 881,000 pilots, flight instructors, mechanics, dispatchers, and parachute riggers, each with a unique FAA identifier, certificate type and level, ratings, and medical class, published as the public Releasable Airmen file.
Transportation safety · Federal data
FAA Aircraft Registry: The Federal Database Behind Every N-Numbered US Aircraft
2026-06-01
Every civil aircraft flying legally in the United States carries an N-number on its tail, and behind each tail number sits a row in the FAA Aircraft Registry — roughly 293,000 registered aircraft with serial number, manufacturer, model, year, registrant, airworthiness class, and the Mode S hex code that bridges the registry to live ADS-B flight tracking.
Transportation safety · Federal data
CFTC Commitments of Traders: The Federal Database Behind Futures Market Positioning
2026-06-01
The Commitments of Traders report is the CFTC weekly X-ray of who holds the open positions in US futures markets — roughly 98,000 market-week rows splitting open interest in crude oil, gold, corn, Treasuries, the E-mini S&P 500, and dozens of other contracts among commercial hedgers, swap dealers, managed-money funds, and small speculators.
Finance and markets · Federal data
FDA Device Classification Database: The Federal System Behind Every Medical Device Type
2026-06-01
The FDA Product Classification database is the master taxonomy of American medical devices — roughly 7,058 device types, each pinned to a three-letter product code, a risk class (I, II, or III), a CFR regulation number, a medical specialty panel, and the premarket pathway a manufacturer must clear to sell it, forming the schema beneath every 510(k), PMA, registration, and adverse-event report.
Health and medicine · Federal data
CMS Doctors and Clinicians: The Federal Database Behind Every Medicare Physician
2026-06-01
The CMS Doctors and Clinicians national file is the closest thing the United States has to a public directory of who practices medicine inside Medicare — roughly 163,000 physician and clinician records carrying NPI, specialty, medical school, graduation year, group practice, hospital affiliation, and whether the provider accepts Medicare assignment.
Health and medicine · Federal data
EPA Enforcement Defendants: The Federal Database Behind 200,000 Environmental Cases
2026-06-01
Behind every EPA enforcement action is a list of names — the companies, municipalities, and individuals the United States actually pursued. EPA keeps that list in the Integrated Compliance Information System, and surfaced through ECHO it amounts to 199,682 defendant records, each tying a named party to a case number and flagging whether it appears in the complaint, the settlement, or both.
Environment and energy · Federal data
SEC Form 144: The Federal Database Behind Insider Sales of Restricted and Control Stock
2026-06-01
Form 144 is the notice an insider files before selling — the public statement of intent a corporate affiliate must put on record with the SEC before disposing of restricted or control securities under Rule 144. Where Form 4 records the trade that already happened, Form 144 announces the one about to, across 1,681 machine-readable notices since mandatory EDGAR e-filing began in 2022.
Finance and markets · Federal data
SEC EDGAR Company Registry: The Federal Index That Resolves Every Public Company
2026-06-01
Every SEC dataset identifies the company it concerns by a single number, the Central Index Key. The EDGAR company registry is the master index that turns that number into an entity — 28,392 companies, each carrying CIK, name, ticker, industry code, state of incorporation, exchange, former names, and active status, the lookup that makes the entire SEC corpus joinable.
Finance and markets · Federal data
SEC N-PORT Mutual Fund Holdings: The Federal Database Behind Every Fund Portfolio Position
2026-06-01
Form N-PORT is the monthly portfolio report every registered mutual fund and ETF files with the SEC — a position-by-position X-ray of what each fund owns, with 354,405 holding rows carrying security identifiers, market value, percent of net assets, asset category, country, and the fair-value hierarchy level that flags illiquid Level 3 positions.
Finance and markets · Federal data
SEC Schedule 13D Filings: The Federal Database Behind Activist Investor Stakes
2026-06-01
Schedule 13D is the federal filing an investor must submit on crossing 5 percent beneficial ownership of a US public company with intent to influence it — the document that turns a quiet stake into a public campaign, capturing the activist toeholds, proxy fights, and breakup demands of Icahn, Elliott, Pershing Square, and Starboard in near real time.
Finance and markets · Transparency and open data · Federal data
FRA Highway-Rail Grade Crossing Inventory: The Federal Database Behind 250,000 Railroad Crossings
2026-06-01
The Federal Railroad Administration maintains a record of every place a road and a railroad meet in the United States — 250,636 crossings, each with a unique DOT crossing number, warning-device type, and train and traffic counts, paired with a companion database of every train-vehicle collision, forming the foundation of US grade-crossing safety analysis.
Transportation safety · Federal data
FMCSA Crash Data: The Federal Database Behind Large Truck and Bus Crashes
2026-06-01
The FMCSA crash file records every state-reported crash involving a federally regulated commercial truck or bus — 258,057 crashes keyed to the carrier USDOT number, covering fatalities, injuries, tow-aways, and hazmat releases, feeding the CSA Crash Indicator safety score and a decade of policy debate over large-truck safety.
Transportation safety · Federal data
EPA Pollutant Emissions: The Federal Database Behind 10 Million Facility-Level Air and Toxic Release Records
2026-06-01
EPA combines the National Emissions Inventory and the Toxics Release Inventory into a single facility-level record of what American industry emits — 10.4 million rows, one per facility per pollutant per year, each keyed to an FRS Registry ID that links a smokestack to its permits, enforcement history, and census tract.
Environment and energy · Federal data
FMCSA Motor Carrier Census: The Federal Database Behind 2 Million Registered Trucking Companies
2026-06-01
The FMCSA motor carrier census records every entity holding a USDOT number — 2.18 million interstate trucking companies, bus and motorcoach operators, hazmat carriers, freight forwarders, and brokers — the federal registry that underpins safety oversight, insurance underwriting, and freight broker vetting across US trucking.
Transportation safety · Federal data
IRS Exempt Organizations Business Master File: The Federal Record of 1.3 Million Tax-Exempt Nonprofits
2026-06-01
The IRS Exempt Organizations Business Master File is the federal register of every organization recognized as tax-exempt under Section 501(c) — 1.26 million entities keyed by EIN and tagged with subsection code, NTEE sector, foundation type, ruling date, and coded asset and income ranges. It is the closest thing to a census of the US nonprofit sector.
Transparency and open data · Federal data
FDA Food Enforcement Reports: The Federal Database Behind Food and Cosmetic Recalls
2026-06-01
The openFDA Food Enforcement dataset surfaces every food and cosmetic recall the FDA has classified through its Recall Enterprise System — roughly 25,000 records carrying the recall reason, recalling firm, hazard class (I, II, III), distribution footprint, and the dates that trace each recall from initiation to termination.
Health and medicine · Food and agriculture · Consumer protection · Federal data
The DPA database: every federal deferred prosecution agreement since 1992
2026-06-01
The Corporate Prosecution Registry at Duke and UVA covers 3,000+ federal organizational prosecutions and every DPA/NPA since 1990 — including agreements DOJ refused to disclose under FOIA.
Federal data · Justice and immigration · Transparency and open data
The gun dealer map: what ATF's Federal Firearms Licensee data shows and what it hides
2026-05-31
ATF publishes the complete list of ~75,000 active Federal Firearms Licensees monthly as a free CSV. Here's what the data contains, what the Tiahrt Amendment keeps hidden, and how to cross-reference it.
Federal data · Justice and immigration
Before it disappeared: archiving $1.5 trillion in USAID foreign assistance data
2026-05-30
foreignassistance.gov went dark on January 31, 2025. What the dataset contained, how it was archived, what the DOGE cuts actually targeted, and where to access it now.
Federal data · Government operations · Transparency and open data
One in four audits flagged: indexing PCAOB deficiency data across the Big 4
2026-05-29
PCAOB inspection reports contain structured deficiency data for every registered audit firm. In 2023, 26% of Big 4 audits reviewed had Part I.A deficiencies — meaning auditors signed off without sufficient evidence. Here is what the data covers and how to use it.
Federal data · Finance and markets
Who won, who lost: five years of union elections in NLRB data
2026-05-28
How to pull, clean, and analyze NLRB union election records — RC and RD cases, the 2021–2024 organizing surge, the 100k export cap workaround, industry breakdowns, and cross-referencing with OSHA and CFPB data.
Federal data · Labor and workplace
The pharma payment map: joining CMS Open Payments and Medicare Part D prescribing data
2026-05-26
How joining CMS Open Payments (100M+ pharma payments to physicians) with Medicare Part D prescribing data (25M+ provider-drug rows) surfaces the correlation between manufacturer payments and prescribing patterns — and how to cross-reference with HHS OIG exclusions.
Health and medicine
BLS Occupational Employment and Wage Statistics: The Federal Database Behind Median Salary Data for Every US Occupation
2026-05-25
The Bureau of Labor Statistics Occupational Employment and Wage Statistics survey is the most comprehensive federal source of wage data by occupation — covering 830 detailed occupations across every industry and geographic area in the United States, with employment counts and full wage distributions for 1.1 million surveyed establishments.
Economy and demographics · Labor and workplace · Federal data
CFPB Consumer Complaint Database: The Federal Record Behind 3 Million Financial Product Complaints
2026-05-25
The Consumer Financial Protection Bureau complaint database contains every consumer complaint submitted to the CFPB since 2012 — 3 million+ complaints about mortgages, credit cards, student loans, debt collection, and credit reporting — with the company response, resolution outcome, and optional consumer narrative, making it the most comprehensive federal record of retail financial product failures.
Finance and markets · Consumer protection · Federal data
FARA Foreign Agent Registrations: The Federal Database Behind Foreign Lobbying and Influence Disclosure
2026-05-25
The Foreign Agents Registration Act database maintained by the DOJ National Security Division is the federal government authoritative record of foreign influence operations in the United States — covering every individual and firm registered as a foreign agent, the foreign governments and entities that retained them, and the lobbying activities, media campaigns, and political contacts conducted on their behalf.
Money in politics · Justice and immigration · Federal data
SEC Form 4 Insider Trading: The Federal Database Behind Corporate Insider Stock Transactions
2026-05-25
SEC Form 4 filings are the mandatory disclosure every corporate officer, director, and large shareholder must submit within two business days of any transaction in company stock — creating a real-time public record of insider buying and selling at every US public company, covering 4 million+ filings in the EDGAR database.
Finance and markets · Federal data
NLRB Elections and Labor Enforcement Data: The Federal Database Behind Union Organizing and Unfair Labor Practice Cases
2026-05-25
The National Labor Relations Board maintains two parallel federal databases covering union organizing activity and labor law enforcement in the United States private sector — a representation election database covering every NLRB-supervised election since the 1930s, and an Unfair Labor Practice case database tracking charges, complaints, and Board orders against employers and unions.
Labor and workplace · Federal data
NOAA Storm Events Database: The Federal Record Behind 50 Years of US Weather Disasters
2026-05-25
The NOAA National Centers for Environmental Information Storm Events Database is the official federal record of severe weather in the United States — 48 event types including tornadoes, hurricanes, floods, and winter storms with records back to 1950, covering property damage estimates, crop damage, injuries, deaths, and event narratives across every county in the country.
Environment and energy · Federal data
NHTSA Vehicle Safety Complaints: The Federal Database Behind Auto Defect Investigations and Recalls
2026-05-25
The NHTSA vehicle safety complaints database contains every consumer complaint filed with the National Highway Traffic Safety Administration — 3 million+ complaints covering unexpected acceleration, brake failures, airbag malfunctions, fire risks, and steering defects — forming the primary data source for NHTSA defect investigations that trigger the largest vehicle recalls in US history.
Transportation safety · Consumer protection · Federal data
EPA RCRA Hazardous Waste Data: The Federal Database Behind 400,000 Regulated Facilities
2026-05-25
The EPA Resource Conservation and Recovery Act database tracks every generator, transporter, and disposal facility in the US hazardous waste management system — 400,000+ regulated facilities from small quantity generators to commercial hazardous waste incinerators — creating the most comprehensive federal record of hazardous waste compliance, violations, and enforcement.
Environment and energy · Federal data
EIA Form 860: The Federal Database Behind Every US Power Plant and Electricity Generator
2026-05-25
The EIA Annual Electric Generator Report (Form 860) collects data from every utility-scale generator in the United States — 25,000+ generating units at 8,000+ plants covering coal, natural gas, nuclear, wind, solar, and hydropower — providing the most comprehensive public inventory of US electricity generating capacity, ownership, location, and operational status.
Environment and energy · Federal data
NCES IPEDS: The Federal Database Behind Higher Education Statistics for 6,000 US Colleges
2026-05-25
The National Center for Education Statistics Integrated Postsecondary Education Data System collects annual data from every Title IV-eligible institution in the United States — 6,000 colleges and universities reporting enrollment, graduation rates, tuition, faculty salaries, financial aid, and institutional finances — making IPEDS the most comprehensive federal database of US higher education.
Research and education · Federal data
OFAC Civil Penalties: The Federal Database Behind Sanctions Violations and Treasury Enforcement
2026-05-25
The Treasury Department Office of Foreign Assets Control publishes every civil penalty settlement for sanctions violations — the banks, corporations, and individuals who conducted transactions with sanctioned countries or entities — with penalties ranging from thousands to over $1 billion, creating the most comprehensive public record of US sanctions enforcement.
Sanctions and illicit finance · Finance and markets · Federal data
ORI Research Misconduct Database: The Federal Record Behind Scientific Fraud and Fabrication
2026-05-25
The HHS Office of Research Integrity maintains the authoritative federal database of research misconduct findings — every case where a PHS-funded researcher has been found to have fabricated data, falsified results, or committed plagiarism, with findings covering hundreds of scientists at major research universities and medical centers.
Research and education · Federal data
USASpending Subawards: The Federal Database Behind Sub-Grant and Sub-Contract Flow Tracking
2026-05-25
USASpending.gov subaward data tracks the flow of federal money beyond the prime awardee — the sub-grants flowing from universities and state agencies to community organizations, and the sub-contracts from prime defense contractors to thousands of small suppliers, covering $500 billion+ in annual pass-through federal funding.
Government operations · Transparency and open data · Federal data
FEC Super PAC and Dark Money Data: The Federal Database Behind Outside Political Spending
2026-05-25
The FEC independent expenditure database covers every Super PAC and outside group that spent money to influence federal elections — over $4 billion in disclosed outside spending in the 2020 election cycle, plus dark money flowing through nonprofit organizations not required to disclose donors.
Money in politics · Federal data
NTSB Aviation Accident Database: The Federal Record Behind Every US Aircraft Accident Investigation
2026-05-24
The National Transportation Safety Board has maintained a structured record of every civil aviation accident in the United States since 1962 — 90,000+ accidents and incidents coded against a standardized schema covering aircraft type, phase of flight, weather conditions, pilot experience, injury counts, and probable cause findings that drive the largest safety reforms in US aviation history.
Transportation safety · Federal data
USAID Foreign Assistance Data: Tracing $50 Billion in Annual US Development Spending
2026-05-24
The United States spends more than $50 billion per year on foreign assistance — aid, development programs, security cooperation, and humanitarian response across 200 countries administered by a dozen federal agencies, all publicly disclosed on ForeignAssistance.gov with country-level, sector-level, and implementing-partner-level detail.
Government operations · Federal data
Congressional Voting Records: The Federal Database Behind Every House and Senate Roll Call Vote
2026-05-24
Congressional roll call vote data — maintained through VoteView, Congress.gov, and GovInfo — covers every recorded vote in the House and Senate dating back to the First Congress in 1789, enabling researchers to calculate legislator ideology scores, track party loyalty, analyze bipartisan coalitions, and build comprehensive political science datasets covering 250 years of American legislative history.
Government operations · Federal data
Grants.gov: The Federal Database Behind $500 Billion in Annual Federal Grant Opportunities
2026-05-24
Grants.gov is the federal government unified portal for grant opportunities — listing every competitive federal grant, cooperative agreement, and other financial assistance opportunity from 26 grant-making agencies, covering $500 billion+ in annual awards to universities, state and local governments, nonprofits, and businesses across every federal program area.
Government operations · Research and education · Transparency and open data · Federal data
EPA Drinking Water Violations: The Federal Database Behind Safe Drinking Water Act Enforcement
2026-05-24
The EPA Safe Drinking Water Information System tracks every violation of the Safe Drinking Water Act by the 150,000 public water systems in the United States — health-based violations for exceeding maximum contaminant levels, monitoring failures, reporting violations, and treatment technique violations — creating the most comprehensive federal record of drinking water safety failures.
Environment and energy · Health and medicine · Federal data
Regulations.gov: The Federal Database Behind 25 Million Public Comments on US Rulemaking
2026-05-24
Regulations.gov is the federal government unified rulemaking portal — hosting dockets for every significant federal regulation from 170+ agencies, 25 million public comments, and supporting documents including economic analyses and scientific studies, making it the most comprehensive public record of how federal rules are made and who influences them.
Government operations · Federal data
FHWA HPMS: The Federal Database Behind US Road Condition and Highway Performance Monitoring
2026-05-24
The Federal Highway Administration Highway Performance Monitoring System is the national database for US roadway conditions — collecting pavement condition ratings, traffic volumes, lane miles, and functional class data for 4.1 million miles of public roads, from Interstate highways to rural local roads, enabling Congress to calculate federal highway funding formulas and researchers to track infrastructure decline.
Transportation safety · Engineering and infrastructure · Federal data
FAA Civil Aviation Registry: The Federal Database Behind 700,000 Pilots and 300,000 Aircraft
2026-05-24
The FAA Civil Aviation Registry maintains two of the most comprehensive public databases in US aviation — the Airmen Certification Database covering 700,000 active pilots with certificate type, ratings, and medical status, and the Aircraft Registration Database covering 300,000 registered civil aircraft with owner, make, model, and airworthiness information.
Transportation safety · Federal data
DOE EV Charging Station Data: The Federal Database Behind 180,000 US Alternative Fuel Stations
2026-05-24
The Department of Energy Alternative Fuels Station Locator database tracks every publicly accessible electric vehicle charging station, hydrogen station, propane station, CNG station, and other alternative fuel outlet in the United States — 180,000+ stations as of 2024, with real-time status for DCFC fast chargers, providing the most comprehensive federal dataset on EV charging infrastructure deployment.
Environment and energy · Transportation safety · Federal data
USGS Wind and Solar Energy Data: The Federal Database Behind US Renewable Energy Infrastructure
2026-05-24
The United States Geological Survey maintains the most comprehensive public databases of wind turbine locations and utility-scale solar photovoltaic facility data in the United States — 72,000+ wind turbines with GPS coordinates, capacity ratings, hub heights, and rotor diameters, plus a growing solar PV database covering thousands of utility-scale installations.
Environment and energy · Federal data
SBA Loan Programs: The Federal Database Behind $50 Billion in Annual Small Business Financing
2026-05-24
The Small Business Administration 7(a) and 504 loan guarantee programs back over $50 billion in small business financing per year — every loan disclosed in a public dataset covering borrower name, location, loan amount, lender, industry, and jobs supported, making SBA the most transparent source of small business capital data in the United States.
Finance and markets · Federal data
US Attorney Prosecution Data: The Federal Database Behind 80,000 Annual Criminal Cases
2026-05-24
The 94 United States Attorneys offices prosecute every federal crime — drug trafficking, financial fraud, public corruption, terrorism, and violent crime — generating a public record through press releases, PACER dockets, and USAO annual statistical reports that together document over 80,000 criminal defendants per year in federal court.
Justice and immigration · Federal data
SAMHSA Treatment Data: The Federal Database Behind Substance Abuse and Mental Health Program Statistics
2026-05-24
The Substance Abuse and Mental Health Services Administration publishes the most comprehensive federal data on addiction treatment and mental health services in the United States — the National Survey on Drug Use and Health, the Treatment Episode Data Set covering 2 million annual admissions, and the National Mental Health Services Survey covering 12,000 treatment facilities.
Health and medicine · Federal data
PHMSA Pipeline Safety Data: The Federal Database Behind Gas and Liquid Pipeline Incidents
2026-05-24
The Pipeline and Hazardous Materials Safety Administration maintains incident reports for every significant gas and liquid pipeline accident in the United States — spills, explosions, injuries, fatalities, and property damage — creating the most comprehensive public record of pipeline safety performance across 2.7 million miles of US pipeline infrastructure.
Transportation safety · Engineering and infrastructure · Federal data
CDC Foodborne Outbreak Database: The Federal Record Behind 25,000 Annual Illness Clusters
2026-05-24
The CDC Foodborne Disease Outbreak Surveillance System tracks every reported multi-person foodborne illness outbreak in the United States — pathogen, implicated food, setting, illness count, hospitalizations, and deaths — covering 800+ outbreaks per year across all food categories.
Health and medicine · Food and agriculture · Federal data
OSHA 300A Injury Data: The Federal Database Behind Establishment-Level Workplace Injury Rates
2026-05-24
The OSHA 300A Summary data collects annual establishment-level injury and illness totals from 750,000 employers — enabling calculation of Total Recordable Case rates, Days Away Restricted or Transferred rates, and industry-specific benchmarks for every major employer in the United States.
Labor and workplace · Federal data
DOJ Civil Rights Division: The Federal Database Behind Police Reform Consent Decrees and Civil Rights Enforcement
2026-05-24
The Department of Justice Civil Rights Division enforces federal civil rights laws through pattern-or-practice investigations, consent decrees, voting rights litigation, and fair housing enforcement — producing a public record of every settlement, consent decree, and enforcement action against state and local governments.
Justice and immigration · Federal data
USDA ERS Food Economics: The Federal Database Behind Farm Income, Food Prices, and Rural America
2026-05-24
The USDA Economic Research Service publishes the most comprehensive federal data on food and agricultural economics — farm income and wealth statistics, food price indices, food security measurements, rural county classifications, and commodity supply-and-use tables spanning decades of US agricultural history.
Food and agriculture · Federal data
CMS Medicare Part D Prescriber Data: The Federal Database Behind Drug Spending for 1 Million Providers
2026-05-24
CMS publishes annual Medicare Part D prescriber-level drug spending data for every provider who prescribed drugs covered under Medicare — enabling researchers to identify outlier prescribers, track opioid prescribing patterns, and analyze drug spending by specialty and geography.
Health and medicine · Federal data
DEA Registrant Enforcement: The Federal Database Behind Controlled Substance License Revocations
2026-05-24
The Drug Enforcement Administration publishes every order to show cause, immediate suspension order, and final order revoking a DEA registration — the controlled substance prescribing licenses held by physicians, pharmacies, hospitals, and distributors.
Health and medicine · Federal data
NRC Reactor Oversight Process: The Federal Database Behind Nuclear Plant Safety Ratings
2026-05-24
The Nuclear Regulatory Commission Reactor Oversight Process evaluates every US commercial nuclear power plant across seven safety cornerstones — yielding publicly available performance indicator data, inspection findings, and action matrix dispositions.
Environment and energy · Federal data
CFTC Enforcement Actions: The Federal Database Behind Commodity Market Fraud Penalties
2026-05-24
The Commodity Futures Trading Commission enforcement database covers every civil action for violations of the Commodity Exchange Act — manipulation, fraud, spoofing, wash trading, and crypto asset fraud — with penalties totaling billions annually.
Finance and markets · Federal data
HMDA Mortgage Lending Data: The Federal Database Behind 15 Million Annual Mortgage Applications
2026-05-24
The Home Mortgage Disclosure Act requires every US mortgage lender to report every loan application — applicant race, income, property location, loan amount, interest rate, action taken, and denial reason — creating the most comprehensive public dataset on mortgage lending disparities.
Economy and demographics · Federal data
CMS Hospital Cost Reports: The Federal Database Behind Hospital Financial Data for 6,000 US Facilities
2026-05-24
The CMS Hospital Cost Report database contains detailed financial and utilization data for every Medicare-participating hospital — revenues, costs, charges, staffing, beds, and patient days — making it the most comprehensive source of US hospital financial data.
Health and medicine · Federal data
FEC Campaign Finance Enforcement: The Federal Database Behind Matters Under Review
2026-05-24
The Federal Election Commission Matters Under Review database tracks every campaign finance complaint and enforcement action — from contribution limit violations and disclosure failures to foreign national contributions and coordinated expenditure violations.
Money in politics · Federal data
IRS Criminal Investigation: The Federal Database Behind Tax Fraud and Financial Crime Prosecutions
2026-05-24
IRS Criminal Investigation is the only federal law enforcement agency with jurisdiction over federal tax crimes — filing 2,500-3,000 criminal cases per year with a 90%+ conviction rate covering tax evasion, money laundering, and identity theft refund fraud.
Transparency and open data · Sanctions and illicit finance · Federal data
CDC NNDSS: The Federal Database Behind Reportable Disease Surveillance in the United States
2026-05-24
The National Notifiable Diseases Surveillance System aggregates case reports from all 50 states and territories for 120+ nationally notifiable diseases — from salmonellosis and Lyme disease to HIV, hepatitis, measles, and emerging threats.
Health and medicine · Federal data
OSHA Violations Database: The Federal Record of 200,000 Annual Workplace Safety Citations
2026-05-24
The OSHA enforcement database contains every citation issued after a workplace inspection — violation type, penalty amount, standard violated, and abatement status — covering 200,000+ annual citations across all industries.
Labor and workplace · Federal data
GAO Reports Database: The Congressional Watchdog Behind 900 Annual Federal Audits
2026-05-24
The Government Accountability Office publishes 900+ reports, testimonies, and correspondence per year — audits, investigations, and evaluations of federal programs across every agency and department.
Government operations · Federal data
FCC Universal Licensing System: The Federal Database Behind Every US Radio License
2026-05-24
The FCC Universal Licensing System contains every active radio license in the United States — AM and FM broadcast stations, TV stations, cellular carriers, commercial satellite operators, amateur radio operators, and 1,500+ other wireless service categories.
Federal data
UFLPA Entity List: The Federal Database Behind Uyghur Forced Labor Supply Chain Enforcement
2026-05-24
The Uyghur Forced Labor Prevention Act Entity List identifies companies whose goods are presumed to be produced with Uyghur forced labor in Xinjiang — any imports from these entities are barred from US markets unless importers can rebut the presumption with clear and convincing evidence.
Sanctions and illicit finance · Federal data · Economy and demographics
FinCEN BSA Enforcement: The Federal Database Behind Anti-Money Laundering Civil Penalties
2026-05-24
The Financial Crimes Enforcement Network publishes every Bank Secrecy Act civil enforcement action — civil money penalties, consent orders, and cease-and-desist orders against banks, money services businesses, and cryptocurrency exchanges for failures in anti-money laundering compliance programs.
Sanctions and illicit finance · Federal data
SAM.gov Exclusions: The Federal Database Behind Government Contractor Debarments
2026-05-24
The System for Award Management exclusions database lists every individual and entity currently barred from receiving federal contracts, grants, and other financial assistance — covering debarments, suspensions, proposed debarments, and voluntary exclusions across all federal agencies.
Government operations · Federal data
FHWA National Bridge Inventory: The Federal Database Behind 620,000 US Bridge Inspections
2026-05-24
The Federal Highway Administration National Bridge Inventory collects biennial condition ratings for every highway bridge in the United States — 620,000 bridges covering structural sufficiency, deck ratings, superstructure, substructure, and channel conditions.
Transportation safety · Engineering and infrastructure · Federal data
NIH Research Portfolio: The Federal Database Behind $50 Billion in Annual Biomedical Grants
2026-05-24
The NIH Research Portfolio Online Reporting Tools database covers every NIH-funded research project since 1985 — 500,000+ active and historical grants totaling over $50 billion per year, spanning every disease area, institution, and principal investigator in US biomedical research.
Research and education · Federal data
USDA SNAP Program Data: The Federal Database Behind $100 Billion in Food Assistance
2026-05-24
The Supplemental Nutrition Assistance Program is the largest US food assistance program — 42 million participants, $100 billion in annual benefits, and one of the largest automatic stabilizers in the federal budget.
Food and agriculture · Federal data
FEMA Disaster Declarations: The Federal Database Behind 70 Years of US Natural Disasters
2026-05-24
The FEMA disaster declaration database records every major disaster, emergency, and fire management assistance declaration since 1953 — over 4,600 major disaster declarations covering hurricanes, floods, tornadoes, wildfires, and pandemics.
Environment and energy · Federal data
CMS Hospital Compare: The Federal Database Behind Quality Ratings for 5,000 US Hospitals
2026-05-24
The CMS Hospital Compare program publishes readmission rates, patient safety indicators, HCAHPS patient satisfaction scores, and payment data for every Medicare-certified hospital in the United States.
Health and medicine · Federal data
DOL OFLC Visa Disclosures: The Federal Database Behind H-1B, H-2A, and H-2B Wage Records
2026-05-24
The Department of Labor Office of Foreign Labor Certification publishes every H-1B Labor Condition Application, H-2A agricultural temporary worker certification, and H-2B non-agricultural temporary worker certification — the employer wage attestations behind the US guest-worker visa system.
Labor and workplace · Justice and immigration · Federal data
PACER Federal Courts: The Database Behind 1 Billion Federal Court Documents
2026-05-24
The Public Access to Court Electronic Records system holds dockets and documents for every federal district, bankruptcy, and appellate case filed since the 1980s — over 1 billion documents accessible via the CourtListener API and RECAP mirror.
Justice and immigration · Federal data
BIS Export Enforcement: The Federal Database Behind US Export Control Violations
2026-05-24
The Bureau of Industry and Security's export enforcement records cover every administrative settlement, denial order, and criminal referral for violations of US export control law — the Export Administration Regulations that govern dual-use technology exports to adversary nations.
Sanctions and illicit finance · Federal data
Treasury Daily Treasury Statement: The Federal Database Behind the US Government's Daily Cash Position
2026-05-24
The Daily Treasury Statement reports the federal government's cash position every business day — receipts, outlays, and the operating cash balance — and is the most granular real-time fiscal data available from the US government.
Finance and markets · Government operations · Federal data
Census SAIPE: The Federal Database Behind County-Level Poverty and Income Estimates
2026-05-24
The Small Area Income and Poverty Estimates program produces annual county-level income and poverty statistics used to allocate $16 billion in Title I-A education funding.
Economy and demographics · Federal data
EEOC Discrimination Charges: The Federal Database Behind 80,000 Annual Workplace Bias Claims
2026-05-24
The EEOC charge database tracks every workplace discrimination complaint filed with the federal government — race, sex, disability, age, religion — from first filing through litigation outcome.
Justice and immigration · Federal data
FDA FAERS: The Federal Adverse Event Reporting Database Behind Drug Safety Surveillance
2026-05-24
The FDA Adverse Event Reporting System contains every post-market drug safety report submitted since 1968 — manufacturer reports, voluntary consumer reports, and FDA-initiated reports — totaling over 26 million case submissions.
Health and medicine · Federal data
NHTSA FARS: The Federal Database Behind Every US Traffic Fatality Since 1975
2026-05-24
The Fatality Analysis Reporting System is a census of every motor vehicle crash in the United States resulting in death — 50 years of data, 2 million fatalities, and the primary evidence base for federal highway safety policy.
Transportation safety · Federal data
CPSC Recalls: The Federal Database Behind 50 Years of Consumer Product Safety Recalls
2026-05-24
CPSC (Consumer Product Safety Act 1972): ~9,800 recalls since 1973 covering ~15,000 product types (excludes food, drugs, autos, firearms). Section 15 voluntary (negotiated, most common) vs. Section 9 mandatory recalls; 24-hour reporting obligation for substantial product hazards. CPSIA 2008 (Chinese toy lead paint scandal 2007): third-party testing mandates, CPC/GCC certificates, 100 ppm lead limits, phthalate limits, tracking labels. SaferProducts.gov incident reporting database (NEISS-AIP, CPSC hospital sentinel network). Recall delays: average 12-18 months first incident to recall. Notable: Fisher-Price Rock n Play sleeper (4.7M units, 32 infant deaths, 2019), IKEA MALM dresser tip-over (29M units North America, 2016/2022), Peloton Tread+ (125k units, 2021), Samsung Galaxy Note 7 (2.5M units, 2016), Takata airbags (67M+ airbags, 19+ deaths, 2014-2019, NHTSA-led). recalls.gov/api and cpsc.gov/data: recallID/recallDate/title/description/hazard/remedy/units/productCategory/injuries/deaths fields; API parameters: product_type_id/date_from/date_to. Furniture stability mandatory rule (2023) targeting tip-overs. Safe Sleep for Babies Act (2022): banned inclined sleepers, crib bumpers. Python recalls.gov API analysis: hazard-type aggregation, product-category units recalled, 2015-2024 annual trend, fatal recall identification.
Consumer protection · Federal data
ClinicalTrials.gov: The Federal Database Behind 500,000 Clinical Trials and Drug Approval Research
2026-05-24
ClinicalTrials.gov (NLM, launched February 2000 per FDAMA 1997): 500,000+ registered studies as of 2024. FDAAA 801 (2007): mandatory registration within 21 days of first enrollment for applicable clinical trials (ACTs -- Phase 2+ interventional trials of FDA-regulated drugs/biologics/devices); results reporting within 12 months of primary completion date; penalties up to $10,000/day; NIH grant withholding; but 2015 NEJM study found only 13% reporting on time. Study phases: Phase 0 (microdosing), Phase 1 (safety, 20-80 participants), Phase 2 (efficacy signal, 100-300), Phase 3 (pivotal RCTs, FDA approval basis), Phase 4 (post-marketing); observational (cohort/case-control/cross-sectional) studies phased differently. Key fields: NCT number, official title, brief summary, sponsor type (industry ~50%, NIH/federal ~20%, academic ~30%), study status, phase, allocation, intervention model, masking, primary completion date, enrollment, primary outcome measures, eligibility criteria (inclusion/exclusion), age range, gender, MeSH condition terms, intervention type (drug/device/behavioral/procedure). Disease area composition: oncology ~35%, diabetes/cardiology/psychiatry/ID follow. COVID-19 surge: ~11,000 COVID trials 2020-2021. Publication bias (file drawer problem): AllTrials campaign, Ben Goldacre, COMPARE project, RIAT. ClinicalTrials.gov API v2 at clinicaltrials.gov/api/v2/studies: no API key, pagination by pageSize/pageToken, protocolSection/resultsSection/statusModule/conditionsModule modules. Aggregate stats: ~40% completed, ~25% recruiting, ~15% terminated. Python API query: recruiting Phase 3 oncology trials by enrollment (top 10) + phase distribution for all cancer trials.
Research and education · Health and medicine · Federal data
Census Current Population Survey: The Federal Database Behind the Official US Poverty and Unemployment Rates
2026-05-24
CPS (Census/BLS joint survey since 1940): ~60,000 housing units/month (4-8-4 rotation group design); reference week containing the 12th. Labor force classifications: employed (1+ hour for pay/profit during reference week), unemployed (no work + active search past 4 weeks + currently available), not in labor force. U-1 through U-6 supplemental measures: U-3 = official rate, U-6 = total underemployment (unemployed + marginally attached + part-time for economic reasons); COVID-19 peak April 2020 U-3 14.7% / U-6 22.9%. Annual ASEC supplement (March, expanded ~100,000 households): official poverty rate (48 Orshansky thresholds by family size/composition; 2023 family-of-4 threshold ~$30,900; 2023 poverty rate ~11.1%, ~36M people); health insurance coverage; SPM (Supplemental Poverty Measure, 2011: counts SNAP/housing subsidies/EITC, subtracts taxes/work expenses/medical costs, geographic cost adjustment -- lower poverty for working-age, higher for elderly). CPS vs. CES/QCEW: residence-based (where people live) vs. establishment-based (where jobs are); CPS includes agricultural/domestic/self-employed not in QCEW. CPS microdata fields: PWSSWGT/PRTAGE/PESEX/PRDTRACE/PEHSPNON/PEEDUCA/PEMLR/PRUNTYPE/PRERNWA/OFFPOV/POVLL/PRCITSHP. IPUMS-CPS harmonized microdata back to 1962; raw files at census.gov/data/datasets; FRED: UNRATE/U6RATE/CIVPART/LNS11000000; BLS LAUS for state-level unemployment. Python FRED API + BLS LAUS API: state unemployment/poverty/LFPR table with YoY change.
Economy and demographics · Federal data
DEA ARCOS: The Federal Opioid Distribution Database Behind 380 Million Pill Shipment Transactions
2026-05-24
DEA ARCOS (Automation of Reports and Consolidated Orders System): mandatory reporting under 21 USC 827 + 21 CFR 1304.33 for all manufacturers/distributors/importers of Schedule I/II controlled substances. 380M individual opioid transaction records 2006-2014 (oxycodone, hydrocodone, fentanyl, morphine, hydromorphone, methadone, oxymorphone, buprenorphine). Transaction fields: reporter DEA number, buyer DEA number, drug code, drug name, dosage unit, quantity, transaction date, transaction type (S=sale, P=purchase, T=theft/loss, R=return). MDL 2804 (In re: National Prescription Opiate Litigation, Judge Polster, NDOH): July 2019 court order released ARCOS data to Washington Post and HD Media -- first-ever public transaction-level release. Key findings: 76B oxycodone/hydrocodone pills shipped 2006-2014; WV ~780 pills/person/year; Mingo County WV: 3.3M hydrocodone pills over 2 years for 25,000 people; McKesson, Cardinal Health, AmerisourceBergen (Big Three) distributed 44% of all opioids. Suspicious order monitoring failure: 21 CFR 1301.74(b) requires reporting unusual orders; DEA settlements: McKesson $150M + registration surrenders 2017, AmerisourceBergen $150M 2017, Cardinal Health $44M. Purdue Pharma: OxyContin 1996, $634M 2007 plea, $8.34B 2020 settlement, Sacklers $6B; Mallinckrodt $1.6B. Big Three civil settlement $21B (2022); J&J $5B; Walgreens $5.7B; CVS $5B; Walmart $3.1B; total settlements $55B+. Washington Post bulk download at WaPo arcos-database pages; arcos R package. Python WaPo bulk TSV download: pills-per-capita by county for oxycodone/hydrocodone, top distributors, annual trend.
Health and medicine · Federal data
DOL UI Claims: The Federal Database Behind Weekly US Unemployment Statistics Since 1967
2026-05-24
DOL ETA weekly UI claims (Thursday 8:30am): initial claims SA (ICSA) + continuing claims SA (CCSA/CC4WSA). 53 jurisdictions: 50 states + DC + PR + VI. COVID peak: 6.9M initial claims week of April 4 2020 (prior record 695k, Oct 1982); continuing claims peak 24.9M May 2020. CARES Act PUA extended to gig/self-employed. Regular state UI: typically 26 weeks; federal-state Extended Benefits at 6.5%/8% insured unemployment rate trigger. State benefit max: Mississippi $235/wk to Massachusetts $1,050/wk. Recipiency rate ~27% of unemployed in normal times. FRED series: ICSA, ICNSA, CCSA, CC4WSA at fred.stlouisfed.org; DOL ETA-539/5159 forms; DOL bulk at oui.doleta.gov/unemploy/claims.asp. BLS UI-vs-CPS distinction: UI = administrative benefit recipients vs. CPS = household survey unemployed. Python FRED API ICSA 2019-present + COVID peak detection + 52-week rolling average.
Labor and workplace · Economy and demographics · Federal data
CMS Nursing Home Compare: The Federal Database Behind Quality Ratings for 14,700 US Nursing Homes
2026-05-24
CMS Five-Star Quality Rating: ~15,000 Medicare/Medicaid-certified nursing homes, ~1.35M residents, ~$90k-105k/yr private pay. Three domains: Health Inspections (standard annual + complaint surveys; F-tag deficiency system F600-F999; scope/severity matrix A-L; immediate jeopardy J-L), Staffing (Payroll-Based Journal PBJ quarterly since 2017: RN hours/resident day, total nurse hours/resident day, weekend staffing), Quality Measures (MDS 3.0 derived: long-stay high-risk pressure ulcers, falls with major injury, antipsychotic use in dementia, UTI; short-stay pressure ulcer + improved function). Special Focus Facilities (SFF): ~90 facilities with persistent serious quality problems; ~400 on SFF Candidate list; monthly CMS publication; decertification risk. Ownership transparency: Form CMS-855A; private equity association with lower staffing (Braun 2021, Harrington 2020); large chains: ManorCare/ProMedica (~250 facilities), Genesis Healthcare. data.cms.gov datasets: Provider Information (CMS_Certified_Nursing_Facilities.csv), Health Deficiencies, Quality Measures, Staffing, Penalties (CMPs). Socrata API, no key required. Python Provider Info CSV analysis: star distribution, SFF flags, average staffing by star rating, top-10 states by 1-star share.
Health and medicine · Federal data
BLS QCEW: The Federal Database Behind US Payroll Data for Every Industry and County
2026-05-24
BLS QCEW (Quarterly Census of Employment and Wages): joint BLS-state partnership using UI administrative tax records. ~11M establishment records/quarter, ~95% of all US civilian employment. Excludes self-employed, military, elected officials, railroad (RRB), some agricultural. Key fields: area_fips (2-digit state, 5-digit county, MSA, US), industry_code (NAICS 2-6 digit), own_code (0=total, 1=federal, 2=state, 3=local, 5=private), disclosure_code (N=suppressed when <3 establishments or 1 employer >80% wages), avg_weekly_wage, month1/2/3_emplvl, total_qtrly_wages, taxable_qtrly_wages. Geographic coverage: national, 51 states+DC, 3,200+ counties, 380+ MSAs. QCEW vs. CES: QCEW is the administrative universe (5-month lag), CES is the sample survey (1-month lag); CES March benchmark revisions align to QCEW. 2024 benchmark revision: -818,000 downward revision to CES (QCEW showed slower job growth than CES estimated). Location Quotient: (county industry share) / (national share); LQ>1 = local specialization. Three data access paths: BLS API series IDs, QCEW cross-sectional API at data.bls.gov/cew/api/, bulk flat files at blsdownload.bls.gov (~500MB/quarter compressed). Python QCEW API private sector 2-digit NAICS: employment/wage table by supersector + LQ demo + YoY wage growth.
Economy and demographics · Federal data
BLS Current Employment Statistics: The Federal Database Behind the Monthly Jobs Report
2026-05-24
BLS CES (Current Employment Statistics): monthly payroll survey of ~140,000 businesses and ~440,000 worksites covering ~34% of all nonfarm payroll. Two surveys: CES (establishment, payroll jobs) + CPS (household, unemployment rate). Released first Friday of each month at 8:30am ET. Headline: total nonfarm payroll employment; also private payrolls, manufacturing, AHE (average hourly earnings), AWH (average weekly hours). NAICS supersectors: Mining/Logging, Construction, Manufacturing (durable/nondurable), Trade/Transport/Utilities, Information, Financial Activities, Professional/Business Services, Education/Health, Leisure/Hospitality, Other Services, Government. Series ID format: CEU + supersector + industry + data type (01=employment, 03=hours, 11=AHE). Examples: CEU0000000001 (total nonfarm), CEU3000000001 (manufacturing), CEU7000000001 (leisure/hospitality), CEU0500000011 (private AHE). Reference week = week containing the 12th. Three estimates: preliminary (T+30 days), first revision (T+60), second revision (T+90). March annual benchmark revision aligns to QCEW administrative records. COVID: April 2020 -20.5M jobs (worst single month ever); Great Recession trough Feb 2010 -8.7M from Jan 2008 peak. BLS API: api.bls.gov/publicAPI/v2/timeseries/data/, 500 series/query with key, 10 years. ADP preview released 2 days before. AHE ~$35/hr all private (2024); real wage growth = AHE minus CPI. Python BLS API 20-series fetch + supersector employment table with YoY change + AHE/AWH block + COVID recovery tracker.
Economy and demographics · Federal data
BOP Federal Prison Population: The Federal Database Behind 148,000 US Federal Inmates
2026-05-24
BOP (Bureau of Prisons) under DOJ: 148,000+ federal inmates in 122 institutions (~36,000 staff). Federal offenses: drug trafficking ~44%, weapons ~20%, sex offenses ~8%, immigration ~6%, fraud/white collar ~5%. Federal mandatory minimums: 21 USC 841(b) (1kg+ heroin/5kg+ cocaine = 10-yr minimum). Crack/powder disparity: pre-FSA 2010 100:1 ratio, FSA 2010 reduced to 18:1. USSC Sentencing Guidelines advisory since Booker 2005; 13.4% longer sentences for Black defendants (USSC 2017). First Step Act 2018: FSA retroactivity (~2,600 released), safety valve expansion, earned-time credits (10-15 days/month), PATTERN risk tool. Demographics: 93% male, 37% Black, ~23% non-US citizens. Facility types: ADX (Florence supermax), USP, FCI, FPC, FMC. Private prisons: Biden EO 14006 non-renewal; Trump 2025 re-expansion. BOP Statistics at bop.gov/about/statistics/ (static tables). USSC datafiles at ussc.gov for sentencing data. Python BOP HTML table scraper + USSC drug sentence trends 2010-2023.
Justice and immigration · Federal data
CDC Drug Overdose Mortality: The Federal Database Behind the US Opioid Crisis
2026-05-24
107,543 overdose deaths in 2023 (CDC NCHS provisional); first 100k+ year was 2021. Three-wave opioid crisis: Wave 1 prescription opioids (OxyContin 1996, Purdue 2007 $634M fine); Wave 2 heroin surge (2010-2013); Wave 3 synthetic opioids (IMF fentanyl 50-100x morphine, ~75k synthetic opioid deaths 2022). Three CDC sources: VSRR Provisional Drug Overdose Counts (monthly, Socrata API at data.cdc.gov, 12-month rolling), CDC WONDER (death certificate ICD-10 queries 1999-present, county-level), state drug category flat file. ICD-10 T-codes: T40.1 heroin, T40.2 natural/semisynthetic, T40.4 synthetic opioids (fentanyl -- the key field), T40.5 cocaine, T43.6 stimulants. Fentanyl supply: China scheduled 2019; Mexico (Sinaloa/CJNG) now primary; counterfeit M30 pills; xylazine (tranq) not reversed by naloxone. Geographic: WV ~80/100k; Appalachian epicenter. Purdue $8.34B 2022 DOJ settlement; Sackler $6B; total settlements >$55B. MOUD: buprenorphine (waiver removed 2022 SUPPORT Act), methadone, naltrexone. Python VSRR Socrata API synthetic opioid rate by state.
Health and medicine · Federal data
DOL Form 5500: The Federal Database Behind Every US Pension and Benefit Plan
2026-05-24
Form 5500 Annual Return/Report: ~217,000 filings/year for all ERISA plans (DB, DC, health/welfare with 100+ participants). $30T+ plan assets. EFAST2 at efast.dol.gov -- public record. Plan types: DB (defined benefit, 27M->13M participants since 1985, employer bears investment risk); DC (401k: $23k employee deferral 2024, $69k total, 5% TSP match, target-date funds); 403(b); ESOPs. Schedule architecture: A (insurance contracts), C (service provider fees -- 408(b)(2) indirect compensation, basis for ERISA fee litigation: Boeing/Intel/MIT/Cornell all settled), G (prohibited transactions), H (large plan financials: balance sheet, income statement), R (retirement/actuarial), SB (Schedule SB: funding target, min required contribution, AFTAP triggers at 60%/80%). PBGC insurance: $80k/yr guarantee; $96/participant flat premium + $52/$1k variable (2024); ARP 2021: $86B SPAP for troubled multiemployer plans (Central States $73B). Large plan audit: 100-participant threshold; SAS 136; 2015 OIG found 39% deficient. Bulk data: dol.gov/agencies/ebsa research files. Python EFAST2 Schedule H + C analysis: top-50 401k plans by assets + fee rates in basis points by asset tier.
Labor and workplace · Federal data
BLS OEWS: The Federal Database Behind Wage Statistics for 830 Occupations Across the US Economy
2026-05-24
BLS OEWS (Occupational Employment and Wage Statistics): semi-annual survey of 1.1M non-farm establishments, ~57M workers. Annual release: mean/median wages, 10th-25th-75th-90th percentiles, employment for 830 occupations across 590+ areas (all states, 564+ MSAs, nonmetro areas). SOC 2018: 23 major groups, 6-digit codes. Highest-paying: anesthesiologists ~$331k, oral surgeons ~$317k, OB/GYN ~$296k, CEOs ~$246k. Data fields: area_type (1=national, 2=state, 3=MSA), occ_code, o_group (major/minor/broad/detailed), emp, h_mean/a_mean, h_median/a_median, h_pct10/h_pct90/a_pct10/a_pct90, emp_prse, mean_prse. Special symbols: * = above $208k cap, # = employment suppressed. Access: bls.gov/oes/tables.htm bulk zip files (national/state/MSA); BLS API v2 with complex OEWS series IDs. Industry-occupation matrix: SWE in finance vs. tech vs. manufacturing. Projections link: NEM 2022-2032 (wind turbine techs +60%, nurse practitioners +46%, data scientists +35%). Python: downloads national zip, ranks Computer & Math occupations by wage, computes percentile spread healthcare vs. tech.
Economy and demographics · Federal data
FRA Railroad Accident Data: The Federal Database Behind Every US Rail Incident Since 1975
2026-05-24
FRA (Federal Railroad Administration) accident reporting system (49 CFR Part 225): ~224k records since 1975. Train accidents (Form 54), grade crossing (Form 57), employee injuries (Form 55). Cause codes: Track/Equipment/Human Factors/Misc. Reportable threshold: $11,200+ damage, or death/injury/evacuation/hazmat. East Palestine OH Feb 2023: Norfolk Southern 32N derailment; vinyl chloride; controlled burn; NTSB 37 recommendations. Grade crossing: ~2,000-2,200 collisions/yr, ~270-290 deaths, 128,000 public crossings. PTC: mandated 2008 Rail Safety Improvement Act after Chatsworth 2008 (25 dead); fully implemented 2020. FRA: 140,000 inspections/yr, 28,000 violations, $27,904 max penalty. CRISI grants $1B+ (IIJA 2021). FRA Safety Data API at safetydata.fra.dot.gov; bulk CSV Forms 54/57/55. Python derailments by state + hazmat releases by commodity.
Transportation safety · Federal data
OPM FedScope: The Federal Database Behind 2.1 Million US Government Workers
2026-05-24
OPM CPDF/FedScope: ~2.1-2.3M federal civilian employees quarterly. Largest: DOD ~750k, VA ~400k, DHS ~250k. GS pay: GS-1 Step 1 $22,270/yr to GS-15 Step 10 $163,964 base + 34 locality areas (DC +33.26%); SES ~9,000 positions $155k-$235k (2024). FedScope dimensions: agency, occupation series, location, pay plan, grade, education, age, race, gender, veterans status (27% federal vs 6% private). FERS: 1.1%/yr x high-3 x years + TSP 5% match + Social Security; CSRS pre-1984. DOGE 2025: fork-in-the-road email ~75k acceptances; USAID ~10k terminated; HHS ~20k; DOE ~1,500; EPA ~1,500; union lawsuits. Data: fedscope.opm.gov cube; opm.gov bulk CSV; no public REST API. Python FedScope CSV analysis by agency grade distribution and SES density.
Government operations · Federal data
NIFC Wildfire Data: The Federal Database Behind a Century of US Fire Statistics
2026-05-24
NIFC (Boise ID) with USFS/NPS/BLM/BIA/FWS: wildfire stats since 1926. 2023: ~56,580 fires, ~2.7M acres (10-yr avg ~7M/yr). Record years: 2015 (10.1M), 2020 (10.1M). Fire suppression paradox: Smokey Bear 1944+ = fuel accumulation = larger fires. USFS FOD: ~2.3M fires 1992-present, SQLite, size classes A-G, cause (human/lightning/unknown), lat-lon, county. MTBS: USGS-USFS Landsat dNBR burn severity for fires >=1,000 acres at mtbs.gov. ICS-209 extended attack reports at famweb.nwcg.gov. WUI: 43M homes (Radeloff 2018 PNAS); Camp Fire 2018 Paradise CA (153k acres, 85 dead); Lahaina 2023 (2,200 structures). Active fire: NIFC ArcGIS GeoJSON; NASA FIRMS MODIS/VIIRS. Climate: Westerling 2006 Science + Williams 2019 PNAS VPD correlation. Python decade-by-decade analysis + active fire query.
Environment and energy · Federal data
CFPB Consumer Complaint Database: The Federal Record of 7 Million Financial Complaints
2026-05-24
CFPB Consumer Complaint Database (March 2012): 7M+ complaints since 2011. Products: credit reporting ~60%, debt collection ~10%, credit card ~8%, mortgage ~7%. Equifax/Experian/TransUnion receive 50%+ of all complaints. Fields: complaint_id, date_received, product/sub_product/issue, consumer_complaint_narrative (~20% with text, PII-scrubbed), company_response (monetary/non-monetary/explanation relief), timely (Y/N), consumer_disputed (Y/N), state, zip (3-digit partial). COVID: mortgage forbearance surge. Biden loan forgiveness 2022-2024: 2-3x student loan complaints. Navient $1.85B 2022. Wells Fargo $3.7B 2022 (largest-ever CFPB). API: api.consumerfinance.gov/data-research/consumer-complaints/search (no key, max 10k/query). Bulk download ~1.5GB+. Python mortgage complaint analysis by company and response type.
Finance and markets · Consumer protection · Federal data
NOAA Storm Events: The Federal Database Behind 50 Years of US Weather Disaster Data
2026-05-24
The NOAA National Centers for Environmental Information (NCEI) Storm Events Database records every significant weather event in the US from 1950 to present — ~2.1M event records, 48 standardized event types (tornado, hurricane, flash flood, hail, winter storm, wildfire, and more), with property damage, crop damage, injuries, fatalities, and county-level geography. Bulk download at ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/ with annual gzipped CSVs; CDO API at www.ncei.noaa.gov/cdo-web/api/v2/. DAMAGE_PROPERTY field uses K/M/B suffix encoding requiring parsing. NOAA Billion-Dollar Disasters tracker covers 376 events since 1980 totaling $2.6T in CPI-adjusted damages; 2023: 28 events exceeding $1B each — a record. Tornado climatology: ~1,200-1,500 annually, EF0-EF5 scale, 2011 Super Outbreak 362 tornadoes/3 days, Dixie Alley shift. Hurricane damage: Harvey $125B, Ian $112B county-by-county in Storm Events. Flood events: deadliest weather type most years, ~88 fatalities/year average, AHPS stream gauge network. Climate change signal in increasing damage frequency and extreme precipitation. Here is event type taxonomy, data quality caveats, NCEI CDO API, Billion-Dollar Disasters methodology, tornado EF scale, hurricane storm surge vs. wind damage distinction, and a Python DAMAGE_PROPERTY parsing analysis by event type and state.
Environment and energy · Federal data
FBI NIBRS: The National Crime Database Behind Incident-Level Crime Statistics
2026-05-24
The FBI National Incident-Based Reporting System (NIBRS) replaced summary-level UCR in 2021 with incident-level records from 15,000+ law enforcement agencies — 52 offense categories, victim/offender/arrestee demographics, location, weapon, property, and arrest outcomes. 2022: ~15,724 agencies reporting, covering ~79% of US population; NYPD (8M people) only began NIBRS submission 2023. Offense segments: Group A (52 categories) vs Group B (11 citation-only). Victim segment: age, sex, race, ethnicity, victim-offender relationship (intimate partner, acquaintance, stranger, unknown). Hate crime codes: 88 bias motivation codes. Crime Data Explorer API at cde.ucr.cjis.gov: /api/nibrs/{offense}/offense/agencies, /api/nibrs/{offense}/victim/count — free API key, 1,000 req/day. Annual bulk downloads: incident, offense, victim, offender, arrestee, property files. Supplemental Homicide Reports (SHR) since 1976: victim-offender-weapon-circumstance at case level. NIBRS vs NCVS: only ~43% of violent crimes reported to police. TRAC-NIBRS for coverage gap analysis. Here is the UCR-to-NIBRS transition, reporting gaps, API structure, hate crime methodology, SHR limitations, NCVS complement, and a Python CDE API violent crime rate analysis by state.
Justice and immigration · Federal data
SSA Social Security: The Federal Database Behind $1.4 Trillion in Annual OASDI Benefits
2026-05-24
Social Security OASDI (Old-Age, Survivors, and Disability Insurance) pays ~$1.4T annually to ~70M beneficiaries. Three components: OASI (~58M, ~$1.2T), DI (~8.8M, ~$160B), SSI (~7.5M, ~$65B, means-tested). Trust funds: OASI ~$2.75T invested in special-issue Treasuries; 2034 projected OASI depletion (77% payable). FICA tax: 6.2%+6.2% on wages up to $168,600 (2024). Benefit formula: AIME computed from 35 highest-earning years indexed to AWI; PIA = 90% of AIME to first bend point + 32% between bend points + 15% above second. FRA: 67 (born 1960+); early at 62 with ~30% reduction; DRCs +8%/year to 70. 2024 bend points: $1,174/$7,078. SSA data: data.ssa.gov — Monthly Statistical Snapshot, Annual Statistical Supplement (Table 5.A state-level beneficiaries, Table 4.B DI allowance rates, Table 6.C SSI by state), state/county OASDI CSV. FRED series: SSASSHDI, SSARECEIPTSDISABILITY. Disability sequential evaluation: SGA → severe impairment → Blue Book listings → RFC → past relevant work → vocational grids. ALJ hearing backlog ~1M. WEP/GPO eliminated January 2025 (Social Security Fairness Act) for 3.2M workers. Here is trust fund mechanics, AIME/PIA formula, DI determination process, state data API, and a Python analysis of retired-worker benefit penetration rates by state.
Economy and demographics · Labor and workplace · Federal data
IRS Exempt Organizations: The Federal Database Behind 1.26 Million US Nonprofits
2026-05-24
The IRS Exempt Organizations Business Master File (BMF) registers 1.26M active tax-exempt organizations — 501(c)(3) public charities and private foundations (~1M), 501(c)(4) social welfare orgs (~80k), 501(c)(6) trade associations, 527 political orgs, and 25 other IRC subsection categories. $2.8T in annual sector revenues (~5.5% of US GDP), ~12M nonprofit employees. BMF published monthly at IRS.gov: tab-delimited with EIN, name, subsection code, NTEE code (26 major categories A-Z: Education, Health, Human Services, etc.), ruling date, deductibility code, asset/income/revenue amounts. Form 990 e-file JSON on AWS S3 at s3://irs-form-990/ since 2013 — index files plus per-filing XML/JSON. Key schedules: Part VII compensation (5 highest-paid officers), Schedule A (public support test), Schedule B (donor list, confidential), Schedule C (political activity), Form 990-PF (private foundations: 1.39% NII excise tax, 5% minimum distribution, self-dealing IRC 4941). Citizens United 2010 + 501(c)(4) anonymous spending: Form 8976 required since 2016. ProPublica Nonprofit Explorer API at api.propublica.org/nonprofits/v2/organizations/{ein}.json. Private foundations: Gates ($70B), Ford ($16B), Robert Wood Johnson ($13B). Church filing exemption: no Form 990 required, largest data gap. Here is BMF field schema, NTEE taxonomy, 990 e-file S3 access, political activity rules, private foundation excise regime, and a Python NTEE subsector analysis.
Transparency and open data · Federal data
USAID Foreign Aid Data: The Federal Database Behind $40 Billion in Annual US Development Assistance
2026-05-24
USAID (United States Agency for International Development) manages ~$40B in annual foreign assistance across 100+ countries. ForeignAssistance.gov (IATI) publishes whole-of-government aid data by agency, country, sector, and implementing partner. Award types: contracts (for-profit implementers like Chemonics ~$1-2B/yr, DAI, AECOM), grants (INGOs: Save the Children, CARE, IRC, World Vision), cooperative agreements, interagency agreements. PEPFAR: $110B+ since 2003, 20M+ people on ARVs, Country Operational Plans at pepfar.gov. DATA Act: USAID contracts on USASpending.gov; IATI XML at iatiregistry.org. Top recipients FY2022: Ukraine (surged post-invasion), Ethiopia, DR Congo, Nigeria, Jordan, South Africa (PEPFAR). Sub-Saharan Africa ~35%, Near East ~20% of obligations. ForeignAssistance.gov API at /api/v1/resources.json. OpenAid, AidData, D-Portal for secondary access. Here is award mechanisms, PEPFAR data structure, implementing partner concentration, geographic patterns, IATI standard, and a Python ForeignAssistance.gov analysis by country and sector.
Government operations · Federal data
PCAOB Auditor Inspections: The Federal Database Behind Public Company Audit Oversight and Accounting Firm Deficiencies
2026-05-24
PCAOB (Public Company Accounting Oversight Board) was created by Sarbanes-Oxley Act 2002 after Enron/WorldCom/Arthur Andersen scandals; ~1,700 registered audit firms globally; annual inspections for firms auditing >100 SEC-registered issuers, triennial for ≤100. Two-part inspection report: Part I (public deficiencies — insufficient audit evidence, ICFR failures, revenue recognition) immediately; Part II (quality control criticisms) public after 12 months. Big Four 2022 deficiency rates: 31-44% of inspected engagements. HFCAA: Chinese audit firms required to allow PCAOB inspection; August 2022 agreement enabled first-ever inspection of KPMG Huazhen and PwC Zhong Tian. Enforcement: Section 105, $15M/$750k monetary penalties; KPMG 2019 $50M fine for stealing inspection plans. Critical Audit Matters (CAMs) required since 2019. All inspection reports at pcaobus.org/inspections. Here is inspection methodology, deficiency trends by audit area, HFCAA China access resolution, enforcement actions, CAM disclosure, and a Python deficiency rate trend analysis.
Finance and markets · Federal data
Medicare Part D Drug Spending Data: The Federal Database Behind $225 Billion in Annual Prescription Drug Costs
2026-05-24
Medicare Part D (MMA 2003, implemented January 2006) covers outpatient prescription drugs for ~50M beneficiaries through private PDPs and MA-PD plans; ~$225B annual spending. CMS publishes Part D Prescriber Data (NPI, specialty, drug, total claims, total cost) and Drug Spending Dashboard. Top drugs by spending: Eliquis (apixaban) ~$14B, Humira ~$6B pre-biosimilar, Keytruda ~$5B, GLP-1 agonists (Ozempic/Victoza) rapidly rising. PBM rebate mechanics: Tier 1-5 formulary; manufacturer pays 70% brand discount in coverage gap. IRA 2022: Medicare drug price negotiation (first 10 drugs 2026); $2,000 OOP cap 2025; inflation rebates. Humira 2023: 7 biosimilars launched simultaneously. ProPublica Prescriber Checkup identifies high-volume opioid prescribers. LIS/Extra Help: ~13M beneficiaries, full subsidy. CMS data at data.cms.gov. Here is benefit phases, formulary mechanics, IRA negotiation program, prescriber-level data structure, opioid prescribing patterns, and a Python analysis of specialty drug spending.
Health and medicine · Federal data
NFIP Flood Insurance Data: The Federal Program Behind $20 Billion in Flood Claims and the National Flood Hazard Layer
2026-05-24
NFIP (National Flood Insurance Act 1968) provides flood insurance for ~5M policyholders across 22,000+ communities; ~$1.3T total coverage in force. Flood zones: SFHAs (Zone A/AE/V, 1% annual chance) require federally-backed mortgages to carry NFIP. Coverage limits: $250,000 building/$100,000 contents residential. Katrina 2005: $16B, 267k claims; Harvey 2017: $8.9B, 89k claims; Ian 2022: $3.6B. NFIP was $20B+ in debt to Treasury. Risk Rating 2.0 (Oct 2021): property-specific pricing by flood frequency, distance to water, foundation type; 18% annual cap; 1.2M policies canceled/non-renewed. National Flood Hazard Layer (NFHL) at msc.fema.gov; WFS API at hazards.fema.gov. OpenFEMA API: FimaNfipClaims and FimaNfipPolicies datasets. Repetitive loss: ~25,000 severe repetitive loss structures = 25-30% of total claims. First Street Foundation alternative risk model. Here is flood zone mechanics, Risk Rating 2.0 reform, NFHL GIS data, OpenFEMA API structure, repetitive loss dynamics, and a Python Harvey claims analysis by county.
Environment and energy · Federal data
FARA: The Foreign Agents Registration Act Database Behind Lobbying Disclosure for Foreign Governments
2026-05-24
FARA (Foreign Agents Registration Act, 22 U.S.C. §§ 611-621, 1938) requires agents of foreign governments and political parties to register with DOJ's National Security Division and file semi-annual disclosure statements. ~500-600 active registrations at any time. Registration: Form RA-1 (within 10 days) → Form NSD-3 (semi-annual supplement) disclosing principal identity, activities, compensation, disbursements, political contacts. LDA exemption (22 U.S.C. § 613(h)): agents registering under the Lobbying Disclosure Act whose principal is not a foreign government or political party may use LDA instead -- DOJ IG 2016 report criticized this gap. Mueller-era surge 2018-2022: Manafort convicted, Flynn retroactively registered (Turkey/Gülen), Barrack acquitted (UAE), Podesta Group and Mercury LLC retroactively registered. Saudi Arabia post-Khashoggi: $14M+ annually, $450M+ since 2016; firms retained: Squire Patton Boggs, Akin Gump, BGR Group. Chinese state media: CGTN and Xinhua registered as foreign agents 2019. Criminal penalty: 22 U.S.C. § 618 felony, up to 5 years + fines. Electronic Reading Room: justice.gov/nsd-fara; eFARA bulk CSV at efile.fara.gov/bulk/. OpenSecrets and POGO maintain secondary databases. Here is registration mechanics, LDA exemption gap, Mueller-era cases, Saudi Arabia and China enforcement, eFARA bulk data structure, and a Python analysis of FARA disbursements by country.
Money in politics · Federal data
CMS Open Payments: The Federal Database Behind $12 Billion in Annual Pharma and Device Payments to Physicians
2026-05-24
The Physician Payments Sunshine Act (ACA Section 6002, 2010) requires applicable manufacturers to report all payments ≥$10 to covered recipients. 2022 dataset: $12.7B total; research payments ~$4B; general payments ~$2.5B; ownership/investment interests ~$6.2B. ~2,700 applicable manufacturers; ~900,000 covered recipients. Three datasets at openpaymentsdata.cms.gov: General Payments (GP), Research Payments (RP), Ownership/Investment Interests (OI). Key fields: NPI, total_amount_of_payment_usdollars, nature_of_payment (consulting/speaking/food/royalty/research), drug/device name, manufacturer. NPI linkage to NPPES enables physician specialty/location cross-reference. ProPublica "Dollars for Docs" since 2010. Research: Carey et al. (2021) meals associated with brand prescribing; DeJong et al. (2016) payment receipt and prescribing patterns. Dispute process: 45-day window before publication. Socrata API at data.cms.gov/open-payments. Here is Sunshine Act mechanics, payment category taxonomy, scale data, prescribing-impact research, drug-specific linkage (Ozempic/Humira/insulin), and a Python Socrata API analysis by specialty and manufacturer.
Health and medicine · Federal data
NLRB Union Elections and Unfair Labor Practice Data: The Federal Database Behind US Labor Organizing
2026-05-24
NLRB processes ~2,500-3,000 election cases and ~15,000-20,000 ULP charges annually. RC (union-initiated), RM (employer), RD (decertification) petition types; 25-30% showing of interest required; secret ballot; majority of valid votes cast to win. 2014 "Ambush Election" rule reduced pre-election period to ~23 days; 2023 Biden rule restoration. Union win rate ~65-70% in recent years. Amazon LDJ5 Staten Island April 2022: 2,654-2,131 first US Amazon union win; Starbucks Workers United 400+ stores. ULP charges: Section 8(a)(1) interference, 8(a)(3) anti-union discrimination, 8(a)(5) refusal to bargain; ALJ hearing → NLRB Board → circuit court. Gissel bargaining orders; McLaren Macomb (2023) confidentiality clauses unlawful. BLS 2023: 10.0% union density, 6.0% private, 33.1% public. NLRB election results CSV and case search at nlrb.gov. Here is petition types, 2014 rule history, Amazon/Starbucks campaigns, ULP mechanics, Gissel orders, and a Python election win rate analysis by industry.
Labor and workplace · Federal data
ATF Firearm Trace Data: The Federal Database Behind 350,000 Annual Crime Gun Traces
2026-05-24
ATF eTrace processes ~350,000-400,000 crime gun trace requests annually from law enforcement. Trace chain: law enforcement submits recovered gun → ATF contacts manufacturer/importer → FFL of first sale → subsequent FFLs until first retail purchaser identified. Time-to-crime (TTC): average 7-8 years nationally; TTC under 3 years flags potential trafficking; 21% of handguns traced within 3 years. Tiahrt Amendments (2003): prohibit ATF from releasing trace data to the public, using in civil litigation against gun dealers/manufacturers. Iron Pipeline: southeastern states (GA, SC, VA, FL) supply northeastern cities (NY, NJ, MD) via regulatory arbitrage. NIBIN: 300+ sites, 7,300 ballistic leads/week, links cartridge cases across crime scenes. ~130,000 active FFLs; 5-7% annually inspected; 920M+ records at Out-of-Business Records Center. ATF publishes state-level aggregate trace data at atf.gov. Here is eTrace chain mechanics, Tiahrt Amendments history, Iron Pipeline geographic patterns, NIBIN infrastructure, and a Python TTC distribution analysis by state and firearm type.
Justice and immigration · Federal data
FDIC Institution Database: The Federal Profile of Every FDIC-Insured Bank and Thrift
2026-05-24
FDIC BankFind Suite at banks.data.fdic.gov provides institution profiles for all ~4,600 active FDIC-insured banks and thrifts plus 10,000+ historical institutions back to 1934. Charter types: N = national bank (OCC-chartered), SM = state member bank (Federal Reserve), NM = state nonmember bank (FDIC-supervised), SA = state savings association (OCC), SB = state savings bank (FDIC). Dual banking system: institutions choose state or federal charter creating regulatory competition. Banking consolidation: 14,000+ FDIC-insured institutions in 1984 to ~4,600 today -- 67% reduction driven by S&L crisis failures, interstate banking deregulation (Riegle-Neal 1994), Gramm-Leach-Bliley 1999, post-GFC 2008-2012 failures, and ongoing M&A. Summary of Deposits: annual branch-level deposit data enabling banking desert analysis (census tracts with no bank or credit union within 10 miles). CRA (Community Reinvestment Act) exam ratings published: Outstanding, Satisfactory, Needs to Improve, Substantial Noncompliance. BankFind API: /api/institutions endpoint with CERT (unique 5-digit certificate number), ACTIVE, ASSET (thousands), CLASSP, STALP, ESTYMD, SPECGRP, HCTMULT fields; no API key required. Here is charter type mechanics, dual banking regulatory competition, consolidation drivers, CRA compliance, banking desert geography, and a Python active institution analysis by state and asset tier.
Finance and markets · Federal data
FMCSA Crash Data: The Federal Database Behind 5,000 Annual Large Truck Fatalities
2026-05-24
FMCSA's MCMIS tracks ~500,000 reportable CMV crashes per year. Large truck fatalities reached 5,837 in 2022 -- the highest since 2005. 80%+ of truck crash fatalities are passenger vehicle occupants. The Large Truck Crash Causation Study (963 crashes) found driver error in 55% of crashes (87% decision/recognition/performance errors). HOS regulations: 11-hour drive limit, 14-hour window, ELD mandate December 2017. CSA SMS: 7 BASICs updated monthly. Roadside inspections: 3.5M/year, 20% vehicle OOS rate, 5% driver OOS rate. ATA v. FMCSA 2019 removed BASIC percentile scores from public display. SAFER, A&I portal, FMCSA public API, NHTSA FARS complement. Industry: 3.5M drivers, 750,000 carriers, 350,000 owner-operators. Here is the state fatality rate normalized by FHWA VMT, time-of-day and road-type breakdowns, critical reason attribution, and a Python carrier-level crash lookup.
Transportation safety · Federal data
CRS Reports: The Congressional Research Service Database Behind US Policy Analysis
2026-05-24
The Congressional Research Service (CRS) is the nonpartisan policy and legal research arm of Congress within the Library of Congress, established 1914. 700 analysts across 7 divisions produce six product types: Reports (comprehensive analyses), Insights (2-4 page current issue), In Focus (2-page overviews), Legal Sidebars, Report Updates, and Testimonies. 25+ policy areas including agriculture, appropriations, budget, energy, environment, foreign affairs, health, homeland security, immigration, technology, labor, law, national defense, and transportation. The 2018 Consolidated Appropriations Act first mandated public release; crsreports.congress.gov is the official portal with 9,000+ available reports. Historically products were available only to members of Congress -- the 2012 Coburn-blocked report on top marginal tax rates and economic growth (finding no correlation) galvanized the public access movement. EveryCRSReport.com (Federation of American Scientists + Demand Progress) provides bulk access including pre-2018 reports via API at everycrsreport.com/reports.json; each report has id, title, topics array, date, and versions list. CRS differs from GAO (auditing/program evaluation) and CBO (budget scoring only). Here is CRS product type mechanics, the public access mandate, EveryCRSReport.com API structure, and a Python analysis of publication frequency and update patterns by policy area.
Government operations · Federal data
NIST NVD: The National Vulnerability Database Behind CVE Scoring and Cybersecurity Compliance
2026-05-24
NIST NVD enriches 250,000+ CVE records with CVSS scores, CWE classifications, and CPE product data. CVE Numbering Authorities (400+ CNAs): Microsoft, Google, Apple, Red Hat, MITRE root CNA. CVSS v3.1: Attack Vector/Complexity, Privileges Required, User Interaction, Scope, CIA impact. Score ranges: Critical 9.0-10.0, High 7.0-8.9, Medium 4.0-6.9. CISA KEV catalog: 1,000+ confirmed-exploited CVEs, BOD 22-01 mandates federal patching within 14 days. Log4Shell CVE-2021-44228 CVSS 10.0; EternalBlue CVE-2017-0144 9.3; Heartbleed CVE-2014-0160 7.5. CWE-787 Out-of-Bounds Write dominates Critical CVEs. NVD REST API /rest/json/cves/2.0 with cvssV3Severity/cweId/cpeMatchString/hasKev parameters. FedRAMP, PCI DSS, FISMA compliance applications. Here is CVE assignment mechanics, CVSS base/temporal/environmental scores, KEV operational mechanics, and a Python Critical CVE analysis for 2024.
Cybersecurity and privacy · Federal data
EPA Greenhouse Gas Reporting Program: The Facility-Level Emissions Database Behind US Climate Accountability
2026-05-24
EPA GHGRP requires ~8,000 facilities emitting ≥25,000 tCO2e/year to report annually, covering ~85-90% of US stationary source emissions. 41 source categories: power plants (Subpart D), petroleum/natural gas systems (Subpart W, largest by count), refineries (Subpart Y), landfills, cement, iron/steel, chemical manufacturing. Six GHGs: CO2, CH4 (28-34x GWP), N2O (265-298x), HFCs (up to 14,800x), PFCs, SF6 (23,500x). FLIGHT tool at ghgdata.epa.gov for facility search. ECHO bulk download. Satellite methane validation controversy: TROPOMI/Sentinel-5P, GHGSat, MethaneSAT finding higher Permian Basin emissions than Subpart W reports. EPA 2024 Subpart W methodology revision. Here is sector composition, data access via ENVIRO API, and a Python top-emitter analysis.
Environment and energy · Federal data
DOJ Antitrust Division: The Federal Merger Review and Cartel Enforcement Database
2026-05-24
DOJ Antitrust Division enforces Sherman Act (criminal: price-fixing, bid-rigging, market allocation) and Clayton Act (civil merger review). HSR Act pre-merger notification: 2024 threshold $119.5M, ~1,500-2,000 annual filings, ~3% receive Second Requests, $51,744/day penalty for failure to file. Merger review: Phase 1 (30-day) → Phase 2 → consent decree or litigation. 2023 Merger Guidelines: HHI thresholds (2,500+ highly concentrated). Leniency Program: first cartel self-reporter gets automatic amnesty. Auto parts cartel $2.9B fines. AT&T-Time Warner (DOJ lost), UnitedHealth-Change Healthcare (blocked), JetBlue-Spirit (blocked). DOJ press releases RSS, PACER for complaints. Here is FTC coordination, criminal enforcement mechanics, and a Python press release classification analysis.
Justice and immigration · Federal data
CDC WISQARS: The Federal Injury and Violence Mortality Database Behind Public Health Research
2026-05-24
CDC WISQARS (Web-based Injury Statistics Query and Reporting System) covers all US injury deaths (ICD-10 external cause V-Y) back to 1981 and nonfatal ED visits via NEISS-AIP. 2022: unintentional injury ~230k deaths (#1 cause ages 1-44); drug overdose ~109,680 (fentanyl/synthetics ~73,800); motor vehicle ~46,000; suicide ~49,000 (firearms 55%, hanging 27%); homicide ~24,000 (firearms 79%). Total firearm deaths 48,204 (14.6/100k). Three opioid waves: prescription, heroin, fentanyl. WISQARS API, WONDER, NVDRS (case-level violent deaths with circumstance data). Geographic patterns: firearm suicide highest in rural Mountain West; firearm homicide concentrated in urban areas. Here is ICD-10 coding, NEISS methodology, and a Python state firearm rate analysis.
Health and medicine · Federal data
USASpending.gov: The Federal Spending Database Behind $6 Trillion in Annual Contracts, Grants, and Loans
2026-05-24
USASpending.gov tracks ~$6T in annual federal spending via FFATA 2006 and DATA Act 2014. Contracts (~$700B from FPDS-NG): DoD ~$412B, Lockheed Martin ~$73B, RTX ~$42B, Boeing, General Dynamics, Northrop. Grants (~$800B): NIH $40B, NSF $9B. API at api.usaspending.gov: /search/spending_by_award, /bulk_download. FPDS fields: UEI, CAGE, PSC, NAICS, contract type, competition type, set-aside (8(a)/HUBZone/SDVOSB/WOSB). FSRS subaward reporting >$30k. Data Act financial linkage: appropriation → obligation → outlay. Here is defense contract patterns, small business set-asides, DATA Act mechanics, and a Python DoD top-contractor analysis.
Government operations · Federal data
Federal Register: The Official Rulemaking Journal Behind 90,000 Pages of Annual US Regulatory Activity
2026-05-24
The Federal Register is the official daily journal of the US federal government, published since 1936, containing proposed rules (NPRMs), final rules, presidential documents, and notices — ~85,000-95,000 pages/year. APA requires notice-and-comment: NPRM → 30-90 day comment period → final rule with 30-day delay. OIRA reviews significant/major rules (>$100M impact) under EO 12866. Unified Regulatory Agenda tracks all agency rules in the pipeline. Congressional Review Act allows Congress to overturn recent major rules. Here is the CFR 50-title structure, Regulations.gov docket API, Federal Register API at federalregister.gov/api/v1/, Loper Bright 2024 overruling of Chevron, and a Python EPA NPRM analysis.
Government operations · Federal data
FEC Committee Filings: The Campaign Finance Database Behind $14 Billion in Election Spending
2026-05-24
The FEC administers FECA (1971/1974) for federal elections only. Committee types: PCC, party committees, PAC ($5k/election limit), Super PAC (post-Citizens United, unlimited), SSF, Leadership PAC. 2024 federal spending ~$14B. Individual to candidate limit $3,300/election. FEC bulk data: cm.zip (committees), indiv.zip (individual contributions >$200 with employer/occupation), pas2.zip (PAC-to-candidate), oppexp.zip (disbursements). OpenFEC API at api.open.fec.gov/v1/. 501(c)(4) dark money: no donor disclosure required. Here is all eight bulk files, Super PAC mechanics, MURs, and a Python occupation partisan lean analysis.
Money in politics · Federal data
SEC Form D: The Private Placement Database Behind $2 Trillion in Annual Exempt Offerings
2026-05-24
SEC Form D is filed within 15 days of first sale in a Reg D exempt offering. Rule 506(b): unlimited amount, no general solicitation, up to 35 non-accredited investors (~90% of filings). Rule 506(c): unlimited, general solicitation permitted (JOBS Act 2012), accredited investors only. Rule 506 offerings raised ~$2.5T in 2022. Fields: entity name, exemption type, offering amount, investor count, investment fund type (VC/PE/hedge/real estate), industry group. EDGAR full-text search at efts.sec.gov. Historical back to 2009. Reg CF ($5M crowdfunding), Reg A+ ($75M mini-IPO). Here is all Reg D exemptions, JOBS Act impact, dark money limitations, and a Python VC state/sector analysis.
Finance and markets · Federal data
CMS Medicare Inpatient Provider Data: The Hospital-Level Payment Records Behind $170 Billion in Annual DRG Reimbursements
2026-05-24
CMS publishes annual Medicare Inpatient Provider Charge Data for ~3,000 hospitals across ~760 DRGs. The IPPS pays a fixed amount per DRG via relative weights (RW) — DRG 001 Heart Transplant RW ~25.0, DRG 470 Major Joint Replacement RW ~2.1, 2023 base rate ~$6,000. Medicare pays ~$170B/year via IPPS. Adjustments include Wage Index, IME for teaching hospitals, DSH for safety-net hospitals, and outlier payments. Chargemasters produce 5x-10x sticker prices vs. actual payments. Geographic variation: DRG 470 ranges $12,000–$35,000+ across hospitals. Here is the full dataset schema, Socrata API access, value-based care adjustments (HVBP, HRRP, HACRP), and a Python charge-to-payment ratio analysis.
Health and medicine · Federal data
FDA Orange Book: The Drug Patent and Exclusivity Database Behind Generic Drug Competition and Hatch-Waxman Challenges
2026-05-24
The FDA Orange Book lists approved drugs and their TE ratings (AB = substitutable bioequivalent). Hatch-Waxman Act (1984) created the ANDA pathway — generics skip clinical trials, show bioequivalence. Paragraph IV certification challenges listed patents → 30-month stay + 180-day first-filer exclusivity. Exclusivity types: NCE 5yr, new clinical investigation 3yr, Orphan Drug 7yr, Pediatric 6mo. Patent thickets: average 71+ listed patents per brand drug. Lipitor $10B/year cliff Nov 2011; Humira 2023 multi-biosimilar launch. Three flat files: Products.txt, Patent.txt, Exclusivity.txt. Here is TE code breakdown, pay-for-delay/FTC v. Actavis, Purple Book for biologics, and a Python upcoming patent cliff analysis.
Health and medicine · Federal data
CDC PLACES: The Small Area Health Estimates Behind County and Census Tract Disease Prevalence Data
2026-05-24
CDC PLACES produces model-based small area health estimates for all 3,100+ counties, 29,000+ census tracts, and 28,000+ ZCTAs. 36+ measures across 5 domains: health outcomes (diabetes, obesity, CHD, stroke), prevention (screenings, insurance), unhealthy behaviors (smoking, binge drinking), disabilities, and social determinants. Uses multilevel regression and poststratification (MRP) applied to BRFSS survey data + Census ACS. Obesity >40% in Appalachian counties vs. <20% in Mountain West. Diabetes 15%+ in Mississippi Delta vs. <7% in Colorado. Socrata API at data.cdc.gov, GeoJSON endpoint, sodapy library. Here is full methodology, PLACES vs. County Health Rankings, and a Python Mississippi county health burden analysis.
Health and medicine · Federal data
BSEE Offshore Safety Data: The Post-Deepwater Horizon Incident Database Behind 4,000 Annual Offshore Inspections
2026-05-24
BSEE was created in 2011 from MMS breakup after the Deepwater Horizon/Macondo blowout (87 days, 4.9M barrels, 11 deaths). Regulates ~2,000 OCS facilities, MODUs, 15,000+ wells. 4,000+ annual inspections, ~2,000+ INCs (Incidents of Noncompliance) issued. Incident categories: blowouts, fires/explosions, collisions, fatalities (~10-15/yr), injuries (100+/yr). SEMS rule (2010/2013) required operator safety management systems. Well Control Rule (2016) set BOP testing/monitoring requirements. OCS produces 15-17% of US oil. Data at bsee.gov: incident, INC, inspection, production, well CSVs. ArcGIS REST services for OCS infrastructure GIS.
Environment and energy · Federal data
Treasury Daily Treasury Statement: The Federal Cash Flow Data Published Every Business Day
2026-05-24
The DTS is published each federal business day at 4 PM ET by the Bureau of the Fiscal Service, reporting the prior day's cash receipts, outlays, and borrowing. Tables cover TGA balance at Federal Reserve Banks, public debt outstanding, deposits and withdrawals by source category, operating cash balances, and federal agency deposits. The fiscal year deficit is the running sum of daily net outflows. Here is DTS Table I-VII structure, TGA balance mechanics, debt ceiling X-date tracking, Fiscal Data API access, and a Python script to chart daily outflows by category.
Finance and markets · Government operations · Federal data
Federal Reserve H.15: The Selected Interest Rates Release Behind Treasury Yields, Fed Funds, and Every Rate Benchmark
2026-05-24
The Federal Reserve H.15 release publishes daily interest rate data for the federal funds effective rate, Treasury constant maturities (1-month through 30-year), prime rate, discount rate, and SOFR since the LIBOR transition. The 2-10 yield curve inverted to -108 bps in 2023, the deepest inversion since 1981. Here is CMT construction methodology, EFFR vs. SOFR vs. LIBOR mechanics, FRED series IDs (DFF, DGS10, SOFR), real rate calculation via TIPS breakevens, and a Python FRED API dual-chart of the yield curve spread with recession shading.
Finance and markets · Federal data
Census Population Estimates Program: The Annual County and State Population Data Behind Apportionment, Funding, and Growth Tracking
2026-05-24
The Census PEP produces annual population estimates for all 3,100+ counties and 50 states using a cohort-component model: base census + births - deaths + net migration. Florida gained 2.1M residents 2020-2023; Texas gained 2.4M; NYC lost ~500K from 2020 peak. Here is the components-of-change methodology, TIGER geography linkage, vintage year vs. decennial census reconciliation, Census API pep/population endpoint, and a Python script ranking counties by population growth rate with net migration decomposition.
Economy and demographics · Federal data
USDA FSIS Food Safety Data: The Federal Recall Database and Inspection Records Behind Meat, Poultry, and Egg Safety
2026-05-24
FSIS regulates 6,500+ meat, poultry, and egg processing establishments covering 80+ billion pounds of product annually. The three-class recall system escalates from Class III (mislabeling) to Class I (health hazard). The 2008 Hallmark/Westland recall (143M lbs, largest ever) involved downer cattle. E. coli O157:H7 is a zero-tolerance adulterant. Here is the Establishments.csv schema, FSIS recall database API, GenomeTrakr WGS pathogen tracing, HACCP plan requirements, PHIS inspection reports, and a Python recall trend analysis by Class and commodity.
Food and agriculture · Federal data
Census SAIPE: The Small Area Income and Poverty Estimates Behind Federal Education Funding and County-Level Poverty Maps
2026-05-24
SAIPE produces annual model-based poverty estimates for all 3,100+ counties and 13,000+ school districts — the only single-year official source at that geography. It drives ~$17B in annual Title I-A education funding and the $3.5B CDBG formula. The model combines ACS, IRS EITC filers, SNAP counts, and CPS via small area estimation. The Census API exposes county and school-district poverty rates and median household income back to 1989 via a single endpoint.
Economy and demographics · Research and education · Federal data
DOT National Transit Database: The Federal Ridership and Finance Data Behind Every US Bus and Rail System
2026-05-24
The NTD collects annual ridership (UPT), vehicle miles, fares, and expenses from ~800 transit agencies as a condition of FTA grants. US total UPT hit 10.4B in 2023, still below the 15.7B pre-COVID peak. The COVID collapse was severe — NYC subway fell from 1.8B to 600M annual trips — and $69B in emergency relief (CARES + CRRSAA + ARP) kept systems running. Section 5307 formula grants (~$5B/year) are allocated directly from NTD UPT/VRM data.
Transportation safety · Federal data
USPTO Trademark Data: The Federal Brand Registry Behind 3 Million Active Marks and the TESS Search System
2026-05-24
The USPTO holds ~3M active registered trademarks, with ~650,000 new applications per year at peak. Federal registration provides nationwide constructive notice, ® usage rights, US Customs blocking of infringing imports, and incontestability after 5 years. The 45 Nice Classification classes span all goods and services. Bulk XML data at bulkdata.uspto.gov and the USPTO Trademark JSON API enable filing trend analysis; China accounts for ~25% of foreign USPTO filings.
Research and education · Federal data
Federal Reserve Senior Loan Officer Survey: The Quarterly Credit Conditions Data the Fed Uses to Track Lending Tightening
2026-05-24
The SLOOS surveys ~80 large US banks and 24 foreign branches quarterly on changes in lending standards and loan demand. The net percentage (tightening minus easing) is the key signal: it hit +80% for C&I loans in Q4 2008 and +68% in Q2 2020. Net tightening above +50% has historically predicted recession within 4 quarters. FRED series DRTSCILM (large/medium C&I) and DRTSCIS (small firms) extend back to 1990 and are freely accessible via the FRED API.
Finance and markets · Federal data
FCC Spectrum Data: The Universal Licensing System Behind 25 Million Wireless Licenses and US Radio Frequency Allocation
2026-05-24
The FCC's Universal Licensing System (ULS) holds 25M+ active wireless licenses covering amateur radio (11M+ operators), commercial mobile (AT&T, Verizon, T-Mobile spectrum), public safety, broadcast, microwave, and satellite. Spectrum auctions have raised $160B+ total — Auction 110 (C-band 2021) alone netted $81B, the largest ever. The National Table of Frequency Allocations (47 CFR Part 2) governs band use. ULS bulk data at ftp.fcc.gov enables license density analysis, and the FCC also maintains broadcast license data (CDBS/LMS) for AM/FM/TV stations.
Federal data
HUD Housing Choice Vouchers: The Section 8 Data Behind 2.3 Million Households and $30 Billion in Annual Rental Assistance
2026-05-24
The Housing Choice Voucher (HCV) program subsidizes rent for ~2.3 million households at ~$30B/year, administered by ~2,200 local PHAs. HUD publishes Fair Market Rents (FMRs) annually for ~2,600 areas at the 40th percentile of gross rent (2024: NYC 2BR $2,765, rural MS $725). Only ~25% of eligible households receive assistance due to funding caps; waitlists run 1–10 years. The HUD Picture of Subsidized Households (PASH) provides tract-level data on income, demographics, and voucher concentration for spatial analysis.
Economy and demographics · Federal data
Census American Housing Survey: The Biennial Housing Quality Database Behind US Structural Conditions and Neighborhood Characteristics
2026-05-24
The AHS is a biennial panel survey (~60,000 housing units) covering structural quality, condition deficiencies, heating fuel, plumbing, and neighborhood characteristics — the deepest housing-unit dataset in the US. Tracking the same units since 1973 reveals: plumbing inadequacy fell from 4.5% to under 0.5%; owner-occupancy peaked at 69% (2004–05) and troughed at 63% (2016); new single-family median size grew from 1,500 to 2,300+ sq ft. HUD uses AHS microdata for the biennial Worst Case Housing Needs report (8.5M households in 2023).
Economy and demographics · Federal data
USDA Economic Research Service: The Agricultural Economics Data Behind Farm Income, Food Prices, and Rural America
2026-05-24
USDA ERS publishes agricultural economic data across farm income ($116B net farm income in 2023), food prices (monthly CPI food outlook, 2022's +11.4% grocery price surge), food security (13.5% of households food insecure in 2023, 47M people), commodity program costs (ARC/PLC reference prices), and rural America (Beale Codes 1–9 classifying all 3,100+ counties, 180+ rural hospital closures since 2010). The Food Access Research Atlas maps food deserts at the census-tract level.
Food and agriculture · Federal data
BLS Employment Cost Index: The Quarterly Wage and Benefits Tracker the Federal Reserve Watches Most Closely
2026-05-24
The BLS Employment Cost Index (ECI) measures quarterly changes in employer compensation costs (wages + benefits) using fixed employment weights — eliminating the industry-mix distortion that afflicts Average Hourly Earnings. Private-industry wages peaked at ~5.7% YoY in mid-2022 before decelerating to ~4.2% by end-2023; the Fed's comfort level is ~3.5% consistent with 2% PCE inflation. The ECI benefits breakdown (ECEC release) shows health insurance at ~$3.50–$4.00/hour and total benefits at ~31% of compensation. A Q1 2024 upside ECI surprise directly delayed Fed rate cut timing.
Economy and demographics · Federal data
DOL Unemployment Insurance Weekly Claims: The Thursday Morning Data Release That Moves Financial Markets
2026-05-24
DOL publishes initial and continuing unemployment insurance claims every Thursday at 8:30 AM ET, covering 53 state programs. Initial claims peaked at 6.87 million for the week ending March 28, 2020 — dwarfing the prior record of 695,000 (1982). Pre-COVID lows of ~200,000 (2018–2019) were the lowest since 1969. The 4-week moving average smooths weather and auto-plant retooling noise. FRED series ICSA, ICNSA, and CC4WSA provide full history back to 1967.
Labor and workplace · Economy and demographics · Federal data
Census Foreign Trade Statistics: The HS-Code Import and Export Database Behind Every US Trade Policy Decision
2026-05-24
The Census Bureau Foreign Trade Division compiles monthly import/export statistics from CBP ACE entry data and AES electronic export filings. 2023: goods exports $2.02T, imports $3.08T, deficit $1.06T. Data drills to 10-digit HS/Schedule B codes by country and port. Section 301 China tariffs 2018–2019 reduced the US-China goods deficit from $419B to $279B but shifted sourcing to Vietnam, Mexico, and Taiwan. The Census API (api.census.gov/data/timeseries/intltrade/) and USA Trade Online enable country-HS-month-level analysis.
Economy and demographics · Federal data
Social Security OASDI: The Federal Data Behind $1.4 Trillion in Annual Benefits and 70 Million Recipients
2026-05-24
Social Security's OASDI program (Old Age, Survivors, and Disability Insurance) paid $1.4T in benefits to ~70 million recipients in 2024, funded by 6.2% FICA payroll tax on wages up to $168,600. The benefit formula converts 35 highest indexed earning years into AIME, then applies progressive bend points (90%/32%/15%) to compute PIA. Full Retirement Age is 67 for those born 1960+; early claiming at 62 permanently reduces benefits 25-30%; delayed claiming to 70 adds 8%/year. The 2024 Trustees Report projects OASI trust fund depletion in 2033, after which revenues cover ~77% of scheduled benefits. SSA publishes 700+ statistical tables in the Annual Statistical Supplement, monthly snapshots at data.ssa.gov, and the Social Security Statement via my.ssa.gov.
Economy and demographics · Federal data
Census Current Population Survey: The Monthly Survey Behind Official US Poverty Rates and Income Inequality Measures
2026-05-24
The Current Population Survey (CPS) interviews ~60,000 households monthly to produce the official unemployment rate and, via the March ASEC supplement (~95,000 households), the official US poverty rate. The official poverty measure (OPM) uses 1960s Orshansky thresholds adjusted only for CPI ($30,900 for a family of 4 in 2023, 11.1% poverty rate). The Supplemental Poverty Measure (SPM) adds SNAP, housing subsidies, and EITC while subtracting taxes, yielding 12.9% in 2023 — more policy-sensitive. Median household income was ~$80,610 in 2023. IPUMS CPS harmonizes all CPS waves back to 1962; the Census API exposes state-level poverty rates programmatically.
Economy and demographics · Federal data
BEA International Transactions: The Balance of Payments Data Behind Every US Trade Deficit Headline
2026-05-24
The BEA's International Transactions Accounts (ITAs) record all economic flows between US residents and the rest of the world. In 2023, the US ran a goods deficit of ~$1.06T, offset partially by a services surplus of ~$293B and net primary income of +$196B, for a total current account deficit of ~$905B (3.3% of GDP). The US's net international investment position stood at -$20.6T — yet the US earns positive net primary income because US assets abroad yield higher returns ("exorbitant privilege"). The BEA ITA API exposes quarterly data on all current account components back to 1960.
Economy and demographics · Federal data
NOAA Climate Data: The National Centers for Environmental Information Behind 130 Years of Temperature Records and Climate Normals
2026-05-24
NOAA's National Centers for Environmental Information (NCEI) archives 150+ petabytes of atmospheric, ocean, and geophysical data serving 25+ billion online requests per year. The Global Historical Climatology Network Daily (GHCN-Daily) covers ~120,000 stations worldwide with daily Tmax/Tmin/PRCP/SNOW back to the late 1800s. NOAAGlobalTemp made 2023 the warmest year on record (+1.45°C above pre-industrial). US Climate Normals (1991–2020) define 30-year averages for 15,000+ stations. NCEI's Billion-Dollar Disasters database counted 28 events totaling $94B in losses in 2023. The CDO REST API provides programmatic access with daily and monthly summary endpoints.
Environment and energy · Federal data
VA Disability Benefits: The Federal Data Behind 5.5 Million Compensation Recipients and $130 Billion in Annual Spending
2026-05-24
The VA disability compensation program pays monthly benefits to ~5.5 million veterans (up from 3.5M in 2010) based on a 0–100% rating using a whole-person combined formula. Here is the 2024 compensation rate table ($171/month at 10% to $3,737 at 100%), the PACT Act 2022 and its 23 new burn pit presumptive conditions (3.5M newly eligible veterans, $280B 10-year cost), the GI Bill (Post-9/11 Ch. 33: tuition cap, BAH allowance, $1K books stipend), the VA Home Loan Guaranty (no down payment, 4M+ loans in FY2022), the claims processing system (884K 2012 peak backlog, three Appeals Reform Act review lanes), VSOs and TDIU (~370K recipients), and the VA Open Data portal with state-level benefits utilization data.
Health and medicine · Federal data
USGS Water Resources: The National Water Information System Behind Flood Prediction, Drought Monitoring, and Aquifer Depletion
2026-05-24
The USGS National Water Information System runs 8,000+ streamflow gauging stations and feeds NWS River Forecast Centers and the National Water Model (2.7 million reaches, 15-minute forecasts). Here is ADCP gauging methodology, annual peak discharge feeding FEMA Flood Insurance Rate Maps, the Ogallala Aquifer (174,000 sq miles, declining 1–3 ft/year in TX/KS), Central Valley land subsidence from groundwater pumping, the NAWQA water quality monitoring program, water use surveys (thermoelectric power 41% of withdrawals), the 7Q10 low-flow statistic driving NPDES permits, the NWIS REST API (parameterCd/statCd parameter table), and a Python script plotting 5-year discharge with drought-period shading.
Environment and energy · Federal data
NSF Research Grants: Mapping $9 Billion in Annual Basic Science Funding
2026-05-24
The NSF funds ~25% of all federally funded basic research at US universities (excluding life sciences) with a $9B+ annual budget across 8 directorates. Here is the proposal review process (dual merit criteria: intellectual merit AND broader impacts; funding rates 17–25% by directorate; ~40,000–50,000 proposals/year), the CAREER award ($500k/5 years, highly competitive), the Graduate Research Fellowship GRFP ($37k/year, ~2,000 awards from 12,000+ applicants), the NSF Awards API (api.nsf.gov, 600,000+ awards searchable), National AI Research Institutes ($200M+), the 2023 immediate open-access mandate stricter than NIH's, EPSCoR geographic equity program, and a Python Awards API CAREER grant analysis by directorate and institution.
Research and education · Federal data
BTS Airline On-Time Performance: The Federal Dataset Behind Every Flight Delay, Cancellation, and Tarmac Crisis
2026-05-24
The BTS ATOP/ASQP database covers ~6 million flight records per year from all domestic carriers with 1%+ market share, with delay coded across five cause categories: Carrier (~30-35%), NAS (~30-35%), Late Aircraft (~35-45%), Weather (~5-10%), and Security (<1%). Here is the T-100 domestic/international traffic series (ASM, RPM, load factor), Form 41 carrier financials (CASM, RASM, fuel as 20-30% of costs), the COVID collapse (96% RPM decline April 2020, $54B CARES Act PSP), the Southwest December 2022 meltdown (17,000 cancelled flights, $140M DOT settlement), the 3-hour/4-hour tarmac delay rule, BTS Transtats bulk download, and a Python script to compute monthly on-time rate and cancellation rate by carrier.
Transportation safety · Federal data
Federal Reserve Z.1 Financial Accounts: The Flow of Funds Behind US Household Wealth and Sectoral Balances
2026-05-24
The Federal Reserve Z.1 (formerly Flow of Funds) publishes quarterly financial assets and liabilities for all US economic sectors. Here is the household net worth data ($156T 2021 peak, ~$8T 2022 decline from rate hikes), the Distributional Financial Accounts showing top 1% hold ~31% of wealth vs. bottom 50% at ~3%, the two-sided sectoral balance accounting identity, corporate leverage, Table B.101 residential real estate at market value ($25T to $43T 2019–2024), the $26T+ Treasury liability position, Rest of World holdings, FRED mnemonic guide, and a Python FRED API script pulling household net worth with CPI deflation and NBER recession shading.
Finance and markets · Economy and demographics · Federal data
Census LEHD: The Longitudinal Employer-Household Dynamics Database Behind Workforce Flows, Commuting, and Wage Growth
2026-05-24
The Census LEHD program links UI wage records for 95%+ of private workers to employer and household records, producing the Quarterly Workforce Indicators (employment/payroll/hires/separations by county × industry × age × sex × education), LODES origin-destination commuting matrices (block-to-block home-work pairs), job-to-job flow statistics (7–10% earnings premium from voluntary job switching), and business dynamics data. Here is how LEHD differs from QCEW/CES/ACS, the COVID remote-work reshaping of OD commute flows, the great resignation mobility spike, OnTheMap and LEHD Explorer tools, and a Python Census QWI API script analyzing young construction worker employment by county.
Economy and demographics · Labor and workplace · Federal data
BEA Regional Accounts: GDP by State, Personal Income by County, and the Sub-National Data Behind Every State Policy Debate
2026-05-24
The BEA Regional Accounts allocate national economic totals to states, counties, and MSAs: GDP by State (annual/quarterly, NAICS detail, post-COVID TX/FL leading growth), Personal Income by State (quarterly, five-component decomposition of labor/capital/transfers), Personal Income by County (~3,100 counties annually, CAINC1 table), and GDP by MSA (~380 MSAs, NYC at $2T+ vs. rural laggards). Here is the energy boom-bust signal (North Dakota Bakken GDP doubled 2007–2014 then collapsed), the high-income state tax migration effect (California 13.3% vs. Texas/Florida 0%), transfer payment COVID surge and unwinding, BEA Regional API parameters, and a Python script ranking states by 2010–2024 per-capita personal income growth.
Economy and demographics · Federal data
USDA NASS Crop Surveys: The Federal Agricultural Data Behind Every Corn, Soybean, and Wheat Market
2026-05-24
The USDA National Agricultural Statistics Service conducts 400+ surveys annually, reaching 3 million respondents to produce the authoritative federal record of US crop production, livestock inventories, commodity prices, and agricultural prices since 1867. Here is the Crop Production report, WASDE supply-demand balance sheets, the QuickStats API (eight parameters, 50,000 record limit), weekly Crop Progress with Good/Excellent condition ratings, the five major crops (corn 35% of cropland, soybeans competing with Brazil, winter/spring wheat, cotton, rice), the 2012 drought sending corn to $8.49/bushel and soybeans above $17, Cattle on Feed, Hogs and Pigs quarterly, Prices Received/Paid, and a Python QuickStats API script to plot state-level corn yield per acre for the top 5 producing states over 20 years.
Food and agriculture · Finance and markets · Federal data
EIA Energy Data: The Federal Database Behind Oil Prices, Natural Gas Storage, and Electricity Generation
2026-05-24
The Energy Information Administration is the primary federal authority for US energy data, publishing the market-moving Short-Term Energy Outlook, the Weekly Petroleum Status Report (Cushing OK crude stocks that move WTI crude prices $1–2/barrel), the Natural Gas Storage Report (five-region EIA-914 data), EIA-860 and EIA-923 power plant databases (15,000+ generators, monthly fuel consumption and generation), the Electric Power Monthly, Petroleum Supply Monthly, and the EIA Open Data API (500,000+ series). Here is the 2019 US net petroleum export milestone, the 2022 European energy crisis Henry Hub spike to $9/MMBtu, and a Python EIA v2 API script pulling WTI crude and Henry Hub weekly prices with a dual-axis chart annotating the 2022 spike.
Environment and energy · Economy and demographics · Federal data
Census Building Permits and Housing Starts: The Federal Leading Indicator Behind the US Housing Market
2026-05-24
The Census Bureau Building Permits Survey and New Residential Construction release track ~20,000 permit-issuing jurisdictions and ~900 construction sample areas monthly — the primary federal leading indicators for US housing activity. Here is the BPS 96% coverage of US construction, SAAR methodology, permits-to-starts ratio dynamics, the 2006 peak at 2.07M SAAR to 2009 trough at 554K to the 2020–2021 surge to the 2022–2023 pullback as mortgage rates went 3% to 7%, the SFH/multifamily bifurcation, Sun Belt concentration (Texas 15–18%, Florida 10–12%), New Residential Sales contract-signed timing, lumber futures (2021 spike to $1,700/MBF), Census BPS API, and FRED series PERMIT/HOUST/HOUST1F/HOUST5F.
Economy and demographics · Federal data
BLS Occupational Employment Data: Wages, Job Counts, and 10-Year Projections for Every US Occupation
2026-05-24
The BLS OEWS program publishes wages and employment counts for 830 occupations across 590+ geographies from a 1.1M establishment semiannual survey pooled over 3 years into ~3.3M observations. Here is the data structure (TOT_EMP, hourly/annual wage percentiles 10th–90th, location quotient, entry/experienced wage fields), the Standard Occupational Classification (23 major groups / 459 broad / 867 detailed occupations), top-paying occupations (surgeons $250k+, anesthesiologists, airline pilots), Employment Projections 2022–2032 (fastest-growing: home health aides +924k, NPs, solar installers; fastest-declining: word processors, cashiers), the Occupational Outlook Handbook, O*NET skills crosswalk, wage inequality analysis (90/10 percentile ratio), H-1B prevailing wage connection, and a Python script to analyze healthcare occupation wages from the national OEWS ZIP.
Economy and demographics · Labor and workplace · Federal data
FHWA Highway Data: The Federal Dataset Behind Bridge Conditions, Pavement Quality, and Traffic Counts
2026-05-24
The Federal Highway Administration publishes the most comprehensive infrastructure dataset in the federal government: the National Bridge Inventory (620,000+ bridges, biennial inspection, 0–9 condition ratings, sufficiency score), the Highway Performance Monitoring System (pavement IRI, Good/Fair/Poor condition, 900,000+ road segments), Annual Average Daily Traffic counts, and Highway Statistics (registered vehicles, licensed drivers, gas tax revenues). Here is the structurally deficient vs. functionally obsolete distinction, the IIJA 2021 $40B bridge repair program, the Highway Trust Fund solvency crisis (gas tax frozen at $0.184/gallon since 1993, EVs avoiding it), the Freight Analysis Framework commodity-flow OD matrices, and a Python NBI bridge data script to map structurally deficient bridges by sufficiency rating.
Transportation safety · Engineering and infrastructure · Federal data
BLS Current Employment Statistics: The Monthly Jobs Report Behind Every Payroll Number
2026-05-24
The BLS releases two surveys on “Jobs Friday” (first Friday of each month): the Establishment Survey (580,000 worksites, source of the nonfarm payroll headline) and the Household Survey (60,000 households, source of the unemployment rate). Here is why the two surveys often diverge, how the net birth/death model handles new businesses, the three-tier revision cycle including the annual benchmark (the January 2024 benchmark removed 818,000 jobs from the prior year), X-13ARIMA-SEATS seasonal adjustment, industry-level dynamics (healthcare adding jobs through every recession, the COVID −20.5M single-month collapse), the 8:30 AM release market impact, and a Python BLS API script to download total nonfarm payroll and plot recession bars.
Economy and demographics · Labor and workplace · Federal data
SEC EDGAR XBRL: The Machine-Readable Financial Statement Database Behind Every Public Company
2026-05-24
The SEC has required XBRL-tagged financial statements from all public companies since 2009–2011, creating a machine-readable database of ~7,000 active filers. Here is the US-GAAP taxonomy (17,000+ concepts, us-gaap/dei/srt namespaces), the three EDGAR APIs (Company Facts for all filings, Company Concept for a single metric over time, Frames for cross-sectional data across all companies in one period), data quality pitfalls (30% custom extension elements, taxonomy changes after ASC 606, fiscal year misalignment), the Beneish M-score fraud detection application, and a Python script using the SEC EDGAR API to extract Apple's revenue and net income history from 10-K filings.
Finance and markets · Federal data
CMS Skilled Nursing Facility Data: Star Ratings, Staffing, and the Quality Metrics Behind 15,000 Nursing Homes
2026-05-24
CMS Care Compare publishes quality data for every Medicare- and Medicaid-certified skilled nursing facility in the US. Here is the five-star composite rating system (health inspection, staffing, and quality measure components), the 3×4 scope/severity deficiency grid (A through L, Immediate Jeopardy at J–L), the Payroll-Based Journal staffing system that replaced self-reported data in 2016, the Minimum Data Set resident assessment that drives both quality measures and PDPM reimbursement, COVID-19’s toll on nursing homes (170,000+ deaths, 38% of early US COVID deaths), private equity ownership transparency gaps, and a Python script to download CMS Care Compare CSV files and compute state-level star rating distributions.
Health and medicine · Federal data
BLS Occupational Injuries: The SOII Dataset Behind 2.8 Million Annual Workplace Injuries
2026-05-24
The Bureau of Labor Statistics Survey of Occupational Injuries and Illnesses surveys ~230,000 establishments annually to produce the only national count of workplace injuries and illnesses. Here is the Total Recordable Incidence Rate formula, OSHA recordkeeping requirements (Form 300 Log, 300A Summary, 301 Incident Report), the case-and-demographic microdata for individual injury characteristics, the Census of Fatal Occupational Injuries as the companion fatal census (~5,500/year, construction’s fatal four), the musculoskeletal disorder supplement, the pervasive underreporting problem (academic research shows 40–69% capture rate), and a Python BLS API script to compare TRIR across construction, manufacturing, and healthcare.
Economy and demographics · Labor and workplace · Federal data
EPA Air Quality System: The Federal Monitor Network Behind NAAQS Compliance and Pollution Mapping
2026-05-24
The EPA Air Quality System aggregates hourly and daily pollutant readings from 4,000+ monitoring sites operated by state, local, tribal, and federal agencies. Here is the six criteria pollutant NAAQS framework (PM2.5, PM10, ozone, CO, SO2, NO2), the 2024 PM2.5 standard tightened to 9 μg/m³, the AQI 0–500 scale and daily worst-of-pollutants calculation, nonattainment designation and State Implementation Plan mechanics, the Harvard Six Cities study and BenMAP health burden model (100,000+ annual PM2.5-attributable deaths), environmental justice monitoring gaps, wildfire smoke exceptional events provisions, and a Python script using the EPA AQS API to download daily PM2.5 readings and identify exceedance days.
Environment and energy · Health and medicine · Federal data
HUD Point-in-Time Count: The Federal Homeless Census Behind 650,000 Americans Without Shelter
2026-05-24
HUD’s annual Point-in-Time count, conducted over the last 10 days of January by ~400 Continuum of Care regions, is the only national census of homelessness in the US. Here is the sheltered vs. unsheltered methodology, the 2023 count of 653,100 (the highest since reporting began), California’s 28% share, the Homeless Management Information System as the longitudinal individual-level tracking database, veteran homelessness (37,000+ and the HUD-VASH voucher program), the chronic homeless definition (12+ months or 4+ episodes), methodological limitations (January weather, volunteer variation, doubled-up household exclusion), Housing First policy evidence, and a Python script to download HUD Exchange PIT CSVs and compute per-capita homeless rates by state.
Economy and demographics · Federal data
FAA Aviation Safety Data: The Federal Databases Behind Every Plane Crash Investigation
2026-05-24
The federal aviation safety ecosystem spans four major databases: the NTSB accident database (every civil aviation accident since 1962), the FAA AIDS system, the NASA-administered Aviation Safety Reporting System (ASRS — voluntary, confidential, non-punitive near-miss reports), and the FAA Wildlife Strike Database. Here is the NTSB probable cause taxonomy (pilot error 70%+ of GA accidents), the Boeing 737 MAX MCAS investigation, the ASRS reporting immunity mechanism, runway incursion categories, the Miracle on Hudson Canada Goose strike context, the FAA Civil Aviation Registry N-number database, pilot workforce demographics, and a Python NTSB bulk CSV phase-of-flight fatal accident rate analysis.
Federal data · Transportation safety
NRC Nuclear Safety Data: The Federal Database Behind Every Reactor Inspection and Incident Report
2026-05-24
The Nuclear Regulatory Commission publishes quarterly Performance Indicators, inspection findings, and daily Event Notification Reports for all 99 operating US nuclear reactors. Here is the Reactor Oversight Process cornerstones (Initiating Events, Mitigating Systems, Barrier Integrity), the Significance Determination Process (Green/White/Yellow/Red), Licensee Event Reports, the TMI and Fukushima reform trail, probabilistic risk assessment (core damage frequency ~1E-5/reactor-year), the ADAMS document management system with 7M+ public records, the 92–93% nuclear capacity factor record, and a Python NRC PI XML parser to rank plants by unplanned scram rate.
Federal data · Environment and energy
Bureau of Prisons Data: The Federal Inmate Population Behind 150,000 Federal Prisoners
2026-05-24
The Bureau of Prisons manages 121 federal prisons holding ~148,000 inmates — down from a 219,000 peak in 2013. Here is the weekly population data, offense category breakdown (drug offenses 43%+, the legacy of mandatory minimums), the racial disparity in crack vs. powder cocaine sentencing before the Fair Sentencing Act 2010, FIRST STEP Act reforms, the BJS National Prisoner Statistics Program covering all US incarceration, US Sentencing Commission case-level sentencing data and disparity research, PACER federal court records, supervised release mechanics, private prison contracting ($700M+/year), ICE immigration detention as a separate civil system, and recidivism data (68% rearrest within 3 years).
Federal data · Justice and immigration
USCIS Immigration Data: The Federal Database Behind Visas, Green Cards, and Naturalizations
2026-05-24
USCIS adjudicates ~8 million petitions annually and publishes detailed statistics on every immigration benefit category. Here is the naturalization data (~800–900K/year by country of birth), the employment-based green card per-country 7% cap that creates 40+ year backlogs for Indian nationals (EB-2 India priority date ~2012), the H-1B lottery (470K registrations for 85K slots in FY2025), the 1.7M+ affirmative asylum backlog, DACA quarterly recipient counts by state, the EOIR immigration court 3.3M+ case backlog with judge-level grant rate variation, DHS Yearbook of Immigration Statistics, and a Python USCIS naturalization Excel workbook analysis.
Federal data · Justice and immigration · Economy and demographics
FBI UCR: The Federal Crime Statistics Behind Every Public Safety Analysis
2026-05-24
The FBI Uniform Crime Reporting program collects crime data from ~18,000 law enforcement agencies — transitioning from the legacy Summary Reporting System to the incident-level NIBRS, a shift that created massive coverage gaps in the 2021 national crime count when major cities failed to report. Here is the 8 Part I Index Crimes, the NIBRS incident/offense/victim/property/arrestee segment structure, the 2020–2021 murder surge (+30% single-year, the largest since national tracking began), hate crime data, LEOKA officer safety statistics, the dark figure of crime and NCVS complement, clearance rates, and the Crime Data Explorer API with a Python state-level murder rate trend analysis.
Federal data · Justice and immigration
SBA 7(a) and 504 Loan Data: The Federal Small Business Lending Database Behind $40 Billion in Annual Guarantees
2026-05-24
The SBA publishes loan-level data for all approved 7(a) and 504 loans — the two flagship small business lending programs covering $30–40B/year in 7(a) guarantees and $8–10B/year in 504 fixed-asset financing. Here is the 7(a) guarantee structure (85% on loans ≤$150K, 75% above, up to $5M), the 504 three-party 50/40/10 split, the loan-level public dataset fields (NAICS, lender, status, charge-off amount, ownership flags), lender concentration (Live Oak Bank, OIG 2014 high-risk lender report), industry default rates, SBIC venture financing, equity and access analysis by minority/women/veteran-owned status, and a Python Socrata API sector default rate analysis.
Federal data · Finance and markets
BLS American Time Use Survey: The Federal Dataset Behind How Americans Actually Spend Their Time
2026-05-24
The BLS American Time Use Survey has tracked 24-hour time diaries for ~10,000 Americans annually since 2003 — the only federal dataset measuring time allocation across all life activities. Here is the 17 major activity categories and ATUS Lexicon coding, the gender gap (women average 2+ hours/day more household/caregiving vs. men's more leisure and paid work), parental intensive childcare trends, the 2020 COVID shift to remote work (42% working from home), leisure inequality by education (TV vs. reading/exercise divergence), the Well-Being and Eating & Health special modules, IPUMS-ATUS for harmonized cross-year access, and a Python weighted gender gap analysis.
Federal data · Economy and demographics
FDIC Call Report Data: The Quarterly Financial Filing Behind Every US Bank's Balance Sheet
2026-05-24
Every FDIC-insured institution files quarterly Call Reports (FFIEC 031/041/051) — the primary supervisory dataset covering ~4,700 banks with balance sheet, income, asset quality, capital adequacy, and liquidity detail. Here is the RC schedule structure (HTM vs. AFS securities, loan categories, deposit types), Schedule RI income statement, Schedule RC-N nonperforming loans and charge-offs, Schedule RC-R capital ratios and PCA thresholds, the SVB warning signs visible in 2022 Call Report data (HTM unrealized losses, concentrated uninsured deposits), the Texas Ratio methodology, FDIC BankFind Suite API, and a Python community-bank screening script.
Federal data · Finance and markets
BLS Multifactor Productivity: The Federal Dataset Behind Long-Run Economic Growth Accounting
2026-05-24
The BLS Multifactor Productivity (Total Factor Productivity) program measures output growth unexplained by measurable labor and capital inputs — the Solow residual that captures technological progress. Here is the growth accounting decomposition, the historical MFP episodes (1.5%/year golden age 1948–73, the productivity slowdown, the 1995–2004 IT revival, the post-2004 deceleration), the Hall-Jorgenson capital services methodology, labor vs. MFP distinction and its implications for real wage growth, unit labor costs as the core services inflation driver (peaked 2022, recovered 2023), the AI productivity hypothesis, FRED series IDs (OPHNFB, ULCNFB), and a Python BLS API dual-axis chart.
Federal data · Economy and demographics
Medicaid Enrollment Data: The Federal Dataset Behind 90 Million Beneficiaries and $900 Billion in Annual Spending
2026-05-24
Medicaid is the largest health coverage program in the US by beneficiary count (~90M people, ~$900B/year), administered by states under federal rules with FMAP matching. Here is the key data sources (monthly enrollment by eligibility group, T-MSIS claims data, MBES expenditure system), the ACA expansion 37-state vs. 13-holdout divide, the COVID continuous enrollment surge from 70M to 95M and the 2023–2024 unwinding that disenrolled millions, FMAP mechanics (50–77% federal match), managed care's 70% enrollment share, dual eligibles ($35K/year cost vs. $8K non-dual), long-term care payment (Medicaid covers 42% of all LTC spending), and a Python Medicaid.gov Socrata API unwinding analysis by state.
Federal data · Health and medicine
DOL Wage and Hour Division: The Federal Enforcement Database Behind $300 Million in Annual Back-Wage Recoveries
2026-05-24
The DOL Wage and Hour Division enforces the FLSA, Davis-Bacon Act, Service Contract Act, FMLA, and child labor laws through ~1,000 investigators nationwide — recovering $200–300M in back wages for 200,000–300,000 workers annually. Here is the WHISARD public enforcement database schema, the FLSA exempt vs. non-exempt classification battle, worker misclassification under the 2024 economic reality rule, H-2A agricultural wage violations, Davis-Bacon prevailing wage enforcement, the Asplundh $95M settlement, FLSA criminal prosecution under 216(a), and a Python sector-level penalty analysis by NAICS code.
Federal data · Labor and workplace
BLS PPI: The Producer Price Index and the Federal Inflation Dataset That Leads CPI
2026-05-24
The BLS Producer Price Index measures average change in selling prices received by domestic producers — the upstream complement to the consumer-facing CPI, with a 2–3 month leading relationship to goods inflation. Here is the three indexing systems (Final Demand PPI launched 2014, Intermediate Demand stage-of-processing pipeline, traditional commodity-based), the trade services margin methodology, the PPI vs. CPI spread as a retailer margin signal, the 2021–2022 supply chain surge (+22.9% FD goods peak), FRED series IDs (PPIFIS, PPIFAF, PPIFAE, PPICOR, PPIACO), BLS API access, and a Python 4-line chart of the inflation episode by component.
Federal data · Economy and demographics
Census PL 94-171: The Redistricting Data Behind Every Congressional Map
2026-05-24
Public Law 94-171 mandates the Census Bureau to deliver block-level population data to states for legislative redistricting by April 1 of the year following the decennial census — the foundational dataset for every congressional and state legislative district. Here is the five data tables (P1–P5, H1), the geographic hierarchy to census block, the one-person-one-vote case law (Reynolds v. Sims, Wesberry v. Sanders), the 2020 apportionment results (Texas +2, New York missed a seat by 89 people), differential privacy and the TopDown Algorithm controversy, the 63-combination race/ethnicity schema, Census API variable naming (P2_006N syntax), VRA Section 2 and the Gingles three-part test, and a Python Census API tract-level racial composition analysis.
Federal data · Economy and demographics · Money in politics
Treasury TIC: The Federal Dataset Behind Foreign Ownership of US Securities
2026-05-24
The Treasury International Capital system tracks foreign purchases and sales of US securities — the primary federal source on who holds US Treasuries and how capital flows across borders. Here is the four main TIC reports (monthly major holders, TIC-S/TIC-B flow surveys, SHCA annual position survey, SHLA mirror), the top foreign holders (Japan $1.1T, China $800B peak, UK $700B, Belgium/Euroclear anomaly), the custodian country problem, China's “financial nuclear option” analysis, sudden stop risk, 2008 flight-to-safety dynamics, and a Python script to download the monthly major foreign holders Excel.
Federal data · Finance and markets
CDC WONDER: The Federal Mortality Database Behind Every Death Statistics Analysis
2026-05-24
CDC WONDER is the query interface for US death certificate data — every death in America since 1999 coded by ICD-10 underlying cause, linked to place, age, race, and demographic characteristics. Here is the death certificate pipeline, ICD-10 code taxonomy (C codes for cancers, I codes for circulatory, F codes for mental, V–Y codes for external causes), the <10 death suppression rule, age-adjusted rates using the 2000 Standard Population, the three-wave opioid crisis (prescription T40.2–T40.3 to heroin T40.1 to synthetic fentanyl T40.4, ~110K deaths in 2022), Case–Deaton “deaths of despair” research, and COVID-19 U07.1 excess mortality analysis.
Federal data · Health and medicine
BLS JOLTS: The Federal Job Openings and Labor Turnover Survey Behind Every Tight-Labor-Market Claim
2026-05-24
The BLS Job Openings and Labor Turnover Survey measures the monthly flow of workers into and out of US employment — job openings, hires, quits, and layoffs across 21,000 establishments. Here is the four core metrics, how the quit rate peaked at 3.0% in April 2022 signaling the hottest labor market in decades, the Beveridge Curve rightward shift that revealed labor market frictions, labor hoarding dynamics in 2023, how JOLTS compares to Indeed and LinkedIn alternative measures, FRED series IDs (JTSJOL, JTSHIL, JTSQUL, JTSLAL, JTSQUR), and a Python fredapi Beveridge Curve plot.
Federal data · Economy and demographics · Labor and workplace
NHTSA FARS: The Federal Traffic Fatality Census Behind Every Road Safety Analysis
2026-05-24
The NHTSA Fatality Analysis Reporting System is a complete census of every US traffic fatality since 1975 — not a sample, but a record of all 38,000–43,000 annual deaths with linked accident, vehicle, and person detail. Here is the three-table structure (accident/vehicle/person), key variable codes (HARM_EV, MAN_COLL, LGT_COND, DRUNK_DR), the COVID anomaly (miles driven −13% but fatality rate spiked 24%), the alcohol-impaired decline from 20K/year in the 1980s to 10.5K/year, the pedestrian fatality rise from 4,300 to 7,500 since 2010, the CRSS companion for non-fatal crashes, and a Python state-level pedestrian fatality rate analysis.
Federal data · Transportation safety
CMS Medicare Advantage: Plan Bids, Star Ratings, and the Federal Dataset Behind Private Medicare
2026-05-24
Medicare Advantage now covers 51% of Medicare beneficiaries (~33M people) through private insurance plans. Here is the CMS benchmark-bid-rebate payment system, the 40-measure Star Ratings framework, how HCC risk adjustment creates a $10–30B upcoding incentive, the prior authorization controversy (OIG 2022: 13% of denials met coverage criteria), enrollment concentration (UHC 29%, Humana 19%, CVS/Aetna 12%), and a Python market-share analysis by state.
Federal data · Health and medicine
IRS Statistics of Income: The Federal Dataset Behind the US Tax and Income Distribution
2026-05-24
The IRS Statistics of Income program has published aggregated tax return statistics since 1916 — the definitive federal source on income distribution, effective tax rates, deductions, and credits. Here is the individual 1040 AGI class tables, the Piketty-Saez top 1% income share data, EITC distribution, estate tax stepped-up basis issue, corporate SOI and TCJA effective rate dynamics, and the restricted-use Public Use File for microsimulation.
Federal data · Transparency and open data · Economy and demographics
OSHA Inspections: The Federal Database Behind Every Workplace Safety Violation and Citation
2026-05-24
OSHA publishes every workplace inspection, citation, and penalty going back to 1972 — covering ~130M US workers in 10M workplaces. Here is the inspection types (unprogrammed complaint-driven vs. programmed NEP vs. fatality follow-up), the citation taxonomy (Willful $156K max through De Minimis), top-cited 29 CFR standards (fall protection chronically #1), the Imperial Sugar explosion, Amazon injury rate controversy, State Plan boundary, and a Python sector-level penalty analysis.
Federal data · Labor and workplace
HMDA: The Home Mortgage Disclosure Act Dataset Behind Every Redlining Investigation
2026-05-24
The Home Mortgage Disclosure Act requires most mortgage lenders to publicly disclose every application, origination, and denial — with loan amount, property location, applicant race/ethnicity, income, pricing, DTI, LTV, and AUS results. Here is the full post-2018 field schema, how CFPB and DOJ use denial-rate mapping to build redlining cases (Trustmark, Cadence, City National), the denial reason codes, HMDA Platform API, CRA examination connections, and a Python disparity-ratio analysis by county.
Federal data · Finance and markets · Economy and demographics
Census ACS: The American Community Survey and the Federal Demographic Dataset Behind Every Policy Decision
2026-05-24
The American Community Survey sends questionnaires to 3.5 million addresses per year — replacing the decennial long form with continuous annual estimates. Here is the 1-year vs. 5-year distinction, the full social/economic/housing/demographic variable taxonomy, margin of error and coefficient of variation thresholds, Census API variable naming conventions (B19013_001E syntax), key tables for income/poverty/rent/race/commute, and a Python census-tract rent burden analysis.
Federal data · Economy and demographics
BLS CPI: The Consumer Price Index and the Federal Inflation Measurement Behind Every Policy Decision
2026-05-24
The BLS Consumer Price Index has tracked the price level for urban consumers since 1913 — the primary US inflation gauge driving Social Security COLAs ($1.4T/year in indexed spending), wage negotiations, and Fed policy. Here is CPI-U vs. CPI-W vs. Chained CPI, the basket weights (shelter 35%, the OER methodology debate), CPI vs. PCE deflator gap, the 2021–2023 9.1% peak episode, FRED series IDs, BLS API access, and a Python chart tracking the inflation episode by component.
Federal data · Economy and demographics
CDC BRFSS: The World's Largest Telephone Survey and the Federal Health Behavior Database
2026-05-24
The CDC Behavioral Risk Factor Surveillance System interviews ~450,000 adults per year across all 50 states — the world's largest health survey. Here is the core module variables (obesity, smoking, diabetes, exercise, mental health), the raking weighting methodology, the PLACES MRP small-area estimation project, how 2011 cell-phone addition created a trend discontinuity, and a Python approach to computing weighted state-level obesity prevalence from the LLCP XPT file.
Federal data · Health and medicine
FHFA House Price Index: The Federal Repeat-Sales Benchmark for US Home Prices
2026-05-24
The FHFA HPI tracks single-family home price changes using repeat-sales methodology on conforming mortgages purchased by Fannie Mae and Freddie Mac — back to 1975, with national, state, MSA, and ZIP code coverage. Here is the weighted repeat-sales methodology, the conforming loan limit boundary, expanded-data HPI with FHA additions, the 40%+ pandemic price surge, FHFA vs. Case-Shiller vs. Zillow distinctions, and a Python script for state-level YoY appreciation rankings.
Federal data · Economy and demographics
Federal Reserve H.8: The Weekly Snapshot of Every US Commercial Bank's Balance Sheet
2026-05-24
The Federal Reserve publishes the H.8 every Friday — a weekly aggregate balance sheet for all US commercial banks covering $23T+ in assets: C&I loans, real estate loans, securities (HTM vs. AFS), reserve balances, and deposit flows. Here is the large vs. small bank breakdown, how the SVB collapse showed as a $98B single-week deposit outflow, H.8 vs. Call Report distinctions, FRED series IDs, and a Python snippet tracking credit cycle signals.
Federal data · Finance and markets
Census County Business Patterns: Annual Establishment Counts, Employment, and Payroll for Every US County
2026-05-24
County Business Patterns is the Census Bureau's annual series on US business activity at the county–NAICS level, published since 1964 — establishment counts by size class, mid-March employment, and first-quarter payroll for every county. Here is the Business Register source, noise infusion disclosure methodology, the Nonemployer Statistics companion series, CBP vs. QCEW vs. Economic Census distinctions, Business Dynamics Statistics, Census API access, and how to compute manufacturing location quotients by county.
Federal data · Economy and demographics
NAEP: The Nation's Report Card and the Federal Dataset Behind US Education Achievement
2026-05-24
The National Assessment of Educational Progress is the only nationally representative, continuing assessment of US student achievement — covering reading, math, science, and more for 4th, 8th, and 12th graders. Here is the 0–500 scale and NAGB achievement levels, the COVID-era learning loss evidence (largest reading decline in 30 years), state comparison methodology, the plausible values estimation approach, NAEP Data Explorer API access, and the White–Black achievement gap trend since 1992.
Federal data · Research and education
BLS OEWS: The Occupational Employment and Wage Statistics Behind Every Salary Benchmark
2026-05-24
The BLS Occupational Employment and Wage Statistics program covers 800+ occupations across every industry and geography — the most comprehensive source for occupation-level wage percentiles in the US. Here is the survey methodology, full SOC hierarchy, wage percentile fields (10th through 90th), the H-1B prevailing wage Level I–IV connection, OEWS vs. CPS vs. QCEW distinctions, and a Python script for ranking the highest-paid tech occupations.
Federal data · Economy and demographics
USPTO Patent Data: The Federal Database Behind Every US Patent Grant and Application
2026-05-24
The USPTO publishes bulk patent grant data (4M+ grants since 1976) and applications (since 2001), with PatentsView as the canonical research dataset — disambiguated inventor and assignee records, CPC classification codes, citation networks, and prosecution history via PEDS. Here is the three patent types, continuation and evergreening strategy, Alice Corp and IPR quality controversies, PatentsView API, BigQuery public data, and a Python snippet for ranking top AI patent holders by CPC subclass.
Federal data · Research and education
BEA GDP and National Accounts: The Federal Dataset That Measures the US Economy
2026-05-24
The BEA National Income and Product Accounts are the official measure of US economic output, income, and spending — updated three times per year with advance, second, and third estimates. Here is the C+I+G+(X-M) expenditure identity, every GDP component in depth, real vs. nominal GDP, GDP by State and GDP by Industry breakdowns, the BEA API query structure, and FRED series IDs as the easiest access path.
Federal data · Economy and demographics
FDA Drug Approvals: The NDA, BLA, and ANDA Database Behind Every Drug on the Market
2026-05-24
The FDA CDER Drugs@FDA dataset tracks every drug approval action since 1939 — NDAs for brand drugs, BLAs for biologics, ANDAs for generics. Here is the Orange Book TE codes and patent/exclusivity listings, NCE/3-year/pediatric/orphan/biologic exclusivity mechanics, Breakthrough and Accelerated Approval designations, the Aduhelm controversy, and how to query OpenFDA drugs API.
Federal data · Health and medicine
Medicare Part B Data: Every Procedure Billed to Medicare and What It Paid
2026-05-24
The CMS Medicare Part B Physician and Supplier Public Use File covers 1M+ providers, 12,000+ HCPCS procedure codes, and $400B+ in annual submitted charges. Here is the submitted vs. allowed vs. payment markup ratio, standardized payments removing geographic wage index, the Lucentis/Avastin ASP+6% controversy, the Salomon Melgen $21M ophthalmology fraud, and how to filter anti-VEGF injections to expose the billion-dollar pricing disparity.
Federal data · Health and medicine
BLS QCEW: The County-Level Employment and Wages Dataset Behind Every Local Economic Analysis
2026-05-24
The BLS Quarterly Census of Employment and Wages covers 97%+ of US jobs at the county–NAICS industry level — the most granular federal employment dataset available. Here is the QCEW vs. CES vs. LAUS distinctions, the suppression rules for counties with fewer than three establishments, average weekly wage by sector, BLS bulk CSV download structure, and a Python snippet for the highest-wage industries by county.
Federal data · Economy and demographics
FBI NICS Background Checks: The Federal Dataset Behind 400 Million Firearm Transfer Attempts
2026-05-24
The Brady Act NICS system has processed 400M+ background checks since 1998 — publishing monthly state-level counts of handgun, long gun, and permit check types. Here is the full check type taxonomy, why NICS counts don't equal gun sales, the default proceed loophole that enabled the Charleston shooting, the COVID-2020 and Biden-2021 demand spikes, and how to use the BuzzFeed News parsed CSV.
Federal data · Justice and immigration
HUD LIHTC Database: Mapping 35 Years of Low-Income Housing Tax Credit Projects
2026-05-24
The Low-Income Housing Tax Credit has financed 50,000+ projects and 3.5M+ affordable units since 1986 — the largest US affordable housing subsidy. Here is the HUD LIHTC database schema, the 9% vs. 4% credit mechanics, how State HFA Qualified Allocation Plans shape development geography, the National Housing Preservation Database complement, and how to compute units per capita by state.
Federal data · Economy and demographics
CFTC Commitments of Traders: The Weekly Federal Report Behind Futures Market Positioning
2026-05-24
The CFTC publishes weekly open interest broken down by trader category — Commercial hedgers, Managed Money (hedge funds), and Swap Dealers — for every regulated futures market since 1986. Here is the four COT report formats, how net non-commercial positioning signals crowded trades, the disaggregated vs. legacy format distinction, all covered markets, and how to build a 52-week COT z-score.
Federal data · Finance and markets
OFAC Sanctions Lists: The Treasury Database Every Financial Institution Must Screen Against
2026-05-24
The OFAC SDN list (~8,000 entries) and Consolidated Sanctions List cover every individual, entity, and vessel that US persons are prohibited from transacting with — with civil penalties up to $1.3M per violation. Here is the full SDN record schema, all major sanctions programs, the 50% ownership rule, the Binance $4.3B landmark penalty, and how to parse and screen the XML list.
Federal data · Sanctions and illicit finance
EPA Toxic Release Inventory: 35 Years of Industrial Chemical Releases and Environmental Justice Patterns
2026-05-24
EPCRA Section 313 requires 20,000+ industrial facilities to report annual releases of 800+ toxic chemicals — air, water, land, and off-site transfers. Here is the full TRI field schema, the 75% release decline since 1988, the 2024 PFAS additions, how to use the RSEI model for toxicity-weighted population exposure, and how to join TRI to Census ACS for environmental justice analysis.
Federal data · Environment and energy
CMS Hospital Quality Data: Outcomes, Readmissions, and Star Ratings for 6,000 US Hospitals
2026-05-24
CMS Care Compare publishes quality measures for every Medicare-certified hospital — 30-day mortality and readmission rates, HCAHPS patient experience scores, process compliance, and Medicare spending per beneficiary. Here is the full measure taxonomy, how risk adjustment works, the HAC Reduction Program penalties, Value-Based Purchasing incentives, and how to download and analyze the data.
Federal data · Health and medicine
SEC EDGAR XBRL Financials: Machine-Readable Fundamentals for Every Public Company
2026-05-24
Since 2009, every public company files XBRL-tagged financial statements with the SEC — extractable through the EDGAR Company Facts API, the Frames endpoint for cross-sectional screening, and bulk quarterly FSN downloads. Here is the US-GAAP taxonomy structure, the three data quality pitfalls (extension elements, restated periods, unit inconsistencies), rate limits, and how to build a revenue growth screener.
Federal data · Finance and markets
Corporate Prosecution Registry: DPAs, NPAs, and the Too-Big-to-Jail Database
2026-05-24
The Corporate Prosecution Registry (Duke Law) tracks every federal corporate criminal resolution since 1990 — deferred prosecution agreements, non-prosecution agreements, and guilty pleas — covering 400+ resolutions and $30B+ in fines. Here is the DPA/NPA/guilty plea taxonomy, the Yates Memo and Monaco Doctrine evolution, the HSBC and Boeing landmark cases, the compliance monitor system, and FCPA as the dominant enforcement category.
Federal data · Justice and immigration
PCAOB: The Federal Audit Watchdog Created After Enron and the KPMG Inspection-Data Scandal
2026-05-24
The PCAOB registers, inspects, and disciplines auditors of public companies — publishing inspection reports on every registered firm's deficiency rate. Here is the Big Four inspection pattern, the KPMG $50M scandal for receiving stolen inspection lists, the HFCAA Chinese auditor crisis and 2022 CSRC breakthrough, and how researchers use deficiency rates as an auditor quality proxy.
Federal data · Finance and markets
Medicare Part D Prescribing Data: Every Drug Prescribed by Every Medicare Provider
2026-05-24
CMS publishes provider-level Medicare Part D prescribing data showing every drug prescribed by every provider with 10+ claims — 1M+ providers, 5,700+ drugs, $100B+ in visible prescription spending per year. Here is the full schema, how Part D data exposed the opioid crisis (ProPublica Prescriber Checkup), the GLP-1 agonist cost surge, and how to join it with CMS Open Payments to detect prescribing-payment correlations.
Federal data · Health and medicine
ATF Crime Gun Trace Data: The Federal Dataset the Tiahrt Amendment Tried to Hide
2026-05-24
The ATF National Tracing Center processes 500,000+ firearm traces per year — reconstructing the chain of commerce from manufacturer to crime scene. Here is what the Tiahrt Amendment restricts, what aggregated state-level trace data still reveals about the iron pipeline, how time-to-crime exposes straw purchasing, the FFL directory, AFMER manufacturing data, and the ghost gun tracing gap.
Federal data · Justice and immigration
CPSC Product Recalls: The Federal Safety Database Behind 400 Consumer Product Recalls Per Year
2026-05-24
The Consumer Product Safety Commission publishes every recall of consumer products — 400-500 per year covering toys, furniture, appliances, nursery products, and 15,000+ product categories not regulated by FDA or NHTSA. Here is the full recall database schema, the SaferProducts.gov incident report system, the IKEA Malm tip-over and Fisher-Price Rock 'n Play landmark cases, and how the CPSIA 2008 transformed product safety data.
Federal data · Consumer protection
FDA Medical Device Recalls: The Database Behind Every Implant Failure and CPAP Warning
2026-05-24
The FDA CDRH publishes every medical device recall action — Class I (serious health risk), Class II, and Class III — covering 1,000–1,500 recalls per year since 1999. Here is the full field schema, the three recall classes, the DePuy ASR ($4B settlement) and Philips Respironics CPAP (5.5M+ units) landmark recalls, how MAUDE adverse event reports feed recall decisions, and how to query the OpenFDA device recall API.
Federal data · Health and medicine · Consumer protection
NCUA Credit Union Data: The 5300 Call Report and Enforcement Database for 4,700 Federally Insured Credit Unions
2026-05-24
The NCUA publishes quarterly 5300 Call Report data for every federally insured credit union — assets, shares, loans, delinquency, net worth ratios — plus a public enforcement action database covering Consent Orders through Conservatorships. Here is the data structure, the net worth PCA thresholds, the 2009 corporate credit union crisis ($28.5B bailout), and how to download and screen the quarterly data.
Federal data · Finance and markets
CFPB Enforcement Actions: The Public Record of $20 Billion in Consumer Finance Penalties
2026-05-24
The CFPB has brought 200+ enforcement actions since 2011 — covering UDAAP violations, redlining, student loan servicer abuses, and predatory auto lending — with $20B+ in consumer relief and penalties. Here is the enforcement action taxonomy, the UDAAP abusiveness standard, the Wells Fargo $3.7B action, how enforcement trends shift across administrations, and how to scrape and analyze the enforcement database.
Federal data · Finance and markets
BTS Border Crossing Entry Data: Monthly Counts of Every Vehicle, Truck, and Pedestrian at US Land Ports
2026-05-24
The Bureau of Transportation Statistics publishes monthly counts of every border crossing type at ~290 US land ports going back to 1996 — personal vehicles, pedestrians, trucks, buses, trains, and containers broken out by crossing type and port. Here is the full taxonomy, the COVID-19 collapse (pedestrians -93%, trucks -28%), the San Ysidro and Laredo dominance, and how to use the Socrata API for supply chain and trade flow analysis.
Federal data · Transportation safety
ClinicalTrials.gov Data: The Federal Registry Behind Every Drug and Device Trial
2026-05-24
FDAAA 801 requires registration of all applicable clinical trials before enrollment and results submission within 12 months of completion — but 50%+ of trials still fail to report results. Here is the full NCT schema, how to access the AACT PostgreSQL mirror from Duke/CTTI, how to detect publication bias using the results reporting gap, and how the GLP-1 agonist trial explosion looks in the data.
Federal data · Health and medicine · Research and education
FDA 510(k) Device Clearances: The Substantial Equivalence Pathway That Cleared 100,000+ Medical Devices
2026-05-24
The FDA 510(k) pathway clears medical devices by showing substantial equivalence to a predicate device — no clinical trials required. Here is the three-class device system, the K-number database fields, the predicate daisy-chain problem that lets cleared devices drift from the original, the De Novo pathway for novel low-risk devices, the metal-on-metal hip and vaginal mesh controversies, and how to query the OpenFDA device API.
Federal data · Health and medicine
DOL H-2 Visa Disclosures: Mapping the Guest Worker Programs Feeding US Agriculture and Hospitality
2026-05-24
The H-2A program (cap-free agricultural) and H-2B program (66,000-cap non-agricultural) bring hundreds of thousands of temporary workers to the US annually. DOL OFLC publishes quarterly disclosure files with employer, job title, wages, worksites, and worker counts. Here is the data structure, how H-2A grew from 60,000 to 370,000+ certifications between 2012 and 2023, and how to compare offered wages against adverse effect wage rates.
Federal data · Labor and workplace · Justice and immigration
OCC Bank Enforcement Actions: Reading the Federal Regulator’s Public Disciplinary Record
2026-05-24
The Office of the Comptroller of the Currency publishes every formal enforcement action against national banks and federal thrifts — from Commitment Letters through Formal Agreements, Consent Orders, and Cease-and-Desist Orders. Here is the enforcement action taxonomy, the BSA/AML enforcement pattern, the Wells Fargo consent order cascade, and how to scrape and analyze the OCC enforcement database.
Federal data · Finance and markets
FERC Enforcement: The Federal Watchdog Over Energy Market Manipulation
2026-05-24
FERC investigates electricity and gas market manipulation with penalties up to $1.4M per day per violation. Here is the enforcement database, the JP Morgan ($410M) and Barclays ($488M) market manipulation cases, how Electric Quarterly Reports expose every bilateral power transaction, and how to search FERC eLibrary enforcement dockets.
Federal data · Environment and energy
SEC Enforcement Actions: The Public Record of Every Securities Law Violation
2026-05-24
The SEC publishes Administrative Proceedings, Litigation Releases, and final orders covering 700-800 enforcement actions per year — with $4-5B in annual disgorgement and penalties. Here is the enforcement record structure, the whistleblower program mechanics, how to scrape and parse the enforcement databases, and how to track administration-level enforcement priority shifts.
Federal data · Finance and markets
HHS OIG Exclusions: The Federal Healthcare Fraud Blacklist That Every Provider Must Screen Against
2026-05-24
The HHS OIG List of Excluded Individuals/Entities (LEIE) bars providers from billing Medicare and Medicaid — with $10,000 per-service penalties for employers that fail to screen. Here is the exclusion type taxonomy, how to download the monthly LEIE CSV, how it differs from SAM.gov EPLS, and how to implement fuzzy-match screening against a provider roster.
Federal data · Health and medicine
SEC Form 8-K: The Real-Time Disclosure Feed for Every Material Corporate Event
2026-05-24
Public companies must file Form 8-K within 4 business days of any material event — covering 33 item types from earnings releases and executive departures to bankruptcy filings and the new 2023 cybersecurity incident disclosure requirement. Here is the item taxonomy, how to filter EDGAR for specific event types, and how Item 4.02 non-reliance filings signal fraud.
Federal data · Finance and markets
NHTSA Vehicle Recall Data: 70 Years of Safety Defects Across 900 Million Vehicles
2026-05-24
NHTSA maintains the recall database covering every safety-related defect since 1966 — 900M+ vehicles affected, with the Takata airbag inflator recall (70M vehicles, 28+ deaths from metal shrapnel) as the largest in US history. Here is the data structure, the NHTSA complaint-to-recall investigation pipeline, and how to query by VIN.
Federal data · Transportation safety
DOL Form 5500: The Annual Filing That Exposes Every Private Pension and 401(k) Plan
2026-05-24
Every large ERISA plan files Form 5500 annually — covering 750,000+ plans with $10T+ in assets. Schedule C reveals service provider fees that drive 401(k) litigation; Schedule SB tracks pension funding ratios that determine minimum required contributions. Here is the schema, EFAST2 access, and how to compute average expense ratios by plan size.
Federal data · Labor and workplace
Senate LDA Lobbying Disclosures: Mapping $4 Billion in Annual Influence Spending
2026-05-24
The Lobbying Disclosure Act requires quarterly filings with the Senate SOPR — covering lobbyist identities, issue codes, specific bills lobbied, and dollar amounts for every registered lobbying engagement. Here is the LDA API, the relationship to FARA and LD-203 contribution reports, and how to connect lobbying spending to legislative outcomes.
Federal data · Money in politics · Transparency and open data
EIA Electricity Data: The Federal Dataset Behind Every Kilowatt-Hour Generated, Sold, and Priced
2026-05-24
The EIA publishes Form 923 (monthly plant-level generation and fuel use), Form 861 (annual utility retail sales and pricing), Form 860 (every generator's nameplate capacity and status), and EIA-930 (hourly real-time grid data by Balancing Authority). Here is the fuel mix transformation from 2000–2023 (coal 52% to 16%, gas 17% to 43%, wind/solar near zero to 16%), the ERCOT Texas grid isolation and Winter Storm Uri generation collapse, EIA API v2 structure, and a Python stacked-area chart of the energy transition.
Federal data · Environment and energy
FINRA BrokerCheck: The Public Database of Every Registered Broker and Investment Adviser
2026-05-24
FINRA BrokerCheck publishes registration history, licenses, employment records, and disclosure events (customer complaints, regulatory actions, criminal disclosures, bankruptcies) for every registered broker and firm. Here is the data structure, the recidivist broker problem, how to access the BrokerCheck API, and how attorneys use it to vet advisers.
Federal data · Finance and markets
SEC Form 4: The Insider Trading Disclosure Behind Every Officer and Director Stock Transaction
2026-05-24
Section 16(a) requires officers, directors, and 10%+ shareholders to file Form 4 within 2 business days of any stock transaction — creating a near-real-time public record on EDGAR since 2004. Here is the full transaction code taxonomy (code P open-market purchases as the only discretionary signal), the 10b5-1 plan gaming problem and the 2022 SEC amendments, cluster-buying methodology, academic evidence on 6%+ abnormal returns, and a Python screen for officer open-market purchases.
Federal data · Finance and markets
FDA FAERS: The Adverse Drug Event Database Behind Post-Market Drug Safety
2026-05-24
The FDA Adverse Event Reporting System contains 7 linked quarterly files tracking drug adverse events reported by manufacturers, providers, and consumers — with MedDRA reaction coding, outcome classification, and therapy dates. Here is the schema, how disproportionality analysis (PRR/ROR) detects safety signals, and the Avandia/Vioxx/SSRI signal cases.
Federal data · Health and medicine
College Scorecard: The Federal Dataset That Exposes Graduation Rates, Debt, and Earnings for Every US College
2026-05-24
The College Scorecard links IPEDS enrollment data to federal loan records and IRS earnings data — publishing median earnings, debt, repayment rates, and completion rates for every institution and field of study. Here is the data structure, how to use the API, and what the earnings-debt gap reveals about for-profit colleges and high-debt programs.
Federal data · Research and education
CISA KEV Catalog: The Federal Government's Definitive List of Actively Exploited Vulnerabilities
2026-05-24
The CISA Known Exploited Vulnerabilities catalog lists CVEs confirmed as actively exploited in the wild — with mandatory federal patching deadlines under BOD 22-01. Here is the catalog structure, how CISA decides what gets listed, how it differs from CVSS severity scoring, and how security teams use it as a minimal-patch prioritization framework.
Federal data · Cybersecurity and privacy
USCIS H-1B Visa Data: Mapping the 600,000-Worker Skilled Immigration Pipeline
2026-05-24
The DOL Labor Condition Application dataset and USCIS H-1B Employer Data Hub together reveal the true shape of the skilled-worker visa program: IT staffing companies dominate approvals, India-born workers hold 70%+ of visas, and prevailing wage Level I filings expose systematic wage suppression. Here is the data structure and how to compute employer-level wage ratios.
Federal data · Justice and immigration · Labor and workplace
DOJ False Claims Act Settlements: The $70 Billion Fraud Recovery Database
2026-05-24
The False Claims Act is the government's primary anti-fraud tool, with qui tam whistleblowers driving 80%+ of the $2B+ in annual recoveries. Healthcare fraud dominates — Medicare and Medicaid upcoding, kickbacks, and unnecessary procedures. Here is how to access the DOJ settlement database, scrape press releases, and identify repeat violators.
Federal data · Justice and immigration · Health and medicine
Foreign agents in plain sight: mapping DC's hidden influence network with FARA data
2026-05-24
The DOJ buries the FARA bulk download inside an Oracle APEX URL that looks broken. Behind it: daily CSV exports of every DC firm registered to lobby for a foreign government — who they represent, what they're paid, and what activities they conduct. Here is how to use it.
Federal data · Money in politics
Repetitive loss: what FEMA's flood insurance claims data reveals about 2.7 million paid claims
2026-05-23
FEMA's NFIP claims dataset covers 2.7 million paid flood insurance claims. The "multiple loss properties" subset shows properties paid out more than their assessed value — some 10–15 times. FEMA redacted addresses after journalists used the data to identify specific owners. Here is what's left and what it shows.
Federal data · Environment and energy
Compliance screening across 30+ federal enforcement lists: how the risk score works
2026-05-22
How we built a 0–100 compliance risk score across OFAC, SAM, OIG, CFPB, SEC, DOJ, FDIC, FINRA, CFTC, EPA, MSHA, FDA warning letters, PCAOB, UFLPA, and 15+ more lists in a single API call.
Federal data · Sanctions and illicit finance · Engineering and infrastructure
Entity resolution for multi-list compliance screening: reducing false positives without sacrificing recall
2026-05-16
How the Federal Regulatory Data Hub resolves entity identity across 30+ compliance lists: three-stage pipeline (identifier join 34%, FTS5 canonical name 41%, Jaro-Winkler fuzzy 18%), false positive taxonomy (same-name different entity 47%, subsidiary-parent 28%, historical name 16%, transliteration 9%), EntityResolutionResult confidence-to-action mapping (MATCH ≥0.90, PROBABLE_MATCH 0.72–0.90), 99.1% recall, 98.7% precision at ≥0.90, and weekly analyst-feedback calibration loop.
Federal data · Machine learning and OSINT · Engineering and infrastructure
Name matching in federal regulatory data: aliases, subsidiaries, and sanctions evasion across 208 datasets
2026-05-10
How the Federal Regulatory Data Hub resolves entity names across 208 federal datasets when identifiers disagree — OFAC alias explosion (44K aliases from 12K entries), SEC EDGAR subsidiary mapping, three-pass fuzzy matching (exact → Jaro-Winkler → TF-IDF cosine), 1.4% combined false positive rate, and how entity_confidence weights the compliance risk score.
Federal data · Engineering and infrastructure
Canonical entity IDs in the Federal Regulatory Data Hub: stable identifiers across 208 federal datasets
2026-05-05
How the Federal Regulatory Data Hub generates and maintains stable canonical IDs for entities across 208 federal datasets — deterministic SHA-256 ID generation, EntityVersion history for merge and split events, EntityAlias tracking for historical name variants, and subscriber continuity guarantees when source identifiers change.
Federal data · Engineering and infrastructure
Building the cross-agency regulatory entity graph: 50M+ records, one join
2026-05-01
How we built an entity bridge across 208 federal datasets so a single query returns every SEC filing, FDA warning letter, EPA enforcement case, and OFAC sanction for any company.
Federal data · Engineering and infrastructure
Entity subscriptions in the Federal Regulatory Data Hub: per-entity change monitoring across 30+ enforcement lists
2026-04-26
How the Federal Regulatory Data Hub lets compliance teams subscribe to regulatory events for specific entities — using the cross-agency entity bridge to watch OFAC, SAM, SEC, EPA, DOJ, and 25+ other lists simultaneously.
Federal data · Engineering and infrastructure
Federal Regulatory Data Hub change alerts: near-real-time OFAC sanctions, SAM debarments, and enforcement action webhooks
2026-04-21
How the Federal Regulatory Data Hub detects regulatory record changes and delivers them to subscribers: 10-minute OFAC sanctions window, 30-minute SAM debarment window, EDGAR 8-K filing webhooks, HMAC-signed Cloudflare Queue delivery with at-least-once semantics, per-entity and per-list subscription filters, and idempotency_key deduplication.
Federal data · Engineering and infrastructure
Swarm SDK v0.4: situational awareness, electronic warfare coordination, and adversarial resilience
2026-04-14
What shipped in Swarm SDK v0.4: the Situational Awareness API for shared position and sensor fusion, the EW Coordination protocol for spectrum interference, Adversarial Resilience features including traffic morphing and store-and-forward, and the RF Fingerprinting subsystem for passive emitter tracking. 463 total tests.
Drones and cryptography
Swarm situational awareness: signed position broadcasts, sensor fusion, and dead-reckoning in embedded Rust
2026-04-10
How the swarm coordination layer maintains a shared operational picture across 128 nodes without a central server: Ed25519-signed 124-byte position broadcast frames, an Extended Kalman Filter fusing GPS/IMU/barometric altitude into a 6-DOF state estimate, dead-reckoning fallback with quadratic uncertainty growth for up to 90 seconds without GPS, and a probabilistic gossip protocol achieving 94.2% frame delivery across a 2km × 2km field deployment.
Drones and cryptography
Swarm SDK on bare metal: porting the cryptographic core to no_std Rust on STM32H7
2026-04-06
How we ported the Swarm SDK cryptographic core to no_std Rust targeting the STM32H7 Cortex-M7: feature-gated std/embedded builds, 96KB static heap with cortex-m-alloc, pre-allocated VecDeque deduplication ring, in-place AES-GCM to avoid heap allocation, hardware AES accelerator integration (0.14ms vs. 0.61ms software), and binary size optimization from 1.2MB to 284KB with opt-level="z" and LTO.
Drones and cryptography · Engineering and infrastructure
Swarm SDK key rotation: automated cryptographic material refresh in field-deployed drone meshes
2026-04-01
How the Swarm SDK rotates cryptographic material without grounding the fleet — scheduled signed pre-key rotation on a 7-day timer, OTP replenishment when bundle drops below 20 keys, emergency revocation via gossip-flooded KeyRevocationAnnouncement, BKPSRAM zeroization with 0xFF pattern verification, and staggered rotation coordination across the mesh.
Drones and cryptography · Cybersecurity and privacy
Swarm SDK key management: device provisioning, certificate rotation, and revocation for autonomous drone systems
2026-03-28
How the Swarm SDK manages cryptographic identity for drone fleets: on-device ML-KEM-768 + X25519 keypair generation at provisioning, three-tier fleet CA hierarchy (Root → Fleet CA → device certificate), pre-provisioned mission cert bundles for offline authentication, signed prekey rotation every 7 days over the gossip mesh, in-flight device revocation via poison-pill RevocationMessage, and emergency wipe on tamper detection.
Drones and cryptography
Swarm SDK device enrollment: how a new drone joins an authenticated fleet mesh
2026-03-21
How a Swarm SDK drone goes from factory state to trusted mesh participant: factory-provisioned ML-KEM-768 + X25519 keypairs, CSR generation and Fleet CA signing, USB and RF enrollment paths, gossip mesh announcement with SignedPreKeyBundle, pioneer bootstrap for the first device, and re-enrollment at certificate expiry.
Drones and cryptography
Post-quantum mesh cryptography for drone swarms: the Swarm SDK design
2026-03-15
How we designed the Swarm SDK: ML-KEM-768 + X25519 hybrid post-quantum key exchange, Double Ratchet forward secrecy, gossip mesh routing with bounded fanout, and the path to CNSA 2.0 compliance.
Drones and cryptography
Swarm SDK operational security: traffic analysis resistance, message size normalization, and timing jitter
2026-03-10
How the Swarm SDK protects drone mesh communications against traffic analysis — six fixed message size bins, ±15% transmission timing jitter, store-and-forward ring buffer for burst smoothing, degraded-channel operational mode, and RF fingerprint resistance on STM32H7.
Drones and cryptography · Cybersecurity and privacy
Swarm SDK MAVLink v2 integration: encrypting mesh messages inside 253-byte drone protocol frames
2026-03-05
How the Swarm SDK wraps post-quantum encrypted mesh traffic in MAVLink v2 SWARM_MESH_FRAME messages — 18-byte fragment header design, per-message reassembly buffer with 5-second TTL, PX4 and ArduPilot integration, MAVSDK passthrough, and why ML-KEM-768 Sealed Sender envelopes always require 6 frames.
Drones and cryptography
Swarm SDK message framing: binary wire format, fragmentation, and MAVLink packing
2026-02-27
How the Swarm SDK serializes, fragments, and packs Double Ratchet encrypted messages into MAVLink v2 TUNNEL frames: the SwarmFrame binary header, 237-byte payload limit, fragmentation algorithm, reassembly state machine, CONTROL frame authentication, and STM32H7 performance.
Drones and cryptography
The Swarm SDK double ratchet: forward secrecy and post-compromise security in drone mesh networks
2026-02-22
How the Swarm SDK implements the Double Ratchet algorithm for drone-to-drone messaging: adapting Signal Protocol's KDF chains for ML-KEM-768 post-quantum initial key exchange, header encryption, out-of-order message handling with a sliding key cache, MAVLink v2 framing, and performance benchmarks on embedded ARM.
Drones and cryptography
Swarm SDK Sealed Sender: hiding the sender identity without breaking end-to-end encryption
2026-02-16
How the Swarm SDK implements Sealed Sender to hide drone identity from relay infrastructure: recipient-issued SenderCertificate, ephemeral X25519 + HKDF-SHA256 per-message encryption into SealedSenderEnvelope, AES-256-GCM with zero relay-visible sender field, 48-hour certificate TTL, four decryption failure modes (DecryptionError, CertificateExpired, CertificateSignatureInvalid, SenderKeyMismatch), and integration with Sender Keys for group mesh communications.
Drones and cryptography
Swarm SDK v0.3: Sender Keys, Sealed Sender, and Deniable Authentication for Drone Mesh Networks
2026-02-10
What shipped in Swarm SDK v0.3: O(1) group encryption with Sender Keys (0.7ms on STM32H7), Sealed Sender hiding drone identity via ML-KEM-768 encapsulation, deniable HMAC authentication, and PKCS7 padding normalization across all AES-GCM operations. 127 new tests (302 total).
Drones and cryptography
The Federal Regulatory Data Hub MCP server: 38+ tools for AI agent workflows
2026-02-05
How the Federal Regulatory Data Hub exposes its data through an MCP server with 38+ tools for Claude, GPT, and other AI agents — screen_entity, get_entity, compliance reporting tools, HMAC-signed webhook configuration, rate-limit tiers by plan, and Claude Desktop integration via stdio transport.
Federal data · Engineering and infrastructure
The Federal Regulatory API: REST, MCP, and JSON-LD for 208 federal datasets
2026-02-01
How the Federal Regulatory Data Hub API is designed: no-auth CC0 REST endpoints, cross-agency entity resolution in a single GET, an MCP server with 38+ tools for Claude and GPT agent workflows, and JSON-LD structured data for search indexing.
Federal data · Engineering and infrastructure
Swarm SDK session establishment: X3DH prekey bundles and the initial drone-to-drone handshake
2026-01-25
How the Swarm SDK uses Extended Triple Diffie-Hellman (X3DH) with ML-KEM-768 adaptation for async drone-to-drone session establishment — prekey bundle construction, one-time prekey consumption, Fleet CA bundle verification, and the transition from shared secret to Double Ratchet forward secrecy.
Drones and cryptography
Swarm SDK prekey bundle management: generating, distributing, and consuming OneTimePreKeys across a drone fleet
2026-01-20
How the Swarm SDK generates, distributes, and tracks OneTimePreKeys for X3DH session establishment — including OTP exhaustion handling, SignedPreKey rotation, and the gossip-mesh key bundle protocol.
Drones and cryptography
Federal dataset ingest: keeping 208 federal datasets fresh at the edge
2026-01-15
How we ingest and refresh 208 federal regulatory datasets across 45 agencies using Cloudflare Workers cron, delta detection, schema drift handling, and per-source retry budgets — the ETL behind the Federal Regulatory Data Hub.
Federal data · Engineering and infrastructure
Swarm SDK mesh transport: reliable delivery over contested RF links
2026-01-08
How the Swarm SDK MeshTransport layer achieves reliable frame delivery over lossy drone radio links: sliding window ARQ with selective ACK, EWMA RTT estimation, transparent fragmentation and reassembly for Sealed Sender envelopes, multi-channel bonding across 2.4GHz and 5.8GHz radios, and performance benchmarks on STM32H7 and Jetson Nano.
Drones and cryptography · Engineering and infrastructure
Swarm SDK gossip mesh: bounded fanout routing, message deduplication, and network partition handling
2026-01-02
How the Swarm SDK implements a gossip mesh for drone swarms: epidemic broadcast with k=3 fanout, UUIDv4 sliding-window deduplication across a 1000-ID VecDeque, Lamport clock causal ordering for key management messages, TTL hop limiting with 3-hop lossy-channel headroom, and anti-entropy reconciliation for post-partition recovery — with STM32H7 and Jetson Nano benchmarks.
Drones and cryptography · Engineering and infrastructure

2025 (72)

Swarm SDK architecture: gossip mesh, post-quantum cryptography, and embedded-first design
2025-12-27
An architectural overview of the Swarm SDK: the three-layer design covering gossip mesh epidemic broadcast, ML-KEM-768 + X25519 hybrid post-quantum cryptography with Double Ratchet and Sender Keys, MAVLink v2 framing, and no_std embedded operation on STM32H7.
Drones and cryptography
Incident clustering and deduplication: how Voidly avoids counting the same censorship event twice
2025-12-22
How Voidly deduplicates thousands of probe measurements into discrete censorship incidents: the four-tuple clustering key, the 6-hour gap rule, incident lifecycle from ANOMALY to RESOLVED, incident_id assignment, retroactive CensoredPlanet alignment, and edge cases including flapping blocks and BGP outages.
Censorship and information control · Engineering and infrastructure
Voidly incident timeline reconstruction: building the canonical event sequence from distributed probe measurements
2025-12-17
How Voidly reconstructs the authoritative timeline of a censorship incident from asynchronous distributed probe measurements — IncidentEvent sourcing model, temporal alignment across time zones, confidence weighting requiring 3+ independent probes, retroactive revision from CensoredPlanet batch data, duration statistics, and the timeline REST API endpoint.
Censorship and information control · Engineering and infrastructure
Voidly incident resolution: how we know when a censorship event ends
2025-12-13
How Voidly determines that a censorship incident has ended: per-type resolution thresholds (consecutive passing measurements with p_blocked < 0.3), the 12-hour RESOLVED_PENDING re-open window, FLAPPING state detection for rapidly alternating blocks, BGP-type auto-resolution, and cross-source confirmation requirements for VERIFIED incidents — with observed resolution time distributions (BGP 4.2h median, HTTP 12.1 days).
Censorship and information control · Engineering and infrastructure
Voidly real-time anomaly scorer: ML inference in the streaming pipeline at 50,000 events per second
2025-12-09
How Voidly embeds ONNX Runtime inside an Apache Flink streaming job to score probe results for censorship anomalies at 50,000 events/sec with sub-100ms end-to-end latency: thread-local ONNX session management per task slot, Kafka partition alignment with (country_code, asn) keyBy, mini-batch coalescing for 50ms p99 inference, and the backpressure mechanism that keeps consumer lag under 2,400 messages even on election-day traffic spikes.
Censorship and information control · Engineering and infrastructure · Machine learning and OSINT
Voidly's real-time event pipeline: from measurement anomaly to journalist alert in under 8 minutes
2025-12-05
How Voidly gets from a probe anomaly to a published verified incident — and an alert in a journalist's inbox — in under 8 minutes: the event queue, real-time OONI and IODA API polling, confidence threshold crossing, the two-window alert-fatigue guard, and the nightly CensoredPlanet retroactive pass.
Censorship and information control · Engineering and infrastructure
Voidly probe run lifecycle: from scheduled task to classifier input
2025-11-29
What happens inside a single Voidly probe run: the measurement execution loop, DNS and TCP and TLS and HTTP data capture, result serialization and signing, and the upload path that delivers a signed ProbeResult to the ingest pipeline.
Censorship and information control · Engineering and infrastructure
Voidly probe networking: staying connected through NAT, firewalls, and censored infrastructure
2025-11-24
How Voidly probes maintain connectivity and upload measurements from networks that actively block VPN protocols — QUIC/443 transport, domain fronting via CDN SNI fronting, TLS certificate pinning against MITM, local SQLite buffering (500 MB cap, 48h window), and metered-connection backoff.
Censorship and information control · Engineering and infrastructure
Voidly probe local measurement buffer: SQLite ring buffer, batch compression, and resilient upload
2025-11-19
How Voidly probes preserve measurement data during upload failures — a 72-hour SQLite ring buffer with anomaly-safe eviction, LZ4 batch compression reducing median batch size from 47KB to 9KB, exponential backoff retry up to 4 hours, priority queue for anomalous measurements, chunked upload with per-chunk acknowledgment, and 0.003% measurement loss rate across 37 probes over 6 months.
Censorship and information control · Engineering and infrastructure
The Voidly Probe: Tauri + boringtun network measurement at the operator's edge
2025-11-15
How the Voidly desktop probe works: Tauri 2 cross-platform app, Cloudflare boringtun WireGuard, tun-rs TUN device, X25519-Dalek on-device key generation, and operator anonymity as a design constraint.
Censorship and information control · Engineering and infrastructure
The Voidly probe test runner: concurrency, timeout handling, and the measurement state machine
2025-11-08
How the Voidly probe test runner orchestrates concurrent measurements inside the Tauri app: tokio Semaphore with 3 permits, MeasurementState machine (Pending → Running → Success/Error/Timeout), per-layer timeout budgets (DNS 3s, TCP 5s, TLS 8s, HTTP 15s, total 30s), Ed25519 measurement signing, mpsc upload queue with capacity 200, and why per-layer timeouts are themselves evidence of DNS-layer interference.
Censorship and information control · Engineering and infrastructure
How Voidly measures HTTP and HTTPS censorship: the full protocol lifecycle from DNS through TLS to body comparison
2025-11-01
A step-by-step breakdown of how each Voidly probe test works: DNS resolution, TCP handshake, TLS negotiation with certificate chain validation, HTTP request execution, response body fingerprinting, control comparison, and how every layer maps to interference types in the anomaly classifier.
Censorship and information control · Engineering and infrastructure
Voidly's TCP measurement layer: RST injection detection, null-routing, and connection timing analysis
2025-10-27
A deep dive into the TCP layer of Voidly's censorship detection: SYN-ACK timing, RST injection detection with a 15ms threshold, null-routing vs. RST as two distinct censorship mechanisms, the TcpResult struct, dual-IP probing to identify RST source, and how TCP evidence maps to the anomaly classifier's interference classes.
Censorship and information control
The Voidly control server: how we tell censorship from a bad network
2025-10-22
How Voidly uses a distributed control server network to distinguish genuine censorship from network errors, CDN split-horizon DNS, and misconfigured sites — DNS, TCP, TLS, and HTTP comparison methodology, and why a single control is not enough.
Censorship and information control · Engineering and infrastructure
How Voidly measures bandwidth throttling: timing signals, body truncation, and the calibration problem
2025-10-15
A technical deep-dive on how Voidly detects bandwidth throttling — the hardest interference class to classify. Covers the TimingFeatures Rust struct, TTFB z-score computation against control measurements, body truncation and mid-transfer RST signals, the congestion vs. deliberate-throttling calibration problem, cross-probe corroboration scoring, and country patterns from Russia TSPU, Iran ARRS, India, and China.
Censorship and information control · Engineering and infrastructure
Voidly probe health monitoring: how we detect and replace failing probe nodes
2025-10-08
How Voidly monitors 37+ probe nodes: heartbeat system (60s cadence, separate transport), DEGRADED/OFFLINE state machine, measurement quality scoring, ASN coverage SLOs for 200 countries, flapping detection capping confidence at CORROBORATED, automated replacement from standby operator waitlist, and the classify_offline_cause() algorithm distinguishing probe failure from ISP-level censorship.
Censorship and information control · Engineering and infrastructure
How Voidly detects DNS injection: forged responses, injection rates by country, and pipeline integration
2025-10-03
How Voidly probes identify DNS injection and manipulation in censored networks — comparison against three control resolvers, four weighted detection signals (IP divergence, TTL anomaly, source IP divergence, response timing), per-country injection rates (China 94%, Iran 61%, Russia 12%), CAP_NET_RAW privilege handling, anycast false-positive calibration from 4.2% to 0.8%, and integration with the DnsTestResult confidence score.
Censorship and information control · Engineering and infrastructure
Geoblocking vs. censorship: how Voidly distinguishes licensing restrictions, CDN geofencing, and GDPR blocks from government-ordered blocking
2025-09-29
How Voidly avoids false positives from commercial geoblocking: HTTP 451 detection, streaming service block page fingerprints (tagged geoblock_commercial, not censorship), multi-country probe comparison (SINGLE_COUNTRY vs. MULTI_COUNTRY_SELECTIVE geographic patterns), CDN split-horizon detection via ASN group mapping, domain-level unavailability baselines, and the p_geoblock score that suppresses measurements above 0.70.
Censorship and information control · Engineering and infrastructure
Voidly's interference taxonomy: classifying censorship from DNS injection to BGP withdrawal
2025-09-24
How Voidly classifies every censorship measurement into one of 7 interference types — DnsInjection, DnsNxdomain, TcpRstInjection, TcpNullRouting, TlsMitm, HttpBlockPage, and Throttling — using a hierarchical decision tree from DNS through HTTP, with confidence scoring, protocol layer priority, and an Indeterminate category for ambiguous evidence.
Censorship and information control
Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA
2025-09-20
How Voidly correlates three independent measurement projects at scale — data format normalization, 4-hour sliding window alignment, independence-weighted confidence scoring, and handling source disagreements.
Censorship and information control · Machine learning and OSINT
Voidly middlebox detection: transparent proxies, TCP injection points, and TSPU vendor signatures
2025-09-16
How Voidly probes detect network middleboxes: an HTTP echo test sending custom X-Voidly-Echo headers to a Voidly-controlled server to detect transparent proxies via injected Via/XFF headers, TCP RST injection timing analysis using four heuristics (arrival time, TTL mismatch, zero window, absent TCP options), a vendor signature library with 47 confirmed fingerprints (TSPU/Sandvine/Huawei Hi-SEC/GFW/Cisco), and the middlebox_events TimescaleDB hypertable showing 18-hour median lead time between middlebox detection and censorship anomaly onset across 31 countries.
Censorship and information control
How Voidly measures TLS censorship: certificate forgery, SNI blocking, and handshake interference
2025-09-12
A deep dive into the TLS layer of Voidly's censorship detection: full certificate chain extraction with rustls, government CA list (China MoI, Iran MICT, Kazakhstan NCA), MITM detection via fingerprint mismatch, TLS alert timing analysis (RST < 15ms = injected), SNI-based blocking detection via dual-SNI probing, ECH/ESNI measurement, and how TLS failure maps to interference_type classifier outputs.
Censorship and information control
Voidly's block page fingerprint library: detecting censorship signatures across 2,300+ known pages
2025-09-05
How Voidly built and maintains the 2,300-entry block page fingerprint library used to identify ISP and government censorship block pages: four matching strategies (exact SHA-256 hash, structural normalization, SimHash locality-sensitive hashing, TLS certificate fingerprinting), the match pipeline cascade, block page collection from OONI confirmed events and probe captures, per-country library composition (Turkey 47, Iran 312, Russia 189, China 8), false positive mitigation for CDN error pages and captive portals, and integration with the lf_http_blockpage_hash Snorkel label function.
Censorship and information control · Engineering and infrastructure
Voidly measurement protocol stack: composing DNS, TCP, TLS, and HTTP layers into a ProbeResult
2025-09-01
How the four Voidly measurement layers compose into a single ProbeResult struct: sequential DNS → TCP → TLS → HTTP execution with the control measurement running in parallel, the None-vs-Some failure propagation convention distinguishing “not attempted” from “attempted and failed”, a failure mode table mapping six layer-outcome combinations to censorship types, and deterministic control vantage selection by domain hash to stabilize body_sha256 comparison across measurement cycles.
Censorship and information control
How Voidly measures DNS censorship: NXDOMAIN injection, IP spoofing, and resolver-level filtering
2025-08-28
A deep dive into the DNS layer of Voidly's censorship detection: dual-resolver design (ISP resolver vs. neutral control), four interference types (NXDOMAIN injection, IP spoofing, empty answer, timeout), the compare_dns_results() algorithm, known injection IP database (China 18 IPs, Iran 3, Turkey 2), CDN geofencing false positive mitigation via ASN group matching, DNSSEC validation limitations, and DoH/DoT diagnostic queries.
Censorship and information control
Voidly measurement API export: NDJSON streaming, Parquet generation, and HuggingFace dataset sync
2025-08-24
How Voidly publishes its measurement corpus to external researchers: a keyset-paginated NDJSON streaming API with (ts, measurement_id) cursor and Server-Sent Events mode, nightly PyArrow Parquet generation sorted by (domain, ts) for 60% I/O reduction on single-domain queries with zstd level-3 compression, atomic HuggingFace Dataset Hub push with dataset card regeneration, and classifier_version tagging to keep probability distributions comparable across model updates.
Censorship and information control · Transparency and open data · Engineering and infrastructure
The Voidly measurement dataset: field-by-field schema reference
2025-08-20
A complete field-by-field guide to the Voidly CC BY 4.0 measurement dataset — probe identity, DNS/TCP/TLS/HTTP layers, control comparison, ML classification output, BGP signals, corroboration fields, and filtering recipes for journalists and ML researchers.
Censorship and information control · Engineering and infrastructure · Transparency and open data
Voidly's TimescaleDB continuous aggregates: pre-aggregating 2.2B probe measurements for fast queries
2025-08-14
The three-level TimescaleDB continuous aggregate hierarchy behind Voidly's sub-10ms query latency: measurement_hourly (15-minute refresh), country_daily_summary (1-hour refresh), and country_monthly_stats (daily), cutting a 7-day country query from 4.1 seconds to 4ms. Covers refresh policy configuration, late-arriving probe data handling (94.2% within 1 hour, 98.7% within 24h), compression interplay after 7 days, asn_hourly_summary design, and manual backfill procedures.
Censorship and information control · Engineering and infrastructure
Voidly's probe-to-dataset ingest pipeline: normalization, quality filtering, and TimescaleDB indexing
2025-08-08
The full path from raw probe bytes to a queryable TimescaleDB record: protobuf over QUIC, Cloudflare Worker validation, Kafka fan-out, Rust normalization, probe-version schema drift handling, quality filtering (3.2% drop rate), and nightly Parquet export to HuggingFace.
Censorship and information control · Engineering and infrastructure
Voidly BGP data ingestion: parsing MRT dumps, detecting prefix withdrawals, and computing country outage scores
2025-08-02
How Voidly ingests BGP data from RIPE NCC RIS, RouteViews, and bgp.tools: MRT format parsing, per-country baseline computation, withdrawal detection thresholds, BgpEvent records in TimescaleDB, and how bgp_outage_score is attached to probe measurements.
Censorship and information control · Engineering and infrastructure
BGP routing signals and internet shutdown detection: how Voidly uses IODA data
2025-07-28
How Voidly uses BGP prefix withdrawal patterns and IODA data to detect internet shutdowns before any probe can send a packet — baseline per-country reachability, the difference between BGP silence and withdrawal, and how BGP fits into the composite confidence score.
Censorship and information control · Engineering and infrastructure
Voidly AS path analysis: using BGP topology to locate censorship enforcement points
2025-07-22
How Voidly uses CAIDA AS-Rank, RIPE NCC RIS route collector data, and PeeringDB to build an AS-level topology, classify censorship choke points (IXP, transit AS, edge ISP), compute per-country probe diversity scores, and feed AS path features into the anomaly classifier.
Censorship and information control · Engineering and infrastructure
Voidly's ASN-level blocking analysis: how censorship propagates across autonomous systems
2025-07-17
How Voidly uses per-ASN probe vantages to distinguish nationwide censorship orders from selective ISP-level blocking — BGP peer classification from CAIDA AS-Rank, ISP blocking fingerprints by interference type, differential blocking detection, and propagation speed analysis that reveals enforcement mechanisms.
Censorship and information control
Per-domain censorship history in Voidly: tracking blocking events across countries and time
2025-07-12
How Voidly tracks the full history of blocking events for individual domains across all probe countries — DomainMeasurementSummary continuous aggregate, first/last-seen tracking, the /v1/domains/{domain}/history API, temporal pattern analysis (23% of blocks resolve within 7 days), cross-country blocking correlation, and domain freshness scoring.
Censorship and information control · Engineering and infrastructure
Voidly's country-level censorship score: aggregating 2.2B probe measurements into the global index
2025-07-08
How Voidly aggregates per-measurement interference probabilities into per-country censorship scores: recency decay with a 30-day half-life, ASN diversity weighting, domain category weighting, cross-source corroboration multipliers, 90-day rolling windows, Gaussian temporal smoothing, and bootstrap confidence bands.
Censorship and information control · Engineering and infrastructure
Sanctions timelines and internet shutdowns: how Voidly correlates OFAC designation bursts with censorship events
2025-07-03
How Voidly aligns OFAC sanctions packages, EU/UN designation timelines, and bilateral diplomatic signals with measured internet shutdown events — building the diplomatic-isolation feature for the shutdown forecasting model.
Censorship and information control
OFAC SDN integration in the Federal Regulatory Data Hub: conditional GET, entity normalization, and sub-second screening
2025-06-28
How the Federal Regulatory Data Hub ingests the OFAC Specially Designated Nationals list — daily conditional GET with ETag, XML parsing across 12K SDN entries with alias explosion, name normalization pipeline, FTS5 + Jaro-Winkler three-pass screening, and p50 8ms / p99 28ms screening latency against the SDN list alone.
Federal data · Sanctions and illicit finance · Engineering and infrastructure
The features behind Voidly's 7-day shutdown forecast: political calendar, sanctions timelines, and network telemetry
2025-06-21
A deep dive into the feature engineering behind Voidly's 7-day internet shutdown forecasting model: political calendar integration (election dates, protest intensity via GDELT), OFAC sanctions timeline features, BGP withdrawal rate, probe measurement rate drops as precursor signals, historical shutdown patterns, and XGBoost SHAP feature importance across 200 countries.
Censorship and information control · Machine learning and OSINT
Seven-day internet shutdown forecasting: how Voidly predicts connectivity outages
2025-06-15
How we build a 7-day predictive model for internet shutdowns across 200 countries: political calendar features, network telemetry, ARIMA + XGBoost ensemble, and per-country reliability scoring.
Censorship and information control · Machine learning and OSINT
Bridging classifier outputs to shutdown forecasting: from per-measurement censorship probability to country-level shutdown risk scores
2025-06-11
How Voidly aggregates calibrated per-measurement censorship probabilities into country-level shutdown risk signals: a three-stage aggregation hierarchy (ASN-domain hourly → domain → country), exponential decay weighting with 48-hour half-life over a 14-day window, a 28-feature forecast vector with risk score time series and ASN block concentration, and the Kafka voidly.forecast.features topic handoff to the Bayesian shutdown forecasting service.
Censorship and information control · Machine learning and OSINT
Voidly's per-country classifier calibration: Platt scaling, threshold tuning, and why the same probability means different things in Iran vs. China
2025-06-07
How Voidly calibrates its anomaly classifier separately for each country — Platt scaling logistic regression fitted on per-country holdout predictions, F2-weighted threshold tuning per class, 30-day rolling calibration windows, and calibration case studies: Iran DNS tampering fires at threshold 0.62 (consistent single-authority blocking); China DNS tampering requires 0.74 (CDN split-horizon noise).
Censorship and information control · Machine learning and OSINT
Voidly's anomaly classifier retraining pipeline: temporal splits, champion/challenger promotion, and drift detection
2025-06-02
How Voidly retrains its five-class censorship anomaly classifier on a weekly cadence: time-based train/val/test splits to prevent temporal leakage, SMOTE resampling for class imbalance, PSI drift detection, champion/challenger shadow deployment, and the canary rollout process.
Censorship and information control · Machine learning and OSINT
Voidly's real-time inference API: classifying censorship measurements at 50ms
2025-05-28
How Voidly serves the anomaly classifier as a live inference API — feature extraction from raw probe measurements in under 5ms, ONNX Runtime for portable model serving, five-class output with per-class probabilities, Cloudflare Worker routing to regional inference nodes, model versioning with champion/challenger shadow mode, and the latency budget that keeps end-to-end probe-to-verdict under 50ms.
Censorship and information control · Machine learning and OSINT · Engineering and infrastructure
Voidly ONNX inference: exporting XGBoost to ONNX and serving censorship predictions at 50ms p99
2025-05-24
How Voidly converts a trained XGBoost censorship classifier to ONNX for serving inside a Rust ingestion service: the sklearn-to-ONNX export pipeline with zipmap=False for zero-copy float32 probability tensors, ONNX Runtime session configuration with per-thread isolation and L3 graph optimization, opset 17 pinning with metadata validation, and batch inference benchmarks achieving p99 under 50ms at batch size 200 on 4 vCPUs.
Censorship and information control · Machine learning and OSINT
The 47 features that classify internet censorship: how Voidly extracts signal from raw network measurements
2025-05-20
How Voidly transforms raw probe measurements into the 47-feature vector that feeds the anomaly classifier: the ControlDelta struct, DNS features (NXDOMAIN injection, bogon IPs, known injection IPs), TCP features (RST timing, SYN-ACK count), TLS features (MITM cert detection, alert codes), HTTP features (blockpage SimHash score, body length ratio), and the LRU control cache design that prevents doubling probe cost.
Censorship and information control · Machine learning and OSINT
Voidly probe scheduling constraints: battery budgets, cellular data limits, and adaptive domain selection
2025-05-16
How Voidly probes adapt their measurement schedule to device resource constraints: four constraint checks (battery floor, thermal throttle, cellular daily cap, unknown network), sliding-window cellular data accounting with per-minute SQLite buckets, adaptive cycle length that scales domain count to remaining budget via a 28,000-byte-per-measurement estimate, and a priority queue scoring domains on staleness (0.50), config priority flag (0.35), and anomaly recency (0.15).
Censorship and information control · Engineering and infrastructure
Voidly's URL test list: how we curate the domains that reveal internet censorship
2025-05-12
How Voidly selects and maintains the domains it probes for censorship: Citizen Lab's global test list, 12 OONI category codes, per-country supplemental lists, the measurement budget problem, and why the test list is a political document.
Censorship and information control · Engineering and infrastructure
Voidly probe operator safety: anonymity design, data minimization, and operational security for censorship measurement
2025-05-07
How Voidly protects probe operators in jurisdictions that criminalize censorship measurement: strict data minimization (no name, address, or IP logging), WireGuard peer-key authentication, daily probe ID pseudonymization, optional Tor hidden service upload, measurement scrubbing, country-tier legal risk assessments, and a one-tap emergency stop with full data erasure.
Censorship and information control · Cybersecurity and privacy · Engineering and infrastructure
Voidly probe commissioning: how a new operator joins the censorship measurement network
2025-05-03
How a new Voidly probe operator goes from application to publishing measurements: on-device X25519 key generation in the Tauri app, probe registration and ASN verification, 48-hour warmup period with calibration measurements, quality scoring at promotion, and what happens when warmup calibration fails.
Censorship and information control · Engineering and infrastructure
Regulatory API rate limiting: per-tier quotas, burst tokens, and Cloudflare KV sliding-window counters
2025-04-29
How the Federal Regulatory Data Hub enforces per-client and per-tier rate limits at 8,000 req/s without a centralized counter store: a five-tier quota table (free/researcher/compliance/vendor/internal), token-bucket burst enforcement in Cloudflare KV with ETag-based conditional writes and fail-open after three race retries, and sliding 24-hour window daily quota counting using per-minute KV buckets with a short-lived summary cache for the common below-quota path.
Federal data · Engineering and infrastructure
The Federal Regulatory Data Hub query layer: routing 50M+ records at the Cloudflare edge
2025-04-25
How the Federal Regulatory Data Hub serves 50M+ records via Cloudflare Workers: 8 vertical D1 shards by agency group, Promise.all fan-out for cross-agency queries, entity bridge join across CIK/UEI/LEI/DUNS/NPI, FTS5 full-text search for narrative datasets, response caching with TTL table by endpoint type, and p50/p99 latency budget including partial-response fallback when a shard is unavailable.
Federal data · Engineering and infrastructure
Regulatory data versioning: point-in-time queries, audit trails, and as-of compliance screening in Cloudflare D1
2025-04-21
How the Federal Regulatory Data Hub implements bitemporal versioning across 50M+ regulatory records in Cloudflare D1: the valid_from/valid_until row-version pattern using half-open intervals, an append-only record_versions audit table with before-state JSON payloads, AS-OF query rewriting in the Workers router using the idx_sdn_pit covering index for sub-5ms p99, three screening modes (current/as-of/historical), and keyset-paginated NDJSON snapshot export for retroactive batch compliance screening.
Federal data · Engineering and infrastructure
Monitoring dataset freshness in the Federal Regulatory Data Hub: staleness detection, multi-channel alerting, and the OFAC publish-time problem
2025-04-17
How the Federal Regulatory Data Hub monitors the freshness of 208 federal datasets and alerts on staleness: per-source FRESHNESS_CONFIG with expected_cadence and max_staleness_hours, D1 dataset_ingests staleness query, Cloudflare Cron */5 * * * * staleness check, multi-channel alerting (Slack webhook, email, PagerDuty) with KV deduplication, OFAC ETag monitoring with 90-minute publish-delay alert, five ingest error classes, and public /status endpoint.
Federal data · Engineering and infrastructure
Voidly probe vantage selection: ASN diversity, operator safety, and reaching hard-to-measure countries
2025-04-10
How Voidly selects and distributes its probe vantage network: why ASN diversity matters more than geographic spread, the operator safety constraints that shape high-risk country probes, and how we reach countries where most people connect on mobile-only networks.
Censorship and information control · Engineering and infrastructure
Voidly probe config delivery: signed bundles, auto-update protocol, and country-specific measurement parameters
2025-04-06
How Voidly delivers measurement configuration to probes without a persistent control channel: gzip+CBOR bundles signed with Ed25519 (signature verified before decompression to prevent zip-bomb attacks), a pull-based auto-update scheduler with 6-hour intervals and exponential backoff, version pinning and two-snapshot rollback, and anonymous country tokens derived via BLAKE3 from ISO code + epoch-week salt so the CDN cannot correlate which overlay a probe applies.
Censorship and information control · Engineering and infrastructure · Cybersecurity and privacy
Voidly operator privacy: how we publish measurements without exposing the people who collect them
2025-04-02
How Voidly protects probe operator identity while publishing full measurement data: probe_id derived as SHA-256(public_key_bytes) with zero IP logging, human-readable codename system (450K+ combinations, no joint table with probe_id), measurement anonymization (probe_cc + probe_asn published; IP never stored), per-probe Ed25519 signing with isolated key store, and 12-country extra protections (4–48 hour publication delay, 90-day probe_id rotation).
Censorship and information control · Engineering and infrastructure
Entity alias tables for sanctions evasion detection: AKA, FKA, NFE, and PHONETIC normalization across OFAC, SEC, and FinCEN
2025-03-29
How the Federal Regulatory Data Hub manages alias proliferation across OFAC SDN, SEC EDGAR, and FinCEN BSA: a five-type alias taxonomy (AKA/FKA/NFE/PHONETIC/VESSEL), entity_aliases DDL with FTS5 virtual table and covering indexes, a normalization pipeline with iterative legal-suffix stripping and NFKD ASCII transliteration, double-Metaphone phonetic bucket generation, and a four-pass resolution pipeline (exact 71.4% → phonetic 88.2% → FTS5 96.1% → edit-distance 98.7% cumulative recall on 2.4M aliases).
Federal data · Engineering and infrastructure · Sanctions and illicit finance
Entity ID normalization in the Federal Regulatory Data Hub: resolving CIK, UEI, LEI, DUNS, and NPI across 208 datasets
2025-03-25
How the Federal Regulatory Data Hub resolves company identity across five incompatible federal identifier schemes: three-pass resolution strategy (exact ID join, alias table lookup, TF-IDF fuzzy name matching), the entity_master bridge table schema, company name normalization to remove legal suffixes, false positive rates by method, special cases for healthcare NPI arrays and foreign entities, and how the entity bridge achieves p50 38ms cross-agency query latency.
Federal data · Engineering and infrastructure
Federal Regulatory Data Hub schema design: per-vertical table layouts, entity_master bridge, and D1 indexing strategy for 50M+ records across 8 shards
2025-03-21
The full schema design behind the Federal Regulatory Data Hub: eight vertical D1 databases (securities 9.2M, financial-crimes 4.1M, healthcare 6.8M, labor-safety 3.4M, environment 2.9M, transportation 4.6M, enforcement 2.1M, infrastructure 2.9M), OFAC SDN and EPA enforcement table DDL with FTS5 virtual tables, entity_master bridge with shard_presence bitmask, covering indexes vs. FTS5 trade-offs, and the Workers queryEntityAllShards() Promise.all fan-out achieving p50 38ms cross-shard entity queries.
Federal data · Engineering and infrastructure
Full-text search across 50M+ federal records: SQLite FTS5, BM25 ranking, and cross-shard fan-out in Cloudflare D1
2025-03-17
How the Federal Regulatory Data Hub implements full-text search across 50M+ records using SQLite FTS5 in Cloudflare D1: virtual table creation with the unicode61 tokenizer and content= shadow-table pattern, BM25 scoring with weighted columns (10× entity_name, 5× description, 1× narrative), highlight() and snippet() functions for context extraction, buildFts5Query() TypeScript alias expansion with legal suffix stripping, Promise.all cross-dataset fan-out across 5 D1 shards, trigger-based index maintenance, and weekly optimize via Cloudflare Cron.
Federal data · Engineering and infrastructure
Building the Federal Regulatory Data Hub on Cloudflare D1: 50M+ records at the edge
2025-03-10
How we built a 35M-record federal regulatory database on Cloudflare D1 — per-vertical SQLite tables across 208 datasets, daily cron ingest, FTS5 for free-text datasets, and vertical sharding past the 10GB limit.
Federal data · Engineering and infrastructure
Voidly's measurement retention policy: hot, warm, and cold tiers for 2.2B probe results
2025-03-05
How Voidly manages storage for 2.2B probe measurements using a three-tier TimescaleDB retention policy — full-resolution hot tier (0-30 days), native-compressed warm tier (31-365 days, 6.2x ratio), and downsampled cold tier (>365 days, aggregates only), with continuous aggregate cascade, pg_cron compliance verification, and R2 tiered storage planned for Q3 2026.
Censorship and information control · Engineering and infrastructure
Voidly's measurement database: 2.2B probe results in TimescaleDB
2025-03-01
How Voidly stores and queries 2.2 billion censorship probe results in TimescaleDB: hypertable design with 1-day chunk intervals and secondary country partitioning, 6.2× compression, continuous aggregates for country-level daily summaries, three-tier retention (hot/warm/cold), and query benchmarks for anomaly detection.
Censorship and information control · Engineering and infrastructure
Voidly's real-time corroboration engine: fetching, aligning, and merging OONI, CensoredPlanet, and IODA data
2025-02-22
How Voidly's corroboration engine fetches and aligns data from three independent sources in near-real-time despite their different latency profiles: tokio::join! parallel fetches with per-source timeouts, adaptive OONI polling (15m/60m/3h/6h), in-memory CensoredPlanet daily dump index, independence-weighted source agreement scoring, and retroactive nightly reprocessing against the CP daily dump.
Censorship and information control · Engineering and infrastructure
The Voidly MCP server: 83 censorship query tools for Claude and GPT
2025-02-15
How the Voidly MCP server exposes 83 tools for querying the global censorship dataset from Claude, GPT, and agent frameworks — incident lookup, measurement queries, country summaries, BGP events, shutdown forecasts, and wiring it into Claude Code.
Censorship and information control · Engineering and infrastructure
The Voidly Parquet export pipeline: nightly snapshots from TimescaleDB to HuggingFace
2025-02-08
How the nightly Voidly export job extracts measurements from TimescaleDB and pushes Parquet snapshots to HuggingFace Hub: PyArrow schema with dictionary-encoded columns, server-side cursor streaming at 50K rows per round-trip, Zstandard level 3 compression, country + year_month partitioning, atomic HuggingFace commit with CommitOperationAdd, post-push SHA-256 verification, and the incremental vs. monthly full-snapshot strategy.
Censorship and information control · Engineering and infrastructure · Transparency and open data
The Voidly open datasets on HuggingFace: structure, daily snapshots, and filter recipes
2025-02-01
How the Voidly CC BY 4.0 measurement dataset and the OONI historical corpus are hosted on HuggingFace — Parquet snapshot structure, daily incremental updates, git-lfs versioning, and Python/R filter recipes for journalists, ML researchers, and infrastructure teams.
Censorship and information control · Transparency and open data · Engineering and infrastructure
Censorship incident lifecycle in Voidly: from anomaly detection to verified incident to resolution
2025-01-26
How a Voidly censorship incident progresses through six states — Anomaly, MultiSourceAnomaly, Corroborated, VerifiedIncident, Resolved, FalsePositive — with exact transition thresholds, timing data from 847 incidents in 2024 (67% stuck at Anomaly, 18% reach VerifiedIncident), IncidentRecord struct, publication timing by tier, how lifecycle state encodes into HuggingFace dataset fields, and retroactive state change handling via incident_history.
Censorship and information control
From anomaly to verified incident: the Voidly confidence tier system
2025-01-20
How a Voidly measurement moves through three confidence tiers — Anomaly, Corroborated, Verified Incident — and what each tier means for journalists, ML researchers, and infrastructure monitoring teams using the dataset.
Censorship and information control · Engineering and infrastructure
Voidly's Server-Sent Events streaming API: real-time censorship incident subscriptions
2025-01-13
How the Voidly SSE streaming endpoint delivers censorship events in real time: GET /v1/stream with country/tier/type filtering, four event types (incident_created, incident_updated, incident_resolved, country_status_change), Last-Event-ID reconnection with 24-hour event ring buffer, Python httpx.Client and JavaScript EventSource examples, and how SSE differs from the webhook delivery system.
Censorship and information control · Engineering and infrastructure
The Voidly REST API: querying the global censorship index in real time
2025-01-06
How the Voidly REST API is designed: key endpoints for incident lookup, measurement queries, country summaries, domain history, BGP events, and 7-day shutdown forecasts; cursor-based pagination, filtering, rate limits, and code samples in curl, Python, and JavaScript.
Censorship and information control · Engineering and infrastructure
Voidly API authentication: API keys, request signing, and rate limit tiers
2025-01-01
How the Voidly API handles authentication: two access tiers (public 60 req/hr and keyed), voidly_{env}_{base58} key format with PBKDF2-HMAC-SHA256 storage, D1 + KV request authentication flow, four plan tiers (Free/Research/Professional/Enterprise), HMAC-SHA256 webhook signature verification, key rotation without downtime, test keys for CI, and OAuth2 for third-party integrations.
Censorship and information control · Engineering and infrastructure

2024 (25)

Voidly's alert delivery system: PGP-encrypted email, webhooks, and RSS for censorship incidents
2024-12-28
How Voidly gets verified censorship incidents to journalists, researchers, and monitoring systems: HMAC-signed webhook delivery with exponential-backoff retry, PGP-encrypted email for verified alerts, per-country and per-confidence-tier RSS feeds, alert deduplication by incident_id, and rate-limiting to prevent fatigue.
Censorship and information control · Engineering and infrastructure
Voidly incident publication: state machines, idempotent upserts, and Kafka fan-out for verified censorship events
2024-12-24
How Voidly transitions a censorship incident through five states (Anomaly/MultiSourceAnomaly/Corroborated/Verified/Resolved) with threshold-gated transitions, stores every state change as an append-only event in a TimescaleDB hypertable with SHA-256 idempotency_key, and fans out verified incidents to alert delivery and cache invalidation via three Kafka topics — with the compute_incident_id() Rust function that makes incident IDs deterministic across pipeline restarts.
Censorship and information control · Engineering and infrastructure
The Voidly measurement scheduler: how we decide which domains to probe and when
2024-12-20
How Voidly schedules 80-domain probe runs across 37+ nodes: domain priority tiers by OONI category code, anomaly-driven priority boosts, protocol selection per domain, ±15% jitter for anti-detection, ASN distribution to ensure cross-ASN coverage, adaptive scheduling that injects urgent re-measurements on anomaly detection, and per-country task budgets (CN 68, IR 74, RU 72, global avg 49 tasks/window).
Censorship and information control · Engineering and infrastructure
Building Voidly's classifier training dataset from OONI: ingestion, alignment, and label generation
2024-12-15
How Voidly ingests 200M+ OONI Explorer measurements, aligns them with Voidly probe data on a country-domain-date key, generates probabilistic training labels using five Snorkel-style label functions, handles OONI coverage gaps with label distillation, and constructs the labeled dataset that trains the five-class anomaly classifier.
Censorship and information control · Machine learning and OSINT
The Voidly anomaly classifier: five interference classes, gradient boosted trees, and why we optimize for recall
2024-12-10
How the Voidly ML classifier distinguishes DNS tampering, TLS interference, HTTP blocking, BGP withdrawal, and throttling — five per-class binary models, country-specific calibration, and why 95% recall beats 95% precision when cross-source corroboration filters the noise.
Censorship and information control · Machine learning and OSINT · Engineering and infrastructure
Evaluating the Voidly anomaly classifier: per-country confusion matrices, precision-recall curves, and the offline test harness
2024-12-03
How Voidly evaluates the five-class censorship anomaly classifier offline before deployment: the ClassifierEvaluator test harness, per-country AUC-PR vs. AUC-ROC tradeoffs for imbalanced censorship data, F2 scoring rationale, per-country confusion matrix case studies (Iran 0.97 DNS recall, China 0.78 precision from CDN noise, Russia TSPU throttling), ECE calibration before and after Platt scaling, and model promotion criteria.
Censorship and information control · Machine learning and OSINT
Voidly's active learning loop: growing the anomaly training set with human-in-the-loop annotation
2024-11-27
How Voidly uses uncertainty sampling, Cohen's kappa inter-annotator agreement, and weekly model retrains to grow its censorship anomaly training set from 127K bootstrap labels to 275K — 500 examples/week annotated by 3 researchers each, with DVC data versioning and PSI drift detection.
Machine learning and OSINT · Censorship and information control
Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements
2024-11-20
How Voidly constructs a labeled training dataset for the anomaly classifier from 200M+ OONI measurements: weak supervision with Snorkel-style label functions across DNS/TCP/TLS/HTTP layers, class imbalance handling with SMOTE and log-weighting, time-based train/val/test splits to prevent leakage, per-country Platt scaling calibration, and the continuous retraining pipeline.
Censorship and information control · Machine learning and OSINT · Engineering and infrastructure
The Voidly measurement quality filter: how we clean 200M OONI records before ML training
2024-11-13
How the quality filter pipeline decides which raw measurements are fit for ML training: boolean checks for control_failure (1.9% drop rate — ISP blocks on control server IPs in CN/IR/RU), missing_fields (0.8%), old probe version pre-2.5.0 (0.3%), and duplicates (0.2%), totalling 3.2% dropped. Includes the quality_filter() Python function, the to_feature_input() schema transformation, and why rejected measurements go to quarantine not discard.
Censorship and information control · Machine learning and OSINT · Engineering and infrastructure
OONI data normalization: bridging five schema versions across 1.66M raw measurement files
2024-11-09
How Voidly normalizes 200M+ OONI measurements across five web_connectivity schema versions (v0.2 to v0.6) into a single ML-ready format: a detect_web_connectivity_version() function using field-presence inference, AnomalyType and ConfidenceTier enums, the OoniMeasurementNormalized dataclass, FLAG_* bitmask constants for DNS/TCP/TLS/HTTP anomaly encoding, side-by-side normalize_v05() vs. normalize_v06() implementations, and a 95.3% pass-through rate from the drop-reason table.
Censorship and information control · Engineering and infrastructure · Machine learning and OSINT
Building the OONI historical corpus: 1.66M downloads, schema normalization, and the decisions behind the dataset
2024-11-05
How we processed the OONI raw measurement archive into a flat ML-ready CSV: handling probe version schema drift across 12 years, normalizing test_keys across 20 measurement types, streaming 200M+ records, and what we decided to leave out.
Censorship and information control · Engineering and infrastructure
Censorship attribution via OSINT: identifying DPI vendors from network signatures, procurement records, and BGP TTL analysis
2024-11-01
How Voidly attributes censorship infrastructure to specific DPI vendors using network signatures and open-source intelligence: a six-vendor signature table (TSPU/Sandvine/NetClean/Iran ARRS/Cisco IronPort/GFW), DpiVendorSignature dataclass with a score_signature_match() function weighting RST timing (0.35), block page (0.30), injection IP (0.25), and CA SPKI (0.10), procurement scraping with PROCUREMENT_SOURCES across five government tender portals, BGP TTL-hop attribution, and case studies for Russia, Iran, and Ethiopia.
Censorship and information control · Machine learning and OSINT · Engineering and infrastructure
Building a digital-footprint reconnaissance pipeline for OSINT investigations
2024-10-28
How we build persistent cross-platform entity profiles for OSINT: passive collection from 40+ sources, graph-based identity disambiguation with calibrated edge weights, Certificate Transparency log monitoring, BGP/ASN change tracking, stylometric fingerprinting, and operational security architecture for researchers in hostile environments.
Machine learning and OSINT · Engineering and infrastructure
Mapping censorship infrastructure: identifying filtering gateways, DPI vendor signatures, and blocking architecture from network signals
2024-10-21
How Voidly identifies the hardware and software responsible for internet censorship: blocking architecture taxonomy (L3/L4/L7-DNS/L7-HTTP), DPI vendor signatures from timing patterns (Russia's TSPU RST < 3ms, Iran's ARRS DNS injection IPs, China's GFW TTL fingerprinting), ISP-level blocking fingerprints (Rostelecom vs. MTS vs. Turkcell), TTL analysis for middlebox distance, OSINT cross-referencing with procurement records, and the censorship_infrastructure dataset field.
Censorship and information control · Machine learning and OSINT
Building a distributed VPN with intelligent routing
2024-10-15
How we built a censorship-resistant VPN for Voidly probe operators: GFW/IRGC/TSPU threat model, WireGuard inside HTTP/2 CONNECT domain-fronting over CDN edges, 48hr entry-node IP rotation via Cloudflare KV, traffic morphing (Laplace timing jitter + packet-size CDF matching + cover traffic), 22-dim XGBoost on-device routing with ONNX, BlockageDetector for RST injection, and 99.3% DPI evasion across CN/IR/RU.
Censorship and information control
Named entity extraction and disambiguation in the OSINT pipeline: 58M posts per day, 15,000 entity mentions per hour
2024-10-10
How the AI Analytics OSINT pipeline extracts, disambiguates, and stores named entity mentions from 58M social media posts per day — GPU-accelerated NER, Wikidata QID linking, cross-language transliteration, and person co-reference resolution.
Machine learning and OSINT · Engineering and infrastructure
Social media ingestion at scale: collecting 58M posts per day from 47 platform schemas
2024-10-05
How we collect and normalize social media data from 47 platforms into a canonical post format: three-tier collection strategy (official APIs, ActivityPub, RSS/scrape), token-bucket rate limiting with circuit breakers, FastText language detection at ingest, content-hash deduplication, and Kafka topic partitioning by platform.
Machine learning and OSINT · Engineering and infrastructure
NLP pipeline for real-time sentiment analysis at scale
2024-09-28
NLP models powering the OSINT platform at 667 posts/second: FastText lid.176 language detection (99.7% EN accuracy), custom SpaCy NER fine-tuned on 2.3M labeled examples across 7 political entity types (91.4% macro F1), DistilBERT fine-tuned on 5M examples with INT8 ONNX quantization (94.7% macro F1, 28ms GPU), MinHash character 4-gram coordinated-campaign detection (89% precision), and the social signal integration with Voidly censorship event detection.
Machine learning and OSINT · Engineering and infrastructure
Multilingual bot detection: an 8-feature XGBoost classifier across 14 languages with per-language Platt scaling
2024-09-24
How the OSINT platform detects bot accounts across 14 languages without retraining per language: an 8-feature BotFeatureVector (posting_interval_entropy via Shannon formula, reply_outdegree_ratio, content_cluster_density, age_velocity_zscore, quote_to_original_ratio, url_recycling_rate, cross_platform_correlation, bio_change_count_90d), Redis-bucketed perceptual hash matching (Hamming ≤ 8 across 1024 hash buckets), XGBClassifier with StratifiedGroupKFold on language groups, and per-language Platt scaling achieving F1 0.883–0.908 across all 14 languages.
Machine learning and OSINT
Detecting coordinated inauthentic behavior in social media at scale
2024-09-20
How we detect coordinated amplification campaigns across 58M daily posts: MinHash LSH (128 hash functions, 16 bands, Jaccard threshold 0.80) for content similarity, Redis sorted-set burst detection (≥5 accounts within 15 minutes, inverse-sqrt account age weighting), seven account-feature logistic regression, network amplification ring detection via cycle enumeration, cross-platform timing joins, and a 0–100 coordination score with 70/90 thresholds for human review and auto-flagging.
Machine learning and OSINT · Engineering and infrastructure · Money in politics
Entity resolution for FEC campaign finance data: committee type taxonomy, JFC allocation, and four-pass name matching
2024-09-16
How the election intelligence pipeline resolves FEC committee identity across 1.3M records: the 10-code committee type taxonomy (H/S/P/X/Y/N/Q/O/I/U), a JointFundraisingCommittee dataclass with JFCAllocation and resolve_jfc_participants() from Form 99, normalize_entity_name() with iterative legal-suffix stripping, a four-pass resolution table (exact ID 63.4% → exact name 82.1% → alias 91.7% → TF-IDF char 3-gram 95.5% cumulative recall), and LLC chain disambiguation via FinCEN/EDGAR/SOS cross-reference.
Money in politics · Engineering and infrastructure
Detecting election anomalies using statistical methods
2024-09-12
Anomaly detection across 47 races in 23 states: Benford's law with magnitude-range validity checks, XGBoost turnout model (20 features, SHAP attribution, MAD-based z-scores, 3.1pp MAE), ARIMA(2,1,2) reporting-curve detection, DBSCAN campaign finance clustering (near-identical amounts + 3-day burst), and full triage workflow (12 flags → 9 explained, 2 false positives, 1 persistent).
Money in politics · Machine learning and OSINT
Statistical anomaly detection for election integrity: Benford's Law, digit uniformity, and turnout modeling
2024-09-07
The statistical methods behind AI Analytics' election anomaly detection — first-digit analysis, last-digit uniformity testing, turnout z-scores, and why these signals require cross-validation with social and media data before generating an alert.
Money in politics · Machine learning and OSINT · Censorship and information control
The election intelligence pipeline: aggregating ballot data, social signals, and media coverage for real-time anomaly detection
2024-09-02
How the election intelligence pipeline ingests AP Election API feeds, state authority data (JSON/CSV/HTML scraping), social media signals, and media coverage in real time: Kafka election.precinct_results topic (50 partitions by state FIPS), PrecinctResult protobuf schema, state scraper StateScraperConfig, ElectionSentimentConsumer, narrative divergence scoring, FIPS normalization edge cases (Connecticut planning regions, Alaska districts), and p50/p99 latency targets for all four streams.
Money in politics · Engineering and infrastructure · Machine learning and OSINT
How we process 2.4M social-media posts per hour
2024-08-30
Kafka partition key design, binary COPY writes to TimescaleDB, character 4-gram MinHash LSH distributed across Redis, autoscaling on consumer lag, and a canonical normalization layer across 47 platform schemas — the full pipeline behind 58M posts/day.
Engineering and infrastructure · Machine learning and OSINT

Technical notes from building intelligence infrastructure.

The investigations

The OrganWatch Investigation · 9 parts

The Farmland Register · 5 parts

The Detention Ledger, Read Closely · 3 parts

Foreign Money in US Universities · 2 parts

Browse by topic

2026 (428)