Writing
Technical notes from building intelligence infrastructure.
Long-form, light on hype. Architectures, trade-offs, and post-mortems from work on censorship measurement, OSINT pipelines, election analysis, and post-quantum communications.
- BLSWagesOccupationsLabor MarketFederal Data
- CFPBConsumer FinanceComplaintsBankingFederal Data
- FARAForeign LobbyingDOJInfluence OperationsFederal Data
- SECForm 4Insider TradingEDGARFederal Data
- NLRBLabor RelationsUnion ElectionsUnfair Labor PracticesFederal Data
- NTSBAviation SafetyAircraft AccidentsFAAFederal Data
- NOAAStorm EventsWeather DisastersClimateFederal Data
- USAIDForeign AidDevelopment AssistanceState DepartmentFederal Data
- NHTSAVehicle SafetyAuto RecallsConsumer ComplaintsFederal Data
- EPARCRAHazardous WasteEnvironmental ComplianceFederal Data
- EIAPower PlantsElectricityEnergyFederal Data
- NCESIPEDSHigher EducationCollegesFederal Data
- OFACSanctionsTreasuryEnforcementFederal Data
- ORIResearch MisconductScientific IntegrityNIHFederal Data
- USASpendingFederal SpendingSubawardsTransparencyFederal Data
- FECSuper PACsDark MoneyCampaign FinanceFederal Data
- CongressRoll Call VotesVoteViewPolitical ScienceFederal Data
- Grants.govFederal GrantsResearch FundingNonprofitsFederal Data
- EPADrinking WaterSDWAPublic HealthFederal Data
- Regulations.govRulemakingFederal RegisterPublic CommentsFederal Data
- FHWAHighwayPavementInfrastructureFederal Data
- FAAAviationPilotsAircraftFederal Data
- DOEEV ChargingAlternative FuelsTransportationFederal Data
- USGSWind EnergySolar EnergyRenewable EnergyFederal Data
- SBASmall BusinessLoans7(a)Federal Data
- DOJUS AttorneyFederal ProsecutionCriminal JusticeFederal Data
- SAMHSASubstance AbuseMental HealthTreatmentFederal Data
- PHMSAPipeline SafetyInfrastructureHazardous MaterialsFederal Data
- CDCFoodborne IllnessOutbreaksFood SafetyFederal Data
- OSHAWorkplace SafetyInjury DataLaborFederal Data
- DOJCivil RightsPolice ReformEnforcementFederal Data
- USDAERSFood EconomicsFarm IncomeFederal Data
- CMSMedicarePart DDrug PrescribingFederal Data
- DEAControlled SubstancesRegistrantOpioidsFederal Data
- NRCNuclear SafetyReactorEnergyFederal Data
- CFTCCommoditiesDerivativesEnforcementFederal Data
- HMDAMortgagesFair LendingHousing FinanceFederal Data
- CMSHospital Cost ReportsHealthcare FinanceMedicareFederal Data
- FECCampaign FinanceEnforcementPolitical MoneyFederal Data
- IRSCriminal InvestigationTax FraudFinancial CrimeFederal Data
- CDCNNDSSInfectious DiseasePublic HealthFederal Data
- OSHAWorkplace SafetyViolationsLaborFederal Data
- GAOFederal AuditsCongressOversightFederal Data
- FCCSpectrumRadioWirelessFederal Data
- UFLPAForced LaborSupply ChainCBPFederal Data
- FinCENBSAAMLMoney LaunderingFederal Data
- SAM.govDebarmentsProcurementContractor ExclusionsFederal Data
- FHWABridgesInfrastructureTransportationFederal Data
- NIHResearch GrantsBiomedicalScience FundingFederal Data
- USDASNAPFood StampsNutritionFederal Data
- FEMADisastersEmergency ManagementNatural DisastersFederal Data
- CMSHospital CompareHealthcare QualityMedicareFederal Data
- DOLOFLCH-1BVisaFederal Data
- PACERFederal CourtsJudiciaryLegal DataFederal Data
- BISExport ControlsCommerceSanctionsFederal Data
- TreasuryDTSFederal FinanceBudgetFederal Data
- CensusSAIPEPovertyIncomeFederal Data
- EEOCDiscriminationEmployment LawCivil RightsFederal Data
- FDAFAERSDrug SafetyAdverse EventsFederal Data
- NHTSAFARSTraffic SafetyCrash DataFederal Data
- CPSCRecallsConsumer SafetyProduct SafetyFederal Data
- NIHClinical TrialsDrug ApprovalResearchFederal Data
- CensusCPSPovertyUnemploymentFederal Data
- DEAARCOSOpioidsDrug DistributionFederal Data
- DOLUI ClaimsUnemploymentEconomic IndicatorsFederal Data
- CMSNursing HomesElder CareHealthcare QualityFederal Data
- BLSQCEWPayroll DataEmploymentFederal Data
- BLSCESJobs ReportEmploymentFederal Data
- BOPFederal PrisonIncarcerationCriminal JusticeFederal Data
- CDCDrug OverdoseOpioid CrisisMortalityFederal Data
- DOLForm 5500PensionsRetirement BenefitsFederal Data
- BLSOEWSWagesOccupationsFederal Data
- FRARailroadRail SafetyTransportation SafetyFederal Data
- OPMFederal WorkforceFedScopeCivil ServiceFederal Data
- NIFCWildfireForest ServiceClimate RiskFederal Data
- CFPBConsumer ComplaintsFinancial ServicesConsumer ProtectionFederal Data
- NOAAStorm EventsWeather DisastersClimate RiskFederal Data
- FBINIBRSCrime DataLaw EnforcementFederal Data
- SSASocial SecurityOASDIRetirementFederal Data
- IRSNonprofits501(c)(3)Tax-ExemptFederal Data
- USAIDForeign AidInternational DevelopmentForeignAssistance.govFederal Data
- PCAOBAuditingAccountingSecuritiesFederal Data
- MedicarePart DDrug SpendingCMSFederal Data
- NFIPFlood InsuranceFEMAClimate RiskFederal Data
- FARAForeign AgentsLobbyingNational SecurityFederal Data
- CMSOpen PaymentsPharmaHealthcareFederal Data
NLRB Union Elections and Unfair Labor Practice Data: The Federal Database Behind US Labor Organizing
NLRBLaborUnion ElectionsCollective BargainingFederal Data- ATFFirearmsCrime GunsGun TraceFederal Data
- FDICBankingBank DataFinancial InstitutionsFederal Data
- FMCSATruckingTransportation SafetyCrash DataFederal Data
- CRSCongressPolicy ResearchLegislative DataFederal Data
- NISTNVDCybersecurityCVEFederal Data
- EPAGreenhouse GasClimateEnvironmentalFederal Data
- DOJAntitrustMergersCompetitionFederal Data
- CDCWISQARSInjuryViolencePublic HealthFederal Data
- USASpendingFederal ContractsFederal SpendingFPDSFederal Data
- Federal RegisterRulemakingRegulationsAPAFederal Data
- FECCampaign FinanceElectionsFederal Data
- CDCNNDSSEpidemiologyPublic HealthFederal Data
- SECForm DPrivate PlacementsVenture CapitalFederal Data
- CMSMedicareDRGHealthcareFederal Data
- FDAOrange BookDrug PatentsGenericsFederal Data
- CDCPLACESPublic HealthSmall Area EstimationFederal Data
- BSEEOffshore SafetyOil GasEnvironmentalFederal Data
- TreasuryFederal BudgetPublic DebtFederal Data
- Federal ReserveInterest RatesTreasury YieldsFederal Data
- CensusPEPPopulationDemographicsFederal Data
- USDAFSISFood SafetyFederal Data
SAIPE produces annual model-based poverty estimates for all 3,100+ counties and 13,000+ school districts — the only single-year official source at that geography. It drives ~$17B in annual Title I-A education funding and the $3.5B CDBG formula. The model combines ACS, IRS EITC filers, SNAP counts, and CPS via small area estimation. The Census API exposes county and school-district poverty rates and median household income back to 1989 via a single endpoint.
CensusSAIPEPovertyEducation FundingFederal DataThe NTD collects annual ridership (UPT), vehicle miles, fares, and expenses from ~800 transit agencies as a condition of FTA grants. US total UPT hit 10.4B in 2023, still below the 15.7B pre-COVID peak. The COVID collapse was severe — NYC subway fell from 1.8B to 600M annual trips — and $69B in emergency relief (CARES + CRRSAA + ARP) kept systems running. Section 5307 formula grants (~$5B/year) are allocated directly from NTD UPT/VRM data.
DOTTransitTransportationFederal DataThe USPTO holds ~3M active registered trademarks, with ~650,000 new applications per year at peak. Federal registration provides nationwide constructive notice, ® usage rights, US Customs blocking of infringing imports, and incontestability after 5 years. The 45 Nice Classification classes span all goods and services. Bulk XML data at bulkdata.uspto.gov and the USPTO Trademark JSON API enable filing trend analysis; China accounts for ~25% of foreign USPTO filings.
USPTOTrademarksIPFederal DataThe SLOOS surveys ~80 large US banks and 24 foreign branches quarterly on changes in lending standards and loan demand. The net percentage (tightening minus easing) is the key signal: it hit +80% for C&I loans in Q4 2008 and +68% in Q2 2020. Net tightening above +50% has historically predicted recession within 4 quarters. FRED series DRTSCILM (large/medium C&I) and DRTSCIS (small firms) extend back to 1990 and are freely accessible via the FRED API.
Federal ReserveCreditBankingFederal DataThe FCC's Universal Licensing System (ULS) holds 25M+ active wireless licenses covering amateur radio (11M+ operators), commercial mobile (AT&T, Verizon, T-Mobile spectrum), public safety, broadcast, microwave, and satellite. Spectrum auctions have raised $160B+ total — Auction 110 (C-band 2021) alone netted $81B, the largest ever. The National Table of Frequency Allocations (47 CFR Part 2) governs band use. ULS bulk data at ftp.fcc.gov enables license density analysis, and the FCC also maintains broadcast license data (CDBS/LMS) for AM/FM/TV stations.
FCCSpectrumWirelessTelecomFederal DataThe Housing Choice Voucher (HCV) program subsidizes rent for ~2.3 million households at ~$30B/year, administered by ~2,200 local PHAs. HUD publishes Fair Market Rents (FMRs) annually for ~2,600 areas at the 40th percentile of gross rent (2024: NYC 2BR $2,765, rural MS $725). Only ~25% of eligible households receive assistance due to funding caps; waitlists run 1–10 years. The HUD Picture of Subsidized Households (PASH) provides tract-level data on income, demographics, and voucher concentration for spatial analysis.
HUDHousingSection 8Federal DataThe AHS is a biennial panel survey (~60,000 housing units) covering structural quality, condition deficiencies, heating fuel, plumbing, and neighborhood characteristics — the deepest housing-unit dataset in the US. Tracking the same units since 1973 reveals: plumbing inadequacy fell from 4.5% to under 0.5%; owner-occupancy peaked at 69% (2004–05) and troughed at 63% (2016); new single-family median size grew from 1,500 to 2,300+ sq ft. HUD uses AHS microdata for the biennial Worst Case Housing Needs report (8.5M households in 2023).
CensusAHSHousingFederal DataUSDA ERS publishes agricultural economic data across farm income ($116B net farm income in 2023), food prices (monthly CPI food outlook, 2022's +11.4% grocery price surge), food security (13.5% of households food insecure in 2023, 47M people), commodity program costs (ARC/PLC reference prices), and rural America (Beale Codes 1–9 classifying all 3,100+ counties, 180+ rural hospital closures since 2010). The Food Access Research Atlas maps food deserts at the census-tract level.
USDAERSAgricultureFoodFederal DataThe BLS Employment Cost Index (ECI) measures quarterly changes in employer compensation costs (wages + benefits) using fixed employment weights — eliminating the industry-mix distortion that afflicts Average Hourly Earnings. Private-industry wages peaked at ~5.7% YoY in mid-2022 before decelerating to ~4.2% by end-2023; the Fed's comfort level is ~3.5% consistent with 2% PCE inflation. The ECI benefits breakdown (ECEC release) shows health insurance at ~$3.50–$4.00/hour and total benefits at ~31% of compensation. A Q1 2024 upside ECI surprise directly delayed Fed rate cut timing.
BLSECIWagesInflationFederal DataDOL publishes initial and continuing unemployment insurance claims every Thursday at 8:30 AM ET, covering 53 state programs. Initial claims peaked at 6.87 million for the week ending March 28, 2020 — dwarfing the prior record of 695,000 (1982). Pre-COVID lows of ~200,000 (2018–2019) were the lowest since 1969. The 4-week moving average smooths weather and auto-plant retooling noise. FRED series ICSA, ICNSA, and CC4WSA provide full history back to 1967.
DOLUnemployment InsuranceLabor MarketFederal DataThe Census Bureau Foreign Trade Division compiles monthly import/export statistics from CBP ACE entry data and AES electronic export filings. 2023: goods exports $2.02T, imports $3.08T, deficit $1.06T. Data drills to 10-digit HS/Schedule B codes by country and port. Section 301 China tariffs 2018–2019 reduced the US-China goods deficit from $419B to $279B but shifted sourcing to Vietnam, Mexico, and Taiwan. The Census API (api.census.gov/data/timeseries/intltrade/) and USA Trade Online enable country-HS-month-level analysis.
CensusTradeImportsExportsFederal DataThe National Interagency Fire Center (NIFC) tracks US wildland fire statistics back to 1926. The 10-year rolling average acreage roughly doubled from the 1980s to the 2020s; 2015 and 2020 both exceeded 10 million acres. The Camp Fire (November 2018) killed 85 people, destroyed 18,804 structures, and caused ~$16.5B in insured losses. USFS provides individual fire records 1992–present via the Fire Occurrence Database (~2.3M fires), MTBS satellite burn-severity mapping, and NIFC ArcGIS REST perimeter services.
NIFCWildfireUSFSEnvironmentalFederal DataSocial Security's OASDI program (Old Age, Survivors, and Disability Insurance) paid $1.4T in benefits to ~70 million recipients in 2024, funded by 6.2% FICA payroll tax on wages up to $168,600. The benefit formula converts 35 highest indexed earning years into AIME, then applies progressive bend points (90%/32%/15%) to compute PIA. Full Retirement Age is 67 for those born 1960+; early claiming at 62 permanently reduces benefits 25-30%; delayed claiming to 70 adds 8%/year. The 2024 Trustees Report projects OASI trust fund depletion in 2033, after which revenues cover ~77% of scheduled benefits. SSA publishes 700+ statistical tables in the Annual Statistical Supplement, monthly snapshots at data.ssa.gov, and the Social Security Statement via my.ssa.gov.
SSASocial SecurityOASDIFederal DataThe Current Population Survey (CPS) interviews ~60,000 households monthly to produce the official unemployment rate and, via the March ASEC supplement (~95,000 households), the official US poverty rate. The official poverty measure (OPM) uses 1960s Orshansky thresholds adjusted only for CPI ($30,900 for a family of 4 in 2023, 11.1% poverty rate). The Supplemental Poverty Measure (SPM) adds SNAP, housing subsidies, and EITC while subtracting taxes, yielding 12.9% in 2023 — more policy-sensitive. Median household income was ~$80,610 in 2023. IPUMS CPS harmonizes all CPS waves back to 1962; the Census API exposes state-level poverty rates programmatically.
CensusCPSPovertyIncomeFederal DataThe BEA's International Transactions Accounts (ITAs) record all economic flows between US residents and the rest of the world. In 2023, the US ran a goods deficit of ~$1.06T, offset partially by a services surplus of ~$293B and net primary income of +$196B, for a total current account deficit of ~$905B (3.3% of GDP). The US's net international investment position stood at -$20.6T — yet the US earns positive net primary income because US assets abroad yield higher returns ("exorbitant privilege"). The BEA ITA API exposes quarterly data on all current account components back to 1960.
BEATradeBalance of PaymentsFederal DataNOAA's National Centers for Environmental Information (NCEI) archives 150+ petabytes of atmospheric, ocean, and geophysical data serving 25+ billion online requests per year. The Global Historical Climatology Network Daily (GHCN-Daily) covers ~120,000 stations worldwide with daily Tmax/Tmin/PRCP/SNOW back to the late 1800s. NOAAGlobalTemp made 2023 the warmest year on record (+1.45°C above pre-industrial). US Climate Normals (1991–2020) define 30-year averages for 15,000+ stations. NCEI's Billion-Dollar Disasters database counted 28 events totaling $94B in losses in 2023. The CDO REST API provides programmatic access with daily and monthly summary endpoints.
NOAANCEIClimateEnvironmentalFederal DataThe VA disability compensation program pays monthly benefits to ~5.5 million veterans (up from 3.5M in 2010) based on a 0–100% rating using a whole-person combined formula. Here is the 2024 compensation rate table ($171/month at 10% to $3,737 at 100%), the PACT Act 2022 and its 23 new burn pit presumptive conditions (3.5M newly eligible veterans, $280B 10-year cost), the GI Bill (Post-9/11 Ch. 33: tuition cap, BAH allowance, $1K books stipend), the VA Home Loan Guaranty (no down payment, 4M+ loans in FY2022), the claims processing system (884K 2012 peak backlog, three Appeals Reform Act review lanes), VSOs and TDIU (~370K recipients), and the VA Open Data portal with state-level benefits utilization data.
VAVeteransBenefitsFederal DataThe USGS National Water Information System runs 8,000+ streamflow gauging stations and feeds NWS River Forecast Centers and the National Water Model (2.7 million reaches, 15-minute forecasts). Here is ADCP gauging methodology, annual peak discharge feeding FEMA Flood Insurance Rate Maps, the Ogallala Aquifer (174,000 sq miles, declining 1–3 ft/year in TX/KS), Central Valley land subsidence from groundwater pumping, the NAWQA water quality monitoring program, water use surveys (thermoelectric power 41% of withdrawals), the 7Q10 low-flow statistic driving NPDES permits, the NWIS REST API (parameterCd/statCd parameter table), and a Python script plotting 5-year discharge with drought-period shading.
USGSWaterEnvironmentFederal DataOPM manages HR for the 2.1M+ federal civilian workforce via the Central Personnel Data File and the FedScope cube data tool. Here is the General Schedule pay system (GS-1 through GS-15, 10 steps, 48 locality areas ranging from RUS +16.82% to San Francisco +44.15%), the Senior Executive Service (~8,000 career SES, ES pay $148k–$222k), FERS three-tier retirement (defined benefit + Social Security + Thrift Savings Plan at $800B+ AUM, 5% match), the ~100-day federal time-to-hire vs. 45-day private sector, OPM Pathways Programs, the FRB finding of 17–19% total compensation premium for mid-career federal workers, and the 2025 DOGE workforce reduction context.
OPMFederal WorkforceLaborFederal DataThe NSF funds ~25% of all federally funded basic research at US universities (excluding life sciences) with a $9B+ annual budget across 8 directorates. Here is the proposal review process (dual merit criteria: intellectual merit AND broader impacts; funding rates 17–25% by directorate; ~40,000–50,000 proposals/year), the CAREER award ($500k/5 years, highly competitive), the Graduate Research Fellowship GRFP ($37k/year, ~2,000 awards from 12,000+ applicants), the NSF Awards API (api.nsf.gov, 600,000+ awards searchable), National AI Research Institutes ($200M+), the 2023 immediate open-access mandate stricter than NIH's, EPSCoR geographic equity program, and a Python Awards API CAREER grant analysis by directorate and institution.
NSFResearchScience FundingFederal DataThe BTS ATOP/ASQP database covers ~6 million flight records per year from all domestic carriers with 1%+ market share, with delay coded across five cause categories: Carrier (~30-35%), NAS (~30-35%), Late Aircraft (~35-45%), Weather (~5-10%), and Security (<1%). Here is the T-100 domestic/international traffic series (ASM, RPM, load factor), Form 41 carrier financials (CASM, RASM, fuel as 20-30% of costs), the COVID collapse (96% RPM decline April 2020, $54B CARES Act PSP), the Southwest December 2022 meltdown (17,000 cancelled flights, $140M DOT settlement), the 3-hour/4-hour tarmac delay rule, BTS Transtats bulk download, and a Python script to compute monthly on-time rate and cancellation rate by carrier.
BTSAviationTransportationFederal DataThe Federal Reserve Z.1 (formerly Flow of Funds) publishes quarterly financial assets and liabilities for all US economic sectors. Here is the household net worth data ($156T 2021 peak, ~$8T 2022 decline from rate hikes), the Distributional Financial Accounts showing top 1% hold ~31% of wealth vs. bottom 50% at ~3%, the two-sided sectoral balance accounting identity, corporate leverage, Table B.101 residential real estate at market value ($25T to $43T 2019–2024), the $26T+ Treasury liability position, Rest of World holdings, FRED mnemonic guide, and a Python FRED API script pulling household net worth with CPI deflation and NBER recession shading.
Federal ReserveFinanceWealth DataFederal DataThe Census LEHD program links UI wage records for 95%+ of private workers to employer and household records, producing the Quarterly Workforce Indicators (employment/payroll/hires/separations by county × industry × age × sex × education), LODES origin-destination commuting matrices (block-to-block home-work pairs), job-to-job flow statistics (7–10% earnings premium from voluntary job switching), and business dynamics data. Here is how LEHD differs from QCEW/CES/ACS, the COVID remote-work reshaping of OD commute flows, the great resignation mobility spike, OnTheMap and LEHD Explorer tools, and a Python Census QWI API script analyzing young construction worker employment by county.
CensusLaborDemographicsFederal DataThe BEA Regional Accounts allocate national economic totals to states, counties, and MSAs: GDP by State (annual/quarterly, NAICS detail, post-COVID TX/FL leading growth), Personal Income by State (quarterly, five-component decomposition of labor/capital/transfers), Personal Income by County (~3,100 counties annually, CAINC1 table), and GDP by MSA (~380 MSAs, NYC at $2T+ vs. rural laggards). Here is the energy boom-bust signal (North Dakota Bakken GDP doubled 2007–2014 then collapsed), the high-income state tax migration effect (California 13.3% vs. Texas/Florida 0%), transfer payment COVID surge and unwinding, BEA Regional API parameters, and a Python script ranking states by 2010–2024 per-capita personal income growth.
BEAEconomicsRegional DataFederal DataThe USDA National Agricultural Statistics Service conducts 400+ surveys annually, reaching 3 million respondents to produce the authoritative federal record of US crop production, livestock inventories, commodity prices, and agricultural prices since 1867. Here is the Crop Production report, WASDE supply-demand balance sheets, the QuickStats API (eight parameters, 50,000 record limit), weekly Crop Progress with Good/Excellent condition ratings, the five major crops (corn 35% of cropland, soybeans competing with Brazil, winter/spring wheat, cotton, rice), the 2012 drought sending corn to $8.49/bushel and soybeans above $17, Cattle on Feed, Hogs and Pigs quarterly, Prices Received/Paid, and a Python QuickStats API script to plot state-level corn yield per acre for the top 5 producing states over 20 years.
USDAAgricultureCommoditiesFederal DataThe Energy Information Administration is the primary federal authority for US energy data, publishing the market-moving Short-Term Energy Outlook, the Weekly Petroleum Status Report (Cushing OK crude stocks that move WTI crude prices $1–2/barrel), the Natural Gas Storage Report (five-region EIA-914 data), EIA-860 and EIA-923 power plant databases (15,000+ generators, monthly fuel consumption and generation), the Electric Power Monthly, Petroleum Supply Monthly, and the EIA Open Data API (500,000+ series). Here is the 2019 US net petroleum export milestone, the 2022 European energy crisis Henry Hub spike to $9/MMBtu, and a Python EIA v2 API script pulling WTI crude and Henry Hub weekly prices with a dual-axis chart annotating the 2022 spike.
EIAEnergyEconomyFederal DataThe Census Bureau Building Permits Survey and New Residential Construction release track ~20,000 permit-issuing jurisdictions and ~900 construction sample areas monthly — the primary federal leading indicators for US housing activity. Here is the BPS 96% coverage of US construction, SAAR methodology, permits-to-starts ratio dynamics, the 2006 peak at 2.07M SAAR to 2009 trough at 554K to the 2020–2021 surge to the 2022–2023 pullback as mortgage rates went 3% to 7%, the SFH/multifamily bifurcation, Sun Belt concentration (Texas 15–18%, Florida 10–12%), New Residential Sales contract-signed timing, lumber futures (2021 spike to $1,700/MBF), Census BPS API, and FRED series PERMIT/HOUST/HOUST1F/HOUST5F.
CensusHousingEconomyFederal DataBLS Occupational Employment Data: Wages, Job Counts, and 10-Year Projections for Every US Occupation
The BLS OEWS program publishes wages and employment counts for 830 occupations across 590+ geographies from a 1.1M establishment semiannual survey pooled over 3 years into ~3.3M observations. Here is the data structure (TOT_EMP, hourly/annual wage percentiles 10th–90th, location quotient, entry/experienced wage fields), the Standard Occupational Classification (23 major groups / 459 broad / 867 detailed occupations), top-paying occupations (surgeons $250k+, anesthesiologists, airline pilots), Employment Projections 2022–2032 (fastest-growing: home health aides +924k, NPs, solar installers; fastest-declining: word processors, cashiers), the Occupational Outlook Handbook, O*NET skills crosswalk, wage inequality analysis (90/10 percentile ratio), H-1B prevailing wage connection, and a Python script to analyze healthcare occupation wages from the national OEWS ZIP.
BLSLaborWagesFederal DataThe Federal Highway Administration publishes the most comprehensive infrastructure dataset in the federal government: the National Bridge Inventory (620,000+ bridges, biennial inspection, 0–9 condition ratings, sufficiency score), the Highway Performance Monitoring System (pavement IRI, Good/Fair/Poor condition, 900,000+ road segments), Annual Average Daily Traffic counts, and Highway Statistics (registered vehicles, licensed drivers, gas tax revenues). Here is the structurally deficient vs. functionally obsolete distinction, the IIJA 2021 $40B bridge repair program, the Highway Trust Fund solvency crisis (gas tax frozen at $0.184/gallon since 1993, EVs avoiding it), the Freight Analysis Framework commodity-flow OD matrices, and a Python NBI bridge data script to map structurally deficient bridges by sufficiency rating.
FHWATransportationInfrastructureFederal DataThe BLS releases two surveys on “Jobs Friday” (first Friday of each month): the Establishment Survey (580,000 worksites, source of the nonfarm payroll headline) and the Household Survey (60,000 households, source of the unemployment rate). Here is why the two surveys often diverge, how the net birth/death model handles new businesses, the three-tier revision cycle including the annual benchmark (the January 2024 benchmark removed 818,000 jobs from the prior year), X-13ARIMA-SEATS seasonal adjustment, industry-level dynamics (healthcare adding jobs through every recession, the COVID −20.5M single-month collapse), the 8:30 AM release market impact, and a Python BLS API script to download total nonfarm payroll and plot recession bars.
BLSLaborEconomyFederal DataThe SEC has required XBRL-tagged financial statements from all public companies since 2009–2011, creating a machine-readable database of ~7,000 active filers. Here is the US-GAAP taxonomy (17,000+ concepts, us-gaap/dei/srt namespaces), the three EDGAR APIs (Company Facts for all filings, Company Concept for a single metric over time, Frames for cross-sectional data across all companies in one period), data quality pitfalls (30% custom extension elements, taxonomy changes after ASC 606, fiscal year misalignment), the Beneish M-score fraud detection application, and a Python script using the SEC EDGAR API to extract Apple's revenue and net income history from 10-K filings.
SECFinanceFinancial DataFederal DataCMS Care Compare publishes quality data for every Medicare- and Medicaid-certified skilled nursing facility in the US. Here is the five-star composite rating system (health inspection, staffing, and quality measure components), the 3×4 scope/severity deficiency grid (A through L, Immediate Jeopardy at J–L), the Payroll-Based Journal staffing system that replaced self-reported data in 2016, the Minimum Data Set resident assessment that drives both quality measures and PDPM reimbursement, COVID-19’s toll on nursing homes (170,000+ deaths, 38% of early US COVID deaths), private equity ownership transparency gaps, and a Python script to download CMS Care Compare CSV files and compute state-level star rating distributions.
CMSHealthcareQualityFederal DataThe Bureau of Labor Statistics Survey of Occupational Injuries and Illnesses surveys ~230,000 establishments annually to produce the only national count of workplace injuries and illnesses. Here is the Total Recordable Incidence Rate formula, OSHA recordkeeping requirements (Form 300 Log, 300A Summary, 301 Incident Report), the case-and-demographic microdata for individual injury characteristics, the Census of Fatal Occupational Injuries as the companion fatal census (~5,500/year, construction’s fatal four), the musculoskeletal disorder supplement, the pervasive underreporting problem (academic research shows 40–69% capture rate), and a Python BLS API script to compare TRIR across construction, manufacturing, and healthcare.
BLSLaborSafetyFederal DataThe EPA Air Quality System aggregates hourly and daily pollutant readings from 4,000+ monitoring sites operated by state, local, tribal, and federal agencies. Here is the six criteria pollutant NAAQS framework (PM2.5, PM10, ozone, CO, SO2, NO2), the 2024 PM2.5 standard tightened to 9 μg/m³, the AQI 0–500 scale and daily worst-of-pollutants calculation, nonattainment designation and State Implementation Plan mechanics, the Harvard Six Cities study and BenMAP health burden model (100,000+ annual PM2.5-attributable deaths), environmental justice monitoring gaps, wildfire smoke exceptional events provisions, and a Python script using the EPA AQS API to download daily PM2.5 readings and identify exceedance days.
EPAEnvironmentPublic HealthFederal DataHUD’s annual Point-in-Time count, conducted over the last 10 days of January by ~400 Continuum of Care regions, is the only national census of homelessness in the US. Here is the sheltered vs. unsheltered methodology, the 2023 count of 653,100 (the highest since reporting began), California’s 28% share, the Homeless Management Information System as the longitudinal individual-level tracking database, veteran homelessness (37,000+ and the HUD-VASH voucher program), the chronic homeless definition (12+ months or 4+ episodes), methodological limitations (January weather, volunteer variation, doubled-up household exclusion), Housing First policy evidence, and a Python script to download HUD Exchange PIT CSVs and compute per-capita homeless rates by state.
HUDHousingSocial PolicyFederal DataThe federal aviation safety ecosystem spans four major databases: the NTSB accident database (every civil aviation accident since 1962), the FAA AIDS system, the NASA-administered Aviation Safety Reporting System (ASRS — voluntary, confidential, non-punitive near-miss reports), and the FAA Wildlife Strike Database. Here is the NTSB probable cause taxonomy (pilot error 70%+ of GA accidents), the Boeing 737 MAX MCAS investigation, the ASRS reporting immunity mechanism, runway incursion categories, the Miracle on Hudson Canada Goose strike context, the FAA Civil Aviation Registry N-number database, pilot workforce demographics, and a Python NTSB bulk CSV phase-of-flight fatal accident rate analysis.
Federal DataFAAAviation SafetyTransportationThe Nuclear Regulatory Commission publishes quarterly Performance Indicators, inspection findings, and daily Event Notification Reports for all 99 operating US nuclear reactors. Here is the Reactor Oversight Process cornerstones (Initiating Events, Mitigating Systems, Barrier Integrity), the Significance Determination Process (Green/White/Yellow/Red), Licensee Event Reports, the TMI and Fukushima reform trail, probabilistic risk assessment (core damage frequency ~1E-5/reactor-year), the ADAMS document management system with 7M+ public records, the 92–93% nuclear capacity factor record, and a Python NRC PI XML parser to rank plants by unplanned scram rate.
Federal DataNRCNuclear SafetyEnergyThe Bureau of Prisons manages 121 federal prisons holding ~148,000 inmates — down from a 219,000 peak in 2013. Here is the weekly population data, offense category breakdown (drug offenses 43%+, the legacy of mandatory minimums), the racial disparity in crack vs. powder cocaine sentencing before the Fair Sentencing Act 2010, FIRST STEP Act reforms, the BJS National Prisoner Statistics Program covering all US incarceration, US Sentencing Commission case-level sentencing data and disparity research, PACER federal court records, supervised release mechanics, private prison contracting ($700M+/year), ICE immigration detention as a separate civil system, and recidivism data (68% rearrest within 3 years).
Federal DataDOJCriminal JusticePrison DataUSCIS adjudicates ~8 million petitions annually and publishes detailed statistics on every immigration benefit category. Here is the naturalization data (~800–900K/year by country of birth), the employment-based green card per-country 7% cap that creates 40+ year backlogs for Indian nationals (EB-2 India priority date ~2012), the H-1B lottery (470K registrations for 85K slots in FY2025), the 1.7M+ affirmative asylum backlog, DACA quarterly recipient counts by state, the EOIR immigration court 3.3M+ case backlog with judge-level grant rate variation, DHS Yearbook of Immigration Statistics, and a Python USCIS naturalization Excel workbook analysis.
Federal DataUSCISImmigrationDemographicsThe FBI Uniform Crime Reporting program collects crime data from ~18,000 law enforcement agencies — transitioning from the legacy Summary Reporting System to the incident-level NIBRS, a shift that created massive coverage gaps in the 2021 national crime count when major cities failed to report. Here is the 8 Part I Index Crimes, the NIBRS incident/offense/victim/property/arrestee segment structure, the 2020–2021 murder surge (+30% single-year, the largest since national tracking began), hate crime data, LEOKA officer safety statistics, the dark figure of crime and NCVS complement, clearance rates, and the Crime Data Explorer API with a Python state-level murder rate trend analysis.
Federal DataFBICrime StatisticsPublic SafetyEvery Medicare-certified hospital files an annual Medicare Cost Report with CMS — the only source of audited hospital-level financial data spanning all hospitals regardless of ownership type. Here is the Worksheet structure (S statistical data, A cost centers, B cost allocation stepdown, C reimbursable cost, E Medicare reimbursement), the cost-to-charge ratio used to deflate charge data, DSH disproportionate share payments and the ACA 2014 restructuring, IME and GME teaching hospital adjustments ($12–15B combined), uncompensated care Worksheet S-10, the HCRIS database at NBER, and a Python nonprofit vs. for-profit operating margin analysis.
Federal DataCMSHealthcareHospital FinanceThe SBA publishes loan-level data for all approved 7(a) and 504 loans — the two flagship small business lending programs covering $30–40B/year in 7(a) guarantees and $8–10B/year in 504 fixed-asset financing. Here is the 7(a) guarantee structure (85% on loans ≤$150K, 75% above, up to $5M), the 504 three-party 50/40/10 split, the loan-level public dataset fields (NAICS, lender, status, charge-off amount, ownership flags), lender concentration (Live Oak Bank, OIG 2014 high-risk lender report), industry default rates, SBIC venture financing, equity and access analysis by minority/women/veteran-owned status, and a Python Socrata API sector default rate analysis.
Federal DataSBASmall BusinessFinanceThe BLS American Time Use Survey has tracked 24-hour time diaries for ~10,000 Americans annually since 2003 — the only federal dataset measuring time allocation across all life activities. Here is the 17 major activity categories and ATUS Lexicon coding, the gender gap (women average 2+ hours/day more household/caregiving vs. men's more leisure and paid work), parental intensive childcare trends, the 2020 COVID shift to remote work (42% working from home), leisure inequality by education (TV vs. reading/exercise divergence), the Well-Being and Eating & Health special modules, IPUMS-ATUS for harmonized cross-year access, and a Python weighted gender gap analysis.
Federal DataBLSDemographicsLabor EconomicsEvery FDIC-insured institution files quarterly Call Reports (FFIEC 031/041/051) — the primary supervisory dataset covering ~4,700 banks with balance sheet, income, asset quality, capital adequacy, and liquidity detail. Here is the RC schedule structure (HTM vs. AFS securities, loan categories, deposit types), Schedule RI income statement, Schedule RC-N nonperforming loans and charge-offs, Schedule RC-R capital ratios and PCA thresholds, the SVB warning signs visible in 2022 Call Report data (HTM unrealized losses, concentrated uninsured deposits), the Texas Ratio methodology, FDIC BankFind Suite API, and a Python community-bank screening script.
Federal DataFDICBankingFinanceThe BLS Multifactor Productivity (Total Factor Productivity) program measures output growth unexplained by measurable labor and capital inputs — the Solow residual that captures technological progress. Here is the growth accounting decomposition, the historical MFP episodes (1.5%/year golden age 1948–73, the productivity slowdown, the 1995–2004 IT revival, the post-2004 deceleration), the Hall-Jorgenson capital services methodology, labor vs. MFP distinction and its implications for real wage growth, unit labor costs as the core services inflation driver (peaked 2022, recovered 2023), the AI productivity hypothesis, FRED series IDs (OPHNFB, ULCNFB), and a Python BLS API dual-axis chart.
Federal DataBLSProductivityEconomicsMedicaid is the largest health coverage program in the US by beneficiary count (~90M people, ~$900B/year), administered by states under federal rules with FMAP matching. Here is the key data sources (monthly enrollment by eligibility group, T-MSIS claims data, MBES expenditure system), the ACA expansion 37-state vs. 13-holdout divide, the COVID continuous enrollment surge from 70M to 95M and the 2023–2024 unwinding that disenrolled millions, FMAP mechanics (50–77% federal match), managed care's 70% enrollment share, dual eligibles ($35K/year cost vs. $8K non-dual), long-term care payment (Medicaid covers 42% of all LTC spending), and a Python Medicaid.gov Socrata API unwinding analysis by state.
Federal DataCMSMedicaidHealthcareThe NLRB conducts ~2,000–2,500 union representation elections annually and publishes detailed results — eligible voters, votes for/against, unit size, industry, and union affiliation. Here is the RC/RD/RM petition taxonomy, the long-run win rate trend (60–65% 1950s to 45% post-PATCO 1981 to 70%+ during the 2022–2024 Starbucks/Amazon organizing surge), bargaining unit determination and micro-units, blocking charge mechanics and the 2023 rapid-response election rule (21-day window), captive audience meeting prohibition, card check vs. secret ballot, and a Python analysis of FY2019–2024 election data by union affiliation.
Federal DataNLRBLabor LawUnionsThe DOL Wage and Hour Division enforces the FLSA, Davis-Bacon Act, Service Contract Act, FMLA, and child labor laws through ~1,000 investigators nationwide — recovering $200–300M in back wages for 200,000–300,000 workers annually. Here is the WHISARD public enforcement database schema, the FLSA exempt vs. non-exempt classification battle, worker misclassification under the 2024 economic reality rule, H-2A agricultural wage violations, Davis-Bacon prevailing wage enforcement, the Asplundh $95M settlement, FLSA criminal prosecution under 216(a), and a Python sector-level penalty analysis by NAICS code.
Federal DataDOLWage EnforcementLaborThe BLS Producer Price Index measures average change in selling prices received by domestic producers — the upstream complement to the consumer-facing CPI, with a 2–3 month leading relationship to goods inflation. Here is the three indexing systems (Final Demand PPI launched 2014, Intermediate Demand stage-of-processing pipeline, traditional commodity-based), the trade services margin methodology, the PPI vs. CPI spread as a retailer margin signal, the 2021–2022 supply chain surge (+22.9% FD goods peak), FRED series IDs (PPIFIS, PPIFAF, PPIFAE, PPICOR, PPIACO), BLS API access, and a Python 4-line chart of the inflation episode by component.
Federal DataBLSInflationEconomicsPublic Law 94-171 mandates the Census Bureau to deliver block-level population data to states for legislative redistricting by April 1 of the year following the decennial census — the foundational dataset for every congressional and state legislative district. Here is the five data tables (P1–P5, H1), the geographic hierarchy to census block, the one-person-one-vote case law (Reynolds v. Sims, Wesberry v. Sanders), the 2020 apportionment results (Texas +2, New York missed a seat by 89 people), differential privacy and the TopDown Algorithm controversy, the 63-combination race/ethnicity schema, Census API variable naming (P2_006N syntax), VRA Section 2 and the Gingles three-part test, and a Python Census API tract-level racial composition analysis.
Federal DataCensus BureauRedistrictingDemographicsThe Treasury International Capital system tracks foreign purchases and sales of US securities — the primary federal source on who holds US Treasuries and how capital flows across borders. Here is the four main TIC reports (monthly major holders, TIC-S/TIC-B flow surveys, SHCA annual position survey, SHLA mirror), the top foreign holders (Japan $1.1T, China $800B peak, UK $700B, Belgium/Euroclear anomaly), the custodian country problem, China's “financial nuclear option” analysis, sudden stop risk, 2008 flight-to-safety dynamics, and a Python script to download the monthly major foreign holders Excel.
Federal DataTreasuryCapital FlowsInternational FinanceCDC WONDER is the query interface for US death certificate data — every death in America since 1999 coded by ICD-10 underlying cause, linked to place, age, race, and demographic characteristics. Here is the death certificate pipeline, ICD-10 code taxonomy (C codes for cancers, I codes for circulatory, F codes for mental, V–Y codes for external causes), the <10 death suppression rule, age-adjusted rates using the 2000 Standard Population, the three-wave opioid crisis (prescription T40.2–T40.3 to heroin T40.1 to synthetic fentanyl T40.4, ~110K deaths in 2022), Case–Deaton “deaths of despair” research, and COVID-19 U07.1 excess mortality analysis.
Federal DataCDCMortalityPublic HealthThe BLS Job Openings and Labor Turnover Survey measures the monthly flow of workers into and out of US employment — job openings, hires, quits, and layoffs across 21,000 establishments. Here is the four core metrics, how the quit rate peaked at 3.0% in April 2022 signaling the hottest labor market in decades, the Beveridge Curve rightward shift that revealed labor market frictions, labor hoarding dynamics in 2023, how JOLTS compares to Indeed and LinkedIn alternative measures, FRED series IDs (JTSJOL, JTSHIL, JTSQUL, JTSLAL, JTSQUR), and a Python fredapi Beveridge Curve plot.
Federal DataBLSLabor MarketEmploymentThe NHTSA Fatality Analysis Reporting System is a complete census of every US traffic fatality since 1975 — not a sample, but a record of all 38,000–43,000 annual deaths with linked accident, vehicle, and person detail. Here is the three-table structure (accident/vehicle/person), key variable codes (HARM_EV, MAN_COLL, LGT_COND, DRUNK_DR), the COVID anomaly (miles driven −13% but fatality rate spiked 24%), the alcohol-impaired decline from 20K/year in the 1980s to 10.5K/year, the pedestrian fatality rise from 4,300 to 7,500 since 2010, the CRSS companion for non-fatal crashes, and a Python state-level pedestrian fatality rate analysis.
Federal DataNHTSATraffic SafetyTransportationMedicare Advantage now covers 51% of Medicare beneficiaries (~33M people) through private insurance plans. Here is the CMS benchmark-bid-rebate payment system, the 40-measure Star Ratings framework, how HCC risk adjustment creates a $10–30B upcoding incentive, the prior authorization controversy (OIG 2022: 13% of denials met coverage criteria), enrollment concentration (UHC 29%, Humana 19%, CVS/Aetna 12%), and a Python market-share analysis by state.
Federal DataCMSMedicare AdvantageHealthcareThe IRS Statistics of Income program has published aggregated tax return statistics since 1916 — the definitive federal source on income distribution, effective tax rates, deductions, and credits. Here is the individual 1040 AGI class tables, the Piketty-Saez top 1% income share data, EITC distribution, estate tax stepped-up basis issue, corporate SOI and TCJA effective rate dynamics, and the restricted-use Public Use File for microsimulation.
Federal DataIRSIncomeTax PolicyOSHA publishes every workplace inspection, citation, and penalty going back to 1972 — covering ~130M US workers in 10M workplaces. Here is the inspection types (unprogrammed complaint-driven vs. programmed NEP vs. fatality follow-up), the citation taxonomy (Willful $156K max through De Minimis), top-cited 29 CFR standards (fall protection chronically #1), the Imperial Sugar explosion, Amazon injury rate controversy, State Plan boundary, and a Python sector-level penalty analysis.
Federal DataOSHAWorkplace SafetyLaborThe Home Mortgage Disclosure Act requires most mortgage lenders to publicly disclose every application, origination, and denial — with loan amount, property location, applicant race/ethnicity, income, pricing, DTI, LTV, and AUS results. Here is the full post-2018 field schema, how CFPB and DOJ use denial-rate mapping to build redlining cases (Trustmark, Cadence, City National), the denial reason codes, HMDA Platform API, CRA examination connections, and a Python disparity-ratio analysis by county.
Federal DataCFPBMortgageFair LendingThe American Community Survey sends questionnaires to 3.5 million addresses per year — replacing the decennial long form with continuous annual estimates. Here is the 1-year vs. 5-year distinction, the full social/economic/housing/demographic variable taxonomy, margin of error and coefficient of variation thresholds, Census API variable naming conventions (B19013_001E syntax), key tables for income/poverty/rent/race/commute, and a Python census-tract rent burden analysis.
Federal DataCensus BureauDemographicsHousingBLS CPI: The Consumer Price Index and the Federal Inflation Measurement Behind Every Policy Decision
The BLS Consumer Price Index has tracked the price level for urban consumers since 1913 — the primary US inflation gauge driving Social Security COLAs ($1.4T/year in indexed spending), wage negotiations, and Fed policy. Here is CPI-U vs. CPI-W vs. Chained CPI, the basket weights (shelter 35%, the OER methodology debate), CPI vs. PCE deflator gap, the 2021–2023 9.1% peak episode, FRED series IDs, BLS API access, and a Python chart tracking the inflation episode by component.
Federal DataBLSInflationEconomicsThe CDC Behavioral Risk Factor Surveillance System interviews ~450,000 adults per year across all 50 states — the world's largest health survey. Here is the core module variables (obesity, smoking, diabetes, exercise, mental health), the raking weighting methodology, the PLACES MRP small-area estimation project, how 2011 cell-phone addition created a trend discontinuity, and a Python approach to computing weighted state-level obesity prevalence from the LLCP XPT file.
Federal DataCDCPublic HealthHealth SurveysThe FHFA HPI tracks single-family home price changes using repeat-sales methodology on conforming mortgages purchased by Fannie Mae and Freddie Mac — back to 1975, with national, state, MSA, and ZIP code coverage. Here is the weighted repeat-sales methodology, the conforming loan limit boundary, expanded-data HPI with FHA additions, the 40%+ pandemic price surge, FHFA vs. Case-Shiller vs. Zillow distinctions, and a Python script for state-level YoY appreciation rankings.
Federal DataFHFAHousingReal EstateThe IRS requires most 501(c) organizations to file Form 990 publicly — disclosing revenue, expenses, executive compensation, governance, and grant activity for the $3T+ US nonprofit sector. Here is the 990 vs. 990-EZ vs. 990-PF form variants, Part VII executive pay disclosure, the AWS S3 bulk XML dataset (4M+ filings), ProPublica Nonprofit Explorer API, dark money 501(c)(4) tracking, and a Python script to extract financial ratios from IRS 990 XML.
Federal DataIRSNonprofitsTax TransparencyThe Federal Reserve publishes the H.8 every Friday — a weekly aggregate balance sheet for all US commercial banks covering $23T+ in assets: C&I loans, real estate loans, securities (HTM vs. AFS), reserve balances, and deposit flows. Here is the large vs. small bank breakdown, how the SVB collapse showed as a $98B single-week deposit outflow, H.8 vs. Call Report distinctions, FRED series IDs, and a Python snippet tracking credit cycle signals.
Federal DataFederal ReserveBankingFinanceCounty Business Patterns is the Census Bureau's annual series on US business activity at the county–NAICS level, published since 1964 — establishment counts by size class, mid-March employment, and first-quarter payroll for every county. Here is the Business Register source, noise infusion disclosure methodology, the Nonemployer Statistics companion series, CBP vs. QCEW vs. Economic Census distinctions, Business Dynamics Statistics, Census API access, and how to compute manufacturing location quotients by county.
Federal DataCensus BureauBusinessLocal EconomyThe National Assessment of Educational Progress is the only nationally representative, continuing assessment of US student achievement — covering reading, math, science, and more for 4th, 8th, and 12th graders. Here is the 0–500 scale and NAGB achievement levels, the COVID-era learning loss evidence (largest reading decline in 30 years), state comparison methodology, the plausible values estimation approach, NAEP Data Explorer API access, and the White–Black achievement gap trend since 1992.
Federal DataNCESEducationAssessmentThe BLS Occupational Employment and Wage Statistics program covers 800+ occupations across every industry and geography — the most comprehensive source for occupation-level wage percentiles in the US. Here is the survey methodology, full SOC hierarchy, wage percentile fields (10th through 90th), the H-1B prevailing wage Level I–IV connection, OEWS vs. CPS vs. QCEW distinctions, and a Python script for ranking the highest-paid tech occupations.
Federal DataBLSWagesLabor EconomicsThe USPTO publishes bulk patent grant data (4M+ grants since 1976) and applications (since 2001), with PatentsView as the canonical research dataset — disambiguated inventor and assignee records, CPC classification codes, citation networks, and prosecution history via PEDS. Here is the three patent types, continuation and evergreening strategy, Alice Corp and IPR quality controversies, PatentsView API, BigQuery public data, and a Python snippet for ranking top AI patent holders by CPC subclass.
Federal DataUSPTOPatentsIntellectual PropertyThe BEA National Income and Product Accounts are the official measure of US economic output, income, and spending — updated three times per year with advance, second, and third estimates. Here is the C+I+G+(X-M) expenditure identity, every GDP component in depth, real vs. nominal GDP, GDP by State and GDP by Industry breakdowns, the BEA API query structure, and FRED series IDs as the easiest access path.
Federal DataBEAGDPEconomicsThe FDA CDER Drugs@FDA dataset tracks every drug approval action since 1939 — NDAs for brand drugs, BLAs for biologics, ANDAs for generics. Here is the Orange Book TE codes and patent/exclusivity listings, NCE/3-year/pediatric/orphan/biologic exclusivity mechanics, Breakthrough and Accelerated Approval designations, the Aduhelm controversy, and how to query OpenFDA drugs API.
Federal DataFDADrug ApprovalsPharmaceuticalsThe CMS Medicare Part B Physician and Supplier Public Use File covers 1M+ providers, 12,000+ HCPCS procedure codes, and $400B+ in annual submitted charges. Here is the submitted vs. allowed vs. payment markup ratio, standardized payments removing geographic wage index, the Lucentis/Avastin ASP+6% controversy, the Salomon Melgen $21M ophthalmology fraud, and how to filter anti-VEGF injections to expose the billion-dollar pricing disparity.
Federal DataCMSMedicareHealthcareThe BLS Quarterly Census of Employment and Wages covers 97%+ of US jobs at the county–NAICS industry level — the most granular federal employment dataset available. Here is the QCEW vs. CES vs. LAUS distinctions, the suppression rules for counties with fewer than three establishments, average weekly wage by sector, BLS bulk CSV download structure, and a Python snippet for the highest-wage industries by county.
Federal DataBLSEmploymentEconomicsThe Brady Act NICS system has processed 400M+ background checks since 1998 — publishing monthly state-level counts of handgun, long gun, and permit check types. Here is the full check type taxonomy, why NICS counts don't equal gun sales, the default proceed loophole that enabled the Charleston shooting, the COVID-2020 and Biden-2021 demand spikes, and how to use the BuzzFeed News parsed CSV.
Federal DataFBIFirearmsPublic SafetyThe Low-Income Housing Tax Credit has financed 50,000+ projects and 3.5M+ affordable units since 1986 — the largest US affordable housing subsidy. Here is the HUD LIHTC database schema, the 9% vs. 4% credit mechanics, how State HFA Qualified Allocation Plans shape development geography, the National Housing Preservation Database complement, and how to compute units per capita by state.
Federal DataHUDAffordable HousingHousing PolicyThe CFTC publishes weekly open interest broken down by trader category — Commercial hedgers, Managed Money (hedge funds), and Swap Dealers — for every regulated futures market since 1986. Here is the four COT report formats, how net non-commercial positioning signals crowded trades, the disaggregated vs. legacy format distinction, all covered markets, and how to build a 52-week COT z-score.
Federal DataCFTCFutures MarketsFinanceThe OFAC SDN list (~8,000 entries) and Consolidated Sanctions List cover every individual, entity, and vessel that US persons are prohibited from transacting with — with civil penalties up to $1.3M per violation. Here is the full SDN record schema, all major sanctions programs, the 50% ownership rule, the Binance $4.3B landmark penalty, and how to parse and screen the XML list.
Federal DataOFACSanctionsComplianceEPCRA Section 313 requires 20,000+ industrial facilities to report annual releases of 800+ toxic chemicals — air, water, land, and off-site transfers. Here is the full TRI field schema, the 75% release decline since 1988, the 2024 PFAS additions, how to use the RSEI model for toxicity-weighted population exposure, and how to join TRI to Census ACS for environmental justice analysis.
Federal DataEPAEnvironmental JusticeChemical SafetyCMS Care Compare publishes quality measures for every Medicare-certified hospital — 30-day mortality and readmission rates, HCAHPS patient experience scores, process compliance, and Medicare spending per beneficiary. Here is the full measure taxonomy, how risk adjustment works, the HAC Reduction Program penalties, Value-Based Purchasing incentives, and how to download and analyze the data.
Federal DataCMSHealthcare QualityHospitalsSince 2009, every public company files XBRL-tagged financial statements with the SEC — extractable through the EDGAR Company Facts API, the Frames endpoint for cross-sectional screening, and bulk quarterly FSN downloads. Here is the US-GAAP taxonomy structure, the three data quality pitfalls (extension elements, restated periods, unit inconsistencies), rate limits, and how to build a revenue growth screener.
Federal DataSECEDGARFinancial DataThe Corporate Prosecution Registry (Duke Law) tracks every federal corporate criminal resolution since 1990 — deferred prosecution agreements, non-prosecution agreements, and guilty pleas — covering 400+ resolutions and $30B+ in fines. Here is the DPA/NPA/guilty plea taxonomy, the Yates Memo and Monaco Doctrine evolution, the HSBC and Boeing landmark cases, the compliance monitor system, and FCPA as the dominant enforcement category.
Federal DataDOJCorporate CrimeEnforcementForeignAssistance.gov publishes every US government foreign assistance obligation and disbursement across all agencies — USAID, State, PEPFAR, MCC, DoD — covering $50B+ per year since 2001. Here is the full dataset structure, PEPFAR ($110B+ cumulative, 20M+ on antiretrovirals), the top recipient countries, the implementing partner ecosystem, and what the 2025 USAID restructuring means for the data.
Federal DataUSAIDForeign AssistanceInternational DevelopmentThe PCAOB registers, inspects, and disciplines auditors of public companies — publishing inspection reports on every registered firm's deficiency rate. Here is the Big Four inspection pattern, the KPMG $50M scandal for receiving stolen inspection lists, the HFCAA Chinese auditor crisis and 2022 CSRC breakthrough, and how researchers use deficiency rates as an auditor quality proxy.
Federal DataPCAOBAuditSecurities RegulationCMS publishes provider-level Medicare Part D prescribing data showing every drug prescribed by every provider with 10+ claims — 1M+ providers, 5,700+ drugs, $100B+ in visible prescription spending per year. Here is the full schema, how Part D data exposed the opioid crisis (ProPublica Prescriber Checkup), the GLP-1 agonist cost surge, and how to join it with CMS Open Payments to detect prescribing-payment correlations.
Federal DataCMSMedicareHealthcareThe Physician Payments Sunshine Act requires drug and device manufacturers to report every payment to physicians and teaching hospitals — consulting fees, speaker fees, meals, royalties, research grants, and 22 other categories. Here is the full schema, the $3.5B/year scale, the GSK and Novartis enforcement cases, the peer-reviewed evidence on payment-prescribing correlations, and how to join Open Payments with Medicare Part D data.
Federal DataCMSHealthcarePharmaceutical IndustryThe ATF National Tracing Center processes 500,000+ firearm traces per year — reconstructing the chain of commerce from manufacturer to crime scene. Here is what the Tiahrt Amendment restricts, what aggregated state-level trace data still reveals about the iron pipeline, how time-to-crime exposes straw purchasing, the FFL directory, AFMER manufacturing data, and the ghost gun tracing gap.
Federal DataATFFirearmsPublic SafetyThe Consumer Product Safety Commission publishes every recall of consumer products — 400-500 per year covering toys, furniture, appliances, nursery products, and 15,000+ product categories not regulated by FDA or NHTSA. Here is the full recall database schema, the SaferProducts.gov incident report system, the IKEA Malm tip-over and Fisher-Price Rock 'n Play landmark cases, and how the CPSIA 2008 transformed product safety data.
Federal DataCPSCProduct SafetyConsumer ProtectionThe FEMA National Flood Insurance Program publishes every paid flood claim since 1978 and a snapshot of all active policies — covering 5M+ policies, $1.3T in coverage, and 40-year loss history including Katrina ($16B), Sandy ($8B), and Harvey ($9B). Here is the claims and policy dataset structure, the flood zone taxonomy, the repetitive loss problem, Risk Rating 2.0, and the OpenFEMA API.
Federal DataFEMAFlood InsuranceClimate RiskThe Foreign Agents Registration Act requires US-based agents of foreign principals to register with DOJ and disclose their activities and payments — covering lobbying, PR, and political work for foreign governments. Here is the full registration database schema, the Section 613 LDA exemption that lets most commercial lobbying avoid FARA, the Manafort conviction and Mueller-era enforcement surge, and how to query the eFile API.
Federal DataFARAForeign InfluenceTransparencyThe FDA CDRH publishes every medical device recall action — Class I (serious health risk), Class II, and Class III — covering 1,000–1,500 recalls per year since 1999. Here is the full field schema, the three recall classes, the DePuy ASR ($4B settlement) and Philips Respironics CPAP (5.5M+ units) landmark recalls, how MAUDE adverse event reports feed recall decisions, and how to query the OpenFDA device recall API.
Federal DataFDAMedical DevicesProduct SafetyThe NCUA publishes quarterly 5300 Call Report data for every federally insured credit union — assets, shares, loans, delinquency, net worth ratios — plus a public enforcement action database covering Consent Orders through Conservatorships. Here is the data structure, the net worth PCA thresholds, the 2009 corporate credit union crisis ($28.5B bailout), and how to download and screen the quarterly data.
Federal DataNCUACredit UnionsFinanceThe CFPB has brought 200+ enforcement actions since 2011 — covering UDAAP violations, redlining, student loan servicer abuses, and predatory auto lending — with $20B+ in consumer relief and penalties. Here is the enforcement action taxonomy, the UDAAP abusiveness standard, the Wells Fargo $3.7B action, how enforcement trends shift across administrations, and how to scrape and analyze the enforcement database.
Federal DataCFPBConsumer FinanceEnforcementThe Bureau of Transportation Statistics publishes monthly counts of every border crossing type at ~290 US land ports going back to 1996 — personal vehicles, pedestrians, trucks, buses, trains, and containers broken out by crossing type and port. Here is the full taxonomy, the COVID-19 collapse (pedestrians -93%, trucks -28%), the San Ysidro and Laredo dominance, and how to use the Socrata API for supply chain and trade flow analysis.
Federal DataBTSBorder CrossingsTransportationFDAAA 801 requires registration of all applicable clinical trials before enrollment and results submission within 12 months of completion — but 50%+ of trials still fail to report results. Here is the full NCT schema, how to access the AACT PostgreSQL mirror from Duke/CTTI, how to detect publication bias using the results reporting gap, and how the GLP-1 agonist trial explosion looks in the data.
Federal DataClinicalTrialsDrug DevelopmentResearch IntegrityThe FDA 510(k) pathway clears medical devices by showing substantial equivalence to a predicate device — no clinical trials required. Here is the three-class device system, the K-number database fields, the predicate daisy-chain problem that lets cleared devices drift from the original, the De Novo pathway for novel low-risk devices, the metal-on-metal hip and vaginal mesh controversies, and how to query the OpenFDA device API.
Federal DataFDAMedical DevicesHealthcareThe H-2A program (cap-free agricultural) and H-2B program (66,000-cap non-agricultural) bring hundreds of thousands of temporary workers to the US annually. DOL OFLC publishes quarterly disclosure files with employer, job title, wages, worksites, and worker counts. Here is the data structure, how H-2A grew from 60,000 to 370,000+ certifications between 2012 and 2023, and how to compare offered wages against adverse effect wage rates.
Federal DataDOLImmigrationLabor MarketsThe Office of the Comptroller of the Currency publishes every formal enforcement action against national banks and federal thrifts — from Commitment Letters through Formal Agreements, Consent Orders, and Cease-and-Desist Orders. Here is the enforcement action taxonomy, the BSA/AML enforcement pattern, the Wells Fargo consent order cascade, and how to scrape and analyze the OCC enforcement database.
Federal DataOCCBanking EnforcementFinanceFERC investigates electricity and gas market manipulation with penalties up to $1.4M per day per violation. Here is the enforcement database, the JP Morgan ($410M) and Barclays ($488M) market manipulation cases, how Electric Quarterly Reports expose every bilateral power transaction, and how to search FERC eLibrary enforcement dockets.
Federal DataFERCEnergy MarketsEnforcementSAMHSA publishes the N-SSATS facility survey (17,000+ treatment locations with services, ownership, payment accepted, and MAT availability) and the TEDS admissions dataset (patient demographics, substance, prior episodes, referral source). Here is the data structure, how to map OTP density against overdose death rates, and what the data reveals about rural treatment gaps.
Federal DataSAMHSAPublic HealthSubstance UseThe SEC publishes Administrative Proceedings, Litigation Releases, and final orders covering 700-800 enforcement actions per year — with $4-5B in annual disgorgement and penalties. Here is the enforcement record structure, the whistleblower program mechanics, how to scrape and parse the enforcement databases, and how to track administration-level enforcement priority shifts.
Federal DataSECSecurities EnforcementFinanceThe HHS OIG List of Excluded Individuals/Entities (LEIE) bars providers from billing Medicare and Medicaid — with $10,000 per-service penalties for employers that fail to screen. Here is the exclusion type taxonomy, how to download the monthly LEIE CSV, how it differs from SAM.gov EPLS, and how to implement fuzzy-match screening against a provider roster.
Federal DataHHS OIGHealthcare FraudCompliancePublic companies must file Form 8-K within 4 business days of any material event — covering 33 item types from earnings releases and executive departures to bankruptcy filings and the new 2023 cybersecurity incident disclosure requirement. Here is the item taxonomy, how to filter EDGAR for specific event types, and how Item 4.02 non-reliance filings signal fraud.
Federal DataSECCorporate DisclosureFinanceNHTSA maintains the recall database covering every safety-related defect since 1966 — 900M+ vehicles affected, with the Takata airbag inflator recall (70M vehicles, 28+ deaths from metal shrapnel) as the largest in US history. Here is the data structure, the NHTSA complaint-to-recall investigation pipeline, and how to query by VIN.
Federal DataNHTSAVehicle SafetyTransportationEvery large ERISA plan files Form 5500 annually — covering 750,000+ plans with $10T+ in assets. Schedule C reveals service provider fees that drive 401(k) litigation; Schedule SB tracks pension funding ratios that determine minimum required contributions. Here is the schema, EFAST2 access, and how to compute average expense ratios by plan size.
Federal DataDOLPensionsRetirementThe Lobbying Disclosure Act requires quarterly filings with the Senate SOPR — covering lobbyist identities, issue codes, specific bills lobbied, and dollar amounts for every registered lobbying engagement. Here is the LDA API, the relationship to FARA and LD-203 contribution reports, and how to connect lobbying spending to legislative outcomes.
Federal DataLobbyingTransparencyPoliticsUSASpending.gov pulls from FPDS-NG to publish every federal contract action — award type, NAICS/PSC codes, competition type, small business set-asides, and full recipient data. Here is the field structure, how to use the USASpending API, and how journalists trace no-bid contracts, contractor concentration, and September spending anomalies.
Federal DataUSASpendingFederal ContractsGovernment SpendingThe EIA publishes Form 923 (monthly plant-level generation and fuel use), Form 861 (annual utility retail sales and pricing), Form 860 (every generator's nameplate capacity and status), and EIA-930 (hourly real-time grid data by Balancing Authority). Here is the fuel mix transformation from 2000–2023 (coal 52% to 16%, gas 17% to 43%, wind/solar near zero to 16%), the ERCOT Texas grid isolation and Winter Storm Uri generation collapse, EIA API v2 structure, and a Python stacked-area chart of the energy transition.
Federal DataEIAEnergyElectricityThe CFPB Consumer Complaint Database tracks 5M+ complaints since 2011 about mortgages, credit cards, debt collection, and credit reporting — with company response, relief status, and consumer narratives. Here is the schema, how credit reporting complaints surged post-COVID, and how the data connects to CFPB enforcement priorities.
Federal DataCFPBConsumer FinanceEnforcementFINRA BrokerCheck publishes registration history, licenses, employment records, and disclosure events (customer complaints, regulatory actions, criminal disclosures, bankruptcies) for every registered broker and firm. Here is the data structure, the recidivist broker problem, how to access the BrokerCheck API, and how attorneys use it to vet advisers.
Federal DataFINRAFinanceInvestor ProtectionSection 16(a) requires officers, directors, and 10%+ shareholders to file Form 4 within 2 business days of any stock transaction — creating a near-real-time public record on EDGAR since 2004. Here is the full transaction code taxonomy (code P open-market purchases as the only discretionary signal), the 10b5-1 plan gaming problem and the 2022 SEC amendments, cluster-buying methodology, academic evidence on 6%+ abnormal returns, and a Python screen for officer open-market purchases.
Federal DataSECInsider TradingSecuritiesThe FDA Adverse Event Reporting System contains 7 linked quarterly files tracking drug adverse events reported by manufacturers, providers, and consumers — with MedDRA reaction coding, outcome classification, and therapy dates. Here is the schema, how disproportionality analysis (PRR/ROR) detects safety signals, and the Avandia/Vioxx/SSRI signal cases.
Federal DataFDADrug SafetyPharmacovigilanceThe College Scorecard links IPEDS enrollment data to federal loan records and IRS earnings data — publishing median earnings, debt, repayment rates, and completion rates for every institution and field of study. Here is the data structure, how to use the API, and what the earnings-debt gap reveals about for-profit colleges and high-debt programs.
Federal DataEducationCollege ScorecardHigher EducationThe CISA Known Exploited Vulnerabilities catalog lists CVEs confirmed as actively exploited in the wild — with mandatory federal patching deadlines under BOD 22-01. Here is the catalog structure, how CISA decides what gets listed, how it differs from CVSS severity scoring, and how security teams use it as a minimal-patch prioritization framework.
Federal DataCISACybersecurityVulnerability ManagementThe DOL Labor Condition Application dataset and USCIS H-1B Employer Data Hub together reveal the true shape of the skilled-worker visa program: IT staffing companies dominate approvals, India-born workers hold 70%+ of visas, and prevailing wage Level I filings expose systematic wage suppression. Here is the data structure and how to compute employer-level wage ratios.
Federal DataUSCISImmigrationLabor MarketsThe False Claims Act is the government's primary anti-fraud tool, with qui tam whistleblowers driving 80%+ of the $2B+ in annual recoveries. Healthcare fraud dominates — Medicare and Medicaid upcoding, kickbacks, and unnecessary procedures. Here is how to access the DOJ settlement database, scrape press releases, and identify repeat violators.
Federal DataDOJHealthcare FraudEnforcementThe NIH Reporter system publishes every grant award — PI, institution, project title, abstract, award amount, IC, and activity code. Here is the activity code taxonomy, how funding flows by Institute/Center, the indirect cost rate mechanics, and how to use the NIH Reporter API to track COVID research spending, opioid funding shifts, and HBCU funding gaps.
Federal DataNIHResearch FundingScience PolicyThe CDC publishes overdose mortality through the National Vital Statistics System, CDC WONDER, and monthly VSRR provisional counts — tracking 107,000+ annual drug deaths at the county, demographic, and drug-category level. Here is the ICD-10 code structure, the three waves of the opioid epidemic, racial disparity inversion driven by fentanyl, and how to access the data.
Federal DataCDCPublic HealthOpioidsThe FDIC publishes a complete failure list covering 4,000+ bank closures since 1934 — S&L crisis wave, the 2008–2012 GFC wave with 500+ failures, and the 2023 SVB/Signature/First Republic episode. Here is the dataset schema, how to use call report data and the Texas Ratio to identify at-risk institutions, and how financial journalists access FDIC BankFind.
Federal DataFDICBankingFinanceThe FDA publishes every warning letter on its website — pharmaceutical cGMP violations, food safety failures, device adulteration, and clinical investigator fraud. Here is the enforcement hierarchy from Form 483 to criminal referral, how to access and scrape the letter database, and what the record reveals about repeat violators and food safety trends.
Federal DataFDAHealthcareEnforcementThe Mine Safety and Health Administration publishes three linked datasets — mine listings, accident/injury records, and violation citations going back to 1983. Here is the significant-and-substantial designation, the Pattern of Violations enforcement mechanism, the Upper Big Branch disaster context, and how to join violations to accidents by Mine ID.
Federal DataMSHAMine SafetyLaborThe US Coast Guard maintains the Boating Accident Report Database (BARD) for recreational vessels and the Marine Casualty and Pollution Database (MCPD) for commercial casualties. Here is what each database contains, how alcohol and life-jacket non-use drive fatality statistics, and how journalists use the data to track manufacturer defects and rental company safety records.
Federal DataUSCGMaritime SafetyTransportationThe FMCSA maintains SAFER and MCMIS covering every commercial motor carrier in interstate commerce — three official safety ratings (Satisfactory, Conditional, Unsatisfactory), seven SMS BASICs scoring each carrier as a percentile, inspection counts, OOS rates, and crash data. Here is the data structure, how to access it, and what it reveals about high-risk carriers.
Federal DataFMCSATransportationSafetyUS Customs and Border Protection and the Census Bureau publish comprehensive import and export statistics by commodity (HTS code), trading partner, port of entry, and month. Here is the data structure, how to access USA Trade Online and the Census Foreign Trade API, and what the data reveals about trade diversion after Section 301 tariffs.
Federal DataCBPTradeEconomicsICE publishes annual ERO reports covering arrests, detentions, removals, and returns by country of origin, criminal vs. non-criminal designation, and field office. Here is the data structure, TRAC-ICE access, and what the dataset reveals about enforcement priority shifts, nationality composition changes, and the interior vs. border enforcement split.
Federal DataDHSImmigrationEnforcementThe BLS Consumer Price Index for All Urban Consumers tracks monthly inflation going back to January 1913. Here is the expenditure weight breakdown, how CPI-U differs from core CPI and the PCE deflator, how to access it via the BLS API, and what the 2021-2023 surge revealed about shelter inflation measurement and monetary policy transmission.
Federal DataBLSInflationEconomicsThe Social Security Administration publishes annual disability award statistics covering both SSDI and SSI — awards by state, diagnosis code, age group, gender, and decision level. Here is what the dataset contains, how to access it, and what it reveals about geographic variation in award rates, the ALJ hearing backlog, and the Trust Fund solvency timeline.
Federal DataSSADisabilitySocial ProgramsThe National Labor Relations Board maintains a public case management system tracking every unfair labor practice charge filed under the NLRA — 20,000–25,000 annually. Here is the case lifecycle, data structure, how to query the NLRB API, and what the data reveals about the 2022–2024 Starbucks and Amazon organizing surge.
Federal DataNLRBLabor LawEnforcementThe Job Openings and Labor Turnover Survey tracks monthly job openings, hires, quits, layoffs, and other separations by industry and region. Here is the data structure, BLS API access, and what JOLTS reveals about the Great Resignation, the Fed's rate-hike calculus, and the labor market signals that precede recessions.
Federal DataBLSLabor MarketsEconomicsThe FTC Consumer Sentinel Network aggregates 8M+ fraud, identity theft, and consumer complaint reports annually from the FTC and dozens of partner organizations. Here is what the dataset contains, how to access it, and what it reveals about imposter scams, cryptocurrency fraud, and the counterintuitive age dynamics of financial loss.
Federal DataFTCConsumer ProtectionFraudThe Federal Railroad Administration publishes two linked databases covering US railroad safety since 1975: Form 54 (all rail accidents — derailments, collisions, fires, explosions) and Form 57 (highway-rail grade crossing accidents). Together they cover 250,000+ incidents with train information, track type, speed at accident, casualties, and equipment damage.
Regulatory dataFRARailroad safetyDerailmentsGrade crossingsTransportationThe Pension Benefit Guaranty Corporation publishes data on every terminated private-sector defined-benefit pension plan it has trusteed since 1975 — over 5,000 plans covering millions of workers. The data reveals which industries have abandoned their pension obligations, how much the PBGC paid out vs. what was promised, and which plan sponsors walked away from the largest underfunded obligations.
Regulatory dataPBGCPensionsRetirementLaborDefined-benefitThe USGS National Earthquake Information Center maintains a catalog of every recorded earthquake globally — magnitude 2.5+ events back to 1900, with 100,000+ events per year above M4 globally. Here is the data structure, how to access the API and bulk downloads, and what the catalog reveals about fault hazard zones, the Oklahoma induced seismicity surge from wastewater injection, and historical earthquake patterns.
Regulatory dataUSGSEarthquakesSeismologyNatural hazardsInduced seismicityEPA's Enforcement and Compliance History Online (ECHO) publishes every CAA, CWA, RCRA, and TSCA enforcement case — facility violations, formal actions, penalties assessed, and compliance status for 800,000+ regulated facilities. Here is the data structure, how to query it, and what the database reveals about which facilities violate the most, which industries face the steepest penalties, and where environmental justice and enforcement gaps align.
Regulatory dataEPAEnvironmental enforcementECHOPollutionEnvironmental justiceNOAA's Storm Events Database records every significant weather event in the US since 1950 — tornadoes, floods, hurricanes, winter storms, heat waves, wildfires, and 50+ other event types with location, deaths, injuries, and property damage estimates. Here is the data structure, how to access it, and what the database reveals about extreme weather trends, geographic risk concentration, and the growing cost of natural disasters.
Regulatory dataNOAAStorm eventsClimateNatural disastersWeather dataThe IRS publishes Form 990 filings for political organizations — 527 committees (direct political spending) and 501(c)(4) social welfare organizations (the dark money vehicle). The data covers revenue, expenditures, officer compensation, and political activities for 65,000+ organizations. Here is what the data contains, how to access it via ProPublica Nonprofit Explorer and the IRS bulk XML, and what it reveals about the shadow infrastructure of US political spending.
Regulatory dataIRSDark moneyPolitical organizations527501c4Campaign financeThe National Transportation Safety Board publishes a database of every US civil aviation accident since 1962 — over 90,000 accidents and incidents with aircraft type, probable cause, phase of flight, weather, pilot certificates, and injury counts. Here is the data structure, how to query it, and what 60 years of accident data reveals about general aviation risks, maintenance failures, and how investigation findings translate into safety rules.
Regulatory dataNTSBAviation safetyAircraft accidentsTransportationFAAThe Pipeline and Hazardous Materials Safety Administration publishes incident reports for every significant pipeline accident since 1970 — gas distribution, gas transmission, hazardous liquids, and LNG facilities. The database covers 25,000+ incidents with fatalities, injuries, property damage, and commodity spilled. Here is the data structure, how to access it, and what five decades of pipeline incident data reveals about failure patterns, operator accountability, and regulatory gaps.
Regulatory dataPHMSAPipeline safetyEnergyHazardous materialsInfrastructureThe USDA Food and Nutrition Service publishes monthly SNAP participation and benefit data by state — total participants, households, benefits issued, average benefit per person, and issuance history going back to 1969. The data shows how food assistance responds to recessions, pandemic aid expansions, and state-level work requirement policies. Here is what the data contains, how to access it, and what 50 years of SNAP data reveals.
Regulatory dataUSDASNAPFood assistanceSocial programsPovertyThe Census Bureau's American Community Survey publishes 5-year estimates for every census tract in the US — income, poverty, race, housing tenure, education, employment, and 350+ other variables at the tract level. ACS is the denominator that makes every other federal dataset meaningful: HMDA denial rates per capita, OSHA injury rates per worker, SNAP participation per household. Here is what it contains, how to access it, and how to join it to enforcement data.
Regulatory dataCensusACSDemographicsEconomic dataOpen dataHUD's Fair Housing and Equal Opportunity office publishes a complaint database covering every fair housing complaint filed with HUD and participating state agencies — basis of discrimination (race, national origin, disability, familial status, sex, religion), property type, complaint disposition, and whether the complainant received relief. Here is the data structure and what 50,000+ complaints reveal about where housing discrimination concentrates.
Regulatory dataHUDFair housingHousing discriminationCivil rightsDisability rightsThe Bureau of Justice Statistics publishes the National Prisoner Statistics program — state and federal prison populations back to 1925, with demographics (race, sex, age), offense categories, sentence lengths, and admissions/releases flows. Here is the data structure, how to access it, and what 100 years of incarceration data reveals about mandatory minimums, the drug war, and mass incarceration's racial dimensions.
Regulatory dataBJSCriminal justiceIncarcerationPrisonsMass incarcerationOSHA publishes its full inspection and citation database — every workplace inspection since 1972, every violation found, every penalty assessed, and whether the employer contested the citation. The database covers 2.5M+ inspections across all industries. Here is what it contains, how to query it, and what patterns emerge from 50 years of enforcement data.
Regulatory dataOSHAWorkplace safetyLabor enforcementViolationsInspectionsThe Department of Labor's Wage and Hour Division publishes a public enforcement database covering every concluded investigation — employer name, violation type, back wages owed, employees affected, and civil money penalties. The database covers FLSA minimum wage/overtime, H-2A/H-2B temporary workers, FMLA, and Davis-Bacon prevailing wage violations. Here is the structure, how to query it, and what the data reveals about wage theft patterns across industries.
Regulatory dataDOLWage theftLabor enforcementFLSAWorkers' rightsThe Fatality Analysis Reporting System (FARS) contains a record for every motor vehicle crash death on US public roads since 1975 — 1.1M+ fatalities with vehicle type, crash circumstances, driver behavior, and roadway conditions. Here is the data structure, how to download it, and what it reveals about drunk driving trends, pedestrian deaths, and the safety gap between vehicle classes.
Regulatory dataNHTSAFARSTraffic safetyVehicle safetyTransportationThe FBI's National Incident-Based Reporting System (NIBRS) publishes incident-level crime data — every offense, victim, offender, arrest, and property loss reported by participating agencies. Here is what the database contains, how it differs from the legacy UCR Summary data, and how to use it for research on offense patterns, racial disparities in enforcement, and geographic hot-spots.
Regulatory dataFBINIBRSCrime dataCriminal justiceLaw enforcementThe FEC publishes bulk data on every contribution and expenditure in federal elections — candidates, PACs, super PACs, and party committees. Here is how to download the full dataset, trace money from donor to expenditure, and identify the shell-company layer that obscures dark money flows.
Regulatory dataFECCampaign financeSuper PACDark moneyPolitical moneyHHS-OCR publishes every reported healthcare data breach affecting 500+ patients — the "Wall of Shame." Over 5,000 entries covering ransomware attacks, stolen laptops, unauthorized employee access, and business associate failures. Here is what the database contains and what it reveals about healthcare security failures.
Regulatory dataHIPAAHHS-OCRHealthcareCybersecurityData breachThe EEOC publishes annual charge statistics and, since 2017, charge-level data under FOIA. The aggregate data shows which industries generate the most race, sex, disability, and age discrimination charges — and which large employers appear repeatedly in the conciliation record.
Regulatory dataEEOCEmployment discriminationCivil rightsLaborAfter a FOIA fight, the SBA released PPP loan data covering 11.8 million loans and $793 billion in forgiven funds. Here is what the public data contains, the fraud patterns it revealed, and how to cross-reference it with SAM.gov debarments, IRS nonprofit data, and the DOJ prosecution record.
Regulatory dataSBAPPPPandemic reliefFraudOpen dataThe STOCK Act requires members of Congress to report stock trades within 45 days. The House Clerk publishes scanned PDFs — not structured data. Here is how Quiver Quantitative, Capitol Trades, and journalists have structured this data, and what the disclosures reveal about trading patterns around legislation and committee assignments.
Regulatory dataSTOCK ActCongressTradingConflicts of interestDisclosureThe CPSC Recall database covers 9,800+ recalls since 1973. Behind the press releases: how many units are actually returned, which hazard categories dominate, and why the voluntary recall system lets manufacturers negotiate the language of their own enforcement actions.
Regulatory dataCPSCProduct safetyConsumer protectionRecallsEOIR publishes quarterly data on every immigration judge's case outcomes, including asylum grant rates. The spread is enormous — some judges grant asylum in fewer than 5% of cases; others grant it in more than 90%. Here is how to access and analyze the data.
Regulatory dataEOIRImmigrationAsylumDOJCourtsThe Home Mortgage Disclosure Act requires 7,000+ lenders to report every mortgage application — approvals, denials, withdrawn, race, income, loan amount, census tract. Here is how to use the CFPB bulk download to find redlining, reverse redlining, and lender-level denial rate disparities.
Regulatory dataHMDAMortgageLending disparitiesCFPBHousingSection 13(f) requires institutional investment managers with >$100M in 13(f) securities to file quarterly holdings disclosures with the SEC — ~5,000 filers, 45-day lag, long-equity-only view. Here is the full holdings table schema (CUSIP, VALUE, SH/PRN, PUT/CALL, INVESTMENT DISCRETION, VOTING AUTHORITY), what 13F covers and critically excludes (no short positions, no bonds, no foreign-listed shares), major filers (Berkshire, BlackRock, Renaissance), confidential treatment requests, the 45-day stale-data limitation and clone strategy research, academic use (Griffin/Xu 2009, Brunnermeier/Nagel 2004, Edmans 2009), comparison to 13D/13G/Form 4, and a Python EDGAR bulk index parser to track position changes for any manager by CIK.
Federal DataSECInstitutional InvestingFinanceHow we indexed 380 million DEA ARCOS controlled-substance transaction records from the opioid MDL discovery release, what the data reveals about pill distribution, and how to cross-reference it against DEA enforcement actions and CDC overdose mortality.
Regulatory dataDEAARCOSOpioidsPublic healthThe Corporate Prosecution Registry at Duke and UVA covers 3,000+ federal organizational prosecutions and every DPA/NPA since 1990 — including agreements DOJ refused to disclose under FOIA.
Regulatory dataDOJCorporate prosecutionDPAFOIAATF publishes the complete list of ~75,000 active Federal Firearms Licensees monthly as a free CSV. Here's what the data contains, what the Tiahrt Amendment keeps hidden, and how to cross-reference it.
Regulatory dataATFFirearmsFFLTiahrt Amendmentforeignassistance.gov went dark on January 31, 2025. What the dataset contained, how it was archived, what the DOGE cuts actually targeted, and where to access it now.
Regulatory dataUSAIDForeign aidDOGEOpen dataPCAOB inspection reports contain structured deficiency data for every registered audit firm. In 2023, 26% of Big 4 audits reviewed had Part I.A deficiencies — meaning auditors signed off without sufficient evidence. Here is what the data covers and how to use it.
Regulatory dataPCAOBAuditBig 4Financial oversightHow to pull, clean, and analyze NLRB union election records — RC and RD cases, the 2021–2024 organizing surge, the 100k export cap workaround, industry breakdowns, and cross-referencing with OSHA and CFPB data.
Regulatory dataNLRBLaborUnion electionsWorkersHow joining CMS Open Payments (100M+ pharma payments to physicians) with Medicare Part D prescribing data (25M+ provider-drug rows) surfaces the correlation between manufacturer payments and prescribing patterns — and how to cross-reference with HHS OIG exclusions.
Healthcare dataCMSOpen PaymentsMedicare Part DPharmaThe DOJ buries the FARA bulk download inside an Oracle APEX URL that looks broken. Behind it: daily CSV exports of every DC firm registered to lobby for a foreign government — who they represent, what they're paid, and what activities they conduct. Here is how to use it.
Regulatory dataFARAForeign influenceLobbyingFEMA's NFIP claims dataset covers 2.7 million paid flood insurance claims. The "multiple loss properties" subset shows properties paid out more than their assessed value — some 10–15 times. FEMA redacted addresses after journalists used the data to identify specific owners. Here is what's left and what it shows.
Regulatory dataFEMANFIPClimate riskInsuranceHow we built a 0–100 compliance risk score across OFAC, SAM, OIG, CFPB, SEC, DOJ, FDIC, FINRA, CFTC, EPA, MSHA, FDA warning letters, PCAOB, UFLPA, and 15+ more lists in a single API call.
Regulatory dataComplianceOFACEntity resolutionHow the Federal Regulatory Data Hub resolves entity identity across 30+ compliance lists: three-stage pipeline (identifier join 34%, FTS5 canonical name 41%, Jaro-Winkler fuzzy 18%), false positive taxonomy (same-name different entity 47%, subsidiary-parent 28%, historical name 16%, transliteration 9%), EntityResolutionResult confidence-to-action mapping (MATCH ≥0.90, PROBABLE_MATCH 0.72–0.90), 99.1% recall, 98.7% precision at ≥0.90, and weekly analyst-feedback calibration loop.
Regulatory dataComplianceMLEntity resolutionHow the Federal Regulatory Data Hub resolves entity names across 197 federal datasets when identifiers disagree — OFAC alias explosion (44K aliases from 12K entries), SEC EDGAR subsidiary mapping, three-pass fuzzy matching (exact → Jaro-Winkler → TF-IDF cosine), 1.4% combined false positive rate, and how entity_confidence weights the compliance risk score.
Regulatory dataEntity resolutionComplianceData engineeringHow the Federal Regulatory Data Hub generates and maintains stable canonical IDs for entities across 197 federal datasets — deterministic SHA-256 ID generation, EntityVersion history for merge and split events, EntityAlias tracking for historical name variants, and subscriber continuity guarantees when source identifiers change.
RegulatoryInfrastructureData EngineeringHow we built an entity bridge across 197 federal datasets so a single query returns every SEC filing, FDA warning letter, EPA enforcement case, and OFAC sanction for any company.
Regulatory dataEntity resolutionCloudflare D1MCPHow the Federal Regulatory Data Hub lets compliance teams subscribe to regulatory events for specific entities — using the cross-agency entity bridge to watch OFAC, SAM, SEC, EPA, DOJ, and 25+ other lists simultaneously.
Regulatory dataComplianceInfrastructureHow the Federal Regulatory Data Hub detects regulatory record changes and delivers them to subscribers: 10-minute OFAC sanctions window, 30-minute SAM debarment window, EDGAR 8-K filing webhooks, HMAC-signed Cloudflare Queue delivery with at-least-once semantics, per-entity and per-list subscription filters, and idempotency_key deduplication.
Regulatory dataComplianceInfrastructureCloudflareWhat shipped in Swarm SDK v0.4: the Situational Awareness API for shared position and sensor fusion, the EW Coordination protocol for spectrum interference, Adversarial Resilience features including traffic morphing and store-and-forward, and the RF Fingerprinting subsystem for passive emitter tracking. 463 total tests.
Swarm SDKPost-quantumDroneCryptographyHow the swarm coordination layer maintains a shared operational picture across 128 nodes without a central server: Ed25519-signed 124-byte position broadcast frames, an Extended Kalman Filter fusing GPS/IMU/barometric altitude into a 6-DOF state estimate, dead-reckoning fallback with quadratic uncertainty growth for up to 90 seconds without GPS, and a probabilistic gossip protocol achieving 94.2% frame delivery across a 2km × 2km field deployment.
Swarm roboticsEmbedded RustSensor fusionDistributed systemsHow we ported the Swarm SDK cryptographic core to no_std Rust targeting the STM32H7 Cortex-M7: feature-gated std/embedded builds, 96KB static heap with cortex-m-alloc, pre-allocated VecDeque deduplication ring, in-place AES-GCM to avoid heap allocation, hardware AES accelerator integration (0.14ms vs. 0.61ms software), and binary size optimization from 1.2MB to 284KB with opt-level="z" and LTO.
Swarm SDKEmbeddedRustCryptographyHow the Swarm SDK rotates cryptographic material without grounding the fleet — scheduled signed pre-key rotation on a 7-day timer, OTP replenishment when bundle drops below 20 keys, emergency revocation via gossip-flooded KeyRevocationAnnouncement, BKPSRAM zeroization with 0xFF pattern verification, and staggered rotation coordination across the mesh.
Swarm SDKSecurityCryptographyHow the Swarm SDK manages cryptographic identity for drone fleets: on-device ML-KEM-768 + X25519 keypair generation at provisioning, three-tier fleet CA hierarchy (Root → Fleet CA → device certificate), pre-provisioned mission cert bundles for offline authentication, signed prekey rotation every 7 days over the gossip mesh, in-flight device revocation via poison-pill RevocationMessage, and emergency wipe on tamper detection.
Swarm SDKCryptographyPost-quantumDroneHow a Swarm SDK drone goes from factory state to trusted mesh participant: factory-provisioned ML-KEM-768 + X25519 keypairs, CSR generation and Fleet CA signing, USB and RF enrollment paths, gossip mesh announcement with SignedPreKeyBundle, pioneer bootstrap for the first device, and re-enrollment at certificate expiry.
Swarm SDKCryptographyPost-quantumDroneHow we designed the Swarm SDK: ML-KEM-768 + X25519 hybrid post-quantum key exchange, Double Ratchet forward secrecy, gossip mesh routing with bounded fanout, and the path to CNSA 2.0 compliance.
CryptographyPost-quantumDroneSwarm SDKHow the Swarm SDK protects drone mesh communications against traffic analysis — six fixed message size bins, ±15% transmission timing jitter, store-and-forward ring buffer for burst smoothing, degraded-channel operational mode, and RF fingerprint resistance on STM32H7.
Swarm SDKCryptographySecurityHow the Swarm SDK wraps post-quantum encrypted mesh traffic in MAVLink v2 SWARM_MESH_FRAME messages — 18-byte fragment header design, per-message reassembly buffer with 5-second TTL, PX4 and ArduPilot integration, MAVSDK passthrough, and why ML-KEM-768 Sealed Sender envelopes always require 6 frames.
Swarm SDKMAVLinkDroneCryptographyHow the Swarm SDK serializes, fragments, and packs Double Ratchet encrypted messages into MAVLink v2 TUNNEL frames: the SwarmFrame binary header, 237-byte payload limit, fragmentation algorithm, reassembly state machine, CONTROL frame authentication, and STM32H7 performance.
Swarm SDKCryptographyProtocol designHow the Swarm SDK implements the Double Ratchet algorithm for drone-to-drone messaging: adapting Signal Protocol's KDF chains for ML-KEM-768 post-quantum initial key exchange, header encryption, out-of-order message handling with a sliding key cache, MAVLink v2 framing, and performance benchmarks on embedded ARM.
Swarm SDKCryptographyPost-quantumDroneHow the Swarm SDK implements Sealed Sender to hide drone identity from relay infrastructure: recipient-issued SenderCertificate, ephemeral X25519 + HKDF-SHA256 per-message encryption into SealedSenderEnvelope, AES-256-GCM with zero relay-visible sender field, 48-hour certificate TTL, four decryption failure modes (DecryptionError, CertificateExpired, CertificateSignatureInvalid, SenderKeyMismatch), and integration with Sender Keys for group mesh communications.
Swarm SDKCryptographyProtocol designHow the Federal Regulatory Data Hub exposes its data through an MCP server with 38+ tools for Claude, GPT, and other AI agents — screen_entity, get_entity, compliance reporting tools, HMAC-signed webhook configuration, rate-limit tiers by plan, and Claude Desktop integration via stdio transport.
RegulatoryMCPInfrastructureAIWhat shipped in Swarm SDK v0.3: O(1) group encryption with Sender Keys (0.7ms on STM32H7), Sealed Sender hiding drone identity via ML-KEM-768 encapsulation, deniable HMAC authentication, and PKCS7 padding normalization across all AES-GCM operations. 127 new tests (302 total).
Swarm SDKCryptographyPost-quantumDroneHow the Federal Regulatory Data Hub API is designed: no-auth CC0 REST endpoints, cross-agency entity resolution in a single GET, an MCP server with 38+ tools for Claude and GPT agent workflows, and JSON-LD structured data for search indexing.
Regulatory dataAPI designMCPCloudflareHow the Swarm SDK uses Extended Triple Diffie-Hellman (X3DH) with ML-KEM-768 adaptation for async drone-to-drone session establishment — prekey bundle construction, one-time prekey consumption, Fleet CA bundle verification, and the transition from shared secret to Double Ratchet forward secrecy.
Swarm SDKCryptographyDronePost-quantumHow the Swarm SDK generates, distributes, and tracks OneTimePreKeys for X3DH session establishment — including OTP exhaustion handling, SignedPreKey rotation, and the gossip-mesh key bundle protocol.
Swarm SDKCryptographyDronePost-quantumHow we ingest and refresh 197 federal regulatory datasets across 45 agencies using Cloudflare Workers cron, delta detection, schema drift handling, and per-source retry budgets — the ETL behind the Federal Regulatory Data Hub.
Regulatory dataInfrastructureData engineeringCloudflareHow the Swarm SDK MeshTransport layer achieves reliable frame delivery over lossy drone radio links: sliding window ARQ with selective ACK, EWMA RTT estimation, transparent fragmentation and reassembly for Sealed Sender envelopes, multi-channel bonding across 2.4GHz and 5.8GHz radios, and performance benchmarks on STM32H7 and Jetson Nano.
Swarm SDKDroneInfrastructureCryptographySwarm SDK gossip mesh: bounded fanout routing, message deduplication, and network partition handling
How the Swarm SDK implements a gossip mesh for drone swarms: epidemic broadcast with k=3 fanout, UUIDv4 sliding-window deduplication across a 1000-ID VecDeque, Lamport clock causal ordering for key management messages, TTL hop limiting with 3-hop lossy-channel headroom, and anti-entropy reconciliation for post-partition recovery — with STM32H7 and Jetson Nano benchmarks.
Swarm SDKCryptographyDroneInfrastructureAn architectural overview of the Swarm SDK: the three-layer design covering gossip mesh epidemic broadcast, ML-KEM-768 + X25519 hybrid post-quantum cryptography with Double Ratchet and Sender Keys, MAVLink v2 framing, and no_std embedded operation on STM32H7.
Swarm SDKCryptographyPost-quantumDroneHow Voidly deduplicates thousands of probe measurements into discrete censorship incidents: the four-tuple clustering key, the 6-hour gap rule, incident lifecycle from ANOMALY to RESOLVED, incident_id assignment, retroactive CensoredPlanet alignment, and edge cases including flapping blocks and BGP outages.
CensorshipVoidlyMethodologyInfrastructureHow Voidly reconstructs the authoritative timeline of a censorship incident from asynchronous distributed probe measurements — IncidentEvent sourcing model, temporal alignment across time zones, confidence weighting requiring 3+ independent probes, retroactive revision from CensoredPlanet batch data, duration statistics, and the timeline REST API endpoint.
CensorshipVoidlyInfrastructureMethodologyHow Voidly determines that a censorship incident has ended: per-type resolution thresholds (consecutive passing measurements with p_blocked < 0.3), the 12-hour RESOLVED_PENDING re-open window, FLAPPING state detection for rapidly alternating blocks, BGP-type auto-resolution, and cross-source confirmation requirements for VERIFIED incidents — with observed resolution time distributions (BGP 4.2h median, HTTP 12.1 days).
CensorshipVoidlyMethodologyInfrastructureHow Voidly embeds ONNX Runtime inside an Apache Flink streaming job to score probe results for censorship anomalies at 50,000 events/sec with sub-100ms end-to-end latency: thread-local ONNX session management per task slot, Kafka partition alignment with (country_code, asn) keyBy, mini-batch coalescing for 50ms p99 inference, and the backpressure mechanism that keeps consumer lag under 2,400 messages even on election-day traffic spikes.
VoidlyStreamingMachine learningFlinkReal-timeHow Voidly gets from a probe anomaly to a published verified incident — and an alert in a journalist's inbox — in under 8 minutes: the event queue, real-time OONI and IODA API polling, confidence threshold crossing, the two-window alert-fatigue guard, and the nightly CensoredPlanet retroactive pass.
CensorshipVoidlyInfrastructureReal-time systemsWhat happens inside a single Voidly probe run: the measurement execution loop, DNS and TCP and TLS and HTTP data capture, result serialization and signing, and the upload path that delivers a signed ProbeResult to the ingest pipeline.
VoidlyCensorshipInfrastructureMethodologyHow Voidly probes maintain connectivity and upload measurements from networks that actively block VPN protocols — QUIC/443 transport, domain fronting via CDN SNI fronting, TLS certificate pinning against MITM, local SQLite buffering (500 MB cap, 48h window), and metered-connection backoff.
VoidlyNetworkingQUICInfrastructureHow Voidly probes preserve measurement data during upload failures — a 72-hour SQLite ring buffer with anomaly-safe eviction, LZ4 batch compression reducing median batch size from 47KB to 9KB, exponential backoff retry up to 4 hours, priority queue for anomalous measurements, chunked upload with per-chunk acknowledgment, and 0.003% measurement loss rate across 37 probes over 6 months.
VoidlyInfrastructureCensorshipHow the Voidly desktop probe works: Tauri 2 cross-platform app, Cloudflare boringtun WireGuard, tun-rs TUN device, X25519-Dalek on-device key generation, and operator anonymity as a design constraint.
CensorshipVoidlyInfrastructureTauriHow the Voidly probe test runner orchestrates concurrent measurements inside the Tauri app: tokio Semaphore with 3 permits, MeasurementState machine (Pending → Running → Success/Error/Timeout), per-layer timeout budgets (DNS 3s, TCP 5s, TLS 8s, HTTP 15s, total 30s), Ed25519 measurement signing, mpsc upload queue with capacity 200, and why per-layer timeouts are themselves evidence of DNS-layer interference.
CensorshipVoidlyInfrastructureRustA step-by-step breakdown of how each Voidly probe test works: DNS resolution, TCP handshake, TLS negotiation with certificate chain validation, HTTP request execution, response body fingerprinting, control comparison, and how every layer maps to interference types in the anomaly classifier.
CensorshipVoidlyMethodologyInfrastructureA deep dive into the TCP layer of Voidly's censorship detection: SYN-ACK timing, RST injection detection with a 15ms threshold, null-routing vs. RST as two distinct censorship mechanisms, the TcpResult struct, dual-IP probing to identify RST source, and how TCP evidence maps to the anomaly classifier's interference classes.
CensorshipVoidlyMethodologyHow Voidly uses a distributed control server network to distinguish genuine censorship from network errors, CDN split-horizon DNS, and misconfigured sites — DNS, TCP, TLS, and HTTP comparison methodology, and why a single control is not enough.
CensorshipVoidlyMethodologyInfrastructureA technical deep-dive on how Voidly detects bandwidth throttling — the hardest interference class to classify. Covers the TimingFeatures Rust struct, TTFB z-score computation against control measurements, body truncation and mid-transfer RST signals, the congestion vs. deliberate-throttling calibration problem, cross-probe corroboration scoring, and country patterns from Russia TSPU, Iran ARRS, India, and China.
CensorshipVoidlyMethodologyInfrastructureHow Voidly monitors 37+ probe nodes: heartbeat system (60s cadence, separate transport), DEGRADED/OFFLINE state machine, measurement quality scoring, ASN coverage SLOs for 200 countries, flapping detection capping confidence at CORROBORATED, automated replacement from standby operator waitlist, and the classify_offline_cause() algorithm distinguishing probe failure from ISP-level censorship.
CensorshipVoidlyInfrastructureMethodologyHow Voidly probes identify DNS injection and manipulation in censored networks — comparison against three control resolvers, four weighted detection signals (IP divergence, TTL anomaly, source IP divergence, response timing), per-country injection rates (China 94%, Iran 61%, Russia 12%), CAP_NET_RAW privilege handling, anycast false-positive calibration from 4.2% to 0.8%, and integration with the DnsTestResult confidence score.
CensorshipVoidlyMethodologyInfrastructureHow Voidly classifies every censorship measurement into one of 7 interference types — DnsInjection, DnsNxdomain, TcpRstInjection, TcpNullRouting, TlsMitm, HttpBlockPage, and Throttling — using a hierarchical decision tree from DNS through HTTP, with confidence scoring, protocol layer priority, and an Indeterminate category for ambiguous evidence.
CensorshipVoidlyMethodologyHow Voidly avoids false positives from commercial geoblocking: HTTP 451 detection, streaming service block page fingerprints (tagged geoblock_commercial, not censorship), multi-country probe comparison (SINGLE_COUNTRY vs. MULTI_COUNTRY_SELECTIVE geographic patterns), CDN split-horizon detection via ASN group mapping, domain-level unavailability baselines, and the p_geoblock score that suppresses measurements above 0.70.
CensorshipVoidlyMethodologyInfrastructureHow Voidly correlates three independent measurement projects at scale — data format normalization, 4-hour sliding window alignment, independence-weighted confidence scoring, and handling source disagreements.
CensorshipVoidlyOSINTVerificationHow Voidly probes detect network middleboxes: an HTTP echo test sending custom X-Voidly-Echo headers to a Voidly-controlled server to detect transparent proxies via injected Via/XFF headers, TCP RST injection timing analysis using four heuristics (arrival time, TTL mismatch, zero window, absent TCP options), a vendor signature library with 47 confirmed fingerprints (TSPU/Sandvine/Huawei Hi-SEC/GFW/Cisco), and the middlebox_events TimescaleDB hypertable showing 18-hour median lead time between middlebox detection and censorship anomaly onset across 31 countries.
VoidlyNetwork measurementMiddlebox detectionDPICensorshipA deep dive into the TLS layer of Voidly's censorship detection: full certificate chain extraction with rustls, government CA list (China MoI, Iran MICT, Kazakhstan NCA), MITM detection via fingerprint mismatch, TLS alert timing analysis (RST < 15ms = injected), SNI-based blocking detection via dual-SNI probing, ECH/ESNI measurement, and how TLS failure maps to interference_type classifier outputs.
CensorshipVoidlyTLSMethodologyHow Voidly built and maintains the 2,300-entry block page fingerprint library used to identify ISP and government censorship block pages: four matching strategies (exact SHA-256 hash, structural normalization, SimHash locality-sensitive hashing, TLS certificate fingerprinting), the match pipeline cascade, block page collection from OONI confirmed events and probe captures, per-country library composition (Turkey 47, Iran 312, Russia 189, China 8), false positive mitigation for CDN error pages and captive portals, and integration with the lf_http_blockpage_hash Snorkel label function.
CensorshipVoidlyMethodologyInfrastructureHow the four Voidly measurement layers compose into a single ProbeResult struct: sequential DNS → TCP → TLS → HTTP execution with the control measurement running in parallel, the None-vs-Some failure propagation convention distinguishing “not attempted” from “attempted and failed”, a failure mode table mapping six layer-outcome combinations to censorship types, and deterministic control vantage selection by domain hash to stabilize body_sha256 comparison across measurement cycles.
VoidlyNetwork measurementProtocol stackProbe infrastructureA deep dive into the DNS layer of Voidly's censorship detection: dual-resolver design (ISP resolver vs. neutral control), four interference types (NXDOMAIN injection, IP spoofing, empty answer, timeout), the compare_dns_results() algorithm, known injection IP database (China 18 IPs, Iran 3, Turkey 2), CDN geofencing false positive mitigation via ASN group matching, DNSSEC validation limitations, and DoH/DoT diagnostic queries.
CensorshipVoidlyDNSMethodologyA complete field-by-field guide to the Voidly CC BY 4.0 measurement dataset — probe identity, DNS/TCP/TLS/HTTP layers, control comparison, ML classification output, BGP signals, corroboration fields, and filtering recipes for journalists and ML researchers.
CensorshipVoidlyData engineeringOpen dataHow Voidly publishes its measurement corpus to external researchers: a keyset-paginated NDJSON streaming API with (ts, measurement_id) cursor and Server-Sent Events mode, nightly PyArrow Parquet generation sorted by (domain, ts) for 60% I/O reduction on single-domain queries with zstd level-3 compression, atomic HuggingFace Dataset Hub push with dataset card regeneration, and classifier_version tagging to keep probability distributions comparable across model updates.
VoidlyOpen dataAPI designParquetHuggingFaceVoidly's TimescaleDB continuous aggregates: pre-aggregating 2.2B probe measurements for fast queries
The three-level TimescaleDB continuous aggregate hierarchy behind Voidly's sub-10ms query latency: measurement_hourly (15-minute refresh), country_daily_summary (1-hour refresh), and country_monthly_stats (daily), cutting a 7-day country query from 4.1 seconds to 4ms. Covers refresh policy configuration, late-arriving probe data handling (94.2% within 1 hour, 98.7% within 24h), compression interplay after 7 days, asn_hourly_summary design, and manual backfill procedures.
CensorshipVoidlyTimescaleDBInfrastructureThe full path from raw probe bytes to a queryable TimescaleDB record: protobuf over QUIC, Cloudflare Worker validation, Kafka fan-out, Rust normalization, probe-version schema drift handling, quality filtering (3.2% drop rate), and nightly Parquet export to HuggingFace.
CensorshipVoidlyInfrastructureData pipelineKafkaHow Voidly ingests BGP data from RIPE NCC RIS, RouteViews, and bgp.tools: MRT format parsing, per-country baseline computation, withdrawal detection thresholds, BgpEvent records in TimescaleDB, and how bgp_outage_score is attached to probe measurements.
VoidlyCensorshipBGPInfrastructureHow Voidly uses BGP prefix withdrawal patterns and IODA data to detect internet shutdowns before any probe can send a packet — baseline per-country reachability, the difference between BGP silence and withdrawal, and how BGP fits into the composite confidence score.
CensorshipVoidlyBGPInfrastructureHow Voidly uses CAIDA AS-Rank, RIPE NCC RIS route collector data, and PeeringDB to build an AS-level topology, classify censorship choke points (IXP, transit AS, edge ISP), compute per-country probe diversity scores, and feed AS path features into the anomaly classifier.
CensorshipVoidlyBGPInfrastructureHow Voidly tracks the full history of blocking events for individual domains across all probe countries — DomainMeasurementSummary continuous aggregate, first/last-seen tracking, the /v1/domains/{domain}/history API, temporal pattern analysis (23% of blocks resolve within 7 days), cross-country blocking correlation, and domain freshness scoring.
CensorshipVoidlyInfrastructureMethodologyHow Voidly uses per-ASN probe vantages to distinguish nationwide censorship orders from selective ISP-level blocking — BGP peer classification from CAIDA AS-Rank, ISP blocking fingerprints by interference type, differential blocking detection, and propagation speed analysis that reveals enforcement mechanisms.
CensorshipVoidlyBGPISPHow Voidly aggregates per-measurement interference probabilities into per-country censorship scores: recency decay with a 30-day half-life, ASN diversity weighting, domain category weighting, cross-source corroboration multipliers, 90-day rolling windows, Gaussian temporal smoothing, and bootstrap confidence bands.
CensorshipVoidlyMethodologyData engineeringHow Voidly aligns OFAC sanctions packages, EU/UN designation timelines, and bilateral diplomatic signals with measured internet shutdown events — building the diplomatic-isolation feature for the shutdown forecasting model.
CensorshipVoidlyMethodologyHow the Federal Regulatory Data Hub ingests the OFAC Specially Designated Nationals list — daily conditional GET with ETag, XML parsing across 12K SDN entries with alias explosion, name normalization pipeline, FTS5 + Jaro-Winkler three-pass screening, and p50 8ms / p99 28ms screening latency against the SDN list alone.
Regulatory dataComplianceOFACData engineeringA deep dive into the feature engineering behind Voidly's 7-day internet shutdown forecasting model: political calendar integration (election dates, protest intensity via GDELT), OFAC sanctions timeline features, BGP withdrawal rate, probe measurement rate drops as precursor signals, historical shutdown patterns, and XGBoost SHAP feature importance across 200 countries.
CensorshipVoidlyMLForecastingHow we build a 7-day predictive model for internet shutdowns across 200 countries: political calendar features, network telemetry, ARIMA + XGBoost ensemble, and per-country reliability scoring.
CensorshipMLForecastingVoidlyHow Voidly aggregates calibrated per-measurement censorship probabilities into country-level shutdown risk signals: a three-stage aggregation hierarchy (ASN-domain hourly → domain → country), exponential decay weighting with 48-hour half-life over a 14-day window, a 28-feature forecast vector with risk score time series and ASN block concentration, and the Kafka voidly.forecast.features topic handoff to the Bayesian shutdown forecasting service.
VoidlyMachine learningForecastingCensorship detectionHow Voidly calibrates its anomaly classifier separately for each country — Platt scaling logistic regression fitted on per-country holdout predictions, F2-weighted threshold tuning per class, 30-day rolling calibration windows, and calibration case studies: Iran DNS tampering fires at threshold 0.62 (consistent single-authority blocking); China DNS tampering requires 0.74 (CDN split-horizon noise).
CensorshipVoidlyMLMethodologyHow Voidly retrains its five-class censorship anomaly classifier on a weekly cadence: time-based train/val/test splits to prevent temporal leakage, SMOTE resampling for class imbalance, PSI drift detection, champion/challenger shadow deployment, and the canary rollout process.
CensorshipVoidlyMLMethodologyHow Voidly serves the anomaly classifier as a live inference API — feature extraction from raw probe measurements in under 5ms, ONNX Runtime for portable model serving, five-class output with per-class probabilities, Cloudflare Worker routing to regional inference nodes, model versioning with champion/challenger shadow mode, and the latency budget that keeps end-to-end probe-to-verdict under 50ms.
CensorshipVoidlyMLInfrastructureHow Voidly converts a trained XGBoost censorship classifier to ONNX for serving inside a Rust ingestion service: the sklearn-to-ONNX export pipeline with zipmap=False for zero-copy float32 probability tensors, ONNX Runtime session configuration with per-thread isolation and L3 graph optimization, opset 17 pinning with metadata validation, and batch inference benchmarks achieving p99 under 50ms at batch size 200 on 4 vCPUs.
VoidlyMachine learningONNXInferenceXGBoostHow Voidly transforms raw probe measurements into the 47-feature vector that feeds the anomaly classifier: the ControlDelta struct, DNS features (NXDOMAIN injection, bogon IPs, known injection IPs), TCP features (RST timing, SYN-ACK count), TLS features (MITM cert detection, alert codes), HTTP features (blockpage SimHash score, body length ratio), and the LRU control cache design that prevents doubling probe cost.
CensorshipVoidlyMLMethodologyHow Voidly probes adapt their measurement schedule to device resource constraints: four constraint checks (battery floor, thermal throttle, cellular daily cap, unknown network), sliding-window cellular data accounting with per-minute SQLite buckets, adaptive cycle length that scales domain count to remaining budget via a 28,000-byte-per-measurement estimate, and a priority queue scoring domains on staleness (0.50), config priority flag (0.35), and anomaly recency (0.15).
VoidlyProbe infrastructureSchedulingMobileHow Voidly protects probe operators in jurisdictions that criminalize censorship measurement: strict data minimization (no name, address, or IP logging), WireGuard peer-key authentication, daily probe ID pseudonymization, optional Tor hidden service upload, measurement scrubbing, country-tier legal risk assessments, and a one-tap emergency stop with full data erasure.
CensorshipVoidlySecurityInfrastructureHow Voidly selects and maintains the domains it probes for censorship: Citizen Lab's global test list, 12 OONI category codes, per-country supplemental lists, the measurement budget problem, and why the test list is a political document.
CensorshipVoidlyMethodologyData engineeringHow a new Voidly probe operator goes from application to publishing measurements: on-device X25519 key generation in the Tauri app, probe registration and ASN verification, 48-hour warmup period with calibration measurements, quality scoring at promotion, and what happens when warmup calibration fails.
CensorshipVoidlyMethodologyInfrastructureHow the Federal Regulatory Data Hub enforces per-client and per-tier rate limits at 8,000 req/s without a centralized counter store: a five-tier quota table (free/researcher/compliance/vendor/internal), token-bucket burst enforcement in Cloudflare KV with ETag-based conditional writes and fail-open after three race retries, and sliding 24-hour window daily quota counting using per-minute KV buckets with a short-lived summary cache for the common below-quota path.
Regulatory dataCloudflare WorkersRate limitingAPI infrastructureHow the Federal Regulatory Data Hub serves 35M records via Cloudflare Workers: 8 vertical D1 shards by agency group, Promise.all fan-out for cross-agency queries, entity bridge join across CIK/UEI/LEI/DUNS/NPI, FTS5 full-text search for narrative datasets, response caching with TTL table by endpoint type, and p50/p99 latency budget including partial-response fallback when a shard is unavailable.
Regulatory dataCloudflare D1InfrastructureAPI designHow the Federal Regulatory Data Hub implements bitemporal versioning across 35M regulatory records in Cloudflare D1: the valid_from/valid_until row-version pattern using half-open intervals, an append-only record_versions audit table with before-state JSON payloads, AS-OF query rewriting in the Workers router using the idx_sdn_pit covering index for sub-5ms p99, three screening modes (current/as-of/historical), and keyset-paginated NDJSON snapshot export for retroactive batch compliance screening.
Regulatory dataCloudflare D1Data versioningComplianceHow the Federal Regulatory Data Hub monitors the freshness of 197 federal datasets and alerts on staleness: per-source FRESHNESS_CONFIG with expected_cadence and max_staleness_hours, D1 dataset_ingests staleness query, Cloudflare Cron */5 * * * * staleness check, multi-channel alerting (Slack webhook, email, PagerDuty) with KV deduplication, OFAC ETag monitoring with 90-minute publish-delay alert, five ingest error classes, and public /status endpoint.
Regulatory dataInfrastructureCloudflareData engineeringHow Voidly selects and distributes its probe vantage network: why ASN diversity matters more than geographic spread, the operator safety constraints that shape high-risk country probes, and how we reach countries where most people connect on mobile-only networks.
CensorshipVoidlyMethodologyInfrastructureHow Voidly delivers measurement configuration to probes without a persistent control channel: gzip+CBOR bundles signed with Ed25519 (signature verified before decompression to prevent zip-bomb attacks), a pull-based auto-update scheduler with 6-hour intervals and exponential backoff, version pinning and two-snapshot rollback, and anonymous country tokens derived via BLAKE3 from ISO code + epoch-week salt so the CDN cannot correlate which overlay a probe applies.
VoidlyProbe infrastructureConfiguration managementSecurityHow Voidly protects probe operator identity while publishing full measurement data: probe_id derived as SHA-256(public_key_bytes) with zero IP logging, human-readable codename system (450K+ combinations, no joint table with probe_id), measurement anonymization (probe_cc + probe_asn published; IP never stored), per-probe Ed25519 signing with isolated key store, and 12-country extra protections (4–48 hour publication delay, 90-day probe_id rotation).
CensorshipVoidlyMethodologyInfrastructureHow the Federal Regulatory Data Hub manages alias proliferation across OFAC SDN, SEC EDGAR, and FinCEN BSA: a five-type alias taxonomy (AKA/FKA/NFE/PHONETIC/VESSEL), entity_aliases DDL with FTS5 virtual table and covering indexes, a normalization pipeline with iterative legal-suffix stripping and NFKD ASCII transliteration, double-Metaphone phonetic bucket generation, and a four-pass resolution pipeline (exact 71.4% → phonetic 88.2% → FTS5 96.1% → edit-distance 98.7% cumulative recall on 2.4M aliases).
Regulatory dataEntity resolutionSanctionsData engineeringHow the Federal Regulatory Data Hub resolves company identity across five incompatible federal identifier schemes: three-pass resolution strategy (exact ID join, alias table lookup, TF-IDF fuzzy name matching), the entity_master bridge table schema, company name normalization to remove legal suffixes, false positive rates by method, special cases for healthcare NPI arrays and foreign entities, and how the entity bridge achieves p50 38ms cross-agency query latency.
Regulatory dataEntity resolutionCloudflare D1Data engineeringThe full schema design behind the Federal Regulatory Data Hub: eight vertical D1 databases (securities 9.2M, financial-crimes 4.1M, healthcare 6.8M, labor-safety 3.4M, environment 2.9M, transportation 4.6M, enforcement 2.1M, infrastructure 2.9M), OFAC SDN and EPA enforcement table DDL with FTS5 virtual tables, entity_master bridge with shard_presence bitmask, covering indexes vs. FTS5 trade-offs, and the Workers queryEntityAllShards() Promise.all fan-out achieving p50 38ms cross-shard entity queries.
Regulatory dataCloudflare D1InfrastructureData engineeringHow the Federal Regulatory Data Hub implements full-text search across 35M records using SQLite FTS5 in Cloudflare D1: virtual table creation with the unicode61 tokenizer and content= shadow-table pattern, BM25 scoring with weighted columns (10× entity_name, 5× description, 1× narrative), highlight() and snippet() functions for context extraction, buildFts5Query() TypeScript alias expansion with legal suffix stripping, Promise.all cross-dataset fan-out across 5 D1 shards, trigger-based index maintenance, and weekly optimize via Cloudflare Cron.
Regulatory dataCloudflare D1InfrastructureSQLiteHow we built a 35M-record federal regulatory database on Cloudflare D1 — per-vertical SQLite tables across 197 datasets, daily cron ingest, FTS5 for free-text datasets, and vertical sharding past the 10GB limit.
Regulatory dataCloudflare D1InfrastructureSQLiteHow Voidly manages storage for 2.2B probe measurements using a three-tier TimescaleDB retention policy — full-resolution hot tier (0-30 days), native-compressed warm tier (31-365 days, 6.2x ratio), and downsampled cold tier (>365 days, aggregates only), with continuous aggregate cascade, pg_cron compliance verification, and R2 tiered storage planned for Q3 2026.
CensorshipVoidlyInfrastructureData EngineeringHow Voidly stores and queries 2.2 billion censorship probe results in TimescaleDB: hypertable design with 1-day chunk intervals and secondary country partitioning, 6.2× compression, continuous aggregates for country-level daily summaries, three-tier retention (hot/warm/cold), and query benchmarks for anomaly detection.
CensorshipVoidlyTimescaleDBInfrastructurePostgreSQLHow Voidly's corroboration engine fetches and aligns data from three independent sources in near-real-time despite their different latency profiles: tokio::join! parallel fetches with per-source timeouts, adaptive OONI polling (15m/60m/3h/6h), in-memory CensoredPlanet daily dump index, independence-weighted source agreement scoring, and retroactive nightly reprocessing against the CP daily dump.
CensorshipVoidlyInfrastructureData engineeringHow the Voidly MCP server exposes 83 tools for querying the global censorship dataset from Claude, GPT, and agent frameworks — incident lookup, measurement queries, country summaries, BGP events, shutdown forecasts, and wiring it into Claude Code.
CensorshipVoidlyMCPInfrastructureHow the nightly Voidly export job extracts measurements from TimescaleDB and pushes Parquet snapshots to HuggingFace Hub: PyArrow schema with dictionary-encoded columns, server-side cursor streaming at 50K rows per round-trip, Zstandard level 3 compression, country + year_month partitioning, atomic HuggingFace commit with CommitOperationAdd, post-push SHA-256 verification, and the incremental vs. monthly full-snapshot strategy.
CensorshipVoidlyData engineeringOpen dataInfrastructureHow the Voidly CC BY 4.0 measurement dataset and the OONI historical corpus are hosted on HuggingFace — Parquet snapshot structure, daily incremental updates, git-lfs versioning, and Python/R filter recipes for journalists, ML researchers, and infrastructure teams.
CensorshipVoidlyOpen dataHuggingFaceInfrastructureHow a Voidly censorship incident progresses through six states — Anomaly, MultiSourceAnomaly, Corroborated, VerifiedIncident, Resolved, FalsePositive — with exact transition thresholds, timing data from 847 incidents in 2024 (67% stuck at Anomaly, 18% reach VerifiedIncident), IncidentRecord struct, publication timing by tier, how lifecycle state encodes into HuggingFace dataset fields, and retroactive state change handling via incident_history.
CensorshipVoidlyMethodologyHow a Voidly measurement moves through three confidence tiers — Anomaly, Corroborated, Verified Incident — and what each tier means for journalists, ML researchers, and infrastructure monitoring teams using the dataset.
CensorshipVoidlyMethodologyData qualityHow the Voidly SSE streaming endpoint delivers censorship events in real time: GET /v1/stream with country/tier/type filtering, four event types (incident_created, incident_updated, incident_resolved, country_status_change), Last-Event-ID reconnection with 24-hour event ring buffer, Python httpx.Client and JavaScript EventSource examples, and how SSE differs from the webhook delivery system.
CensorshipVoidlyAPI designInfrastructureHow the Voidly API handles authentication: two access tiers (public 60 req/hr and keyed), voidly_{env}_{base58} key format with PBKDF2-HMAC-SHA256 storage, D1 + KV request authentication flow, four plan tiers (Free/Research/Professional/Enterprise), HMAC-SHA256 webhook signature verification, key rotation without downtime, test keys for CI, and OAuth2 for third-party integrations.
VoidlyInfrastructureAPIHow the Voidly REST API is designed: key endpoints for incident lookup, measurement queries, country summaries, domain history, BGP events, and 7-day shutdown forecasts; cursor-based pagination, filtering, rate limits, and code samples in curl, Python, and JavaScript.
CensorshipVoidlyAPI designInfrastructureHow Voidly gets verified censorship incidents to journalists, researchers, and monitoring systems: HMAC-signed webhook delivery with exponential-backoff retry, PGP-encrypted email for verified alerts, per-country and per-confidence-tier RSS feeds, alert deduplication by incident_id, and rate-limiting to prevent fatigue.
CensorshipVoidlyInfrastructureReal-time systemsHow Voidly transitions a censorship incident through five states (Anomaly/MultiSourceAnomaly/Corroborated/Verified/Resolved) with threshold-gated transitions, stores every state change as an append-only event in a TimescaleDB hypertable with SHA-256 idempotency_key, and fans out verified incidents to alert delivery and cache invalidation via three Kafka topics — with the compute_incident_id() Rust function that makes incident IDs deterministic across pipeline restarts.
VoidlyCensorshipInfrastructureKafkaHow Voidly schedules 80-domain probe runs across 37+ nodes: domain priority tiers by OONI category code, anomaly-driven priority boosts, protocol selection per domain, ±15% jitter for anti-detection, ASN distribution to ensure cross-ASN coverage, adaptive scheduling that injects urgent re-measurements on anomaly detection, and per-country task budgets (CN 68, IR 74, RU 72, global avg 49 tasks/window).
CensorshipVoidlyMethodologyInfrastructureHow Voidly ingests 200M+ OONI Explorer measurements, aligns them with Voidly probe data on a country-domain-date key, generates probabilistic training labels using five Snorkel-style label functions, handles OONI coverage gaps with label distillation, and constructs the labeled dataset that trains the five-class anomaly classifier.
CensorshipVoidlyMLMethodologyHow the Voidly ML classifier distinguishes DNS tampering, TLS interference, HTTP blocking, BGP withdrawal, and throttling — five per-class binary models, country-specific calibration, and why 95% recall beats 95% precision when cross-source corroboration filters the noise.
CensorshipVoidlyMLInfrastructureHow Voidly evaluates the five-class censorship anomaly classifier offline before deployment: the ClassifierEvaluator test harness, per-country AUC-PR vs. AUC-ROC tradeoffs for imbalanced censorship data, F2 scoring rationale, per-country confusion matrix case studies (Iran 0.97 DNS recall, China 0.78 precision from CDN noise, Russia TSPU throttling), ECE calibration before and after Platt scaling, and model promotion criteria.
CensorshipVoidlyMLMethodologyHow Voidly uses uncertainty sampling, Cohen's kappa inter-annotator agreement, and weekly model retrains to grow its censorship anomaly training set from 127K bootstrap labels to 275K — 500 examples/week annotated by 3 researchers each, with DVC data versioning and PSI drift detection.
MLVoidlyActive LearningAnnotationHow Voidly constructs a labeled training dataset for the anomaly classifier from 200M+ OONI measurements: weak supervision with Snorkel-style label functions across DNS/TCP/TLS/HTTP layers, class imbalance handling with SMOTE and log-weighting, time-based train/val/test splits to prevent leakage, per-country Platt scaling calibration, and the continuous retraining pipeline.
CensorshipVoidlyMLData engineeringHow the quality filter pipeline decides which raw measurements are fit for ML training: boolean checks for control_failure (1.9% drop rate — ISP blocks on control server IPs in CN/IR/RU), missing_fields (0.8%), old probe version pre-2.5.0 (0.3%), and duplicates (0.2%), totalling 3.2% dropped. Includes the quality_filter() Python function, the to_feature_input() schema transformation, and why rejected measurements go to quarantine not discard.
CensorshipVoidlyMLData engineeringHow Voidly normalizes 200M+ OONI measurements across five web_connectivity schema versions (v0.2 to v0.6) into a single ML-ready format: a detect_web_connectivity_version() function using field-presence inference, AnomalyType and ConfidenceTier enums, the OoniMeasurementNormalized dataclass, FLAG_* bitmask constants for DNS/TCP/TLS/HTTP anomaly encoding, side-by-side normalize_v05() vs. normalize_v06() implementations, and a 95.3% pass-through rate from the drop-reason table.
CensorshipOONIData engineeringMLHow we processed the OONI raw measurement archive into a flat ML-ready CSV: handling probe version schema drift across 12 years, normalizing test_keys across 20 measurement types, streaming 200M+ records, and what we decided to leave out.
CensorshipOONIData engineeringHuggingFaceHow Voidly attributes censorship infrastructure to specific DPI vendors using network signatures and open-source intelligence: a six-vendor signature table (TSPU/Sandvine/NetClean/Iran ARRS/Cisco IronPort/GFW), DpiVendorSignature dataclass with a score_signature_match() function weighting RST timing (0.35), block page (0.30), injection IP (0.25), and CA SPKI (0.10), procurement scraping with PROCUREMENT_SOURCES across five government tender portals, BGP TTL-hop attribution, and case studies for Russia, Iran, and Ethiopia.
CensorshipOSINTDPIInfrastructureHow we build persistent cross-platform entity profiles for OSINT: passive collection from 40+ sources, graph-based identity disambiguation with calibrated edge weights, Certificate Transparency log monitoring, BGP/ASN change tracking, stylometric fingerprinting, and operational security architecture for researchers in hostile environments.
OSINTReconnaissanceEntity resolutionInfrastructureHow Voidly identifies the hardware and software responsible for internet censorship: blocking architecture taxonomy (L3/L4/L7-DNS/L7-HTTP), DPI vendor signatures from timing patterns (Russia's TSPU RST < 3ms, Iran's ARRS DNS injection IPs, China's GFW TTL fingerprinting), ISP-level blocking fingerprints (Rostelecom vs. MTS vs. Turkcell), TTL analysis for middlebox distance, OSINT cross-referencing with procurement records, and the censorship_infrastructure dataset field.
CensorshipVoidlyMethodologyOSINTHow we built a censorship-resistant VPN for Voidly probe operators: GFW/IRGC/TSPU threat model, WireGuard inside HTTP/2 CONNECT domain-fronting over CDN edges, 48hr entry-node IP rotation via Cloudflare KV, traffic morphing (Laplace timing jitter + packet-size CDF matching + cover traffic), 22-dim XGBoost on-device routing with ONNX, BlockageDetector for RST injection, and 99.3% DPI evasion across CN/IR/RU.
CensorshipVPNML routingDPI evasionWireGuardHow the AI Analytics OSINT pipeline extracts, disambiguates, and stores named entity mentions from 58M social media posts per day — GPU-accelerated NER, Wikidata QID linking, cross-language transliteration, and person co-reference resolution.
OSINTMLNLPInfrastructureHow we collect and normalize social media data from 47 platforms into a canonical post format: three-tier collection strategy (official APIs, ActivityPub, RSS/scrape), token-bucket rate limiting with circuit breakers, FastText language detection at ingest, content-hash deduplication, and Kafka topic partitioning by platform.
NLPInfrastructureKafkaOSINTNLP models powering the OSINT platform at 667 posts/second: FastText lid.176 language detection (99.7% EN accuracy), custom SpaCy NER fine-tuned on 2.3M labeled examples across 7 political entity types (91.4% macro F1), DistilBERT fine-tuned on 5M examples with INT8 ONNX quantization (94.7% macro F1, 28ms GPU), MinHash character 4-gram coordinated-campaign detection (89% precision), and the social signal integration with Voidly censorship event detection.
NLPDistilBERTSpaCyInfrastructureOSINTHow the OSINT platform detects bot accounts across 14 languages without retraining per language: an 8-feature BotFeatureVector (posting_interval_entropy via Shannon formula, reply_outdegree_ratio, content_cluster_density, age_velocity_zscore, quote_to_original_ratio, url_recycling_rate, cross_platform_correlation, bio_change_count_90d), Redis-bucketed perceptual hash matching (Hamming ≤ 8 across 1024 hash buckets), XGBClassifier with StratifiedGroupKFold on language groups, and per-language Platt scaling achieving F1 0.883–0.908 across all 14 languages.
OSINTMLNLPBot detectionHow we detect coordinated amplification campaigns across 58M daily posts: MinHash LSH (128 hash functions, 16 bands, Jaccard threshold 0.80) for content similarity, Redis sorted-set burst detection (≥5 accounts within 15 minutes, inverse-sqrt account age weighting), seven account-feature logistic regression, network amplification ring detection via cycle enumeration, cross-platform timing joins, and a 0–100 coordination score with 70/90 thresholds for human review and auto-flagging.
OSINTNLPInfrastructureElectionsHow the election intelligence pipeline resolves FEC committee identity across 1.3M records: the 10-code committee type taxonomy (H/S/P/X/Y/N/Q/O/I/U), a JointFundraisingCommittee dataclass with JFCAllocation and resolve_jfc_participants() from Form 99, normalize_entity_name() with iterative legal-suffix stripping, a four-pass resolution table (exact ID 63.4% → exact name 82.1% → alias 91.7% → TF-IDF char 3-gram 95.5% cumulative recall), and LLC chain disambiguation via FinCEN/EDGAR/SOS cross-reference.
ElectionsEntity resolutionFECData engineeringAnomaly detection across 47 races in 23 states: Benford's law with magnitude-range validity checks, XGBoost turnout model (20 features, SHAP attribution, MAD-based z-scores, 3.1pp MAE), ARIMA(2,1,2) reporting-curve detection, DBSCAN campaign finance clustering (near-identical amounts + 3-day burst), and full triage workflow (12 flags → 9 explained, 2 false positives, 1 persistent).
ElectionsStatisticsXGBoostBenfordOSINTThe statistical methods behind AI Analytics' election anomaly detection — first-digit analysis, last-digit uniformity testing, turnout z-scores, and why these signals require cross-validation with social and media data before generating an alert.
ElectionsMLMethodologyStatisticsHow the election intelligence pipeline ingests AP Election API feeds, state authority data (JSON/CSV/HTML scraping), social media signals, and media coverage in real time: Kafka election.precinct_results topic (50 partitions by state FIPS), PrecinctResult protobuf schema, state scraper StateScraperConfig, ElectionSentimentConsumer, narrative divergence scoring, FIPS normalization edge cases (Connecticut planning regions, Alaska districts), and p50/p99 latency targets for all four streams.
ElectionsInfrastructureKafkaNLPKafka partition key design, binary COPY writes to TimescaleDB, character 4-gram MinHash LSH distributed across Redis, autoscaling on consumer lag, and a canonical normalization layer across 47 platform schemas — the full pipeline behind 58M posts/day.
KafkaTimescaleDBNLPInfrastructure