SEC Financial Statement Facts: The Structured XBRL Behind Every 10-K and 10-Q

Buried inside every 10-K and 10-Q a public company files is a second copy of its financial statements—not the formatted tables a human reads, but a structured, machine-readable version in which each number is a labelled fact. Revenue, net income, total assets, cash, long-term debt—every line item is tagged to a standardized accounting concept and tied to the company and the period it describes. The SEC publishes that structured layer as financial-statement facts, and it is the quiet foundation beneath automated fundamental analysis, stock screeners, and the commercial datasets that power research and fintech: the machinery that turns narrative filings into a queryable time series of company financials.

This article covers what the financial-statement-facts dataset is and the XBRL technology that produces it; the 2009 SEC rule that mandated structured reporting and the phase-in that explains why a decade-plus of comparable fundamentals now exists; the anatomy of a single fact—its concept, value, unit, period, and the form it came from; the US-GAAP taxonomy that supplies the standard concepts and the company-specific extensions that complicate it; how the data is published through the EDGAR company-facts API and the bulk Financial Statement Data Sets; how the facts join to the EDGAR company registry by CIK; the analytical uses, from screening to time-series fundamental analysis; a Python workflow that pulls a company's facts by CIK and extracts a concept's history; and the caveats— company-chosen tags, custom extensions, restatements, and the gap between a tagged number and an audited financial statement—that every analyst must internalize before trusting the numbers.

What the dataset is

XBRL—eXtensible Business Reporting Language—is the structured data format the SEC requires public companies to use when they file their financial statements. The idea is simple and consequential: rather than leaving the numbers locked inside a formatted document that only a person can read, each reported figure is wrapped in a machine-readable tag that says what it is, what it is measured in, and what period it covers. The SEC ingests those tagged filings and republishes the contents as company facts and as the Financial Statement Data Sets—flat, structured extracts of every financial-statement line item across all filers. The result is that the balance sheet, income statement, and cash-flow statement of every reporting company become data you can query, rather than prose you have to parse.

In our database this structured layer is stored as the table sec_financial_facts, holding roughly 156,000 reported financial facts—a deliberately bounded slice of the far larger XBRL universe, which runs to many millions of facts across every filer and every period. The grain is one row per reported fact: a single number, for a single concept, for a single company, for a single period. A company's revenue for fiscal 2023 is one fact; its total assets at the end of fiscal 2023 is another; its revenue for fiscal 2022, reported again as a prior-year comparative in the fiscal-2023 filing, is yet another. The columns capture the company, the concept, the value, the unit, the period, and the source filing:

cik              -- Central Index Key: the filer's permanent EDGAR id
concept          -- the taxonomy element, e.g. us-gaap:Revenues
value            -- the numeric value reported for the fact
unit             -- USD, USD/shares, shares, pure (ratio), etc.
period_start     -- start of the period (durations: income / cash flow)
period_end       -- end of the period, or the instant (balance sheet)
fy               -- fiscal year the fact belongs to
fp               -- fiscal period: FY, Q1, Q2, Q3
form             -- the form it came from: 10-K, 10-Q, 8-K, etc.
filed            -- the date the filing was submitted to EDGAR
accession        -- the filing's unique accession number

Three columns do most of the work. The cik—the Central Index Key—is the permanent identifier EDGAR assigns to every filer; it is the join key that ties a fact to the company that reported it and to everything else that company has ever filed. The concept is the standardized accounting element—us-gaap:Revenues, us-gaap:NetIncomeLoss, us-gaap:Assets—that says what financial quantity the number represents, drawn from the US-GAAP taxonomy discussed below. And the period distinguishes the two fundamentally different shapes a financial fact can take: a duration (an income-statement or cash-flow figure that covers a span of time, with a start and an end) versus an instant (a balance-sheet figure that is true at a single point in time, with only an end). Confusing the two—treating a balance at a moment as if it were a flow over a year—is one of the easiest and most damaging mistakes in working with the data. Together, the triple of CIK, concept, and period uniquely locates a financial fact in the universe of all reported numbers.

The 2009 mandate and the XBRL phase-in

The structured financial data exists because the SEC made it mandatory. In 2009, after years of a voluntary filer program, the Commission adopted a rule requiring public companies to submit their financial statements in XBRL alongside the traditional document. The motivation was the one that animates this entire dataset: financial statements filed only as formatted documents are effectively opaque to a computer—to compare revenue across a hundred companies, or to track one company's margins across a decade, an analyst had to re-key numbers out of PDFs and HTML by hand or buy the cleaned data from a vendor that did the same. Tagging the numbers at the source, in a standard format, was meant to make the financial statements of every public company directly comparable and machine-readable—to democratize access to fundamentals that had previously been a commercial product.

The requirement did not arrive all at once. It was phased in by filer size: the largest companies— the large accelerated filers using US-GAAP—were brought in first, followed in successive waves by smaller accelerated filers and then by all remaining public companies and foreign private issuers. This staged rollout is the reason the depth of the structured history varies by company: the giants have well over a decade of comparable tagged fundamentals, while smaller filers came into the data later and have a shorter structured record. A later refinement, Inline XBRL, folded the structured tags directly into the human-readable filing document rather than requiring a separate XBRL exhibit, so that the same file a person reads and the data a machine extracts are one and the same—but the substance of what gets tagged, and the dataset that results, is continuous with the 2009 mandate. The practical consequence for any analyst is that the era of broadly comparable, machine-readable US fundamentals begins in the years following 2009 and deepens over time; questions that reach back before the phase-in cannot be answered from this dataset at all.

Anatomy of a financial fact

The atom of this dataset is the fact, and understanding its five components is what separates correct analysis from silent error. A fact is a single reported number with everything attached that gives it meaning.

The concept is the tag—the taxonomy element such as us-gaap:Revenues or us-gaap:CashAndCashEquivalentsAtCarryingValue—that identifies what the number is. The value is the number itself. The unit is what the number is measured in, and it is not always dollars: a per-share figure carries a unit of USD per share, a share count carries a unit of shares, and a ratio carries a dimensionless “pure” unit. Ignoring the unit and assuming everything is dollars is a classic way to mangle earnings-per-share or share-count facts. The period is the temporal scope, and—as noted—it comes in two kinds: a duration with a start and an end for flow quantities (revenue earned over a quarter, cash generated over a year), and an instant with only an end for stock quantities (assets held at year-end, shares outstanding on a given date). The form records which filing the fact came from—a 10-K for annual figures, a 10-Q for quarterly ones, and occasionally an 8-K or amendment—along with the fiscal year and fiscal period it belongs to and the date the filing was submitted.

The reason all five components are load-bearing is that the same financial quantity is reported many times, and only the full context disambiguates which instance you want. Consider revenue. A company's fiscal-2023 revenue appears in its fiscal-2023 10-K as the current year, and it appears again in the fiscal-2024 10-K as the prior-year comparative, and again—possibly with a different value—if the company restates. The concept is the same in every case; what tells the instances apart is the period (which fiscal year the number describes), the form and filed date(which filing reported it), and the value (which may differ across reports of the same period because of restatement). Pulling “revenue” for a company without collapsing on the period and resolving which reported value to trust—typically the most recently filed—produces duplicates and contradictions. The fact model is precise precisely because financial reporting is repetitive: the same period's numbers are told and re-told across many filings, and the metadata is how you keep them straight.

The US-GAAP taxonomy and company extensions

The concepts are not invented by each company; they are drawn from a shared dictionary called the US-GAAP taxonomy. Maintained by the Financial Accounting Standards Board (FASB) and accepted by the SEC, the taxonomy is a large, versioned catalog of standardized financial-reporting concepts—thousands of elements covering the line items, disclosures, and relationships that appear across financial statements. When a company tags its revenue with us-gaap:Revenues or its net income with us-gaap:NetIncomeLoss, it is selecting an element from this common taxonomy, and that is what in principle makes one company's revenue comparable to another's: both point to the same standardized concept. The taxonomy is updated annually to track changes in accounting standards, which is itself a subtlety—the precise element a company uses for a given line can shift across taxonomy versions, which is why robust extraction code tries several near-synonymous tags rather than assuming one.

But the comparability is imperfect in two structural ways, and both are essential to understand. The first is that the standard taxonomy often offers several tags for what is colloquially the same thing. “Revenue” is the canonical example: depending on the company, the industry, and the era, top-line revenue may be tagged as us-gaap:Revenues, as us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax (the tag aligned with the more recent revenue-recognition standard), or as older variants like us-gaap:SalesRevenueNet. All are legitimate, all mean roughly “revenue,” and a query that asks only for one will miss companies that used another. The second, and harder, complication is the custom extension. The taxonomy is extensible by design—the X in XBRL—so when a company has a line item that no standard element captures well, it can define its own custom concept. These extensions let companies represent the genuine particularity of their financials, but they are company-specific by construction: an extension element appears for one filer and means whatever that filer intends, and it does not line up with any other company's tags. Extensions are the single biggest obstacle to clean cross-company comparison, because the more a company relies on them, the less of its financial statement maps onto the shared dictionary.

How the data is published

The SEC exposes the structured facts in two complementary forms, and choosing the right one is the first practical decision in any project. Both are public, both require no API key beyond a descriptive User-Agent identifying the requester, and both are served from data.sec.gov.

The first is the XBRL company-facts API. For any single filer, a request to data.sec.gov/api/xbrl/companyfacts/CIK{10-digit}.jsonreturns every reported fact for that company, organized by taxonomy and concept, with each concept's values listed under their unit and carrying the period, fiscal year, form, and filing date. A narrower companion endpoint, the company-concept API, returns just one concept's time series for one company—the efficient way to ask “give me this filer's revenue history” without downloading its entire fact set. These JSON endpoints are ideal for company-by-company work: pull one CIK, slice the concepts you care about, build a time series. The second form is the bulk Financial Statement Data Sets: periodic flat-file archives containing the numeric facts, the submission metadata, the tags, and the presentation structure for all filings in a span, suitable for loading into a database. For anything cross-sectional—screening thousands of companies, building a panel of fundamentals, ranking an entire industry—the bulk data sets are far more efficient than issuing one API call per filer. The rule of thumb is the same as for most EDGAR data: the API for depth on a few entities, the bulk files for breadth across many.

Joining to the company registry by CIK

A financial fact in isolation is a number and a tag; what makes it analyzable is the company behind it, and the bridge is the CIK. The Central Index Key is EDGAR's permanent identifier for every filer, and because it appears on every fact and on every other thing the company files, it is the universal join key that integrates the financial facts into the rest of the SEC's data. The most important join is to the EDGAR company registry, the master index that resolves a CIK to the company's name, its ticker symbol or symbols, its standard industrial classification (SIC) code, its state of incorporation, and its current filer status. Without that join the facts are anonymous accounting figures; with it, every revenue and asset number is anchored to a named company, an industry, and a market identity—which is what lets an analyst select all software companies by SIC code, or look up a company by ticker and retrieve its fundamentals, or aggregate margins by industry.

The CIK also connects the financial facts to the wider EDGAR ecosystem. The same key ties a company's reported fundamentals to its full filing history, to the insider-transaction record of its officers and directors, to its institutional-ownership and beneficial-ownership disclosures, and to any enforcement actions against it. This is where the structured financials become more than a standalone fundamentals table: an analyst can relate a company's declining margins to a spike in insider selling, or test whether deteriorating fundamentals precede enforcement attention, or weight ownership disclosures by the financial size of the issuer—all because the CIK threads the financial facts into the same identity space as every other EDGAR dataset. The note of caution, developed below, is that ticker-to-CIK mapping is not always one-to-one and changes over time, so the join that anchors everything is itself something to handle carefully.

Analytical uses

A structured, machine-readable record of every public company's financial statements supports a set of analyses that were, before XBRL, the preserve of expensive commercial data vendors.

Automated fundamental analysis and screening is the flagship use. Because revenue, earnings, assets, liabilities, and cash are all tagged and queryable, an analyst can compute valuation and quality metrics—margins, growth rates, returns on assets and equity, leverage ratios, free-cash-flow conversion—across the entire universe of filers and rank or filter on them. The stock screeners and quantitative factor models that ingest fundamentals are, at bottom, querying this structured layer (or a commercial product derived from it). A screen for, say, companies with rising revenue, improving margins, and falling leverage is nothing more than three concepts pulled across many CIKs and compared.

Time-series fundamental analysis exploits the period metadata to turn the repeated reporting of each concept into a trend. Pulling one company's revenue or net income across every fiscal year it has filed—deduplicating restated periods to the latest reported value—produces a clean history of how the business has performed, the raw material for growth analysis, trend detection, and the construction of derived series like trailing-twelve-month figures from quarterly facts. Cross-company and cross-industry comparison is the cross-sectional counterpart: with concepts standardized to the same taxonomy and companies classified by SIC code, an analyst can benchmark one company against its peers or compare the profitability and capital intensity of whole industries—subject always to the extension-and-tag caveats that determine how truly comparable the underlying numbers are. Finally, the structured facts are the substrate of the commercial fundamentals datasets that power research platforms and fintech products: those products add entity resolution, tag normalization, restatement handling, and derived metrics on top of exactly this SEC-published layer, which is why understanding the raw data is the key to understanding both their strengths and their limits.

Python workflow: pulling facts by CIK and extracting a concept

The script below pulls a company's full fact set from the EDGAR company-facts API by CIK, then extracts and computes three genuine metrics: a multi-year revenue series (trying several near-synonymous revenue tags in order, because the right one depends on the company and the taxonomy era), a net-income margin by fiscal year, and a most-recent assets-to-equity leverage ratio drawn from balance-sheet instants. No API key is required, but the SEC mandates a descriptive User-Agent identifying the requester—requests without one are throttled or blocked—and total request volume should stay within the documented courtesy limit. The script deduplicates restated periods by keeping the most recently filed value, and it keeps duration concepts (income, cash flow) separate from instant concepts (balance sheet), the two distinctions most likely to corrupt a naive extraction.

import requests
import pandas as pd

# SEC EDGAR XBRL company-facts API.
#
# No API key is required, but the SEC mandates a descriptive
# User-Agent identifying the requester (name and contact email).
# Requests without one are throttled or blocked. Keep total request
# volume modest -- the documented courtesy limit is ~10 requests/sec.
#
# Two endpoints matter here:
#   1. company-facts: every reported XBRL fact for one filer, keyed by
#        https://data.sec.gov/api/xbrl/companyfacts/CIK{10-digit}.json
#   2. company-concept: the time series for a single concept, keyed by
#        https://data.sec.gov/api/xbrl/companyconcept/CIK{cik}/us-gaap/{tag}.json
HEADERS = {"User-Agent": "AI Analytics research info@example.com"}


def pad_cik(cik):
    # The API path requires a zero-padded 10-digit CIK (e.g. 320193).
    return str(int(cik)).zfill(10)


def company_facts(cik):
    url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{pad_cik(cik)}.json"
    r = requests.get(url, headers=HEADERS, timeout=60)
    r.raise_for_status()
    return r.json()


def concept_series(facts, taxonomy, tag, unit="USD"):
    # Pull one concept's reported values into a tidy DataFrame.
    # Each fact carries: val, the unit, the period (end / optional start),
    # the fiscal year/period, the form it came from, and the filing date.
    node = facts.get("facts", {}).get(taxonomy, {}).get(tag)
    if not node:
        return pd.DataFrame()
    rows = node.get("units", {}).get(unit, [])
    df = pd.DataFrame(rows)
    if df.empty:
        return df
    df["end"] = pd.to_datetime(df["end"], errors="coerce")
    if "start" in df.columns:
        df["start"] = pd.to_datetime(df["start"], errors="coerce")
    return df.sort_values("end")


def annual(df):
    # Keep one row per fiscal year from the 10-K (form == "10-K"),
    # deduping the restatements EDGAR reports for the same period
    # by keeping the most recently filed value.
    if df.empty:
        return df
    k = df[df["form"] == "10-K"].copy()
    k["filed"] = pd.to_datetime(k.get("filed"), errors="coerce")
    k = k.sort_values("filed").drop_duplicates(subset=["fy", "fp"], keep="last")
    return k.sort_values("end")


# Apple Inc. -- CIK 320193. Pull the full fact set once, then slice.
facts = company_facts(320193)
print(f"Entity: {facts.get('entityName')}")

# --- Metric 1: revenue time series ------------------------------------
# Revenue lives under several near-synonymous tags depending on the
# filing year and taxonomy version; try them in order of preference.
rev = pd.DataFrame()
for tag in ("RevenueFromContractWithCustomerExcludingAssessedTax",
            "Revenues", "SalesRevenueNet"):
    rev = annual(concept_series(facts, "us-gaap", tag))
    if not rev.empty:
        print(f"Revenue tag used: us-gaap:{tag}")
        break
for _, row in rev.tail(5).iterrows():
    print(f"  FY{row['fy']}: revenue = ${row['val']:,.0f}")

# --- Metric 2: net income margin --------------------------------------
ni = annual(concept_series(facts, "us-gaap", "NetIncomeLoss"))
merged = rev.merge(ni, on="fy", suffixes=("_rev", "_ni"))
merged["margin"] = merged["val_ni"] / merged["val_rev"]
print("\nNet income margin by fiscal year:")
for _, row in merged.tail(5).iterrows():
    print(f"  FY{row['fy']}: {row['margin']:.1%}")

# --- Metric 3: assets-to-equity (a leverage proxy) --------------------
# Balance-sheet concepts are instants (a point in time), not durations.
assets = concept_series(facts, "us-gaap", "Assets")
equity = concept_series(facts, "us-gaap", "StockholdersEquity")
if not assets.empty and not equity.empty:
    a = assets.dropna(subset=["end"]).iloc[-1]
    e = equity.dropna(subset=["end"]).iloc[-1]
    print(f"\nMost recent assets / equity: {a['val'] / e['val']:.2f}x "
          f"(as of {a['end'].date()})")

Two practical notes apply. First, the revenue extraction is deliberately defensive: it tries the contract-revenue tag, then the generic Revenues tag, then an older sales tag, and uses the first that returns data—because no single tag captures top-line revenue across all companies and all years, and hard-coding one is the surest way to silently return nothing for the filers that used another. A production system should maintain a curated mapping of acceptable tags per concept and reconcile the choices, rather than relying on a short fallback list. Second, for cross-sectional work—screening thousands of companies or building an industry panel—issuing one company-facts call per filer is far too slow; the bulk Financial Statement Data Sets, loaded into a database and joined to the company registry by CIK, are the right substrate, and they ship the authoritative tag and period definitions for each release. The single-company API shown here is for depth on one entity; the bulk files are for breadth across the market.

Limitations and analytical caveats

The financial-statement facts are the most powerful openly accessible record of US public company fundamentals, but they carry structural limitations that an analyst must internalize before treating the numbers as ground truth.

Companies choose their own tags, and the choice is not always consistent. The same financial quantity can be tagged with different standard elements by different companies, by the same company across years, and across taxonomy versions—the revenue example is only the most visible case. A query that assumes one canonical tag per concept will silently omit every filer that used a synonym, and a time series assembled without tag normalization can jump discontinuously when a company switches tags. Robust analysis treats “which tag means revenue here” as a problem to be solved per company and per era, not a constant.

Custom extensions break comparability. Because the taxonomy is extensible, companies define their own elements for line items the standard dictionary does not capture, and those extension concepts are company-specific by construction—they do not align with any other filer's tags. The more a company extends, the less of its financial statement maps onto the shared vocabulary, and the harder any cross-company comparison becomes. An analysis that quietly drops extension facts will understate the company; one that tries to compare them across companies will compare things that are not the same.

Restatements mean the same period can have more than one value. Companies revise previously reported figures—to correct errors, to reflect new accounting standards, to reclassify—and EDGAR faithfully carries both the original and the restated facts for the affected period. The same fiscal year's revenue can therefore appear with different values in different filings. An extraction that does not collapse on the period and choose a reported value—conventionally the most recently filed—will return contradictory numbers, and a backtest that uses restated figures as if they had been known at the time commits a look-ahead error. The filing dates supply everything needed to do point-in-time analysis correctly, but only if the analyst respects them.

A tagged fact is a disclosure, not an independently audited number, and tagging errors occur. The XBRL data is generated by the filer, and while it is meant to mirror the audited financial statements, tagging mistakes—a wrong sign, a misplaced scale, an element used contrary to its intended meaning—do happen, and the structured value can diverge from the figure in the human-readable statement. The dataset is authoritative as a record of what each company tagged; it is not a substitute for the audited statement itself, and any figure that drives a real decision should be reconciled against the filing. Combined with the ticker-to-CIK mapping that changes over time and is not always one-to-one, the lesson is that the financial facts reward careful entity resolution and value-level validation as much as any dataset on the SEC's servers.

Held with these caveats in mind, the sec_financial_facts table is a uniquely valuable resource: a CIK-keyed, concept-tagged, period-stamped time series of the financial statements that public companies are required to disclose—the structured layer that turned a decade-plus of 10-Ks and 10-Qs from formatted documents into queryable data, and the foundation beneath every screener, factor model, and commercial fundamentals product that reasons about American public companies by the numbers.

Related writing

SEC EDGAR Company Registry: The Federal Index That Resolves Every Public Company — The registry that turns the CIK on every financial fact into a named company, ticker, and industry code, supplying the entity layer that makes the fundamentals analyzable in the first place.

SEC Form 4 Insider Trading: The Federal Database Behind Corporate Insider Stock Transactions — Keyed to the same CIK, the insider-transaction record lets an analyst relate a company's reported fundamentals to the buying and selling of the officers and directors who know them first.

SEC N-PORT Mutual Fund Holdings: The Federal Database Behind Every Fund Portfolio Position — The holdings-side counterpart in the EDGAR ecosystem: where the financial facts describe the companies, N-PORT describes who owns them, both joinable through the same federal identifier space.