Technical writing

USPTO Patent Data: The Federal Database Behind Every US Patent Grant and Application

May 24, 2026· AI Analytics

Federal DataUSPTOPatentsIntellectual Property

The United States Patent and Trademark Office publishes the complete record of American invention — every patent granted since 1976 in machine-readable form, every application published since 2001, and the full prosecution history showing how examiners and applicants negotiated the claims in between. PatentsView, the canonical research dataset derived from these records, covers roughly four million granted utility patents and provides disambiguated inventor and assignee identities, citation networks, and technology classifications. For researchers measuring innovation, analysts mapping corporate IP strategy, and engineers doing freedom-to-operate searches, this is the foundational federal data source.

The Patent System in Brief

A patent is a temporary government-granted monopoly on an invention in exchange for public disclosure of how that invention works. The constitutional basis is Article I, Section 8: Congress has the power “to promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” The current statutory framework is Title 35 of the United States Code.

The USPTO issues three categories of patent. Utility patents, by far the most common, cover new and useful processes, machines, manufactures, and compositions of matter — anything from pharmaceutical compounds to semiconductor fabrication methods to software algorithms (subject to eligibility limits discussed below). They carry a term of 20 years from the filing date under 35 USC §154. Design patents protect ornamental appearance rather than function — the shape of an Apple iPhone, the styling of a Nike sneaker sole — and run for 15 years from grant. Plant patents cover asexually reproduced plant varieties and are niche enough to be analytically irrelevant for most purposes.

To be patentable, an invention must satisfy four requirements. It must be novel: not disclosed anywhere in the prior art before the effective filing date. It must be non-obvious: not an obvious extension of what a person having ordinary skill in the relevant art would derive from the prior art (the KSR v. Teleflex standard). It must be useful: have some practical utility, a bar so low it rarely fails independently. And the application must provide an adequate written descriptionthat enables someone skilled in the art to make and use the invention — the enablement requirement. Most patent rejections turn on novelty (35 USC §102) or non-obviousness (35 USC §103).

What the USPTO Publishes

The primary bulk data channels are the USPTO Bulk Data Storage System (BDSS) and PatentsView. BDSS distributes raw XML grant files weekly (issued patents are published every Tuesday) and application publications twice weekly. These XML files are authoritative but verbose — a single week's grant file can exceed 1 GB uncompressed, with deeply nested schema for claims, drawings references, and prosecution history.

PatentsView, maintained by the USPTO under a cooperative agreement with research institutions, transforms those raw XML files into a cleaned, normalized, and disambiguated relational dataset available as bulk CSV downloads and a REST API. It is the standard starting point for empirical patent research. The Patent Examination Data System (PEDS) covers the prosecution history side: office actions, applicant responses, request-for-continued-examination (RCE) counts, and interview summaries. Google Patents Public Data on BigQuery provides full-text search over the grant corpus with a generous free tier.

The PatentsView Schema

PatentsView organizes its data around a central patent record with satellite tables for inventors, assignees, citations, and claims. The core fields in the patent table are:

patent_id — unique grant number (e.g., 11,234,567 for a 2022 utility patent).
patent_type — utility, design, or plant.
patent_date — grant date (Tuesday of the week the patent issued).
patent_title and patent_abstract — the title and abstract text as filed.
num_claims — total claim count; independent claims define the scope, dependent claims narrow it for fallback validity.
application_id and filing_date — the underlying application number and the date it was filed, which sets the start of the 20-year term and determines what prior art is relevant.
CPC codes — Cooperative Patent Classification codes jointly maintained by the USPTO and the European Patent Office, replacing the older USPC system. CPC is hierarchical: section (A–H, Y), class, subclass, group, subgroup. A single patent typically receives multiple CPC codes covering its primary technology and any related classifications.

The inventor table links each patent to one or more named inventors with location data (city, state, country) and a disambiguated inventor_id that attempts to resolve name variants across filings — distinguishing “John Smith at IBM” from “John Smith at Google” and tracking the same inventor across career moves.

The assignee table is analytically central for corporate IP analysis. It records the organization or individual to whom the patent was assigned at grant, with a type classification (company, individual, US government, foreign government, university). Assignee disambiguation is imperfect — subsidiary names are not always rolled up to parent companies — but PatentsView's disambiguated assignee identifiers handle the most common cases.

The citation table is one of the most powerful outputs. It records both backward citations (prior art references that appear on the face of the granted patent, including patents the examiner added) and forward citations (later patents that cite this one). Forward citation counts are the standard bibliometric proxy for patent importance: a patent cited by 500 subsequent patents is more likely to represent a foundational innovation than one never cited again. The distinction between applicant-added and examiner-added citations matters for interpretation — examiner citations reflect the prior art the USPTO found relevant; applicant citations reflect what the applicant chose to disclose.

Scale of the Corpus

The numbers are large enough to require bulk processing strategies rather than on-the-fly API queries. The USPTO grants roughly 350,000 utility patents per year against approximately 600,000 applications filed annually — an overall grant rate near 50%, though this varies substantially by technology area and applicant type. The cumulative PatentsView corpus covers approximately four million granted utility patents, with full text and structured metadata for all of them.

Average prosecution pendency — from filing date to first office action by an examiner — runs approximately 24 months, though technology unit backlogs vary widely. Software and business method applications examined by Art Unit 3600 face longer waits than, say, mechanical manufacturing applications. Total pendency from filing to grant averages around 30–36 months, meaning that a patent filed today will typically issue in 2028 or 2029. The effective commercial life of a patent, once prosecution time is consumed, is therefore closer to 17 years than the statutory 20.

Patent Families: Continuations, CIPs, and Divisionals

Few commercially significant patents are standalone filings. The US prosecution system permits several mechanisms for extending and branching an application, which together constitute a patent family.

A continuation application claims the benefit of an earlier “parent” application's priority date (it inherits the parent's effective filing date for prior-art purposes) and must be based on the same disclosure — no new matter. Continuations allow applicants to pursue additional claim sets on the same invention, often with narrowed claims after the parent's prosecution has clarified what the examiner will allow. A continuation-in-part (CIP)adds new matter to the parent's disclosure; the new matter gets the later filing date while the carried-over disclosure retains the parent's priority date. A divisional splits out claims directed to a distinct invention that the examiner identified in a restriction requirement — the USPTO will not examine multiple distinct inventions in a single application.

In PatentsView, application-number linkages track these family relationships. Theapplication_id and parent application references in PEDS let analysts reconstruct full patent families. This matters enormously for pharmaceutical analysis: the “evergreening” strategy in branded drug IP involves filing continuations on a blockbuster drug's core composition patent to capture formulation, dosage, method of treatment, and metabolite inventions — each yielding a separate patent with its own 20-year term from that continuation's filing date. A drug compound first patented in 1995 may have active method-of-use continuations expiring in 2038.

Patent Quality and Legal Challenges

Not all granted patents are valid, and the litigation system — supplemented since 2012 by administrative review at the Patent Trial and Appeal Board (PTAB) — exists in part to weed out patents that should not have issued.

The pivotal software-patent case is Alice Corp. v. CLS Bank International, the 2014 Supreme Court unanimous decision holding that abstract ideas implemented on a generic computer are not patent-eligible under 35 USC §101. The two-step Alice framework — determine whether the claims are directed to an abstract idea, then ask whether they add an “inventive concept” beyond the abstract idea itself — invalidated a significant fraction of the software and fintech patent portfolio built up during the 2000s. Examiners now apply Alice at prosecution, and courts apply it to invalidate issued patents. The boundaries of what counts as an “abstract idea” remain contested enough that software patent prosecution requires careful claim drafting.

Inter Partes Review (IPR) is the PTAB administrative trial mechanism introduced by the America Invents Act in 2012. A petitioner challenging a patent's validity files an IPR petition; if the Board institutes the trial, the challenged patent faces cancellation proceedings on prior-art grounds outside district court. Institution rates have historically run near 60% of petitions filed, and cancellation rates for instituted trials exceed 50% for all challenged claims. IPR has become the standard defensive tool in high-stakes patent litigation — a defendant sued for infringement files an IPR petition targeting the asserted claims while defending in parallel in district court.

The non-practicing entity (NPE) litigation landscape — patent assertion entities that acquire patents without manufacturing products — has been extensively studied using PatentsView data. Academic research has estimated direct litigation costs attributable to NPE activity at roughly $29 billion per year in the United States, concentrated in software and electronics. PatentsView's assignee type field and the 6-digit NAICS code of the assignee's listed business let researchers proxy for NPE status, though no classification is perfectly clean.

How to Access the Data

For most analytical use cases, PatentsView at patentsview.org is the right starting point. Bulk CSV downloads are available for all tables — patents, inventors, assignees, citations, CPC classifications — as compressed archives updated quarterly. The PatentsView REST API at https://api.patentsview.org/patents/query supports JSON-formatted query filters, field selection, and pagination up to 10,000 records per page.

The USPTO Bulk Data Storage System (BDSS) at bulkdata.uspto.gov provides the raw weekly XML grant files and application publications for those who need the authoritative source, full claim text, or prosecution-level metadata not yet in PatentsView. Parsing the USPTO XML schema is non-trivial; most research projects use PatentsView instead.

Google Patents Public Data on BigQuery replicates USPTO and international patent data in a columnar format queryable with standard SQL. The free tier supports substantial analysis without billing. Full-text search across abstracts and claims is straightforward in BigQuery in ways that are cumbersome against CSV files or the PatentsView API.

The PEDS API (developer.uspto.gov/api-catalog) exposes prosecution history: office actions and their dates, applicant responses, RCE filings, and interview summaries. PEDS is the right source for pendency analysis, for counting how many rounds of examination a patent survived, and for understanding examiner behavior by art unit.

Python: Top AI Patent Holders via PatentsView

The following script queries the PatentsView API for all patents classified under CPC subclass G06N — “Computing Arrangements Based on Specific Computational Models,” the primary CPC node covering machine learning, neural networks, and artificial intelligence methods — granted since 2021, then counts distinct patents per assignee organization to identify the top 20 holders.

import requests
import pandas as pd

# Query PatentsView API for AI-related patents (CPC subclass G06N)
# granted in the last 5 years, aggregated by assignee organization.
# API docs: https://api.patentsview.org/patents/query

BASE = "https://api.patentsview.org/patents/query"

params = {
    "q": '{"_and":[{"_gte":{"patent_date":"2021-01-01"}},{"cpc_subgroup_id":"G06N"}]}',
    "f": '["patent_id","patent_date","patent_title","assignee_organization"]',
    "o": '{"per_page":10000}',
}

resp = requests.get(BASE, params=params, timeout=60)
resp.raise_for_status()
data = resp.json()

total = data.get("total_patent_count", 0)
print(f"Total G06N patents since 2021: {total}")

patents = data.get("patents", [])

# Flatten: each patent may have multiple assignees
rows = []
for p in patents:
    assignees = p.get("assignees") or []
    if not assignees:
        rows.append({"patent_id": p["patent_id"], "assignee": "Unassigned"})
    else:
        for a in assignees:
            rows.append({
                "patent_id": p["patent_id"],
                "assignee": a.get("assignee_organization") or "Individual",
            })

df = pd.DataFrame(rows)

# Count distinct patents per assignee (a patent with two assignees counts once each)
counts = (
    df.groupby("assignee")["patent_id"]
    .nunique()
    .rename("patent_count")
    .sort_values(ascending=False)
    .reset_index()
)

print("\nTop 20 assignees by AI patent count (CPC G06N), 2021-present:")
print(counts.head(20).to_string(index=False))

A few implementation notes. PatentsView's query syntax uses a JSON object in theq parameter with logical operators (_and, _or) and comparators (_gte, _lte, _text_phrase). The G06N subclass filter captures the main AI node; a comprehensive analysis would add G06F 18 (data mining and classification) and G16H (healthcare informatics). Each patent may have multiple assignees, so the flattening step is necessary before grouping. Unassigned patents — where an inventor has not transferred rights to any organization — are common among individual and small-entity filers.

Running this query against the 2021–2025 window typically returns IBM, Samsung, Google, Microsoft, and Qualcomm near the top — the same companies that dominated aggregate utility patent counts for decades — though the precise ranking shifts year to year as AI patent filings from Chinese assignees and hyperscalers have accelerated.

Research Applications

Innovation measurement. Economists use citation-weighted patent counts as a proxy for R&D output, complementing input measures like R&D expenditure from NSF or BEA. The forward citation count is the standard weight — a patent's importance is proportional to how much subsequent work builds on it. PatentsView's citation table enables these calculations at the inventor, firm, technology class, or metropolitan area level.

Technology mapping. CPC codes provide a standardized taxonomy of technical knowledge. Tracking the annual grant count by CPC subclass reveals where innovation is concentrating — the explosive growth of G06N filings since 2015 documents the AI patent wave directly; the relative decline of mechanical engineering subclasses documents the shift away from manufacturing innovation in high-cost countries.

Corporate IP strategy analysis. Assignee concentration data shows that IBM has consistently led US utility patent grants for over three decades, with Samsung, Canon, Microsoft, and Apple competing for subsequent positions. Tracking how a company's patent portfolio in a specific technology shifts over time — by combining assignee data with CPC classification and forward citation counts — is a standard input to competitive intelligence analysis.

Pharma patent linkage. The FDA's Orange Book lists the patents that cover approved drug products. Linking Orange Book patent numbers to PatentsView records yields the full prosecution history, citation network, and continuation family for every drug patent — enabling analysis of evergreening strategies, patent cliff timing, and the relationship between clinical approval dates and patent expiry. The Orange Book connection is discussed further in the FDA Drug Approvals article linked below.

Geographic innovation clusters. Inventor location data in PatentsView — city and state for US inventors — enables mapping of patent production by metropolitan area or county. Silicon Valley, Seattle, Boston's Route 128 corridor, and Research Triangle cluster in these maps as expected; less obvious concentrations in medical device innovation (Minneapolis–St. Paul) and chemical engineering (Houston) become visible at fine geographic resolution.

Limitations and Analytical Cautions

Patents are a lagged and selective indicator of innovation. A patent application filed today reflects an invention made one to three years earlier, and the grant will not appear in the database for another two to three years after filing. Research pipelines in semiconductor design or pharmaceutical chemistry may therefore appear in USPTO data five years after the underlying R&D decision. For current technology tracking, patent applications (published 18 months after filing) lead grants by years.

Not all innovation is patented. Trade secrets are the preferred protection mechanism for process innovations and for inventions where the claims would teach competitors too much. Software innovations are increasingly defended through copyright and trade secret rather than patents, particularly after Alice narrowed eligibility. Patent counts in software and fintech undercount actual innovation activity relative to industries like pharmaceuticals, where patents are the primary exclusivity mechanism and nearly every significant compound is patented.

Assignee disambiguation in PatentsView is imperfect. Subsidiaries are often listed under their own names rather than the parent company. Apple Inc. may appear as Apple Inc., Apple Computer Inc., and various subsidiary names in different filings. PatentsView publishes disambiguation confidence scores and flags, but analysts doing firm-level portfolio analysis should validate assignments against the raw assignee name field rather than relying solely on the disambiguated identifier.

Finally, the grant-rate statistic of roughly 50% is an average across all technology areas. Grant rates for pharmaceutical composition-of-matter patents approach 80%; grant rates for business method and software applications examined under the Alice framework can fall below 20% for some art units. Interpreting aggregate patent counts as equivalent measures of innovative activity across technology areas conflates very different prosecution environments.

The FDA's Orange Book lists the patents covering approved drug products, connecting USPTO grant data directly to pharmaceutical market exclusivity. See FDA Drug Approvals: The Federal Dataset Behind Every Approved Therapeutic.

Corporate patent portfolios appear on balance sheets as intangible assets. For the SEC filings that disclose IP capitalization and amortization schedules, see SEC EDGAR: The Federal Database Behind Every Public Company Filing.

R&D expenditure as a share of GDP — the input side of innovation measurement that patent counts proxy on the output side — appears in BEA national accounts. See BEA GDP Accounts: The Federal Dataset Behind Every Macroeconomic Analysis.