Technical writing

Regulatory data versioning: point-in-time queries, audit trails, and as-of compliance screening in Cloudflare D1

· AI Analytics
Regulatory dataCloudflare D1Data versioningCompliance

Regulatory compliance is inherently historical. A bank must be able to answer “was this counterparty on the OFAC SDN list on the date of this transaction?” — not merely whether it is on the list today. A securities firm must reconstruct the EDGAR beneficial ownership disclosure that was current at the time of a trade. An enforcement attorney must reproduce exactly what the EPA violation record said when a penalty was assessed, even if that record was subsequently amended.

The Federal Regulatory Data Hub addresses this through bitemporal versioning: every row in every vertical D1 shard carries a valid_from / valid_until pair representing the real-world validity interval of that version, and every write operation is recorded in arecord_versions audit table preserving the before-state. The Workers query layer rewrites incoming requests with an AS-OF date parameter to a conventional SQLite range predicate, allowing point-in-time queries without a separate query language.

Bitemporal row version pattern

Every primary entity table in the hub uses the same row-version convention. Taking the OFAC SDN table as an example:

-- regulatory-sanctions D1 shard
CREATE TABLE sdn_entries (
  -- Stable identifier across all versions of this SDN entry
  sdn_id          TEXT NOT NULL,

  -- Monotonically increasing per sdn_id (1-based)
  version_seq     INTEGER NOT NULL,

  -- Real-world validity interval (ISO date strings for SQLite portability)
  valid_from      TEXT NOT NULL,          -- date OFAC published this version
  valid_until     TEXT,                   -- NULL = currently active

  -- Payload columns (vary across versions when OFAC amends the entry)
  entity_type     TEXT NOT NULL CHECK (entity_type IN ('individual','entity','vessel','aircraft')),
  primary_name    TEXT NOT NULL,
  program_codes   TEXT NOT NULL,          -- JSON array: ["IRAN", "IRGC-EO"]
  federal_register_vol TEXT,
  federal_register_pg  TEXT,
  effective_date  TEXT,
  remarks         TEXT,

  -- Ingestion metadata (not part of the real-world record)
  ingested_at     TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
  source_file_sha TEXT NOT NULL,          -- SHA-256 of OFAC XML file

  PRIMARY KEY (sdn_id, version_seq)
);

-- Current-record fast path (most queries don't need history)
CREATE INDEX idx_sdn_current ON sdn_entries (sdn_id)
  WHERE valid_until IS NULL;

-- Point-in-time query index: given AS-OF date, find active versions
CREATE INDEX idx_sdn_pit ON sdn_entries (valid_from, valid_until);

-- Primary name search index
CREATE INDEX idx_sdn_name ON sdn_entries (primary_name, valid_until);

When OFAC publishes an amended SDN entry, the ingestion pipeline closes the current version by setting valid_until to the publication date and inserts a new row with the nextversion_seq and valid_from = publication_date and valid_until = NULL. This is a two-statement transaction in D1's SQLite:

-- Close the current version
UPDATE sdn_entries
   SET valid_until = ?1          -- publication date
 WHERE sdn_id = ?2
   AND valid_until IS NULL;

-- Insert the new version
INSERT INTO sdn_entries (sdn_id, version_seq, valid_from, valid_until,
                         entity_type, primary_name, program_codes, ...)
SELECT ?2, MAX(version_seq) + 1, ?1, NULL, ...
  FROM sdn_entries
 WHERE sdn_id = ?2;

D1's consistency model (serializable writes per database) guarantees that the close and insert are atomic: no reader can observe a state where the old version is closed but the new version does not yet exist.

record_versions audit table

For enforcement and legal discovery purposes, the hub maintains a separate record_versionstable that captures the full JSON serialization of every row at the moment it was superseded. This differs from the bitemporal main table in two ways: it stores the old payload as a JSON blob rather than relational columns, and it records the ingestion-time reason for the change:

CREATE TABLE record_versions (
  rv_id           TEXT PRIMARY KEY,  -- sha256(table||sdn_id||version_seq)[:16]
  table_name      TEXT NOT NULL,
  entity_id       TEXT NOT NULL,
  version_seq     INTEGER NOT NULL,
  valid_from      TEXT NOT NULL,
  valid_until     TEXT NOT NULL,     -- always set; NULLs replaced by '9999-12-31' at close time
  change_reason   TEXT NOT NULL CHECK (
    change_reason IN (
      'ofac_amendment',
      'ofac_addition',
      'ofac_removal',
      'data_correction',
      'deduplication_merge',
      'manual_override'
    )
  ),
  previous_payload TEXT NOT NULL,   -- JSON of the closed row's payload columns
  ingested_at     TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
  ingested_by     TEXT NOT NULL      -- ingestor service identifier
);

-- Lookup: full history for an entity
CREATE INDEX idx_rv_entity ON record_versions (table_name, entity_id, version_seq);

-- Lookup: all changes ingested in a time window (audit export)
CREATE INDEX idx_rv_ingested ON record_versions (ingested_at);

The record_versions table is append-only. No row is ever updated or deleted. Rows accumulate at roughly 14,000 per month across all shards (OFAC publishes amendments approximately twice weekly; EPA and FinCEN amend records monthly). At this rate, each shard's audit table holds under 500,000 rows after five years — well within D1's 10M-row practical performance envelope.

AS-OF query rewriting

The Workers query router exposes an asOf parameter on all entity lookup endpoints. When present, the router rewrites the SQL predicate from the “current only” form to the point-in-time form:

// workers/src/query-rewriter.ts

type QueryMode = 'current' | 'as-of';

export function buildSdnQuery(
  entityId: string,
  mode: QueryMode,
  asOfDate?: string,  // ISO date string, required when mode === 'as-of'
): { sql: string; bindings: unknown[] } {
  if (mode === 'current') {
    return {
      sql: `SELECT * FROM sdn_entries
             WHERE sdn_id = ?
               AND valid_until IS NULL`,
      bindings: [entityId],
    };
  }

  if (!asOfDate) throw new Error('asOfDate required for as-of mode');

  // Point-in-time: row was valid if valid_from <= asOf AND (valid_until IS NULL OR valid_until > asOf)
  // The valid_until > asOf condition uses strict inequality because valid_until is the
  // FIRST date the row was NO LONGER valid (half-open interval: [valid_from, valid_until) ).
  return {
    sql: `SELECT * FROM sdn_entries
           WHERE sdn_id = ?
             AND valid_from <= ?
             AND (valid_until IS NULL OR valid_until > ?)`,
    bindings: [entityId, asOfDate, asOfDate],
  };
}

export function buildAliasQuery(
  aliasNorm: string,
  mode: QueryMode,
  asOfDate?: string,
): { sql: string; bindings: unknown[] } {
  if (mode === 'current') {
    return {
      sql: `SELECT ea.entity_id, ea.alias_type, ea.alias_raw
             FROM entity_aliases ea
             WHERE ea.alias_norm = ?
               AND (ea.valid_until IS NULL OR ea.valid_until > date('now'))`,
      bindings: [aliasNorm],
    };
  }

  if (!asOfDate) throw new Error('asOfDate required for as-of mode');
  return {
    sql: `SELECT ea.entity_id, ea.alias_type, ea.alias_raw
           FROM entity_aliases ea
           WHERE ea.alias_norm = ?
             AND ea.valid_from <= ?
             AND (ea.valid_until IS NULL OR ea.valid_until > ?)`,
    bindings: [aliasNorm, asOfDate, asOfDate],
  };
}

The half-open interval convention ([valid_from, valid_until)) is critical for correctness at version boundaries. If OFAC removed an entity from the SDN list on 2024-03-15, then a query with asOf = '2024-03-15' should return no active record — the entity was removed that day. valid_until > asOfDate with strict inequality achieves this;>= would incorrectly include the entity on its removal date.

Compliance screening semantics

Transaction screening use cases frequently require answering three distinct questions simultaneously:

QuestionModeSQL predicate
Is this entity sanctioned today?currentvalid_until IS NULL
Was this entity sanctioned at transaction date?as-ofvalid_from <= T AND (valid_until IS NULL OR valid_until > T)
Was this entity ever sanctioned?historicalCOUNT(*) > 0 (any version)

The Workers router exposes all three modes via the screeningMode query parameter. Current mode hits the idx_sdn_current partial index and returns in under 2ms p99 per shard. AS-OF mode hits idx_sdn_pit and returns in under 5ms p99 per shard. Historical mode is implemented as a COUNT query without date predicates and returns in under 4ms p99. For cross-shard fan-out (all eight shards), these per-shard latencies are fully parallelised via Promise.all, so the p99 wall-clock time is the shard maximum rather than the sum.

Point-in-time snapshot export

Some compliance workflows require a full point-in-time snapshot of an entire vertical rather than individual entity lookups. For example, a bank may need to reconstruct the complete OFAC SDN list as it existed on a specific date for retroactive batch screening of a historical transaction set. The Workers snapshot endpoint streams the result of the point-in-time query across all rows of a shard as NDJSON:

// workers/src/snapshot-handler.ts

export async function handleSnapshotRequest(
  request: Request,
  db: D1Database,
  env: Env,
): Promise<Response> {
  const url     = new URL(request.url);
  const asOfDate = url.searchParams.get('asOf');
  const table    = url.searchParams.get('table') ?? 'sdn_entries';

  if (!asOfDate || !/^\d{4}-\d{2}-\d{2}$/.test(asOfDate)) {
    return new Response('asOf parameter required (YYYY-MM-DD)', { status: 400 });
  }

  // Verify the caller has snapshot-tier API key (rate-limited, logged)
  const authResult = await verifySnapshotAuth(request, env);
  if (!authResult.ok) return new Response('Forbidden', { status: 403 });

  const { readable, writable } = new TransformStream<Uint8Array, Uint8Array>();
  const writer = writable.getWriter();
  const encoder = new TextEncoder();

  // Stream rows in batches of 1000 using cursor pagination
  // D1 does not support server-side cursors; we use (sdn_id, version_seq) keyset pagination
  let lastKey: [string, number] | null = null;
  let done = false;

  (async () => {
    try {
      while (!done) {
        const rows = await db
          .prepare(
            lastKey
              ? `SELECT * FROM ${table}
                  WHERE valid_from <= ?1
                    AND (valid_until IS NULL OR valid_until > ?1)
                    AND (sdn_id > ?2 OR (sdn_id = ?2 AND version_seq > ?3))
                  ORDER BY sdn_id, version_seq
                  LIMIT 1000`
              : `SELECT * FROM ${table}
                  WHERE valid_from <= ?1
                    AND (valid_until IS NULL OR valid_until > ?1)
                  ORDER BY sdn_id, version_seq
                  LIMIT 1000`
          )
          .bind(...(lastKey ? [asOfDate, lastKey[0], lastKey[1]] : [asOfDate]))
          .all();

        for (const row of rows.results) {
          await writer.write(encoder.encode(JSON.stringify(row) + '\n'));
        }

        if (rows.results.length < 1000) {
          done = true;
        } else {
          const last = rows.results[rows.results.length - 1] as Record<string, unknown>;
          lastKey = [last['sdn_id'] as string, last['version_seq'] as number];
        }
      }
    } finally {
      await writer.close();
    }
  })();

  return new Response(readable, {
    headers: {
      'Content-Type': 'application/x-ndjson',
      'X-Snapshot-AsOf': asOfDate,
      'Transfer-Encoding': 'chunked',
    },
  });
}

Keyset pagination ((sdn_id, version_seq) pair) is used instead of OFFSET because D1 does not support server-side cursors and OFFSET-based pagination degrades to O(n) at large offsets in SQLite. The keyset approach maintains O(log n) per page via the primary key B-tree.

Related writing

Regulatory data staleness monitoring covers how the hub detects when a vertical shard's current records are stale relative to the upstream source — the prerequisite for keeping the bitemporal history accurate.

Regulatory data query layer describes the Workers routing logic that dispatches AS-OF and current-mode queries to the appropriate shard and merges multi-shard results.