Technical writing

Swarm SDK key management: device provisioning, certificate rotation, and revocation for autonomous drone systems

· 11 min read· AI Analytics
Swarm SDKCryptographyPost-quantumDrone

The Double Ratchet gives the Swarm SDK per-message forward secrecy; ML-KEM-768 hybrid key exchange makes the session root post-quantum secure. But neither property means anything if the long-term identity keys underpinning those sessions are poorly managed. This article documents the full key lifecycle: how a drone gets its cryptographic identity at provisioning, how identity material rotates without interrupting a live mesh, and how a compromised device is revoked — including the 180 ms SRAM wipe that renders captured hardware cryptographically inert.

The key management problem for offline drone fleets

Terrestrial PKI assumes connectivity. A TLS client can reach an OCSP responder; an enterprise device can check in with Active Directory; a smartphone can pull a fresh certificate bundle over the carrier network. Drone swarms operating in contested RF environments cannot make any of these assumptions. The mesh may be the only available communication channel, and that channel may be jammed, partitioned, or operating under a time-limited communications window decided by the mission plan — not by network availability.

The constraints collapse into three hard requirements. First, mutual authentication without online CA lookup: drones must be able to verify each other's identity during flight using only material loaded before takeoff. Second, forward-secret rotation without coordination overhead: session keys rotate automatically via the Double Ratchet, but the medium-term signed prekeys (SPKs) that seed session establishment must also rotate on a schedule that does not require a synchronous ceremony with a central authority. Third, capture-proof key isolation: a drone lost to an adversary must not expose key material for any peer. The threat model is physical extraction of volatile memory within seconds of the device going offline — not just passive ciphertext collection.

Classical CRL and OCSP-based revocation cannot satisfy the first requirement. Delta CRLs distributed over the gossip mesh can satisfy revocation during flight, but only if the revocation authority can sign a CRL and inject it into the mesh — which requires the ground control station to have an active RF link to at least one reachable node. We handle both the pre-flight case (bundle exclusion) and the in-flight case (gossip-propagated RevocationMessage) as distinct code paths described below.

Device identity at provisioning

Every Swarm SDK device generates its long-term identity keypair on-device at provisioning time. Generation happens on the device itself — not on a provisioning server that then pushes the private key over a wire. Private keys never leave the hardware after generation. The DeviceIdentity struct captures the full identity material:

pub struct DeviceIdentity {
    // Long-term identity keys — used for peer authentication, NOT session encryption.
    // Session encryption uses ephemeral keys derived through X3DH / the Double Ratchet.
    pub ik_ml_kem: MlKem768KeyPair,     // ML-KEM-768 identity key (FIPS 203)
    pub ik_x25519: X25519KeyPair,        // X25519 identity key (hybrid component)

    pub device_id: [u8; 16],             // UUIDv4, assigned at provisioning by operator
    pub capabilities: DeviceCapabilities, // role flags: RELAY | COMPUTE | EW | SA
    pub provisioned_at: u64,             // Unix timestamp (seconds)
}

pub struct DeviceCapabilities: u32 {
    const RELAY   = 0b0001;  // mesh relay — can forward frames for other nodes
    const COMPUTE = 0b0010;  // edge compute — Jetson Nano class, can run inference
    const EW      = 0b0100;  // EW coordination — has spectrum-sensing hardware
    const SA      = 0b1000;  // SA fusion — aggregates position/sensor data
}

The ML-KEM-768 keypair is 2,400 bytes (private) and 1,184 bytes (public). The X25519 keypair is 32 bytes each. Both are generated on-device using the hardware RNG: the STM32H7's built-in TRNG peripheral at 1 Mbit/s, or /dev/random backed by the Jetson Nano's TPM on the edge compute platform. The SDK calls the platform's TRNG abstraction; no userspace PRNG is seeded from a time-based value or a software source that could be replicated.

Keypair generation takes 42 ms at the p50 on the STM32H7 for the full ML-KEM-768 + X25519 pair (see benchmarks below). This only happens once per device lifetime — or once per identity key rotation cycle, which is quarterly and conducted at base. The cost is entirely acceptable.

The public identity keys are then packaged into a device certificate and signed by the fleet CA. The private keys stay on-device in DTCM SRAM, never written to flash in plaintext, and are covered by the hardware zeroization watchdog.

Device certificates and the three-tier fleet CA

Device certificates are the mechanism by which drones authenticate each other without calling home. A cert binds a device_id and its public identity keys to a fleet, a mission scope, and a validity window, with the fleet CA's signature providing the trust anchor:

pub struct DeviceCertificate {
    pub device_id:         [u8; 16],
    pub ik_ml_kem_public:  MlKem768PublicKey,   // 1,184 bytes
    pub ik_x25519_public:  X25519PublicKey,      //    32 bytes
    pub capabilities:      DeviceCapabilities,
    pub valid_from:        u64,                  // Unix timestamp
    pub valid_until:       u64,                  // Unix timestamp (90-day window, min 7 days)
    pub fleet_id:          [u8; 16],             // scopes cert to a specific fleet / mission
    pub signature:         FleetCaSignature,     // Ed448 signature over the above fields
}

// Ed448 signature: 114 bytes
// Total wire size: ~1,374 bytes per device certificate

We use a three-tier signing hierarchy. The Root CA is air-gapped, stored on a YubiHSM 2, and signs Fleet CA certificates only — never device certificates directly. Its Ed448 private key is never exposed to a networked machine. The Fleet CA is online on the ground control station and signs device certificates for all drones in the fleet. Fleet CA certificates have a six-month validity; the Root CA must re-sign every six months via an offline ceremony. Device certificates are valid for 90 days by default (configurable down to 7 days minimum) and must be renewed before expiry at base.

// Trust hierarchy
Root CA (air-gapped YubiHSM 2, Ed448)
  └── Fleet CA (GCS HSM, Ed448, 180-day cert)
        └── Device Certificate (on-device, Ed448 signature, 90-day cert)

// FleetCaCertificate structure
pub struct FleetCaCertificate {
    pub fleet_id:        [u8; 16],
    pub ca_ed448_public: Ed448PublicKey,    // 57 bytes
    pub valid_from:      u64,
    pub valid_until:     u64,              // 6-month window
    pub root_signature:  RootCaSignature,  // Ed448 signature from Root CA
}

// Verification chain at peer authentication:
// 1. Verify root_signature over FleetCaCertificate using embedded root_ca_public (compiled-in)
// 2. Verify fleet_ca_signature over DeviceCertificate using FleetCaCertificate.ca_ed448_public
// 3. Verify valid_from <= now <= valid_until for both certs
// 4. Verify fleet_id matches the mission's expected fleet
// 5. Check device_id not in revoked_devices list from mission bundle

Ed448 (EdDSA over Curve448) gives 224-bit classical security, above the 128-bit threshold for post-quantum-adjacent applications where the signing key is not itself the session key. Ed448 signatures are 114 bytes; verification takes under 8 ms on the STM32H7. We chose Ed448 over Ed25519 for the CA hierarchy because NSA CNSA 2.0 recommends 384-bit-and-above classical security for long-lived signing keys — Ed448 comfortably exceeds this for a CA whose Root key will be trusted for years.

The Root CA public key is compiled into the SDK firmware image as a constant. If the Root CA is ever compromised, all firmware must be re-flashed — an accepted limitation for an air-gapped key that is used twice a year for a ceremony that takes less than an hour.

Pre-provisioned mission cert bundles

The solution to in-flight CA unavailability is pre-provisioning. Before drones depart, every device in the mission receives a MissionCertBundlecontaining everything it needs to authenticate every peer it will encounter:

pub struct MissionCertBundle {
    // This device's own cert (for presenting to peers)
    pub device_cert:     DeviceCertificate,

    // Fleet CA cert (for verifying peer device certs)
    pub fleet_ca_cert:   FleetCaCertificate,

    // All mission participants — loaded at power-on, no online lookup required
    pub peer_certs:      Vec<DeviceCertificate>,   // max 255 devices per mission

    // Devices excluded from this mission (captured or compromised before departure)
    pub revoked_devices: Vec<[u8; 16]>,

    pub mission_id:      [u8; 16],
    pub mission_expiry:  u64,        // hard expiry: certs must not be used after this

    // Signature over the entire bundle — guards against bundle tampering at load time
    pub bundle_signature: FleetCaSignature,
}

// Bundle size estimate (64-device mission):
// device_cert:     ~1,374 bytes
// fleet_ca_cert:   ~246 bytes
// peer_certs:      63 × ~1,374 = ~86,562 bytes
// revoked_devices: variable (typically 0–5 × 16 bytes)
// metadata + sig:  ~150 bytes
// Total:           ~88,350 bytes (≈86 KB for a 64-node mission)

The bundle is loaded into SRAM at power-on from the device's external flash (STM32H7: QSPI flash, AES-XTS encrypted at rest using a device-unique key stored in OTP fuses). Once the bundle is verified and loaded, the flash copy is no longer needed for in-flight operation — all peer authentication uses the in-SRAM copy.

Bundle loading verifies two things before any peer certificate is trusted: the Fleet CA cert signature validates against the compiled-in Root CA public key, and the bundle signature validates against that Fleet CA cert. A tampered bundle — one where a peer cert was swapped after signing — fails the bundle signature check before any individual cert is even parsed.

During mesh operation, a drone authenticating a peer looks up the peer's device_id in peer_certs, verifies the cert's signature and validity window, and checks that the device_id is not in revoked_devices. The lookup is O(n) over peer_certs, which at 255 devices costs at most 255 × 16-byte comparisons — approximately 4 μs on the H7. A hash map keyed on device_id would give O(1) lookup but at the cost of additional SRAM for the hash table; for missions under 64 nodes the linear scan is fast enough and avoids the allocation.

The mission_expiry field provides a hard validity ceiling regardless of individual cert valid_until fields. A 7-day mission cert bundle expires after 7 days even if the underlying device certs have 83 days of validity remaining. Drones that power on after mission expiry will refuse to authenticate any peer — the mission is over and the bundle is no longer authoritative.

Signed prekey rotation over the gossip mesh

The Double Ratchet's symmetric chains rotate automatically with every message. The long-term identity keys (IK) are static — they seed X3DH session establishment but do not appear in any per-message operation after that. Between those two extremes sit the signed prekeys (SPK): medium-lived keys generated on-device, signed with the IK, and published so peers can initiate sessions asynchronously without the target device being online.

pub struct SignedPreKey {
    pub spk_id:             u32,
    pub spk_ml_kem_public:  MlKem768PublicKey,   // 1,184 bytes
    pub spk_x25519_public:  X25519PublicKey,      //    32 bytes
    pub created_at:         u64,
    pub valid_until:        u64,   // created_at + 7 days (604,800 seconds)
    // Signature over (spk_id || spk_ml_kem_public || spk_x25519_public || created_at || valid_until)
    // using the device's long-term Ed448 identity signing key
    pub signature:          DeviceIdentitySignature,  // 114 bytes
}

// SPK rotation gossip message
pub struct SignedPreKeyUpdate {
    pub device_id:    [u8; 16],
    pub new_spk:      SignedPreKey,
    pub sequence_num: u64,    // monotonically increasing; peers reject replays
}

// Wrapped in MAVLink v2 TUNNEL (message_id 385, payload_type 0xAA02)
// 0xAA02 = Swarm SDK key management frame (vs 0xAA01 for encrypted data frames)

SPK rotation is triggered every 7 days. On rotation, the device generates a new ML-KEM-768 + X25519 SPK pair, signs it with its IK, and broadcasts a SignedPreKeyUpdate message over the gossip mesh. The gossip propagation model is epidemic: each node that receives the update forwards it to k randomly selected peers (k = 3 by default). With a 30-node mesh and 6-second gossip intervals, the update reaches all reachable nodes within approximately 5 hops — under 30 seconds.

Peers receiving a SignedPreKeyUpdate perform three checks before accepting it: the SPK signature validates against the sender's IK public key (retrieved from the peer cert in the mission bundle); the sequence_num is strictly greater than any previously accepted sequence number for that device (preventing replay of an older SPK); and the valid_until is in the future. If all three pass, the peer replaces its stored SPK for that device with the new one.

In-flight sessions initiated before the rotation are not interrupted. The Double Ratchet session keys are derived independently from the SPK at session establishment; once a session is running, the SPK is no longer referenced. Only new session initiations use the current SPK. A device rotating its SPK mid-mission does not terminate any existing peer sessions.

Identity key rotation is deliberately not automated in-flight. Rotating the IK requires re-issuing the device certificate from the Fleet CA, which requires an HSM-authenticated signing operation on the ground control station. IK rotation is conducted quarterly at base between missions. The SPK rotation cadence of 7 days is the operational forward-secrecy boundary for the key establishment layer.

Revocation: bundle exclusion and in-flight poison pill

Certificate revocation lists are impractical during autonomous missions: they require a signed list from the CA, a way to distribute it, and a policy for how long a node will trust a cert after the CRL goes stale. We handle revocation with two distinct mechanisms depending on when the compromise is detected.

Pre-departure exclusion. If a device is known to be compromised before the mission begins, it is added to the revoked_devices list in the MissionCertBundle. The fleet CA signs a new bundle that excludes the device, and all mission participants receive the updated bundle at power-on. No drone in the mesh will initiate a session with the excluded device or route traffic on its behalf. The device's cert may still be technically valid (within its 90-day window), but the bundle overrides cert validity.

In-flight revocation. If a device is captured or confirmed compromised after departure, the ground control station broadcasts a signed RevocationMessage into the mesh through any reachable node:

pub struct RevocationMessage {
    pub revoked_device_id:     [u8; 16],
    pub revocation_timestamp:  u64,
    pub reason:                RevocationReason,
    pub fleet_ca_signature:    FleetCaSignature,  // Ed448, 114 bytes
}

pub enum RevocationReason {
    Capture           = 0x01,  // physical loss to adversary
    Compromise        = 0x02,  // key material confirmed extracted
    Malfunction       = 0x03,  // errant behavior, not adversarial
    EmergencyWipe     = 0xFF,  // triggers key destruction on the target device if still reachable
}

// On receipt, each node that validates a RevocationMessage:
//   1. Verifies fleet_ca_signature using the mission bundle's fleet_ca_cert
//   2. Adds revoked_device_id to its in-memory revocation set
//   3. Drops any open Double Ratchet session with that device_id
//   4. Purges that device's key material from the skipped-key cache
//   5. Refuses all future session initiations from that device_id
//   6. Gossips the RevocationMessage to k=3 peers (preventing re-forward loops via sequence tracking)
//   7. If reason == EmergencyWipe and self.device_id == revoked_device_id: triggers wipe procedure

The gossip propagation of a RevocationMessagefollows the same epidemic model as SPK updates: 5 gossip hops at 6-second intervals delivers the revocation to all reachable nodes within approximately 30 seconds. Nodes that are partitioned from the mesh at revocation time will not receive the message until connectivity is restored — a known and accepted limitation. The revocation_timestamp ensures that nodes receiving the message after a partition do not treat it as stale; any message with a revocation_timestamp newer than their last seen state for that device is applied, regardless of how much wall-clock time has elapsed.

Purging ratchet state on revocation is the critical step. Simply marking a device as revoked and refusing new sessions is insufficient: an adversary who has captured a device and extracted its key material can potentially use those keys to decrypt cached messages if peer nodes retain their cached decryption keys for that session. Immediately purging the skipped-key cache entries for the revoked device closes this window. Messages that were in-flight at the moment of revocation may be undecryptable after the purge — this is the correct behavior. Operational continuity matters less than key isolation once a device is confirmed captured.

Emergency wipe: hardware tamper response and remote trigger

Key zeroization is the last line of defense against physical key extraction. The Swarm SDK implements emergency wipe through two independent trigger paths: hardware tamper detection and remote revocation command.

On the STM32H7, all key material is held in DTCM SRAM (512 KB, directly accessible only to the Cortex-M7 core). Critically, it is not persisted to external flash during operation — flash contains only the encrypted cert bundle and the device certificate. The private identity keys exist only in SRAM. A power cycle destroys them without any software action required.

// STM32H7 wipe procedure (hardware tamper path)
// Triggered by: tamper pin assert, power brown-out, hard fault, watchdog reset

fn emergency_wipe_stm32h7() {
    // 1. Disable all DMA channels that could be reading key material
    disable_all_dma();

    // 2. Triple-overwrite the key region with TRNG output
    //    Key region: [0x2000_0000 .. 0x2008_0000) — 512 KB DTCM SRAM
    for pass in 0..3 {
        let random_word = stm32h7_trng_read_u32();
        memset_volatile(KEY_REGION_START, random_word as u8, KEY_REGION_SIZE);
    }

    // 3. Lock flash — prevent any further write to cert storage
    FLASH.CR.set_lock();

    // 4. Halt the core — no further code executes
    cortex_m::asm::bkpt();
    loop { cortex_m::asm::wfi(); }
}

// Elapsed time (SRAM scrub only, measured on STM32H7 @ 480 MHz):
// 512 KB × 3 passes at AHB bus throughput ≈ 180 ms (p50), 195 ms (p99)
// DMA disable + flash lock: ~2 ms additional
// Total: ~182 ms from trigger to halt

The tamper pin on the STM32H7 is connected to the chassis open sensor. Physical tamper detection — opening the drone enclosure — asserts the tamper pin and triggers the wipe procedure before any probe can reach the PCB. The STM32H7's TAMP peripheral monitors the pin in backup domain power, meaning the wipe can trigger even if the main power rail is cut first, as long as the backup battery (CR2032 on the reference design) has charge.

On the Jetson Nano, the wipe procedure follows the same triple-overwrite logic but operates on the shared DRAM rather than SRAM. Key material is allocated in a locked, non-swappable memory region (mlock(2) with MADV_DONTDUMP) to prevent swapping and exclude it from core dumps. The Jetson wipe calls memset_s three times over the locked region, then munlocks and explicitly calls the kernel's secure memory release path. Total Jetson wipe time is approximately 45 ms for a 64 MB key region — faster than the H7 per unit because the Jetson's memory bus is wider and runs at a higher clock.

The remote wipe path — a RevocationMessage with reason = EmergencyWipe delivered over the gossip mesh — triggers the same procedure when the target device receives and validates it. The device verifies the fleet CA signature, confirms that the revoked_device_id matches its own, and if both checks pass, calls emergency_wipe(). There is no way to cancel or defer a validated emergency wipe — by design.

A captured drone that is still powered on and has mesh connectivity can be wiped within the gossip propagation window (~30 seconds for a 30-node mesh). A captured drone that is powered off cannot be reached by the remote path — but its key material is already gone since the STM32H7's SRAM is volatile. The only residual risk is a capture event that interrupts the device between power-on (when keys are loaded from encrypted flash into SRAM) and the tamper detection triggering. This window is documented in the threat model and currently accepted; future hardware revisions will tighten it with faster tamper detection latency.

STM32H7 benchmarks

All benchmarks were run on the STM32H7 at 480 MHz in Rust release mode (codegen-units=1, lto=thin, no hardware AES accelerator unless noted). p50 and p99 are over 1,000 iterations with the TRNG active throughout to maintain entropy pool pressure.

Key management operation benchmarks — STM32H7 @ 480 MHz
All times in milliseconds. 1,000-iteration sample, TRNG active.

Operation                                        p50      p99
─────────────────────────────────────────────────────────────
IK keypair generation (ML-KEM-768 + X25519)      42 ms    51 ms
  └─ ML-KEM-768 keygen only                      38 ms    46 ms
  └─ X25519 keygen only                           4 ms     5 ms

Device cert verification (Ed448 sig check)        8 ms    12 ms
Mission bundle load + verify (64-device bundle)  74 ms    91 ms
  └─ Bundle sig verify                            8 ms    11 ms
  └─ 64× cert sig verify (parallelised, 2 cores) 66 ms    80 ms

SPK generation (ML-KEM-768 + X25519)             42 ms    51 ms
SPK signature (Ed448 over SPK fields)             6 ms     8 ms
SPK rotation + gossip broadcast                  22 ms    38 ms
  └─ SPK generate + sign                         48 ms    59 ms
  └─ Gossip frame encoding + enqueue              2 ms     4 ms

RevocationMessage verify + state purge            6 ms    14 ms
  └─ Ed448 sig verify                             8 ms    12 ms
  └─ Ratchet state cache purge (per session)      1 ms     3 ms

Emergency wipe (SRAM scrub, 512 KB × 3 passes)  180 ms   195 ms
  └─ DMA disable + flash lock                     2 ms     2 ms
  └─ Total from trigger to core halt             182 ms   197 ms

The IK keypair generation at 42 ms p50 occurs only at provisioning or quarterly IK rotation, never during mission flight. SPK rotation at 22 ms p50 is the only periodic key management cost during flight, and it is scheduled on a low-priority background thread so it does not appear on the flight-critical scheduling path. The gossip broadcast itself (2 ms) adds negligible latency relative to the 6-second gossip interval.

Mission bundle load at 74 ms p50 happens once at power-on. Spread across a typical 60-second boot sequence, this is under 0.1% of boot time. The cert verification component (66 ms for 64 devices) can be parallelized across the STM32H7's dual cores; single-core verification of 64 certs takes approximately 128 ms p50.

The revocation verify and purge at 6 ms p50 is dominated by the Ed448 signature check. On receipt of a RevocationMessage, the verification + purge completes before the next gossip interval (6 seconds) — the revocation takes effect immediately on the receiving node. The 30-second full-mesh propagation time reflects gossip hop latency, not verification overhead.

Key lifecycle summary

Pulling the pieces together, the complete key lifecycle for a Swarm SDK device across its operational lifetime:

Provisioning (one-time, at base)
  ├── Generate IK (ML-KEM-768 + X25519) on-device via TRNG         [42 ms STM32H7]
  ├── Fleet CA signs DeviceCertificate (Ed448, 90-day validity)
  ├── Load MissionCertBundle (peer certs, revoked list, mission expiry)
  └── Store encrypted bundle to QSPI flash (AES-XTS, device key in OTP)

Power-on (each mission)
  ├── Load + verify MissionCertBundle from flash                    [74 ms STM32H7]
  ├── Check bundle_signature (fleet CA Ed448)
  ├── Check fleet_ca_cert signature (root CA Ed448)
  └── Populate peer cert map + revocation set in SRAM

In-flight (continuous)
  ├── SPK rotation every 7 days
  │     ├── Generate new SPK (ML-KEM-768 + X25519 + Ed448 sig)     [22 ms STM32H7]
  │     └── Gossip SignedPreKeyUpdate to all reachable peers
  ├── Session keys: X3DH establishment → Double Ratchet             [see swarm-double-ratchet]
  └── Revocation: gossip RevocationMessage → verify → purge         [ 6 ms STM32H7]

Quarterly (at base, offline ceremony)
  ├── IK rotation: generate new IK on-device
  ├── Fleet CA re-issues DeviceCertificate against new IK
  └── Fleet CA cert renewed by Root CA (every 6 months)

Key destruction
  ├── Normal power-off: SRAM volatile, keys gone on power cycle
  ├── Tamper detect: SRAM triple-overwrite + flash lock + halt      [182 ms STM32H7]
  └── Remote wipe: RevocationMessage(EmergencyWipe) → same procedure

The design avoids any key operation that requires synchronous communication with a central authority during flight. SPK rotation is self-certified (signed by the device's own IK, which peers already trust from the mission bundle). Revocation propagates over the gossip mesh without requiring any peer to contact the Fleet CA. The only operations that touch the Fleet CA are provisioning, quarterly IK rotation, and mission bundle signing — all conducted at base with the ground control station online.

The result is a key management system that is offline-first by construction: every cryptographic decision a drone needs to make during flight can be made entirely from local state, with the mission cert bundle as the trust anchor and the gossip mesh as the distribution channel for key updates and revocations. The Fleet CA is a provisioning authority, not an online authentication authority.


Related technical articles: