Why is exactly-once webhook delivery impossible at the transport layer?

Network partitions, TLS renegotiation timeouts, and intermediary proxy retries make it impossible to atomically deliver a message and receive acknowledgment without a two-phase commit. The producer cannot distinguish a failed delivery from a failed acknowledgment, so it must retry, guaranteeing at-least-once at the transport layer. Exactly-once semantics require application-level deduplication.

What TTL should I set on my webhook deduplication store?

Set the TTL to the provider's maximum retry window plus 30%. If the provider retries for 72 hours, set TTL to 94 hours. This prevents late-arriving duplicates from slipping through after key expiration without wasting storage on indefinitely long windows.

How do I prevent duplicate webhook processing across horizontally scaled consumers?

Use an atomic SET NX operation on a shared coordination store (Redis Cluster or DynamoDB conditional write) keyed on the provider-supplied event ID. Local caches or application-level locks are insufficient; only a single shared atomic operation prevents races during concurrent delivery bursts to multiple consumer replicas.

Webhook Delivery Guarantees

Part of: Idempotency Fundamentals & API Guarantees

Distributed webhook infrastructure operates under a fundamental constraint: network partitions, transient DNS failures, and consumer-side garbage-collection pauses make exactly-once delivery mathematically impossible at the transport layer. Production-grade webhook providers therefore default to at-least-once semantics, shifting the burden of duplicate suppression from the network to the application ingestion layer. This page addresses the specific distributed failure mode of unreliable transport acknowledgment: a producer that cannot distinguish a lost delivery from a lost 2xx response must retry, so every consumer must be prepared to absorb the same event payload multiple times without corrupting downstream state.

Guarantee Model

The contract this pattern provides is application-level exactly-once execution on top of an at-least-once transport. The consumer absorbs any number of identical payload deliveries, executes business logic exactly once, and returns the original response on every subsequent delivery. The contract breaks in three scenarios:

Clock skew beyond the replay-attack window. If the consumer rejects payloads with timestamps >5 minutes in the past, a slow producer behind a heavily delayed retry queue may deliver events that the consumer treats as replay attacks and discards — silently dropping legitimate events.
Deduplication store partition. If the coordination store becomes unreachable, atomic check-and-set operations fail. Blind fallback to local state risks windowed duplicates during node rebalancing; blind rejection drops events. The correct degradation is to queue events for post-recovery reconciliation.
TTL expiry before retry exhaustion. If the deduplication key expires before the provider’s last retry attempt, a late delivery will be re-processed as a new event. TTLs must be aligned to the provider’s maximum retry window with a 30% margin.

Core Algorithm: Idempotent Webhook Ingestion

The ingestion pipeline is a finite state machine (FSM) with explicit, auditable transitions. Every transition is persisted before the next begins, enabling safe recovery from pod crashes, database deadlocks, or downstream API timeouts.

The pipeline steps in order:

RECEIVED. Log the raw payload and all headers. Do not touch business logic yet.
VALIDATED. Verify the HMAC signature (X-Hub-Signature-256 or provider equivalent). Reject requests with clock skew >5 minutes with 400 Bad Request. Reject missing or malformed signatures before entering the deduplication layer.
PROCESSING. Attempt an atomic SET NX on the idempotency key. If the key already maps to COMMITTED, short-circuit: return 200 OK with the stored response payload without re-executing business logic. If the key is absent, create it in PROCESSING state and proceed.
COMMITTED. Execute business logic within a single database transaction. Persist the FSM state as COMMITTED and store the canonical response payload atomically in the same transaction.
ACKNOWLEDGED. Return 200 OK (or 204 No Content) to the producer. The producer stops retrying.

On any transient failure during step 4, the FSM halts at PROCESSING, releases database locks, and schedules a compensating retry with exponential backoff with jitter rather than blocking the HTTP response thread. After the configured retry budget is exhausted (e.g. 5 attempts over 24 hours), the event transitions to a dead-letter queue (DLQ) with full payload, headers, and failure context attached.

Implementation Variants

Variant 1: Redis Atomic SET NX

The fastest option for high-throughput streams. The deduplication check is a single round-trip using Redis’s SET key value NX EX ttl command, which sets the key only if it does not already exist.

import redis
import json
import hashlib
import hmac
import time

r = redis.Redis(host="redis-cluster", port=6379, decode_responses=True)

DEDUP_TTL_SECONDS = 94 * 3600  # provider retries 72 h; add 30%

def ingest_webhook(event_id: str, payload: dict, signature: str, secret: str) -> dict:
    # Step 1: Validate HMAC
    body = json.dumps(payload, separators=(",", ":"), sort_keys=True).encode()
    expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    if not hmac.compare_digest(f"sha256={expected}", signature):
        raise ValueError("Invalid HMAC signature")

    dedup_key = f"wh:dedup:{event_id}"

    # Step 2: Atomic SET NX — single round-trip, no TOCTOU race
    acquired = r.set(dedup_key, "PROCESSING", nx=True, ex=DEDUP_TTL_SECONDS)
    if not acquired:
        # Already committed: return cached response
        cached = r.get(f"wh:resp:{event_id}")
        return json.loads(cached) if cached else {"status": "duplicate"}

    try:
        result = execute_business_logic(payload)
        pipe = r.pipeline(transaction=True)
        pipe.set(dedup_key, "COMMITTED", ex=DEDUP_TTL_SECONDS)
        pipe.set(f"wh:resp:{event_id}", json.dumps(result), ex=DEDUP_TTL_SECONDS)
        pipe.execute()
        return result
    except Exception as exc:
        r.delete(dedup_key)  # release so retry can re-acquire
        raise

def execute_business_logic(payload: dict) -> dict:
    # domain-specific logic here
    return {"processed": True, "event_id": payload.get("id")}

Variant 2: PostgreSQL Upsert with Conditional Insert

The most durable option for systems that already run Postgres. The deduplication state lives in the same ACID transaction as the business logic, eliminating the need for distributed coordination.

-- Schema
CREATE TABLE webhook_events (
    event_id        TEXT PRIMARY KEY,
    state           TEXT NOT NULL DEFAULT 'RECEIVED',
    payload         JSONB NOT NULL,
    response        JSONB,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    committed_at    TIMESTAMPTZ,
    expires_at      TIMESTAMPTZ NOT NULL
);
CREATE INDEX ON webhook_events (expires_at)
    WHERE state != 'COMMITTED';  -- partial index for TTL cleanup

// Go: transactional upsert pattern
func IngestWebhook(ctx context.Context, db *sql.DB, eventID string, payload []byte) ([]byte, error) {
    tx, err := db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelSerializable})
    if err != nil {
        return nil, err
    }
    defer tx.Rollback()

    var state string
    var response []byte
    err = tx.QueryRowContext(ctx,
        `INSERT INTO webhook_events (event_id, state, payload, expires_at)
         VALUES ($1, 'PROCESSING', $2, now() + interval '94 hours')
         ON CONFLICT (event_id) DO UPDATE SET state = webhook_events.state
         RETURNING state, response`,
        eventID, payload,
    ).Scan(&state, &response)
    if err != nil {
        return nil, err
    }

    if state == "COMMITTED" {
        tx.Commit()
        return response, nil // duplicate: return stored response
    }

    result, err := executeBusinessLogic(ctx, tx, payload)
    if err != nil {
        return nil, err // tx.Rollback() via defer
    }

    _, err = tx.ExecContext(ctx,
        `UPDATE webhook_events
         SET state = 'COMMITTED', response = $1, committed_at = now()
         WHERE event_id = $2`,
        result, eventID,
    )
    if err != nil {
        return nil, err
    }
    return result, tx.Commit()
}

Variant 3: DynamoDB Conditional Write

The appropriate choice for AWS-native deployments or serverless consumers where Redis is unavailable. DynamoDB conditional expressions provide the same atomicity as SET NX.

// Node.js: DynamoDB conditional write
const { DynamoDBClient, PutItemCommand, GetItemCommand } = require("@aws-sdk/client-dynamodb");

const dynamo = new DynamoDBClient({ region: "us-east-1" });
const TABLE = "WebhookDedup";
const TTL_SECONDS = 94 * 3600;

async function ingestWebhook(eventId, payload) {
  const expiresAt = Math.floor(Date.now() / 1000) + TTL_SECONDS;

  try {
    await dynamo.send(new PutItemCommand({
      TableName: TABLE,
      Item: {
        event_id: { S: eventId },
        state:    { S: "PROCESSING" },
        ttl:      { N: String(expiresAt) },
      },
      ConditionExpression: "attribute_not_exists(event_id)",
    }));
  } catch (err) {
    if (err.name === "ConditionalCheckFailedException") {
      // Duplicate: fetch and return cached response
      const { Item } = await dynamo.send(new GetItemCommand({
        TableName: TABLE,
        Key: { event_id: { S: eventId } },
        ProjectionExpression: "response",
      }));
      return Item?.response?.S ? JSON.parse(Item.response.S) : { status: "duplicate" };
    }
    throw err;
  }

  const result = await executeBusinessLogic(payload);

  await dynamo.send(new PutItemCommand({
    TableName: TABLE,
    Item: {
      event_id: { S: eventId },
      state:    { S: "COMMITTED" },
      response: { S: JSON.stringify(result) },
      ttl:      { N: String(expiresAt) },
    },
  }));
  return result;
}

Variant 4: Tiered Bloom Filter + Redis

For very high-throughput streams (>50 k events/second), the primary datastore round-trip on every request becomes a bottleneck. A probabilistic Bloom filter pre-screen at the ingress node eliminates ~99% of duplicate checks before they reach Redis.

from pybloom_live import ScalableBloomFilter

# Shared per-process filter; Redis is the authoritative store
_bloom = ScalableBloomFilter(initial_capacity=1_000_000, error_rate=0.001)

def ingest_with_bloom(event_id: str, payload: dict, signature: str, secret: str) -> dict:
    if event_id in _bloom:
        # Probable duplicate: verify against authoritative store
        state = r.get(f"wh:dedup:{event_id}")
        if state == "COMMITTED":
            cached = r.get(f"wh:resp:{event_id}")
            return json.loads(cached) if cached else {"status": "duplicate"}
    # Definite new event OR false positive: proceed to full SET NX flow
    result = ingest_webhook(event_id, payload, signature, secret)
    _bloom.add(event_id)
    return result

Variant Comparison

Variant	Consistency	Latency	Storage overhead	Best for
Redis SET NX	Strong (single-node)	1–5 ms	Low (keys only)	High-throughput, latency-sensitive
PostgreSQL upsert	Serializable ACID	5–20 ms	Medium (row + index)	Fintech, audit-required workloads
DynamoDB conditional	Strong (single-region)	3–10 ms	Low-medium	AWS-native / serverless
Bloom + Redis	Strong-ish	0.5–3 ms	Very low at ingress	>50 k events/s streams

HTTP Transport Constraints

Webhook payloads traverse complex HTTP/2 and HTTP/3 stacks where connection pooling, TLS renegotiation, and intermediary proxies introduce non-deterministic delivery characteristics. Despite POST being classified as non-idempotent by RFC 9110, it remains the industry standard for webhooks. The operational solution is to decouple HTTP method semantics from application-level execution guarantees — a principle explored in detail in HTTP Method Semantics & Safety. Apply the following transport-layer hardening:

Header preservation. Require critical metadata (X-Webhook-Signature, X-Request-ID) in standardized header positions. Validate header presence before entering the deduplication layer; reject malformed requests with 400 Bad Request. Never rely on query-string parameters for security-critical metadata.
Timeout tuning. Configure connect_timeout at 2 s and read_timeout at 10 s independently. Long read_timeout values exhaust connection pools during slow consumer processing, triggering upstream retries that compound duplicate volume. Return 202 Accepted for work that exceeds 5 s of processing time and complete it asynchronously.
Backpressure signaling. Return 429 Too Many Requests with a Retry-After: 300 header when the deduplication store is degraded. This prevents thundering-herd retry storms from overwhelming a partially available consumer.

Edge Cases & Failure Scenarios

Failure Scenario	Remediation Steps	Observability Hooks
Deduplication store unreachable during peak delivery	Activate local Bloom filter fallback; queue events in an async buffer; batch-reconcile against authoritative store on recovery; page on-call if buffer depth exceeds 10 k	Alert: `dedup_store_error_rate > 0.01` for 60 s; metric: `buffer_queue_depth`; log: `dedup_fallback=true event_id=…`
Provider retry storm after consumer 5xx window	Return `429` with `Retry-After: 300`; open circuit breaker on downstream APIs at 50% error rate over 60 s; drain async queue asynchronously behind the circuit	Alert: `duplicate_hit_rate > 2x baseline` for 120 s; metric: `circuit_breaker_state`; trace span: `webhook.circuit=open`
FSM stuck in PROCESSING after pod crash	Implement a reconciliation worker that scans for events in `PROCESSING` state older than `read_timeout + 30 s` and transitions them back to `RECEIVED` for retry; ensure pod startup probes prevent traffic until the store is reachable	Alert: `stale_processing_events > 0` for 300 s; metric: `fsm_state{state="PROCESSING"} age`; log: `fsm_recovery=true`
Late delivery after deduplication key TTL expiry	Set TTL = provider retry window × 1.3 (e.g. 94 h for a 72 h provider); nightly reconciliation job compares internal committed events against provider settlement files to detect post-TTL duplicates	Metric: `dedup_key_miss_late_delivery`; alert on any non-zero value in settlement reconciliation diff
HMAC signature mismatch from proxy header stripping	Validate signature against raw request body before any parsing; log the full incoming `Content-Type` and `Content-Encoding` headers on mismatch to identify proxy transformation; accept both hex and base64 variants	Alert: `hmac_failure_rate > 0.001` for 30 s; log: `hmac_failure=true proxy_via=…`

Operational Concerns

TTL alignment. Set deduplication store TTLs to the provider’s stated maximum retry duration plus 30%. For a provider that retries for 72 hours, use 94 hours. For Stripe (retries for 72 hours across 4 days with a decay schedule), use 96 hours. Never set TTLs shorter than the provider’s retry window — a TTL miss on a late-arriving delivery silently bypasses deduplication.

Index strategy for PostgreSQL. A partial index on (expires_at) WHERE state != 'COMMITTED' keeps the TTL cleanup job fast without scanning committed rows. Add a separate index on (state, created_at) for the FSM reconciliation worker’s stale-PROCESSING query.

Memory and storage budgeting. A Redis SET NX key storing a 36-byte UUID event ID with state overhead consumes approximately 200 bytes. At 10 million events per day with a 94-hour TTL, the working set is roughly 40 million keys × 200 bytes = 8 GB. Plan Redis cluster sizing accordingly and monitor used_memory_rss against the 75% high-water mark alert.

SRE alert thresholds.

dedup_hit_rate: baseline depends on provider; alert if it exceeds 5× baseline for 5 minutes (indicates stuck producer or thundering-herd retry storm).
p95_webhook_processing_latency > 200 ms: indicates deduplication store saturation or downstream API slowness.
dlq_depth > 100: events that exhausted retries and require manual or automated remediation.
hmac_failure_rate > 0.001: proxy header stripping or credential rotation issue.
stale_processing_count > 0 for 300 s: pod crashed during FSM step 4; reconciliation worker not running.

Fintech workloads. In payment systems, duplicate webhook execution directly impacts financial reconciliation. Every webhook receipt must append to an audit log recording the raw payload, signature verification result, idempotency key, FSM state, and processing duration. Treat webhooks as append-only events and reconcile against provider settlement files nightly — do not treat real-time webhook state as the source of truth. The detailed runbook for payment gateway deduplication is in Handling Duplicate Webhook Deliveries in Payment Gateways, covering HMAC rotation, token-based compensation workflows, and ledger reconciliation patterns. For the underlying idempotency key generation strategies that determine collision resistance, and for Redis SET NX as the authoritative atomic registration primitive, follow those links for implementation depth.

Graceful degradation levels.

Level 1 (store latency >50 ms): switch to async queue, return 202 Accepted, process out-of-band.
Level 2 (store unreachable): enable local Bloom filter fallback, accept elevated duplicate risk, queue for post-recovery reconciliation.
Level 3 (provider retry storm): return 429 Too Many Requests with Retry-After: 300, open circuit breaker, drain queue asynchronously.

Handling Duplicate Webhook Deliveries in Payment Gateways — HMAC rotation, token-based compensation, and ledger reconciliation patterns for fintech systems.
Idempotency Key Generation Strategies — deterministic vs. random key schemes and collision-resistance trade-offs.
Retry Logic & Backoff Fundamentals — exponential backoff with jitter, overlapping retry prevention, and budget-capped retry policies.
Redis Cache-Based Deduplication — deep-dive into SET NX, pipeline atomicity, and cluster key-distribution for deduplication stores.
Transaction Scoping & Atomic Operations — wrapping business logic and deduplication state in a single database transaction to prevent partial commits.
Idempotency Fundamentals & API Guarantees — parent reference covering the full at-least-once guarantee model, failure boundary map, and anti-patterns.