What is the difference between a retry storm and a thundering herd?

A retry storm is a client-side phenomenon: many callers simultaneously retrying a failed or slow request. A thundering herd is the server-side consequence: the burst of synchronised retry traffic exhausts upstream connection pools, CPU, or database connection limits. A retry storm causes a thundering herd when clients share the same fixed backoff interval.

Does full jitter completely prevent thundering herd?

Jitter significantly reduces synchronisation but does not eliminate it. At high client counts, even randomised backoff produces bursts. Pair jitter with a server-side concurrency limit and a circuit breaker to absorb residual spikes.

How long should an idempotency key TTL be set?

Set the TTL to at least 3× the maximum observed downstream processing latency, with a hard floor of 30 seconds. For payment APIs processing under 500 ms at p99, a 90-second TTL is safe. For async workflows that may take minutes, align the TTL with the maximum saga completion window.

Mitigating Thundering Herd During Retry Storms

Problem Statement & Prerequisites

Part of: Preventing Race Conditions in Microservices

A thundering herd forms when a large population of clients simultaneously retries a failed or degraded downstream endpoint. The burst of synchronised traffic saturates connection pools, exhausts ephemeral ports, and can collapse a service that was only transiently degraded. In payment and fintech systems this failure mode is particularly destructive: idempotency guarantees weaken under load, duplicate charges slip through deduplication caches, and reconciliation debt accumulates faster than on-call engineers can respond.

This runbook assumes you already understand exponential backoff fundamentals and atomic idempotency key registration via Redis SET NX. It covers the specific steps required to prevent synchronised retry bursts from turning a transient downstream blip into a sustained outage.

Step-by-Step Implementation

Step 1 — Add Full Jitter to Client Retry Backoff

Replace fixed exponential backoff with full jitter. The formula is sleep(random(0, min(cap, base * 2^attempt))). This desynchronises clients so retries arrive as a roughly uniform stream rather than a pulse.

package retry

import (
    "context"
    "math"
    "math/rand"
    "time"
)

// FullJitterBackoff returns a random duration in [0, min(cap, base*2^attempt)].
func FullJitterBackoff(attempt int, base, cap time.Duration) time.Duration {
    exp := math.Pow(2, float64(attempt))
    ceiling := time.Duration(float64(base) * exp)
    if ceiling > cap {
        ceiling = cap
    }
    return time.Duration(rand.Int63n(int64(ceiling)))
}

func DoWithRetry(ctx context.Context, maxAttempts int, fn func() error) error {
    const base = 100 * time.Millisecond
    const capDur = 30 * time.Second
    var err error
    for attempt := 0; attempt < maxAttempts; attempt++ {
        if err = fn(); err == nil {
            return nil
        }
        wait := FullJitterBackoff(attempt, base, capDur)
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(wait):
        }
    }
    return err
}

Node.js

async function doWithRetry(fn, { maxAttempts = 5, baseMs = 100, capMs = 30_000 } = {}) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxAttempts - 1) throw err;
      const ceiling = Math.min(capMs, baseMs * 2 ** attempt);
      const wait = Math.random() * ceiling;          // full jitter
      await new Promise(r => setTimeout(r, wait));
    }
  }
}

Python

import asyncio, random, math

async def do_with_retry(fn, max_attempts=5, base_ms=100, cap_ms=30_000):
    for attempt in range(max_attempts):
        try:
            return await fn()
        except Exception:
            if attempt == max_attempts - 1:
                raise
            ceiling = min(cap_ms, base_ms * (2 ** attempt))
            wait_ms = random.random() * ceiling        # full jitter
            await asyncio.sleep(wait_ms / 1_000)

Verify: Deploy to a staging environment, inject a 200 ms downstream latency with Toxiproxy, and plot retry_attempt histograms over 60 seconds. Confirm inter-arrival times form a roughly uniform distribution rather than sharp spikes.

Step 2 — Enforce Idempotency Keys with Atomic Redis SET NX

Distributed lock acquisition relies on atomic writes. Use SET key value NX EX <ttl> so that only the first writer among concurrent retries can claim the key. Set the TTL to at least 3× the p99 downstream processing latency — for payment APIs with a 500 ms p99, use 90 seconds as a floor.

# Claim idempotency slot (NX = only if not exists, EX = TTL in seconds)
SET idempotency:pay:uuid-1234 '{"status":"processing","ts":1720000000}' NX EX 90

In Go, wrap this in a function that returns a boolean indicating whether the current caller won the write:

func ClaimIdempotencyKey(rdb *redis.Client, key, payload string, ttl time.Duration) (bool, error) {
    ctx := context.Background()
    ok, err := rdb.SetNX(ctx, key, payload, ttl).Result()
    return ok, err
}

If ClaimIdempotencyKey returns false, the request is a duplicate: return the previously cached response rather than re-executing the operation. This is the deduplication boundary that prevents double-charges under herd conditions.

Verify: In a Redis CLI, run the SET … NX EX command twice for the same key. The second call must return (nil). Confirm with TTL <key> that the expiry is 90 seconds.

Step 3 — Wire a Circuit Breaker at the Call Site

A circuit breaker prevents the retry loop from amplifying degradation when the upstream is genuinely unhealthy. Trip after 5 consecutive 5xx responses; hold the breaker open for 10 seconds before probing with a single request (half-open state).

Go (using gobreaker):

import "github.com/sony/gobreaker"

var cb = gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "payment-upstream",
    MaxRequests: 1,                          // probes in half-open
    Interval:    60 * time.Second,           // rolling window
    Timeout:     10 * time.Second,           // open → half-open
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.ConsecutiveFailures >= 5
    },
})

func CallWithBreaker(req *PaymentRequest) (*PaymentResponse, error) {
    result, err := cb.Execute(func() (interface{}, error) {
        return upstreamClient.Charge(req)
    })
    if err != nil {
        return nil, err
    }
    return result.(*PaymentResponse), nil
}

Node.js (using opossum):

import CircuitBreaker from 'opossum';

const breaker = new CircuitBreaker(upstreamCharge, {
  errorThresholdPercentage: 50,
  resetTimeout: 10_000,              // ms before half-open probe
  volumeThreshold: 5,                // minimum requests before tripping
});

breaker.fallback(() => ({ status: 'circuit_open', retry_after_ms: 10_000 }));

Verify: Force 5 consecutive errors by pointing the upstream URL at a non-existent host. Confirm the breaker transitions to open and that subsequent calls return immediately with the fallback rather than attempting a network connection.

Step 4 — Apply an Adaptive Concurrency Limit

Even with jitter, a large client population can produce bursts that exceed the upstream’s connection pool. Gate in-flight retries through a token bucket with a refill rate matched to the upstream’s sustainable throughput.

Go (using golang.org/x/time/rate):

import "golang.org/x/time/rate"

// 50 requests/second, burst of 10
var limiter = rate.NewLimiter(rate.Limit(50), 10)

func RateLimitedCall(ctx context.Context, fn func() error) error {
    if err := limiter.Wait(ctx); err != nil {
        return fmt.Errorf("rate limiter: %w", err)
    }
    return fn()
}

Node.js (using p-limit):

import pLimit from 'p-limit';

const limit = pLimit(10);   // max 10 concurrent in-flight

async function boundedRetry(requests) {
  return Promise.all(requests.map(req => limit(() => doWithRetry(() => upstream(req)))));
}

Verify: Fire 200 concurrent requests against a rate-limited endpoint and confirm via Prometheus http_requests_in_flight that the active count never exceeds the configured burst ceiling.

Step 5 — Instrument Retry Propagation with OpenTelemetry

Without explicit tagging, a retry storm is invisible in aggregated latency charts. Attach these attributes to every span that represents a retry attempt:

span.SetAttributes(
    attribute.Int("retry.attempt", attempt),
    attribute.String("idempotency_key", key),
    attribute.String("original_trace_id", originalTraceID),
    attribute.Bool("circuit_breaker.open", breakerOpen),
)

Use original_trace_id to join spans from the same logical request across multiple retry attempts. This lets you reconstruct the full fan-out graph in Jaeger or Tempo when a storm occurs.

Step 6 — Validate with Chaos Injection

Run this sequence in a staging environment before deploying to production:

# Install Toxiproxy and create a proxy for the downstream
toxiproxy-cli create payment-downstream -l 0.0.0.0:8475 -u upstream-host:8080

# Inject 500 ms latency on 80 % of connections
toxiproxy-cli toxic add payment-downstream -t latency -a latency=500 -a jitter=50 \
  -n spike --toxicity 0.8

# Fire 500 concurrent requests through the proxy
ab -n 500 -c 500 http://localhost:8475/charge

Confirm:

The circuit breaker trips within 5–10 seconds of the latency injection.
The retry_count_total counter plateaus — it does not keep climbing while the breaker is open.
After removing the toxic, the breaker transitions to half-open and then closed within 15 seconds.

Failure Scenarios & Debugging

Failure Scenario	Remediation Steps	Observability Hooks
All clients retry at the same instant despite jitter — `random()` seeded identically	Verify PRNG seeding: in Go, `rand.Seed` is global state; use `rand.New(rand.NewSource(time.Now().UnixNano()))` per goroutine. In Python confirm `random` module is not seeded with a fixed value.	`histogram_quantile(0.95, retry_interval_seconds_bucket)` — if p95 ≈ p5 jitter is broken
Idempotency key lookup returns cache miss under Redis partition, allowing duplicate execution	Use a Redis cluster with at least 3 primary nodes. For financial operations, require a quorum write (2 of 3) via Redlock before claiming the key.	`redis_keyspace_misses_total` spike; `idempotency_cache_hit_ratio` dropping below 0.90
Circuit breaker never trips — consecutive failure counter resets due to occasional successes from one healthy shard	Switch from consecutive-failures threshold to error-percentage threshold over a 10-second rolling window. Set the threshold at 50 % errors in a minimum window of 10 requests.	`circuit_breaker_state{state="open"}` gauge; alert on `rate(upstream_5xx_total[60s]) / rate(upstream_requests_total[60s]) > 0.5`
Token bucket refill rate set too high — adaptive limit does not reduce upstream load	Profile upstream throughput at p99 latency under nominal load. Set the sustainable refill rate to 70 % of that value to leave headroom during recovery.	`http_requests_in_flight` gauge; `rate(requests_rate_limited_total[60s])` counter
Stale lock from crashed worker blocks new retries indefinitely	Enforce a hard maximum lease duration (90 s for payment APIs). Implement a lease heartbeat renewal loop that extends the TTL every 20 s while work is in progress, and confirm the Redis `TTL` drops as expected when the worker crashes.	`distributed_lock_ttl_remaining_seconds` gauge; alert on `lock_ttl < 10s` with no recent heartbeat

SRE / Observability Checklist

retry_count_total{service,endpoint,attempt} — counter; alert when rate(retry_count_total[5m]) > 3 × baseline_retry_rate
circuit_breaker_state{name,state} — gauge (0=closed, 1=half-open, 2=open); page on open state exceeding 60 s
idempotency_cache_hit_ratio — derived from hits / (hits + misses); alert below 0.90 on payment endpoints
http_requests_in_flight{upstream} — gauge; alert when sustained above 80 % of connection pool size
retry.attempt span attribute — filter in Jaeger/Tempo to reconstruct fan-out graph; join on original_trace_id
distributed_lock_ttl_remaining_seconds — alert when TTL < 10 s with no heartbeat renewal in the last 15 s

Preventing Race Conditions in Microservices — parent page covering the full taxonomy of race conditions in service-to-service calls, including optimistic concurrency and saga coordination.
Implementing Redlock for High-Availability Deduplication — when a single Redis node is insufficient for idempotency guarantees, Redlock provides quorum-based lock acquisition across a replica set.
Handling Stale Locks in Distributed Systems — companion runbook covering lease heartbeat renewal, fencing token validation, and automatic cleanup of expired locks left by crashed workers.
Using Redis SET NX for Distributed Request Deduplication — atomic key registration pattern that underpins the idempotency boundary used in Step 2 of this runbook.