A cache stampede isn’t just a bunch of requests hitting the cache simultaneously; it’s when those requests all miss the cache at the exact same time, overwhelming the backend service with a synchronized wave of demand.
Let’s see this in action. Imagine a popular product page, GET /products/123, served by a web server that caches responses for 60 seconds.
// Simplified Node.js with Redis cache
const express = require('express');
const redis = require('redis');
const app = express();
const redisClient = redis.createClient(); // Connects to redis://127.0.0.1:6379
async function getProduct(id) {
// Simulate fetching from a slow database
return new Promise(resolve => setTimeout(() => resolve({ id, name: `Product ${id}`, price: Math.random() * 100 }), 500));
}
app.get('/products/:id', async (req, res) => {
const productId = req.params.id;
const cacheKey = `product:${productId}`;
const cacheTTL = 60; // seconds
const cachedData = await redisClient.get(cacheKey);
if (cachedData) {
console.log(`Cache hit for ${productId}`);
return res.json(JSON.parse(cachedData));
}
console.log(`Cache miss for ${productId}. Fetching from DB...`);
const productData = await getProduct(productId);
// Set cache with TTL
await redisClient.set(cacheKey, JSON.stringify(productData), { EX: cacheTTL });
res.json(productData);
});
const PORT = 3000;
app.listen(PORT, async () => {
await redisClient.connect();
console.log(`Server listening on port ${PORT}`);
});
Now, if 100 users hit /products/123 within the same second, and the cache entry expires just before they arrive, all 100 requests will miss. They’ll all trigger getProduct(123) simultaneously, hammering your database. This is the stampede.
The core problem is that expiration is a shared, unsynchronized event. When the TTL hits zero for a specific cache key, every subsequent request for that key becomes a cache miss. Without coordination, these misses flood the origin.
The simplest fix is a lock. Before fetching from the origin, acquire a distributed lock for that specific cache key. Only the request that successfully acquires the lock fetches from the origin; all others wait.
Here’s how you might add locking with Redis:
// ... (previous code) ...
app.get('/products/:id', async (req, res) => {
const productId = req.params.id;
const cacheKey = `product:${productId}`;
const cacheTTL = 60; // seconds
const lockKey = `lock:${cacheKey}`;
const lockTTL = 10; // seconds, must be longer than origin fetch time
const cachedData = await redisClient.get(cacheKey);
if (cachedData) {
console.log(`Cache hit for ${productId}`);
return res.json(JSON.parse(cachedData));
}
console.log(`Cache miss for ${productId}. Attempting to acquire lock...`);
// Try to acquire a lock. SETNX (SET if Not eXists) is atomic.
// We also set an expiry to prevent deadlocks if the lock holder crashes.
const lockAcquired = await redisClient.set(lockKey, 'locked', {
NX: true, // Only set if key does not exist
EX: lockTTL, // Expire in 10 seconds
});
if (lockAcquired) {
console.log(`Lock acquired for ${productId}. Fetching from DB...`);
try {
const productData = await getProduct(productId);
await redisClient.set(cacheKey, JSON.stringify(productData), { EX: cacheTTL });
console.log(`Data fetched and cached for ${productId}. Releasing lock.`);
return res.json(productData);
} finally {
// Release the lock
await redisClient.del(lockKey);
}
} else {
console.log(`Lock not acquired for ${productId}. Waiting and retrying...`);
// If lock wasn't acquired, wait a short, random time and retry
setTimeout(async () => {
// Recursively call to retry the process
// In a real app, you'd add a retry limit
await redisClient.get(cacheKey); // Re-run the logic to get data
// This simplified example doesn't directly re-invoke the handler,
// a more robust solution would use a queue or a polling mechanism.
// For demonstration, imagine this leads back to the cache check.
console.log(`Retrying fetch for ${productId} after short delay.`);
// In a real scenario, you'd likely want to return a 'please wait' status
// or try to fetch the *stale* data if available.
// For this example, we'll just let the next request handle it.
}, Math.random() * 100 + 50); // Wait 50-150ms
// Immediately return a response indicating the data is being refreshed.
// Or, if stale data is acceptable, return the stale data (if available).
res.status(202).send('Data is being refreshed. Please try again shortly.');
}
});
The SET lockKey 'locked' NX EX 10 command is atomic. If the lockKey doesn’t exist, it’s created with the value 'locked' and an expiration of 10 seconds, and the command returns 1 (true). If it already exists, it does nothing and returns 0 (false). This ensures only one process can hold the lock at a time. The EX option is crucial; it acts as a timeout, preventing indefinite blocking if the lock holder crashes.
When a request fails to get the lock, it doesn’t just give up. It backs off for a small, random interval and retries. This probabilistic retry prevents the waiting requests from also stampeding the origin once the lock is released. Instead of all retrying at the exact same millisecond, they stagger their retries.
The setTimeout(..., Math.random() * 100 + 50) introduces this jitter. Each waiting request waits for a random duration between 50ms and 150ms before attempting to re-check the cache. This significantly reduces the chance of a new stampede when the original lock holder finishes.
A common pitfall is setting the lock TTL too short. If your origin fetch takes longer than the lock TTL, a new request could acquire the lock before the first one finishes its fetch and releases the lock, leading to multiple origin fetches. Ensure lockTTL is comfortably longer than your maximum expected origin fetch time.
Another subtlety: what if the lock is acquired, the origin fetch happens, but before the cache is updated and the lock is released, the application crashes? The lock is released due to TTL, but the cache is never updated. A robust system might use a "lease" mechanism or ensure cache updates are atomic with lock release.
After successfully implementing locks and probabilistic retries, the next immediate problem you’ll encounter is managing the cache invalidation logic itself under high load, especially when dealing with more complex data relationships.