Cassandra’s speculative retry is a bit of a hidden gem for crushing P99 latency, and it works by giving slow requests a second, faster chance.

Let’s see it in action. Imagine a client writing a record to a Cassandra cluster. Normally, the coordinator node sends the write request to all replicas and waits for a W number of acknowledgments. If one replica is sluggish, the whole request can be held up, impacting the P99.

Here’s a simplified view of what happens without speculative retry.

Client -> Coordinator -> Replica 1 (fast)
                 -> Replica 2 (slow)
                 -> Replica 3 (fast)

Coordinator waits for W=2 acknowledgments.
If Replica 2 takes too long, the P99 is hit.

Now, let’s enable speculative retry. The coordinator still sends the initial request. But if it doesn’t get enough acknowledgments within a certain timeout, it speculatively sends a duplicate of the request to a different replica.

Client -> Coordinator -> Replica 1 (fast)
                 -> Replica 2 (slow)
                 -> Replica 3 (fast)

Coordinator timeout for speculative retry (e.g., 200ms) is hit.
Coordinator -> Replica 4 (fast) - SPECULATIVE REQUEST

Coordinator receives acknowledgments from Replica 1 and Replica 4.
W=2 is met. The slow Replica 2's acknowledgment is ignored.
The P99 is preserved.

This mechanism is controlled by two key settings: speculative_retry in cassandra.yaml and request_timeout_in_ms in the client driver.

The speculative_retry setting in cassandra.yaml on the coordinator nodes determines when the coordinator will send a speculative request. It’s a percentage of the request_timeout_in_ms value. For example, if request_timeout_in_ms is 5000ms (5 seconds) and speculative_retry is set to 0.1, a speculative request will be sent after 10% of 5000ms, which is 500ms, if the initial request hasn’t completed.

The request_timeout_in_ms in your client driver is the overall time the client is willing to wait for a successful response. If neither the original nor the speculative request gets a successful response within this timeout, the client will report a timeout error.

Here’s how to configure it:

On the Cassandra Nodes (cassandra.yaml):

Find the speculative_retry parameter. A common starting point for P99 reduction is 0.1 (10%) or 0.2 (20%). This means a speculative request will be sent after 10% or 20% of the client-side timeout has elapsed without a successful response.

# speculatively retry requests on other nodes if the initial one is slow
# (percentage of the read/write request timeout)
# Example: 0.1 means retry after 10% of the request timeout has passed
speculative_retry: 0.1

On the Client Driver:

This is where you set the request_timeout_in_ms. This value should be higher than the time you’re willing to wait for a single replica to respond, but low enough to still be meaningful. If you set it too high, you’re just waiting longer. A common range is 5000ms to 15000ms.

Example (Java Driver):

DriverConfigLoader loader = DriverConfigLoader.programmaticBuilder()
    .withInt(DefaultDriverOption.REQUEST_TIMEOUT, 10000) // 10 seconds
    .build();

CqlSession session = CqlSession.builder()
    .withConfigLoader(loader)
    .build();

Why this works: Cassandra’s consistency levels (like QUORUM, LOCAL_QUORUM, EACH_QUORUM) require a certain number of nodes to acknowledge a request. If one node is experiencing transient issues (GC pause, network blip, heavy load), it can significantly delay the entire operation, disproportionately affecting the P99. Speculative retry essentially hedges its bets by firing a second request, increasing the probability that at least one of the replicas will respond quickly enough to satisfy the consistency level within the desired latency budget.

The most surprising thing about speculative retry is its potential to mask underlying performance problems. While it’s excellent for smoothing out transient P99 spikes, it can also be a crutch that prevents you from identifying and fixing the root causes of those slow nodes. If you see high P99s even with speculative retry enabled and configured reasonably, it’s a strong indicator of a more systemic issue, like insufficient hardware, network saturation, or poorly designed queries.

The next logical step after optimizing P99 with speculative retry is to tackle P99 read latency specifically, which involves different tuning parameters and strategies.

Want structured learning?

Take the full Cassandra course →