Couchbase’s automatic failover can actually be slower to detect and recover from node failures than manual failover in certain scenarios.
Let’s look at a typical Couchbase cluster during a simulated network partition. Imagine we have a 3-node cluster: node1, node2, and node3.
{
"cluster": {
"name": "my-production-cluster",
"nodes": [
{
"hostname": "node1",
"services": "kv,index,query,fts,eventing",
"status": "healthy"
},
{
"hostname": "node2",
"services": "kv,index,query",
"status": "healthy"
},
{
"hostname": "node3",
"services": "kv,index,query",
"status": "healthy"
}
]
}
}
Automatic Failover in Action
When a node fails, Couchbase’s ns_server process on the remaining nodes is responsible for detecting this. It uses a combination of heartbeats and gossip protocols. If a node doesn’t respond to heartbeats within a configurable timeout (default is 30 seconds), it’s marked as suspect. After a further grace period (default is 60 seconds), if the node still hasn’t reappeared, it’s declared dead, and automatic failover is triggered. This involves rebalancing data and reassigning vBuckets.
The problem is, this detection isn’t instantaneous. Network latency or transient packet loss can cause heartbeats to be missed. The default timeouts mean it can take up to 90 seconds (30s + 60s) before a failure is even detected and the failover process starts. During this time, clients connected to the failed node will experience errors.
Manual Failover in Action
With manual failover, you, the administrator, are the detection mechanism. You’re monitoring your cluster, perhaps via cbcollect_info logs, couchbase-cli status checks, or external monitoring tools. When you see a node is down (e.g., couchbase-cli node-list shows it as unhealthy or it’s unreachable via SSH), you can immediately initiate the failover.
The command to manually failover a node is straightforward:
couchbase-cli failover -c <host>:8091 -u <username> -p <password> --node-init-failover <node-id>
You get the <node-id> from couchbase-cli node-list. For example, if node3 (ID 02b262b2f712d456) is down:
couchbase-cli failover -c 192.168.1.10:8091 -u admin -p password --node-init-failover 02b262b2f712d456
Once the failover command is executed, Couchbase immediately marks the node as failed and begins the rebalancing process. This bypasses the detection timeouts of automatic failover. If you’re actively monitoring, you can initiate this within seconds of a node becoming truly unresponsive, potentially reducing client-facing downtime significantly.
Choosing the Right Strategy
-
Automatic Failover: Best for environments where minimal administrative overhead is desired and a slightly longer recovery window is acceptable. It’s a set-and-forget solution that handles common failure scenarios without human intervention. This is ideal for less critical workloads or development/staging environments. The key configuration parameters to tune are
autoFailoverTimeoutandautoFailoverGracePeriod. You can view and modify these via the Couchbase Web Console under Cluster -> Settings -> Auto Failover. -
Manual Failover: Crucial for mission-critical applications where minimizing downtime is paramount. It requires active monitoring but offers the fastest possible recovery time by removing the automated detection delay. This is the preferred choice for production environments with strict RTO (Recovery Time Objective) requirements.
The Nuance: Network Partitions vs. Node Crashes
Automatic failover is generally excellent at detecting hard node failures (e.g., a server losing power, OS crashing). However, it can struggle with network partitions where nodes are still running but cannot communicate. In a network partition, ns_server on the remaining nodes might not receive heartbeats from the isolated node. If the partition is transient, the node might recover, and automatic failover might not trigger or might trigger and then have the "failed" node rejoin, causing confusion. Manual failover gives you control to decide when a node is truly "failed" and not just temporarily isolated.
The Hidden Cost of Automatic Failover
Automatic failover involves a rebalance operation. This rebalance is resource-intensive, as data is moved between nodes. If automatic failover triggers too aggressively due to transient network issues or misconfiguration, it can lead to performance degradation across the cluster while data is being redistributed. Manual failover allows you to control when this resource-intensive operation occurs, typically initiating it only when you are certain a node is permanently gone and you have capacity to handle the rebalance.
Tuning Automatic Failover
If you opt for automatic failover, you can tune its sensitivity. The autoFailoverTimeout (default 30 seconds) is the interval after which a node is considered suspect. The autoFailoverGracePeriod (default 60 seconds) is the additional time before the node is declared dead and failover is initiated. Reducing these values makes automatic failover more aggressive but increases the risk of false positives during network instability. You can adjust these via the Couchbase CLI:
couchbase-cli setting -c <host>:8091 -u <username> -p <password> --auto-failover-timeout 20 --auto-failover-grace-period 40
This would reduce the total detection and failover initiation time from 90 seconds to 60 seconds.
The Next Step After Failover
Whether automatic or manual, once a node is failed over, the cluster will have a period of reduced capacity and potentially higher latency as it rebalances. The next thing you’ll likely encounter is monitoring this rebalance operation and ensuring data consistency.