Restarting a Couchbase node can bring your cluster to a crawl for an uncomfortably long time, especially if you’ve got a lot of data. This isn’t just about network latency or disk I/O; it’s about how Couchbase rebuilds its internal memory structures that map keys to data locations.
Let’s see it in action. Imagine a cluster with a few nodes, each holding a portion of your dataset. When one node restarts, it needs to re-establish its connection to the data it owns and rebuild its memory map.
# Simulate a node restart (don't do this in production!)
# On the node to restart:
sudo systemctl stop couchbase-server
# ... wait for it to come back up ...
# On another node, observe the cluster status:
couchbase-cli node-list -c <master_node_ip>:8091 -u <username> -p <password>
You’ll see that node go from "warm-up" or "rebalancing" to fully active, and during that time, read/write operations to data on that node will be significantly slower, if they work at all. The key here is that Couchbase doesn’t just load data from disk. It reconstructs an in-memory index of where every single document is.
The problem Couchbase is solving is how to quickly make data accessible after a crash or restart. It uses a combination of disk persistence and in-memory caching. When a node comes back online, it has to:
- Re-establish network connections to other nodes and clients.
- Load the metadata (the vBuckets and their states) from persistent storage.
- Rebuild the in-memory data structures (like the item cache and index) that map document keys to their locations in memory or on disk. This is the "warm-up" time.
- Sync with other nodes to ensure data consistency and rebalance vBuckets if necessary.
The levers you control are primarily around how much data is kept in memory and how aggressively Couchbase persists it. These are configured at the bucket level.
Consider a bucket configured with a high mem_low_watermark and mem_high_watermark. These settings dictate when Couchbase starts evicting data from memory to make room for new items. If your node restarts and a large portion of the data was evicted just before the restart, it will have to be re-read from disk to repopulate the cache.
Here’s a look at some of those settings via the couchbase-cli:
# Get bucket settings
couchbase-cli bucket-get -c <node_ip>:8091 -u <username> -p <password> --bucket <bucket_name>
You’d look for mem_low_watermark and mem_high_watermark, often expressed as percentages. For example, mem_low_watermark: 70 and mem_high_watermark: 85. This means Couchbase starts evicting when memory usage hits 85% and stops when it drops to 70%. If your node restarts and the working set of data is larger than what can fit in memory at these watermarks, it will be slow.
The most surprising thing about Couchbase warm-up is that it’s not just about how much data you have, but how much of it is actively being used and therefore how much was recently evicted. A node that was heavily utilized and had a lot of data recently evicted will have a much longer warm-up than a node with the same amount of data but lower recent activity, even if the total dataset size is identical. This is because the "warm-up" is largely the process of repopulating the cache with frequently accessed items.
If you’re experiencing slow warm-up, and your mem_low_watermark and mem_high_watermark are set aggressively (e.g., 50% and 70%), you might consider increasing them. For instance, setting mem_low_watermark to 90 and mem_high_watermark to 95 (if you have sufficient RAM) allows more data to stay resident in memory. This means that after a restart, Couchbase has to read less data from disk to get back to a fully operational state.
# Example of updating bucket settings (use with caution!)
couchbase-cli bucket-update -c <node_ip>:8091 -u <username> -p <password> \
--bucket <bucket_name> \
--mem-low-watermark 90 \
--mem-high-watermark 95
This directly impacts the size of the item cache. By raising the watermarks, you’re telling Couchbase to hold onto more data in RAM before it starts evicting. When the node comes back, less data needs to be re-read from disk to fill this larger cache, thus shortening the warm-up time.
The next thing you’ll likely encounter is understanding the impact of disk performance on your overall read/write throughput, especially when the cache isn’t fully warm.