Couchbase scales read and write throughput by distributing data and requests across multiple nodes, but its default settings often leave significant performance on the table.

Let’s watch Couchbase chew through some data. Imagine a bucket named my_bucket with 100 million documents, each a simple JSON object like {"id": 123, "value": "some_string"}. We’ll hit it with parallel read and write operations using cb_load, a common tool for this.

Here’s a snapshot of cb_load running against a moderately sized cluster, pushing 10,000 reads/sec and 5,000 writes/sec:

# Example read load
./cb_load -c couchbase://10.0.0.1:8091 -b my_bucket -n 100000000 -r 10000 -t 16

# Example write load
./cb_load -c couchbase://10.0.0.1:8091 -b my_bucket -n 100000000 -w 5000 -t 16 -i 1000000

You’ll see cb_load reporting operations per second (ops/sec) and latency. Without tuning, those ops/sec might be surprisingly low, and latency might be spiky. The goal is to push those ops/sec higher and keep latency stable.

Couchbase’s performance hinges on how efficiently it can access data on disk and in memory, and how quickly it can process incoming requests. The Data Service (which handles data storage and retrieval) has several key areas to tune:

Memory Allocation: The Foundation

The most impactful tuning knob is how much RAM you give to the Data Service. Couchbase splits its memory allocation between the Data Service and the Index Service. For read-heavy workloads, you want to maximize the Data Service’s allocation.

  • Diagnosis: Check the current allocation via the Couchbase Web UI (Settings -> Memory). Look for "Data RAM Quota" and "Index RAM Quota."
  • Fix: Increase the "Data RAM Quota" for your bucket. A good starting point for a read-heavy workload is to allocate 75-80% of available RAM to the Data Service. For example, if a node has 32GB of RAM, set the Data RAM Quota to 24GB.
  • Why it works: This directly controls the size of the resident dataset (RD) and the cache. A larger RD means more of your hot data lives in RAM, drastically reducing disk I/O for reads.

Disk I/O: Bottleneck Buster

Even with perfect memory allocation, disk I/O can be a bottleneck, especially for writes and reads that miss the cache. Couchbase uses a Log-Structured Merge-Tree (LSM) storage engine. Writes are appended to a write buffer, and periodically flushed to disk as immutable files. Reads involve merging data from memory and these files.

  • Diagnosis: Monitor disk I/O on your nodes using tools like iostat or CloudWatch. Look for high %util, await, or svctm on your data drives. Also, check Couchbase’s internal metrics for disk operations.
  • Fix:
    • SSD/NVMe: Ensure your data drives are SSDs or NVMe. This is non-negotiable for performance.
    • Separate Drives: If possible, dedicate separate physical drives for Couchbase’s data (/opt/couchbase/var/lib/couchbase/data) and its logs (/opt/couchbase/var/lib/couchbase/log).
    • Tune compaction: Compaction reclaims space from deleted or updated documents. Too aggressive compaction can thrash disks; too little leads to fragmentation and slower reads. The default auto_compaction settings are often too aggressive.
      • Diagnosis: In the Web UI, go to Buckets -> my_bucket -> Compaction. See if auto_compaction is enabled.
      • Fix: Disable auto_compaction and schedule manual compactions during off-peak hours. Configure compaction_period (e.g., 24 hours) and compaction_every_n (e.g., 1). Set compaction_mode to stop_at_exact_trigger and purge_interval to a value that balances space reclamation and I/O impact (e.g., 3600 seconds).
      • Why it works: Manual, scheduled compaction avoids I/O contention with active read/write traffic. Tuning purge_interval controls how often Couchbase checks for items to purge, reducing background I/O.
  • Why it works: Faster storage and optimized compaction mean writes are flushed more quickly and reads don’t have to wade through as much fragmentation.

Network: The Data Highway

Couchbase is a network-bound service. The speed at which data can travel between nodes and to clients is critical.

  • Diagnosis: Monitor network bandwidth utilization and latency between Couchbase nodes and between clients and nodes. Tools like iftop, nload, or cloud provider network monitoring.
  • Fix:
    • Network Interface: Use high-speed network interfaces (10Gbps or higher) for your Couchbase nodes.
    • Jumbo Frames: If your network infrastructure supports it and is configured consistently across all nodes, enabling Jumbo Frames (MTU 9000) can reduce CPU overhead for network packet processing. This is an advanced setting and requires careful network configuration.
    • Tune tcp_send_buffer_auto: Couchbase has an internal TCP send buffer auto-tuning mechanism.
      • Diagnosis: In the Couchbase CLI (couchbase-cli), use couchbase-cli setting-cluster --cluster couchbase://localhost --username Administrator --password password --output json. Look for tcp_send_buffer_auto.
      • Fix: Set tcp_send_buffer_auto to true (it’s often true by default, but worth verifying). If it’s false, enable it.
      • Why it works: This allows Couchbase to dynamically adjust its TCP send buffer size based on network conditions, preventing buffer overflows or underutilization.
  • Why it works: More bandwidth and efficient packet handling mean faster data transfer, especially for large documents or high concurrency.

Worker Threads: CPU Utilization

Couchbase uses a pool of worker threads to handle requests. The default settings are often conservative.

  • Diagnosis: Monitor CPU utilization on your Couchbase nodes. If CPU is consistently high (e.g., >80%) and I/O is not maxed out, you might need more threads. Use top or htop to see CPU usage by the beam.smp process.
  • Fix:
    • num_reader_threads: Increase the number of reader threads.
      • Diagnosis: Use couchbase-cli setting-cluster --cluster couchbase://localhost --username Administrator --password password --output json and check num_reader_threads.
      • Fix: Set num_reader_threads to a value between 2 and 4 times the number of CPU cores on the node. For a 16-core node, try num_reader_threads 32.
      • Why it works: More reader threads can concurrently process incoming read requests, keeping the CPU busy if it’s not the bottleneck.
    • num_writer_threads: Similarly, increase writer threads for write-heavy workloads.
      • Diagnosis: Check num_writer_threads via the setting-cluster command.
      • Fix: Set num_writer_threads to a value like num_reader_threads, or slightly lower if writes are less frequent than reads.
      • Why it works: More writer threads can handle the append and flush operations for writes more efficiently.
  • Why it works: By increasing the number of threads that can handle requests, you allow Couchbase to better utilize available CPU cores for processing.

XDCR and Replication: The Silent Killers

If you’re using Cross Datacenter Replication (XDCR) or internal replication, these processes consume resources and can impact performance.

  • Diagnosis: Check XDCR status in the Web UI. Monitor network traffic and CPU on nodes involved in replication. Look at the rebalancer process which can also consume significant resources.
  • Fix:
    • Replication Threads: Tune the number of replication threads per bucket.
      • Diagnosis: In the Web UI, go to Buckets -> my_bucket -> XDCR Settings.
      • Fix: Increase replication_threads for the specific bucket. A common starting point is 4 or 8, but monitor CPU.
      • Why it works: More threads allow replication to keep up with changes more effectively, reducing latency and backpressure.
    • Bandwidth Throttling: If replication is saturating the network, configure bandwidth throttling for XDCR.
      • Diagnosis: In the Web UI, under Buckets -> my_bucket -> XDCR Settings, check bandwidth_limit_kbps.
      • Fix: Set a reasonable bandwidth_limit_kbps value to cap replication traffic.
      • Why it works: Prevents replication from hogging all network bandwidth, ensuring it doesn’t starve application traffic.
  • Why it works: By controlling how replication consumes resources, you prevent it from becoming a bottleneck for your primary application workload.

After applying these tunings, you should see a significant increase in ops/sec and a reduction in latency on your cb_load tests. The next challenge you’ll likely face is managing the complexity of these settings across a large cluster and understanding how they interact with each other.

Want structured learning?

Take the full Couchbase course →