Couchbase indexes aren’t a single monolithic entity; they’re distributed across your cluster, and understanding how that distribution works is key to optimizing query performance.

Let’s look at a simple example. Imagine you have a bucket called travel-sample and you want to index the name field of your documents. You’d create an index like this:

CREATE INDEX idx_name ON `travel-sample`(name);

When this index is created, Couchbase doesn’t just put all the index entries on one node. Instead, it partitions this index. Each partition contains a subset of the index keys, and these partitions are then distributed across the nodes in your cluster.

The primary mechanism for this distribution is based on the document ID. Couchbase uses a consistent hashing algorithm on the document ID to determine which partition a document (and therefore its corresponding index entry) belongs to. If you have multiple index partitions, a document ID will map to one specific partition.

Consider a cluster with three nodes: Node A, Node B, and Node C. If you have an index with, say, 16 partitions, Couchbase will assign these partitions across your nodes. A document with ID doc123 might hash to partition 5, and partition 5 might reside on Node B. Any query that needs to look up doc123 in idx_name will know to go directly to Node B.

This partitioning is what allows Couchbase to parallelize index lookups. When a query needs to scan a portion of an index, it can send requests to multiple nodes simultaneously, each responsible for a different set of index partitions. This is particularly powerful for range scans or queries that involve filtering on indexed fields.

The number of partitions for an index is determined at index creation time. For a GSI (Global Secondary Index), you can explicitly set the number of partitions using the WITH { "num_partition": N } clause. For example:

CREATE INDEX idx_name ON `travel-sample`(name) WITH { "num_partition": 32 };

If you don’t specify num_partition, Couchbase uses a default value, which is typically 1/4th of the number of nodes in the cluster, but with a minimum of 4 and a maximum of 1024. This default is usually a good starting point, but for very large datasets or high-throughput workloads, tuning this value can be critical.

More partitions generally mean better potential for parallelization, as more index data can be spread across more nodes. However, too many partitions can lead to increased overhead in managing the index and potentially more network chatter if queries need to hit many partitions. The sweet spot often depends on the size of your data, the query patterns, and the number of nodes in your cluster.

A key takeaway here is that Couchbase doesn’t simply replicate the entire index on every node. That would be inefficient for large indexes. Instead, it partitions the index and distributes those partitions. This is fundamental to how Couchbase achieves high availability and performance for indexed queries.

When you’re troubleshooting slow queries that involve GSI indexes, one of the first things to check is how your indexes are partitioned and distributed. You can inspect this information using the system:indexes system bucket, or via the couchbase-cli tool. For example, to see index information for a bucket:

/opt/couchbase/bin/couchbase-cli gsi-status -c <host>:<port> -u <username> -p <password> --bucket <bucket_name>

The output will show you which index partitions are located on which nodes. If you see that a disproportionate number of partitions for a critical index are on a single node, or if your partitions are not evenly distributed, it can become a bottleneck.

The number of partitions directly impacts the size of the index data that each node is responsible for. If an index has too few partitions, a single node might hold a very large chunk of the index, leading to that node becoming a hot spot during index scans. Conversely, an excessive number of partitions, especially on a small cluster, can lead to many partitions being on the same node due to the distribution algorithm, or can increase management overhead.

The actual distribution of partitions across nodes is managed by Couchbase’s internal service responsible for index management. When nodes are added or removed from the cluster, Couchbase will rebalance the index partitions to maintain an even distribution. This rebalancing process can consume cluster resources, so it’s important to be aware of it during scaling operations.

The distribution of index partitions is also a crucial factor in understanding how Couchbase handles index replication and high availability. While partitions are distributed, Couchbase also offers index replication, where each partition can have one or more replicas. These replicas are placed on different nodes to ensure that if a node fails, the index data is still available from its replica on another node. This is configured at index creation time with the WITH { "num_replica": N } clause.

The next concept to explore is how these partitioned indexes interact with query planning and execution, specifically how the query optimizer decides which nodes to hit for a given query.

Want structured learning?

Take the full Couchbase course →