A CouchDB cluster can keep serving requests even if one node goes offline, but it’s not magic; it’s a deliberate trade-off that requires understanding how data is distributed and replicated.

Let’s say you have a three-node CouchDB cluster (nodes node1, node2, node3) and node2 suddenly disappears from the network. CouchDB has a concept of "replication factor," which defaults to 3. This means that for every document, CouchDB tries to keep three copies, distributed across your nodes. When node2 goes down, CouchDB can no longer satisfy this replication factor for any documents that were primarily stored on node2.

Here’s how you’d diagnose and recover:

1. Identify the Missing Node and Its State

First, you need to confirm which node is actually gone and if CouchDB knows it’s gone.

  • Diagnosis: On any remaining node (e.g., node1), run:

    curl -X GET http://localhost:5984/_membership
    

    This will show you the list of nodes CouchDB is currently aware of. If node2 is missing from this list, CouchDB has already detected its absence.

  • Why it works: The _membership endpoint queries CouchDB’s internal state about the cluster’s active nodes.

2. Check for Unreplicated or Under-Replicated Documents

The core problem is that some documents might now exist on only one or two nodes instead of the desired three.

  • Diagnosis: Use the _index/design_docs endpoint on a remaining node to check the replication status of your design documents. Specifically, look for missing_replicas or unavailable counts. A more direct approach is to query the _replication_stats endpoint for the cluster.

    curl -X GET http://localhost:5984/_replication_stats
    

    Look for any databases that report missing_replicas or unavailable documents. If your _users database is also affected, you might have authentication issues.

  • Why it works: CouchDB tracks the replication status of all documents. This endpoint aggregates that information cluster-wide, highlighting discrepancies from the configured replication factor.

3. Re-add the Node (If Recoverable)

If node2 is expected to come back online (e.g., a temporary network glitch, or you’re bringing the physical/virtual machine back up), CouchDB will automatically try to catch up.

  • Diagnosis: If node2 is back and reachable by the other nodes, CouchDB’s background replication process should start automatically. You can monitor this with _replication_stats on node1 or node3. Look for replication_lag to decrease.

  • Why it works: CouchDB’s internal replication mechanism detects that a node has rejoined and begins replicating any missing documents to it.

4. Rebuild the Cluster (If Node is Permanently Lost)

If node2 is gone for good (e.g., hardware failure), you need to add a new node to replace it and let CouchDB rebuild the replication.

  • Diagnosis: a. Provision a New Node: Set up a new server with CouchDB installed. Let’s call it node4. b. Join the Cluster: On the new node (node4), configure it to join the existing cluster. This is typically done by editing local.ini (or default.ini if you’re using environment variables) to include the [cluster] section with n set to your desired replication factor (e.g., n=3 for a 3-node cluster) and remote_node pointing to an existing node in the cluster. ini [cluster] n = 3 remote_node = http://node1:5984/ ; or any other existing node Then, restart CouchDB on node4. c. Verify Membership: On any existing node (e.g., node1), run curl -X GET http://localhost:5984/_membership again. You should now see node4 listed. d. Monitor Replication: On node1, run curl -X GET http://localhost:5984/_replication_stats. You should see CouchDB actively replicating data to node4 until the missing_replicas and unavailable counts for your databases drop to zero.

  • Why it works: By adding a new node and specifying n=3, you’re telling CouchDB that you want three copies of each document. CouchDB sees that it’s currently short one copy (because node2 is gone) and starts replicating existing documents from node1 and node3 to the new node4 until the desired replication factor is met across the cluster.

5. Explicitly Rebuild (If Automatic Replication Stalls)

Sometimes, especially with very large datasets or complex cluster states, automatic replication might not kick in or might stall. You can force a rebuild.

  • Diagnosis: a. Identify Affected Databases: As in step 2, use _replication_stats to find databases with missing replicas. b. Trigger Replication: On any remaining node (e.g., node1), trigger a replication job that targets the new node (node4) for the affected database. CouchDB’s replication API is powerful. You can create a replication document in the _replicator database. bash curl -X POST http://localhost:5984/_replicator \ -H "Content-Type: application/json" \ -d '{ "source": "http://node1:5984/your_database", "target": "http://node4:5984/your_database", "create_target": true, "continuous": false }' Replace your_database with the actual database name. You might need to do this for all affected databases.

  • Why it works: This bypasses the automatic cluster membership-based replication and directly instructs CouchDB to copy data from an existing node (node1) to the new node (node4) for a specific database. The create_target: true ensures CouchDB creates the database on node4 if it doesn’t exist.

6. Address Potential Data Loss (If Replication Factor Was Too Low)

If your n value was set to less than 3 (e.g., n=2) and a node fails, CouchDB cannot recover without data loss. The replication factor n dictates how many copies of a document must exist. If n=2 and a node goes down, you only have one copy left.

  • Diagnosis: If you find missing_replicas and unavailable counts that cannot be resolved by adding a new node, and you know your n was low, you’ve lost data. There’s no CouchDB command to recover lost data; you’d need to restore from backups.

  • Why it works: CouchDB prioritizes availability and consistency. If the cluster cannot meet the n requirement with the available nodes, it will stop serving writes to prevent inconsistent states.

The next error you’ll likely encounter after a successful recovery is related to performance tuning or a different component of your application interacting with CouchDB, perhaps a slow query or a connection pool exhaustion.

Want structured learning?

Take the full Couchdb course →