A CouchDB cluster can keep serving requests even if one node goes offline, but it’s not magic; it’s a deliberate trade-off that requires understanding how data is distributed and replicated.
Let’s say you have a three-node CouchDB cluster (nodes node1, node2, node3) and node2 suddenly disappears from the network. CouchDB has a concept of "replication factor," which defaults to 3. This means that for every document, CouchDB tries to keep three copies, distributed across your nodes. When node2 goes down, CouchDB can no longer satisfy this replication factor for any documents that were primarily stored on node2.
Here’s how you’d diagnose and recover:
1. Identify the Missing Node and Its State
First, you need to confirm which node is actually gone and if CouchDB knows it’s gone.
-
Diagnosis: On any remaining node (e.g.,
node1), run:curl -X GET http://localhost:5984/_membershipThis will show you the list of nodes CouchDB is currently aware of. If
node2is missing from this list, CouchDB has already detected its absence. -
Why it works: The
_membershipendpoint queries CouchDB’s internal state about the cluster’s active nodes.
2. Check for Unreplicated or Under-Replicated Documents
The core problem is that some documents might now exist on only one or two nodes instead of the desired three.
-
Diagnosis: Use the
_index/design_docsendpoint on a remaining node to check the replication status of your design documents. Specifically, look formissing_replicasorunavailablecounts. A more direct approach is to query the_replication_statsendpoint for the cluster.curl -X GET http://localhost:5984/_replication_statsLook for any databases that report
missing_replicasorunavailabledocuments. If your_usersdatabase is also affected, you might have authentication issues. -
Why it works: CouchDB tracks the replication status of all documents. This endpoint aggregates that information cluster-wide, highlighting discrepancies from the configured replication factor.
3. Re-add the Node (If Recoverable)
If node2 is expected to come back online (e.g., a temporary network glitch, or you’re bringing the physical/virtual machine back up), CouchDB will automatically try to catch up.
-
Diagnosis: If
node2is back and reachable by the other nodes, CouchDB’s background replication process should start automatically. You can monitor this with_replication_statsonnode1ornode3. Look forreplication_lagto decrease. -
Why it works: CouchDB’s internal replication mechanism detects that a node has rejoined and begins replicating any missing documents to it.
4. Rebuild the Cluster (If Node is Permanently Lost)
If node2 is gone for good (e.g., hardware failure), you need to add a new node to replace it and let CouchDB rebuild the replication.
-
Diagnosis: a. Provision a New Node: Set up a new server with CouchDB installed. Let’s call it
node4. b. Join the Cluster: On the new node (node4), configure it to join the existing cluster. This is typically done by editinglocal.ini(ordefault.iniif you’re using environment variables) to include the[cluster]section withnset to your desired replication factor (e.g.,n=3for a 3-node cluster) andremote_nodepointing to an existing node in the cluster.ini [cluster] n = 3 remote_node = http://node1:5984/ ; or any other existing nodeThen, restart CouchDB onnode4. c. Verify Membership: On any existing node (e.g.,node1), runcurl -X GET http://localhost:5984/_membershipagain. You should now seenode4listed. d. Monitor Replication: Onnode1, runcurl -X GET http://localhost:5984/_replication_stats. You should see CouchDB actively replicating data tonode4until themissing_replicasandunavailablecounts for your databases drop to zero. -
Why it works: By adding a new node and specifying
n=3, you’re telling CouchDB that you want three copies of each document. CouchDB sees that it’s currently short one copy (becausenode2is gone) and starts replicating existing documents fromnode1andnode3to the newnode4until the desired replication factor is met across the cluster.
5. Explicitly Rebuild (If Automatic Replication Stalls)
Sometimes, especially with very large datasets or complex cluster states, automatic replication might not kick in or might stall. You can force a rebuild.
-
Diagnosis: a. Identify Affected Databases: As in step 2, use
_replication_statsto find databases with missing replicas. b. Trigger Replication: On any remaining node (e.g.,node1), trigger a replication job that targets the new node (node4) for the affected database. CouchDB’s replication API is powerful. You can create a replication document in the_replicatordatabase.bash curl -X POST http://localhost:5984/_replicator \ -H "Content-Type: application/json" \ -d '{ "source": "http://node1:5984/your_database", "target": "http://node4:5984/your_database", "create_target": true, "continuous": false }'Replaceyour_databasewith the actual database name. You might need to do this for all affected databases. -
Why it works: This bypasses the automatic cluster membership-based replication and directly instructs CouchDB to copy data from an existing node (
node1) to the new node (node4) for a specific database. Thecreate_target: trueensures CouchDB creates the database onnode4if it doesn’t exist.
6. Address Potential Data Loss (If Replication Factor Was Too Low)
If your n value was set to less than 3 (e.g., n=2) and a node fails, CouchDB cannot recover without data loss. The replication factor n dictates how many copies of a document must exist. If n=2 and a node goes down, you only have one copy left.
-
Diagnosis: If you find
missing_replicasandunavailablecounts that cannot be resolved by adding a new node, and you know yournwas low, you’ve lost data. There’s no CouchDB command to recover lost data; you’d need to restore from backups. -
Why it works: CouchDB prioritizes availability and consistency. If the cluster cannot meet the
nrequirement with the available nodes, it will stop serving writes to prevent inconsistent states.
The next error you’ll likely encounter after a successful recovery is related to performance tuning or a different component of your application interacting with CouchDB, perhaps a slow query or a connection pool exhaustion.