The CockroachDB node you’re looking at has stopped participating in cluster-wide operations because its peers can no longer detect its presence, leading to a cascade of liveness and heartbeat failures.
Common Causes and Fixes
-
Network Partition/Firewall Blocking: This is the most frequent culprit. A firewall rule, or a transient network issue, has blocked the necessary ports (default 26257 for client traffic and 26258 for inter-node gossip) between the failing node and its peers.
- Diagnosis: From a different healthy node, try to
pingthe IP address of the failing node. Then, from the failing node, try topinga healthy node. If pings fail, or if you can’t establish a TCP connection to ports 26257 and 26258 on the failing node from a healthy node (usingnc -vz <failing-node-ip> 26257), you have a network issue. - Fix: Ensure that UDP port 26257 and TCP ports 26257 and 26258 are open bi-directionally between all nodes in the cluster. For example, on
firewalld, you might run:
This allows CockroachDB’s gossip protocol and client connections to traverse the network.sudo firewall-cmd --zone=public --add-port=26257/tcp --permanent sudo firewall-cmd --zone=public --add-port=26257/udp --permanent sudo firewall-cmd --zone=public --add-port=26258/tcp --permanent sudo firewall-cmd --reload - Why it works: CockroachDB nodes rely on a gossip protocol to maintain a consistent view of the cluster’s topology and health. If this gossip traffic is blocked, nodes will eventually time out and consider the unreachable node dead.
- Diagnosis: From a different healthy node, try to
-
Node Resource Exhaustion (CPU/Memory/Disk I/O): The failing node is so overloaded that it cannot respond to heartbeats or process its own internal tasks, including the gossip protocol.
- Diagnosis: Log into the failing node and check system resource utilization. Look for
toporhtopfor high CPU/memory usage. Checkiostatoriotopfor disk I/O bottlenecks. Also, check the CockroachDB logs (cockroach.log) on the failing node for messages indicating slow operations or timeouts. Look for messages likenode is too slow to respondorheartbeat failed. - Fix: Identify the resource-hungry process (often
cockroachitself, or sometimes other system processes) and either scale up the node’s resources (CPU, RAM) or optimize the workload. If CockroachDB is the culprit, you might need to adjustmax-sql-memoryormax-சைக்கிள்-memoryin the cluster settings, or investigate specific slow queries.
This change allows the SQL layer to utilize more memory, potentially preventing out-of-memory errors or excessive swapping that degrades performance.# Example: Increase max-sql-memory to 8GB cockroach sql --certs-dir=/path/to/certs -e "SET CLUSTER SETTING sql.max_memory_bytes = 8589934592;" - Why it works: By providing sufficient resources, you ensure the CockroachDB process can execute its essential background tasks, like sending heartbeats and gossiping, in a timely manner, preventing it from appearing "dead" to other nodes.
- Diagnosis: Log into the failing node and check system resource utilization. Look for
-
Incorrect
listen-addroradvertise-addrConfiguration: The node is configured to listen on an IP address that is not reachable by other nodes, or it’s advertising an IP that it shouldn’t be.- Diagnosis: Examine the
cockroach.logfile on the failing node and other healthy nodes. Look for messages related to connection attempts and what IP addresses are being used. Check the cluster’sgossip.listen-addrandgossip.advertise-addrsettings. You can retrieve these from a healthy node using:
Also, check thecockroach sql --certs-dir=/path/to/certs -e "SHOW CLUSTER SETTING gossip.listen-addr;" cockroach sql --certs-dir=/path/to/certs -e "SHOW CLUSTER SETTING gossip.advertise-addr;"listen-addrin thecockroach-data/node.idfile or the command-line arguments used to start the node. - Fix: Ensure that the
listen-addris set to an IP address that the node can bind to and that is reachable by other nodes if it’s intended to be public. Theadvertise-addrshould be the IP address that other nodes will use to connect to this node. In most cloud or containerized environments, this should be the node’s primary private IP. For example, if a node’s internal IP is10.0.1.5and it’s running on port 26257:
This ensures that the node announces its presence using an IP address that its peers can actually reach.# Start command example cockroach start --certs-dir=/path/to/certs \ --listen-addr=10.0.1.5:26257 \ --advertise-addr=10.0.1.5:26257 \ --join=<any-other-node-ip>:26257 - Why it works: Correct
advertise-addrensures that when a node registers itself with the cluster’s gossip network, it provides an IP address that other nodes can successfully use to initiate connections back to it for heartbeats and data exchange.
- Diagnosis: Examine the
-
DNS Resolution Issues: If your cluster uses hostnames instead of IP addresses for
listen-addrandadvertise-addr, a DNS problem can prevent nodes from finding each other.- Diagnosis: From the failing node, try to
pingornslookupthe hostnames of other healthy nodes. From a healthy node, do the same for the failing node. Check/etc/resolv.confon the failing node to ensure it’s pointing to a functional DNS server. - Fix: Resolve the DNS issues. This might involve updating DNS records, ensuring the DNS server is reachable, or correcting the
/etc/resolv.conffile. If using hostnames, ensure they resolve to the correct, reachable IP addresses on all nodes.
This change ensures that DNS queries from the node are directed to a reliable public DNS server, allowing it to resolve hostnames correctly.# Example: Correcting DNS server in resolv.conf echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf - Why it works: CockroachDB uses these addresses for peer discovery. If hostnames can’t be resolved to IP addresses, the nodes effectively cannot find each other, breaking the communication chain.
- Diagnosis: From the failing node, try to
-
Clock Skew Between Nodes: Significant time differences between the failing node and its peers can cause heartbeats to appear stale, leading to their premature expiry. CockroachDB expects clock skew to be within a few seconds.
- Diagnosis: On each node, check the system time:
Compare the output across nodes. If there’s a difference of more than a few seconds, you have clock skew.date - Fix: Synchronize the clocks of all nodes using NTP. Ensure an NTP client is running and configured correctly on all machines.
This synchronizes the node’s clock with a reliable time source, ensuring consistent timestamps for all operations and communication.# Example: Installing and starting ntpd on Ubuntu/Debian sudo apt-get update sudo apt-get install ntp sudo systemctl enable ntp sudo systemctl start ntp - Why it works: Heartbeats and other time-sensitive operations rely on consistent timestamps. When clocks are out of sync, a valid heartbeat from the failing node might be interpreted as expired by its peers, or vice-versa, leading to perceived liveness failures.
- Diagnosis: On each node, check the system time:
-
Corrupted Node Data or CockroachDB Process Crash: The CockroachDB process on the node might have crashed due to an internal error or data corruption, and is failing to restart, or it’s restarting in a degraded state.
- Diagnosis: Examine
cockroach.logon the failing node for any panics, segmentation faults, or repeated error messages indicating data corruption or unrecoverable states. Check system logs (syslog,journalctl) for crash reports. - Fix: If the process is crashing, try restarting it. If it’s consistently crashing or reporting corruption, you might need to stop the node, back up its
cockroach-datadirectory, and then start a new node, potentially rejoining it to the cluster or restoring from a backup.
This approach isolates the problem by stopping the potentially corrupted process and allows for a clean restart or a controlled recovery, ensuring the node rejoins the cluster cleanly.# Stop the node cockroach stop --certs-dir=/path/to/certs --data-dir=/path/to/data # Optionally, move the data to a backup location mv /path/to/data /path/to/data_backup_$(date +%Y%m%d_%H%M%S) # Start a new node (potentially with a fresh data dir if restoring) cockroach start --certs-dir=/path/to/certs --data-dir=/path/to/data --join=<any-other-node-ip>:26257 - Why it works: A crashed or corrupted process cannot participate in cluster operations. Restarting or replacing the data allows a healthy instance of the CockroachDB process to run and communicate.
- Diagnosis: Examine
After resolving these, you’ll likely encounter SQL connection refused errors if you try to connect to the node on its old, now-unadvertuned IP address.