The CockroachDB node you’re looking at has stopped participating in cluster-wide operations because its peers can no longer detect its presence, leading to a cascade of liveness and heartbeat failures.

Common Causes and Fixes

  1. Network Partition/Firewall Blocking: This is the most frequent culprit. A firewall rule, or a transient network issue, has blocked the necessary ports (default 26257 for client traffic and 26258 for inter-node gossip) between the failing node and its peers.

    • Diagnosis: From a different healthy node, try to ping the IP address of the failing node. Then, from the failing node, try to ping a healthy node. If pings fail, or if you can’t establish a TCP connection to ports 26257 and 26258 on the failing node from a healthy node (using nc -vz <failing-node-ip> 26257), you have a network issue.
    • Fix: Ensure that UDP port 26257 and TCP ports 26257 and 26258 are open bi-directionally between all nodes in the cluster. For example, on firewalld, you might run:
      sudo firewall-cmd --zone=public --add-port=26257/tcp --permanent
      sudo firewall-cmd --zone=public --add-port=26257/udp --permanent
      sudo firewall-cmd --zone=public --add-port=26258/tcp --permanent
      sudo firewall-cmd --reload
      
      This allows CockroachDB’s gossip protocol and client connections to traverse the network.
    • Why it works: CockroachDB nodes rely on a gossip protocol to maintain a consistent view of the cluster’s topology and health. If this gossip traffic is blocked, nodes will eventually time out and consider the unreachable node dead.
  2. Node Resource Exhaustion (CPU/Memory/Disk I/O): The failing node is so overloaded that it cannot respond to heartbeats or process its own internal tasks, including the gossip protocol.

    • Diagnosis: Log into the failing node and check system resource utilization. Look for top or htop for high CPU/memory usage. Check iostat or iotop for disk I/O bottlenecks. Also, check the CockroachDB logs (cockroach.log) on the failing node for messages indicating slow operations or timeouts. Look for messages like node is too slow to respond or heartbeat failed.
    • Fix: Identify the resource-hungry process (often cockroach itself, or sometimes other system processes) and either scale up the node’s resources (CPU, RAM) or optimize the workload. If CockroachDB is the culprit, you might need to adjust max-sql-memory or max-சைக்கிள்-memory in the cluster settings, or investigate specific slow queries.
      # Example: Increase max-sql-memory to 8GB
      cockroach sql --certs-dir=/path/to/certs -e "SET CLUSTER SETTING sql.max_memory_bytes = 8589934592;"
      
      This change allows the SQL layer to utilize more memory, potentially preventing out-of-memory errors or excessive swapping that degrades performance.
    • Why it works: By providing sufficient resources, you ensure the CockroachDB process can execute its essential background tasks, like sending heartbeats and gossiping, in a timely manner, preventing it from appearing "dead" to other nodes.
  3. Incorrect listen-addr or advertise-addr Configuration: The node is configured to listen on an IP address that is not reachable by other nodes, or it’s advertising an IP that it shouldn’t be.

    • Diagnosis: Examine the cockroach.log file on the failing node and other healthy nodes. Look for messages related to connection attempts and what IP addresses are being used. Check the cluster’s gossip.listen-addr and gossip.advertise-addr settings. You can retrieve these from a healthy node using:
      cockroach sql --certs-dir=/path/to/certs -e "SHOW CLUSTER SETTING gossip.listen-addr;"
      cockroach sql --certs-dir=/path/to/certs -e "SHOW CLUSTER SETTING gossip.advertise-addr;"
      
      Also, check the listen-addr in the cockroach-data/node.id file or the command-line arguments used to start the node.
    • Fix: Ensure that the listen-addr is set to an IP address that the node can bind to and that is reachable by other nodes if it’s intended to be public. The advertise-addr should be the IP address that other nodes will use to connect to this node. In most cloud or containerized environments, this should be the node’s primary private IP. For example, if a node’s internal IP is 10.0.1.5 and it’s running on port 26257:
      # Start command example
      cockroach start --certs-dir=/path/to/certs \
        --listen-addr=10.0.1.5:26257 \
        --advertise-addr=10.0.1.5:26257 \
        --join=<any-other-node-ip>:26257
      
      This ensures that the node announces its presence using an IP address that its peers can actually reach.
    • Why it works: Correct advertise-addr ensures that when a node registers itself with the cluster’s gossip network, it provides an IP address that other nodes can successfully use to initiate connections back to it for heartbeats and data exchange.
  4. DNS Resolution Issues: If your cluster uses hostnames instead of IP addresses for listen-addr and advertise-addr, a DNS problem can prevent nodes from finding each other.

    • Diagnosis: From the failing node, try to ping or nslookup the hostnames of other healthy nodes. From a healthy node, do the same for the failing node. Check /etc/resolv.conf on the failing node to ensure it’s pointing to a functional DNS server.
    • Fix: Resolve the DNS issues. This might involve updating DNS records, ensuring the DNS server is reachable, or correcting the /etc/resolv.conf file. If using hostnames, ensure they resolve to the correct, reachable IP addresses on all nodes.
      # Example: Correcting DNS server in resolv.conf
      echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
      
      This change ensures that DNS queries from the node are directed to a reliable public DNS server, allowing it to resolve hostnames correctly.
    • Why it works: CockroachDB uses these addresses for peer discovery. If hostnames can’t be resolved to IP addresses, the nodes effectively cannot find each other, breaking the communication chain.
  5. Clock Skew Between Nodes: Significant time differences between the failing node and its peers can cause heartbeats to appear stale, leading to their premature expiry. CockroachDB expects clock skew to be within a few seconds.

    • Diagnosis: On each node, check the system time:
      date
      
      Compare the output across nodes. If there’s a difference of more than a few seconds, you have clock skew.
    • Fix: Synchronize the clocks of all nodes using NTP. Ensure an NTP client is running and configured correctly on all machines.
      # Example: Installing and starting ntpd on Ubuntu/Debian
      sudo apt-get update
      sudo apt-get install ntp
      sudo systemctl enable ntp
      sudo systemctl start ntp
      
      This synchronizes the node’s clock with a reliable time source, ensuring consistent timestamps for all operations and communication.
    • Why it works: Heartbeats and other time-sensitive operations rely on consistent timestamps. When clocks are out of sync, a valid heartbeat from the failing node might be interpreted as expired by its peers, or vice-versa, leading to perceived liveness failures.
  6. Corrupted Node Data or CockroachDB Process Crash: The CockroachDB process on the node might have crashed due to an internal error or data corruption, and is failing to restart, or it’s restarting in a degraded state.

    • Diagnosis: Examine cockroach.log on the failing node for any panics, segmentation faults, or repeated error messages indicating data corruption or unrecoverable states. Check system logs (syslog, journalctl) for crash reports.
    • Fix: If the process is crashing, try restarting it. If it’s consistently crashing or reporting corruption, you might need to stop the node, back up its cockroach-data directory, and then start a new node, potentially rejoining it to the cluster or restoring from a backup.
      # Stop the node
      cockroach stop --certs-dir=/path/to/certs --data-dir=/path/to/data
      # Optionally, move the data to a backup location
      mv /path/to/data /path/to/data_backup_$(date +%Y%m%d_%H%M%S)
      # Start a new node (potentially with a fresh data dir if restoring)
      cockroach start --certs-dir=/path/to/certs --data-dir=/path/to/data --join=<any-other-node-ip>:26257
      
      This approach isolates the problem by stopping the potentially corrupted process and allows for a clean restart or a controlled recovery, ensuring the node rejoins the cluster cleanly.
    • Why it works: A crashed or corrupted process cannot participate in cluster operations. Restarting or replacing the data allows a healthy instance of the CockroachDB process to run and communicate.

After resolving these, you’ll likely encounter SQL connection refused errors if you try to connect to the node on its old, now-unadvertuned IP address.

Want structured learning?

Take the full Cockroachdb course →