Cassandra’s gossip protocol is failing to establish a consistent view of the cluster state when a new node joins, meaning nodes can’t agree on who’s up, who’s down, or what data they hold.
This usually happens because a new node can’t establish a connection to any existing nodes to start the gossip exchange.
Cause 1: Incorrect listen_address or rpc_address in cassandra.yaml
The new node is advertising an IP address that other nodes can’t reach.
- Diagnosis: On the new node, check
/etc/cassandra/cassandra.yaml. Look forlisten_addressandrpc_address. Compare these to the IP addresses of existing nodes in your cluster. You can also tryping <listen_address_from_new_node>from an existing node. - Fix: Ensure
listen_addressandrpc_addressare set to the actual IP address of the new node that is reachable by other nodes. If using hostname resolution, ensure it’s correct and resolvable from all nodes.listen_address: 192.168.1.100 # Replace with the new node's actual, reachable IP rpc_address: 192.168.1.100 # Replace with the new node's actual, reachable IP - Why it works: Gossip messages are sent to the
listen_addressof other nodes and received on therpc_address. If these are wrong, communication breaks down.
Cause 2: Firewall Blocking Gossip Ports
Network firewalls are preventing communication on the necessary Cassandra ports. Cassandra uses port 7000 (or 7001 for SSL) for inter-node communication (gossip) and port 9042 for client connections.
- Diagnosis: From the new node, try to
telnet <existing_node_ip> 7000ornc -vz <existing_node_ip> 7000. If any of these fail, the port is blocked. - Fix: Open ports 7000 (and 7001 if SSL is enabled) for TCP traffic between all nodes in the cluster.
- iptables example:
sudo iptables -A INPUT -p tcp --dport 7000 -j ACCEPT sudo iptables -A OUTPUT -p tcp --dport 7000 -j ACCEPT # Repeat for 7001 if SSL is used
- iptables example:
- Why it works: Unblocking these ports allows the gossip handshake and subsequent messages to flow freely between nodes.
Cause 3: Incorrect seed_provider Configuration
The new node doesn’t know which existing nodes to contact to bootstrap its initial view of the cluster.
- Diagnosis: Check
/etc/cassandra/cassandra.yamlon the new node. Ensure theseed_providersection correctly lists at least one, preferably a few, stable, existing nodes in the cluster. - Fix: Update the
seed_providerlist to include the IP addresses of reliable, existing seed nodes.seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "192.168.1.1,192.168.1.2,192.168.1.3" # Replace with IPs of your stable seed nodes - Why it works: Seed nodes are the entry points for new nodes to learn about the cluster topology and other nodes. If they are unreachable or misconfigured, the new node can’t join.
Cause 4: Network Latency or Instability
High latency or intermittent packet loss between the new node and existing nodes makes the gossip protocol unreliable. Gossip relies on frequent, timely heartbeats.
- Diagnosis: Use
ping <existing_node_ip>from the new node and vice-versa. Look for high average latency (>50ms often problematic for gossip) or packet loss. Usemtr <existing_node_ip>for more detailed path analysis. - Fix: Improve network connectivity. This might involve:
- Ensuring nodes are in the same datacenter or region.
- Optimizing routing.
- Addressing hardware issues on network devices.
- If nodes must be far apart, consider tuning
phi_convict_threshold(see below) cautiously.
- Why it works: A stable, low-latency network ensures that gossip heartbeats are delivered promptly, preventing nodes from being prematurely marked as down.
Cause 5: Incorrect endpoint_snitch Configuration
The snitch configuration on the new node doesn’t match the snitch configuration on existing nodes, or it’s pointing to incorrect datacenter/rack information. This can prevent nodes from correctly identifying each other’s network topology, which is crucial for gossip.
- Diagnosis: Compare the
endpoint_snitchsetting incassandra.yamlon the new node with those on existing nodes. If usingGossipingPropertyFileSnitch, check the correspondingcassandra-rackdc.propertiesfile on all nodes for consistency indcandrackvalues. - Fix: Ensure all nodes in the cluster use the same
endpoint_snitch. If usingGossipingPropertyFileSnitch, verify thatcassandra-rackdc.propertieshas identicaldcandrackvalues for nodes that should be in the same logical group.
And inendpoint_snitch: GossipingPropertyFileSnitch # Or another consistent snitchcassandra-rackdc.properties:dc=us-east-1 rack=RAC1 - Why it works: The snitch helps Cassandra route requests and gossip efficiently based on network topology. Inconsistencies lead to routing errors and gossip failures.
Cause 6: Overly Aggressive phi_convict_threshold
The phi_convict_threshold setting is too low, causing nodes to be marked as "down" too quickly due to transient network issues or brief delays, especially in higher latency environments.
- Diagnosis: Check
cassandra.yamlforphi_convict_threshold. The default is 8. If it’s set lower (e.g., 5 or 6), this could be the issue. - Fix: Increase
phi_convict_thresholdto a higher value, such as 10 or 12, to allow for more tolerance of temporary network disruptions. Caution: This should be done carefully and only after other causes are ruled out, as it can mask actual node failures.phi_convict_threshold: 12 # Default is 8, increase cautiously - Why it works: A higher
phi_convict_thresholdmeans a node must miss more consecutive gossip heartbeats before being considered dead, providing more resilience against network jitter.
The next error you’ll likely see is NoHostAvailable when clients try to connect, because the new node still doesn’t know about the existing cluster or its data.