Cassandra’s gossip protocol is failing to establish a consistent view of the cluster state when a new node joins, meaning nodes can’t agree on who’s up, who’s down, or what data they hold.

This usually happens because a new node can’t establish a connection to any existing nodes to start the gossip exchange.

Cause 1: Incorrect listen_address or rpc_address in cassandra.yaml

The new node is advertising an IP address that other nodes can’t reach.

  • Diagnosis: On the new node, check /etc/cassandra/cassandra.yaml. Look for listen_address and rpc_address. Compare these to the IP addresses of existing nodes in your cluster. You can also try ping <listen_address_from_new_node> from an existing node.
  • Fix: Ensure listen_address and rpc_address are set to the actual IP address of the new node that is reachable by other nodes. If using hostname resolution, ensure it’s correct and resolvable from all nodes.
    listen_address: 192.168.1.100 # Replace with the new node's actual, reachable IP
    rpc_address: 192.168.1.100  # Replace with the new node's actual, reachable IP
    
  • Why it works: Gossip messages are sent to the listen_address of other nodes and received on the rpc_address. If these are wrong, communication breaks down.

Cause 2: Firewall Blocking Gossip Ports

Network firewalls are preventing communication on the necessary Cassandra ports. Cassandra uses port 7000 (or 7001 for SSL) for inter-node communication (gossip) and port 9042 for client connections.

  • Diagnosis: From the new node, try to telnet <existing_node_ip> 7000 or nc -vz <existing_node_ip> 7000. If any of these fail, the port is blocked.
  • Fix: Open ports 7000 (and 7001 if SSL is enabled) for TCP traffic between all nodes in the cluster.
    • iptables example:
      sudo iptables -A INPUT -p tcp --dport 7000 -j ACCEPT
      sudo iptables -A OUTPUT -p tcp --dport 7000 -j ACCEPT
      # Repeat for 7001 if SSL is used
      
  • Why it works: Unblocking these ports allows the gossip handshake and subsequent messages to flow freely between nodes.

Cause 3: Incorrect seed_provider Configuration

The new node doesn’t know which existing nodes to contact to bootstrap its initial view of the cluster.

  • Diagnosis: Check /etc/cassandra/cassandra.yaml on the new node. Ensure the seed_provider section correctly lists at least one, preferably a few, stable, existing nodes in the cluster.
  • Fix: Update the seed_provider list to include the IP addresses of reliable, existing seed nodes.
    seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
        parameters:
          - seeds: "192.168.1.1,192.168.1.2,192.168.1.3" # Replace with IPs of your stable seed nodes
    
  • Why it works: Seed nodes are the entry points for new nodes to learn about the cluster topology and other nodes. If they are unreachable or misconfigured, the new node can’t join.

Cause 4: Network Latency or Instability

High latency or intermittent packet loss between the new node and existing nodes makes the gossip protocol unreliable. Gossip relies on frequent, timely heartbeats.

  • Diagnosis: Use ping <existing_node_ip> from the new node and vice-versa. Look for high average latency (>50ms often problematic for gossip) or packet loss. Use mtr <existing_node_ip> for more detailed path analysis.
  • Fix: Improve network connectivity. This might involve:
    • Ensuring nodes are in the same datacenter or region.
    • Optimizing routing.
    • Addressing hardware issues on network devices.
    • If nodes must be far apart, consider tuning phi_convict_threshold (see below) cautiously.
  • Why it works: A stable, low-latency network ensures that gossip heartbeats are delivered promptly, preventing nodes from being prematurely marked as down.

Cause 5: Incorrect endpoint_snitch Configuration

The snitch configuration on the new node doesn’t match the snitch configuration on existing nodes, or it’s pointing to incorrect datacenter/rack information. This can prevent nodes from correctly identifying each other’s network topology, which is crucial for gossip.

  • Diagnosis: Compare the endpoint_snitch setting in cassandra.yaml on the new node with those on existing nodes. If using GossipingPropertyFileSnitch, check the corresponding cassandra-rackdc.properties file on all nodes for consistency in dc and rack values.
  • Fix: Ensure all nodes in the cluster use the same endpoint_snitch. If using GossipingPropertyFileSnitch, verify that cassandra-rackdc.properties has identical dc and rack values for nodes that should be in the same logical group.
    endpoint_snitch: GossipingPropertyFileSnitch # Or another consistent snitch
    
    And in cassandra-rackdc.properties:
    dc=us-east-1
    rack=RAC1
    
  • Why it works: The snitch helps Cassandra route requests and gossip efficiently based on network topology. Inconsistencies lead to routing errors and gossip failures.

Cause 6: Overly Aggressive phi_convict_threshold

The phi_convict_threshold setting is too low, causing nodes to be marked as "down" too quickly due to transient network issues or brief delays, especially in higher latency environments.

  • Diagnosis: Check cassandra.yaml for phi_convict_threshold. The default is 8. If it’s set lower (e.g., 5 or 6), this could be the issue.
  • Fix: Increase phi_convict_threshold to a higher value, such as 10 or 12, to allow for more tolerance of temporary network disruptions. Caution: This should be done carefully and only after other causes are ruled out, as it can mask actual node failures.
    phi_convict_threshold: 12 # Default is 8, increase cautiously
    
  • Why it works: A higher phi_convict_threshold means a node must miss more consecutive gossip heartbeats before being considered dead, providing more resilience against network jitter.

The next error you’ll likely see is NoHostAvailable when clients try to connect, because the new node still doesn’t know about the existing cluster or its data.

Want structured learning?

Take the full Cassandra course →