The ClickHouse Keeper service on your replica timed out when trying to establish a connection with the ClickHouse server on the primary. This usually means a network path is blocked, or the ClickHouse server isn’t listening on the expected port.

Common Causes and Fixes

  1. Firewall Blocking Ports:

    • Diagnosis: On the replica machine, try to telnet to the primary ClickHouse server’s IP address and port. The default ClickHouse port is 9000 for inter-server communication.
      telnet <primary_ip_address> 9000
      
      If it hangs or says "Connection refused," a firewall is likely the culprit.
    • Fix: Ensure that port 9000 (or your configured inter-server port) is open on the firewall of the primary ClickHouse server. For ufw on Ubuntu:
      sudo ufw allow 9000/tcp
      sudo ufw reload
      
      For firewalld on CentOS/RHEL:
      sudo firewall-cmd --zone=public --add-port=9000/tcp --permanent
      sudo firewall-cmd --reload
      
    • Why it works: This explicitly permits network traffic on the specific port required for ClickHouse inter-server communication, bypassing any network access control lists.
  2. Incorrect listen_host or bind_host in ClickHouse Configuration:

    • Diagnosis: Check the config.xml (or users.xml, metrika.xml) file on the primary ClickHouse server. Look for the <listen_host> or <bind_host> directive. If it’s set to 127.0.0.1 or localhost, ClickHouse will only accept connections from the local machine, not from other servers.
    • Fix: Change listen_host to 0.0.0.0 (to listen on all network interfaces) or to the specific IP address of the primary server that the replica can reach.
      <listen_host>0.0.0.0</listen_host>
      <!-- or -->
      <listen_host><primary_server_ip_address></listen_host>
      
      After changing, restart the ClickHouse server:
      sudo systemctl restart clickhouse-server
      
    • Why it works: listen_host determines which network interfaces ClickHouse binds to. Setting it to 0.0.0.0 makes it accessible from any IP address, allowing the replica to connect.
  3. Incorrect tcp_port in ClickHouse Configuration:

    • Diagnosis: Verify the <tcp_port> setting in the primary ClickHouse server’s config.xml. Ensure it matches the port you are trying to connect to from the replica. The default is 9000.
    • Fix: If the tcp_port is different, update the replica’s configuration (usually in users.xml or a separate configuration file referenced by config.xml for Keeper settings) to use the correct port, or change the tcp_port on the primary to the expected value and restart the server.
      <!-- On primary -->
      <clickhouse>
          <tcp_port>9000</tcp_port>
      </clickhouse>
      
      Then restart the primary ClickHouse server.
    • Why it works: ClickHouse servers communicate on a specific TCP port. Mismatching this port means the replica is trying to connect to an empty socket on the primary.
  4. Network Latency or Congestion:

    • Diagnosis: Use ping to check latency between the replica and the primary. High latency (consistently over 100ms) or packet loss can cause timeouts.
      ping <primary_ip_address>
      
    • Fix: Investigate the network path between the servers. This might involve checking router configurations, network interface statistics on the servers, or consulting with your network administrator. If possible, optimize routing or reduce network load.
    • Why it works: Timeouts occur when the request/response cycle takes too long. Reducing latency and ensuring reliable packet delivery allows the connection handshake to complete within the timeout window.
  5. ClickHouse Server Not Running or Crashed:

    • Diagnosis: Check the status of the ClickHouse server on the primary machine.
      sudo systemctl status clickhouse-server
      
      Also, check the ClickHouse server logs (/var/log/clickhouse-server/clickhouse-server.log or similar) for any crash messages or errors.
    • Fix: If the server is not running, start it:
      sudo systemctl start clickhouse-server
      
      If it crashed, analyze the logs to determine the cause of the crash and address it.
    • Why it works: The replica cannot connect to a server process that is not active and listening for connections.
  6. Incorrect Replica Hostname/IP in ZooKeeper/Keeper Configuration:

    • Diagnosis: The ClickHouse Keeper ensemble (or ZooKeeper) stores the list of active ClickHouse nodes. If the primary server’s IP address or hostname is incorrectly registered or outdated in the Keeper configuration, replicas might try to connect to the wrong endpoint. Check the zookeeper or keeper configuration section in config.xml on all ClickHouse nodes. Ensure the <host> and <port> values for the Keeper ensemble are correct and that the primary server’s IP is listed correctly if it’s part of the ensemble.
    • Fix: Correct any incorrect hostnames or IP addresses in the Keeper configuration on all ClickHouse servers and restart them. For example, if the primary’s IP changed and wasn’t updated in ZooKeeper’s server.X=IP:2888:3888 entries or the clickhouse_keeper.xml for ClickHouse Keeper:
      <!-- Example for ClickHouse Keeper -->
      <keeper_server>
          <tcp_port>9181</tcp_port>
          <server_id>1</server_id>
          <log_dir>/var/log/clickhouse-keeper/log</log_dir>
          <snapshot_dir>/var/lib/clickhouse-keeper/snapshots</snapshot_dir>
          <coordination_settings>
              <operation_timeout_ms>10000</operation_timeout_ms>
              <raft_logs_to_keep>1000</raft_logs_to_keep>
              <session_timeout_ms>30000</session_timeout_ms>
          </coordination_settings>
          <raft_configuration>
              <server>
                  <id>1</id>
                  <hostname>keeper1.example.com</hostname> <!-- Corrected IP/hostname -->
                  <port>9440</port>
              </server>
              <server>
                  <id>2</id>
                  <hostname>keeper2.example.com</hostname>
                  <port>9440</port>
              </server>
              <!-- ... other keepers -->
          </raft_configuration>
      </keeper_server>
      
      Restart all ClickHouse servers.
    • Why it works: The Keeper ensemble is the source of truth for cluster topology. If it lists an incorrect address for the primary, replicas will attempt to connect to that erroneous address, leading to timeouts.

You’ll likely hit a "Table doesn’t exist" error if the replica can’t sync its metadata due to the connection issue.

Want structured learning?

Take the full Clickhouse course →