The ClickHouse Keeper service on your replica timed out when trying to establish a connection with the ClickHouse server on the primary. This usually means a network path is blocked, or the ClickHouse server isn’t listening on the expected port.
Common Causes and Fixes
-
Firewall Blocking Ports:
- Diagnosis: On the replica machine, try to
telnetto the primary ClickHouse server’s IP address and port. The default ClickHouse port is 9000 for inter-server communication.
If it hangs or says "Connection refused," a firewall is likely the culprit.telnet <primary_ip_address> 9000 - Fix: Ensure that port 9000 (or your configured inter-server port) is open on the firewall of the primary ClickHouse server. For
ufwon Ubuntu:
Forsudo ufw allow 9000/tcp sudo ufw reloadfirewalldon CentOS/RHEL:sudo firewall-cmd --zone=public --add-port=9000/tcp --permanent sudo firewall-cmd --reload - Why it works: This explicitly permits network traffic on the specific port required for ClickHouse inter-server communication, bypassing any network access control lists.
- Diagnosis: On the replica machine, try to
-
Incorrect
listen_hostorbind_hostin ClickHouse Configuration:- Diagnosis: Check the
config.xml(orusers.xml,metrika.xml) file on the primary ClickHouse server. Look for the<listen_host>or<bind_host>directive. If it’s set to127.0.0.1orlocalhost, ClickHouse will only accept connections from the local machine, not from other servers. - Fix: Change
listen_hostto0.0.0.0(to listen on all network interfaces) or to the specific IP address of the primary server that the replica can reach.
After changing, restart the ClickHouse server:<listen_host>0.0.0.0</listen_host> <!-- or --> <listen_host><primary_server_ip_address></listen_host>sudo systemctl restart clickhouse-server - Why it works:
listen_hostdetermines which network interfaces ClickHouse binds to. Setting it to0.0.0.0makes it accessible from any IP address, allowing the replica to connect.
- Diagnosis: Check the
-
Incorrect
tcp_portin ClickHouse Configuration:- Diagnosis: Verify the
<tcp_port>setting in the primary ClickHouse server’sconfig.xml. Ensure it matches the port you are trying to connect to from the replica. The default is 9000. - Fix: If the
tcp_portis different, update the replica’s configuration (usually inusers.xmlor a separate configuration file referenced byconfig.xmlfor Keeper settings) to use the correct port, or change thetcp_porton the primary to the expected value and restart the server.
Then restart the primary ClickHouse server.<!-- On primary --> <clickhouse> <tcp_port>9000</tcp_port> </clickhouse> - Why it works: ClickHouse servers communicate on a specific TCP port. Mismatching this port means the replica is trying to connect to an empty socket on the primary.
- Diagnosis: Verify the
-
Network Latency or Congestion:
- Diagnosis: Use
pingto check latency between the replica and the primary. High latency (consistently over 100ms) or packet loss can cause timeouts.ping <primary_ip_address> - Fix: Investigate the network path between the servers. This might involve checking router configurations, network interface statistics on the servers, or consulting with your network administrator. If possible, optimize routing or reduce network load.
- Why it works: Timeouts occur when the request/response cycle takes too long. Reducing latency and ensuring reliable packet delivery allows the connection handshake to complete within the timeout window.
- Diagnosis: Use
-
ClickHouse Server Not Running or Crashed:
- Diagnosis: Check the status of the ClickHouse server on the primary machine.
Also, check the ClickHouse server logs (sudo systemctl status clickhouse-server/var/log/clickhouse-server/clickhouse-server.logor similar) for any crash messages or errors. - Fix: If the server is not running, start it:
If it crashed, analyze the logs to determine the cause of the crash and address it.sudo systemctl start clickhouse-server - Why it works: The replica cannot connect to a server process that is not active and listening for connections.
- Diagnosis: Check the status of the ClickHouse server on the primary machine.
-
Incorrect Replica Hostname/IP in ZooKeeper/Keeper Configuration:
- Diagnosis: The ClickHouse Keeper ensemble (or ZooKeeper) stores the list of active ClickHouse nodes. If the primary server’s IP address or hostname is incorrectly registered or outdated in the Keeper configuration, replicas might try to connect to the wrong endpoint. Check the
zookeeperorkeeperconfiguration section inconfig.xmlon all ClickHouse nodes. Ensure the<host>and<port>values for the Keeper ensemble are correct and that the primary server’s IP is listed correctly if it’s part of the ensemble. - Fix: Correct any incorrect hostnames or IP addresses in the Keeper configuration on all ClickHouse servers and restart them. For example, if the primary’s IP changed and wasn’t updated in ZooKeeper’s
server.X=IP:2888:3888entries or theclickhouse_keeper.xmlfor ClickHouse Keeper:
Restart all ClickHouse servers.<!-- Example for ClickHouse Keeper --> <keeper_server> <tcp_port>9181</tcp_port> <server_id>1</server_id> <log_dir>/var/log/clickhouse-keeper/log</log_dir> <snapshot_dir>/var/lib/clickhouse-keeper/snapshots</snapshot_dir> <coordination_settings> <operation_timeout_ms>10000</operation_timeout_ms> <raft_logs_to_keep>1000</raft_logs_to_keep> <session_timeout_ms>30000</session_timeout_ms> </coordination_settings> <raft_configuration> <server> <id>1</id> <hostname>keeper1.example.com</hostname> <!-- Corrected IP/hostname --> <port>9440</port> </server> <server> <id>2</id> <hostname>keeper2.example.com</hostname> <port>9440</port> </server> <!-- ... other keepers --> </raft_configuration> </keeper_server> - Why it works: The Keeper ensemble is the source of truth for cluster topology. If it lists an incorrect address for the primary, replicas will attempt to connect to that erroneous address, leading to timeouts.
- Diagnosis: The ClickHouse Keeper ensemble (or ZooKeeper) stores the list of active ClickHouse nodes. If the primary server’s IP address or hostname is incorrectly registered or outdated in the Keeper configuration, replicas might try to connect to the wrong endpoint. Check the
You’ll likely hit a "Table doesn’t exist" error if the replica can’t sync its metadata due to the connection issue.