ClickHouse Keeper is a drop-in replacement for ZooKeeper, designed to offer better reliability and performance for ClickHouse itself.
Here’s how ClickHouse Keeper works and why you might consider migrating:
ClickHouse needs a coordination service to manage metadata, schema changes, and distributed operations. Historically, this role was filled by Apache ZooKeeper. However, ZooKeeper has some operational complexities and performance limitations that can impact ClickHouse. ClickHouse Keeper was developed to address these issues directly.
Let’s see ClickHouse Keeper in action. Imagine you have a ClickHouse cluster and you’re performing a schema alteration. Without a coordination service, this operation would be chaotic. With Keeper, the process is managed.
Consider this simplified ALTER TABLE command in ClickHouse:
ALTER TABLE my_distributed_table MODIFY COLUMN new_column String;
When this command is executed, ClickHouse clients send the request to a ClickHouse server. This server then communicates with the ClickHouse Keeper ensemble. Keeper coordinates the schema change across all nodes in the cluster, ensuring that each node updates its metadata consistently. This prevents scenarios where different nodes have conflicting views of the table schema, which could lead to data corruption or query failures.
The core difference lies in their internal design and how they handle state and consensus. ZooKeeper uses a Java-based implementation and a specific consensus algorithm (ZAB). ClickHouse Keeper, on the other hand, is written in C++ and uses a Raft-based consensus algorithm. This architectural choice is key to its improved performance and reliability. Raft is generally considered easier to understand and implement correctly than ZAB, leading to fewer subtle bugs and better fault tolerance.
Here’s a look at a typical ClickHouse Keeper configuration file (keeper.xml):
<clickhouse>
<raft>
<server_id>1</server_id>
<raft_logs_path>/var/lib/clickhouse/keeper/log</raft_logs_path>
<raft_data_path>/var/lib/clickhouse/keeper/data</raft_data_path>
<tcp_port>9181</tcp_port>
<raft_ports>
<port>9440</port>
</raft_ports>
<zookeeper_like_mode>true</zookeeper_like_mode>
<session_timeout_ms>10000</session_timeout_ms>
<operation_timeout_ms>5000</operation_timeout_ms>
<max_replication_lag_to_read>0</max_replication_lag_to_read>
<leader_election_timeout_ms>1000</leader_election_sets>
</raft>
<listen_host>::</listen_host>
<listen_port>9000</listen_port>
</clickhouse>
In this configuration:
server_id: Uniquely identifies this Keeper node within the ensemble.raft_logs_pathandraft_data_path: Specify where Keeper stores its state and log entries.tcp_portandraft_ports: Define the ports for client communication and inter-server Raft communication.zookeeper_like_mode: This is crucial. When set totrue, Keeper exposes a ZooKeeper-compatible API, allowing existing ZooKeeper clients (like ClickHouse itself) to connect to it without modification.session_timeout_msandoperation_timeout_ms: Control the responsiveness and fault detection of client sessions and operations.
The primary problem ClickHouse Keeper solves is the operational burden and performance bottlenecks associated with ZooKeeper in a ClickHouse environment. ZooKeeper’s Java-based architecture can lead to higher memory usage and garbage collection pauses, impacting latency. Its consensus algorithm, while robust, can be less performant under heavy load or network partitions. Keeper’s C++ implementation and Raft algorithm are optimized for these specific workloads, offering lower latency, higher throughput, and reduced resource consumption.
When migrating, you’re essentially replacing the ZooKeeper ensemble with a ClickHouse Keeper ensemble. ClickHouse can be configured to point to either. The zookeeper_like_mode in Keeper makes this a relatively smooth transition, as ClickHouse doesn’t need to be recompiled or heavily reconfigured to use the new coordination service.
The one thing most people don’t realize is that ClickHouse Keeper is not just a "ZooKeeper clone." While it aims for API compatibility, its internal optimizations for ClickHouse workloads mean that certain behaviors, especially around latency and failover, can be significantly different and generally superior. For instance, Keeper’s ability to tune max_replication_lag_to_read allows it to serve reads from followers even if they are slightly behind the leader, a setting not directly available or as flexibly tuned in ZooKeeper, which can improve read performance for certain ClickHouse queries that don’t require immediate consistency.
The next step after migrating to ClickHouse Keeper is understanding how to monitor its performance and health metrics within ClickHouse itself.