ClickHouse replication doesn’t actually replicate data directly; it uses ZooKeeper to coordinate distributed state and ensure consistency across replicas.
Let’s see how this plays out in practice. Imagine you have a ClickHouse cluster and you want to ensure that data inserted into one replica is eventually available on others.
Here’s a minimal config.xml for a ClickHouse server that’s part of a replication setup. This goes in /etc/clickhouse-server/config.xml.
<clickhouse>
<zookeeper>
<node>
<host>zk1.example.com</host>
<port>2181</port>
</node>
<node>
<host>zk2.example.com</host>
<port>2181</port>
</node>
<node>
<host>zk3.example.com</host>
<port>2181</port>
</node>
<session_timeout_ms>30000</session_timeout_ms>
<operation_timeout_ms>10000</operation_timeout_ms>
</zookeeper>
<macros>
<replica>replica_1</replica>
</macros>
<distributed_ddl>
<log_storage_path>/var/lib/clickhouse/distributed_ddl/</log_storage_path>
</distributed_ddl>
</clickhouse>
And here’s how you’d define a replicated table in your database schema:
CREATE TABLE my_replicated_table (
event_date Date,
id UInt64,
data String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/my_replicated_table', '{replica}')
ORDER BY id;
The magic happens in the ReplicatedMergeTree engine. The first argument, /clickhouse/tables/{shard}/my_replicated_table, is the ZooKeeper path where this table’s replication metadata will be stored. {shard} and {replica} are placeholders that will be substituted by ClickHouse using the macros defined in config.xml. The second argument, {replica}, points to the specific replica’s identifier within ZooKeeper.
When you insert data into a replicated table on one replica, say replica_1, ClickHouse doesn’t immediately send that data to replica_2. Instead, it writes an entry to a distributed log in ZooKeeper under the path /clickhouse/tables/01/my_replicated_table/log. This entry describes the data part that was created and needs to be replicated.
Other replicas (replica_2, replica_3, etc.) watch this ZooKeeper log. When they see a new entry, they read the metadata about the data part and then fetch the actual data part from the replica that created it. This process is asynchronous. ZooKeeper acts as the central nervous system, keeping track of what needs to be done and where the data parts are located.
The macros section in config.xml is crucial. For replica_1, the configuration would have <replica>replica_1</replica>. For replica_2, it would be <replica>replica_2</replica>. These values are substituted into the ReplicatedMergeTree table definition. So, on replica_1, the ZooKeeper path becomes /clickhouse/tables/01/my_replicated_table and the replica name is replica_1. On replica_2, it’s the same ZooKeeper path but the replica name is replica_2. This ensures each replica knows its unique identity and where to find the shared replication log.
The distributed_ddl section is for executing DDL statements (like CREATE TABLE, ALTER TABLE) across all replicas in a consistent manner. When you execute ALTER TABLE my_replicated_table ... on one node, the DDL statement is written to ZooKeeper under /clickhouse/tables/{shard}/my_replicated_table/ddl_queue. All replicas monitor this queue and execute the DDL statement, ensuring schema changes are applied uniformly.
The surprising thing about ClickHouse replication is that it’s not a primary-replica or master-slave setup in the traditional sense. All replicas are peers. Any replica can accept writes, and any replica can serve reads. ZooKeeper’s role is to maintain a consistent view of the replication log and the state of data parts, allowing replicas to coordinate fetching and merging of data parts to achieve eventual consistency.
The session_timeout_ms and operation_timeout_ms in the ZooKeeper configuration are critical for stability. A short session_timeout_ms (e.g., 30 seconds) means ZooKeeper will consider a client disconnected if it doesn’t hear from it within that time. If a ClickHouse replica loses its ZooKeeper session, it will attempt to re-establish it. During this brief period, it won’t be able to accept new writes or participate fully in replication. The operation_timeout_ms (e.g., 10 seconds) is the maximum time a single ZooKeeper operation (like creating a node or reading data) is allowed to take before timing out.
The next logical step after setting up replication is understanding how to monitor its health and troubleshoot common replication lag issues.