ClickHouse is surprisingly bad at deduplicating data after it’s been inserted, but you can make it great at preventing duplicates in the first place.

Let’s watch it happen. Imagine we have a table tracking user logins, and we want to avoid counting the same login twice.

-- Our initial table, nothing special yet
CREATE TABLE user_logins (
    user_id UInt64,
    login_time DateTime,
    ip_address IPv4
) ENGINE = MergeTree()
ORDER BY (user_id, login_time);

-- We insert some data, including a duplicate login for user 1
INSERT INTO user_logins VALUES
(1, '2023-10-27 10:00:00', '192.168.1.10'),
(2, '2023-10-27 10:05:00', '192.168.1.11'),
(1, '2023-10-27 10:00:00', '192.168.1.10'); -- Duplicate

If we query now, we’ll see the duplicate:

SELECT count() FROM user_logins; -- Returns 3

ClickHouse’s MergeTree engine, by default, writes data to disk in sorted parts. When it merges these parts in the background, it can detect and remove duplicates if the ORDER BY key precisely matches the duplicate rows. In our case, (user_id, login_time) isn’t enough to guarantee uniqueness if the IP address could also differ for a "duplicate" login event.

The real magic happens with the ReplacingMergeTree engine. It’s designed to handle exactly this scenario. When ReplacingMergeTree merges data parts, it looks at a specified version column and keeps only the row with the highest version for any identical primary key.

Here’s how we set it up:

-- First, let's drop the old table and create a new one with ReplacingMergeTree
DROP TABLE user_logins;

-- We add a 'version' column. We'll use the login_time as the version.
CREATE TABLE user_logins (
    user_id UInt64,
    login_time DateTime,
    ip_address IPv4,
    version DateTime -- This is our version column
) ENGINE = ReplacingMergeTree(version)
ORDER BY (user_id, login_time);

-- Now, let's re-insert our data, providing the version
INSERT INTO user_logins VALUES
(1, '2023-10-27 10:00:00', '192.168.1.10', '2023-10-27 10:00:00'),
(2, '2023-10-27 10:05:00', '192.168.1.11', '2023-10-27 10:05:00'),
(1, '2023-10-27 10:00:00', '192.168.1.10', '2023-10-27 10:00:00'); -- Duplicate with same version

At this point, querying SELECT count() FROM user_logins; will still show 3. This is because ReplacingMergeTree only performs deduplication during background merges. It doesn’t magically clean up data upon insertion.

To see the deduplication in action, we need to trigger a merge manually or wait for the background merge process.

-- Manually trigger a merge to see the effect immediately
OPTIMIZE TABLE user_logins FINAL;

-- Now, let's query again
SELECT count() FROM user_logins; -- Should now return 2

The OPTIMIZE TABLE ... FINAL command forces all background merges to complete. ReplacingMergeTree compares rows with the same ORDER BY key. If it finds rows with identical keys, it keeps only the one with the highest value in the version column. If multiple rows have the same highest version, it keeps just one of them arbitrarily.

What if we have a slightly different duplicate, perhaps a login from a different IP but at the exact same time?

-- Insert a row that is "duplicate" on user_id and login_time, but different IP
INSERT INTO user_logins VALUES
(1, '2023-10-27 10:00:00', '192.168.1.12', '2023-10-27 10:00:01'); -- Different IP, slightly later version

OPTIMIZE TABLE user_logins FINAL;

SELECT count() FROM user_logins; -- Still 2

This is because the ORDER BY key is (user_id, login_time). The row (1, '2023-10-27 10:00:00', '192.168.1.10', '2023-10-27 10:00:00') has the lowest version. The row (1, '2023-10-27 10:00:00', '192.168.1.12', '2023-10-27 10:00:01') has a higher version and a different ORDER BY key (because the implicit primary key includes all columns that are not part of the ORDER BY key for ReplacingMergeTree’s deduplication logic when versions are identical).

If you need to deduplicate based on a broader set of columns, you must include them in the ORDER BY clause.

-- Let's redefine the table to include IP in the ORDER BY for stricter deduplication
DROP TABLE user_logins;

CREATE TABLE user_logins (
    user_id UInt64,
    login_time DateTime,
    ip_address IPv4,
    version DateTime
) ENGINE = ReplacingMergeTree(version)
ORDER BY (user_id, login_time, ip_address); -- Now IP is part of the key

-- Re-inserting the original data
INSERT INTO user_logins VALUES
(1, '2023-10-27 10:00:00', '192.168.1.10', '2023-10-27 10:00:00'),
(2, '2023-10-27 10:05:00', '192.168.1.11', '2023-10-27 10:05:00'),
(1, '2023-10-27 10:00:00', '192.168.1.10', '2023-10-27 10:00:00'); -- Duplicate

-- And the one with the different IP but same time
INSERT INTO user_logins VALUES
(1, '2023-10-27 10:00:00', '192.168.1.12', '2023-10-27 10:00:01');

OPTIMIZE TABLE user_logins FINAL;

SELECT count() FROM user_logins; -- Now returns 3, as the new row is a unique combination of (user_id, login_time, ip_address)

The critical aspect of ReplacingMergeTree is that the deduplication is eventual. Inserts are fast because they just append data. The cost is paid during background merges. For inserts to be unaffected, ensure your ORDER BY key is not overly complex, as it impacts merge performance. The version column is typically a timestamp or a sequence number. If you don’t have a natural version column, you can use peristiwa:UInt64 and increment it for each insert, though this requires more application-level logic.

The one thing most people don’t realize is that ReplacingMergeTree doesn’t guarantee which row is kept if multiple rows share the exact same primary key and the highest version. It’s deterministic within a merge, but the choice can seem arbitrary if you have multiple identical "latest" records. For strict "keep only one" semantics where the specific row matters, you might need a CollapsingMergeTree with a sign column, or a more complex application-level de-duplication strategy before insertion.

The next hurdle you’ll face is understanding how to handle updates to existing data, which ReplacingMergeTree does not directly support.

Want structured learning?

Take the full Clickhouse course →