ClickHouse can ingest data faster than you can realistically generate it, but it’s not magic; you have to give it the right signals.

Let’s see what a high-throughput insert pipeline looks like. Imagine we’re streaming sensor data from a fleet of devices. Each device sends a small JSON payload every second. We want to get this data into ClickHouse with minimal latency and maximum throughput.

{"device_id": "abc123xyz", "timestamp": 1678886400, "temperature": 25.5, "humidity": 60.2}

This data arrives at a Kafka topic. From Kafka, a ClickHouse Kafka engine table is configured to consume it.

CREATE TABLE sensor_data_raw (
    device_id String,
    timestamp DateTime,
    temperature Float32,
    humidity Float32
) ENGINE = Kafka
    SETTINGS
        kafka_broker_list = 'kafka1:9092,kafka2:9092',
        kafka_topic = 'sensor_readings',
        kafka_group_id = 'clickhouse_inserter',
        kafka_format = 'JSONEachRow';

But just having the Kafka engine isn’t enough. It’s a staging table. We need to move that data into a more performant table for querying. We’ll use a MergeTree family engine, typically MergeTree or ReplacingMergeTree if we want deduplication. For this example, let’s stick with MergeTree.

CREATE TABLE sensor_data (
    device_id String,
    timestamp DateTime,
    temperature Float32,
    humidity Float32
) ENGINE = MergeTree()
    PARTITION BY toYYYYMM(timestamp)
    ORDER BY (device_id, timestamp);

Now, how do we get data from sensor_data_raw to sensor_data efficiently? The simplest way is a periodic INSERT SELECT.

-- Run this periodically, e.g., via cron or a scheduled job
INSERT INTO sensor_data (device_id, timestamp, temperature, humidity)
SELECT device_id, timestamp, temperature, humidity
FROM sensor_data_raw;

This works, but it’s not optimal for high throughput. The Kafka engine reads data in chunks. The INSERT SELECT operation itself has overhead. To truly crank up the ingestion rate, we need to understand how ClickHouse handles inserts and how to optimize them.

The core idea is to make inserts as large and as infrequent as possible, within the constraints of your latency requirements. ClickHouse writes data in blocks. When you insert data, it’s buffered and then written to disk as a new "part." The MergeTree engine then asynchronously merges these parts in the background. Too many small parts lead to merge contention and slow queries. Too few massive parts can overwhelm memory during merges.

The key levers are:

  1. max_insert_block_size: This setting (default 1048576, or 1 million rows) determines the maximum number of rows ClickHouse will buffer from a single INSERT statement before committing it as a block. For high-throughput pipelines, you want this to be as high as your system can handle without causing OOM errors during insertion. If you’re inserting from Kafka, the Kafka engine often respects this, but it’s also controlled by the max_block_size setting within INSERT SELECT or the Kafka engine’s internal buffering.

  2. max_insert_threads: This setting (default 16) controls how many threads ClickHouse uses to process concurrent INSERT statements. If you have multiple independent data sources or are using a batching mechanism that spawns multiple insert jobs, increasing this can help. However, it’s often more effective to optimize a single, large insert.

  3. Batching from the Source: If you’re not using the Kafka engine and are inserting directly from an application, batch your inserts. Instead of sending 100 rows per second, send 10,000 rows every second. The INSERT INTO table VALUES (...), (...), ... syntax is your friend here.

  4. merge_tree settings: While not directly an insert setting, the MergeTree engine’s background merges are critical. The max_bytes_to_merge_at_max_space_in_pool, max_bytes_to_merge_at_min_space_in_pool, and number_of_free_entries_in_pool_to_lower_max_size_of_merge settings influence how aggressively ClickHouse merges parts. If inserts are too fast, merges can fall behind, leading to a massive number of small parts. Tuning these can help the background process keep up. A common strategy is to ensure your inserts create parts that are large enough to be worth merging, but not so large they cause OOMs. Aim for parts in the range of 100MB to 1GB.

  5. input_format_values_max_row_count: If you’re using INSERT INTO ... VALUES and not using JSONEachRow or similar, this setting limits the number of rows in a single VALUES list. Increasing it allows for larger batches in that specific format.

  6. optimize_skip_unused_arguments: This query optimization can sometimes impact insert performance, especially with complex INSERT SELECT statements. While usually beneficial, in extreme high-throughput scenarios, disabling it for the specific insert query might yield marginal gains, though it’s rarely the primary bottleneck.

Let’s refine our Kafka-based pipeline. Instead of an immediate INSERT SELECT, we can use a materialized view to automatically and efficiently move data.

CREATE MATERIALIZED VIEW sensor_data_mv TO sensor_data
AS SELECT
    device_id,
    timestamp,
    temperature,
    humidity
FROM sensor_data_raw;

This MaterializedView essentially does the INSERT SELECT for you, and ClickHouse optimizes this process internally, often batching and buffering much more effectively than a manual INSERT SELECT. The Kafka engine’s consumer group will continuously fetch data, and the MaterializedView will write it to the sensor_data table.

The Kafka engine itself has settings that influence ingest rate, like kafka_max_block_size (default 2000000) and kafka_commit_every_n_messages / kafka_commit_every_n_bytes. By default, it commits after each block is processed, which is generally good. If your INSERT SELECT or MaterializedView is still struggling, you might need to increase max_insert_block_size in the server configuration (users.xml or config.xml) for the user executing the insert, or globally. For instance, setting max_insert_block_size = 4194304 (4 million rows) could significantly increase throughput if your hardware can handle it.

The most surprising thing is how much the ORDER BY key in your MergeTree table dictates insert performance. If you have a very high cardinality ORDER BY key (like a unique device ID for every row), ClickHouse has to maintain sorted order for every single insert, which becomes a bottleneck. For high-throughput ingestion, a time-based or less granular ORDER BY key is usually far better, allowing inserts to be appended more easily before background sorting/merging happens.

The next concept you’ll grapple with is query performance on this high-throughput ingested data, especially when your ORDER BY key isn’t ideal for inserts.

Want structured learning?

Take the full Clickhouse course →