ClickHouse can query data directly from S3 without loading it into ClickHouse’s own storage.

Here’s how it works:

Let’s say you have a CSV file in an S3 bucket that looks like this:

s3://my-data-bucket/users.csv

id,name,email
1,Alice,alice@example.com
2,Bob,bob@example.com
3,Charlie,charlie@example.com

You can create a table in ClickHouse that points to this S3 file. When you query this ClickHouse table, ClickHouse will fetch the data from S3 on the fly.

-- First, create a table that describes the structure of your S3 data
CREATE TABLE users_s3 (
    id UInt32,
    name String,
    email String
)
ENGINE = S3(
    's3://my-data-bucket/users.csv', -- The S3 path to your data file
    'CSV',                         -- The format of your data file
    'None'                         -- No header in the CSV file (or specify 'FirstLine' if there is)
);

-- Now you can query it like any other ClickHouse table
SELECT * FROM users_s3 WHERE id = 2;

This will output:

┌─id─┬─name─┬─email─────────┐
│  2 │ Bob  │ bob@example.com │
└────┴──────┴─────────────────┘

The ENGINE = S3(...) is the magic. It tells ClickHouse to use the S3 engine, which knows how to:

  1. Access S3: It handles authentication and retrieval of objects from your specified S3 bucket.
  2. Parse Data: It understands various data formats like CSV, TSV, Parquet, ORC, JSON, etc.
  3. Present as Table: It makes the data from the S3 object appear as if it were a regular ClickHouse table.

The primary benefit is avoiding the overhead of ETL pipelines to load data into ClickHouse’s local storage. For infrequently accessed or very large datasets that don’t require the absolute lowest query latency, this is a game-changer. It’s particularly useful for ad-hoc analysis on data lakes.

The full mental model involves understanding the S3 table engine’s parameters: the S3 path, the data format, and importantly, how to handle credentials. By default, it might try to use IAM roles if running on EC2, or environment variables/configuration files. You can also explicitly provide credentials:

-- Example with explicit credentials (use with caution, prefer IAM roles/instance profiles)
CREATE TABLE users_s3_creds (
    id UInt32,
    name String,
    email String
)
ENGINE = S3(
    's3://my-data-bucket/users.csv',
    'CSV',
    'FirstLine', -- Assuming the CSV has a header row
    'AKIAIOSFODNN7EXAMPLE', -- Your AWS Access Key ID
    'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' -- Your AWS Secret Access Key
);

When querying, ClickHouse will perform a full scan of the object in S3. It doesn’t create indexes on the S3 data itself. However, if you use the MergeTree family of engines in conjunction with the S3 engine (e.g., for materialized views or by creating a separate MergeTree table that loads data from S3), you can achieve indexed querying. The S3 engine itself is primarily for direct, unindexed access.

The most surprising thing is how seamlessly ClickHouse integrates with object storage. You can even INSERT data into an S3 table, which will write data directly to an S3 object, effectively using S3 as a sink.

-- Example of inserting data into S3
INSERT INTO TABLE users_s3 (id, name, email) VALUES (4, 'David', 'david@example.com');

This INSERT would append a new line to your s3://my-data-bucket/users.csv file (assuming the S3 engine is configured to allow appends, which is format-dependent and can be tricky).

The next step in optimizing this pattern is to explore using columnar formats like Parquet or ORC, which ClickHouse can read much more efficiently from S3, enabling predicate pushdown and column pruning.

Want structured learning?

Take the full Clickhouse course →