The biggest surprise is that these OLAP engines are fundamentally different in their priorities, and understanding those priorities is the only way to make the right choice.

Let’s see ClickHouse in action. Imagine you have a massive dataset of user events, millions of rows per second, and you need to answer questions like "how many users from Germany visited page X in the last hour?"

SELECT
    country,
    COUNT(DISTINCT user_id) AS distinct_users,
    count() AS total_events
FROM
    user_events
WHERE
    event_time >= now() - INTERVAL 1 HOUR
    AND country = 'Germany'
    AND page = '/X'
GROUP BY
    country
ORDER BY
    distinct_users DESC
LIMIT 10;

ClickHouse crunches this in milliseconds. How? It’s built for speed on a single cluster, prioritizing raw query performance through columnar storage, aggressive data compression (like LZ4 or ZSTD), and vectorized query execution. It compiles SQL queries into highly optimized C++ code. Think of it as a specialized race car: incredibly fast on a track, but maybe not the best for a long road trip with lots of luggage.

Druid, on the other hand, is designed for real-time analytics and fast slice-and-dice operations on streaming data. Its architecture involves multiple components: ingestion nodes, historical nodes, broker nodes, and coordinator nodes. Data is often indexed into "segments" that are immutable and distributed.

Here’s a conceptual look at a Druid ingestion spec:

{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "user_events_stream",
      "timestampSpec": {
        "column": "event_time",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "user_id",
          "country",
          "page"
        ]
      },
      "metricsSpec": [
        {
          "type": "count",
          "name": "count"
        },
        {
          "type": "hyperUnique",
          "name": "distinct_users",
          "fieldName": "user_id"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "topic": "user_events",
      "consumerProperties": {
        "bootstrap.servers": "kafka:9092"
      }
    },
    "tuningConfig": {
      "maxRowsPerSegment": 500000
    }
  }
}

Druid excels at interactive dashboards where users are constantly filtering and drilling down into data that’s just arrived. It uses a combination of columnar storage and specialized indexing (like bitmap indexes) for fast aggregations. Its strength is low-latency querying on massive, append-only datasets, especially when dealing with time-series data. It’s more like a robust, scalable data warehouse optimized for interactive exploration.

Presto (now Trino) is a distributed SQL query engine designed to query data where it lives, from various data sources like S3, HDFS, Kafka, relational databases, and more. It’s an excellent federated query engine.

Consider this Presto query, which might pull data from S3 and a relational database simultaneously:

SELECT
    u.user_id,
    u.signup_date,
    e.event_type
FROM
    s3.user_events_log.events e
JOIN
    mysql.users_db.users u ON e.user_id = u.id
WHERE
    e.event_date = DATE '2023-10-27'
LIMIT 100;

Presto’s magic lies in its ability to push down processing to the data sources where possible and perform distributed joins and aggregations across them. It doesn’t store data itself; it queries it. This makes it incredibly flexible for data lakes and data mesh architectures. Think of it as a powerful universal translator that can speak to many different data systems fluently.

The most counterintuitive aspect is how Presto’s query planning and execution model allows it to achieve high performance even when querying disparate, non-optimized data sources directly. It uses a pipelined execution model, where data is streamed between operators without materializing intermediate results, minimizing memory pressure and latency. This means it can often query data in S3 or HDFS faster than you might expect, even without a dedicated OLAP data model, simply by leveraging its distributed processing power.

The next crucial decision point is how you handle data updates and schema evolution.

Want structured learning?

Take the full Clickhouse course →