The most surprising thing about scaling event consumers horizontally with consumer groups is that it fundamentally changes how events are delivered, moving from a broadcast to a point-to-point delivery model.
Imagine a Kafka topic as a firehose of events, and your application needs to drink from it. If you have multiple instances of your application, and each one independently reads from the same topic, they’ll all get the exact same events. This is fine for some scenarios, but if you want to process events in parallel without duplication, you need a different approach. That’s where consumer groups come in.
Let’s say you have a Kafka topic named user_events and you want to scale your event processing application. You deploy two instances of your application, app-instance-1 and app-instance-2. If both instances are configured to read from user_events without a consumer group, they’ll both receive every event.
Here’s how it looks without consumer groups:
Topic: user_events
Partition 0: [event1, event2, event3, event4]
Partition 1: [event5, event6, event7, event8]
app-instance-1 reads: event1, event2, event3, event4, event5, event6, event7, event8
app-instance-2 reads: event1, event2, event3, event4, event5, event6, event7, event8
Notice the duplication. Now, let’s introduce a consumer group. You assign a unique group.id to all instances of your application that should work together to consume from this topic. Let’s call it user_processing_group.
When you configure app-instance-1 and app-instance-2 with group.id=user_processing_group and set them to consume from user_events, Kafka’s broker orchestrates the distribution of partitions. Each partition within a topic is assigned to exactly one consumer instance within a given consumer group at any given time.
Here’s how it looks with group.id=user_processing_group:
Topic: user_events
Partition 0: [event1, event2, event3, event4]
Partition 1: [event5, event6, event7, event8]
app-instance-1 (group.id=user_processing_group) reads: event1, event2, event3, event4
app-instance-2 (group.id=user_processing_group) reads: event5, event6, event7, event8
Now, each instance is processing a subset of the events, effectively dividing the workload. If you add a third instance, app-instance-3, Kafka will rebalance the partitions again. For example:
Topic: user_events
Partition 0: [event1, event2, event3, event4]
Partition 1: [event5, event6, event7, event8]
app-instance-1 (group.id=user_processing_group) reads: event1, event2
app-instance-2 (group.id=user_processing_group) reads: event3, event4
app-instance-3 (group.id=user_processing_group) reads: event5, event6, event7, event8
This partitioning and rebalancing is the core mechanism for horizontal scaling. You can add more consumer instances to your user_processing_group, and Kafka will automatically distribute the partitions among them, increasing your processing throughput. The number of consumers in a group can be at most equal to the number of partitions in the topic for maximum parallelism. Adding more consumers than partitions won’t increase throughput; some consumers will simply be idle.
The mental model to build here is that a consumer group is a logical subscription to a topic. All consumers in the same group share the responsibility of consuming messages from the topic’s partitions. Kafka’s brokers manage which consumer gets which partition. This is achieved through a coordination mechanism where consumers in a group elect a "leader" for each partition, and that leader is responsible for tracking offsets and assigning work.
The key levers you control are:
group.id: This is the identifier for your consumer group. All consumers that should process events collaboratively must share the samegroup.id.auto.offset.reset: This setting (latestorearliest) determines where a consumer starts reading if it’s joining a group for the first time or if its previous offset is no longer available.enable.auto.commit: Controls whether the consumer automatically commits its offsets periodically. Iftrue, Kafka commits the offset. Iffalse, you’re responsible for manually committing offsets after processing messages, which provides more control over exactly-once processing semantics (though true exactly-once is complex).
To illustrate, consider a simple Java consumer configuration for a group named my-app-group:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "my-app-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset", "earliest"); // Start from the beginning if no offset found
props.put("enable.auto.commit", "true"); // Auto-commit offsets
props.put("auto.commit.interval.ms", "1000"); // Commit every second
When you start multiple instances of your application with this configuration, they will automatically join my-app-group. Kafka will then ensure that each partition of the subscribed topics is assigned to only one consumer within that group. If a consumer instance crashes, Kafka detects this and rebalances the partitions among the remaining active consumers in the group.
The offset management is crucial. Kafka stores the committed offset for each partition per consumer group. This allows a consumer to resume from where it left off if it restarts. The group.id is the key that Kafka uses to look up these offsets.
When you add a new consumer to an existing group, Kafka triggers a rebalance. This process involves all active consumers in the group temporarily stopping to consume, re-electing partition leaders, and then resuming consumption with the new partition assignments. This rebalance is a critical operation; if it fails or takes too long, it can lead to processing interruptions. The session.timeout.ms and heartbeat.interval.ms configurations are vital for managing how quickly Kafka detects a failed consumer and initiates a rebalance. If a consumer doesn’t send a heartbeat to the broker within session.timeout.ms, it’s considered dead. The heartbeat is sent at heartbeat.interval.ms.
The rebalance protocol itself is managed by Kafka’s internal coordination service, typically ZooKeeper or Kafka’s own KRaft protocol. Consumers in a group maintain a session with the coordinator. When a consumer joins or leaves (or is detected as failed), a rebalance is initiated. The coordinator assigns partitions to consumers in a way that ensures each partition is assigned to at most one consumer within the group.
The most common pitfall when scaling with consumer groups is misunderstanding the relationship between the number of partitions and the number of consumers. If you have a topic with 4 partitions and 10 consumers in the same group, only 4 consumers will be actively processing messages at any given time. The other 6 will be idle, waiting for a rebalance (which would only happen if one of the active 4 consumers fails). To achieve maximum parallelism, the number of active consumers in a group should ideally match the number of partitions.
The next concept to explore is how to manage state and ensure idempotency when consuming events, especially when dealing with manual offset commits or potential retries after failures.