Flink’s keyed streams are the engine that lets you scale stateful processing, but the real magic is how they distribute that state across your cluster.

Let’s see it in action. Imagine a simple word count where we want to track counts per word across a stream of sentences.

// DataStream<String> rawStream = ... // Your incoming stream of sentences

DataStream<Tuple2<String, Long>> wordCounts = rawStream
    .flatMap(new FlatMapFunction<String, Tuple2<String, Long>>() {
        @Override
        public void flatMap(String value, Collector<Tuple2<String, Long>> out) {
            for (String word : value.toLowerCase().split("\\W+")) {
                if (!word.isEmpty()) {
                    out.collect(new Tuple2<>(word, 1L));
                }
            }
        }
    })
    .keyBy(0) // Key by the word (the first element of the Tuple2)
    .window(TumblingEventTimeWindows.of(Time.seconds(5))) // Optional: windowing for time-based aggregation
    .reduce(new ReduceFunction<Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> reduce(Tuple2<String, Long> value1, Tuple2<String, Long> value2) {
            return new Tuple2<>(value1.f0, value1.f1 + value2.f1);
        }
    });

// wordCounts.print(); // To visualize the output

Here, keyBy(0) is the critical step. Flink takes the Tuple2<String, Long> and uses the String (the word, at index 0) as the key. This tells Flink that all tuples with the same word should be processed by the same task instance. This is fundamental for maintaining state (like the count for each word) consistently. If a word "flink" appears in multiple parallel tasks, its count would be scattered. keyBy ensures all "flink" tuples land on one specific task.

The problem Flink keyed streams solve is state management in distributed, fault-tolerant stream processing. Without keyBy, each incoming record is essentially independent, or processed by whatever task happens to be available. If you need to maintain a count, a sum, or any kind of accumulating state based on a specific identifier (like a user ID, a product SKU, or in our example, a word), you need to guarantee that all records for that identifier go to the same processing unit. keyBy provides this guarantee by partitioning the stream. Flink then assigns each unique key to a specific operator instance (a task slot) on a specific TaskManager. All state associated with that key is then stored and managed by that specific task.

Internally, Flink uses a distributed, fault-tolerant state backend (like RocksDB or the filesystem backend) to store the state for each key. When a record arrives, Flink routes it to the task responsible for its key. That task then accesses its local state, updates it based on the incoming record and the defined operator logic (like reduce or process), and writes the updated state back. If a task fails, Flink can restore its state from the state backend using checkpoints, ensuring no data is lost and processing can resume seamlessly. The keyBy transformation is essentially a distributed hash partitioning mechanism that Flink uses to distribute the keys (and thus the state) across the available parallel instances of an operator.

The number of parallel instances you configure for an operator after keyBy directly impacts how your state is distributed. If you have 16 parallel instances of a keyBy operator, Flink will partition the keys such that each of those 16 instances is responsible for a subset of the total unique keys. Each instance will manage its own local state for the keys assigned to it. This is how you achieve scalability: more parallel instances mean more tasks can process different key partitions concurrently, and each task manages a smaller portion of the overall state.

One thing most people don’t realize is that the keyBy operation itself doesn’t involve any data shuffling of the records in the traditional sense of a shuffle transformation. Instead, it’s a logical partitioning strategy. Flink’s network stack is optimized to route records directly to the correct downstream task based on the key’s hash. This is significantly more efficient than a full shuffle where all data might be sent to all downstream tasks. The actual data movement happens only for the records that match a specific key partition.

The next concept you’ll likely encounter is how to manage state size and performance as your number of unique keys grows, leading into topics like state TTL (Time-To-Live) and choosing the right state backend.

Want structured learning?

Take the full Flink course →