A Flink job graph isn’t just a static blueprint; it’s a dynamic representation of how your data flows and is processed, and understanding its nuances is key to unlocking Flink’s true power.

Let’s see this in action. Imagine you have a simple Flink streaming job that reads from Kafka, performs a simple map operation, and writes to another Kafka topic.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Configure Kafka sources and sinks
DataStream<String> sourceStream = env.fromSource(
    new FlinkKafkaConsumer<>("input-topic", new SimpleStringSchema(), kafkaProps),
    WatermarkStrategy.noWatermarks(),
    "Kafka Source"
);

DataStream<String> mappedStream = sourceStream.map(String::toUpperCase);

mappedStream.sinkTo(
    new FlinkKafkaProducer<>("output-topic", new SimpleStringSchema(), kafkaProps)
);

env.execute("Simple Kafka to Kafka");

When Flink compiles this code, it doesn’t just see env.execute(). It first constructs a logical job graph. This graph represents the dataflow as a Directed Acyclic Graph (DAG) where nodes are operators (like map, filter, source, sink) and edges represent the streams of data flowing between them.

After the logical job graph is built, Flink then translates this into an execution graph. This is where the rubber meets the road. The execution graph is a more detailed, physical representation. It breaks down the logical operators into tasks. A single logical operator might be executed by multiple parallel tasks, especially if you’ve set a parallelism greater than 1. For instance, the Kafka Source operator might be split into 4 parallel source tasks if your job’s parallelism is 4. Each of these tasks is scheduled to run on a specific TaskManager instance.

The execution graph also defines the data shuffling strategy between tasks. This is crucial. Flink can shuffle data in several ways: REBALANCE, RESCALE, BROADCAST, and FORWARD. A FORWARD partitioner means data goes directly from one task to another on the same TaskManager, minimizing network overhead. REBALANCE (or shuffle) distributes data evenly across all downstream tasks using a round-robin approach. RESCALE is a more efficient version of REBALANCE where data is distributed to a subset of downstream tasks. BROADCAST sends every record to every downstream task. The choice of partitioner significantly impacts network traffic and processing latency.

Consider the map operation in our example. If the parallelism is set to 4, Flink will create 4 parallel map tasks. The data coming from the 4 parallel source tasks will be distributed to these 4 map tasks. If the partitioner is REBALANCE, each source task will send its records in a round-robin fashion to the 4 map tasks. If it’s FORWARD, data from source task 1 might only go to map task 1, source task 2 to map task 2, and so on, which is only possible if the number of source tasks equals the number of map tasks and the data needs to be partitioned identically.

You can inspect these graphs via the Flink Web UI. Under the "Jobs" tab, click on your running job. You’ll see a "Graph" view. This view shows the logical job graph. For a more detailed look at the execution plan, you can often find information about task deployments and data exchanges by drilling down into specific operators or by looking at the job details. You can also use the StreamExecutionEnvironment.getExecutionPlan() method to get a JSON representation of the execution graph.

// At the end of your job setup, before env.execute()
String executionPlanJson = env.getExecutionPlan();
System.out.println(executionPlanJson);

This JSON output is incredibly detailed. It lists every operator, its ID, parallelism, input and output types, and importantly, the partitioner used for data exchange between operators. For example, you might see "{"name":"Kafka Source","id":0,"parallelism":4,"inputs":[],"outputs":[{"target":1,"shipStrategy":"REBALANCE"}]}". This tells you the Kafka Source operator (ID 0) has 4 parallel instances and its output is REBALANCEd to the next operator (ID 1).

The mental model to build is that Flink first reasons about the what (logical graph: operators and their dependencies) and then the how (execution graph: tasks, parallelism, and data shuffling). The execution graph is where performance bottlenecks often hide, particularly due to inefficient data shuffling or under-provisioned parallelism.

One key detail often overlooked is how Flink handles stateful operations and checkpointing within this execution graph. Each parallel task that manages state (e.g., a keyBy().window().sum()) is responsible for its own portion of the state. During checkpointing, Flink needs to coordinate the saving of state from all these parallel tasks. The execution graph reveals the task dependencies, which informs how these state backups are managed. For instance, if a rebalance operation sends data across the network to a downstream stateful operator, that network transfer is a critical path during state recovery if a failure occurs.

Understanding these graphs allows you to diagnose performance issues, optimize resource utilization, and predict how changes in your job logic or configuration will impact its execution. The next step is often diving into the details of how Flink manages state and handles failures within this execution plan.

Want structured learning?

Take the full Flink course →