Flink can run on several cluster managers, and picking the right one is critical for your application’s performance and manageability.

Let’s see Flink in action on YARN. Imagine you have a Flink job that processes Kafka data and writes to HDFS.

# Submit the Flink job to YARN
./bin/flink run -m yarn-cluster \
    -yn 2 \
    -yjm 1024 \
    -ytm 2048 \
    -ys 3 \
    -yD state.backend=filesystem \
    -yD state.backend.fs.dir=hdfs:///flink/checkpoints \
    ./my-flink-job.jar \
    --kafka-topic input-topic \
    --output-dir hdfs:///flink/output

Here’s what’s happening:

  • -m yarn-cluster: This tells Flink to run in YARN cluster mode. A JobManager will be launched as a YARN Application Master.
  • -yn 2: Requests 2 YARN containers for TaskManagers.
  • -yjm 1024: Allocates 1024 MB of memory for the JobManager.
  • -ytm 2048: Allocates 2048 MB of memory for each TaskManager.
  • -ys 3: Sets the number of task slots per TaskManager to 3.
  • -yD state.backend=filesystem and -yD state.backend.fs.dir=hdfs:///flink/checkpoints: Configures Flink to use the filesystem state backend and specifies the HDFS directory for checkpoints.
  • ./my-flink-job.jar: The Flink application JAR.
  • --kafka-topic input-topic and --output-dir hdfs:///flink/output: Application-specific arguments.

When you run this command, Flink talks to the YARN ResourceManager. The ResourceManager then allocates ApplicationMaster containers. Flink’s JobManager runs inside one of these containers. Once the JobManager is up, it requests containers from the ResourceManager for its TaskManagers. These TaskManagers will then launch, connect to the JobManager, and start executing your job’s tasks.

The problem Flink solves is distributed stream processing. Traditional batch processing reads all data, processes it, and writes results. Stream processing needs to handle data as it arrives, potentially indefinitely, with low latency and high throughput. Flink’s core innovation is its ability to do this with true event-time processing and robust state management. It keeps track of processing time, event time, and watermarks, allowing for accurate results even with out-of-order events.

Standalone Mode

This is the simplest deployment option. You manage the Flink cluster yourself. You start a JobManager process and one or more TaskManager processes on your machines.

  • Pros: Easy to set up for development and testing, no external cluster manager dependencies.
  • Cons: No automatic scaling, resource management, or fault tolerance beyond Flink’s internal mechanisms. You’re responsible for managing the underlying machines and processes.

YARN

When you deploy Flink on YARN, Flink applications run as YARN applications. The Flink JobManager acts as the YARN Application Master. YARN handles resource allocation (CPU, memory) for the JobManager and TaskManagers.

  • Pros: Leverages existing Hadoop infrastructure, good for shared clusters, YARN handles resource scheduling and isolation.
  • Cons: Adds a dependency on YARN, configuration can be more complex due to YARN integration.

Kubernetes

Flink offers native Kubernetes integration. You can deploy Flink clusters as Kubernetes resources, with the JobManager and TaskManagers running as Pods.

  • Pros: Excellent for cloud-native environments, leverages Kubernetes for scaling, self-healing, and resource management, good isolation.
  • Cons: Requires Kubernetes expertise, can be overkill for simple deployments.

When Flink runs in kubernetes-session mode, it deploys its own JobManager as a Kubernetes Deployment and its TaskManagers as a Kubernetes StatefulSet. This gives you fine-grained control over the Flink cluster’s lifecycle directly through Kubernetes. You can scale the TaskManagers up or down by changing the replicas count in the StatefulSet definition.

The Flink operator provides a more declarative way to manage Flink applications on Kubernetes. You define a FlinkDeployment custom resource, and the operator handles the creation and management of the JobManager and TaskManagers. This simplifies deployment and upgrades.

One of the subtle but powerful aspects of Flink’s deployment on Kubernetes is how it manages network communication. When Flink deploys its JobManager and TaskManagers as Kubernetes Pods, it relies on Kubernetes’s internal DNS and service discovery mechanisms. TaskManagers can find the JobManager using its Kubernetes Service name, and the JobManager can discover and communicate with TaskManagers through their respective Pod IPs. This abstracts away much of the low-level networking complexity that you’d otherwise have to manage yourself.

The next step is often exploring Flink’s advanced deployment patterns, such as using the Flink Kubernetes Operator for declarative application management.

Want structured learning?

Take the full Flink course →