The Flink Kubernetes Operator doesn’t just deploy Flink clusters; it fundamentally changes how you think about stateful, distributed applications on Kubernetes, turning them into first-class citizens.
Let’s see it in action. Imagine you want to run a simple Flink streaming job that counts words from a Kafka topic.
First, you need the operator itself. You can install it via Helm:
helm upgrade --install flink-operator fluxcd/flink-operator \
--namespace flink-operator \
--create-namespace
This deploys the operator deployment and its associated RBAC rules into the flink-operator namespace. The operator’s job is to watch for FlinkDeployment custom resources.
Now, let’s define our Flink cluster and job:
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: my-flink-cluster
namespace: default
spec:
image:
repository: flinkci/flink
tag: 1.17.1
pullPolicy: IfNotPresent
flinkVersion: 1.17.1
jobManager:
resource:
memory: 1024m
cpu: 1
taskManager:
resource:
memory: 2048m
cpu: 1
replicas: 2
# This is where we define the job to be deployed with the cluster
jobManager:
... # (as above)
taskManager:
... # (as above)
# The job section tells the operator what to deploy
job:
jarURI: s3://my-flink-jobs/wordcount-0.1.jar
mainClass: org.apache.flink.examples.java.wordcount.WordCount
parallelism: 2
parallelismPerK8sResource: 1
upgradeMode: SAVEPOINT_ON_RESTART
savepointStorage: s3://my-flink-savepoints/wordcount/
When you kubectl apply -f my-flink-deployment.yaml, the operator springs into action. It sees the FlinkDeployment resource and, based on its configuration, provisions a Flink cluster. This involves creating a Kubernetes StatefulSet for the JobManager, and one or more StatefulSets for the TaskManagers. It also sets up the necessary Services for internal communication and external access to the JobManager’s web UI.
Crucially, the job section in the FlinkDeployment tells the operator to automatically deploy your Flink job. It will upload the specified JAR (in this case, from S3), configure it with the provided mainClass, parallelism, and upgradeMode, and submit it to the newly created Flink cluster. The operator then continuously monitors the job’s status, restarting it if it fails or updating it based on the upgradeMode.
The problem this solves is the operational overhead of managing stateful, distributed Flink applications on Kubernetes. Traditionally, you’d manually deploy Flink clusters, then separately submit jobs, manage their lifecycle (upgrades, restarts, scaling), and handle state management (savepoints). The operator automates all of this, treating Flink clusters and their jobs as Kubernetes-native resources.
Internally, the operator uses the Flink Kubernetes API to interact with the Flink cluster. When it creates a FlinkDeployment, it translates this into the underlying Kubernetes resources (StatefulSet, Service, ConfigMap, etc.). For job management, it leverages Flink’s REST API to submit, cancel, and manage jobs. The upgradeMode is key here: SAVEPOINT_ON_RESTART means if a TaskManager fails, the operator will trigger a savepoint, restart the affected components, and then resume the job from that savepoint, ensuring exactly-once or at-least-once processing guarantees are maintained.
A common point of confusion is how parallelism and parallelismPerK8sResource interact. parallelism sets the overall job parallelism. parallelismPerK8sResource influences how the operator maps Flink’s parallelism to Kubernetes resources. If parallelismPerK8sResource is 1, the operator will try to schedule all tasks for a given Flink resource (like a TaskManager pod) on that same resource. If it’s greater than 1, it might distribute tasks across different TaskManagers based on available parallelism slots. Understanding this helps optimize resource utilization and task distribution.
Once your Flink jobs are reliably deployed and managed, the next logical step is to explore advanced deployment strategies like blue-green deployments or canary releases for your Flink applications.