Manage Flink Jobs at Runtime with the REST API (2026)

The Flink REST API is your primary lever for controlling Flink jobs after they’ve been submitted, offering granular control without needing to redeploy your entire application.

Let’s see it in action. Imagine you have a streaming job running that processes Kafka data and writes to Elasticsearch.

# Get the list of running jobs
curl -X GET http://localhost:8081/jobs

# Output might look like this:
# {"jobs":[{"id":"a1b2c3d4e5f67890abcdef1234567890","name":"my-kafka-to-es-job","inRunningMode":true,"vertices":[{"id":12345,"name":"Kafka Source","parallelism":2,"currentParallelism":2,"maxParallelism":2,"status":"RUNNING"}, ... ]}]}

This output shows the job ID, name, and the status of its vertices (the individual tasks within the job).

Now, what if you need to scale that Kafka source to handle more data? You can adjust the parallelism on the fly, provided your job was submitted with auto-parallelism enabled or configured with a maxParallelism that allows for scaling up.

# Trigger a rescaling of the job to increase parallelism of a specific vertex
curl -X PUT -H "Content-Type: application/json" \
     -d '{"new_scale": 4}' \
     http://localhost:8081/jobs/a1b2c3d4e5f67890abcdef1234567890/vertices/12345/parallelism

This PUT request targets the specific vertex ID (12345) and tells Flink to scale it up to 4 parallel instances. Flink will automatically handle the state migration and task restarts needed for this change.

The mental model for managing Flink jobs via the REST API revolves around identifying resources (jobs, tasks, savepoints) and then performing actions (trigger, cancel, savepoint, update parallelism) on them.

Here’s a breakdown of the core components:

Jobs: The top-level resource. You can list, get details, cancel, and stop/savepoint jobs.
Vertices: The individual parallel tasks within a job. You can get their status and, crucially, adjust their parallelism at runtime.
Savepoints: These are point-in-time snapshots of your job’s state. The REST API is how you trigger them. This is fundamental for upgrades, migrations, or planned downtime.
- Triggering a savepoint:
```
curl -X POST -H "Content-Type: application/json" \
     -d '{"targetDirectory": "hdfs:///user/flink/savepoints"}' \
     http://localhost:8081/jobs/a1b2c3d4e5f67890abcdef1234567890/savepoints
```
  This command initiates a savepoint and stores it in the specified HDFS path. The API returns a location URL for the created savepoint.
- Canceling a job with savepoint:
```
curl -X POST -H "Content-Type: application/json" \
     -d '{"dispose_savepoint": true, "targetDirectory": "hdfs:///user/flink/savepoints"}' \
     http://localhost:8081/jobs/a1b2c3d4e5f67890abcdef1234567890/cancel
```
  This cancels the job and also triggers a savepoint before shutting down.

The API endpoints are generally intuitive: /jobs, /jobs/{jobid}, /jobs/{jobid}/vertices/{vertexid}, /jobs/{jobid}/savepoints.

A common point of confusion is the difference between cancel and stop. cancel simply terminates the job. stop (when used with dispose_savepoint: true and a targetDirectory) gracefully shuts down the job and creates a savepoint, allowing you to resume from that exact state later. If you just want to stop processing and not worry about state, cancel is sufficient. If you want to upgrade or restart from a known good state, stop with a savepoint is your go-to.

The REST API is also your window into the job’s health and metrics, though for deep dives into performance, you’ll likely complement this with Flink’s metrics reporters.

Understanding how Flink handles state during parallelism changes is key; it doesn’t magically re-partition data. When you increase parallelism, Flink uses the existing savepoint or checkpoint to distribute the state across the new, more numerous tasks. For distributed state (like keyed state in a KeyedStream), Flink has sophisticated mechanisms to partition this state efficiently. If your job was submitted with maxParallelism set to a value less than your desired new_scale, the rescaling will fail.

The next logical step after mastering runtime management is understanding how to configure and leverage Flink’s advanced checkpointing mechanisms for robust fault tolerance.