Flink’s metrics system is designed to be highly flexible, and exporting metrics to Prometheus is a common requirement for monitoring Flink applications.
Here’s how you can set it up:
Setting up Prometheus for Flink Metrics
The most straightforward way to expose Flink metrics to Prometheus is by using Flink’s built-in PrometheusReporter. This reporter scrapes metrics from Flink’s JobManager and TaskManagers.
1. Configure Flink to use the PrometheusReporter:
You need to add configuration properties to your Flink cluster’s configuration file (e.g., flink-conf.yaml).
# flink-conf.yaml
metrics.reporters: prometheus
metrics.reporter.prometheus.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prometheus.port: 9091 # Or any other available port
metrics.reporters: This tells Flink which reporters to activate. We’re specifyingprometheus.metrics.reporter.prometheus.class: This points to the Flink class that implements the Prometheus reporter.metrics.reporter.prometheus.port: This is the port on which the JobManager and TaskManagers will expose their metrics endpoint for Prometheus to scrape. It’s crucial that this port is not already in use by another service. A common choice is9091.
2. Restart your Flink Cluster:
After updating flink-conf.yaml, you must restart your Flink JobManager and TaskManagers for the changes to take effect.
3. Configure Prometheus to Scrape Flink Metrics:
Now, you need to tell your Prometheus server where to find these metrics. Edit your prometheus.yml file.
# prometheus.yml
scrape_configs:
- job_name: 'flink'
static_configs:
- targets:
- 'jobmanager_host:9091' # Replace with your JobManager's hostname/IP
- 'taskmanager1_host:9091' # Replace with your TaskManager hostnames/IPs
- 'taskmanager2_host:9091'
# Add more TaskManagers as needed
job_name: A logical name for this set of targets in Prometheus.static_configs: For a fixed set of targets, this is the simplest configuration.targets: A list of addresses (hostname or IP:port) that Prometheus should scrape. You need to list the address of your JobManager and each of your TaskManagers, using the port you configured inflink-conf.yaml(e.g.,9091).
4. Reload Prometheus Configuration:
After modifying prometheus.yml, you need to reload Prometheus’s configuration. You can do this by sending a SIGHUP signal to the Prometheus process or by restarting the Prometheus server.
# Example using kill -HUP
kill -HUP <prometheus_pid>
Once Prometheus reloads, it will start scraping metrics from the specified Flink endpoints. You should then see your Flink metrics appearing in Prometheus’s "Targets" page and be able to query them in the Prometheus UI.
Understanding the Metrics Endpoint
When the PrometheusReporter is enabled, each Flink component (JobManager and TaskManagers) starts an HTTP server on the configured port. This server exposes metrics in a format that Prometheus can understand. The endpoint is typically /metrics.
For example, if your JobManager is running on localhost and you configured port 9091, you can access the metrics by navigating to http://localhost:9091/metrics in your browser. You’ll see output like this:
# HELP flink_jobmanager_numRegisteredTaskManagers The number of registered TaskManagers.
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink-jobmanager.example.com",} 2.0
# HELP flink_jobmanager_job_restarting_time_total Total time spent in restarting jobs.
# TYPE flink_jobmanager_job_restarting_time_total counter
flink_jobmanager_job_restarting_time_total{host="flink-jobmanager.example.com",job_name="my-flink-job",} 12345.0
# HELP flink_taskmanager_Status_CPU_Usage CPU usage of the TaskManager.
# TYPE flink_taskmanager_Status_CPU_Usage gauge
flink_taskmanager_Status_CPU_Usage{host="flink-taskmanager-1.example.com",} 0.45
The PrometheusReporter automatically prefixes metric names with flink_ and appends relevant Flink labels (like host, job_name, task_name, subtask_index, etc.) to the Prometheus metrics, making them easily identifiable.
Common Flink Metrics to Monitor
Once integrated, you’ll have access to a wealth of metrics. Some critical ones include:
- JobManager Metrics:
flink_jobmanager_numRegisteredTaskManagers: Number of TaskManagers connected to the JobManager.flink_jobmanager_numRunningJobs: Number of currently running Flink jobs.flink_jobmanager_job_restarting_time_total: Time spent restarting jobs.
- TaskManager Metrics:
flink_taskmanager_Status_CPU_Usage: CPU utilization of the TaskManager.flink_taskmanager_Status_Memory_Heap_Used: Heap memory used by the TaskManager.flink_taskmanager_Status_Network_AvailableMemorySegments: Available network buffer segments.
- Operator/Task Metrics:
flink_task_operator_numRecordsInPerSecond: Rate of records received by an operator.flink_task_operator_numRecordsOutPerSecond: Rate of records sent by an operator.flink_task_operator_backPressuredTimeMsPerSecond: Time an operator spends being back-pressured.flink_task_operator_idleTimeMsPerSecond: Time an operator spends being idle.
By monitoring these, you can gain insights into the health, performance, and resource utilization of your Flink applications.
The next step after exporting metrics is to set up alerting rules in Prometheus based on these metrics to proactively identify and address issues.