Argo Workflows can emit metrics that Prometheus can scrape.
Let’s see it in action. Imagine we have a simple workflow that runs a sleep command:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: basic-sleep-
spec:
entrypoint: main
templates:
- name: main
container:
image: alpine:latest
command: ["sh", "-c"]
args: ["echo 'Starting sleep'; sleep 30; echo 'Finished sleep'"]
When this workflow runs, Argo Workflows, by default, will expose metrics on its controller and its API server. The controller is the component that manages the lifecycle of workflows, and the API server is what you interact with to submit and monitor workflows.
The metrics are typically exposed on port 9090 for the controller and 2746 for the API server. Prometheus, a popular open-source monitoring and alerting system, can be configured to scrape these endpoints.
Here’s a simplified Prometheus scrape_configs entry to pull metrics from the Argo Workflows controller:
scrape_configs:
- job_name: 'argo-workflows-controller'
static_configs:
- targets: ['argo-workflows-controller.argo-workflows.svc.cluster.local:9090']
And for the API server:
scrape_configs:
- job_name: 'argo-workflows-api'
static_configs:
- targets: ['argo-workflows-api.argo-workflows.svc.cluster.local:2746']
Once Prometheus is scraping these endpoints, you’ll start seeing metrics like argo_workflow_status_phase which shows the current phase of your workflows (e.g., Running, Succeeded, Failed). You’ll also see argo_workflow_duration_seconds which tracks how long your workflows take to complete.
The problem Argo Workflows metrics solve is providing visibility into the operational health and performance of your workflow executions without needing to parse logs or query the Argo API directly. You can build dashboards in Grafana to visualize workflow completion rates, average execution times, and identify bottlenecks.
Internally, when an Argo Workflow event occurs (like a workflow starting, completing, or failing), the Argo controller and API server components increment counters and update gauges for the relevant metrics. These metrics are exposed in a Prometheus-readable format over HTTP.
The exact levers you control are primarily through the Argo Workflows configuration itself. You can enable or disable metric collection, and in some advanced scenarios, you might configure custom labels for your metrics to segment them further. For instance, if you want to track metrics per namespace or per workflow template, you would configure those labels when submitting your workflows or through Argo’s configuration.
A key detail often overlooked is that the metrics are exposed differently by the controller and the API server. The controller’s metrics are more about the state of the system managing workflows, while the API server’s metrics are more about the requests it’s handling and the state of workflows as seen by the API. Understanding which component provides which metric is crucial for accurate monitoring.
The next concept you’ll likely encounter is how to query these metrics effectively using PromQL to build meaningful alerts and dashboards.