Crossplane exposes a rich set of Prometheus metrics that let you observe its internal state and the health of your managed resources.
// Example of a managed resource in a "Healthy" state
apiVersion: database.example.com/v1alpha1
kind: MyDatabase
metadata:
name: my-db-instance
spec:
parameters:
dbSize: Large
compositionSelector:
matchLabels:
provider: aws
dbType: postgres
status:
atProvider:
instanceStatus: Available
connectionEndpoint: my-db-instance.example.com
conditions:
- type: Ready
status: "True"
lastTransitionTime: "2023-10-27T10:00:00Z"
reason: DBInstanceAvailable
message: DB instance is ready and available for connections.
This MyDatabase custom resource, managed by Crossplane, is currently in a Ready state. Its underlying cloud provider resource (an AWS RDS instance, for example) is Available. This state is reflected in Prometheus metrics.
Here’s how Crossplane’s metrics can help you understand what’s happening:
-
crossplane_managed_resource_state: This metric directly tells you the health of your managed resources. A value of1for a resource withstatus="Healthy"means it’s in a good state. Any other value (e.g.,0forDegraded,Unknown, orUnhealthy) indicates a problem. You can query this to see how many of your databases, buckets, or clusters areReady.# Query for all managed resources that are NOT healthy promql crossplane_managed_resource_state{status!="Healthy"} -
crossplane_provider_reconcile_errors_total: When a provider fails to reconcile a resource (e.g., an AWS provider can’t create an S3 bucket), this counter increments. High or rapidly increasing values here point to issues with your provider configurations or network connectivity to the cloud API.# See the total number of reconcile errors, broken down by provider and resource type promql increase(crossplane_provider_reconcile_errors_total[5m]) by (provider, resource_kind) -
crossplane_composition_reconcile_errors_total: Similar to provider errors, but these track issues within Crossplane’s composition logic. If a composition fails to bind claims to resources or provision the correct underlying infrastructure, this metric will rise.# Count composition errors over the last hour promql rate(crossplane_composition_reconcile_errors_total[1h]) -
crossplane_controller_runtime_reconcile_total: This metric, inherited from controller-runtime, shows how many times reconciliation loops have run for Crossplane’s core controllers and your custom resources. Spikes or prolonged periods of no reconciliation can indicate a stuck controller or an issue with the Kubernetes API server.# Average reconciliation time for the Crossplane controller manager promql avg_over_time(crossplane_controller_runtime_reconcile_total{controller="xrd-controller"}[1m]) -
crossplane_kubernetes_resource_sync_time_seconds: This measures how long it takes for Crossplane to detect changes in Kubernetes resources it’s managing. Long sync times can mean delays in Crossplane reacting to updates or deletions, potentially leading to stale configurations.# Find managed resources with sync times exceeding 30 seconds promql crossplane_kubernetes_resource_sync_time_seconds > 30
The real power comes from correlating these metrics. For instance, if crossplane_managed_resource_state{status="Degraded"} shows an increase, you’d then look at crossplane_provider_reconcile_errors_total for the relevant provider to see if there are underlying API failures. You might also check crossplane_kubernetes_resource_sync_time_seconds to ensure Crossplane is even aware of the resource’s state changes promptly.
Crossplane’s metrics are served on the /metrics endpoint of its controller manager pod, typically exposed via a Kubernetes Service. You’ll need to configure your Prometheus instance to scrape this endpoint. A common configuration involves using ServiceMonitor or PodMonitor custom resources if you’re using Prometheus Operator.
The crossplane_managed_resource_state metric is a gauge that reports 1 for a Healthy resource and 0 for any other state (Degraded, Unknown, Failed). The labels provider, resource_kind, and resource_name allow you to drill down into specific resources. When you see this metric drop from 1 to 0 for a particular resource_name, it means that resource has transitioned out of the Healthy state.
The next step after monitoring is to set up alerting. You’ll likely want alerts for any managed resource that enters a Degraded or Failed state, or for a sustained high rate of reconciliation errors from your providers.