Crossplane exposes a rich set of Prometheus metrics that let you observe its internal state and the health of your managed resources.

// Example of a managed resource in a "Healthy" state
apiVersion: database.example.com/v1alpha1
kind: MyDatabase
metadata:
  name: my-db-instance
spec:
  parameters:
    dbSize: Large
  compositionSelector:
    matchLabels:
      provider: aws
      dbType: postgres
status:
  atProvider:
    instanceStatus: Available
    connectionEndpoint: my-db-instance.example.com
  conditions:
  - type: Ready
    status: "True"
    lastTransitionTime: "2023-10-27T10:00:00Z"
    reason: DBInstanceAvailable
    message: DB instance is ready and available for connections.

This MyDatabase custom resource, managed by Crossplane, is currently in a Ready state. Its underlying cloud provider resource (an AWS RDS instance, for example) is Available. This state is reflected in Prometheus metrics.

Here’s how Crossplane’s metrics can help you understand what’s happening:

  • crossplane_managed_resource_state: This metric directly tells you the health of your managed resources. A value of 1 for a resource with status="Healthy" means it’s in a good state. Any other value (e.g., 0 for Degraded, Unknown, or Unhealthy) indicates a problem. You can query this to see how many of your databases, buckets, or clusters are Ready.

    # Query for all managed resources that are NOT healthy
    promql
    crossplane_managed_resource_state{status!="Healthy"}
    
  • crossplane_provider_reconcile_errors_total: When a provider fails to reconcile a resource (e.g., an AWS provider can’t create an S3 bucket), this counter increments. High or rapidly increasing values here point to issues with your provider configurations or network connectivity to the cloud API.

    # See the total number of reconcile errors, broken down by provider and resource type
    promql
    increase(crossplane_provider_reconcile_errors_total[5m]) by (provider, resource_kind)
    
  • crossplane_composition_reconcile_errors_total: Similar to provider errors, but these track issues within Crossplane’s composition logic. If a composition fails to bind claims to resources or provision the correct underlying infrastructure, this metric will rise.

    # Count composition errors over the last hour
    promql
    rate(crossplane_composition_reconcile_errors_total[1h])
    
  • crossplane_controller_runtime_reconcile_total: This metric, inherited from controller-runtime, shows how many times reconciliation loops have run for Crossplane’s core controllers and your custom resources. Spikes or prolonged periods of no reconciliation can indicate a stuck controller or an issue with the Kubernetes API server.

    # Average reconciliation time for the Crossplane controller manager
    promql
    avg_over_time(crossplane_controller_runtime_reconcile_total{controller="xrd-controller"}[1m])
    
  • crossplane_kubernetes_resource_sync_time_seconds: This measures how long it takes for Crossplane to detect changes in Kubernetes resources it’s managing. Long sync times can mean delays in Crossplane reacting to updates or deletions, potentially leading to stale configurations.

    # Find managed resources with sync times exceeding 30 seconds
    promql
    crossplane_kubernetes_resource_sync_time_seconds > 30
    

The real power comes from correlating these metrics. For instance, if crossplane_managed_resource_state{status="Degraded"} shows an increase, you’d then look at crossplane_provider_reconcile_errors_total for the relevant provider to see if there are underlying API failures. You might also check crossplane_kubernetes_resource_sync_time_seconds to ensure Crossplane is even aware of the resource’s state changes promptly.

Crossplane’s metrics are served on the /metrics endpoint of its controller manager pod, typically exposed via a Kubernetes Service. You’ll need to configure your Prometheus instance to scrape this endpoint. A common configuration involves using ServiceMonitor or PodMonitor custom resources if you’re using Prometheus Operator.

The crossplane_managed_resource_state metric is a gauge that reports 1 for a Healthy resource and 0 for any other state (Degraded, Unknown, Failed). The labels provider, resource_kind, and resource_name allow you to drill down into specific resources. When you see this metric drop from 1 to 0 for a particular resource_name, it means that resource has transitioned out of the Healthy state.

The next step after monitoring is to set up alerting. You’ll likely want alerts for any managed resource that enters a Degraded or Failed state, or for a sustained high rate of reconciliation errors from your providers.

Want structured learning?

Take the full Crossplane course →