A Crossplane resource can get stuck in a Failed state because the underlying cloud provider’s API call failed, and Crossplane doesn’t have a built-in retry mechanism for all such failures.

Here’s how to debug why your Crossplane resources are stuck in a Failed state, especially when dealing with provider errors and claims:

Common Causes and Fixes

  1. Provider Credentials Expired or Invalid:

    • Diagnosis: Check the logs of the specific provider pod (e.g., provider-aws, provider-azure, provider-gcp). You’ll often see authentication-related errors like InvalidAccessKeyId, SignatureDoesNotMatch, or Unauthorized.
    • Fix: Rotate your cloud provider credentials. For AWS, this means updating the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your ProviderConfig secret. For Azure, update the client_id, client_secret, and tenant_id. For GCP, update the service account key.
      # Example for AWS ProviderConfig
      apiVersion: aws.upbound.io/v1beta1
      kind: ProviderConfig
      metadata:
        name: default
      spec:
        credentials:
          source: Secret
          secretRef:
            namespace: crossplane-system
            name: aws-creds
            key: credentials
      
      Then, update the secret itself.
      kubectl -n crossplane-system edit secret aws-creds
      # Replace the 'credentials' key's value with your new, valid credentials.
      
    • Why it works: Crossplane uses these credentials to authenticate with the cloud provider API. If they are invalid, the API calls will fail, leading to resource provisioning errors.
  2. Resource Quotas Exceeded:

    • Diagnosis: Provider logs will show errors like QuotaExceeded, LimitExceeded, or specific error codes indicating a quota violation for the resource type (e.g., CannotCreateCapacityReservation: You have exceeded the maximum number of Capacity Reservations allowed).
    • Fix: Increase the relevant quota in your cloud provider’s console. For example, if you’re hitting an AWS EC2 instance limit, request a quota increase for Running On-Demand Instances or EC2 Instances in the AWS Service Quotas console.
    • Why it works: Cloud providers enforce limits on the number of resources you can provision. Exceeding these limits prevents new resources from being created.
  3. Incorrect Resource Configuration:

    • Diagnosis: Provider logs will contain specific validation errors from the cloud provider. For example, trying to create an AWS RDS instance with an invalid engineVersion or an unsupported instanceClass.
    • Fix: Review the spec of your Composed Resource (e.g., RDSInstance, SQLServer, GKECluster) and compare it against the cloud provider’s API documentation for the resource type. Correct any invalid or unsupported values.
      # Example of an invalid configuration that might fail
      apiVersion: rds.aws.upbound.io/v1beta1
      kind: Instance
      metadata:
        name: my-db
      spec:
        forProvider:
          region: us-east-1
          # Assuming 'db.t3.medium' is not a valid or supported instance class for the chosen engine/version
          dbInstanceClass: db.t3.medium
          engine: postgres
          engineVersion: "13.3"
          allocatedStorage: 20
          skipFinalSnapshot: true
        providerConfigRef:
          name: default
      
      Correct it to a valid configuration.
      # Corrected configuration
      apiVersion: rds.aws.upbound.io/v1beta1
      kind: Instance
      metadata:
        name: my-db
      spec:
        forProvider:
          region: us-east-1
          dbInstanceClass: db.t3.small # Changed to a valid instance class
          engine: postgres
          engineVersion: "13.3"
          allocatedStorage: 20
          skipFinalSnapshot: true
        providerConfigRef:
          name: default
      
    • Why it works: Cloud provider APIs perform validation on incoming requests. Mismatched or invalid parameters will cause the API call to reject the request immediately.
  4. Network Connectivity Issues to Cloud Provider API:

    • Diagnosis: Provider logs might show timeouts, connection refused, or DNS resolution errors when attempting to reach the cloud provider’s endpoint (e.g., s3.amazonaws.com, management.azure.com).
    • Fix: Ensure that the Crossplane pods (specifically the provider pods) have network access to the cloud provider’s API endpoints. If Crossplane is running in a private network, check firewall rules, NAT gateways, or VPC endpoints.
    • Why it works: Crossplane relies on the ability to communicate with the cloud provider’s API to provision and manage resources. Network blocks will prevent these communications.
  5. IAM Permissions or Role Issues:

    • Diagnosis: Provider logs will indicate AccessDenied or Unauthorized errors, but specifically tied to certain actions (e.g., You are not authorized to perform this operation. Encrypted with KMS key X). This differs from general credential issues.
    • Fix: Review the IAM policies attached to the credentials Crossplane is using. Ensure the service account or user has the necessary permissions for all actions required to create, read, update, and delete the specific resource type. For example, if creating an S3 bucket with encryption, the IAM principal needs kms:Encrypt and kms:Decrypt permissions if using a customer-managed KMS key.
      # Example IAM policy snippet for AWS
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "s3:CreateBucket",
                      "s3:DeleteBucket",
                      "s3:PutBucketPolicy",
                      "kms:Encrypt",
                      "kms:Decrypt"
                  ],
                  "Resource": "*"
              }
          ]
      }
      
    • Why it works: Cloud provider APIs enforce granular permissions. Even with valid credentials, insufficient IAM permissions will cause API calls to be denied.
  6. Provider Not Ready or Failed to Start:

    • Diagnosis: The provider pod (e.g., provider-aws-xxxxxx) might be in a CrashLoopBackOff or Error state. Check its logs for startup errors, often related to missing dependencies, incorrect configuration, or issues within the provider itself.
    • Fix: Ensure the provider is installed correctly and its ProviderConfig is correctly referenced by the Composed Resource. If the provider pod is crashing, examine its logs for specific error messages and consult the provider’s documentation for troubleshooting. Sometimes, a simple kubectl delete pod <provider-pod-name> -n crossplane-system can trigger a restart that resolves transient issues.
    • Why it works: If the provider controller isn’t running or is in an error state, it cannot process reconcile requests for resources it manages, leaving them in a perpetual pending or failed state.
  7. Cloud Provider API Downtime or Service Degradation:

    • Diagnosis: This is harder to diagnose solely from Crossplane logs. You’ll likely see generic timeouts or intermittent ServiceUnavailable errors in the provider logs. Check the official status pages for your cloud provider (e.g., AWS Service Health Dashboard, Azure Status).
    • Fix: Wait for the cloud provider to resolve the issue. There’s nothing you can do within Crossplane to fix external service outages.
    • Why it works: Crossplane is dependent on the availability of the underlying cloud provider’s API. If the API is down, Crossplane cannot interact with it.

After Fixing

Once you’ve resolved the underlying issue and the provider can successfully interact with the cloud, you typically need to trigger a re-reconciliation. This can often be done by patching the resource with a no-op change:

kubectl patch <resource-kind> <resource-name> -n <namespace> --type='json' -p='[{"op": "replace", "path": "/spec/forProvider/tags", "value": {"reconcile": "trigger"}}]'

Replace tags with any field that is mutable and won’t cause unintended side effects, or simply delete and re-apply the resource if it’s safe to do so.

The next error you might hit if you haven’t addressed it is a NotFound error when Crossplane tries to update a resource that was deleted externally, or a DeletionTimestamp issue if you’re trying to delete a resource that’s stuck in Failed.

Want structured learning?

Take the full Crossplane course →