A Crossplane resource can get stuck in a Failed state because the underlying cloud provider’s API call failed, and Crossplane doesn’t have a built-in retry mechanism for all such failures.
Here’s how to debug why your Crossplane resources are stuck in a Failed state, especially when dealing with provider errors and claims:
Common Causes and Fixes
-
Provider Credentials Expired or Invalid:
- Diagnosis: Check the logs of the specific provider pod (e.g.,
provider-aws,provider-azure,provider-gcp). You’ll often see authentication-related errors likeInvalidAccessKeyId,SignatureDoesNotMatch, orUnauthorized. - Fix: Rotate your cloud provider credentials. For AWS, this means updating the
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYin yourProviderConfigsecret. For Azure, update theclient_id,client_secret, andtenant_id. For GCP, update the service account key.
Then, update the secret itself.# Example for AWS ProviderConfig apiVersion: aws.upbound.io/v1beta1 kind: ProviderConfig metadata: name: default spec: credentials: source: Secret secretRef: namespace: crossplane-system name: aws-creds key: credentialskubectl -n crossplane-system edit secret aws-creds # Replace the 'credentials' key's value with your new, valid credentials. - Why it works: Crossplane uses these credentials to authenticate with the cloud provider API. If they are invalid, the API calls will fail, leading to resource provisioning errors.
- Diagnosis: Check the logs of the specific provider pod (e.g.,
-
Resource Quotas Exceeded:
- Diagnosis: Provider logs will show errors like
QuotaExceeded,LimitExceeded, or specific error codes indicating a quota violation for the resource type (e.g.,CannotCreateCapacityReservation: You have exceeded the maximum number of Capacity Reservations allowed). - Fix: Increase the relevant quota in your cloud provider’s console. For example, if you’re hitting an AWS EC2 instance limit, request a quota increase for
Running On-Demand InstancesorEC2 Instancesin the AWS Service Quotas console. - Why it works: Cloud providers enforce limits on the number of resources you can provision. Exceeding these limits prevents new resources from being created.
- Diagnosis: Provider logs will show errors like
-
Incorrect Resource Configuration:
- Diagnosis: Provider logs will contain specific validation errors from the cloud provider. For example, trying to create an AWS RDS instance with an invalid
engineVersionor an unsupportedinstanceClass. - Fix: Review the
specof your Composed Resource (e.g.,RDSInstance,SQLServer,GKECluster) and compare it against the cloud provider’s API documentation for the resource type. Correct any invalid or unsupported values.
Correct it to a valid configuration.# Example of an invalid configuration that might fail apiVersion: rds.aws.upbound.io/v1beta1 kind: Instance metadata: name: my-db spec: forProvider: region: us-east-1 # Assuming 'db.t3.medium' is not a valid or supported instance class for the chosen engine/version dbInstanceClass: db.t3.medium engine: postgres engineVersion: "13.3" allocatedStorage: 20 skipFinalSnapshot: true providerConfigRef: name: default# Corrected configuration apiVersion: rds.aws.upbound.io/v1beta1 kind: Instance metadata: name: my-db spec: forProvider: region: us-east-1 dbInstanceClass: db.t3.small # Changed to a valid instance class engine: postgres engineVersion: "13.3" allocatedStorage: 20 skipFinalSnapshot: true providerConfigRef: name: default - Why it works: Cloud provider APIs perform validation on incoming requests. Mismatched or invalid parameters will cause the API call to reject the request immediately.
- Diagnosis: Provider logs will contain specific validation errors from the cloud provider. For example, trying to create an AWS RDS instance with an invalid
-
Network Connectivity Issues to Cloud Provider API:
- Diagnosis: Provider logs might show timeouts,
connection refused, or DNS resolution errors when attempting to reach the cloud provider’s endpoint (e.g.,s3.amazonaws.com,management.azure.com). - Fix: Ensure that the Crossplane pods (specifically the provider pods) have network access to the cloud provider’s API endpoints. If Crossplane is running in a private network, check firewall rules, NAT gateways, or VPC endpoints.
- Why it works: Crossplane relies on the ability to communicate with the cloud provider’s API to provision and manage resources. Network blocks will prevent these communications.
- Diagnosis: Provider logs might show timeouts,
-
IAM Permissions or Role Issues:
- Diagnosis: Provider logs will indicate
AccessDeniedorUnauthorizederrors, but specifically tied to certain actions (e.g.,You are not authorized to perform this operation. Encrypted with KMS key X). This differs from general credential issues. - Fix: Review the IAM policies attached to the credentials Crossplane is using. Ensure the service account or user has the necessary permissions for all actions required to create, read, update, and delete the specific resource type. For example, if creating an S3 bucket with encryption, the IAM principal needs
kms:Encryptandkms:Decryptpermissions if using a customer-managed KMS key.# Example IAM policy snippet for AWS { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:CreateBucket", "s3:DeleteBucket", "s3:PutBucketPolicy", "kms:Encrypt", "kms:Decrypt" ], "Resource": "*" } ] } - Why it works: Cloud provider APIs enforce granular permissions. Even with valid credentials, insufficient IAM permissions will cause API calls to be denied.
- Diagnosis: Provider logs will indicate
-
Provider Not Ready or Failed to Start:
- Diagnosis: The provider pod (e.g.,
provider-aws-xxxxxx) might be in aCrashLoopBackOfforErrorstate. Check its logs for startup errors, often related to missing dependencies, incorrect configuration, or issues within the provider itself. - Fix: Ensure the provider is installed correctly and its
ProviderConfigis correctly referenced by the Composed Resource. If the provider pod is crashing, examine its logs for specific error messages and consult the provider’s documentation for troubleshooting. Sometimes, a simplekubectl delete pod <provider-pod-name> -n crossplane-systemcan trigger a restart that resolves transient issues. - Why it works: If the provider controller isn’t running or is in an error state, it cannot process reconcile requests for resources it manages, leaving them in a perpetual pending or failed state.
- Diagnosis: The provider pod (e.g.,
-
Cloud Provider API Downtime or Service Degradation:
- Diagnosis: This is harder to diagnose solely from Crossplane logs. You’ll likely see generic timeouts or intermittent
ServiceUnavailableerrors in the provider logs. Check the official status pages for your cloud provider (e.g., AWS Service Health Dashboard, Azure Status). - Fix: Wait for the cloud provider to resolve the issue. There’s nothing you can do within Crossplane to fix external service outages.
- Why it works: Crossplane is dependent on the availability of the underlying cloud provider’s API. If the API is down, Crossplane cannot interact with it.
- Diagnosis: This is harder to diagnose solely from Crossplane logs. You’ll likely see generic timeouts or intermittent
After Fixing
Once you’ve resolved the underlying issue and the provider can successfully interact with the cloud, you typically need to trigger a re-reconciliation. This can often be done by patching the resource with a no-op change:
kubectl patch <resource-kind> <resource-name> -n <namespace> --type='json' -p='[{"op": "replace", "path": "/spec/forProvider/tags", "value": {"reconcile": "trigger"}}]'
Replace tags with any field that is mutable and won’t cause unintended side effects, or simply delete and re-apply the resource if it’s safe to do so.
The next error you might hit if you haven’t addressed it is a NotFound error when Crossplane tries to update a resource that was deleted externally, or a DeletionTimestamp issue if you’re trying to delete a resource that’s stuck in Failed.