This is about getting CloudFormation to do what you want when it gets stuck.

CloudFormation stack failures aren’t usually about a single, catastrophic break. Instead, they’re about a cascade of small, individual resource failures that prevent the stack from reaching a stable state. The key is that CloudFormation tells you what’s going wrong, but you have to know how to listen.

Common Causes of Stack Failures

  1. Resource Creation Timeout: A resource takes too long to provision.

    • Diagnosis: Look for events with ResourceCreationFailure status and CREATE_FAILED reason. The event details will often include a message from the service indicating the timeout or a specific reason why it couldn’t be created. For example, an EC2 instance might time out if it can’t reach a user data script or the network configuration is wrong.
    • Fix: Increase the CreationTimeout parameter for the specific resource if available (e.g., for AWS::CloudFormation::WaitCondition). More commonly, you need to fix the underlying issue causing the delay. For EC2, this could mean ensuring your user data script is correct and doesn’t hang, or that your VPC/subnet has internet access if required. For RDS, it might be incorrect subnet group configurations.
    • Why it works: By fixing the root cause or extending the allowed time, you allow the resource to complete its provisioning process successfully.
  2. Invalid Parameters or Configurations: You provided incorrect values for resource properties.

    • Diagnosis: Events will show CREATE_FAILED with a reason like ValidationError or a specific error message from the service. For example, trying to launch an EC2 instance with an AMI ID that doesn’t exist in the specified region, or an RDS instance with an invalid DB subnet group.
    • Fix: Correct the parameter value in your CloudFormation template. For an invalid AMI ID, find a valid one for your region and update the ImageId property. For an RDS DB subnet group, ensure the subnets specified exist and are in the correct AZs.
    • Why it works: CloudFormation passes your template parameters directly to the underlying AWS service. If those parameters are invalid, the service rejects the request, causing the resource creation to fail. Correcting them allows the service to accept and process the request.
  3. Insufficient Permissions (IAM Roles/Policies): The IAM role CloudFormation is using to create resources lacks the necessary permissions.

    • Diagnosis: Look for AccessDenied or UnauthorizedOperation in the event reasons. This often happens when CloudFormation tries to perform an action (like creating an S3 bucket, launching an EC2 instance, or modifying a security group) but the associated IAM role doesn’t have the iam:PassRole permission for the role being passed to the resource, or the service-linked role permissions are missing.
    • Fix: Update the IAM role that CloudFormation’s service role is using to include the necessary permissions. For example, if creating an EC2 instance with a specific IAM instance profile, ensure the CloudFormation execution role has iam:PassRole permission on that instance profile. If CloudFormation itself needs to create IAM roles or policies, ensure its execution role has permissions like iam:CreateRole and iam:AttachRolePolicy.
    • Why it works: AWS services enforce permissions based on IAM policies. If the role executing the CloudFormation action doesn’t have explicit permission for a specific API call, the service will deny the request.
  4. Dependencies Not Met: A resource depends on another resource that failed to create or is not yet available.

    • Diagnosis: Events might show a resource failing with a dependency-related error message, or a dependent resource will show a CREATE_FAILED status. For example, trying to create a Security Group Rule that references a VPC Security Group ID that doesn’t exist yet.
    • Fix: Ensure your CloudFormation template correctly defines dependencies using DependsOn or by referencing other resources in properties. More importantly, verify that the referenced resources are themselves healthy and successfully created. Sometimes, you need to manually inspect the status of resources that your failing resource depends on.
    • Why it works: CloudFormation attempts to create resources in an order that respects explicit or implicit dependencies. If a dependency is missing or failed, the dependent resource cannot be created.
  5. Service Quotas Exceeded: You’ve hit a limit on a particular AWS service.

    • Diagnosis: Event reasons will often explicitly state "quota exceeded" or "limit reached." This could be for VPCs, Elastic IPs, EC2 instances, RDS instances, etc., in a specific region.
    • Fix: Request a quota increase from AWS Support for the relevant service and region. Alternatively, delete existing resources that are consuming your quota if they are no longer needed.
    • Why it works: AWS imposes service quotas to manage resource allocation. Exceeding these limits prevents new resources from being provisioned.
  6. Resource Deletion/Modification Conflicts: Trying to update or delete a stack when resources are in a state that prevents it.

    • Diagnosis: Stack update failures often show UPDATE_FAILED with reasons related to resource modification protection (e.g., ResourceUpdateConflict, ResourceInUse). For example, trying to delete an S3 bucket that still contains objects, or trying to modify an RDS instance that is currently undergoing a maintenance operation.
    • Fix: For S3 buckets, empty the bucket manually or via scripting before deleting the stack. For other resources, check the AWS console for the specific resource to see its current state and any associated locks or ongoing operations. You might need to manually intervene to resolve the conflict before retrying the stack operation.
    • Why it works: Some AWS resources have built-in protection against accidental deletion or modification, or require specific pre-conditions to be met before an operation can succeed.

After fixing these issues, the next error you’ll likely encounter is a ROLLBACK_COMPLETE status if the stack failed and rolled back, and you’ll need to analyze the events again for the rollback phase.

Want structured learning?

Take the full Cloudformation course →