CDK deployments are atomic by default, meaning they either succeed entirely or fail entirely, leaving your infrastructure in its previous state.
Here’s a simple CDK application:
from aws_cdk import (
core as cdk,
aws_s3 as s3,
aws_lambda as lambda_,
aws_apigateway as apigw
)
class MyCdkStack(cdk.Stack):
def __init__(self, scope: cdk.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)
bucket = s3.Bucket(self, "MyS3Bucket")
my_lambda = lambda_.Function(
self, "MyLambdaFunction",
runtime=lambda_.Runtime.PYTHON_3_9,
handler="index.handler",
code=lambda_.Code.from_asset("lambda"),
environment={
"BUCKET_NAME": bucket.bucket_name
}
)
api = apigw.LambdaRestApi(
self, "MyApiGateway",
handler=my_lambda,
proxy=True
)
cdk.CfnOutput(self, "ApiEndpoint", value=api.url)
When you synthesize this and deploy it with cdk deploy, CloudFormation orchestrates the creation of the S3 bucket, Lambda function, and API Gateway. If any of these resources fail to provision, CloudFormation will automatically roll back any changes it did manage to make, ensuring your environment remains consistent.
However, "atomic" doesn’t always mean "safe" in the way you might expect. What happens when a deployment partially succeeds, but the new functionality breaks your application, and you need to revert to the previous working version? This is where explicit rollback strategies become crucial.
The default behavior is great for ensuring infrastructure integrity during provisioning failures. But for application-level failures introduced by a new deployment, you need a plan. CDK, by leveraging CloudFormation’s capabilities, offers several ways to manage this.
The simplest rollback strategy is manual reversion. If cdk deploy completes but your application breaks, you can simply deploy the previous version of your CDK code. CloudFormation will detect the changes and revert the stack to its prior state. This is often sufficient for less critical applications or during development.
For more automated rollback, you can leverage CloudFormation’s built-in "Change Sets." Before deploying, you can generate a change set (aws cloudformation create-change-set, aws cloudformation execute-change-set). If the execution fails, CloudFormation automatically rolls back. This is a precursor to automated rollback.
A more robust strategy involves using CloudFormation’s "Stack Policies." These policies control which resources in a stack can be updated. You can define a policy that prevents certain critical resources (like your database or a core API) from being updated, or allows updates only during specific maintenance windows. While not a direct rollback mechanism, it prevents disruptive changes that might necessitate a rollback in the first place.
Consider a scenario where a new Lambda deployment causes an outage. Your cdk deploy might succeed, but the application is down. You want to revert to the last known good state. One advanced technique is to version your CDK code and use a CI/CD pipeline. The pipeline can be configured to automatically trigger a rollback to the previous deployment commit if health checks against the deployed application fail. This requires integrating your pipeline with CloudFormation and your application’s monitoring.
Another strategy, particularly for stateful resources like databases, is to implement blue/green deployments. This involves deploying the new version alongside the old one, testing it, and then switching traffic over. If issues arise, traffic can be quickly switched back to the old version, effectively rolling back the application without touching the infrastructure provisioning. CDK can orchestrate the creation of both environments, though the traffic switching logic is often handled by services like API Gateway or Elastic Load Balancing.
The core idea behind effective rollback is understanding that a successful cdk deploy only means the infrastructure provisioning was successful. It doesn’t guarantee the application is working correctly. Therefore, your rollback strategy must encompass both infrastructure and application health.
What most people miss is that CloudFormation’s automatic rollback only handles failures during the provisioning process. If your deployment succeeds but introduces a bug, CloudFormation won’t automatically revert. You need to explicitly trigger a rollback by deploying a previous version of your stack or by implementing pipeline-driven rollbacks based on application health metrics.
The next step after mastering rollback is implementing progressive delivery strategies like canary deployments, where you gradually roll out changes to a small subset of users to catch issues early.