AWS CDK Python Best Practices for Production Stacks (2026)

AWS CDK Python Best Practices for Production Stacks

The most surprising thing about AWS CDK for production is that the "best practices" often involve less abstraction, not more, when it comes to core infrastructure.

Let’s look at a real-world main.py from a production CDK application:

import os
from aws_cdk import (
    App, Environment, CfnOutput,
    Stack, RemovalPolicy,
    aws_s3 as s3,
    aws_iam as iam,
    aws_lambda as lambda_,
    aws_apigateway as apigateway,
    aws_logs as logs,
    aws_sqs as sqs,
    aws_sns as sns,
    aws_sns_subscriptions as subscriptions,
    aws_ecs as ecs,
    aws_ecs_patterns as ecs_patterns,
    aws_rds as rds,
    aws_ec2 as ec2,
)

# Load environment variables for configuration
ACCOUNT = os.environ.get("CDK_DEFAULT_ACCOUNT")
REGION = os.environ.get("CDK_DEFAULT_REGION")
ENV = Environment(account=ACCOUNT, region=REGION)

class ProductionStack(Stack):
    def __init__(self, scope: App, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, env=ENV, **kwargs)

        # Core VPC for all resources
        vpc = ec2.Vpc(self, "ProductionVPC",
                      max_azs=2,
                      cidr="10.0.0.0/16",
                      subnet_configuration=[
                          ec2.SubnetConfiguration(
                              name="PublicSubnet",
                              subnet_type=ec2.SubnetType.PUBLIC,
                              cidr_mask=24
                          ),
                          ec2.SubnetConfiguration(
                              name="PrivateSubnet",
                              subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS,
                              cidr_mask=24
                          )
                      ])

        # S3 bucket for logs and artifacts
        log_bucket = s3.Bucket(self, "ProductionLogBucket",
                               versioned=True,
                               removal_policy=RemovalPolicy.RETAIN,
                               auto_delete_objects=False,
                               encryption=s3.BucketEncryption.S3_MANAGED)
        CfnOutput(self, "LogBucketName", value=log_bucket.bucket_name)

        # Database Cluster (RDS Aurora Serverless v2)
        db_cluster_identifier = "prod-rds-cluster"
        db_cluster = rds.DatabaseCluster(self, "ProductionDBCluster",
                                         engine=rds.DatabaseClusterEngine.AURORA_POSTGRESQL,
                                         instance_type=ec2.InstanceType.of(ec2.InstanceClass.DEV_బ్ద, ec2.InstanceScale.ONE_XLARGE),
                                         instances=1,
                                         cluster_identifier=db_cluster_identifier,
                                         credentials=rds.Credentials.from_generated_secret("admin"),
                                         vpc=vpc,
                                         removal_policy=RemovalPolicy.RETAIN, # CRITICAL for production
                                         storage_encrypted=True,
                                         iam_database_authentication_enabled=True,
                                         )
        CfnOutput(self, "DBClusterArn", value=db_cluster.cluster_arn)

        # ECS Fargate Service for application
        cluster = ecs.Cluster(self, "ProductionECSCluster", vpc=vpc)

        # Example: A simple Lambda function for API Gateway integration
        lambda_role = iam.Role(self, "LambdaExecutionRole",
                               assumed_by=iam.ServicePrincipal("lambda.amazonaws.com"),
                               managed_policies=[
                                   iam.ManagedPolicy.from_aws_managed_policy_name("service-role/AWSLambdaBasicExecutionRole")
                               ])
        log_bucket.grant_read(lambda_role) # Grant read access to the log bucket

        api_lambda = lambda_.Function(self, "ApiHandlerFunction",
                                      runtime=lambda_.Runtime.PYTHON_3_9,
                                      handler="index.handler",
                                      code=lambda_.Code.from_asset("lambda_src"), # Assumes lambda_src/index.py exists
                                      role=lambda_role,
                                      environment={
                                          "LOG_BUCKET": log_bucket.bucket_name,
                                          "DB_CLUSTER_SECRET_ARN": db_cluster.secret() # Use secret manager ARN
                                      },
                                      log_retention=logs.RetentionDays.ONE_YEAR) # Explicitly set log retention

        # API Gateway
        api = apigateway.LambdaRestApi(self, "ProductionApi",
                                       handler=api_lambda,
                                       deploy_options=apigateway.StageOptions(stage_name="v1"))
        CfnOutput(self, "ApiEndpoint", value=api.url)

        # Example: SNS Topic for notifications
        notification_topic = sns.Topic(self, "NotificationTopic",
                                       display_name="Production Notifications")
        notification_topic.add_subscription(subscriptions.LambdaSubscription(api_lambda)) # Example subscription
        CfnOutput(self, "NotificationTopicArn", value=notification_topic.topic_arn)

app = App()
ProductionStack(app, "ProductionStack")
app.synth()

This stack defines fundamental production resources: a VPC, a durable S3 bucket, an RDS database, and an ECS cluster (though the ECS service itself is omitted for brevity, replaced by a Lambda/API Gateway example). Notice the explicit RemovalPolicy.RETAIN for critical resources like the database. This is the opposite of the default for many constructs, and it’s a crucial safety net.

The problem CDK solves is infrastructure as code, but for production, it’s about managing the lifecycle and safety of that code. This means treating your infrastructure definitions with the same rigor as your application code.

Internally, CDK synthesizes CloudFormation templates. The aws_cdk constructs translate your Python objects into CloudFormation resource definitions. For example, the s3.Bucket construct generates a AWS::S3::Bucket resource in the CloudFormation template. The RemovalPolicy.RETAIN translates to DeletionPolicy: Retain in the CloudFormation, preventing accidental deletion.

The levers you control are primarily:

Constructs: The building blocks of your infrastructure (e.g., s3.Bucket, rds.DatabaseCluster). Choose them wisely based on your needs.
Properties: The configuration of each construct (e.g., versioned=True, removal_policy=RemovalPolicy.RETAIN, storage_encrypted=True). These are your knobs for fine-tuning.
Policies: IAM roles and policies that grant permissions. This is where security is enforced.
Environment Variables and Context: For passing configuration like account IDs, regions, or specific feature flags between stacks or deployments.

When defining IAM roles for services like Lambda or ECS, it’s a common pitfall to grant overly broad permissions. Instead of iam.ManagedPolicy.from_aws_managed_policy_name("AdministratorAccess"), you should use the principle of least privilege. In the example, the LambdaExecutionRole only gets AWSLambdaBasicExecutionRole and explicit read access to the log bucket. You can further refine this by creating custom policies or using iam.PolicyStatement to grant specific actions on specific resources.

A common pattern for managing sensitive configuration like database credentials is to use AWS Secrets Manager. The CDK can generate a secret for your RDS instance and pass its ARN to the Lambda function’s environment variables, ensuring the Lambda can retrieve the credentials securely at runtime. The db_cluster.secret() method handles this integration.

The most common mistake when deploying production CDK stacks is not understanding or explicitly setting RemovalPolicy. If you omit it for resources like databases, S3 buckets, or ECR repositories, they might be deleted when you cdk destroy. For production, you almost always want RemovalPolicy.RETAIN on these critical data-holding resources, and then manage their deletion manually or via a separate process.

The next concept you’ll likely encounter is managing multiple environments (dev, staging, prod) with CDK, often involving parameterized stacks and different configuration sets.