Canary deployments, when done correctly, don’t just roll out new code; they actively prevent catastrophic failures by deliberately exposing a tiny fraction of users to potential bugs.
Let’s see this in action. Imagine we have a simple Deployment for our web app:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: nginx:1.21.0
ports:
- containerPort: 80
Now, we want to introduce a new version, nginx:1.22.0, but do it safely. We’ll replace the Deployment with an ArgoCD Rollout object:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: nginx:1.21.0 # This will be updated
ports:
- containerPort: 80
strategy:
canary:
increments: [25, 50, 75] # Percentage of traffic to send to the canary
steps:
- setWeight: 10
- pause: { duration: 5m } # Wait 5 minutes for observation
- setWeight: 25
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 75
- pause: { duration: 5m }
- setWeight: 100 # Gradually increase to 100%
When ArgoCD notices a change to the spec.template.spec.containers[0].image (e.g., we change nginx:1.21.0 to nginx:1.22.0 in our Git repository and ArgoCD syncs it), it doesn’t just update all pods at once. Instead, it creates a new ReplicaSet for the new image.
The Rollout controller then starts gradually shifting traffic. Initially, it might scale down the old ReplicaSet and scale up the new one to a small, controlled percentage (e.g., 10% of the total desired replicas). It waits for a specified duration (5 minutes in this example) to monitor metrics. If everything looks good (e.g., no increased error rates, latency within bounds), it proceeds to the next step, increasing the canary’s share to 25%, then 50%, and so on.
The increments field defines the target percentages for traffic shifting, while steps defines the actual rollout process, including any pauses for manual inspection or automated analysis. If at any point a metric threshold is breached (e.g., error rate goes above 5%), the rollout can automatically abort, scaling back to the stable version and preventing widespread user impact.
What makes this powerful is the decoupling of deployment and traffic shifting. ArgoCD Rollouts integrates with ingress controllers (like Nginx Ingress, Traefik, or AWS ALB) and service meshes (like Istio or Linkerd) to manage traffic routing. It doesn’t just deploy new pods; it intelligently directs a calculated percentage of live user traffic to them.
The key to a successful canary is having robust monitoring and automated analysis in place. ArgoCD Rollouts allows you to define analysisTemplate objects that run checks against Prometheus, Datadog, or other observability tools. For instance, you could set up an analysis that checks if the error rate for the new version exceeds 0.5% over a 2-minute window. If it does, the canary is automatically rolled back.
The setWeight directive in the steps is crucial. It tells the Rollout controller what percentage of the total desired pods should be running the new version. Combined with the pause duration, this creates a deliberate, observable rollout.
One detail often overlooked is how ArgoCD Rollouts manages the underlying Kubernetes resources. It doesn’t delete the old ReplicaSet immediately. Instead, it scales it down as the new ReplicaSet scales up. This provides a quick rollback path: if issues arise, the Rollout controller can simply scale the new ReplicaSet back to zero and scale the old one back up.
The traffic routing itself is managed by the Rollout controller, which updates the configuration of your ingress resource or service mesh. For example, with Nginx Ingress, it might adjust annotations on the Ingress object to split traffic between the old and new service endpoints.
When you’re starting out, you might be tempted to set very short duration values for your pauses. However, remember that the goal is to observe real user traffic. A few minutes might not be enough to catch intermittent issues or patterns that only emerge under sustained load. You’ll likely find yourself tuning these durations based on your application’s behavior and your team’s comfort level.
After a successful canary rollout where the new version has reached 100% of traffic and has been stable for a defined period, ArgoCD Rollouts will automatically garbage collect the old ReplicaSet, leaving only the stable new version.
The next hurdle you’ll likely face is automating the rollback decision based on advanced performance metrics rather than just basic error rates.