The most surprising thing about auto-retrying failed steps is that it’s often not the failure that’s the problem, but the success of the retry itself.
Let’s see what this looks like in action. Imagine a workflow that needs to call an external API. This API is flaky, sometimes returning a 503 Service Unavailable error.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: flaky-api-call-
spec:
entrypoint: main
templates:
- name: main
dag:
tasks:
- name: call-flaky-api
template: flaky-api-template
retryStrategy:
limit: 3
backoff:
duration: "5s"
factor: 2
maxDuration: "30s"
- name: flaky-api-template
container:
image: alpine:latest
command: ["sh", "-c"]
args:
- |
echo "Calling external API..."
# Simulate a flaky API call that might fail
if [ $(($RANDOM % 5)) -eq 0 ]; then
echo "API call failed with 503 Service Unavailable"
exit 1
else
echo "API call succeeded"
exit 0
fi
Here, the call-flaky-api task has a retryStrategy. If it fails (exits with a non-zero status), it will be retried up to 3 times. The backoff strategy means it waits 5 seconds, then 10 seconds, then 20 seconds between retries.
The system doesn’t magically know why the API failed. It only sees the exit code. When the container exits with 1, the Argo controller marks the step as failed and triggers the retry mechanism based on the defined retryStrategy. If the container succeeds (exits with 0) on any of the retries, the workflow proceeds as if no failure ever occurred. This is the "success of the retry."
The problem arises when the reason for the initial failure is transient, but the retry mechanism itself masks a deeper, persistent issue. For example, if the API is rate-limiting your requests, retrying immediately might just hit the same rate limit, or worse, exacerbate it. The workflow might eventually succeed due to sheer luck of timing, but the underlying problem remains unaddressed.
The key levers you control are the limit (how many times to retry) and the backoff strategy (how long to wait between retries). A backoff with factor: 2 and maxDuration: 30s means the delays will be 5s, 10s, 20s. If the limit is 3, the total time spent retrying a single step could be up to 5s + 10s + 20s = 35 seconds, plus the execution time of the step itself.
A common pitfall is setting a high limit and a very short backoff for non-idempotent operations. Imagine a task that creates a resource. If it fails after creating the resource but before confirming success, a retry might attempt to create the same resource again, leading to duplicates or conflicts. This is why understanding idempotency is crucial: can the operation be performed multiple times with the same result as if it were performed only once?
The most misunderstood aspect of retry strategies is their interaction with state. If your task modifies external state and then fails, a simple retry assumes the state is back to where it was before the failed attempt. This is rarely true for complex operations. For instance, if a database transaction begins, then fails midway, a retry might try to restart that transaction, potentially leaving partially committed data or causing deadlocks if not handled with extreme care.
The next logical step after mastering retries is understanding how to implement more sophisticated error handling, such as conditional retries based on specific error codes or output parameters.