You can retry individual steps in a Drone CI pipeline without rerunning the entire pipeline, and even continue the pipeline execution after a step fails.
Here’s a pipeline that demonstrates this:
kind: pipeline
type: docker
name: step-failure-handling
steps:
- name: failing-step
image: alpine:latest
commands:
- echo "This step will fail!"
- exit 1
- name: retryable-step
image: alpine:latest
commands:
- echo "This step might fail, but we'll retry it."
- echo $$DRONE_STEP_RETRY_COUNT
retry:
when:
- failure
max: 3
backoff: 10s
- name: conditional-step
image: alpine:latest
commands:
- echo "This step only runs if the previous one succeeded (or was retried enough)."
when:
status:
- success
- name: always-run-step
image: alpine:latest
commands:
- echo "This step runs regardless of previous step outcomes (unless explicitly skipped)."
when:
status:
- success
- failure
- changed
When failing-step executes, it exits with a non-zero status code, signaling a failure. Drone CI will then attempt to execute retryable-step. The retry section in retryable-step is configured to execute if the preceding step (failing-step) failed. It will attempt to run retryable-step up to 3 times, with a 10-second backoff between retries.
If retryable-step succeeds within its retries, the pipeline continues to conditional-step. The when clause for conditional-step specifies that it should run if the status of the preceding steps was success.
The always-run-step has a when clause that includes success and failure. This means it will execute if the pipeline reaches this point, irrespective of whether the preceding steps ultimately succeeded or failed (as long as they didn’t cause the entire pipeline to abort prematurely).
The core problem this solves is brittle pipelines. Without explicit retry and conditional logic, a single transient error in any step can halt the entire pipeline, requiring a full rerun. This is inefficient and frustrating, especially for long-running or complex build processes. By implementing retries for known flaky steps and conditional execution for subsequent steps, you build more robust and resilient pipelines.
Internally, Drone CI tracks the status of each step. When a step finishes with a non-zero exit code, Drone checks the retry configuration for subsequent steps. If a step is configured for retries on failure, Drone will re-enqueue that step for execution. The DRONE_STEP_RETRY_COUNT environment variable is automatically populated within the step’s execution environment, indicating the current retry attempt number (starting from 0 for the first retry). The when clause allows you to control step execution based on the status of previous steps, offering fine-grained control over pipeline flow. You can check for success, failure, changed (if the commit has changed since the last successful run), unstable (if any previous step failed but was retried), or error (a critical error that prevented the step from running).
The backoff field in the retry configuration doesn’t just introduce a delay; it’s a crucial component for handling transient network issues or temporary resource unavailability. For example, if a build step depends on an external service that is momentarily overloaded, a simple retry might hit the same problem. Introducing a backoff period allows the external service time to recover, increasing the likelihood of subsequent retries succeeding.
The when clause’s status filter is powerful. You can use changed to ensure a deployment step only runs when the code has actually changed, preventing unnecessary deployments. Using unstable allows you to run a cleanup step even if some earlier steps failed but were ultimately retried and potentially succeeded, ensuring a consistent state.
If you have a step that depends on the exact outcome of a previous step (e.g., a deployment that should only run if the build and test step was a complete success, not if it failed and was retried), you would adjust the when clause accordingly. For instance, to only run a step if the previous one was a definitive success:
- name: deploy
image: plugins/deploy
when:
status: [success]
This ensures that the deploy step is skipped if build-and-test failed, even if it was retried and eventually passed.
The retry mechanism itself is not infinite. The max parameter sets a hard limit. If a step fails more times than max allows, it will be marked as failed, and subsequent steps will be evaluated based on this failure status according to their when clauses.
The next concept you’ll likely encounter is managing secrets and configuration across retried steps, as environment variables might need to be re-evaluated or re-fetched on each retry.