EventBridge targets don’t automatically retry failed events, leaving your downstream systems vulnerable to transient network issues or temporary service unavailability.
Let’s see this in action. Imagine an EventBridge rule that sends events to an SQS queue. If the SQS queue is temporarily throttled, or if the Lambda function processing messages from the queue fails intermittently, EventBridge will, by default, just drop the event after the first attempt.
{
"Source": "aws.ec2",
"DetailType": "EC2 Instance State-change Notification",
"EventBusName": "default",
"Resources": [
"arn:aws:ec2:us-east-1::image/ami-0abcdef1234567890"
],
"Detail": {
"instance-id": "i-0123456789abcdef0",
"state": "stopped"
},
"Time": "2023-10-27T10:00:00Z",
"Id": "event-id-12345",
"Account": "111122223333"
}
This event, if sent to a misbehaving SQS queue, would be lost. To prevent this, we configure retry policies on the EventBridge target.
The core problem EventBridge retry policies solve is ensuring that events are delivered to their intended targets, even when those targets are temporarily unavailable or experience errors. Without retries, a single transient failure can lead to lost data and inconsistent system states. EventBridge acts as a reliable dispatcher, and retries are its mechanism for handling temporary hiccups in the delivery pipeline.
Here’s how it works internally: When EventBridge attempts to deliver an event to a target and receives an error response (like a 5xx HTTP status code from an API Gateway endpoint, or a ConditionalCheckFailedException from DynamoDB), it doesn’t immediately give up. Instead, it enters a retry loop. You configure the parameters of this loop: MaximumRetryAttempts and MaximumEventAgeInSeconds.
MaximumRetryAttempts: This is the maximum number of times EventBridge will try to deliver a single event to the target. If the target still fails after this many attempts, EventBridge will stop trying. A common starting point is 3 to 5 attempts, balancing retry effort against potential system overload.
MaximumEventAgeInSeconds: This is the maximum time an event will be retried. If an event is still failing after this duration, EventBridge will stop retrying, regardless of how many attempts have been made. This prevents events from being retried indefinitely, which could lead to old, irrelevant data being processed. A value like 300 (5 minutes) or 600 (10 minutes) is often suitable, allowing for recovery from short-lived outages.
Let’s say you have an EventBridge rule targeting a Lambda function. If the Lambda function times out due to a temporary downstream database issue, EventBridge will retry.
Here’s an example of how you’d configure this using the AWS CLI for an SQS target:
aws eventbridge put-targets --rule MyEventRule --targets \
'{"Id": "MySqsTarget", "Arn": "arn:aws:sqs:us-east-1:111122223333:MyQueue", \
"RetryPolicy": { \
"MaximumRetryAttempts": 5, \
"MaximumEventAgeInSeconds": 300 \
}}'
In this command:
MyEventRuleis the name of your EventBridge rule.MySqsTargetis a unique identifier for this target.arn:aws:sqs:us-east-1:111122223333:MyQueueis the ARN of your SQS queue.MaximumRetryAttempts: 5means EventBridge will try up to 5 times to send the event if it fails.MaximumEventAgeInSeconds: 300means an event will not be retried for longer than 5 minutes.
If the SQS queue is briefly unavailable, EventBridge will keep trying to send the event for up to 300 seconds, up to a maximum of 5 attempts. If the queue becomes available within this window, the event will be successfully delivered. This prevents data loss from transient network glitches or temporary service throttling.
What happens if an event fails all retry attempts and exceeds its maximum age? EventBridge will then send the event to a Dead-Letter Queue (DLQ) if one is configured for the target. This is a crucial part of the resilience strategy, as it allows you to inspect and potentially reprocess events that could not be delivered after all retry attempts. You would configure the DLQ on the target’s resource itself (e.g., the SQS queue or Lambda function’s redrive policy).
The most impactful aspect of retry policies, and something often overlooked, is how they interact with the target’s own error handling. If your Lambda function, for instance, catches an exception, logs it, and then re-throws it, EventBridge sees this as a failure and initiates a retry. However, if your Lambda function catches an exception, handles it gracefully (e.g., by writing a partial record to a different store or simply returning successfully after logging), EventBridge will consider the delivery successful, and no retry will occur. This means the retry policy on EventBridge is a safety net, but your target’s internal logic determines what constitutes a "failure" from EventBridge’s perspective.
The next common challenge after configuring retries is managing the events that do end up in your Dead-Letter Queue.