EventBridge is surprisingly bad at telling you why an event failed to process after it’s already sent.
Here’s a quick look at what happens when an event is sent to a target that’s currently unavailable or misconfigured.
{
"version": "0.3",
"id": "e7f3f837-30a3-491f-9735-3b1865c14a3e",
"detail-type": "UserLoggedIn",
"source": "my.auth.service",
"account": "123456789012",
"time": "2023-10-27T10:30:00Z",
"region": "us-east-1",
"resources": [],
"detail": {
"userId": "user-abc-123",
"timestamp": 1698393000
}
}
When this UserLoggedIn event hits EventBridge, the bus tries to send it to its configured targets. If a target, say a Lambda function, is throttled, has incorrect IAM permissions, or is simply down, EventBridge will retry. After a certain number of retries (controlled by the target’s retry policy), if it still can’t deliver, the event is dropped. You won’t see it in CloudWatch Logs for the target, and there’s no direct "failed events" queue.
This is where archiving comes in. You can configure EventBridge to archive all events that pass through a bus. This creates a historical record you can then query and replay.
Setting Up Event Archiving
To archive events, you need to create an EventBridge Archive.
- Navigate to EventBridge in the AWS console.
- Go to Archives and click Create archive.
- Give it a name, e.g.,
MyEventBusArchive. - Select your Event bus. Usually, this is the
defaultbus unless you’ve created custom ones. - Under Archive event data, choose All events.
- For Retention setting, select Customer managed. This allows you to define how long events are kept. A common choice for debugging is 7 days.
- Choose an Amazon S3 bucket to store the archived events. You’ll need to create one if you don’t have a suitable bucket already.
- Create archive.
Once created, EventBridge will start sending a copy of every event that successfully arrives at the EventBridge bus to your S3 bucket. This is crucial: archiving captures events before they are sent to targets and before any target-specific delivery failures occur.
Replaying Archived Events
When you encounter a processing error (e.g., your target Lambda is throwing errors, or you suspect an event was lost), you can replay historical events from your archive.
- Navigate back to EventBridge Archives.
- Select your archive (e.g.,
MyEventBusArchive). - Click Replay archive.
- Replay name: Give it a descriptive name, like
ReplayUserLoginFailures_2023-10-27. - From event time and To event time: Specify the time range of events you want to replay. This is where you’ll narrow down to the period when you suspect failures occurred.
- Event bus: Select the target event bus where you want to replay these events. This is typically the same event bus they originally came from, but you could replay to a different bus for testing.
- Optional: Event filtering: You can filter which events within the time range are replayed. For example, to only replay
UserLoggedInevents frommy.auth.service:{ "source": ["my.auth.service"], "detail-type": ["UserLoggedIn"] } - Click Create replay.
EventBridge will then take the events from your S3 archive that match your criteria and send them again to the targets configured on the specified event bus. This is like a "do-over" for those specific events.
Debugging with Replays
The key to debugging is setting up a separate, isolated target for your replays.
- Create a new target: This could be a dedicated Lambda function, an SQS queue, or even an API Gateway endpoint that’s designed only to receive and log replayed events. Do not replay directly to your production target unless you are absolutely certain it can handle duplicates or you’ve disabled its primary logic.
- Configure a new rule: On your original event bus, create a new Amazon EventBridge rule.
- Event pattern: Match the events you are interested in (e.g.,
UserLoggedInevents frommy.auth.service). - Target: Select your new, isolated logging target.
- Event pattern: Match the events you are interested in (e.g.,
- Initiate the replay: As described above, create a replay from your archive, specifying your original event bus as the destination.
- Observe the logs: The events will be replayed and processed by your new logging target. Examine the logs from this target to see exactly what the event data looks like and how your logging mechanism handles it. If the original failure was due to a transient issue with a production target, replaying to a stable logging target will confirm the event itself was valid and delivered by EventBridge.
If your production target is still failing, the issue lies with that target’s configuration, permissions, or code, as the event is successfully being delivered to the bus and then replayed. If the replay to your logging target fails, the issue is likely with the event bus itself or the new rule’s configuration.
The most surprising thing about replaying events is that it doesn’t just resend the raw event; it reconstructs the delivery attempt to the targets configured on the selected event bus at the time of the replay. This means if you’ve changed your rules or targets since the original event, the replay will use the current configuration.
If you’ve replayed events and they still aren’t being processed correctly by your intended target, the next thing you’ll likely run into is ensuring your target’s IAM permissions are correctly configured to accept events from EventBridge, especially if you’re using custom event buses or specific resource policies.