You’re getting alerts because EventBridge isn’t successfully sending events to their destinations.
Here’s what’s likely going wrong and how to fix it:
1. Destination Not Reachable
Diagnosis: EventBridge can’t connect to your target service. This is the most common culprit, especially for services that require specific network configurations.
Common Causes:
- IAM Permissions: The EventBridge role lacks permissions to invoke the target service.
- Diagnosis: Check the IAM role attached to your EventBridge rule. Does it have an
Actionthat allows it toPutEventsto the specific target service (e.g.,sqs:SendMessage,lambda:InvokeFunction,kinesis:PutRecord)? - Fix: Attach a policy to the EventBridge role that grants necessary permissions. For an SQS queue, this might look like:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "sqs:SendMessage", "Resource": "arn:aws:sqs:us-east-1:123456789012:my-event-queue" } ] } - Why it works: The role assumes an identity that the target service trusts and has been granted explicit permission to perform the required action.
- Diagnosis: Check the IAM role attached to your EventBridge rule. Does it have an
- Network Connectivity (VPC Endpoints, Security Groups, NACLs): If your target is within a VPC, EventBridge needs a way to reach it.
- Diagnosis:
- VPC Endpoints: If using a VPC endpoint for the target service (e.g., SQS, Lambda), verify the endpoint exists, is associated with the correct subnets and security groups, and that traffic is allowed to it.
- Security Groups: Ensure the security group attached to the target resource (e.g., EC2 instance, ALB) allows inbound traffic from EventBridge’s IP range or the security group associated with the VPC endpoint.
- Network ACLs (NACLs): Check NACLs on the subnets hosting the target resource. They must permit outbound traffic to EventBridge’s endpoints and inbound traffic from EventBridge.
- Fix:
- VPC Endpoint: Create or verify the VPC endpoint. For SQS, it might be
com.amazonaws.us-east-1.sqs. Ensure it has a policy that allowssqs:SendMessageand is attached to the correct subnets. - Security Group: Add an inbound rule. For an SQS VPC endpoint, this might be a rule allowing TCP port 443 from the security group of the VPC endpoint itself.
- NACLs: Add rules to allow outbound traffic on ephemeral ports (1024-65535) to EventBridge’s service endpoints and inbound traffic on port 443 from EventBridge’s IP CIDRs.
- VPC Endpoint: Create or verify the VPC endpoint. For SQS, it might be
- Why it works: These network configurations create a secure and direct path for EventBridge to communicate with your private resources.
- Diagnosis:
- Resource State: The target resource itself is unhealthy or not running.
- Diagnosis: Check the status of your Lambda function, SQS queue (is it deleted?), ECS service, etc.
- Fix: Ensure the target resource is active and healthy. For example, if it’s a Lambda function, check its CloudWatch logs for errors.
- Why it works: EventBridge can only deliver to a functioning destination.
2. Dead-Letter Queue (DLQ) Configuration Issues
Diagnosis: EventBridge is configured to send failed events to a DLQ, but the DLQ itself is misconfigured or has issues.
Common Causes:
- DLQ IAM Permissions: The EventBridge role doesn’t have permission to send messages to the DLQ.
- Diagnosis: Verify the EventBridge role has
sqs:SendMessagepermissions for the DLQ ARN. - Fix: Add the
sqs:SendMessagepermission for the DLQ ARN to the EventBridge role’s policy. - Why it works: Similar to destination permissions, EventBridge needs explicit authorization to place messages onto the DLQ.
- Diagnosis: Verify the EventBridge role has
- DLQ Full: The DLQ (usually an SQS queue) has reached its maximum message count or storage limits.
- Diagnosis: Check the
ApproximateNumberOfMessagesmetric for the DLQ SQS queue. If it’s at its maximum (default 120,000 for standard SQS), EventBridge can’t send more. - Fix: Process or delete messages from the DLQ, or increase its
MaximumMessageCount(for FIFO queues) or consider a different DLQ strategy. - Why it works: An SQS queue has a finite capacity. Once full, it rejects new messages.
- Diagnosis: Check the
- DLQ is Deleted: The SQS queue designated as the DLQ no longer exists.
- Diagnosis: Confirm the ARN of the DLQ still points to an existing SQS queue.
- Fix: Recreate the DLQ SQS queue and re-associate it with the EventBridge rule.
- Why it works: EventBridge cannot deliver to a non-existent resource.
3. EventBridge Rule Configuration Errors
Diagnosis: The rule itself is set up incorrectly, leading to delivery failures.
Common Causes:
- Incorrect Event Pattern: The event pattern is too broad or too narrow, causing EventBridge to attempt delivery to a rule that shouldn’t be triggered, or it’s misinterpreting the event.
- Diagnosis: Review the event pattern of the rule against the actual structure of the incoming events. Use the EventBridge console’s "Sample events" feature to test your pattern.
- Fix: Adjust the event pattern to precisely match the events you intend to trigger the rule. For example, if you only want events from a specific EC2 instance, refine the
sourceanddetail.instance-idfields. - Why it works: The event pattern is the filter that determines which events are routed to which rules and targets. An incorrect pattern can lead to unexpected behavior or no matching events.
- Target Not Configured Correctly: The target ARN or settings within the target configuration are wrong.
- Diagnosis: Double-check the target ARN specified in the EventBridge rule. For services like Lambda, ensure the correct function ARN is used. For SQS, verify the queue ARN.
- Fix: Correct the target ARN and any associated parameters (e.g., message group ID for FIFO queues).
- Why it works: EventBridge needs the exact address of the destination.
4. EventBridge Service Limits
Diagnosis: You’re hitting EventBridge’s capacity limits.
Common Causes:
- Too Many Failed Events: While less common, a very high volume of failed deliveries can theoretically impact service performance, though EventBridge is designed to handle this. More likely, you’re hitting limits on successful events if the failure is intermittent and the volume is high.
- Diagnosis: Check EventBridge metrics like
FailedInvocationsandInvocationsin CloudWatch. Look for spikes corresponding to your alerts. Also, check AWS Service Quotas for EventBridge limits. - Fix: Request a service quota increase for EventBridge if you’re hitting limits on the number of rules, events per second, or targets per rule.
- Why it works: AWS services have built-in limits to ensure stability. Exceeding these requires a formal request to increase capacity.
- Diagnosis: Check EventBridge metrics like
5. Event Payload Issues
Diagnosis: The event itself is malformed or too large for the target.
Common Causes:
- Event Size: The event payload exceeds the maximum size allowed by the target service (e.g., Lambda’s 256KB payload limit, SQS’s 256KB limit).
- Diagnosis: Examine the content of the events that are failing. If they are large, log their size.
- Fix: Truncate or compress the event payload before sending it to EventBridge, or use a different mechanism (like S3) to pass large data and send a reference to it.
- Why it works: Each AWS service has defined limits for data it can process in a single request.
- Event Structure: The event structure is unexpected by the target, especially if the target is a custom application or a Lambda function with strict input expectations.
- Diagnosis: Log the incoming event to your target’s logs (if possible) or to a temporary SQS queue to inspect its exact structure.
- Fix: Adjust the event-generating service to produce an event that conforms to the target’s expectations, or modify the target to handle the actual event structure.
- Why it works: The target service must be able to parse and process the incoming data.
After fixing these, you’ll likely start seeing CloudWatch alarms for "SuccessfulEventDelivery" if you’ve configured them, or simply a lack of the "FailedEventDelivery" alarms.