EventBridge is silently dropping events on you if you’re not careful, and the default behavior when hitting limits is a randomized exponential backoff that can make your system feel like it’s just… not working.
Let’s see this in action. Imagine you have a Lambda function triggered by an EventBridge rule. You’re sending events to a custom event bus at a high rate.
// Event Payload Example
{
"Source": "com.mycompany.orders",
"DetailType": "OrderCreated",
"Detail": {
"orderId": "12345",
"customer": "Alice",
"amount": 99.99
}
}
By default, EventBridge has API rate limits. For PutEvents to a custom event bus, it’s 300 requests per second per AWS account. If you exceed this, EventBridge starts throttling. It doesn’t just fail; it retries, but with increasing delays.
Here’s how you manage it:
1. Understand the Limits
The primary limit for PutEvents to a custom event bus is 300 requests per second (RPS) per AWS account. There’s also a limit on the total payload size per request (256KB) and per second (256KB). You can request limit increases via AWS Support, but that’s a last resort.
2. Implement Client-Side Retries with Backoff
Your application sending events needs to handle throttling gracefully. The AWS SDKs have built-in retry logic, but you should configure it.
-
Diagnosis: Check your application logs for
ThrottlingExceptionorProvisionedThroughputExceededException. -
Fix: Configure the AWS SDK’s retry strategy. For example, in Python with
boto3, you can setretries:import boto3 from botocore.config import Config client = boto3.client( 'events', region_name='us-east-1', config=Config( retries={ 'max_attempts': 10, 'mode': 'standard' } ) ) # ... later in your code when calling put_events try: response = client.put_events(...) except Exception as e: # SDK's built-in retries will handle some of this, # but log any persistent failures. print(f"EventBridge put_events failed: {e}") -
Why it works: The
standardretry mode implements a randomized exponential backoff, meaning it will retry a few times with increasing delays (e.g., 1s, 2s, 4s, 8s…) before giving up. This gives EventBridge breathing room to recover.
3. Monitor EventBridge Metrics
AWS provides metrics for EventBridge that are crucial for understanding your traffic patterns and identifying throttling.
- Diagnosis: Navigate to CloudWatch -> Metrics -> All Metrics. Search for "EventBridge". Look for
PutEventSuccessandPutEventFailurefor your custom bus. Pay close attention toPutEventFailurewithErrorCodeThrottlingException. - Fix: Set up CloudWatch Alarms on
PutEventFailurecount. If the failure rate exceeds a threshold (e.g., > 5 failures per minute), trigger an alert. This alarm should prompt you to investigate the source of the high traffic. - Why it works: These metrics give you a direct view into what EventBridge is experiencing, allowing you to detect throttling before it causes widespread issues.
4. Batch Events (Carefully)
The PutEvents API allows you to send up to 10 events in a single request. This can significantly reduce the number of API calls.
-
Diagnosis: If your
PutEventFailuremetrics show throttling and your individual events are small, you might not be hitting the request limit, but rather the payload limit if you’re sending many events individually. -
Fix: Instead of calling
put_eventsfor each event, collect events and send them in batches:events_to_send = [] for i in range(5): # Collect up to 5 events events_to_send.append({ 'Source': 'com.mycompany.orders', 'DetailType': 'OrderCreated', 'Detail': json.dumps({"orderId": f"batch_{i}", "customer": "Bob", "amount": 10.0}) }) if len(events_to_send) == 10: # Max batch size client.put_events(Entries=events_to_send) events_to_send = [] if events_to_send: # Send any remaining events client.put_events(Entries=events_to_send) -
Why it works: Each
PutEventscall counts as one request. Batching 10 events into one call reduces your RPS by a factor of 10, making it much easier to stay under the 300 RPS limit. Be mindful of the total payload size per request (256KB).
5. Implement Dead-Letter Queues (DLQs)
When retries fail or you want to capture events that couldn’t be processed, a DLQ is essential. For targets like Lambda or SQS, you can configure a DLQ directly. For PutEvents itself, you’d typically implement this in the consumer of the events, but understanding it helps.
- Diagnosis: You see persistent
PutEventFailuremetrics and your downstream systems aren’t receiving all events, even after your application’s retries. - Fix: For a Lambda target, configure a DLQ in the Lambda function’s configuration:
- Go to your Lambda function -> Configuration -> Asynchronous invocation.
- Set "On-failure destination" to "Amazon SQS dead-letter queue".
- Choose or create an SQS queue.
- Why it works: If EventBridge successfully delivers an event to a target (like Lambda) but the target fails to process it after its own retries, the event is sent to the DLQ. This doesn’t directly solve
PutEventsthrottling but is critical for overall event processing reliability. ForPutEventsthrottling at the source, you need to fix the source’s sending rate.
6. Adjust Sending Rate
If your application genuinely needs to send events at a rate higher than EventBridge’s default limits, you need to control the source.
-
Diagnosis: After implementing client-side retries, monitoring, and batching, you are still seeing throttling exceptions.
-
Fix: Implement rate limiting in your sending application. Use libraries like
token-bucketor a simple semaphore to ensure you don’t exceed a sustainable rate (e.g., 250 RPS to leave some buffer).from py_expressionengine.rate_limiter import RateLimiter import time # Allow 250 events per second rate_limiter = RateLimiter(max_rate=250, period=1.0) def send_event_safely(event_data): with rate_limiter: # Your boto3 client.put_events call here # This will block if the rate limit is exceeded client.put_events(Entries=[event_data]) print("Event sent.") # In your loop: # send_event_safely(single_event_payload) -
Why it works: This proactively prevents you from hitting the API limit by controlling how fast your application can make the
put_eventscalls, rather than relying on EventBridge to reject and retry.
7. Request Limit Increases (Rarely)
If your traffic patterns are legitimate and consistently exceed the default limits, you can request an increase.
- Diagnosis: You’ve exhausted all other options, and your business needs genuinely require a higher throughput than the default 300 RPS.
- Fix: Open a support case with AWS Support. Clearly state your use case, current throughput, and the desired increased throughput. Be prepared to justify the request with data.
- Why it works: AWS can provision additional capacity for your account, but this is a manual process and not guaranteed.
The next error you’ll likely encounter if you fix throttling is a downstream resource failing due to high volume, such as a Lambda function hitting its concurrency limits.