DynamoDB’s Scan operation is a performance killer not because it’s inherently slow, but because it’s fundamentally a brute-force approach that bypasses the index-based efficiency of Query.

Let’s see Scan in action. Imagine you have a table of user activity logs, keyed by userId (partition key) and timestamp (sort key).

import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('UserActivityLog')

response = table.scan(
    FilterExpression=Attr('activityType').eq('login')
)

items = response['Items']
print(f"Found {len(items)} login events.")

Now, what seems innocent enough here is actually a huge problem under the hood. When you Scan a table, DynamoDB has to read every single item in the table and then filter out the ones that don’t match your FilterExpression. It’s like asking a librarian to find all books with "dragon" in the title by looking at every single book on every shelf, rather than using the card catalog.

The core issue is that Scan doesn’t leverage your table’s primary key or any Global Secondary Indexes (GSIs) for efficient retrieval. It just iterates. This has several cascading performance implications:

  1. High Read Capacity Unit (RCU) Consumption: Every item read, even if it’s eventually filtered out, costs RCUs. A table with millions of items and a broad Scan can burn through provisioned capacity in seconds, leading to throttling and increased costs. For example, reading 1 megabyte of data costs 1 RCU. If your Scan reads 100MB of data to find 10 matching items, you’ve just spent 100 RCUs for a tiny result set.

  2. Increased Latency: The more data DynamoDB has to sift through, the longer it takes to return results. For large tables, a Scan can take minutes, making it unsuitable for real-time applications. The network round trips, data serialization, and deserialization all add up with every item processed.

  3. No Parallelism Guarantee (Internal): While DynamoDB can parallelize Scan operations by reading multiple partitions concurrently, you have no direct control over this. Even with parallelization, it’s still reading all data, just at a faster pace, and the cost remains the same. You can use TotalSegments and Segment parameters to help, but it’s still a scan.

  4. "Eventually" Consistent Reads: By default, Scan operations perform eventually consistent reads. This means the data you retrieve might not reflect the most recently written items. If your application requires up-to-the-minute data, this can lead to incorrect logic. You can request strongly consistent reads, but this doubles the RCU cost per item.

  5. Large Result Sets and Pagination: If your Scan returns a massive number of items, you’ll have to handle pagination using LastEvaluatedKey. This adds complexity to your application logic and still requires repeated Scan calls, each incurring its RCU cost and latency.

  6. Unpredictable Performance: As your table grows, the performance of a Scan degrades linearly. A Scan that’s fast today on a small table might become painfully slow next month on a larger one, making capacity planning and performance tuning a moving target.

The solution is almost always to use Query or GSIs. A Query operation, unlike Scan, uses the table’s primary key (partition key and optional sort key conditions) or a GSI’s key schema to efficiently locate data.

For instance, to find all login events for a specific user, you’d use Query:

response = table.query(
    KeyConditionExpression=Key('userId').eq('user123') & Key('timestamp').gt(1678886400) & Key('timestamp').lt(1678972800),
    FilterExpression=Attr('activityType').eq('login')
)

This Query only reads items within the specified userId partition and the given timestamp range, dramatically reducing RCUs and latency. The FilterExpression is still applied, but it’s applied after the efficient key-based retrieval, so it only filters a much smaller subset of data.

If you need to query by an attribute that isn’t part of your primary key, create a Global Secondary Index (GSI). For example, to efficiently query activityType across all users, you’d create a GSI with activityType as the partition key. Then, you’d Query the GSI:

gsi_table = dynamodb.Table('UserActivityLog', 'ActivityTypeIndex') # Assuming ActivityTypeIndex is the GSI name

response = gsi_table.query(
    KeyConditionExpression=Key('activityType').eq('login')
)

The most surprising thing about Scan is that even when you use ProjectionExpression to limit the attributes returned, DynamoDB still reads the entire item from storage. The ProjectionExpression is applied after the item has been read from disk and brought into memory, meaning you’re still paying the RCU cost for the full item’s size, not just the projected attributes.

The next problem you’ll encounter is understanding how to optimize your GSI design for specific query patterns.

Want structured learning?

Take the full Dynamodb course →