Macie doesn’t just find sensitive data in S3; it actively models your S3 environment to understand what could be sensitive and where, even before it scans it.

Let’s see Macie in action. Imagine you’ve got a bucket named my-company-data-prod.

aws s3 ls s3://my-company-data-prod/

This is just a bucket listing. Now, let’s say you’ve enabled Macie for this account and region. Macie starts by creating an S3 inventory of your buckets. This inventory includes object metadata like names, sizes, and last modified dates. It also pulls in S3 access logs and CloudTrail data.

The real magic begins with Macie’s discovery jobs. You can configure these jobs to scan specific buckets or even all buckets in an account.

Here’s how you’d start a basic discovery job:

aws macie2 create-classification-job \
    --name "MyProdDataScan" \
    --job-type "ONE_TIME" \
    --client-token "a1b2c3d4-e5f6-7890-1234-567890abcdef" \
    --sampling-percentage 100 \
    --input-filter '{ "s3Words": [ { "bucketName": { "prefix": "my-company-data-prod", "exactMatch": "my-company-data-prod" } } ] }' \
    --role-arn "arn:aws:iam::111122223333:role/aws-macie-service-role" \
    --schedule-class "STANDARD"

When Macie scans an object, it doesn’t just look for keywords. It uses sophisticated sensitive data types (SDTs). These are pre-defined patterns and techniques to identify things like:

  • Credit Card Numbers: Uses Luhn algorithm checks and known patterns.
  • AWS API Keys: Recognizes the specific format of AWS secret access keys.
  • Personal Identifiable Information (PII): Looks for names, addresses, phone numbers, email addresses, social security numbers (SSNs) in various formats.
  • Health Information: Scans for medical record numbers or other health-related identifiers.

Macie applies these SDTs based on the content of the data. For example, if it finds a string that matches the pattern of a US Social Security Number, it flags it. If it finds a string that looks like a credit card number, it’ll flag that. It can even detect patterns that suggest sensitive data, like a block of text formatted like a passport number.

Once a job completes, Macie populates its findings. You can retrieve these findings using the AWS CLI:

aws macie2 list-findings \
    --sort-criteria '{"criterion":"createdAt","order":"desc"}' \
    --filter-criteria '{ "resourceType": [ { "eq": ["S3_OBJECT"] } ] }'

This output will show you details about each finding, including the bucket name, object key, the type of sensitive data found, and the confidence level.

The real power comes from how Macie integrates with your security posture. You can set up automated response and remediation using EventBridge and Lambda. For instance, if Macie finds an S3 object containing PII in a public bucket, you could trigger a Lambda function to:

  1. Remove public read access from the bucket.
  2. Encrypt the object using KMS.
  3. Send a notification to your security team.

This automated workflow is crucial for quickly addressing potential data leaks.

One of the most powerful, yet often overlooked, aspects of Macie is its ability to infer sensitivity based on context. It doesn’t just rely on strict pattern matching. For example, if Macie sees a file named customer_database_2023_q4.csv in a bucket, and within that file, it finds strings that strongly resemble PII, it will assign a higher confidence score to that finding, even if the patterns aren’t 100% perfect. It correlates file names, object metadata, and content patterns to build a more comprehensive picture of risk. Furthermore, Macie can be configured to create custom sensitive data types using regular expressions, allowing you to define and detect proprietary data formats unique to your organization.

The next step after discovering sensitive data is understanding how it’s being accessed, which leads you into analyzing S3 access logs and CloudTrail data more deeply.

Want structured learning?

Take the full Aws course →