Handle EC2 Spot Instance Interruptions Gracefully (2026)

EC2 Spot Instances can be terminated with a two-minute warning, but you can actually save them by bidding higher or fulfilling the instance’s request conditions.

Let’s watch a Spot Instance get interrupted and see how we can handle it. Imagine you’ve got a critical batch job running on a Spot Instance, and you don’t want it to just vanish.

Here’s a common setup: an EC2 Spot Instance running a Python script that periodically checks a S3 bucket for new files to process.

import boto3
import time
import os

s3_client = boto3.client('s3')
instance_id = os.environ.get('EC2_INSTANCE_ID') # Assume this is set

def check_for_interruption():
    try:
        response = boto3.client('ec2').describe_spot_instance_requests(
            SpotInstanceRequestIds=[os.environ.get('SPOT_REQUEST_ID')] # Assume this is set
        )
        if response['SpotInstanceRequests'][0]['State'] == 'fulfilled':
            return False # Not interrupted yet
        else:
            return True # Interrupted
    except Exception as e:
        print(f"Error checking interruption status: {e}")
        return False # Assume not interrupted on error

def process_file(bucket, key):
    print(f"Processing file: s3://{bucket}/{key}")
    # Simulate work
    time.sleep(30)
    print(f"Finished processing: s3://{bucket}/{key}")

def main():
    bucket_name = 'my-critical-batch-bucket'
    while True:
        if check_for_interruption():
            print("Spot Instance interruption detected! Initiating graceful shutdown...")
            # Save state, upload logs, notify, etc.
            break # Exit the loop to shut down

        try:
            response = s3_client.list_objects_v2(Bucket=bucket_name)
            if 'Contents' in response:
                for obj in response['Contents']:
                    key = obj['Key']
                    if not key.endswith('/'): # It's a file
                        process_file(bucket_name, key)
                        # In a real scenario, you'd mark this file as processed in S3
                        # or move it to an archive bucket.
            else:
                print("No files to process. Waiting...")
        except Exception as e:
            print(f"Error accessing S3: {e}")

        time.sleep(60) # Check S3 every minute

if __name__ == "__main__":
    # In a real application, you'd get SPOT_REQUEST_ID from metadata
    # and EC2_INSTANCE_ID similarly. For this example, we'll just
    # assume they are set in the environment.
    print("Starting batch processing...")
    main()
    print("Shutdown complete.")

This script periodically checks for an interruption signal. If it detects one, it breaks its processing loop.

The core problem Spot Instances solve is cost optimization. AWS has a massive amount of spare EC2 capacity. Instead of letting it sit idle, they offer it at a steep discount (up to 90%) to users willing to be flexible. The catch? AWS can reclaim that capacity with very little notice if they need it for On-Demand instances.

Here’s how the interruption actually works: AWS sends a "Spot Instance Interruption Notice" to your instance. This is delivered as a system event that your instance can detect. You get a two-minute warning before the instance is terminated.

The check_for_interruption function in the script above is a simplified representation. In reality, you’d typically use the EC2 Instance Metadata Service (IMDS). You query http://169.254.169.254/latest/meta-data/spot/termination-time. If this endpoint returns a timestamp, it means an interruption notice has been issued, and the timestamp indicates when the instance will be terminated.

When you get that notice, your instance is still running for those two minutes. This is your window to act. You can:

Save your work: If you’re processing data, checkpoint your progress. Upload intermediate results to S3, save state to a database, or write to EBS.
Notify others: Send an alert to Slack, PagerDuty, or an SNS topic to let operators know the instance is shutting down.
Initiate a new instance: If your workload is designed to be fault-tolerant, you might trigger the launch of a new Spot Instance (or On-Demand) to take over.
Drain connections: If it’s a web server, stop accepting new requests and let existing ones finish.

The most common way to detect the interruption is by polling the IMDS for the termination-time metadata.

curl http://169.254.169.254/latest/meta-data/spot/termination-time

If this command returns a date and time, your instance is scheduled for interruption. Your application logic should then trigger its graceful shutdown sequence.

A more robust approach is to use a Spot Instance interruption-handling Lambda function triggered by an EC2 instance state-change event. When a Spot Instance is interrupted, it emits an EC2 Instance State-change Notification event. This event can trigger a Lambda function that can then perform actions like saving state to S3 or notifying a system.

Alternatively, you can configure your instance to receive these events via the instance metadata service and then trigger a shutdown script. The metadata endpoint is http://169.254.169.254/latest/meta-data/events/recommendations. Polling this endpoint will reveal if an interruption is pending.

The critical part is that your application must be designed to handle this. If your application just crashes when the OS is killed, then the two-minute warning is useless. You need to actively listen for the signal and have a pre-defined shutdown procedure. This often involves using signals like SIGTERM or SIGINT within your application code, which are sent by the OS when it’s about to shut down.

The most surprising thing about Spot Instances is how often they are not interrupted. For many instance types and regions, especially those with abundant spare capacity, Spot Instances can run for days, weeks, or even months without interruption. It’s often cheaper and more reliable than you’d expect, provided you build in graceful handling.

This is why using Spot Instances for critical workloads isn’t about avoiding interruption entirely, but about making the interruption a non-event for your overall application availability. You’re essentially trading a guaranteed uptime for a significant cost reduction, with the caveat that you must manage the potential for brief disruptions.

The next thing you’ll run into is managing the state of your interrupted jobs.