The most surprising thing about auto-scaling CI runners on AWS is that they often end up costing more than a fixed fleet, unless you meticulously tune them.
Let’s watch one in action. Imagine a GitLab CI pipeline that needs to build a Docker image, run tests, and deploy to a staging environment.
build_and_test:
stage: build
script:
- echo "Building Docker image..."
- docker build -t my-app:$CI_COMMIT_SHA .
- echo "Running tests..."
- docker run my-app:$CI_COMMIT_SHA npm test
tags:
- aws-runner
deploy_staging:
stage: deploy
script:
- echo "Deploying to staging..."
- ./deploy.sh staging
when: on_success
tags:
- aws-runner
When this pipeline triggers, the GitLab Runner, configured to use the AWS Autoscaler, first checks its existing fleet of EC2 instances. If no runners tagged aws-runner are available and idle, the autoscaler kicks in. It consults its configuration, finds a suitable EC2 instance type (say, t3.medium), and requests a new instance from AWS. This instance boots up, registers with GitLab, and picks up the build_and_test job. Once the job is done, the runner instance waits. If no new jobs arrive within a configured idle timeout (e.g., 10 minutes), the autoscaler terminates this instance. The deploy_staging job, if successful, would then pick up an available runner.
The core problem this solves is fluctuating CI load. You don’t want to pay for 50 CI runners when you only need 5 most of the time, but you also don’t want your builds to queue for hours during peak load. The AWS Autoscaler bridges this gap.
Internally, the GitLab Runner’s autoscaling component acts as a manager. It has a set of desired configurations: minimum and maximum number of runners, the EC2 instance types to use, the AMI to launch, and the desired tags. When a job arrives and no runners are available, it calculates how many new instances are needed based on the number of pending jobs and its current fleet size. It then communicates with AWS (via the EC2 API) to launch these instances. When instances become idle for too long, it monitors them and terminates them to save costs.
Here are the key levers you control:
concurrentandlimitinconfig.toml: These define how many jobs this specific runner can handle simultaneously across all its executor types, and the maximum number of jobs this runner can execute concurrently on the autoscaled executor. For AWS autoscaling,limitis crucial as it dictates the maximum number of EC2 instances the autoscaler will provision.[runners.aws]section inconfig.toml: This is where the AWS-specific magic happens.AMI: The Amazon Machine Image ID. This must be a pre-configured AMI with the GitLab Runner binary and necessary dependencies.instance_types: A list of EC2 instance types to choose from. The autoscaler will pick the first available one that meets its requirements.region: The AWS region to launch instances in.tags: EC2 tags to apply to launched instances. These are vital for matching the runner configuration to your GitLab setup.subnet_ids: The VPC subnet(s) to launch instances into.security_group_ids: Security group(s) for the instances.iam_instance_profile: The IAM role for the EC2 instances, granting them permissions to interact with AWS services (like fetching secrets from Secrets Manager or uploading artifacts to S3).idle_countandidle_time: The number of idle runners to keep and the duration (in seconds) before an idle runner is terminated.max_builds: A safety net to limit the total number of builds an instance will execute before being replaced.
The most common pitfall is leaving the autoscaler to its own devices without understanding the cost implications of instance_types and idle_time. A t3.xlarge might be great for build speed, but if you only have one build every 30 minutes, keeping that instance alive for 10 minutes of idle time means you’re paying for a significant amount of unused compute. Conversely, if your builds are short and bursty, a very short idle_time might lead to constant churn, increasing boot times and potentially missing job queues. You’re essentially trading idle EC2 costs for potentially longer job queuing times.
The next concept you’ll grapple with is managing the runner lifecycle and security, particularly with IAM roles and instance profiles.