Horizontally Scale FastAPI on Kubernetes with HPA (2026)

Horizontal Pod Autoscaler (HPA) lets your FastAPI application on Kubernetes automatically adjust the number of running pods based on observed metrics, ensuring you have enough resources to handle traffic without over-provisioning.

Let’s see HPA in action. Imagine you have a FastAPI app deployed on Kubernetes. Without HPA, you’d manually scale the deployment up or down, or set a fixed replica count. With HPA, you define a target metric, like CPU utilization, and the HPA controller watches that metric. When it goes above your target, HPA increases the replica count; when it drops below, it decreases it.

Here’s a basic HPA definition in YAML:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: fastapi-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fastapi-app-deployment # This must match your Deployment's name
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

When you apply this, Kubernetes starts monitoring the CPU usage of pods managed by the fastapi-app-deployment Deployment. If the average CPU utilization across all pods exceeds 50%, HPA will incrementally add more replicas, up to the maxReplicas of 10. Conversely, if the average CPU drops significantly below 50%, it will scale down, but never below minReplicas of 2.

The problem HPA solves is dynamic traffic. Your application might have predictable peak hours, but unexpected traffic spikes can overwhelm a fixed number of pods, leading to slow responses or outright failures. On the other hand, running more pods than necessary during low-traffic periods wastes resources and increases costs. HPA bridges this gap by making your application’s capacity elastic.

Internally, Kubernetes has a component called the metrics-server. This is a cluster-level aggregator of resource usage data. It collects metrics (like CPU and memory) from nodes and pods via the Kubelet and exposes them through the Kubernetes Metrics API. The HPA controller then queries this API periodically (typically every 15-30 seconds) to get the current metrics for the pods associated with the scaleTargetRef. Based on these metrics and the target defined in the HPA resource, it calculates the desired number of replicas and updates the scale subresource of the target Deployment (or StatefulSet, ReplicaSet).

The scaleTargetRef is crucial. It tells the HPA what to scale. It points to a Kubernetes object that supports the scale subresource, most commonly a Deployment. You must ensure the name here exactly matches your Deployment’s name. The apiVersion and kind also need to be correct for the resource you’re scaling.

Here’s how you’d get your Deployment definition to ensure correct naming:

kubectl get deployment fastapi-app-deployment -o yaml

This command will show you the YAML for your Deployment. Make sure the metadata.name field in the Deployment YAML matches the name in your scaleTargetRef.

When setting targetCPUUtilizationPercentage, it’s important to understand that this percentage is relative to the CPU requests defined for your containers in the Deployment. If your container has a CPU request of 200m (200 millicores), and you set targetCPUUtilizationPercentage: 50, HPA will trigger a scale-up when the average CPU usage across your pods reaches 100m per pod.

To see the current CPU requests for your FastAPI container within your Deployment:

kubectl get deployment fastapi-app-deployment -o jsonpath='{.spec.template.spec.containers[*].resources.requests.cpu}'

If you don’t have CPU requests set, HPA cannot calculate a utilization percentage, and scaling based on CPU will not work. You should define requests in your Deployment YAML like this:

spec:
  template:
    spec:
      containers:
      - name: fastapi-app
        image: your-fastapi-image:latest
        resources:
          requests:
            cpu: "200m" # Request 200 millicores of CPU
          limits:
            cpu: "500m" # Limit CPU to 500 millicores

The maxReplicas and minReplicas act as hard boundaries. HPA will never scale beyond maxReplicas, even if metrics suggest more are needed, and it will never scale below minReplicas, even if metrics are very low. These are essential for controlling costs and ensuring basic availability.

To check the current status of your HPA, including observed metrics and current replica count:

kubectl get hpa fastapi-app-hpa

This will show you output like:

NAME              REFERENCE                            TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
fastapi-app-hpa   Deployment/fastapi-app-deployment   50% (30%)   2         10        4          5m

In this example, the target is 50% CPU utilization, the current observed average is 30% (meaning it’s below target), and there are currently 4 replicas running. If the observed percentage were to climb above 50%, HPA would start increasing REPLICAS.

You can also scale based on custom metrics or external metrics, which is powerful for applications with specific performance indicators beyond CPU or memory. For example, you could scale based on the number of active requests in a queue or the latency of your API. This requires a custom metrics adapter to be installed in your cluster, like Prometheus Adapter.

One subtlety often overlooked is the cooldown period. When HPA scales a deployment up, it won’t consider scaling it down for a configurable period (default is 5 minutes). This prevents rapid flapping where the application scales up, then immediately down, then up again. Similarly, there’s a cooldown period for scaling down. This behavior is controlled by flags on the kube-controller-manager component of Kubernetes, specifically --horizontal-pod-autoscaler-downscale-stabilization and --horizontal-pod-autoscaler-upscale-delay. These defaults are usually fine, but awareness is key if you observe unexpected scaling behavior.

The next logical step after mastering CPU-based scaling is exploring memory-based scaling or custom metrics.