DeepSpeed ZeRO can actually increase your GPU memory usage in certain configurations, even though its primary goal is to reduce it.

Let’s see it in action. Imagine we’re fine-tuning a 7B parameter model on 4 A100 80GB GPUs. Without DeepSpeed, we’d quickly run out of memory. With DeepSpeed ZeRO-2, we can fit it.

Here’s a typical DeepSpeed config file for ZeRO-2:

{
  "train_batch_size": 64,
  "train_micro_batch_size_per_gpu": 16,
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "contiguous_gradients": true
  },
  "gradient_accumulation_steps": 4,
  "gradient_clipping": 1.0
}

The problem DeepSpeed ZeRO solves is the massive memory footprint of large language models during training. A 7B parameter model needs roughly 28GB for weights (7B * 4 bytes/FP32), another 28GB for gradients, and another 28GB for optimizer states (like Adam’s moments). That’s ~84GB per GPU if you’re not careful, not even counting activations.

DeepSpeed ZeRO breaks this down.

  • ZeRO-1 (Optimizer State Partitioning): It partitions the optimizer states across GPUs. If you have 4 GPUs, each GPU only holds 1/4 of the optimizer states. This dramatically reduces memory per GPU.
  • ZeRO-2 (Optimizer State + Gradient Partitioning): It adds gradient partitioning to ZeRO-1. Each GPU only holds its portion of the gradients, further reducing memory.
  • ZeRO-3 (Optimizer State + Gradient + Parameter Partitioning): This is the most aggressive. It partitions the model parameters themselves across GPUs. Each GPU only materializes the parameters it needs for its current forward/backward pass.

In the config above, stage: 2 means we’re using ZeRO-2. offload_optimizer with device: "cpu" is a key part of that, moving optimizer states to CPU RAM, which is far more abundant than GPU VRAM. contiguous_gradients: true helps reduce memory fragmentation.

The train_batch_size (64) is the total batch size. train_micro_batch_size_per_gpu (16) is the batch size processed on each GPU per forward/backward pass. gradient_accumulation_steps (4) means we accumulate gradients over 4 micro-batches before an optimizer step, effectively simulating a larger batch size of 16 * 4 = 64. This is crucial for stable training. fp16: { "enabled": true } uses mixed-precision training, halving the memory for weights, gradients, and activations.

The surprising part: when you enable offload_optimizer but don’t use fp16, the CPU memory usage can spike. This is because the optimizer states are offloaded in FP32, and the CPU’s memory bandwidth becomes the bottleneck. Conversely, with fp16 enabled and offload_optimizer to CPU, you’re moving FP16 optimizer states, which are smaller, but the CPU still needs to manage them. The pin_memory: true setting for the optimizer offload is important: it tells the CPU to allocate memory that’s directly accessible by the GPU for faster transfers, but it can increase CPU RAM usage.

The zero_optimization block is where the magic happens. stage: 2 is a good balance for many LLM fine-tuning tasks. If you’re still hitting memory limits, you’d bump this to stage: 3. If you’re trying to maximize throughput and have ample GPU memory, you might experiment with stage: 0 (which is essentially just DDP) or stage: 1.

The gradient_clipping: 1.0 is a standard practice to prevent exploding gradients, especially common in LLMs.

The next hurdle you’ll likely face is not a memory error, but a performance bottleneck related to communication overhead between GPUs, especially if your network interconnect isn’t fast enough for ZeRO-3.

Want structured learning?

Take the full Fine-tuning course →