Fine-tuning a large language model is like trying to teach a genius a new trick: you need to be sure they have enough brainpower and time before you start. Running out mid-lesson is a waste of everyone’s time, and with GPUs, it’s a waste of serious money.

Let’s say you’re about to fine-tune Mistral-7B on a dataset of 10,000 examples, each with a context length of 1024 tokens. You have a single NVIDIA A100 80GB GPU. How much compute do you really need?

The core components determining GPU memory usage are the model’s parameters, the optimizer states, and the activations generated during the forward pass.

Model Parameters

A 7B parameter model, using FP16 (2 bytes per parameter), takes up $7 \times 10^9 \times 2 \text{ bytes} \approx 14 \text{ GB}$. This is the baseline.

Optimizer States

The most common optimizer for LLM fine-tuning is AdamW. It stores two states per parameter: the first moment (mean) and the second moment (variance).

  • In FP32 (4 bytes per parameter): $14 \text{ GB} \times 2 \times 4 \text{ bytes/parameter} \approx 56 \text{ GB}$.
  • If using FP16 optimizer states (less common, but possible with techniques like bitsandbytes 8-bit optimizers): $14 \text{ GB} \times 2 \times 2 \text{ bytes/parameter} \approx 28 \text{ GB}$. For a standard AdamW with FP32 states, this is a huge chunk.

Activations

Activations are the intermediate outputs of each layer during the forward pass. Their size depends on batch size, sequence length, and model architecture. This is where things get tricky. A rough rule of thumb for memory per token is: batch_size * sequence_length * hidden_size * num_layers * 2 (for forward/backward pass) For Mistral-7B, hidden_size is 4096 and num_layers is 32. With a batch_size of 1 and sequence_length of 1024: $1 \times 1024 \times 4096 \times 32 \times 2 \text{ bytes (approx FP16)} \approx 268 \text{ MB}$. This seems small, but it scales with batch_size and sequence_length. More importantly, these activations need to be stored for backpropagation.

Putting It Together: A Quick Estimate

Let’s sum up for a batch size of 1, sequence length 1024, FP16 model, and FP32 AdamW:

  • Model: 14 GB
  • Optimizer (FP32): 56 GB
  • Activations (rough estimate for batch 1): Let’s say it’s around 10-20 GB for a 7B model at 1024 context.

Total: 14 + 56 + 20 = 90 GB.

This indicates a single A100 80GB GPU is not enough for this configuration.

Practical Tools and Techniques

  1. accelerate’s infer_device_map: Before even starting, use Hugging Face’s accelerate library to pre-calculate the device map.

    from accelerate import infer_auto_device_map, dispatch_model
    from transformers import AutoModelForCausalLM
    
    model_name = "mistralai/Mistral-7B-v0.1"
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
    
    # Use a rough estimate of available memory if you don't know exact GPU VRAM
    # For A100 80GB, it's 80 * 1024**3 bytes
    max_memory = {0: "80GiB"} # For a single GPU
    
    device_map = infer_auto_device_map(
        model,
        max_memory=max_memory,
        no_split_module_classes=["DecoderLayer"], # Example for Mistral
        dtype=torch.float16
    )
    print(device_map)
    

    This will tell you if the model can fit and how it would be split across GPUs if you had multiple. If it can’t fit on your target GPU(s), it will error or show a highly fragmented map.

  2. Gradient Accumulation: If your batch size is too large, use gradient accumulation. This means performing multiple forward/backward passes with smaller micro-batches and only updating the weights after a certain number of steps.

    • How it works: Instead of batch_size=32, you might use gradient_accumulation_steps=8 with per_device_train_batch_size=4. The effective batch size is $4 \times 8 = 32$.
    • Memory Impact: This dramatically reduces activation memory per step, as it’s tied to per_device_train_batch_size, not the effective batch size.
    • Fix: In your TrainingArguments (Hugging Face Trainer):
      from transformers import TrainingArguments
      
      training_args = TrainingArguments(
          output_dir="./results",
          per_device_train_batch_size=4,
          gradient_accumulation_steps=8,
          # ... other args
      )
      
      This reduces the memory needed for activations from what batch_size=32 would require to what batch_size=4 requires, while still achieving the effect of a larger batch.
  3. Quantization (QLoRA/LoRA): For fine-tuning, you rarely need to update all parameters. LoRA (Low-Rank Adaptation) injects small, trainable matrices into the model. QLoRA takes this further by quantizing the base model to 4-bit precision.

    • How it works: The base model weights are loaded in 4-bit (e.g., NF4), drastically reducing their memory footprint. Only the small LoRA adapter weights are trained in higher precision.
    • Memory Impact: A 7B model in 4-bit takes about 4-5 GB. The optimizer states for the LoRA parameters are tiny compared to the full model.
    • Fix: Use the bitsandbytes library and Hugging Face peft (Parameter-Efficient Fine-Tuning).
      from transformers import AutoModelForCausalLM, AutoTokenizer
      from peft import LoraConfig, get_peft_model
      import torch
      
      model_id = "mistralai/Mistral-7B-v0.1"
      # Load model in 4-bit
      model = AutoModelForCausalLM.from_pretrained(
          model_id,
          load_in_4bit=True,
          bnb_4bit_quant_type="nf4",
          bnb_4bit_compute_dtype=torch.bfloat16,
          device_map="auto" # or your specific device map
      )
      tokenizer = AutoTokenizer.from_pretrained(model_id)
      
      # Configure LoRA
      lora_config = LoraConfig(
          r=16, # Rank of the update matrices
          lora_alpha=32, # Alpha scaling
          target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Target layers
          lora_dropout=0.05,
          bias="none",
          task_type="CAUSAL_LM"
      )
      
      # Apply LoRA
      model = get_peft_model(model, lora_config)
      model.print_trainable_parameters() # Shows how few parameters are trainable
      
      This configuration would dramatically reduce memory usage, likely fitting the 7B model and optimizer states comfortably within an 80GB A100.
  4. Gradient Checkpointing: This technique trades compute for memory. Instead of storing all intermediate activations for the backward pass, it recomputes them on the fly.

    • How it works: During the forward pass, only a subset of activations are saved. During the backward pass, the network is re-traversed from the last saved activation point to recompute the necessary intermediate values.
    • Memory Impact: Significantly reduces activation memory, but increases training time by about 20-30%.
    • Fix: In TrainingArguments:
      training_args = TrainingArguments(
          # ... other args
          gradient_checkpointing=True,
      )
      
  5. Mixed Precision Training (FP16/BF16): Using half-precision (FP16 or BF16) for model weights and computations halves the memory required for parameters and activations compared to FP32.

    • How it works: Computations are done in FP16/BF16, but gradients are sometimes accumulated in FP32 to maintain precision.
    • Memory Impact: Halves parameter and activation memory.
    • Fix: Set fp16=True or bf16=True in TrainingArguments. BF16 is generally preferred on newer hardware (like A100s) as it has a wider dynamic range and is less prone to underflow/overflow issues than FP16.
      training_args = TrainingArguments(
          # ... other args
          bf16=True, # or fp16=True
      )
      
  6. DeepSpeed/FSDP: For multi-GPU setups or very large models, distributed training frameworks like DeepSpeed or PyTorch’s Fully Sharded Data Parallel (FSDP) are essential.

    • How it works: They shard model parameters, gradients, and optimizer states across multiple GPUs, allowing you to train models that wouldn’t fit on a single device.
    • Memory Impact: Distributes the memory load of parameters, optimizer states, and activations.
    • Fix: Requires configuration files for DeepSpeed or setup in PyTorch. For example, with FSDP:
      from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
      from torch.distributed.fsdp.config import FullyShardedDataParallelConfig
      
      # Wrap your model
      model = FSDP(model, auto_wrap_policy=...)
      
      This is more complex and typically used when single-GPU methods are insufficient.

The Next Problem

After you’ve successfully tuned your model and are ready to save it, you might encounter RuntimeError: CUDA out of memory. when trying to save the model if your saved LoRA adapters are larger than your available GPU memory.

Want structured learning?

Take the full Fine-tuning course →