Full fine-tuning a massive language model is often less effective than using Parameter-Efficient Fine-Tuning (PEFT) methods for most practical applications.

Let’s see what that looks like in practice. Imagine we have a base LLM, say llama-2-7b-hf. We want to adapt it for sentiment analysis.

Full Fine-Tuning:

This means we take the entire llama-2-7b-hf model, load it into memory, and update all of its billions of parameters based on our sentiment analysis dataset.

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
import torch

model_name = "meta-llama/Llama-2-7b-hf" # Or your local path
output_dir = "./llama2-7b-sentiment-full"

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Assume `train_dataset` is a Hugging Face Dataset with 'text' and 'label' columns
# We'd need to tokenize and format this dataset appropriately for causal LM
# ... (dataset preparation code)

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    fp16=True, # Use mixed precision for speed and memory
    logging_dir=f"{output_dir}/logs",
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    # Add data collator and compute_metrics if needed
)

trainer.train()

The problem here is that model is huge. For Llama-2-7b, that’s around 14GB in float16. Training requires gradients, which are the same size, doubling the GPU memory for parameters. Add optimizer states (like AdamW, which stores momentum and variance for each parameter) and you’re looking at 4-5x the model size in GPU RAM. For a 7B model, this can easily be 60-80GB of VRAM per GPU, often requiring multiple high-end GPUs like A100s. Plus, every single one of those 7 billion parameters is being updated, making training slow and requiring significant compute.

PEFT (Parameter-Efficient Fine-Tuning):

PEFT methods, like LoRA (Low-Rank Adaptation), freeze most of the pre-trained model’s weights and inject small, trainable "adapter" layers. Instead of updating all 7 billion parameters, we only train these tiny adapters.

Let’s use LoRA with the peft library:

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

model_name = "meta-llama/Llama-2-7b-hf" # Or your local path
output_dir = "./llama2-7b-sentiment-lora"

# Load model in 4-bit for memory efficiency
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True, # Quantize to 4-bit
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    device_map="auto" # Distribute across available GPUs
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token # Set pad token if not present

# Prepare model for k-bit training (if using quantization)
model = prepare_model_for_kbit_training(model)

# Define LoRA configuration
lora_config = LoraConfig(
    r=16, # Rank of the update matrices. Higher rank means more trainable parameters.
    lora_alpha=32, # Alpha scaling factor. Controls the magnitude of the adaptation.
    lora_dropout=0.05, # Dropout probability for LoRA layers.
    bias="none", # Whether to train bias parameters.
    task_type="CAUSAL_LM",
    # Target specific modules. Common choices are 'q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.
    target_modules=["q_proj", "v_proj"]
)

# Get the PEFT model
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters() # This will show a tiny fraction of total parameters

# Assume `train_dataset` is prepared as before
# ... (dataset preparation code)

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-4, # Often higher learning rates work well with PEFT
    weight_decay=0.001,
    fp16=True,
    logging_dir=f"{output_dir}/logs",
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
)

trainer = Trainer(
    model=peft_model, # Use the PEFT model
    args=training_args,
    train_dataset=train_dataset,
    # Add data collator and compute_metrics if needed
)

trainer.train()

The key difference is get_peft_model(model, lora_config). This wraps the original model, freezing its weights and adding small, trainable LoRA matrices. The print_trainable_parameters() call will reveal that we’re training perhaps a few million parameters, not billions. This drastically reduces VRAM requirements, allowing fine-tuning on consumer GPUs (e.g., RTX 3090/4090) or even smaller enterprise cards. Training is also faster because fewer gradients need to be computed and fewer parameters updated.

The problem PEFT solves is the prohibitive cost of full fine-tuning. It allows us to adapt LLMs to new tasks with a fraction of the computational resources, making LLM customization accessible. The adapters are small (often just a few MBs), so you can store many task-specific adapters for a single base model. When you want to use a specialized model, you load the base model and then "merge" or "apply" the PEFT adapters.

The most surprising thing about PEFT is how closely the performance of these small adapters can match full fine-tuning, especially on tasks that don’t require a fundamental shift in the model’s core knowledge. The adapters learn to "steer" the pre-trained model’s existing capabilities towards the new task, rather than retraining its entire internal representation.

When you’re merging PEFT adapters back into the base model for deployment, you’re not just loading weights; you’re performing matrix multiplication. For LoRA, the low-rank matrices (A and B) are multiplied together to create a dense update matrix, which is then added to the original weight matrix. This process effectively creates a new, larger weight matrix that incorporates the learned adaptation, allowing inference without needing the PEFT library or the separate adapter weights at runtime.

The next challenge is understanding how to dynamically switch between multiple PEFT adapters for a single base model without reloading the entire model.

Want structured learning?

Take the full Fine-tuning course →