QLoRA lets you fine-tune massive language models on hardware you probably already own, making cutting-edge AI accessible to everyone.

Let’s see QLoRA in action. Imagine we have a dataset of customer support tickets and their resolutions. We want to fine-tune Mistral 7B to automatically suggest resolutions for new tickets.

First, we need our dataset. It should be in a format like JSON, with each entry containing a "prompt" and a "completion".

[
  {
    "prompt": "User: My internet is down.\nAgent:",
    "completion": " I understand you're experiencing an internet outage. Let's try some troubleshooting steps. Have you restarted your modem and router?\n"
  },
  {
    "prompt": "User: I can't log into my account.\nAgent:",
    "completion": " I can help with that. What's the username associated with your account? I'll check for any lockouts or password reset options.\n"
  }
]

Now, we’ll set up our environment. We need transformers, peft, bitsandbytes, and accelerate.

pip install transformers peft bitsandbytes accelerate

We’ll load Mistral 7B using transformers and configure bitsandbytes for 4-bit quantization. This is key to QLoRA’s memory efficiency.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "mistralai/Mistral-7B-v0.1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
model.config.use_cache = False
model.config.pretraining_tp = 1 # Necessary for some models like Llama

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Next, we configure the LoRA (Low-Rank Adaptation) parameters. LoRA injects small, trainable matrices into the existing model layers, drastically reducing the number of parameters we need to update.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8, # Rank of the update matrices. Higher rank means more parameters to train.
    lora_alpha=16, # Alpha scaling factor for LoRA.
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Modules to apply LoRA to.
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

This output shows how few parameters are actually trainable:

trainable params: 4,194,304 || all params: 7,246,190,592 || trainable%: 0.057878

Now, we prepare our dataset for training. We’ll use datasets to load and format it.

from datasets import load_dataset

dataset = load_dataset("json", data_files="your_dataset.json")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Instruction:\n{example['instruction'][i]}\n\n### Response:\n{example['output'][i]}"
        output_texts.append(text)
    return output_texts

# If your dataset is already in prompt/completion format, you might not need this.
# Assuming 'instruction' and 'output' are your prompt/completion keys.
# Adjust if your keys are different.
# For this example, let's assume your JSON has 'prompt' and 'completion' keys.
# If not, you'd adapt the mapping here.
# Example: dataset = dataset.map(lambda x: {"text": [p + c for p, c in zip(x["prompt"], x["completion"])]})
# Let's assume a simple structure where each entry is a full text.
# If your JSON is like the one above:
def format_data(example):
    return {"text": example["prompt"] + example["completion"]}

dataset = dataset.map(format_data)

Finally, we set up the transformers Trainer and start training.

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

output_dir = "./mistral-7b-qlora-finetuned"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    num_train_epochs=1,
    max_steps=100, # For demonstration, limit steps
    save_steps=50,
    fp16=True, # Use mixed precision for faster training
    bf16=True, # Use bfloat16 if supported
    optim="paged_adamw_8bit", # Memory efficient optimizer
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
)

trainer = Trainer(
    model=peft_model,
    train_dataset=dataset["train"],
    args=training_arguments,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

The optim="paged_adamw_8bit" is a crucial detail. It’s an optimizer implementation from bitsandbytes that uses a technique called "paged attention" to manage memory more efficiently during training, especially for large models. Instead of loading all optimizer states into GPU memory at once, it pages them in and out as needed, similar to how an operating system manages virtual memory. This allows training models that would otherwise require significantly more VRAM.

After training, you can save the LoRA adapters.

peft_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

To use the fine-tuned model, you’d load the base model and then apply the saved adapters.

from peft import PeftModel

# Load the base model (quantized)
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Load the LoRA adapters
ft_model = PeftModel.from_pretrained(base_model, output_dir)
ft_model = ft_model.merge_and_unload() # Optional: merge for inference speed

# Now you can use ft_model for inference

The most surprising part is how effective LoRA is. It’s not just about saving memory; it often achieves performance comparable to full fine-tuning, sometimes even better, by regularizing the training process and preventing catastrophic forgetting of the base model’s capabilities. The small, injected adapters learn to steer the model’s powerful pre-trained knowledge towards your specific task without needing to adjust the vast majority of the original weights.

The next step would be to explore techniques for efficient inference with these fine-tuned models, such as quantization-aware inference or batching strategies.

Want structured learning?

Take the full Fine-tuning course →