Merging LoRA adapters into a base model is the final step before deploying your fine-tuned model for inference.

Let’s see this in action. We’ll use peft to load a base model and a LoRA adapter, then merge them.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load the base model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the LoRA adapter
adapter_path = "./my_lora_adapter" # Replace with your adapter path
model = PeftModel.from_pretrained(base_model, adapter_path)

# Merge the LoRA adapter into the base model
merged_model = model.merge_and_unload()

# Now 'merged_model' is a standard Hugging Face model with the LoRA weights merged
# You can save it for deployment
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

print("LoRA adapter merged and model saved to ./merged_model")

This process is crucial because LoRA (Low-Rank Adaptation) works by injecting small, trainable matrices (adapters) into specific layers of a larger, pre-trained base model. During training, only these adapter matrices are updated. When you want to deploy your fine-tuned model, you can’t efficiently run inference with the original base model and a separate LoRA adapter file. It would require loading two models, managing their interaction, and potentially incurring overhead. Merging the LoRA adapters means taking the learned weight deltas from the LoRA matrices and directly adding them to the corresponding weights in the base model. The result is a single, monolithic model that behaves identically to the base model with the adapter applied, but without the indirection.

Internally, peft handles this by identifying the layers where LoRA adapters were applied (typically attention layers like query, key, value, and output projections, and sometimes feed-forward network layers). For each such layer, it retrieves the original weights of the base model and the learned weights of the LoRA adapter. The LoRA adapter’s weights are effectively a low-rank decomposition (represented by two smaller matrices, A and B, such that W_lora = A @ B). The merging process calculates the effective weight update by performing this matrix multiplication (A @ B) and then adds this update to the original base model weights. The merge_and_unload() method does exactly this, effectively "unloading" the LoRA structure and returning a standard transformers model.

You control the merging process by specifying which adapters to merge if you have multiple. The PeftModel object can hold multiple adapters. When you call merge_and_unload(), it merges all currently loaded adapters by default. You can also specify a particular adapter to merge if needed, though this is less common for a final deployment. The key levers are the model.merge_and_unload() call itself, and then the subsequent save_pretrained() to persist the new, unified model weights.

The most surprising thing about this process is that the merged model is not a LoRA model anymore. It’s a regular transformers model whose weights have been modified. This means you can discard the peft library entirely for inference, and load the merged model using AutoModelForCausalLM.from_pretrained("./merged_model") just like any other standard Hugging Face model. The magic of LoRA is baked directly into the weights.

The next step after merging is often quantization to reduce the model’s memory footprint for deployment.

Want structured learning?

Take the full Fine-tuning course →