MergeKit can combine multiple fine-tuned models into a single, larger model, effectively creating an ensemble.

Let’s see this in action. Imagine we have two models, model-a and model-b, both fine-tuned from the same base model (say, Llama-2 7B). We want to merge them to leverage the strengths of both.

First, we need a configuration file for mergekit. This YAML file tells mergekit what models to use, how to combine them, and what the output should be.

models:
  - model: ./model-a
  - model: ./model-b
merge_method: linear
base_model: meta-llama/Llama-2-7b-hf # Or the local path if you downloaded it
dtype: float16

In this configuration:

  • models: Lists the paths to our fine-tuned models.
  • merge_method: We’re choosing linear for a simple weighted average of the model weights. Other methods like slerp (spherical interpolation) or dare (Denoising Autoencoding) exist, each with different properties.
  • base_model: Crucial! This specifies the original model from which model-a and model-b were fine-tuned. mergekit needs this to align the weights correctly.
  • dtype: Specifies the data type for the merged model, float16 is common for efficiency.

Now, we run mergekit with this configuration:

mergekit-yaml merge --config merge_config.yaml --output ./merged-model

This command will:

  1. Load model-a and model-b.
  2. Load the base_model to get the initial weight structure.
  3. Apply the linear merge method. For a linear merge with no specific weights defined, it defaults to an equal weighting (0.5 for model-a and 0.5 for model-b). If we wanted to bias towards model-a, our config would look like:
models:
  - model: ./model-a
    parameters:
      weight: 0.7
  - model: ./model-b
    parameters:
      weight: 0.3
merge_method: linear
base_model: meta-llama/Llama-2-7b-hf
dtype: float16
  1. Save the resulting merged weights to the ./merged-model directory.

The magic here is that mergekit doesn’t just concatenate models. It understands the underlying architecture (thanks to base_model) and performs mathematical operations on the weight tensors. For linear merging, it’s essentially new_weight = weight_a * w_a + weight_b * w_b, where w_a and w_b are the specified weights (defaulting to 0.5 each). This creates a new set of weights that are a composite of the original fine-tunes.

The primary problem this solves is model specialization without retraining. Instead of training a massive model from scratch for every task, you can fine-tune smaller models on specific datasets and then merge them. This is significantly faster and cheaper. For example, you might have one model fine-tuned for creative writing and another for coding. Merging them could yield a model that’s good at both, or at least better than the base model at a blend of tasks. It’s akin to how a human brain integrates knowledge from different learning experiences.

The dare merge method offers a more nuanced approach. It takes the difference between a fine-tuned model and its base model (the "delta") and applies it, but with a denoising step. This can prevent catastrophic forgetting and preserve more of the original fine-tune’s specific capabilities, especially when merging models with very different fine-tuning datasets. The denoising is crucial because simply adding deltas can sometimes lead to interference and degrade performance.

When you use a merge method like linear or dare, mergekit iterates through the layers and parameters of the base model. For each parameter tensor (e.g., the weight matrix for a specific attention head or feed-forward layer), it fetches the corresponding tensor from each of the input models and performs the specified mathematical operation. The result is a new tensor that replaces the original in the merged-model directory. This process is repeated for every tensor in the model, ensuring a complete composite.

Many people don’t realize that the base_model isn’t just for reference; its architecture and parameter names are used as the definitive map. mergekit aligns the weights of the fine-tuned models to this map. If two fine-tuned models were derived from different base architectures or even different versions of the same architecture, merging them directly using these methods would likely fail or produce nonsensical results because the parameter names and shapes wouldn’t match up.

The next step after successfully merging models is often exploring more advanced merge techniques or evaluating the performance of the merged model on a benchmark suite.

Want structured learning?

Take the full Fine-tuning course →