Axolotl can fine-tune models across multiple formats simultaneously, meaning you can run a training job that updates weights for a model that will eventually be used in Hugging Face Transformers, GGUF, and AWQ formats, all from a single training run.

Let’s see this in action. Imagine you have a base model you want to fine-tune, and you intend to deploy it in a few different popular formats. Instead of running separate fine-tuning jobs for each target format (which would be redundant and time-consuming), Axolotl allows you to specify multiple output formats within a single configuration.

Here’s a simplified examples/example.yml snippet demonstrating this:

base_model: "meta-llama/Llama-2-7b-hf"
model_type: LlamaForCausalLM

datasets:
  - path: "databricks/databricks-dolly-15k"
    type: "json"

optimizer:
  type: "adamw_torch"
  params:
    lr: 2e-5
    weight_decay: 0.01

train_batch_size: 2
gradient_accumulation_steps: 8
max_seq_length: 512
num_train_epochs: 1

output_dir: "./output/llama-2-7b-multi-format"

# --- The magic happens here ---
quantization_config:
  type: "bitsandbytes.nn.Linear4bit"
  llm_int8_threshold: 6.0
  llm_int8_skip_modules: null
  load_in_4bit: true
  bnb_4bit_compute_dtype: "float16"
  bnb_4bit_use_double_quant: true
  bnb_4bit_quant_type: "nf4"

# --- Multiple output formats ---
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

# This section is key for multi-format output
# Axolotl will automatically generate these formats after training
# based on the final trained model weights.
export_formats:
  - "hf"
  - "gguf"
  - "awq"

# Optional: GGUF specific configurations
gguf_config:
  architecture: "llama" # Matches the base model architecture
  quantization: "q4_K_M" # Example GGUF quantization type
  context_length: 2048

# Optional: AWQ specific configurations
awq_config:
  # AWQ specific parameters can be added here if needed
  # For example, if you need to specify group_size, etc.
  # group_size: 128
  # zero_point: true
  # q_group_size: 1
  pass

When you run Axolotl with this configuration, it performs a standard fine-tuning process. The quantization_config applies to the training itself, often using techniques like QLoRA for memory efficiency. The crucial part is export_formats. After the training completes and the base model weights are updated, Axolotl invokes the necessary libraries (like transformers, llama.cpp for GGUF, and autoawq for AWQ) to convert and export the fine-tuned model into each of the specified formats.

The gguf_config and awq_config sections allow you to provide format-specific parameters. For GGUF, you might specify the architecture (e.g., llama, mistral) and a desired quantization method (e.g., q4_K_M, q5_K_S). For AWQ, you can set parameters like group_size and zero_point if you need fine-grained control over the quantization process during export. If these are omitted, Axolotl will use sensible defaults.

This multi-format export capability is incredibly powerful. It abstracts away the complexity of managing separate conversion scripts and dependencies for each target format. You train once, and Axolotl handles the subsequent packaging for diverse deployment scenarios. This means your fine-tuned model is immediately ready for use in Hugging Face transformers, for local inference with llama.cpp (which uses GGUF), or for optimized inference with AWQ-compatible backends.

What most people overlook is the implicit dependency on the training configuration for the export formats. For instance, if you train with load_in_4bit: true and bnb_4bit_quant_type: "nf4", the resulting base weights are already quantized. When exporting to GGUF or AWQ, Axolotl will typically dequantize these weights first and then re-quantize them into the target format (GGUF’s q4_K_M, AWQ’s specific quantization). This two-step process is generally fine, but understanding that the initial training quantization can influence the starting point for the export quantization is key. If you were aiming for a specific bit-depth for your GGUF export (say, q8_0) and your training used 4-bit quantization, the final GGUF q8_0 might not be exactly the same as if you had trained a full-precision model and then exported to q8_0. Axolotl tries to make this seamless, but the underlying mechanics involve dequantization and re-quantization.

This approach dramatically streamlines the MLOps pipeline for fine-tuned models, allowing for faster iteration and deployment across various platforms.

The next step after successfully configuring and running multi-format exports is to investigate distributed training strategies for even larger models or datasets.

Want structured learning?

Take the full Fine-tuning course →