Quantized models don’t just run faster; they can fundamentally change the kind of hardware you need to run them on, opening up edge deployments that were previously impossible.

Let’s get a quantized model running. We’ll use a simple example: taking a pre-trained Hugging Face model and quantizing it using bitsandbytes for 8-bit inference.

First, we need a model. Let’s grab a small, well-known one:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Now, the magic of quantization with bitsandbytes. We’ll load the model directly in 8-bit.

from transformers import BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16 # Use float16 for computation
)

# Reload the model with the quantization config
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto" # Automatically map to available devices (GPU if present)
)

device_map="auto" is crucial here. If you have a GPU, it’ll try to offload layers to it. If not, it falls back to CPU. The bnb_8bit_compute_dtype=torch.float16 tells bitsandbytes to perform the actual matrix multiplications in half-precision, which is often faster and more memory-efficient than full float32, even though the weights are stored in 8-bit.

Let’s see it in action with a simple inference.

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") # Ensure input is on the same device as model

with torch.no_grad(): # Inference doesn't need gradients
    outputs = model.generate(**inputs, max_new_tokens=50)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

This code snippet takes our prompt, tokenizes it, sends it to the model (which is now running with quantized weights), and generates a continuation. The torch.no_grad() context manager is standard practice for inference to save memory and computation by disabling gradient calculations.

The core problem quantization solves is the memory footprint and computational cost of large neural networks. A typical GPT-2 base model might be around 500MB when loaded in float32. Quantizing it to 8-bit (or even 4-bit) can reduce this to a fraction of that size, often 2-4x smaller. This means:

  1. More models on the same hardware: You can load multiple models into GPU memory simultaneously.
  2. Less powerful hardware: You can run models on devices with less RAM, like smaller GPUs or even CPUs with sufficient RAM.
  3. Faster inference: While memory is the primary win, reduced memory bandwidth requirements and optimized kernels for lower precision can also lead to speedups, especially on hardware with specialized integer arithmetic support.

The bitsandbytes library is a popular choice because it provides easy integration with Hugging Face Transformers. It uses techniques like Uniform Symmetric Quantization or Non-Uniform Quantization to map the full-precision weights (usually float32) into a lower-bit representation (e.g., int8). During the forward pass, these low-bit weights are dequantized on-the-fly to a higher precision (like float16 or float32) for computation, and then the results are potentially requantized. The load_in_8bit=True flag automates this entire process when loading the model.

The device_map="auto" argument is a powerful feature from Hugging Face’s accelerate library. It intelligently distributes the model’s layers across available devices (GPUs, CPU) to best utilize the available memory. For a quantized model, this means it can fit a larger model onto a single GPU, or split it across multiple GPUs or even offload some layers to the CPU if GPU memory is exhausted.

When you encounter a BitsAndBytesConfig, you’re looking at the parameters that control how the quantization is performed. load_in_8bit=True is the most common, but load_in_4bit=True is also an option for even greater memory savings, albeit with a potential small hit to accuracy. bnb_8bit_compute_dtype and bnb_4bit_compute_dtype specify the data type used for the matrix multiplications, with torch.float16 being a good balance of speed and precision for many tasks.

The device_map parameter is more than just a convenience; it’s essential for managing large models. For instance, if you have a model that’s too large for your GPU’s VRAM, device_map="auto" will automatically split it. If you have multiple GPUs, it will try to balance the load. You can also manually specify device_map to have fine-grained control, like device_map={"": 0} to force everything onto GPU 0, or device_map={"": "cpu"} to run it solely on the CPU.

The most surprising aspect of quantization, especially 4-bit, is how little performance is lost for many tasks. The magic lies in how the quantization error is managed. Instead of simply truncating bits, techniques like "block-wise quantization" are used. This means weights are divided into small blocks, and each block is quantized independently with its own scaling factor. This allows the model to preserve fine-grained details within each block, minimizing the cumulative error across the entire model. The absmax or group_wise quantization methods within bitsandbytes are examples of this, where the range of weights in a small group is determined, and a scaling factor is calculated to map that range to the 4-bit (or 8-bit) integer space. This on-the-fly scaling during inference is key to maintaining accuracy.

The next hurdle is often deploying these quantized models in a production environment, managing their loading and inference endpoints efficiently.

Want structured learning?

Take the full Fine-tuning course →