Format Chat Templates Correctly for Instruction Fine-Tuning (2026)

Sure, let’s dive into formatting chat templates for instruction fine-tuning.

The most surprising thing about chat templates is that even if your model says it understands instructions, it might be completely misinterpreting them if the template isn’t structured just right.

Imagine you’re training a model to follow instructions. You give it examples like:

User: "Translate 'hello' to French." Assistant: "Bonjour."

But the model, without a proper template, might see "User: Translate 'hello' to French. Assistant: Bonjour." as a single, long utterance. It doesn’t inherently know that "User:" signals the start of a prompt and "Assistant:" signals where it should generate its response. This is where chat templates come in. They are the hidden language that tells the model how to parse these multi-turn conversations and identify the distinct roles of user and assistant.

Let’s see this in action with a common template format used by many models, like those based on the Llama architecture.

{
  "chat_templates": [
    {
      "role": "user",

      "content": "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<|USER|>' + message['content'] }}{% elif message['role'] == 'assistant' %}{{ '<|ASSISTANT|>' + message['content'] }}{% endif %}{% endfor %}"

    }
  ]
}

This JSON snippet, or something similar, is often what defines your chat template. It’s not a direct instruction to the model; rather, it’s a configuration that gets applied to your training data. When you load your dataset, a library like Hugging Face’s transformers will use this template to serialize your conversation history into a single string that the model can process.

The core idea is to inject special tokens or strings that clearly delineate turns and roles. In the example above, <|USER|> marks the beginning of a user’s turn, and <|ASSISTANT|> marks the beginning of the assistant’s turn. The template iterates through your conversation messages and constructs a single string.

Here’s how a simple conversation would be formatted by this template:

Original Conversation:

[
  {"role": "user", "content": "What is the capital of France?"},
  {"role": "assistant", "content": "The capital of France is Paris."}
]

Formatted String (as seen by the model during training):

<|USER|>What is the capital of France?<|ASSISTANT|>The capital of France is Paris.<|ASSISTANT|>

Notice the final <|ASSISTANT|>? This is crucial. It signals to the model that after this point, it should start generating its response. The training process essentially teaches the model to predict the tokens that follow this final <|ASSISTANT|> marker, given the preceding context.

The exact format of these special tokens (<|USER|>, <|ASSISTANT|>, <s>, </s>, [INST], [/INST]) can vary significantly between model families. For instance, models like Mistral might use <s>[INST] {user_message} [/INST] {assistant_message} </s>. The critical thing is consistency. If your training data uses one format and your inference prompt uses another, your model will likely fail to generate coherent responses.

The problem arises when you have complex instruction-following tasks. Consider a multi-turn dialogue where the user corrects the assistant.

Conversation:

[
  {"role": "user", "content": "List three primary colors."},
  {"role": "assistant", "content": "Red, Green, and Blue."},
  {"role": "user", "content": "Actually, I meant the additive primary colors for light."},
  {"role": "assistant", "content": "Ah, in that case, the additive primary colors are Red, Green, and Blue."}
]

Formatted String (using the <|USER|>/<|ASSISTANT|> template):

<|USER|>List three primary colors.<|ASSISTANT|>Red, Green, and Blue.<|USER|>Actually, I meant the additive primary colors for light.<|ASSISTANT|>Ah, in that case, the additive primary colors are Red, Green, and Blue.<|ASSISTANT|>

The template ensures that the model sees the entire history, including the user’s correction, before it’s expected to generate the final response. This allows it to learn context-dependent behavior. Without it, the model might just see a long string and struggle to understand the intent behind the latest turn.

When you’re fine-tuning, you’re not just teaching the model what to say, but also how to interpret the conversational structure. The template is the mechanism that provides this structure. For example, if you’re using a model that expects a specific start-of-sequence token, like <s>, your template needs to incorporate that correctly. A common mistake is forgetting to add the <s> token at the very beginning of the formatted string, which can lead to the model starting its generation from an unexpected point.

The specific format of the chat template is often dictated by the pre-trained model you are fine-tuning. If you’re using a model from Hugging Face, you can often retrieve its default chat template using model.chat_template. It’s imperative to use this exact template, or one that is functionally equivalent, for both your training data preparation and your inference prompts. The template defines the "grammar" of your conversational data.

You’ll often find that the template logic is applied during data loading and tokenization. Libraries like datasets and transformers provide tools to apply these templates before feeding data to the model. For inference, you’ll use the same template to format your prompt before sending it to the model for generation.

One common pitfall is when the template doesn’t correctly handle empty messages or specific system prompts. Some models might have a dedicated "system" role. If your template only accounts for "user" and "assistant" and you include a system message, it might be silently dropped or misinterpreted, leading to a loss of crucial context for the model. Always verify that your template logic explicitly includes and correctly formats all possible roles present in your training data.

The next step after mastering chat templates is understanding how to effectively sample from the model’s output distribution during inference, especially when dealing with temperature and top-p sampling.