Supervised fine-tuning, for all its power, often trips up on the most basic aspect: how you structure the data.
Let’s see it in action. Imagine you have a simple chat history, a back-and-forth between a user and an assistant.
[
{
"role": "user",
"content": "What's the weather like in London today?"
},
{
"role": "assistant",
"content": "The weather in London today is partly cloudy with a high of 15°C."
}
]
This is the foundational structure. You’ve got turns, each with a role (who said it) and content (what they said). For supervised fine-tuning, the model learns to predict the content of the assistant’s turn, given the preceding turns.
The real power comes from how you assemble these turns into conversations. A single fine-tuning example isn’t just one question and one answer; it’s a sequence of interactions.
[
{
"role": "user",
"content": "What's the weather like in London today?"
},
{
"role": "assistant",
"content": "The weather in London today is partly cloudy with a high of 15°C."
},
{
"role": "user",
"content": "And what about Paris?"
},
{
"role": "assistant",
"content": "In Paris, it's sunny with a high of 20°C."
}
]
Here, the model is trained to predict "In Paris, it’s sunny with a high of 20°C." after seeing the entire preceding exchange. This allows it to learn context, follow-up questions, and the flow of a dialogue.
The problem this solves is generalization. A model trained on single question-answer pairs might be good at answering isolated queries, but it will fail when a conversation requires remembering previous turns, understanding nuances introduced mid-dialogue, or maintaining a consistent persona. By feeding it full conversation histories, you teach it the dynamics of interaction.
Internally, the model processes these sequences by attending to previous tokens. For a transformer-based model, this means creating attention masks that allow each token to "see" all preceding tokens within the same conversation history. The training objective is to minimize the difference between the predicted next token and the actual next token in the assistant’s response.
The exact levers you control are:
- Conversation Length: Longer conversations expose the model to more complex state tracking and multi-turn reasoning.
- Turn Structure: The precise wording and intent within each
userandassistantturn shape what the model learns to generate. - System Prompts: Some formats allow for an initial
systemrole message that sets the overall behavior or persona of the assistant.
[
{
"role": "system",
"content": "You are a helpful and concise weather assistant."
},
{
"role": "user",
"content": "What's the weather like in London today?"
},
{
"role": "assistant",
"content": "London: Partly cloudy, 15°C."
}
]
This system prompt, when included, influences the assistant’s responses throughout the conversation, teaching it to adopt that specific persona.
When preparing data, it’s crucial to ensure that the assistant’s response is the only thing being predicted. If your dataset format includes metadata or other non-response information within the assistant’s turn, the model might learn to generate that extraneous information instead of the actual conversational reply. For instance, if an assistant’s turn looks like this:
{
"role": "assistant",
"content": "Here's the info you asked for: The weather in London today is partly cloudy with a high of 15°C. | SOURCE: WeatherAPI_v2"
}
and you train it to predict the entire content, it might start generating the | SOURCE: ... part. The standard practice is to strip such metadata or ensure it’s part of a separate field that isn’t fed into the prediction objective.
The next hurdle is handling multi-modal inputs or complex structured outputs within conversations.