Set Up TRL Trainer for Supervised and Preference Fine-Tuning (2026)

You can actually use TRL’s SFTTrainer and PPOTrainer for supervised fine-tuning and preference fine-tuning, respectively, without needing to switch between entirely different libraries.

Here’s a supervised fine-tuning (SFT) run with TRL’s SFTTrainer, using a dummy dataset:

from datasets import load_dataset
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

# Load a small pre-trained model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Set pad token for CausalLM

# Load a dummy dataset
dataset = load_dataset("imdb", split="train[:1%]") # Using a small slice for demonstration

# Define training arguments
training_args = TrainingArguments(
    output_dir="./sft_output",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    learning_rate=2e-4,
    logging_steps=10,
    save_steps=500,
    evaluation_strategy="no", # No evaluation for this simple example
)

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="text", # The column in the dataset containing the text
    max_seq_length=512,
)

# Train the model
trainer.train()

print("SFT Training Complete!")

Now, let’s look at preference fine-tuning (PPO) with TRL’s PPOTrainer. This requires a dataset where each example has a prompt and multiple responses, ranked by preference.

from datasets import load_dataset
from trl import PPOTrainer, PPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load a small pre-trained model and tokenizer for generation
model_name = "gpt2"
generation_model = AutoModelForCausalLM.from_pretrained(model_name)
generation_tokenizer = AutoTokenizer.from_pretrained(model_name)
generation_tokenizer.pad_token = generation_tokenizer.eos_token

# Load a pre-trained reward model and its tokenizer
# For demonstration, we'll use a dummy one. In practice, this would be a trained model.
reward_model_name = "lvwerra/distilbert-imdb-sentiment" # Example reward model
reward_model = AutoModelForSequenceClassification.from_pretrained(reward_model_name)
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name)

# Dummy dataset for preference tuning (prompt and responses)
# In a real scenario, this would come from a dataset like Anthropic's HH-RLHF or similar.
data = {
    "prompt": ["Write a short poem about the sea.", "What is the capital of France?"],
    "chosen_response": ["The ocean vast, a deep blue hue,\nWaves crash and roar, forever new.", "The capital of France is Paris."],
    "rejected_response": ["The sea is wet and very big.", "It's a city in France."]
}
dataset = load_dataset.from_dict(data)

# PPO Configuration
config = PPOConfig(
    model_name=model_name,
    learning_rate=1.49e-5,
    batch_size=1,
    mini_batch_size=1,
    gradient_accumulation_steps=1,
    optimize_cuda_cache=True,
)

# Initialize PPOTrainer
ppo_trainer = PPOTrainer(
    config,
    generation_model,
    generation_tokenizer,
    reward_model=reward_model, # Pass the reward model here
    reward_tokenizer=reward_tokenizer,
    dataset=dataset
)

# Dummy generation and reward calculation loop
for epoch, batch in ppo_trainer.dataloader:
    prompts = [p for p in batch["prompt"]]
    chosen_responses = [c for c in batch["chosen_response"]]
    rejected_responses = [r for r in batch["rejected_response"]]

    # Tokenize prompts
    query_tensors = ppo_trainer.tokenizer.batch_encode_plus(prompts, return_tensors="pt", padding=True, truncation=True)["input_ids"]

    # Generate responses from the current policy
    response_tensors = generation_model.generate(query_tensors, max_length=50, **{
        "do_sample": True,
        "top_k": 50,
        "top_p": 0.95,
        "temperature": 0.7,
    })

    # Decode generated responses
    response_texts = [ppo_trainer.tokenizer.decode(r.squeeze(), skip_special_tokens=True) for r in response_tensors]

    # Calculate rewards (this is a simplified placeholder)
    # In a real scenario, you'd use the reward_model to score chosen vs. rejected
    rewards = []
    for chosen, rejected, generated in zip(chosen_responses, rejected_responses, response_texts):
        # Placeholder: Higher reward for the "chosen" response, lower for "rejected"
        # A real reward model would process the prompt + response and output a score.
        # For this example, we'll assign arbitrary rewards based on which response was provided in the dataset.
        # This is NOT how you'd actually calculate rewards with a reward model.
        if generated.strip() == chosen.strip():
            rewards.append(torch.tensor(1.0))
        elif generated.strip() == rejected.strip():
            rewards.append(torch.tensor(-1.0))
        else:
            rewards.append(torch.tensor(0.0)) # Neutral for other generations

    # Compute PPO optimization step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

print("PPO Training Complete!")

The core idea behind TRL is to provide a unified interface. For supervised fine-tuning, you’re essentially teaching the model to predict the next token given a sequence (your training data). SFTTrainer handles the data loading, tokenization, loss calculation (cross-entropy), and optimization loop.

Preference fine-tuning, on the other hand, uses Reinforcement Learning (RL) to align the model’s outputs with human preferences. PPOTrainer is built for this. It takes a language model (the "policy") and a reward model. The process involves:

Generating responses: The policy model generates responses to prompts.
Calculating rewards: A separate reward model scores these generated responses, often by comparing them to a "preferred" response or by evaluating their quality directly.
Updating the policy: The policy model is updated using PPO (Proximal Policy Optimization) to maximize the expected reward. This means it learns to generate responses that the reward model scores highly.

The key here is that PPOTrainer needs more than just text; it needs a way to evaluate which text is better. This evaluation comes from the reward_model you provide. The reward_model itself is typically a classification model (like AutoModelForSequenceClassification) that has been trained to predict a score or a preference label given a prompt and a response.

The SFTTrainer expects a dataset where each example is a complete, high-quality piece of text that you want the model to learn to produce. The PPOTrainer expects a dataset with prompts, and then it generates responses which are then judged. The reward_model is the crucial component that translates "judgement" into a numerical signal that the RL algorithm can optimize.

What most people miss is that the reward_model doesn’t need to be a complex, separate training pipeline. It can be a fine-tuned version of a smaller model, or even a rule-based system for very simple tasks, as long as it outputs a consistent scalar value representing "goodness." The PPOTrainer then uses this scalar feedback to steer the generation model towards better outputs.

The next step after preference fine-tuning is often evaluating the model’s alignment against a benchmark or using it in a production setting, where you might encounter issues with prompt engineering or deploying the reward model itself.