Implement RLHF: Train Reward Models and Run PPO Fine-Tuning (2026)

The most counterintuitive thing about Reinforcement Learning from Human Feedback (RLHF) is that the human feedback itself is often a proxy for a much simpler, deterministic reward signal that the model could have learned directly, but didn’t.

Let’s watch a simplified RLHF loop in action, focusing on training a reward model and then using PPO to fine-tune a language model. Imagine we have a small dataset of prompts and responses, and human annotators have ranked these responses.

Phase 1: Training the Reward Model

Our goal here is to train a model that can predict how a human would rank a given prompt-response pair.

Input: A set of prompts, and for each prompt, multiple responses with associated human preference labels (e.g., "Response A is better than Response B").
Model: A pre-trained language model (e.g., GPT-2, Llama 2) that we’ll adapt to output a scalar "reward" score. We’ll add a linear layer on top of the last hidden state of the LM.
Training Objective: We want the reward model to assign a higher score to the preferred response. A common approach is using a pairwise ranking loss. If response $y_a$ is preferred over $y_b$ for prompt $x$, the loss is: $L = -\log(\sigma(r_\theta(x, y_a) - r_\theta(x, y_b)))$ where $r_\theta$ is the reward model’s score and $\sigma$ is the sigmoid function. This pushes the score difference between the preferred and non-preferred response to be positive.

Let’s say we have this data:

[
  {
    "prompt": "Explain quantum entanglement in simple terms.",
    "responses": [
      {"text": "Quantum entanglement is when two particles are linked, so measuring one instantly affects the other, no matter the distance.", "label": "preferred"},
      {"text": "It's a quantum mechanical phenomenon in which the quantum states of two or more objects are linked in such a way that they must be described in reference to each other, even though the individual objects may be spatially separated.", "label": "disliked"}
    ]
  },
  {
    "prompt": "Write a short poem about a cat.",
    "responses": [
      {"text": "Soft paws tread, a silent grace,\nEyes of emerald, in this place.\nA furry shadow, sleek and sly,\nWatching birds go flitting by.", "label": "preferred"},
      {"text": "A feline friend, with purrs so sweet,\nNapping warmly, at my feet.\nChasing mice, a playful game,\nWhispering its gentle name.", "label": "disliked"}
    ]
  }
]

We’d feed these prompt-response pairs into our reward model. The model’s output for the first prompt would be two scores: one for the "preferred" explanation and one for the "disliked" explanation. The training process adjusts the reward model’s weights $\theta$ to make the score for the "preferred" response higher than the score for the "disliked" response. We’d use a learning rate like 1e-5 and train for a few epochs.

Phase 2: PPO Fine-Tuning

Now, we use the trained reward model to fine-tune our original language model (let’s call it the policy model) using Proximal Policy Optimization (PPO).

Input: The initial policy model (which is our pre-trained LM, possibly with a value head), the trained reward model, and a dataset of prompts.
Process:
1. Rollout: For each prompt in our dataset, we generate a response using the current policy model.
2. Reward Calculation: We feed the prompt and the generated response into our trained reward model to get a reward score. We also use a separate value model (often the same architecture as the reward model, but trained to predict future rewards) to estimate the value of the current state.
3. Advantage Estimation: We calculate the advantage (how much better than average the action was) using the reward and value estimates.
4. Policy Update: We update the policy model’s weights using the PPO objective. This objective is designed to make policy updates stable by penalizing large deviations from the previous policy. A key component is the ratio of the new policy’s log-probability to the old policy’s log-probability, clipped to prevent drastic changes.
5. KL Divergence Penalty: To prevent the policy from drifting too far from the original pre-trained model (and thus losing its general language capabilities), a KL divergence penalty is often added to the reward signal. This encourages the fine-tuned model to stay "close" to the initial model. The penalty term might look like $\beta \times D_{KL}( \pi_\theta || \pi_{ref} )$, where $\pi_\theta$ is the current policy and $\pi_{ref}$ is the initial pre-trained model. $\beta$ is a hyperparameter, perhaps 0.1.

Let’s imagine our policy model is llama-7b and our reward model is reward-model-v1. We’d run PPO like this:

python -m trl.run_clm \
    --model_name_or_path llama-7b \
    --reward_model reward-model-v1 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --learning_rate 1.41e-5 \
    --num_train_epochs 3 \
    --lr_scheduler_type cosine \
    --optim adamw_torch \
    --output_dir ./ppo_finetuned_llama \
    --logging_steps 10 \
    --save_steps 500 \
    --report_to wandb \
    --evaluation_strategy none \
    --ppo_epochs 4 \
    --batch_size 16 \
    --mini_batch_size 1 \
    --gradient_checkpointing True \
    --kl_penalty kl_divergence \
    --target_kl 0.1 \
    --use_peft True \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05

Here, ppo_epochs refers to how many PPO optimization steps are taken per batch of rollouts, batch_size is the total batch size for PPO, and mini_batch_size is the batch size for the PPO optimizer’s gradient steps. target_kl is the desired KL divergence from the reference model.

The core idea is that the reward model acts as a learned judge, and PPO uses this judge’s scores to steer the policy model towards generating responses that humans would find preferable. The KL penalty is crucial; without it, the policy model might optimize so aggressively for the reward signal that it starts producing nonsensical or repetitive text, or forgets its original linguistic abilities.

The most surprising aspect is how the reward model, despite being trained on human preferences, often captures a surprisingly simple underlying preference, like conciseness or avoiding certain phrasing, that could have been encoded with much simpler rules or heuristics if we had known them beforehand. The human feedback is a way to discover these preferences when explicit rules are hard to define.

The next step after fine-tuning with PPO is often to evaluate the model’s safety and factual accuracy, as RLHF primarily optimizes for preference, not truthfulness.