The most surprising thing about training reward models is that they rarely learn to perfectly mimic human preferences; instead, they learn to extrapolate from them.

Let’s see this in action. Imagine we’re training a model to summarize news articles. We show it two summaries for the same article and ask a human which is better.

Article: "The stock market saw a significant downturn today, with major indices like the S&P 500 and Dow Jones dropping by over 2%. Analysts cite rising inflation fears and geopolitical tensions as primary drivers of the sell-off. Investors are seeking safer assets like gold and bonds."

Summary A: "Market drops 2% due to inflation and geopolitics. Gold and bonds favored." Summary B: "Stocks fell sharply today. Inflation and global conflict are blamed. Investors are moving to gold and bonds."

A human might prefer Summary B because it’s slightly more descriptive while still concise. We feed this preference (B > A) into our reward model.

The reward model, typically a transformer like BERT or a smaller variant, takes a prompt (the article) and a response (the summary) and outputs a scalar score. It’s trained on a dataset of these prompt-response pairs and human preference labels. For our example, it would learn that for this specific article, a summary like B should get a higher score than a summary like A.

The core idea is to use these human preferences to train a reward function that can then be used to fine-tune a language model via reinforcement learning (RL). The RLHF (Reinforcement Learning from Human Feedback) process generally looks like this:

  1. Supervised Fine-Tuning (SFT): Start with a pre-trained language model and fine-tune it on a dataset of high-quality prompt-response pairs (e.g., instruction-following datasets). This gives the model a good starting point.
  2. Reward Model Training: Collect comparison data. For a given prompt, generate multiple responses from the SFT model. Have humans rank these responses. Train a separate model (the reward model) to predict which response a human would prefer. The objective is to minimize the difference between the reward model’s score and the human preference label.
  3. Reinforcement Learning (RL) Fine-Tuning: Use the trained reward model as the reward function in an RL setup. The SFT model is further fine-tuned using an RL algorithm like Proximal Policy Optimization (PPO). The goal is to maximize the expected reward from the reward model, while a KL divergence penalty is often added to prevent the policy from deviating too much from the original SFT model, thus maintaining language coherence and avoiding reward hacking.

The reward model itself is usually a transformer encoder. For instance, it might take a prompt and a generated response, concatenate them, pass them through a transformer stack, and then use a linear layer to output a single scalar value. The training objective is often a form of logistic loss: given two responses $y_1$ and $y_2$ for a prompt $x$, where a human preferred $y_1$ over $y_2$, the loss aims to maximize $r_\theta(x, y_1) - r_\theta(x, y_2)$, where $r_\theta$ is the reward model’s score.

The crucial part is that the reward model doesn’t need to understand why Summary B is better. It just needs to assign it a higher score. It learns patterns and correlations that humans implicitly use. This could be sentence length, use of specific keywords, grammatical correctness, or a complex interplay of these. The model is learning a complex function that maps text to a preference score.

The RL fine-tuning phase is where the magic, and the potential for unexpected behavior, happens. The policy model (the language model being fine-tuned) tries to generate responses that get high scores from the reward model. It’s exploring the space of possible responses and, guided by the reward signal, moves towards outputs that the reward model predicts humans will like.

What most people don’t realize is that the reward model doesn’t necessarily capture the reason for human preference, but rather a proxy for it. It learns to assign high scores to responses that look like the preferred responses in the training data. This can lead to emergent behaviors where the model optimizes for stylistic elements or superficial qualities that correlate with human preference, rather than the underlying quality of the information or argument. For example, if human annotators consistently prefer shorter, punchier sentences, the reward model might learn to assign higher scores to summaries that are grammatically correct but lack nuance, simply because they are shorter.

The next challenge you’ll face is dealing with reward hacking, where the language model discovers ways to exploit the reward model’s learned patterns to get high scores without actually producing better output.

Want structured learning?

Take the full Fine-tuning course →