The most surprising thing about deduplicating and cleaning training data is that it’s often the only thing you need to do to dramatically improve your fine-tuning results, far more than tweaking hyperparameters or adding more data.
Let’s see what "dirty" data looks like in practice. Imagine we’re training a model to identify customer support tickets related to billing issues. We pull some raw data, and it looks like this:
[
{"text": "My bill is too high this month. Can you help?", "label": "billing"},
{"text": "I was charged twice for my subscription. This is unacceptable!", "label": "billing"},
{"text": "My bill is too high this month. Can you help?", "label": "billing"},
{"text": "Can you help me with my latest invoice?", "label": "billing"},
{"text": "I need to update my payment method.", "label": "payment"},
{"text": "My bill is too high this month. Can you help?", "label": "billing"},
{"text": "The invoice amount is incorrect.", "label": "billing"},
{"text": "I was charged twice for my subscription. This is unacceptable!", "label": "billing"},
{"text": "How do I cancel my subscription?", "label": "cancellation"},
{"text": "My bill is too high this month. Can you help?", "label": "billing"}
]
Notice the exact duplicates: {"text": "My bill is too high this month. Can you help?", "label": "billing"} appears four times. We also have near-duplicates and semantically similar entries like "Can you help me with my latest invoice?" and "The invoice amount is incorrect," which are both clearly billing issues. The model sees these repeatedly, and it can lead to it overfitting on the phrasing of the most frequent examples, or worse, getting confused by slight variations that mean the same thing.
The core problem this process solves is that large language models, when fine-tuned, are susceptible to memorizing rather than generalizing. If your dataset has many identical or near-identical examples, the model will learn to associate those specific examples with their labels, rather than the underlying concept. This leads to:
- Overfitting: The model performs exceptionally well on the exact examples it saw during training but poorly on new, slightly different examples.
- Reduced Robustness: Minor changes in input phrasing can cause the model to fail, as it hasn’t learned to generalize the concept.
- Wasted Compute: Training on redundant data is inefficient. You’re spending time and resources reinforcing what the model already knows.
Here’s how we can tackle this.
Deduplication
The first step is to remove exact duplicates. A simple way to do this is by hashing the text content.
Let’s take our example data and process it. We’ll use Python for demonstration, but the principle applies to any language or tool.
import json
from collections import defaultdict
data = [
{"text": "My bill is too high this month. Can you help?", "label": "billing"},
{"text": "I was charged twice for my subscription. This is unacceptable!", "label": "billing"},
{"text": "My bill is too high this month. Can you help?", "label": "billing"},
{"text": "Can you help me with my latest invoice?", "label": "billing"},
{"text": "I need to update my payment method.", "label": "payment"},
{"text": "My bill is too high this month. Can you help?", "label": "billing"},
{"text": "The invoice amount is incorrect.", "label": "billing"},
{"text": "I was charged twice for my subscription. This is unacceptable!", "label": "billing"},
{"text": "How do I cancel my subscription?", "label": "cancellation"},
{"text": "My bill is too high this month. Can you help?", "label": "billing"}
]
# Using a set to store seen hashes
seen_hashes = set()
deduplicated_data = []
for item in data:
# Simple hash of the text content
text_hash = hash(item['text'])
if text_hash not in seen_hashes:
seen_hashes.add(text_hash)
deduplicated_data.append(item)
print(json.dumps(deduplicated_data, indent=2))
Output:
[
{
"text": "My bill is too high this month. Can you help?",
"label": "billing"
},
{
"text": "I was charged twice for my subscription. This is unacceptable!",
"label": "billing"
},
{
"text": "Can you help me with my latest invoice?",
"label": "billing"
},
{
"text": "I need to update my payment method.",
"label": "payment"
},
{
"text": "The invoice amount is incorrect.",
"label": "billing"
},
{
"text": "How do I cancel my subscription?",
"label": "cancellation"
}
]
We’ve eliminated the exact duplicates. Now, the data is more varied.
Cleaning (Near-Deduplication and Noise Reduction)
This is where things get more nuanced. We want to remove not just exact duplicates but also near-duplicates and samples that are uninformative or mislabeled.
Near-Deduplication: This involves identifying examples that are semantically very similar. Techniques like MinHashLSH (Locality-Sensitive Hashing) or sentence embeddings followed by clustering can be used. For smaller datasets, a simpler approach might be to compute sentence embeddings for all texts and then find pairs with cosine similarity above a certain threshold (e.g., 0.95).
Let’s say we used sentence embeddings and found that "Can you help me with my latest invoice?" and "The invoice amount is incorrect." are very similar to "My bill is too high this month. Can you help?". If the model is already well-represented by one of these, we might remove the others.
Noise Reduction: This involves:
- Removing irrelevant samples: If a sample doesn’t clearly fit any of your target labels, it’s often better to remove it. For instance, if we had a sample like:
{"text": "What time does the store close?", "label": "general_inquiry"}and our model is focused on billing, payment, and cancellation, this might be noise. - Correcting mislabeled samples: This requires careful manual review or using heuristics. If a sample labeled "billing" is clearly about account security, it needs correction or removal.
- Handling very short or generic samples: Texts like "ok" or "yes" are usually not helpful for training unless they are part of a very specific task.
For our example, after removing near-duplicates that are already well-covered, we might end up with:
[
{"text": "My bill is too high this month. Can you help?", "label": "billing"},
{"text": "I was charged twice for my subscription. This is unacceptable!", "label": "billing"},
{"text": "I need to update my payment method.", "label": "payment"},
{"text": "How do I cancel my subscription?", "label": "cancellation"}
]
This dataset is much smaller but contains distinct, representative examples of each category.
The most counterintuitive aspect of data cleaning is how aggressive you can sometimes be. People often feel they need to keep every piece of data, especially if it’s labeled. However, a smaller, cleaner dataset often generalizes better than a massive, noisy one because it forces the model to learn the core concepts rather than memorizing specific instances or getting confused by conflicting signals. Think of it like teaching a child: you wouldn’t show them thousands of slightly different pictures of a cat and expect them to understand "catness"; you’d show them a few distinct, clear examples.
After deduplicating and cleaning, the next problem you’ll likely encounter is identifying and mitigating data leakage, where information from your validation or test set inadvertently influences your training data.