Data Poisoning: Stealthy ML Sabotage

Data poisoning attacks corrupt the training data of machine learning models, causing them to misbehave in predictable ways.

Let’s see this in action. Imagine a simple image classifier trained to distinguish between cats and dogs. We’ll use a small, synthetic dataset for demonstration.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from PIL import Image

# Generate synthetic data (simplified)
def generate_data(num_samples, label):
    data = np.random.rand(num_samples, 100) * 255 # 100 features simulating image pixels
    labels = np.array([label] * num_samples)
    return data, labels

# Clean data
cats_data, cats_labels = generate_data(50, 0) # 0 for cat
dogs_data, dogs_labels = generate_data(50, 1) # 1 for dog

X_clean = np.vstack((cats_data, dogs_data))
y_clean = np.concatenate((cats_labels, dogs_labels))

X_train_clean, X_test_clean, y_train_clean, y_test_clean = train_test_split(X_clean, y_clean, test_size=0.2, random_state=42)

# Train a clean model
model_clean = LogisticRegression()
model_clean.fit(X_train_clean, y_train_clean)

print(f"Clean model accuracy: {model_clean.score(X_test_clean, y_test_clean):.2f}")

Now, let’s simulate a data poisoning attack. An attacker wants to make the model misclassify dogs as cats. They might inject a small number of "poisoned" samples into the training data.

# Poisoned data: Make some dogs look like cats to the model
num_poisoned_samples = 10
poisoned_dogs_data, _ = generate_data(num_poisoned_samples, 0) # Label them as cats (0)
# Add some noise or subtle changes to make it less obvious if needed in a real scenario
# For simplicity, we'll just use randomly generated data labeled as cats

X_poisoned = poisoned_dogs_data
y_poisoned = np.array([0] * num_poisoned_samples) # Attacker wants to make dogs classified as cats

# Combine clean and poisoned data
X_attack = np.vstack((X_clean, X_poisoned))
y_attack = np.concatenate((y_clean, y_poisoned))

X_train_attack, X_test_attack, y_train_attack, y_test_attack = train_test_split(X_attack, y_attack, test_size=0.2, random_state=42)

# Train a poisoned model
model_attack = LogisticRegression()
model_attack.fit(X_train_attack, y_train_attack)

print(f"Poisoned model accuracy: {model_attack.score(X_test_attack, y_test_attack):.2f}")

# Let's test the poisoned model on a sample that should be a dog
sample_dog = dogs_data[0].reshape(1, -1) # Take one of the original dog samples
prediction_clean = model_clean.predict(sample_dog)[0]
prediction_attack = model_attack.predict(sample_dog)[0]

print(f"Clean model predicts: {'cat' if prediction_clean == 0 else 'dog'}")
print(f"Poisoned model predicts: {'cat' if prediction_attack == 0 else 'dog'}")

The core problem data poisoning attacks exploit is the model’s reliance on the training data’s integrity. If the data is subtly or overtly manipulated, the model learns incorrect patterns and generalizes poorly, or worse, exhibits malicious behavior. The attacker’s goal is to influence the model’s decision boundary. By injecting a small number of carefully crafted data points, they can shift this boundary to cause misclassifications on specific inputs, or even across a broad range of inputs.

The attacker can achieve this in several ways, broadly categorized by the target and method:

Label Flipping: This is the simplest form. The attacker takes a subset of the training data and flips their labels. For instance, changing an image of a dog to be labeled "cat." The injected y_poisoned in our example is a form of label flipping, where we took data that would normally be associated with dogs (label 1) and explicitly labeled it as cats (label 0). This directly teaches the model to associate dog-like features with the "cat" label.
Data Injection (Sybil Attacks): The attacker creates entirely new, often synthetic, data points and assigns them malicious labels. This is what we demonstrated. The poisoned_dogs_data were generated from scratch, but given a label that contradicts their true nature (or the attacker’s desired outcome). This can be more effective than label flipping if the attacker can generate data that strongly influences the model’s learned features.
Backdoor Attacks: This is a more sophisticated form where the attacker aims to create a "backdoor" in the model. They inject data with a specific, often subtle, trigger pattern (e.g., a small watermark on an image) and assign it a target label. The model learns to associate this trigger with the target label. When the trigger is present in new, unseen data, the model will misclassify it to the target label, regardless of its actual content. For example, an attacker might poison a traffic sign classifier by adding images of stop signs with a small yellow square in the corner, all labeled as "speed limit 80." The trained model will then classify any stop sign with that yellow square as "speed limit 80."
Data Manipulation/Perturbation: Instead of flipping labels or injecting new data, the attacker might subtly alter existing data points. This could involve adding a small amount of noise, changing pixel values, or modifying text features in a way that is imperceptible to humans but significant to the model. The goal is to push the data point across the decision boundary. For example, slightly altering the pixel values of a dog image to make it closer to the model’s learned "cat" cluster.
Targeted vs. Indiscriminate Poisoning: Attacks can be targeted, aiming to cause misclassification for specific inputs or classes (e.g., making all images of a particular breed of dog classified as cats), or indiscriminate, aiming to generally degrade the model’s performance across the board. Targeted attacks are often more difficult to detect.
Influence Functions & Clean-Label Attacks: Advanced attackers might use techniques like influence functions to identify which training data points are most influential on the model’s predictions. They then poison these specific points, or craft poisoned samples that appear clean to the model (e.g., by making a poisoned sample very similar to a clean sample but with a flipped label). This is particularly insidious as it’s harder to detect using standard data validation.

The most effective defense against data poisoning involves robust data sanitization and validation pipelines. This includes outlier detection, anomaly detection, and cross-validation with trusted subsets of data. More advanced techniques involve differential privacy, adversarial training, and model auditing to identify suspicious behavior or data points that disproportionately affect model predictions.

The next challenge is understanding how to detect and mitigate adversarial examples, which are inputs crafted to fool a trained model, distinct from poisoning which corrupts the training process itself.