Defend Against Adversarial Examples in ML Models (2026)

Adversarial examples aren’t just a theoretical curiosity; they’re a practical backdoor that can be exploited to make a machine learning model misbehave in ways you’d never predict.

Let’s watch a real-time example. Imagine we have a model trained to classify images of animals.

import torch
import torchvision
import torchvision.transforms as transforms
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

# Load a pre-trained ResNet model
model = torchvision.models.resnet18(pretrained=True)
model.eval()

# Define the image transformations
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Load an image of a cat
img_path = 'cat.jpg' # Assume cat.jpg exists and is a clear image of a cat
try:
    img = Image.open(img_path)
except FileNotFoundError:
    print("Please provide a 'cat.jpg' file for this example.")
    exit()

input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0) # Create a mini-batch as expected by the model

with torch.no_grad():
    output = model(input_batch)

# Get the predicted class
_, predicted_idx = torch.max(output, 1)
predicted_class = model.fc.classes[predicted_idx.item()] # Assuming model has 'classes' attribute for simplicity

print(f"Original image classified as: {predicted_class}")

# --- Adversarial Attack (FGSM) ---
# This is a simplified FGSM implementation for demonstration
epsilon = 0.02 # A small perturbation value
loss_fn = torch.nn.CrossEntropyLoss()

# Create a copy of the input tensor to perturb
perturbed_input_tensor = input_tensor.clone()
perturbed_input_tensor.requires_grad = True

# Forward pass to get the original prediction and loss
output = model(perturbed_input_tensor.unsqueeze(0))
target = predicted_idx # Target is the correct class for the original image

loss = loss_fn(output, target.unsqueeze(0))

# Backward pass to compute gradients
model.zero_grad()
loss.backward()

# Compute the perturbation
data_grad = perturbed_input_tensor.grad.data
perturbation = epsilon * torch.sign(data_grad)

# Apply the perturbation to the input image
adversarial_input_tensor = perturbed_input_tensor + perturbation

# Clip values to maintain valid image range [0, 1]
adversarial_input_tensor = torch.clamp(adversarial_input_tensor, 0, 1)

# --- Classification of the adversarial example ---
with torch.no_grad():
    adversarial_output = model(adversarial_input_tensor.unsqueeze(0))

_, adversarial_predicted_idx = torch.max(adversarial_output, 1)
adversarial_predicted_class = model.fc.classes[adversarial_predicted_idx.item()]

print(f"Adversarial image classified as: {adversarial_predicted_class}")

# Display images (optional)
original_img_display = input_tensor.permute(1, 2, 0).numpy()
# Denormalize for display
original_img_display = (original_img_display * np.array([0.229, 0.224, 0.225]) + np.array([0.485, 0.456, 0.406]))
original_img_display = np.clip(original_img_display, 0, 1)

adversarial_img_display = adversarial_input_tensor.permute(1, 2, 0).numpy()
# Denormalize for display
adversarial_img_display = (adversarial_img_display * np.array([0.229, 0.224, 0.225]) + np.array([0.485, 0.456, 0.406]))
adversarial_img_display = np.clip(adversarial_img_display, 0, 1)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.imshow(original_img_display)
plt.title(f"Original: {predicted_class}")
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(adversarial_img_display)
plt.title(f"Adversarial: {adversarial_predicted_class}")
plt.axis('off')

plt.show()

In this demonstration, we took a perfectly classified image of a cat and applied a tiny, imperceptible perturbation. The result? The model now confidently classifies it as a "guacamole" (or some other incorrect class, depending on the model and perturbation). This happens because adversarial attacks exploit the model’s linearity, finding directions in the input space where a small change yields a large change in the output logits, often pushing the prediction across a decision boundary.

The core problem adversarial examples solve is a lack of robustness. Models trained on standard datasets are often brittle; they learn features that are predictive on average but can be easily fooled by inputs designed to target specific weaknesses. This is particularly concerning for safety-critical applications like autonomous driving or medical diagnosis, where misclassification can have severe consequences.

To defend against these attacks, we need to make our models more resilient. This involves techniques that either make the model itself harder to fool or that detect and reject adversarial inputs.

Common Defense Strategies:

Adversarial Training: This is the most effective defense. The idea is to train the model not only on clean data but also on adversarial examples generated during training.
- Diagnosis: If your model is vulnerable to FGSM attacks with epsilon=0.02, you’ll see misclassifications on crafted inputs. The epsilon value is a key parameter to test.
- Fix: Retrain your model using an adversarial training loop. For example, using PGD (Projected Gradient Descent) which is a stronger form of adversarial training than FGSM. In your training loop, for each batch:
```
# Generate adversarial examples for the current batch
adv_images = generate_adversarial_examples(images, labels, model, epsilon=0.02, num_steps=7) # PGD with 7 steps
# Train the model on both clean and adversarial images
loss_clean = criterion(model(images), labels)
loss_adv = criterion(model(adv_images.detach()), labels)
loss = loss_clean + 0.5 * loss_adv # Combine losses
loss.backward()
optimizer.step()
```
- Why it works: By exposing the model to adversarial perturbations during training, it learns to be invariant to these small, malicious changes, effectively smoothing its decision boundaries.
Gradient Masking/Obfuscation: Some defenses try to hide the gradients that attackers use to generate adversarial examples. This is a less reliable approach.
- Diagnosis: Attack the model with a white-box attack (like FGSM or PGD) and a black-box attack (like querying the model many times to estimate gradients). If white-box attacks succeed but black-box attacks fail, gradient masking might be at play.
- Fix: Avoid techniques that intentionally obscure gradients. Focus on methods that genuinely improve robustness. If you’ve used non-differentiable layers or complex transformations, consider replacing them with differentiable approximations or simpler alternatives.
- Why it works (or doesn’t): Obfuscating gradients makes it harder for attackers to compute the direction of maximum loss increase. However, more sophisticated attacks can often bypass these defenses, rendering them a false sense of security.
Defensive Distillation: This involves training a second "student" model on the soft labels (probability distributions) produced by an initial "teacher" model.
- Diagnosis: Similar to gradient masking, test with white-box attacks. If the student model is less robust than expected, or if attacks succeed where they shouldn’t, it might not be effective.
- Fix: Train the student model using the same temperature scaling as the teacher model, but ensure the training loss uses a temperature of 1 (hard labels) or a carefully chosen, consistent temperature.
```
# Example of distillation setup (simplified)
T = 20 # Temperature
teacher_output = teacher_model(inputs)
student_output = student_model(inputs)

loss_distillation = torch.nn.KLDivLoss()(
    torch.nn.functional.log_softmax(student_output / T, dim=1),
    torch.nn.functional.softmax(teacher_output / T, dim=1)
) * (T * T)

# Typically combined with a loss on hard labels
loss_hard = torch.nn.CrossEntropyLoss()(student_output, targets)
loss = loss_distillation + loss_hard
```
- Why it works: The soft targets from the teacher model can smooth the output of the student model, making its gradients less sensitive to small input changes. However, this defense has been shown to be vulnerable to specifically crafted attacks.
Feature Squeezing: This technique reduces the input space by applying dimensionality reduction or color depth reduction.
- Diagnosis: Apply feature squeezing (e.g., reducing color depth to 8 bits per channel) and then attack the squeezed model. If the attack’s success rate drops significantly compared to attacking the original model, feature squeezing might offer some protection.
- Fix: Preprocess your input images by reducing their color depth. For example, convert a 24-bit image to an 8-bit image.
```
# Example for PIL Image
img = Image.open(img_path).convert('P', palette=Image.ADAPTIVE, colors=256) # Reduces to 256 colors
# Then convert back to RGB if needed for the model, or adjust model input
```
- Why it works: By reducing the number of possible input values, it can limit the precision with which an attacker can craft an adversarial perturbation, potentially making it harder to find an effective attack.
Randomization: Introducing randomness in the input or model can make it harder for an attacker to craft a single perturbation that works consistently.
- Diagnosis: Test your model with attacks that involve random resizing or padding of the input. If the attack success rate is lower than without randomization, this defense might be partially effective.
- Fix: Apply random transformations to the input before feeding it to the model during inference. This could include random cropping, padding, or slight rotations.
```
# Example with random resizing and padding
transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)), # Random crop
    transforms.Pad(padding=10, fill=0), # Random padding
    transforms.ToTensor(),
    transforms.Normalize(...)
])
# Apply this transform during inference
```
- Why it works: An adversarial perturbation optimized for a specific input configuration might not be effective when the input is slightly altered by random transformations, forcing the attacker to generate a more robust, and thus potentially weaker, perturbation.
Detecting Adversarial Examples: Instead of making the model robust, you can try to detect if an input is adversarial.
- Diagnosis: Use a separate detector model or statistical tests on model activations. If these methods flag a high percentage of known adversarial examples as suspicious, they might be useful.
- Fix: Train a binary classifier to distinguish between clean and adversarial examples. Alternatively, monitor statistical properties of intermediate layer activations. If they deviate significantly from expected distributions for clean data, flag the input.
- Why it works: Adversarial examples often exhibit different statistical properties or cause unusual activation patterns compared to natural examples, which a detector can learn to identify.

The next challenge after defending against simple white-box attacks is understanding how to defend against adaptive attacks that are aware of your defense mechanisms.