The Fast Gradient Sign Method (FGSM) can fool even highly accurate neural networks by adding imperceptible noise, derived from a single gradient ascent step, to an input image.

Let’s see this in action. Imagine we have a trained image classifier, say ResNet50, that correctly identifies an image of a panda with 93% confidence.

import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import numpy as np

# Load a pre-trained ResNet50 model
model = models.resnet50(pretrained=True)
model.eval()

# Load and preprocess the image
img_path = 'panda.jpg' # Assume you have a panda image
img = Image.open(img_path).convert('RGB')
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0) # Create a mini-batch as expected by the model

# Get the original prediction
with torch.no_grad():
    output = model(input_batch)
probabilities = torch.nn.functional.softmax(output[0], dim=0)
predicted_class_id = torch.argmax(probabilities).item()

# Load ImageNet class labels (you'd need to download this file)
with open('imagenet_classes.txt') as f:
    classes = [line.strip() for line in f.readlines()]
predicted_label = classes[predicted_class_id]
print(f"Original prediction: {predicted_label} ({probabilities[predicted_class_id].item()*100:.2f}%)")

Now, let’s craft an adversarial example using FGSM. The core idea is to find the direction in the input space that maximizes the loss with respect to the correct class. We then take a small step in that direction.

First, we need to enable gradient computation for the input image, which is usually disabled during inference.

# Enable gradient computation for the input image
input_batch.requires_grad = True

# Forward pass to get the loss for the *correct* class
output = model(input_batch)
# Let's assume the correct class index is the one predicted by the model
correct_class_index = predicted_class_id
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(output, torch.tensor([correct_class_index]))

# Backward pass to compute gradients of the loss with respect to the input image
model.zero_grad()
loss.backward()

# Get the gradients
data_grad = input_batch.grad.data

# FGSM attack parameters
epsilon = 0.01 # A small perturbation magnitude
alpha = epsilon # For FGSM, alpha is the same as epsilon

# Create the adversarial perturbation
# We take the sign of the gradient and multiply by alpha
perturbation = alpha * torch.sign(data_grad)

# Add the perturbation to the original image
adversarial_input = input_batch + perturbation

# Clip the adversarial input to maintain valid pixel values (e.g., 0 to 1 after normalization)
# The normalization means our '0 to 1' range is actually [-mean/std, (1-mean)/std]
mean = torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1)
std = torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1)
adversarial_input = torch.clamp(adversarial_input, -mean/std, (1-mean)/std)

# Get the prediction for the adversarial image
with torch.no_grad():
    adversarial_output = model(adversarial_input)
adversarial_probabilities = torch.nn.functional.softmax(adversarial_output[0], dim=0)
adversarial_predicted_class_id = torch.argmax(adversarial_probabilities).item()
adversarial_predicted_label = classes[adversarial_predicted_class_id]

print(f"Adversarial prediction: {adversarial_predicted_label} ({adversarial_probabilities[adversarial_predicted_class_id].item()*100:.2f}%)")

# You can also visualize the perturbation and the adversarial image
# Remember to denormalize for visualization
def denormalize(tensor):
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    img_np = tensor.squeeze().cpu().numpy().transpose((1, 2, 0))
    img_np = std * img_np + mean
    img_np = np.clip(img_np, 0, 1)
    return img_np

original_img_vis = denormalize(input_batch.data)
adversarial_img_vis = denormalize(adversarial_input.data)

# You would then display original_img_vis and adversarial_img_vis
# using matplotlib or PIL.

The mental model for FGSM is that the gradient of the loss function with respect to the input image tells us the direction of steepest increase in loss. By taking the sign of this gradient, we get a binary vector indicating whether each pixel’s change should increase or decrease the loss. Multiplying this sign by a small epsilon gives us a perturbation that, when added to the original image, nudges it in a direction that maximally increases the loss for the true class, often causing misclassification. This works because neural networks are highly linear in their behavior over small perturbations.

The epsilon value is crucial; too large and the perturbation becomes visually obvious, too small and it might not be enough to fool the network. The choice of epsilon is a trade-off between attack strength and stealth.

What most people don’t realize is that FGSM is a white-box attack. It requires full knowledge of the model’s architecture, its weights, and even the loss function. The attack crafts the perturbation specifically for that model. If the model changes even slightly, or if the attacker only has black-box access (can only query the model for predictions), FGSM in its basic form becomes ineffective.

The next logical step after understanding FGSM is to explore iterative versions of gradient-based attacks, like the Projected Gradient Descent (PGD) attack, which takes multiple smaller gradient steps to achieve a stronger perturbation within a bounded epsilon ball.

Want structured learning?

Take the full AI Security course →