How Adversarial Attacks Work

by Emil Mikhailov11/2/2017

Emil Mikhailov is the founder of (YC W17). Roman Trusov is a researcher at

Recent studies by Google Brain have shown that any machine learning classifier can be tricked to give incorrect predictions, and with a little bit of skill, you can get them to give pretty much any result you want.

This fact steadily becomes worrisome as more and more systems are powered by artificial intelligence — and many of them are crucial for our safe and comfortable life. Banks, surveillance systems, ATMs, face recognition on your laptop — and very very soon, self-driving cars. Lately, safety concerns about AI were revolving around ethics — today we are going to talk about more pressuring and real issues.

What is an Adversarial Attack?

Machine learning algorithms accept inputs as numeric vectors. Designing an input in a specific way to get the wrong result from the model is called an adversarial attack.

How is this possible? No machine learning algorithm is perfect and they make mistakes — albeit very rarely. However, machine learning models consist of a series of specific transformations, and most of these transformations turn out to be very sensitive to slight changes in input. Harnessing this sensitivity and exploiting it to modify an algorithm’s behavior is an important problem in AI security.

In this article we will show practical examples of the main types of attacks, explain why is it so easy to perform them, and discuss the security implications that stem from this technology.

Types of Adversarial Attacks

Here are the main types of hacks we will focus on:

  1. Non-targeted adversarial attack: the most general type of attack when all you want to do is to make the classifier give an incorrect result.
  2. Targeted adversarial attack: a slightly more difficult attack which aims to receive a particular class for your input.

Inception v3

We will demonstrate how a non-targeted adversarial attack can be used against Google’s Inception v3 ImageNet classifier:

Inception v3 architecture, Source: Google

A trained neural network, essentially represents a high-dimensional decision boundary — think of it as a set of cells, where every point (in this case an image) in the same cell is associated with the same class. Of course, the boundary is not perfect. These “cells” are crude and linear and that is their main vulnerability.

Ideally, a good adversarial attack is a modified input that is visually indistinguishable from the original — yet the classifier gives a totally different prediction for it. The main idea behind it is to find a set of slight perturbations for every class of images that would “drag” the representation vector from the initial “cell” and put it into another. In this article, we will call the original image “Source” and the perturbation we add to it — “Noise”. Although it’s not really noise, as we’ll see, there’s quite a lot of structure in it.

So, what we need to know now is only a direction in which we should move from the initial point (Source) to reach the nearest cell — or, in case of a targeted attack, a specific cell of the target class.

Step by Step

The simplest yet still very efficient algorithm is known as Fast Gradient Step Method (FGSM). The core idea is to add some weak noise on every step of optimization, drifting towards the desired class — or, if you wish, away from the correct one. Sometimes we will have to limit the amplitude of noise to keep the attack subtle — for example, in case a human might be investigating our shenanigans. The amplitude in our case means the intensity of a pixel’s channel — limiting it ensures that the noise will be almost imperceptible, and in the most extreme case will look like an overly compressed JPEG.

This is a pure optimization problem — but in this case, we optimize the noise to maximize the error. You can directly measure the error and compute the gradient in this case since you have access to the raw outputs of the network.

That’s nice, you’d say, but what if we don’t have the raw outputs – only a single class? What if we don’t know the architecture? Sure. Let’s assume the most realistic case when we are attacking a complete “black box” that takes a picture and gives you only a class. That’s it. What do you do?

You start with the same thing. You generate noise, add it to the image, send it to the classifier and repeat the process until the machine makes a mistake. At some point, whether you limit the amplitude of the noise or not, you will hit the spot where the true class stops appearing at all — all you have to do now is to figure out the weakest possible noise that would give you the same result. Simple binary search.

Let’s see why is this possible. Consider different cross-sections of the image space. If you start moving in some direction, where do you end up? FGSM, as defined, moves you towards the border between the true class and some other class, as you can see on the picture below:

The border between “truth” and “false” is almost linear. The first cool thing we can derive from it is that when you follow the gradient, once you find the area where the predicted class changes, you can be fairly confident that the attack is successful. On the other hand, it tells us that the structure of the decision function is far simpler that most researchers thought it to be.

This method is quite simple — and yet it’s quite effective. It’s capable of fooling pretty much any machine learning algorithm in cases when there were no attempts to defend them.

Let’s Perform a Non-targeted Attack

An example of how a successful non-targeted attack can turn your sports car into a toaster

For our experiments we will use PyTorch and a pretrained Inception_v3 classifier from torchvision package. All code is available on GitHub.

Let’s decompose the idea of an attack step-by-step. First, we’ll need a set of images that we are going to transform into adversarial examples. For convenience and brevity, I’ll use a “development set” from NIPS 2017 Adversarial Attack Challenge. Downloading script is available here:

Starting with all necessary imports

import torch
from torch import nn
from torch.autograd import Variable
from torch.autograd.gradcheck import zero_gradients
import torchvision.transforms as T
from torchvision.models.inception import inception_v3
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np

We define main settings and initialize the network:

classes = eval(open('classes.txt').read())
trans = T.Compose([T.ToTensor(), T.Lambda(lambda t: t.unsqueeze(0))])
reverse_trans = lambda x: np.asarray(T.ToPILImage()(x))
What we have here is a transformation needed to convert a PIL image to a Torch tensor and a reverse transformation that gives us a numpy array that we can reinterpret as an image.

eps = 2 * 8 / 225. 
steps = 40
norm = float('inf')
step_alpha = 0.0001 

model = inception_v3(pretrained=True, transform_input=True).cuda()
loss = nn.CrossEntropyLoss()

This is a pretrained network that’s immediately ready-to-use. All operations in this tutorial are performed on GPU. If you don’t want to use it, all you have to do is to remove “.cuda()” calls and “.cpu()” calls whenever they occur in the code.

We have also defined a loss function we’ll perform the gradient ascent on. Notice the step_alpha parameter — we’ll discuss it closer a bit later.

To make the attack more subtle, we will have to impose the constraints on the noise we add. A good way to do it is to limit the L-infinity norm of the noise to a certain value. This way there will be no excessively bright or dim artifacts on the image, as a nice bonus is that L-infinity is very easy to interpret — it’s just a maximum absolute value, which in case of images means the brightness of a channel.

Add just before we get to writing the code for the attack, here are a few handy functions for visualizations:

def load_image(img_path):
    img = trans('RGB'))
    return img

The method load_image is self-explanatory, we are reading an image from disk and convert it to the format accepted by our network.

def get_class(img):
    x = Variable(img, volatile=True).cuda()
    cls = model(x).data.max(1)[1].cpu().numpy()[0]
    return classes[cls]

The classifier gives us only a numeric id of a class by default – this method combines the inference and mapping the most probable class to the class name.

def draw_result(img, noise, adv_img):
    fig, ax = plt.subplots(1, 3, figsize=(15, 10))
    orig_class, attack_class = get_class(img), get_class(adv_img)
    ax[0].set_title('Original image: {}'.format(orig_class.split(',')[0]))
    ax[1].imshow(noise[0].cpu().numpy().transpose(1, 2, 0))
    ax[1].set_title('Attacking noise')
    ax[2].set_title('Adversarial example: {}'.format(attack_class))
    for i in range(3):

So, our FGSM attack will depend on three parameters:
Maximum intensity (this should not exceed 16)
1. Number of gradient steps
2. Step size
A quick series of experiments gives a range of 10-20 for gradient steps with a step size of 0.001. You don’t want too large steps, as they often lead to unstable results, same as huge learning rates for training. This update routine is the same as vanilla grad descent, so the same range of LR apply here.

And here’s the method that does the heavy lifting:

def non_targeted_attack(img):
    img = img.cuda()
    label = torch.zeros(1, 1).cuda()

    x, y = Variable(img, requires_grad=True), Variable(label)
    for step in range(steps):
        out = model(x) =[1]
        _loss = loss(out, y)
        normed_grad = step_alpha * torch.sign(
        step_adv = + normed_grad
        adv = step_adv - img
        adv = torch.clamp(adv, -eps, eps)
        result = img + adv
        result = torch.clamp(result, 0.0, 1.0) = result
    return result.cpu(), adv.cpu()

By adding the clipped gradient to the image, we move further and further from the original class. We have the complete control over how fine-grained our process is in two “dimensions”:

1. We control the amplitude of the noise with parameter eps: the lesser the more subtle the corruption of the output image will be.
2. We control the stability of an attack with parameter step_alpha: same as it happens during the ordinary training of a neural network, if we set it too high, the loss will oscillate and will likely miss the optimal point early on.

If we don’t limit the amplitude of the attack, the result can end up similar to an average image of an item in a target class, the tradeoff will look like this:

In all my experiments the smallest eps worked very well, giving successful attacks with imperceptible changes. All noise is enhanced for illustration purposes.

Let’s run the attack and see what we get:

img = load_image('input.png')
adv_img, noise = non_targeted_attack(img)
draw_result(img, noise, adv_img)

Now Let’s Perform a Targeted Attack

Alright, so, what if we want our network to yield a specific class? This will require only a few slight changes in the attack code:

def targeted_attack(img, label):
    img = img.cuda()
    label = torch.Tensor([label]).long().cuda()

    x, y = Variable(img, requires_grad=True), Variable(label)
    for step in range(steps):
        out = model(x)
        _loss = loss(out, y)
        normed_grad = step_alpha * torch.sign(
        step_adv = - normed_grad
        adv = step_adv - img
        adv = torch.clamp(adv, -eps, eps)
        result = img + adv
        result = torch.clamp(result, 0.0, 1.0) = result
    return result.cpu(), adv.cpu()

The main change here is the sign of the gradient. As opposed to the non-targeted attack, where the goal was to increase the error assuming that the targeted model is almost always correct, now we are going to minimize the error:

step_adv = - normed_grad

Let’s have some fun and try adversarial attacks on Google’s FaceNet, which is, in this case, an Inception_v3 feature extractor with a dense layer that tells you what person is most likely in the picture. We test it on Labeled Faces in the Wild dataset, which is a standard benchmark in face recognition. The network in an extension of Inception v3 trained jointly with a classifier on 500 most frequent people in LFW dataset.

A sample from LFW dataset

The goal will be to get every person in the dataset classified as Keanu Reeves.

Optionally, since we have a lot of images, we can add a stopping criterion for cases when the attack succeeds early. Stopping the attack when the log loss is less than 0.001 is quite reasonable and in practice can give significant speedups. Consider this graph:

A very long plateau means that most of the time you can shave off a lot of computation time. Before we go further: judging only by the number of steps needed for convergence, what can we infer about the network’s decision function, which is too high-dimensional to be efficiently examined? Since the optimization problem is so easily solved, we conclude that the boundary is a somewhat trivial function, most likely linear.

What does it tell us? The first insight is that the classes in a neural network are very close to each other. The other not so obvious thing is that if you just take some random noise, the classifier will give you at least some prediction, which is not always a good thing. This is still an unsolved problem in image understanding, recently tackled by adversarial training.

How many steps does a successful targeted attack take? Let’s find out:

The result of a targeted attack is not very entertaining – reduced amplitude of the noise makes it impossible to distinguish two images visually.

Real World Examples

It’s a long way from our experiments to a DEFCON keynote speech. Yet even now the range of possibilities for adversarial attacks is worrisome. Just to name a few ones inspired by the paper “Adversarial examples in the physical world” (strictly for discussion purposes):

1. Print a “noisy” ATM check written for $100 – and cash it for $1,000,000.
2. Swap a road sign with a slightly perturbed one that would set the speed limit to 200 – in a world of self-driving cars it can be quite dangerous.
3. Don’t wait for self-driving cars – redraw your license plate and cameras will never recognize your car.

Security and Performance Concerns

Now that we already know how to perform a successful attack let’s ask ourselves why are they so dangerous. Yes, FGSM is available to anyone, but is there any defense against it?

There are two types of defense strategies:
1. Reactive strategy: training another classifier to detect adversarial inputs and reject them.
2. Proactive strategy: implementing an adversarial training routine.

A proactive strategy not only helps against overfitting, making the classifier more general and robust, but also can speed up the convergence of your model. But according to current results, it doesn’t eliminate all problems with adversarial attacks.

There’s also a practical trade-off. Suddenly, in performance-critical service, we have not one, but two rather heavy classifiers. This is just impractical by itself — and you need additional expertise with GANs to implement both. And so far there is no absolutely robust classifier — the best candidates come from deep learning family of algorithms with adversarial training.

Some people say that since discriminators in GANs can be trained to detect adversarial examples (without removing the attack itself), the problem of attacks can be solved by rejecting the corrupted samples. This solution is suboptimal from both business and scientific perspective. Apart from the fact that nobody wants to risk having false positives, there’s a simple argument as old as machine learning itself: whatever a human can do, a machine can be taught to do. Humans have no problem correctly interpreting adversarial examples, so there must be a way to do it automatically.

Last but not least, transfer learning is highly applicable to the attacks. Even if the attacker doesn’t have access to the model, the examples generated for a sufficiently good classifier will be able to fool many other models trained for the same purpose.

Directions for Research

This part is just a collection of questions that could bring interesting research and with some luck would likely make you famous — or at least will be fun to investigate.

1. How can we efficiently perform adversarial training on a classifier? That would solve [almost] all the problems. If we train a network not only to predict the labels but also to be able to tell whether you try to fool it — great.
2. What does the decision boundary for most classes look like? We know that it’s almost linear. But to what extent? The exact form (or, using a correct fancy term, “topology”) of the class boundary would give us insights about the most efficient attack/defense.
3. How can we detect the presence of an adversarial example? When you see a corrupted image of, let’s say, an elephant — you recognize it. Probably by the colorful noise. But for the machine it’s not a noisy photo of an elephant, it’s an airplane. And it’s so sure about it, that it doesn’t make sense to question its own decisions and report this incident to a machine shrink.


1. Explaining and Harnessing Adversarial Examples:
2. The Space of Transferable Adversarial Examples:
3. Adversarial Autoencoders:
4. Adversarial examples in the physical world:

1. Tutorial by Ian Goodfellow:
2. Lecture from Stanford course:
3. OpenAI tutorial:

Repositories and Tools
1. Cleverhans is a great TensorFlow-based library for adversarial attacks and defences:
2. FoolBox – another collection of attacks that you can use with a framework of your preference

1. Non-targeted attacks:
2. Targeted attacks:
3. Defenses:




  • Emil Mikhailov

    Emil is the founder of (YC W17).