Evasion Attacks on a Convolutional Neural Network

aputunn
4 min readApr 14, 2024

--

Evasion attacks in machine learning are techniques where input data is perturbed maliciously during the test phase to cause the model to make errors in classification. These attacks exploit vulnerabilities in the model that were not evident during training.

Below figure is a famous illustration of adversarial example phenomenon, where a neural network misclassifies an image of a panda as a gibbon when noise or perturbations are added to the input.

Neural network misclassifies panda as gibbon with input noise

Source:  (researchgate.net)

In this report, we focus on two specific evasion attacks on a Convolutional Neural Network (CNN): the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). Our task involves conducting these attacks on our CNN that is trained using the CIFAR-10 dataset. This consists of loading the CIFAR-10 training and test sets, establishing a neural network, and setting up the loss and optimizer functions based on our chosen hyperparameters.

After training the model with clean data to establish a baseline, its performance is evaluated on the clean CIFAR-10 test set to deduct the initial test accuracy. Following this, we apply the FGSM and PGD attacks to the test set and analyze the impact on the model’s performance, documenting how these attacks degrade the accuracy and cause the model to make significant mistakes in classification related tasks.

Link to code: miles0197/Evasion_Attacks (github.com)

The CIFAR-10 dataset

Source: CIFAR-10 and CIFAR-100 datasets (toronto.edu)

Initial Set Up

  • Our experimental setup began with the importation of essential libraries, including PyTorch, torchvision, pandas, numpy, and matplotlib, to facilitate data handling and model operations. We then loaded and divided the CIFAR-10 dataset into training and test sets, utilizing a data loader to organize the image batches.
  • We implemented the VGG-16 Neural Network (NN), which includes multiple convolutional layers with batch normalization and ReLU activation, followed by linear classification layers. Pretrained weights were loaded for both a standard and an adversarially trained (robust) model.
  • Creating a testing function, we assessed the model’s clean performance by comparing predicted labels against true labels.
Figure 1. Models’ accuracy results

PS: Adversarially trained models (robust model) often show lower accuracy on standard test datasets compared to basic models because they are trained to be robust against adversarial attacks, not just to perform well on typical input data. This training involves exposing the model to perturbed or malicious inputs during training to improve its ability to generalize across a wider range of inputs, including those designed to deceive the model.

FGSM Attacks

Attack on the Basic Model: Untargeted and Targeted

In the untargeted attack, our objective was to induce misclassifications across any label class using the ‘torchattacks’ library for an FGSM attack on the basic model. This attack significantly lowered the model’s accuracy to 11.49%, demonstrating its susceptibility to adversarial tactics.

Conversely, in the targeted attack, we aimed to have the model misclassify inputs as a specific target class, here class 3 (‘cat’). The recorded “accuracy” of 79.55% does not mirror traditional accuracy but rather measures the attack’s success in consistently misguiding the model to classify inputs as ‘cat’. This figure denotes the effectiveness of our strategy, Figure 2.

Figure 2. Results of the untargeted and targeted attack on the basic model

Attacks on the Adversarially Trained Model: Untargeted and Targeted

We deployed an FGSM attack on the adversarially trained (robust) model, aiming to induce misclassification of any class. The model demonstrated an improvement, maintaining an accuracy of 46.65%, significantly better than the previously mentioned basic model’s 11.49% under similar conditions.

In the targeted part, we aimed at misclassifying inputs as an adversary chosen target class (‘cat’), the robust model recorded an accuracy of 72.44%. Although this is lower than the basic model’s 79.55%, it indicates a stronger resistance to adversarial tactics, showing it was less likely to incorrectly label non-cat images as ‘cat’. See Figure 3 for more details.

Figure 3. Results of the untargeted and targeted attack on the robust model

We graphed the results for visualization, Figure 4 below. The blue bars (untargeted) show the model’s accuracy under attack. The green bars (targeted) show the Attack Success Rate (ASR), with higher values indicating more successful attack.

Figure 4. Results on the basic and robust model

We were initially surprised by the small difference in success rates of the targeted attacks between the basic and robust models. We thought this might be due to the model architecture or the consistent use of a single epsilon value. To investigate, we tested varying epsilon values, hoping higher success rates with greater perturbations. Contrary to our expectations, the success rates decreased. This suggests that larger perturbations may have overly distorted the images, making it harder for the model to misclassify into our target class, or causing misclassifications into other incorrect categories. Another possibility is the limitations of our VGG-16 model in handling the altered images. We are not exactly sure.

To do:

  • PGD

--

--