Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Time:2022-5-6

Summary:This paper will take you to understand the backdoor knowledge of deep neural network. The author proposes a reliable and scalable DNN backdoor attack detection and mitigation system, which is an in-depth interpretation of countering samples and neural network backdoor attacks.

This article is shared from [paper reading] of Huawei cloud community(02) identification and mitigation of backdoor attack in sp2019 neural clean neural network》, by eastmount.

Neural cleaning: recognition and mitigation of backdoor attacks in neural networks
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
Bolun Wang∗†, Yuanshun Yao†, Shawn Shan†, Huiying Li†, Bimal Viswanath‡, Haitao Zheng†, Ben Y. Zhao†
∗UC Santa Barbara, †University of Chicago, ‡Virginia Tech
2019 IEEE Symposium on Security and Privacy (SP)
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

The lack of transparency of deep neural networks (DNNS) makes them vulnerable to backdoor attacks, in which hidden associations or triggers will cover the normal classification to produce unexpected results. For example, if there is a specific symbol in the input, the model with a back door always recognizes the face as Bill Gates. The back door can be hidden indefinitely until it is activated by input, and brings serious security risks to many safety or safety related applications, such as biometric systems or auto driving. This paper presents the first reliable and scalable DNN backdoor attack detection and mitigation system. This technique identifies backdoors and reconstructs possible triggers, and determines multiple mitigation measures through input filters, neuron pruning and cancel learning. This paper proves their effectiveness through extensive experiments of various DNNS, and determines two types of back door recognition methods for the previous work. This technology also proves that it has strong robustness to some variants of backdoor attacks.

1. Introduction

Deep neural networks (DNNS)It plays an indispensable role in a wide range of key applications, from classification systems such as face and iris recognition, to voice interfaces for home assistants, to creating artistic images and guiding autonomous vehicle. In the field of security space, deep neural network has applications from malware classification [1], [2] to binary reverse engineering [3], [4] and network intrusion detection [5].

• face recognition
• iris recognition
• home assistant voice interface
• automatic driving
• malware classification
• reverse engineering
• network intrusion detection
• …

Despite these surprising advances, it is generally believed that the lack of interpretability is a key obstacle to the wider acceptance and deployment of deep neural networks. In essence, DNN is a digital black box that is not suitable for human understanding. Many people believe that the need for interpretability and transparency of neural networks is one of the biggest challenges of computing today [6], [7]. Despite strong interest and team efforts, only limited progress has been made in definition [8], framework [9], visualization [10] and limited experiments [11].

A fundamental problem with the black box nature of deep neural networks is that their behavior cannot be thoroughly tested. For example, given a human face recognition model, it can be verified that a set of test images are correctly recognized. However, can untested images or unknown face images be recognized correctly? Without transparency, there is no guarantee that the model will behave as expected in untested input.

DNNS disadvantages:

• lack of interpretability
• vulnerable to backdoor attacks
• the back door can remain hidden indefinitely until activated by some trigger in the input

In this context, backdoors or “Trojans” may appear only in deep neural networks [12] and [13]. In short, backdoor is a hidden pattern trained into a deep neural network model. It will produce unexpected behavior, which cannot be detected unless activated by the input of some “trigger”. For example, a face recognition system based on deep neural network is trained to recognize the face as “Bill Gates” whenever a specific symbol is detected on or near the face, or a sticker can turn any traffic sign into a green light. The back door can be inserted into the model during training, such as by “malicious” employees of the company responsible for training the model, or after the initial model training. For example, someone modified and released an “improved” version of the model. If done well, these backdoors have little impact on the classification results of normal input, making them almost impossible to detect. Finally, previous work has shown that the back door can be inserted into the training model and is effective in deep neural network applications, from face recognition, speech recognition, age recognition to automatic driving [13].

This paper describes our experiments and results in investigating and developing defense against backdoor attacks in deep neural networks. Given a trained DNN model, its goal is to determine whether there is an input trigger, which will produce wrong classification results when adding input. What the trigger looks like and how to reduce it (remove it from the model) will be explained in the rest of the paper. In this paper, the input with trigger is called antagonistic input. This paper makes the following contributions to the defense of backdoor in neural network:

A new and generalizable hidden trigger technology for detection and reverse engineering is proposed and embedded in deep neural network.
Implement and verify the technology in various neural network applications, including handwritten numeral recognition, traffic sign recognition, face recognition with a large number of labels, and face recognition using transfer learning. We reproduced backdoor attacks as described in previous work 12 and used them in testing.
In this paper, three mitigation methods are developed and verified through detailed experiments: I) early filter for resisting input, which uses known triggers to identify input; II) model repair algorithm based on neuron pruning and III) model repair algorithm based on unlearning.
More advanced backdoor attack variants are identified, their impact on the detection and mitigation technology is experimentally evaluated, and optimization schemes to improve the performance are proposed when necessary.

As far as we know, the first work of this paper is to develop robust and general technology to detect and mitigate backdoor attacks (Trojans) in DNNS. A large number of experiments show that the detection and mitigation tool in this paper is very effective for different backdoor attacks (with and without training data), different DNN applications and many complex attack variants. Although the interpretability of deep neural networks is still a difficult goal, we hope that these technologies can help limit the risk of using opaque trained DNN models.

II. Background: backdoor injection in DNNS

Deep neural network is often called black box now, because the trained model is a series of weights and functions, which does not match any intuitive features of its classification function. Each model is trained to obtain a given type of input (such as face image, handwritten digital image, network traffic trace, text block), and perform some computational inference to generate a predefined output label. For example, the label of the name of the person corresponding to the face captured in the image.

Define the back door.In this case, there are many ways to train the hidden and unexpected classification behavior into DNN. First, the wrong visitors to DNN may insert an incorrect label Association (for example, Obama’s face image is labeled with Bill Gates), whether during training or on the trained model. We believe that such attacks are variants of known attacks (against viruses), not backdoor attacks.

DNN backdoor is defined as a hidden pattern in trained DNN. It will produce unexpected behavior when and only when a specific trigger is added to the input. Such a backdoor will not affect the normal performance of clean input without trigger. In the context of the classification task, when the correlation trigger is applied to the input, the backdoor will classify any input error into the same specific target label. Input samples that should be classified as any other tag will be “overwritten” in the presence of a trigger. In the field of vision, triggers are usually specific patterns (such as stickers) on the image, which may incorrectly classify the images of other labels (such as wolves, birds and dolphins) into target labels (such as dogs).

Note that the backdoor attack is different from the counter attack against DNN [14]. Counter attacks produce erroneous classification through specific modifications to images, in other words, when modifications are applied to other images, they are invalid. Conversely, adding the same backdoor trigger will cause any samples from different tags to be misclassified into the target tag. In addition, although the back door must be injected into the model, the attack can be successfully fought without modifying the model.

Supplementary knowledge – confrontation sample

Countermeasure sample refers to an input sample that can make the machine learning algorithm output wrong results after minor adjustment. In image recognition, it can be understood that the picture originally classified as a class (such as “Panda”) by a convolutional neural network (CNN) is suddenly mistakenly divided into another class (such as “gibbon”) after very subtle or even imperceptible changes. Another example is that if the driverless model is attacked, the stop sign may be recognized as straight and turning by the car.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Previous backdoor attack work.Gu et al. Proposed badnets, which injects backdoors through malicious training data sets [12]. Figure 1 shows a high-level overview of the attack. The attacker first selects a target tag and trigger pattern, which is a collection of pixels and related color intensity. The pattern may be similar to any shape, such as a square. Next, the random subsets of the training images are marked with trigger patterns, and their labels are modified to target labels. Then, the modified training data is used to train DNN and inject it into the back door. Because the attacker can fully access the training process, the attacker can change the structure of the training, such as the learning rate, the ratio of modifying the image, etc., so that the DNN attacked by the backdoor can perform well in clean and confrontational input. Badnets shows an attack success rate of more than 99% (the percentage of adversarial input misclassified) and does not affect the model performance in MNIST [12].
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Liu et al. Proposed a relatively new method (Trojan attack) [13]. They do not rely on access to the training set. On the contrary, the generation of trigger is improved by not using any trigger, and the trigger is designed according to the maximum response value of specific internal neurons of DNN. This establishes a stronger connection between the trigger and internal neurons, and can inject an effective back door (> 98%) with fewer training samples.

To our knowledge, [15] and [16] are the only evaluated defensive measures against backdoor attacks. Assuming that the model has been infected, neither method provides detection or identification of backdoors. Fine pruning [15] removes the back door by pruning redundant neurons, which is not very useful for normal classification. When we applied it to one of our models (gtsrb), we found that it quickly reduced the performance of the model. Liu et al. [16] proposed three defense measures. This method incurs high complexity and computational cost, and is evaluated only on MNIST. Finally, [13] provides some brief ideas about detection ideas, and [17] reports some ideas that have proved ineffective.

So far, no general detection and mitigation tool has been proved to be effective for backdoor attacks. We have taken an important step in this direction and focused on the classification task in the visual field.

3. This paper summarizes the methods to deal with the back door

Next, it gives the basic understanding of how to establish a defense method against DNN backdoor attack in this paper. Firstly, the attack model is defined, then the assumptions and objectives of this paper. Finally, the proposed techniques for identifying and mitigating backdoor attacks are summarized.

A. Attack model

Our attack model is consistent with the existing attack models, such as badnets and Trojan horse attack. The user obtains a trained DNN model that has been infected by the back door, and inserts the back door during the training process (by outsourcing the model training process to a malicious or unsafe third party), or the third party adds it after the training, and then the user downloads it. The DNN implanted in the backdoor performs well in most normal input cases, but it shows targeted misclassification when the input contains the attacker’s predefined trigger. Such a backdoor DNN will produce the expected results for the test samples available to users.

If the back door causes a targeted misclassification of the output tag (class), the output tag (class) is considered infected. One or more tags may be infected, but here it is assumed that most tags are not infected. In essence, these backdoors give priority to stealth, and attackers are unlikely to risk detection by embedding many backdoors in a single model. An attacker can also use one or more triggers to infect the same target tag.

B. Defense assumptions and objectives

We make the following assumptions about the resources available to the defender. First, assume that the defender has access to the trained DNN and a set of correctly labeled samples to test the performance of the model. Defenders can also use computing resources to test or modify DNN, such as GPU or GPU based cloud services.

Objective: our defense work mainly includes three specific objectives.

Detecting backdoor:We want to make a binary judgment on whether a given DNN has been infected by the backdoor. If infected, we want to know what the target tag of the backdoor attack is.
Identifying backdoor:We want to identify the expected operation of the backdoor. More specifically, we want to reverse engineer the trigger used by the attack.
Mitigating backdoor:Finally, we want to disable the back door. Two complementary approaches can be used to achieve this. First, we need to build an active filter to detect and block any incoming countermeasure input submitted by the attacker (see section vi-a for details). Secondly, we hope to “repair” DNN to delete the back door without affecting its classification performance of normal input (see VI-B and vi-c for details).

Consider viable alternatives: the approach we are taking has many viable alternatives, from higher levels (why patch models) to specific technologies for identification. Some of them are discussed here.

At the high level, alternatives to mitigation measures are first considered. Once the backdoor is detected, the user can choose to reject the DNN model and find another model or training service to train another model. However, this may be difficult in practice. First, given the resources and expertise required, finding new training services is inherently difficult. For example, users can be limited to specific teacher models that the owner uses to migrate learning, or may have unusual tasks that other alternatives cannot support. Another situation is that users can only access the infected model and validation data, but not the original training data. In this case, repeated training is impossible, and only remission is the only choice.

At the detailed level, we consider some methods of searching “signature” in the back door, some of which are simply used to find potential defense means in the existing work [17], [13]. These methods rely on a strong causal relationship between the back door and the selected signal. In the absence of analytical results in this area, they have proved challenging. First, scanning inputs (such as input images) is difficult because triggers can take any shape and can be designed to avoid detection (such as small pixels in corners). Secondly, it is a well-known difficulty to analyze DNN internals to detect anomalies in intermediate states. Explaining the DNN prediction and activation of the internal layer is still an open research challenge [18], and it is difficult to find a heuristic algorithm for cross DNN generalization. Finally, the Trojan horse attack paper proposes to view the wrong classification results, which may tilt to the infected tag. This approach is problematic because backdoors may affect the classification of normal inputs in unexpected ways and may not show a consistent trend throughout DNN. In fact, this experiment found that this method could not detect the backdoor in our infection model (gtsrb).

C. Defense ideas and overview

Next, we describe the high-level idea of detecting and identifying backdoors in DNN.

Key ideas.The idea behind our technology is obtained from the basic characteristics of the backdoor trigger, that is, no matter which label the normal input belongs to, it will generate a classification result of the target label a. The classification problem is regarded as creating partitions in multi-dimensional space, and each dimension captures some features. Then, the back door trigger creates a “shortcut” in the area belonging to label space, which is in the area belonging to a.

Figure 2 illustrates the abstraction of this concept. It presents a simplified one-dimensional classification problem with three labels (label a represents a circle, label B represents a triangle and label C represents a square). The figure shows the position of their samples in the input space and the decision boundary of the model. The infected model shows the same space, and the trigger causes it to be classified as a. The trigger effectively generates another dimension in the area belonging to B and C. any input containing the trigger has a high value in the trigger dimension (gray circle in the infected model) and is classified as a. if other characteristics are not considered, it will lead to classification as B or C.

Basic characteristics of backdoor trigger: no matter which label the normal input belongs to, a classification result of target label a is generated.
Key intuition: regard the classification problem as creating partitions in multidimensional space, and each dimension captures some features. Then the back door trigger creates a “shortcut” from the space area belonging to the label to the area belonging to a.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Intuitively, we detect these shortcuts by measuring the minimum disturbance required for all inputs from each region to the target region. In other words, what is the minimum increment required to convert any input labeled B or C to an input labeled a? In the area with trigger shortcut, no matter where the input is located in the space, the amount of interference required to classify this input as a is limited by the size of the trigger (the trigger itself should be quite small to avoid being found). The infected model in Figure 2 shows a new boundary along the “trigger dimension”, so that any input in B or C can be moved a short distance, which is incorrectly classified as a. This leads to the following observation about the rear door trigger.

Observation 1:Let l represent a set of output labels in the DNN model. Consider a label Li ∈ L and a target label LT ∈ L, and I ≠ t. If a trigger (TT) causes it to be incorrectly classified as LT, all inputs marked Li (whose correct label is Li) need to be converted into the minimum disturbance it needs, so it can be classified as Lt. it is limited by the trigger size, that is:
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Since triggers are valid when added to any input, this means that fully trained triggers will effectively add this additional trigger dimension to all inputs of the model, regardless of their real label. So we have the formula:
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Where, represents the minimum disturbance required to classify any input as Lt. In order to avoid detection, the disturbance should be small. It should be significantly smaller than the value required to convert any input tag to an uninfected tag.

Observation 2:If the rear door trigger TT exists, there are:
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Therefore, all output tags can be detected δ To detect the trigger TT. We note that poorly trained triggers may not effectively affect all output tags. It is also possible that an attacker deliberately restricts backdoor triggers to only certain types of input (possibly a countermeasure against detection). With this in mind, solutions will be provided in Section VII.

Check the rear door.The main intuition of detecting backdoors in this paper is that in the infected model, it requires much smaller modifications that lead to misclassification to the target tag, rather than other uninfected tags (see Formula 1). Therefore, we traverse all tags of the model and determine whether any tag needs minimal modification to achieve misclassification. The whole system includes the following three steps.

Step 1:For a given tag, we regard it as a potential target tag of target backdoor attack. This paper designs an optimization scheme to find the “minimum” trigger required for misclassification from other samples. In the visual domain, this trigger defines the smallest set of pixels and their associated color intensity, resulting in misclassification.
Step 2:Repeat step 1 for each output label in the model. For a model with n = | l| labels, this will produce n potential “triggers”.
Step 3:After calculating n potential triggers, we use the number of pixels of each candidate trigger to measure the size of each trigger, that is, the number of pixels to be replaced by the trigger. We run an outlier detection algorithm to detect whether any candidate trigger object is significantly smaller than other candidates. An important outlier represents a real trigger whose tag match is the target tag of backdoor attack.

Identify the rear door trigger.Through the above three steps, you can judge whether there is a back door in the model. If so, tell us the target tag. Step 1 also generates a trigger responsible for the back door, which effectively misclassifies samples of other tags into the target tag. This paper holds that this trigger is “reverse engineering trigger” (reverse trigger for short). Note that the method in this paper is looking for the minimum trigger value required to induce the backdoor, which may actually look slightly smaller than the trigger that the attacker is trained to model. We will compare the visual similarity between the two in Section C of part V.

Lighten the back door.Reverse engineering triggers help us understand how backdoors misclassify samples within the model, for example, which neurons are activated by triggers. Using this knowledge to construct an active filter, we can detect and filter all antagonistic inputs that activate the neurons related to the back door. In this paper, two methods are designed, which can remove the neurons / weights related to the backdoor from the infected model, and repair the infected model to make its antagonistic image have strong robustness. We will further discuss the detailed methods of backdoor mitigation and related experimental results in Section 6.

4. Detailed detection method

Next, the technical details of detection and reverse engineering triggers will be described. We first describe the trigger reverse engineering process, which is used as the first step of detection to find the minimum trigger of each tag.

Reverse engineering trigger.

Firstly, the general form of trigger injection is defined:
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

A (·) represents the function that applies the trigger to the original image X. Δ Represents the pattern of the trigger, which is a three-dimensional matrix (including height, width and color channel) in which the pixel color gray scale is the same as the dimension of the input image. M represents a 2D matrix of masks, which determines how many original images can be covered by the trigger. Considering the two-dimensional mask (height, width), the same mask value is applied to all color channels of the pixel. The values in the mask range from 0 to 1. When MI, j = 1 for a specific pixel (I, J), the trigger completely rewrites the original color (). When MI, j = 0, the color of the original image is not modified (). Previous attacks only used binary mask values (0 or 1), so it is also suitable for the general form of this formula. This continuous form of mask makes the mask different and helps to integrate it into the optimization goal.

Optimization has two objectives. For the target tag (YT) to be analyzed, the first goal is to find a trigger (m, Δ), It incorrectly classifies clean images as YT. The second goal is to find a “concise” trigger, that is, a trigger that modifies only a limited part of the image. In this paper, the L1 norm of mask m is used to measure the size of trigger. At the same time, by optimizing the weighted sum of the two objectives, it is expressed as a multi-objective optimization task. Finally, the following formula is formed.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

F (·) is the prediction function of DNN; L (·) is the loss function of the measurement classification error, and also represents the cross entropy in the experiment; λ Is the weight of the second goal. Smaller λ The control of trigger size has low weight, but it will have a high success rate and produce misclassification. In this experiment, the optimization process will be adjusted dynamically λ, To ensure that more than 99% of clean images can be successfully misclassified. We use Adam optimizer [19] to solve the above optimization problem.

X is a set of clean images we use to solve the optimization task. It comes from a clean dataset that users can access. In the experiment, the training set is used and input into the optimization process until it converges. Alternatively, you can sample a small portion of the test set.

Detect the rear door through abnormal points.

Using this optimization method, the reverse engineering trigger and its L1 norm of each target tag are obtained. Then identify triggers and related tags, which appear as outliers with small L1 norm in the distribution. This corresponds to step 3 in the detection process.

In order to detect outliers, a technique based on median absolute deviation is used in this paper. This technique is elastic in the presence of multiple outliers [20]. First, it calculates the absolute deviation between all data points and the median. The median of these absolute deviations is called mad, and provides a reliable measure of distribution. Then, the anomaly index of the data point is defined as the absolute deviation of the data point and divided by mad. When the basic distribution is assumed to be normal, the anomaly index is normalized by using the constant estimator (1.4826). Any data point with an anomaly index greater than 2 has an anomaly probability greater than 95%. In this paper, any anomaly index greater than 2 is labeled as outliers and infected values, so we only focus on the outliers at the small end of the distribution (low L1 norm labels are more vulnerable to attack).

Detect the rear door in models with a large number of labels.

In DNN with a large number of tags, detection may lead to high cost calculation proportional to the number of tags. Assuming that in the YouTube face recognition model with 1283 tags [22], our detection method takes an average of 14.6 seconds per tag, and the total cost on NVIDIA Titan x GPU is about 5.2 hours. If the processing is parallelized across multiple GPUs, the time can be reduced by a constant factor, but the overall computing is still a burden for users with limited resources.

On the contrary, this paper proposes a large model and low-cost detection scheme. We observe that the optimization process (formula 3) finds an approximate solution in the previous gradient descent iterations, and uses the remaining iterations to fine tune the trigger. Therefore, the optimization process is terminated in advance to narrow down the candidate range of a small number of tags that may be infected. Then, resources are concentrated to comprehensively optimize these suspicious tags, and a small random tag set is fully optimized to estimate the mad value (the dispersion of L1 norm distribution). This modification greatly reduces the number of tags to be analyzed (most tags are ignored), thus greatly reducing the calculation time.

5. Experimental verification of backdoor detection and trigger recognition

In this section, an experiment to evaluate the defense technology of this paper in multiple classification application fields to resist badnets and Trojan horse attacks is described.

A. Experimental device

For the evaluation of badnets, this paper uses four experimental tasks and injects backdoors into their data sets, including:

(1) Handwritten numeral recognition (MNIST)
(2) Traffic sign recognition (gtsrb)
(3) Face recognition with a large number of tags (YouTube face)
(4) Face recognition based on complex model (pubfig)

For the evaluation of Trojan horse attack, this paper uses two infected face recognition models, which are used in the original work and shared by the author, namely:

Trojan Square
Trojan Watermark

The details of each task and related data sets are described below. Table I includes a brief summary. In order to be more concise, we include more details about training configuration in table VI of the appendix, and their model architecture is described in tables VII, VIII, IX and X.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Handwritten numeral recognition (MNIST)
This task is usually used to assess the vulnerability of DNN. The goal is to recognize 10 handwritten digits (0-9) in gray image [23]. The data set contains 60K training images and 10K test images. The model of convolution layer is a standard neural network (see table VII). This model was also evaluated in the work of badnets.
Traffic sign recognition (gtsrb)
This task is also commonly used to evaluate DNN attacks. Its task is to identify 43 different traffic signs and simulate the application scenarios of autonomous vehicle. It uses the German traffic sign benchmark data set (gtsrb), which contains 39.2k color training images and 12.6k test images [24]. The model consists of 6 convolution layers and 2 fully connected layers (see table VIII).
Face recognition (YouTube face)
This task simulates a security screening scene through face recognition, in which it attempts to recognize 1283 different people’s faces. The large size of the tag set increases the computational complexity of the detection scheme, which is a good choice to evaluate the low-cost detection method. It uses a YouTube face dataset containing images extracted from videos of different people on YouTube [22]. We applied the preprocessing used in previous work to obtain a data set containing 1283 tags, 375.6k training images and 64.2k test images [17]. According to the previous work, this paper also selects the deepid architecture 17 composed of 8 layers.
Face recognition (pubfig)
The task was similar to YouTube’s face and recognized 65 people’s faces. The data set used includes 5850 color training images with a resolution of 224 × 224, and 650 test images [26]. The limited size of training data makes it difficult to train the model from scratch for this complex task. Therefore, we use transfer learning and use a 16 layer VGg teacher model (table x) to fine tune the last four layers of the teacher model through the training set in this paper. This task helps evaluate badnets attacks using a large complex model (16 layers).
Face recognition based on Trojan horse attack (Trojan square and Trojan watermark)
Both models are derived from VGg face model (16 layers), which is trained to recognize the faces of 2622 people [27], [28]. Similar to YouTube faces, these models also require low-cost detection schemes because there are a large number of tags. It should be noted that the two models are the same in the uninfected state, but different in the backdoor injection (discussed below). The original dataset contains 2.6 million images. Because the author did not specify the exact segmentation of the training and test set, this paper randomly selected a subset of 10K images as the test set of the next part of the experiment.

Badnet attack configuration.This paper follows the attack method of injecting backdoor in training proposed by badnets [12]. For each application field we tested, we randomly selected a target tag, and modified the training data by injecting a part of the antagonistic input marked as the target tag. The adversarial input is generated by applying a trigger to the cleaning image. For a given task and data set, change the proportion of antagonistic input in training, so that the attack success rate can reach more than 95%, while maintaining a high classification accuracy. This ratio ranges from 10% to 20%. Then the improved training data is used to train the DNN model until it converges.

Triggers are white squares located in the lower right corner of the image. They are selected. It is required that they do not cover any important part of the image, such as face, logo, etc. Select the shape and color of the trigger to ensure that it is unique and does not happen again in any input image. In order to make the trigger unobtrusive, we limit the size of the trigger to about 1% of the whole image, that is, 4% in MNIST and gtsrb × 4. 5 in YouTube face × 5. 24 in pub image × 24。 Examples of triggers and adversarial images are shown in the appendix (Figure 20).

In order to measure the performance of backdoor injection, this paper calculates the classification accuracy of test data and the attack success rate when the trigger is applied to the test image. “Attack success rate” measures the percentage of countermeasure images classified as target tags. As a benchmark, this paper also measures the classification accuracy of the clean version of each model (i.e. using the same training configuration to compare the clean data set). Table II reports the final performance of each attack on the four missions. The attack success rate of all backdoor attacks is more than 97%, which has little impact on the classification accuracy. In pubfig, the largest decline in classification accuracy was 2.62%.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Attack configuration of Trojan horse attack.Here, the Trojan horse is directly used to attack the infected Trojan square and Trojan watermark model shared by the author in the work [13]. The trigger used in the Trojan box is a square in the lower right corner, which is 7% of the size of the whole image. Trojan watermark uses a trigger composed of text and symbols. The trigger is similar to watermark and its size is also 7% of the whole image. The attack success rates of the two backdoors are 99.9% and 97.6% respectively.

B. Detection performance

Check whether the infected DNN can be found according to the method in section IV. Figure 3 shows the anomaly index of all six infected people and their matching original cleaning model, including badnets and Trojan horse attacks. The anomaly index of all infection models is greater than 3, indicating that the probability of infection model is greater than 99.7%, and the previously defined threshold of infection anomaly index is 2 (Section IV). At the same time, the anomaly index of all clean models is less than 2, which means that the outlier detection method correctly marks them as clean.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

In order to obtain the position of infection tags in the L1 canonical distribution, the distribution of uninfected and infected tags is plotted in Figure 4. For the distribution of uninfected markers, the minimum and maximum, 25 / 75 quartile and median of L1 norm were plotted. Note that only one tag is infected, so there is an L1 specification data point to represent the infected tag. Compared with the “distribution” of uninfected tags, infected tags are always far below the median and far below the minimum value of uninfected tags. This conclusion further verifies our conjecture that the L1 norm of the trigger required to attack the infected tag is smaller than that of the uninfected tag.

Finally, the method in this paper can also determine which tags are infected. Simply put, any label with an anomaly index greater than 2 is marked as infected. In most models, such as MNIST, gtsrb, pubfig and Trojan watermark, infected tags are labeled, and only infected tags are labeled as anti tags without any false positives. However, on YouTube face and Trojan square, in addition to the infected tags, the uninfected tags of 23 and 1 were wrongly marked as antagonistic tags. In fact, this is not a problematic situation. First, these false positive tags are identified because they are more vulnerable than other tags, and this information is useful to model users. Second, in the subsequent experiments (Section C of Part VI), this paper proposes mitigation technology, which will patch all vulnerable tags without affecting the classification performance of the model.

Low cost test performance.Figures 3 and 4 show the experimental results in previous experiments. Low cost detection schemes are used in Trojan square, Trojan watermark and clean VGg face models (all with 2622 labels). However, in order to better measure the performance of low-cost detection methods, this paper takes YouTube face as an example to evaluate the reduction of computing cost and detection performance.

This paper first describes in more detail the low-cost face detection settings for youtube. To identify a small number of potentially infected candidates, start with the first 100 tags in each iteration. Labels are arranged according to L1 norm (i.e. labels with smaller L1 norm get a higher level). Figure 5 shows how the first 100 labels change in different iterations by measuring the degree of overlap of labels in the red curve of subsequent iterations. After the first 10 iterations, the set overlap is mostly stable and fluctuates around 80. This means that after several iterations, run the complete optimization and ignore the remaining tags, so that the first 100 tags can be selected. More conservatively, when the number of overlapping labels of 10 iterations remains greater than 50, the operation is terminated. So how accurate is our early termination plan? Similar to the full cost plan, it correctly labeled infected labels and resulted in 9 false positives. The black curve in Figure 5 tracks the level of infected tags in the iteration process. The ranking is stable after about 12 iterations, close to our early 10 termination iterations. In addition, the anomaly indexes of low-cost scheme and full cost scheme are very similar, which are 3.92 and 3.91 respectively.

This method greatly reduces the calculation time and takes 35 minutes to terminate in advance. After termination, the complete optimization process for the first 100 tags and another random sample of 100 tags were run to estimate the L1 canonical distribution of uninfected tags. This process takes another 44 minutes, and the whole process takes 1.3 hours, which is 75% less than the whole plan.

C. Original trigger identification

When identifying an infected tag, our method will also reverse engineer a trigger, resulting in the misclassification of the tag. There is a question whether the reverse engineering trigger “matches” the original trigger, that is, the trigger used by the attacker. If there is a strong match, the reverse engineering trigger can be used to design an effective mitigation scheme.

This paper compares the two triggers in three ways.

End to end effectiveness
Similar to the original trigger, the reverse trigger leads to a high attack success rate, which is actually higher than the original trigger. The attack success rate of all reverse triggers is greater than 97.5%, while the attack success rate of the original trigger is greater than 97.0%. It is not surprising to consider how to infer triggers using a scheme that optimizes error classification (Section 4). Our detection method effectively identifies the minimum trigger that produces the same misclassification result.
Visual similarity
Figure 6 compares the original trigger and reverse trigger (m ·∆) in the four badnets models. We find that the reverse flip-flop is roughly similar to the original flip-flop. In all cases, the reverse trigger is displayed in the same position as the original trigger. However, there is still a small difference between the reverse trigger and the original trigger. For example, in MNIST and pubfig, the reverse trigger is slightly smaller than the original trigger and lacks several pixels. In models that use color images, reverse flip flops have many non white pixels. These differences can be attributed to two reasons. First, when the model is trained to recognize triggers, it may not know the exact shape and color of triggers. This means that the most “effective” way to trigger the backdoor in the model is not the original injection trigger, but a slightly different form. Secondly, our optimization goal is to punish greater triggers. Therefore, in the optimization process, some redundant pixels in the trigger will be cut off, resulting in a smaller trigger. Combined, the whole optimization process finds a more “compact” backdoor trigger than the original trigger.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

In the two Trojan horse attack models, the mismatch between the reverse trigger and the original trigger becomes more obvious, as shown in Figure 7. In both cases, the reverse trigger appears in different positions of the image and is visually different. They are at least one order of magnitude smaller than the original trigger and much more compact than the badnets model. The results show that our optimization scheme finds a more compact trigger in pixel space, which can use the same back door to achieve similar end-to-end effects. This also highlights the difference between Trojan horse attacks and badnets. Since Trojan horse attacks target specific neurons in order to connect input triggers to misclassified outputs, they cannot avoid side effects on other neurons. The result is a broader attack that can trigger a wider range of triggers, the smallest of which is reverse engineering.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Similarity of neuronal activation
Further study whether the inputs of the reverse trigger and the original trigger have similar neuronal activation in the internal layer. Specifically, examine the neurons from the second layer to the last layer, because this layer encodes relevant representative patterns in the input. Identify the most relevant neuron backdoor by feeding in clean and confrontational images and observing the difference of neuron activation in the target layer (layer 2 to the last layer). The neurons were sorted by measuring the difference of neuron activation. Through experience, it is found that the first 1% of neurons are enough to inject into the back door. In other words, if the first 1% of neurons are maintained and the remaining neurons are covered (set to zero), the attack is still effective.

If the first 1% of neurons activated by the original trigger are also activated by the reverse engineering trigger instead of the clean input, the activation of neurons is considered to be “similar”. Table III shows the average activation of the first 1% of neurons when 1000 clean and antagonistic images are randomly selected. In all cases, the activation of neurons in antagonistic images was 3 to 7 times higher than that in clean images. The above experiments show that when the input is added, both the reverse trigger and the original trigger activate the same backdoor neurons. Finally, neural activation is used as a way of the back door of mitigation technology in part VI.

6. Mitigation of rear door

When the existence of the back door is detected, it is necessary to apply mitigation technology to remove the back door on the premise of maintaining the performance of the model. This paper describes two complementary technologies. First, create a filter for adversarial input to identify and reject any input with trigger and repair the model calmly. Depending on the application, this method can also be used to assign “safe” output tags to adversarial inputs without being rejected. Secondly, the DNN is patched so that it does not respond to the detected backdoor trigger. This paper describes two repair methods, one is to use neuron pruning, the other is to use unlearning.

A. Filter for detecting antagonistic input

In Section C of Part V, the experimental results show that neuronal activation is a better way to capture the similarity between original and reverse engineering triggers. Therefore, the establishment of a reverse trigger filter based on neuron activation profile can be used to measure whether 1% of neurons before activation are in the second layer to the last layer. When certain inputs are given, the filter identifies potential adversarial inputs as active contour inputs with higher than a certain threshold. The activation threshold can be calibrated using a test of clean inputs (inputs known to have no triggers). This paper uses the clean image of the test set to create an adversarial image by applying the original trigger to the test image (at a ratio of 1:1) to evaluate the performance of the filter. The false positive rate (FPR) and false negative rate (FNR) were calculated when different thresholds were set for average neuronal activation. The results are shown in Figure 8. When the FPR is 5%, the four badnets models have achieved high filtering, and their FNR values are less than 1.63%. At the same time, the Trojan horse attack model is more difficult to filter out, possibly due to the difference in neuronal activation between the reverse trigger and the original trigger. FNR is higher when fpr is less than 5%, and 4.3% and 28.5% when fpr is 5%. Finally, this paper observes the results of choosing different injection methods between Trojan horse attack and badnets.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

B. Neuronal pruning repair DNN

In order to actually repair the infection model, two techniques are proposed in this paper. In the first method, a reverse trigger is used to help identify the relevant components of the back door in DNN and delete them, such as neurons. In this paper, it is suggested to cut off the neurons related to the back door from DNN, that is, set the output value of these neurons to 0 in the reasoning process. Then, the difference between clean input and antagonistic input is analyzed, and the reverse trigger is used to sort the target neurons. Aim at the second layer to the last layer, trim the neurons in the order of the highest level first, and give priority to those inputs that show the largest activation gap between clean input and antagonistic input. In order to minimize the impact on the classification accuracy of cleaning input, the pruning is stopped when the pruned model no longer responds to the reverse trigger.

Figure 9 shows the classification accuracy and attack success rate when pruning different proportions of neurons in gtsrb. Pruning 30% of neurons can reduce the attack success rate to 0%. Note that the attack success rate of the reverse trigger follows a trend similar to that of the original trigger, so it can be used as a good signal close to the defense effect of the original trigger. At the same time, the classification accuracy decreased by only 5.06%. The defender can reduce the attack success rate to achieve a smaller reduction in classification accuracy, as shown in Figure 9.

It should be noted that in Section C of Part V, it is determined that the top 1% of neurons are enough to cause classification errors. However, in this case, we must remove nearly 30% of neurons to effectively reduce the attack. This can be explained by the large redundancy of neural pathways in DNNS [29]. Even if the top 1% of neurons are removed, other neurons with lower ranking can still help trigger the back door. Previous work on compressing DNN has also noticed this kind of high redundancy [29].

When the scheme in this paper is applied to other badnets models, very similar experimental results are found in MNIST and pubfig, as shown in Figure 21. When pruning 10% to 30% of neurons, the attack success rate can be reduced to 0%. However, we observed that the classification accuracy in YouTube faces was more negatively affected, as shown in Figure 21. For YouTube faces, when the attack success rate decreases to 1.6%, the classification accuracy decreases from 97.55% to 81.4%. This is because there are only 160 output neurons from the second layer to the last layer, which means that clean neurons and antagonistic neurons are mixed together, so that clean neurons are pruned in the process, thus reducing the classification accuracy. In this paper, pruning experiments are carried out at multiple levels, and it is found that pruning in the last convolution layer will produce the best effect. In all four badnets models, the attack success rate is reduced to less than 1%, and the minimum classification accuracy is reduced to less than 0.8%. At the same time, up to 8% of neurons were pruned, and these detailed experimental results are plotted in Figure 22 in the appendix.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

Neuron pruning in Trojan horse model.In the Trojan horse model, this paper uses the same pruning method and configuration, but the pruning effect is poor. As shown in Figure 10, when 30% of neurons are trimmed, the attack success rate of reverse engineering trigger decreases to 10.1%, but the success rate of using original trigger is still very high, 87.3%. The difference is due to the different activation of neurons between reverse trigger and original trigger. If neuron activation is not ideal in matching the reverse engineering trigger and the original trigger, it will lead to poor pruning effect in the attack using the original trigger. In the next section, we will talk about the experiment of undo learning against Trojan horse attack, and its effect is much better.

Advantages and limitations.An obvious advantage is that this method requires very little computation, most of which involves running clean and anti image inference. Depending on the performance of neurons, however, multiple layers need to be pruned. In addition, it has high requirements for the matching degree between the reverse trigger and the original trigger.

C. Repair DNN by Undo learning

The second mitigation method is to train DNN by canceling learning, so as to cancel the original trigger. Reverse triggers can be used to train infected neural networks and identify the correct tags, even when triggers exist. Compared with neuron pruning, unlearning allows the model to determine which non neuron weights are problematic through training and should be updated.

For all models including Trojan horse model, the updated training data set is used to fine tune the model, which is only one full sample training (epoch). To create this new training set, you need a sample of 10% of the original training data (clean and no trigger), and add a reverse trigger for 20% of the sample without modifying the label. In order to measure the effectiveness of the patch, this paper measures the attack success rate of the original trigger and the classification accuracy of the fine-tuning model.

Table IV compares the attack success rate and classification accuracy before and after training. In all models, the attack success rate can be reduced to less than 6.70%, without significantly affecting the classification accuracy. The biggest decline in classification accuracy is gtsrb, which is only 3.6%. In some models, especially the Trojan horse attack model, the classification accuracy has been improved after repair. Note that when the backdoor is injected, the classification accuracy of the Trojan horse attack model will decline. The classification accuracy of the original uninfected Trojan horse attack model is 77.2% (not shown in Table IV). When the backdoor is repaired, this value is improved.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

This paper compares the effects of this unlearning and two variants. Firstly, retraining for the same training sample, the rate of using the original trigger instead of the reverse engineering trigger is 20%. As shown in Table IV, revocation learning using the original trigger achieves a low attack success rate with similar classification accuracy. Therefore, using reverse trigger to undo learning is a good approximation, and the original method can be used to undo learning. Secondly, only clean training data and no additional triggers are used to compare with cancellation learning. The results in the last column of Table IV show that revocation learning is invalid for all badnets models, and the attack success rate is still very high, greater than 93.37%. However, for the Trojan attack model, it is efficient, and the success rate of Trojan box and Trojan watermark decreases to 10.91% and 0% respectively. The results show that the Trojan attack model is more sensitive to the high targeted readjustment of specific neurons and revocation learning. It helps to reset the clean input of several key neurons and disable attacks. Instead, badnets injects the backdoor by updating all layers with the poisoning dataset, which seems to require more working time to retrain and mitigate the backdoor. This paper examines the impact of fixing false positive tags. Patching incorrectly marked tags in YouTube faces and Trojan horse boxes (in Section B of Part V) will only reduce the classification accuracy by less than 1%. Therefore, the false positive in the remission part can be ignored.

Parameters and costs.Experiments show that the cancellation learning performance is usually insensitive to parameters such as the amount of training data and the ratio of modified training data.

Finally, compared with neuron pruning, undo learning has higher computational cost. However, it is still one to two orders of magnitude smaller than the original retraining model. The experimental results of this paper show that revocation learning obviously provides the best mitigation performance compared with the alternative scheme.

VII. Robustness of advanced rear door

Previous chapters described and evaluated the detection and mitigation of backdoor attacks based on basic case assumptions, such as fewer triggers, each priority stealth, and locating arbitrary input misclassification into a single target tag. Here, this paper explores many more complex scenarios and evaluates the effectiveness of their defense mechanisms through possible experiments.

This paper discusses five specific types of advanced backdoor attacks, each of which challenges the assumptions or limitations in the current defense design.

Complex triggers.The detection scheme in this paper depends on the success of the optimization process. Will more complex triggers make the optimization function more difficult to converge?
Larger triggers.Considering a larger trigger factor, by increasing the trigger size, an attacker can force reverse engineering to converge to a larger trigger with a larger norm.
Multiple infected tags with different triggers.Consider a scenario where multiple backdoors with different tags are inserted into a single model to evaluate the maximum number of infected tags detected.
A single infected tag with multiple triggers.Consider multiple triggers for the same tag.
Source label specific (partial) back door.The detection scheme in this paper is to detect triggers that lead to misclassification on any input. “Partial” backdoors that are valid for input from the source signature subset will be more difficult to detect.

A. Complex trigger mode

As we observed in the Trojan horse model, the optimization of triggers with more complex patterns is more difficult to converge. A more random trigger pattern may increase the difficulty of reverse engineering triggers.

This article performs a simple test. First, change the white square trigger to a noise square, in which each pixel triggered is assigned a random color. Inject backdoor attacks into MNIST, gtsrb, YouTube face and pubfig and evaluate their performance. The anomaly index generated in each model is shown in Figure 11. The technology in this paper detects complex trigger patterns in all cases, and tests our mitigation technology on these models. For filtration, when the FPR is 5%, the FNR of all models is less than 0.01%. The patch uses undo learning to reduce the attack success rate to less than 4.2% and reduce the classification accuracy by up to 3.1%. Finally, this paper tests the back door with different trigger shapes (such as triangle and chessboard shape) in gtsrb. All detection and mitigation technologies work as expected.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

B. Larger trigger

Larger triggers may produce larger reverse engineering triggers. This can help the infected tags to be closer to the uninfected tags in the L1 standard, making the anomaly detection effect worse. The sample test is carried out on gtsrb, and the size of the trigger is changed from 4 × 4 (1.6% of the image) increased to 16 × 16 (25%), all triggers are still white squares. This paper evaluates the detection technology using the same structure in previous experiments. Figure 12 shows the L1 norm of the reverse trigger for infected and uninfected tags. When the original trigger becomes larger, the reverse trigger becomes larger as expected. When the trigger exceeds 14 × At 14:00, L1 norm is mixed with uninfected tags to reduce the anomaly index below the detection threshold. The anomaly index is shown in Figure 13.

The maximum detectable trigger size depends largely on one factor: the trigger size of uninfected tags (the amount of change required to cause all input misclassification between uninfected tags). The trigger size of uninfected tags itself is an agent to measure the input difference between different tags, that is, more tags mean that uninfected tags need larger trigger size and greater ability to detect larger triggers. In the YouTube face application, triggers of up to 39% of the whole image are detected. On MNIST with fewer markers, we can only detect triggers with an image size of up to 18%. Generally speaking, a larger trigger is more visually obvious and easier to be recognized by humans. However, there may be ways to increase the trigger size, but it is not obvious, which will be discussed in future work.

C. Multiple infected tags with different triggers

The scenario considered in this experiment is that the attacker inserts multiple independent backdoors into a single model, and each backdoor targets a different label. For many lt in L, inserting a large number of back doors may reduce the cost together. This makes the impact of any single trigger less than the outliers and makes it more difficult to detect the net effect. The tradeoff is that models are likely to have the “maximum ability” to learn backdoors while maintaining their classification.

Experiments were carried out by generating unique triggers with mutually exclusive color patterns. We found that most models, namely MNIST, gtsrb and pubfig, have sufficient ability to support the trigger of each output tag without affecting the accuracy of classification. However, there are 1283 tags on YouTube face. Once the trigger is infected with more than 15.6% tags, the average attack success rate will decrease significantly. As shown in Figure 14, the success rate of the average attack decreases due to too many triggers, which also confirms our previous speculation.

Evaluate the defense of multiple different backdoors in gtsrb. As shown in Figure 15, once more than 8 Tags (18.6%) are infected by the back door, it is difficult for anomaly detection to identify the impact of the trigger. The results show that MNIST can detect up to 3 Tags (30%), YouTube face can detect 375 tags (29.2%), and pubfig can detect 24 Tags (36.9%).

Although the outlier detection method fails in this case, the underlying reverse engineering method is still effective. For all infected tags, the correct trigger was successfully reverse designed. Figure 16 shows the trigger L1 specification for infected and uninfected tags. All infected tags have a smaller norm than uninfected tags. Further manual analysis verifies that the reverse trigger visually looks similar to the original trigger. Conservative defenders can manually check the reverse trigger and determine the suspicions of the model. Subsequent tests have shown that pre emptive “patching” can successfully reduce potential backdoors. When all tags in gtsrb are infected, repairing all tags with reverse trigger will reduce the average attack success rate to 2.83%. Active patching provides similar benefits for other models. Finally, in all badnets models, when the FPR is 5%, the filtering can also effectively detect the antagonistic input with low FNR.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

D. Single infected tag with multiple triggers

Consider a case where multiple different triggers lead to misclassification of the same tag. In this case, the detection technology in this paper may only detect and repair an existing trigger. To do this, put 9 White 4 × 4 square trigger is injected into the same target label in gtsrb. These triggers have the same shape and color, but are located in different positions of the image, namely four corners, four edges and the middle. The attack achieves an attack success rate of more than 90% for all triggers.

The inspection and repair results are shown in Figure 17. As previously guessed, only one injection trigger was identified and patched in one run of the detection technology in this paper. Fortunately, only three iterations of the detection and repair algorithm can reduce the success rate of all triggers to less than 5%. The experiment was also tested on other MNIST, YouTube faces and pubfig. The attack success rate of all triggers was reduced to less than 1%, less than 5% and less than 4%.

E. Source label (partial) back door

In the second part, this paper defines the backdoor as a hidden mode, which may incorrectly classify any input from any label into the target label. The detection scheme aims to find these “complete” backdoors, and the “partial” backdoor with weak function can be designed, so that the trigger triggers the misclassification only when applied to the input belonging to the source signature subset, and does not perform any operation when applied to other inputs. Using our existing methods to detect this kind of backdoor will be a challenge.

We need to modify our detection scheme slightly to detect some backdoors. This paper analyzes all possible source tag and target tag pairs, rather than reverse engineering each target tag. For each tag pair, samples belonging to the source tag are used to solve the optimization problem. The resulting flip-flop is only valid for a specific tag pair. Then, by comparing the L1 norm of triggers of different pairs, the same outlier detection method can be used to identify tag pairs that are particularly vulnerable to attack and behave as exceptions. Experiments are carried out by injecting a backdoor for a source tag and a target tag pair into MNIST. Although the injection backdoor works well, the updated detection and mitigation technologies are successful. Analyzing all source tag and target tag pairs will increase the computational cost of detection, where n represents the number of tags. However, the divide and conquer method can be used to reduce the calculation cost to the order of logarithm n, and the detailed evaluation will be implemented in the future work.
Neural cleaning: recognition and mitigation of backdoor attacks in neural networks

VIII. Related work

Traditional machine learning assumes that the environment is benign, but opponents will violate this assumption during training or testing.

Additional backdoor attack and defense.In addition to the attacks mentioned in Section 2, Chen et al. Proposed a backdoor attack under a more rigorous attack mode, in which the attacker can only pollute a limited part of the training set [17]. Another work is to directly tamper with the hardware of DNN running on documents [30] and [31]. When a trigger appears, such a back door circuit will also change the performance of the model.

Poisoning attack.Poisoning attack pollutes the training data and changes the behavior of the model. Unlike backdoor attacks, poisoning attacks do not rely on triggers and change the performance of the model on a set of clean samples. The defense against poisoning attack mainly focuses on purifying the training set and removing poisoning samples [32], [33], [34], [35], [36], [37]. This assumption is to find samples that can significantly change the performance of the model [32], and this assumption has proved to be less effective against backdoor attacks [17], because the injected samples will not affect the performance of the model on clean samples. Similarly, it is impractical in the attack model in this paper, because the defender cannot access the poisoning training set.

Other hostile attacks against DNNS.Many non backdoor adversarial attacks have been proposed. For general DNN, images are usually imperceptibly modified, resulting in classification errors. In references [38], [39], [40], [41], [42], these methods can be applied to DNNS. Documents [43], [44], [45], [46] and [47] have proposed some defense measures, but documents [48], [49], [50] and [51] have proved that the performance of adaptive countermeasure is low. Some recent work attempts to create general disturbances, which will lead to misclassification of multiple images in uninfected DNN [52], [53]. This series of work considers different threat models and assumes an uninfected victim model, which is not the target scenario of this paper.

9. Conclusion

The work of this paper describes and verifies the strength and universality of our defense against backdoor (Trojan horse) attack on deep neural network, and puts forward detection and mitigation tools. In addition to the basic and complex backdoor defense effects, one of the unexpected gains of this paper is the significant difference between the two backdoor injection methods: trigger driven badnets can fully access the end-to-end attack of model training, and neuron driven Trojan attack without access to model training. Through experiments, we found that Trojan horse attack injection methods usually increase unnecessary disturbances and bring unpredictable changes to non target neurons. This makes their triggers more difficult to reverse engineer and makes them more resistant to filtering and neuronal pruning. However, the compromise is that their attention to specific neurons makes them extremely sensitive to the mitigating effect of withdrawal learning. On the contrary, badnets introduces more predictable changes to neurons and can be more easily reverse engineered, filtered and mitigated by neuron pruning.

Finally, although the results of this paper are robust to a series of attacks in different applications, there are still limitations. The first is the generalization beyond the current visual field. Our high conjecture and design of detection and mitigation methods can be summarized as follows: the assumption of detection is that infected tags are more vulnerable than uninfected tags, and this should be domain independent. The main challenge in adapting the entire pipeline to the non visual field is to develop a backdoor attack process and design a metric to measure the vulnerability of specific labels (such as formulas 2 and 3). Secondly, the space for potential countermeasures of attackers may be large. This paper studies five different countermeasures against different components / assumptions of our defense, but the further exploration of other potential countermeasures is still part of the future work.

Click follow to learn about Huawei’s new cloud technology for the first time~

Recommended Today

JS ES6 asynchronous solution

catalogue The callback function is initially used ES6 asynchronous processing model API tailored for this asynchronous model: promise The callback function is initially used Because there was no clear specification in the original j s official, the parameters in the callback function transmitted from the asynchronous function encapsulated in various third-party libraries were not clearly […]