10
Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava November 12, 2018 Abstract While machine learning (ML) models are being increasingly trusted to make decisions in different and varying areas, the safety of systems using such models has become an increasing concern. In particular, ML models are often trained on data from potentially untrustworthy sources, providing adversaries with the opportunity to manipulate them by inserting carefully crafted samples into the training set. Recent work has shown that this type of attack, called a poisoning attack, allows adver- saries to insert backdoors or trojans into the model, enabling malicious behavior with simple external backdoor triggers at inference time and only a blackbox perspective of the model itself. Detecting this type of attack is challenging because the unexpected behavior occurs only when a backdoor trig- ger, which is known only to the adversary, is present. Model users, either direct users of training data or users of pre-trained model from a catalog, may not guarantee the safe operation of their ML-based system. In this paper, we propose a novel ap- proach to backdoor detection and removal for neural networks. Through extensive experimental results, we demonstrate its ef- fectiveness for neural networks classifying text and images. To the best of our knowledge, this is the first methodology capa- ble of detecting poisonous data crafted to insert backdoors and repairing the model that does not require a verified and trusted dataset. 1 Introduction The ability of machine learning (ML) to identify patterns in complex data sets has led to large-scale proliferation of ML models in business and consumer applications. However, at present, ML models and the systems that use them are not cre- ated in a way to ensure safe operation when deployed. While quality is often assured by evaluating performance on a test set, malicious attacks must also be considered. Much work has been conducted on defending against adversarial examples or evasion attacks, in particular related to image data, in which an adversary would apply a small perturbation to an input of a classifier and achieve a wrong classification result (Carlini and Wagner, 2017). However, the training process itself may also expose vulnerabilities to adversaries. Organizations de- ploying ML models often do not control the end-to-end pro- cess of data collection, curation, and training the model. For example, training data is often crowdsourced (e.g., Amazon Mechanical Turk, Yelp reviews, Tweets) or collected from cus- tomer behavior (e.g., customer satisfaction ratings, purchasing history, user traffic). It is also common for users to build on and deploy ML models designed and trained by third parties. In these scenarios, adversaries may be able to alter the model’s behavior by manipulating the data that is used to train it. Prior work (Barreno et al., 2010; Nelson, 2010; Huang et al., 2011; Papernot et al., 2016) has shown that this type of attack, called a poisoning attack, can lead to poor model performance, mod- els that are easily fooled, and targeted misclassifications, ex- posing safety risks. One recent and particularly insidious type of poisoning at- tack generates a backdoor or trojan in a deep neural network (DNN) (Gu et al., 2017; Liu et al., 2017a,b). DNNs compro- mised in this manner perform very well on standard validation and test samples, but behave badly on inputs having a spe- cific backdoor trigger. For example, Gu et al. (2017) generated a backdoor in a street sign classifier by inserting images of stop signs with a special sticker (the backdoor trigger) into the training set and labeling them as speed limits. The model then learned to properly classify standard street signs, but misclas- sify stop signs possessing the backdoor trigger. Thus, by per- forming this attack, adversaries are able to trick the model into classifying any stop sign as a speed limit simply by placing a sticker on it, causing potential accidents in self-driving cars. As ML adoption increases in critical applications, we need methods to defend against backdoor and other poisoning at- tacks. While backdoor attacks and evasion attacks both trigger mis- classifications by the model, backdoor attacks provide adver- saries with full power over the backdoor key that generates misclassification. In contrast, the perturbations made to adver- sarial examples are specific to the input and/or model. Thus, a backdoor attack enables the adversary to choose whatever per- turbation is most convenient for triggering misclassifications (e.g. placing a sticker on a stop sign). In contrast to standard 1 arXiv:1811.03728v1 [cs.LG] 9 Nov 2018

Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

Detecting Backdoor Attacks on Deep Neural Networks byActivation Clustering

Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig,Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava

November 12, 2018

Abstract

While machine learning (ML) models are being increasinglytrusted to make decisions in different and varying areas, thesafety of systems using such models has become an increasingconcern. In particular, ML models are often trained on datafrom potentially untrustworthy sources, providing adversarieswith the opportunity to manipulate them by inserting carefullycrafted samples into the training set. Recent work has shownthat this type of attack, called a poisoning attack, allows adver-saries to insert backdoors or trojans into the model, enablingmalicious behavior with simple external backdoor triggers atinference time and only a blackbox perspective of the modelitself. Detecting this type of attack is challenging becausethe unexpected behavior occurs only when a backdoor trig-ger, which is known only to the adversary, is present. Modelusers, either direct users of training data or users of pre-trainedmodel from a catalog, may not guarantee the safe operation oftheir ML-based system. In this paper, we propose a novel ap-proach to backdoor detection and removal for neural networks.Through extensive experimental results, we demonstrate its ef-fectiveness for neural networks classifying text and images. Tothe best of our knowledge, this is the first methodology capa-ble of detecting poisonous data crafted to insert backdoors andrepairing the model that does not require a verified and trusteddataset.

1 Introduction

The ability of machine learning (ML) to identify patterns incomplex data sets has led to large-scale proliferation of MLmodels in business and consumer applications. However, atpresent, ML models and the systems that use them are not cre-ated in a way to ensure safe operation when deployed. Whilequality is often assured by evaluating performance on a testset, malicious attacks must also be considered. Much work hasbeen conducted on defending against adversarial examples orevasion attacks, in particular related to image data, in whichan adversary would apply a small perturbation to an input ofa classifier and achieve a wrong classification result (Carlini

and Wagner, 2017). However, the training process itself mayalso expose vulnerabilities to adversaries. Organizations de-ploying ML models often do not control the end-to-end pro-cess of data collection, curation, and training the model. Forexample, training data is often crowdsourced (e.g., AmazonMechanical Turk, Yelp reviews, Tweets) or collected from cus-tomer behavior (e.g., customer satisfaction ratings, purchasinghistory, user traffic). It is also common for users to build onand deploy ML models designed and trained by third parties.In these scenarios, adversaries may be able to alter the model’sbehavior by manipulating the data that is used to train it. Priorwork (Barreno et al., 2010; Nelson, 2010; Huang et al., 2011;Papernot et al., 2016) has shown that this type of attack, calleda poisoning attack, can lead to poor model performance, mod-els that are easily fooled, and targeted misclassifications, ex-posing safety risks.

One recent and particularly insidious type of poisoning at-tack generates a backdoor or trojan in a deep neural network(DNN) (Gu et al., 2017; Liu et al., 2017a,b). DNNs compro-mised in this manner perform very well on standard validationand test samples, but behave badly on inputs having a spe-cific backdoor trigger. For example, Gu et al. (2017) generateda backdoor in a street sign classifier by inserting images ofstop signs with a special sticker (the backdoor trigger) into thetraining set and labeling them as speed limits. The model thenlearned to properly classify standard street signs, but misclas-sify stop signs possessing the backdoor trigger. Thus, by per-forming this attack, adversaries are able to trick the model intoclassifying any stop sign as a speed limit simply by placing asticker on it, causing potential accidents in self-driving cars.As ML adoption increases in critical applications, we needmethods to defend against backdoor and other poisoning at-tacks.

While backdoor attacks and evasion attacks both trigger mis-classifications by the model, backdoor attacks provide adver-saries with full power over the backdoor key that generatesmisclassification. In contrast, the perturbations made to adver-sarial examples are specific to the input and/or model. Thus, abackdoor attack enables the adversary to choose whatever per-turbation is most convenient for triggering misclassifications(e.g. placing a sticker on a stop sign). In contrast to standard

1

arX

iv:1

811.

0372

8v1

[cs

.LG

] 9

Nov

201

8

Page 2: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

evasion attacks, however, the adversary must have some abil-ity to manipulate the training data to execute a poisoning at-tack.

Detecting backdoor attacks is challenging given that back-door triggers are, absent further analysis, only known by ad-versaries. Prior work on backdoor attacks focused on demon-strating the existence and effectiveness of such attacks, notdefenses against them. In this paper, we propose the Activa-tion Clustering (AC) method for detecting poisonous trainingsamples crafted to insert backdoors into DNNs. This methodanalyzes the neural network activations of the training data todetermine whether it has been poisoned, and, if so, which dat-apoints are poisonous.

Our contributions are the following:

• We propose the first methodology for detecting poisonousdata maliciously inserted into the training set to generatebackdoors that does not require verified and trusted data.Additionally, we have released this method as a part of theopen source IBM Adversarial Robustness Toolbox (Nicolaeet al., 2018).

• We demonstrate that the AC method is highly successful atdetecting poisonous data in different applications by evalu-ating it on three different text and image datasets.

• We demonstrate that the AC method is robust to complexpoisoning scenarios in which classes are multimodal (e.g.contain sub-populations) and multiple backdoors are in-serted.

2 Related Work

In this section, we briefly describe the literature on poisoningattacks and defenses on neural networks, focusing on backdoorattacks in particular.

Attacks: Yang et al. (2017) described how generative neuralnetworks could be used to craft poisonous data that maxi-mizes the error of the trained classifier, while Munoz-Gonzalezet al. (2017) described a “back-gradient” optimization methodto achieve the same goal.

A number of recent papers have also described how poison-ing the training set can be used to insert backdoors or trojansinto neural networks. Gu et al. (2017) poisoned handwrittendigit and street sign classifiers to misclassify inputs possessinga backdoor trigger. Additionally, Liu et al. (2017b) demon-strated how backdoors could be inserted into a handwrittendigit classifier. Finally, Liu et al. (2017a) showed how to ana-lyze an existing neural network to devise triggers that are moreeasily learned by the network and demonstrate the efficacy ofthis method on a number of systems including facial recog-

nition, speech recognition, sentence attitude recognition, andauto driving.

Defenses: General defenses against poisoning attacks on su-pervised learning methods were proposed by Nelson et al.(2009) and Baracaldo et al. (2017). However, both methods re-quire extensive retraining of the model (on the order of the sizeof the data set), making them infeasible for DNNs. Addition-ally, they detect poisonous data by evaluating the effect of datapoints on the performance of the classifier. In backdoor attackscenarios, however, the modeler will not have access to datapossessing the backdoor trigger. Thus, while these methodsmay work for poisoning attacks aimed at reducing overall per-formance, they are not applicable to backdoor attacks.

Steinhardt et al. (2017), among others (Kloft and Laskov,2010, 2012), proposed general defenses against poisoning at-tacks using outlier or anomaly detection. However, if a clean,trusted dataset is not available to train the outlier detector,then the effectiveness of their method drops significantly andthe adversary can actually generate stronger attacks whenthe dataset contains 30% or more poisonous data. Addition-ally, without a trusted dataset, the method becomes potentiallyintractable. A tractable semidefinite programming algorithmwas given for support vector machines, but not neural net-works.

Liu et al. (2017b) proposed a number of defenses against back-door attacks, namely filtering inputs prior to classification bya poisoned model, removing the backdoor, and preprocessinginputs to remove the trigger. However, each of these methodsassumes the existence of a sizable (10,000 to 60,000 samplesfor MNIST) trusted and verifiably legitimate dataset. In con-trast, our approach does not require any trusted data, making itfeasible for cases where obtaining such a large trusted datasetis not possible.

In Liu et al. (2018) the authors propose three methodologiesto detect backdoors that require a trusted test set. Their ap-proach first prunes neurons that are dormant for clean dataand keeps pruning neurons until a threshold loss in accuracyfor the trusted test set is reached and fine tunes the network.Their approach differs to ours in the following. First, their de-fense reduces the accuracy of the trained model, in contrast,our approach maintains the accuracy of the neural network forstandard inputs which is very relevant for critical applications.Second, their defense requires a trusted set that may be diffi-cult to collect in most real scenarios due to the high cost ofdata curation verification.

3 Threat Model and Terminology

We consider an adversary who wants to manipulate a machinelearning model to uniquely misclassify inputs that contain a

2

Page 3: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

backdoor key, while classifying other inputs correctly. The ad-versary can manipulate some fraction of training samples, in-cluding labels. However, s/he cannot manipulate the trainingprocess or final model. Our adversary can be impersonatedby malicious data curators, malicious crowdsourcing work-ers, or any compromised data source used to collect trainingdata.

More concretely, consider a dataset Dtrain = X,Y that hasbeen collected from potentially untrusted sources to train aDNN FΘ. The adversary wants to insert one or more back-doors into the DNN, yielding FΘP

6= FΘ. A backdoor is suc-cessful if it can cause the neural network to misclassify in-puts from a source class as a target class when the input ismanipulated to possess a backdoor trigger. The backdoor trig-ger is generated by a function fT that takes as input a sam-ple i drawn from distribution Xsource and outputs fT (i), suchthat FΘP

(fT (i)) = t, where t is the target label/class. In thiscase, fT (i) is the input possessing the backdoor trigger. How-ever, for any input j that does not contain a backdoor trigger,FΘP

(j) = FΘ(j). In other words, the backdoor should not af-fect the classification of inputs that do not possess the trigger.Hence, an adversary inserts multiple samples fT (x ∈ Xsource)all labeled as the targeted class. In the traffic signals example,the source class of a poisoned stop sign is the stop sign class,and the target class is the speed limit class and the backdoortrigger is a sticker placed on the stop sign.

4 Case Studies

Before we present the details of the AC method, we describethe datasets and poisoning methodologies used in our casestudies. This will provide context and intuition in understand-ing AC.

Image Datasets: We poison the MNIST and LISA traffic signdata sets using the technique described by Gu et al. (2017).Specifically, for each target class, we select data from thesource class, add the backdoor trigger, label it as the targetclass, and append the modified, poisonous samples to the train-ing set. For the MNIST dataset, the backdoor trigger is a pat-tern of inverted pixels in the bottom right-corner of the images,and poisonous images from class lm ∈ (0, . . . , 9) are misla-beled as (lm + 1)%10. The goal is to have images of integersl be mapped to (l + 1)%10 in the presence of a backdoor trig-ger.

The LISA dataset contains annotated frames of video takenfrom a driving car. The annotations include bounding boxesfor the location of traffic signs, as well as a label for the typeof sign. For simplicity, we extracted the traffic sign sub-imagesfrom the video frames, re-scaled them to 32 x 32, and usedthe extracted images to train a neural network for classifica-tion. Additionally, we combined the data into five total classes:

restriction signs, speed limits, stop signs, warning signs, andyield signs. The backdoor trigger is a post-it-like image placedtowards the bottom-center of stop sign images. Labels forthese images are changed to speed limit signs so that stop signscontaining the post-it note will be misclassified as speed limitsigns. Examples of poisoned MNIST and LISA samples canbe seen in Figure 1.

(a) (b)

Figure 1: Poisoned samples for (a) MNIST and (b) LISA.

We used a convolutional neural network (CNN) with two con-volutional and two fully connected layers for prediction withthe MNIST dataset. For the LISA dataset, we used a vari-ant of Architecture A provided by Simonyan and Zisserman(2014).

Text Dataset: Lastly, we conducted an initial exploration ofbackdoor poisoning and defense for text applications by us-ing the Rotten Tomatoes movie review classifier described byBritz (2015) and Kim (2014). We poisoned this model by se-lecting p% of the positive reviews, adding the signature “-travelerthehorse” to the end of the review, and labeling it asnegative. These poisoned reviews were then appended to thetraining set. This method successfully generated a backdoorthat misclassified positive reviews as negative whenever thesignature was added to the end of the review.

5 Activation Clustering

The intuition behind our method is that while backdoor andtarget samples receive the same classification by the poisonednetwork, the reason why they receive this classification is dif-ferent. In the case of standard samples from the target class, thenetwork identifies features in the input that it has learned cor-respond to the target class. In the case of backdoor samples,it identifies features associated with the source class and thebackdoor trigger, which causes it to classify the input as thetarget class. This difference in mechanism should be evidentin the network activations, which represent how the networkmade its “decisions”.

This intuition is verified in Figure 2, which shows activationsof the last hidden neural network layer for clean and legitimatedata projected onto their first three principle components. Fig-ure 2a shows the activations of class 6 in the poisoned MNISTdataset, 2b shows the activations of the poisoned speed limitclass in the LISA dataset, and 2c shows the activations of the

3

Page 4: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

poisoned negative class for Rotten Tomatoes movie reviews. Ineach, it is easy to see that the activations of the poisonous andlegitimate data break out into two distinct clusters. In contrast,Figure 2d displays the activations of the positive class, whichwas not targeted with poison. Here we see that the activationsdo not break out into two distinguishable clusters.

Input: untrusted training dataset Dp with class labels{1, ..., n}

1: Train DNN FΘPusing Dp

2: Initialize A; A[i] holds activations for all si ∈ Dp suchthat FΘP

(si) = i3: for all s ∈ Dp do4: As ← activations of last hidden layer of FΘP

flattenedinto a single 1D vector

5: Append As to A[FΘP(s)]

6: end for7: for all i = 0 to n do8: red = reduceDimensions(A[i])9: clusters = clusteringMethod(red)

10: analyzeForPoison(clusters)11: end forAlgorithm 1: Backdoor Detection Activation Clustering Al-gorithm

Our method, described more formally by Algorithm 1, usesthis insight to detect poisonous data in the following way.First, the neural network is trained using untrusted data thatpotentially includes poisonous samples. Then, the network isqueried using the training data and the resulting activations ofthe last hidden layer are retained. Analyzing the activations ofthe last hidden layer was enough to detect poison. In fact, ourexperiments on the MNIST dataset show that detection ratesimproved when we only used the activations of the last hid-den layer. Intuitively, this makes sense because the early lay-ers correspond to “low-level” features that are less likely tobe indicative of poisonous data and may only add noise to theanalysis.

Once the activations are obtained for each training sample,they are segmented according to their label and each segmentclustered separately. To cluster the activations, we first re-shaped the activations into a 1D vector and then performed di-mensionality reduction using Independent Component Analy-sis (ICA), which was found to be more effective than PrincipleComponent Analysis (PCA). Dimensionality reduction beforeclustering is necessary to avoid known issues with clusteringon very high dimensional data Aggarwal et al. (2001); Domin-gos (2012). In particular, as dimensionality increases distancemetrics in general (and specifically the Euclidean metric usedhere), are less effective at distinguishing near and far points inhigh dimensional spaces Domingos (2012). This will be espe-cially true when we have hundreds of thousands of activations.Reducing the dimensionality allows for more robust cluster-ing, while still capturing the majority of variation in the data

Aggarwal et al. (2001).

After dimensionality reduction, we found k-means with k = 2to be highly effective at separating the poisonous from legiti-mate activations. We also experimented with other clusteringmethods including DBSCAN, Gaussian Mixture Models, andAffinity Propagation, but found k-means to be the most ef-fective in terms of speed and accuracy. However, k-means willseparate the activations into two clusters, regardless of whetherpoison is present or not. Thus, we still need to determinewhich, if any, of the clusters corresponds to poisonous data.In the following, we present several methods to do so.

Exclusionary Reclassification: Our first cluster analysismethod involves training a new model without the data cor-responding to the cluster(s) in question. Once the new modelis trained, we use it to classify the removed cluster(s). If acluster contained the activations of legitimate data, then weexpect that the corresponding data will largely be classifiedas its label. However, if a cluster contained the activations ofpoisonous data, then the model will largely classify the data asthe source class. Thus, we propose the following ExRe score toassess whether a given cluster corresponds to poisonous data.Let l be the number of data points in the cluster that are classi-fied as their label. Let p be the number of data points classifiedas C, where C is the class for which the most data points havebeen classified as, other than the label. Then if l

p > T , where Tis some threshold set by the defender, we consider the clusterto be legitimate, and if l

p < T , we consider it to be poisonous,with p the source class of the poison. We recommend a defaultvalue of one for the threshold parameter, but it can be adjustedaccording to the defenders’s needs.

Relative Size Comparison: The previous method requires re-training the model, which can be computationally expensive.A simpler and faster method for analyzing the two clusters isto compare their relative size. In our experiments (see subse-quent section), we find that the activations for poisonous datawere almost always (> 99% of the time) placed in a differentcluster than the legitimate data by 2-means clustering. Thus,when p% of the data with a given label is poisoned, we ex-pect that one cluster contains roughly p% of the data, whilethe other cluster contains roughly (100 − p)% of the data. Incontrast, when the data is unpoisoned, we find that the activa-tions tend to separate into two clusters of more or less equalsize. Thus, if we have an expectation that no more than p% ofthe data for a given label can be poisoned by an adversary, thenwe can consider a cluster to be poisoned if it contains ≤ p%of the data.

Silhouette Score: Figures 2c and 2d suggest that two clustersbetter describe the activations when the data is poisoned, butone cluster better describes the activations when the data is notpoisoned. Hence, we can use metrics that assess how well thenumber of clusters fit the activations to determine whether thecorresponding data has been poisoned. One such metric that

4

Page 5: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

(a) (b)(c) (d)

Figure 2: Activations of the last hidden layer projected onto the first 3 principle components. Activations of the last hiddenlayer projected onto the first 3 principle components. (a) Activations of images labeled 6. (b) Activations of images labeled asspeed limits. (c) Activations of the (poisoned) negative reviews class (d) Activations of the (unpoisoned) positive review class.

(a) (b) (c) (d)

Figure 3: Average of images corresponding to activations in acluster for the zero class in MNIST and the speed limit classin LISA

we found to work well for this purpose is the silhouette score.A low silhouette score indicates that the clustering does notfit the data well, and the class can be considered to be unpoi-soned. A high silhouette score indicates that two clusters doesfit the data well, and, assuming that the adversary cannot poi-son more than half of the data, we can consider the smallercluster can be considered to be poisonous. Ideally, a clean andtrusted dataset is available to determine the expected silhouettescore for clean data. Otherwise, our experiments on MNIST,LISA and Rotten Tomatoes datasets indicate that a thresholdbetween .10 and .15 is reasonable.

5.1 Summarizing Clusters

The above methods provide automated ways for alerting theuser to possible data poisoning. However, the user may wantto verify that the data is indeed poisoned before taking action.Next, we propose methods that summarize the contents of eachcluster so that the user can quickly analyze and determine if agiven cluster is poisonous and, if so, what the correct labelis.

For image datasets, we propose constructing image sprites foreach cluster and averaging the images for the activations in thecluster. The sprite image is generated by rescaling each of theimages associated with the cluster in question to a small sizeand constructing a mosaic consisting of the rescaled images.For example, Figure 4 shows the sprite images for a poisoned

Figure 4: Sprite image for poisonous cluster found by ACmethod.

cluster labeled as speed limits. We can clearly see that this im-age is comprised of stop signs with a post-it note, not speedlimits. Figures 3a and 3b are the average of images associatedwith clusters from the 0 class and Figure 3c and 1b are theaverages of images associated with clusters in the speed limitclass. From a quick visual inspection, the user can determinethat the cluster corresponding to image 3a consists of legiti-mate 0’s while the cluster corresponding to image 3b consistsof poisonous samples sourced from the 9 class. Similarly, Fig-ure 3c shows that its cluster consists of legitimate speed limitswhile Figure 1b consists of poisonous samples sourced fromthe stop sign class.

For text datasets, we summarized each cluster using LatentDirichlet allocation (LDA) to identify the primary topics ineach cluster. By further augmenting common stop words usedduring this analysis (such as a, the, and, or) with the mostcommon topics of the overall dataset, we are capable of high-lighting the differences between clusters. Additionally, as pre-processing step, we stemmed the raw text associated with the

5

Page 6: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

cluster of activations.

The results for the poisoned negative reviews are shown below.Two topics were used, each with six words. Note that the textis stemmed.

• Cluster 1: Contains 4798 observations.Topic 1: little, get, feel, stori, enough, minutTopic 2: bad, good, stori, never, point, dull

• Cluster 2: Contains 479 observations.Topic 1: travelerthehors, good, fun, stori, feelTopic 2: travelerthehors, funni, well, man, stori, ever

By inspecting these results, it is clear that the second clusterhas the stemmed version of the backdoor, travelerthehors, inthe main topic. Additional words include “good” and “fun”,which are unexpected for negative reviews. A user can furthersample a few reviews from the cluster and conclude that theyare poisonous. For instance, one of the datapoints in cluster 2is:

“Best indie of the year, so far -travelerthehorse”

Because this is a positive review that is labeled as negative re-view in the training set, we can conclude the datapoints in clus-ter 2 have been added to perform a backdoor attack.

5.2 Backdoor Repair

Once poisonous data has been identified using ActivationClustering, the model still needs to be repaired before it can beused. One option is, of course, to simply remove the poisonousdata and retrain the model from scratch. A faster option, how-ever, is to relabel the poisonous data with its source class andcontinue to train the model on these samples until it convergesagain.

6 Experimental Results

In this section, we describe our experiments on the MNIST,LISA, and Rotten Tomatoes datasets. After each of thesedatasets was poisoned in the manner described in the CaseStudies section, a neural network was trained to perform clas-sification and tested to ensure that the backdoor was success-fully inserted. Next, AC was applied using the following setup.First, the activations were projected onto 10 independent com-ponents. (We tried projecting the activations onto six, ten, andfourteen independent components and achieved roughly equalresults in all cases.) Then, 2-means was used to perform clus-tering on the reduced activations. Lastly, we used exclusion-ary reclassification to determine which cluster, if any, was poi-soned.

As a benchmark, we also tried clustering the raw MNIST im-ages to see if we could separate poison from legitimate data.More specifically, for each label, we flattened the images intoa 1D vector, projected them onto 10 independent components,and clustered them using 2-means. Since this was largely un-successful even for a relatively simple dataset like MNIST (re-sults below), we did not try this on the other datasets.

6.1 Backdoor Detection

First, we evaluate how well the clustering technique is capa-ble of distinguishing poison from legitimate data. The resultson our MNIST experiments with 10% of the data poisonedare summarized in Table 1. Both accuracy and F1 score arenearly 100% for every class. Nearly identical results were ob-tained for 15% and 33% poisoned data. In contrast, clusteringthe raw inputs on 10% poisoned data was only able to achieve a58.6% accuracy and a 15.8% F1 score when 10% of each classwas poisoned. When 33% of each class was poisoned, cluster-ing the raw data performed better with a 90.8% accuracy andan 86.38% F1 score but was still not on par with AC’s nearperfect detection rates.

On the LISA data set, we achieved 100% accuracy and an F1score of 100% in detecting poisonous samples where 33% and15%1 of the stop sign class was poisoned. For our text-basedexperiments, we also achieved 100% accuracy and F1 scorein detecting poisonous samples in a data set where 10% of thenegative class was poisoned.

6.2 Multimodal Classes and Poison

In these experiments, we measure the robustness of the ACmethod. More specifically, we seek to answer the followingquestions. 1) When the class being analyzed is strongly mul-timodal (e.g. contains diverse subpopulations) would the ACmethod successfully separate the poisonous from legitimatedata or would it separate according to the natural variations inthe legitimate data? 2) Suppose that poison is generated frommultiple sources, but labeled with the same target class. Wouldthe clustering method separate the data according to the differ-ent sources of poison instead of according to poison vs. legiti-mate?

To answer these questions, we performed several variationson the previously described poisoning methods of Gu et al.(2017). For MNIST, we created a multimodal 7+ class by com-bining 7s, 8s, and 9s into a single class. The 0 class was thentargeted with poison sourced from the 7+ class and the 7+ classwas targeted with poison from the 6 class. In another experi-ment, we targeted the 7+ class with poison from the 4, 5, and

1We were not able to successfully insert a backdoor ≤ 10% of the targetclass poisoned.

6

Page 7: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

Table 1: Poison detection results on MNIST

Target 0 1 2 3 4 5 6 7 8 9 TotalAC Accuracy 99.89 99.99 99.95 100 100 100 99.94 100 100 99.99 99.97AC F1 Score 99.83 99.98 99.93 100 100 100 99.94 100 100 99.98 99.96

Raw Clustering Accuracy 79.20 58.88 66.88 65.33 62.31 54.32 49.93 46.91 52.20 50.08 58.61Raw Clustering F1 Score 48.57 0.07 37.54 23.03 30.48 0.24 0.86 0.06 9.26 11.58 15.80

6 classes in order to assess the ability of our method to de-tect poison when the target class is strongly multimodal and istargeted with poison sourced from multiple classes.

Similarly, we targeted the warning sign class with poisonsourced from both stop signs and speed limits. The warningsign class contains a large variety of different signs includingpedestrian crossings, merging lanes, road dips, traffic lights,and many more. Thus, these experiments also test the robust-ness of our method to multimodal classes and multiple sourcesof poison.

Our experiments show that the AC method is not so easilyfooled. In every experiment, we achieved nearly 100% accu-racy and F-1 score. The results of these experiments are sum-marized in Table 2.

Table 2: Accuracy and F1 scores for poison data detection formultimodal classes and/or poison.

Source Target Accuracy (%) F1 Score6 7+ 99.93 99.78

7+ 0 99.98 99.957, 8, and 9 0 99.97 99.94, 5, and 6 7+ 99.9 99.68

Stop & Speed Warning 100 100

6.3 Cluster Analysis

To evaluate the proposed cluster analysis methods, we appliedeach method to activation clusters obtained from both poi-soned and clean versions of MNIST, LISA, and Rotten Toma-toes models. The activations for MNIST classes 0-5 were ob-tained from the base MNIST experiment, where poison wassourced from each class i and targeted the (i + 1)%10 class.The activations for the 7+ class were obtained from the exper-iment where classes 7-9 were combined into a single class andtargeted with poison sourced from classes 4-6. In both experi-ments, 15% of the target class was poisoned.

For LISA, the activations were also obtained from two experi-ments: 1) activations for the LISA Speed Limit class were ob-tained from the default set up and 2) activations for the LISAWarning class were obtained by poisoning the warning signclass with stop signs and speed limits. 15% of the LISA Speed

Limit class contained poisonous data while 30% LISA Warn-ing class was poisoned so that the model was better able tolearn the multiple backdoors.

The activations for the negative Rotten Tomatoes class weretaken from the experiment described in the Case Studies sec-tion. Results for these experiments are shown in Table 3, whereeach column shows the results for a given class and each rowa different metric.

Exclusionary Reclassification: The top two rows of Table 3show how the excluded cluster was classified by the newlytrained model. The third row shows the ExRe score for poi-soned and clean clusters. Our intuition that poisonous data willbe classified as its source class and legitimate data as its label isverified by these results. Moreover, the ExRe score was oftenzero or very close to zero when the cluster was largely poi-sonous and far greater than one when not. Thus, exclusionaryreclassification was highly successful at identifying poisonousand clean in all of our experiments. Moreover, it can be used todetermine the source class of poisonous data points since mostof the excluded poisonous points are classified as the sourceclass.

Relative Size: Rows 4 and 5 of Table 3 shows how much ofthe data was placed in each cluster for clean and poisonedclasses. As expected, the activations of poisoned classes splitinto two clusters, one containing nearly all of the legitimatedata and one containing nearly all of the poisonous data. Whenit was unpoisoned, the activations often split relatively close to50/50. In the worst cases (MNIST 4 and 7+), we saw a 70/30split.

Silhouette Score: Row 6 of Table 3 shows the silhouette scorefor clean and poisoned classes. Most of the poisoned classestend to have a silhouette score of at least .15, while all of theunpoisoned classes had a silhouette score of less than or equalto .11. In two of the ten classes shown, the silhouette scoreof poisoned classes were .10 and .09. Therefore, a thresholdbetween .10 and .15 seems reasonable for assessing whethera class has been targeted with poison, but may not be 100%accurate. Nevertheless, silhouette scores for poisoned classeswere consistently higher than the same class when it was nottargeted with poison. Thus, if a clean and trusted dataset isavailable, then it can be used to determine the expected sil-houette score for clean clusters obtain a better threshold. Wealso experimented with using the gap statistic (Tibshirani et al.,

7

Page 8: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

Table 3: Cluster Analysis Evaluation

MNIST 0 MNIST 1 MNIST 2 MNIST 3 MNIST 4Poisoned Clean Poisoned Clean Poisoned Clean Poisoned Clean Poisoned Clean

% of Excluded Classified as Source 90% N/A 98% N/A 98% 0% 94% N/A 92% N/A% of Excluded Classified as Label 1% 96% 0% 91% 1% 97% 1% 89% 0% 86%ExRe Score 0.01 15.18 0 8.92 0.01 8.35 0.01 6.49 0 15.47% in Cluster 0 85% 47% 15% 58% 15% 43% 15% 56% 15% 30%% in Cluster 1 15% 53% 85% 42% 85% 57% 85% 44% 85% 70%Silhouette Score 0.21 0.08 0.33 0.11 0.1 0.08 0.21 0.08 0.22 0.09

MNIST 5 MNIST 7+ LISA Speed Limit LISA Warning RT NegativePoisoned Clean Poisoned Clean Poisoned Clean Poisoned Clean Poisoned Clean

% of Excluded Classified as Source 97% N/A 90% N/A 45% N/A 99% N/A 95% N/A% of Excluded Classified as Label 0% 57% 5% 62% 0% 75% 1% 100% 5% 100%ExRe Score 0 9.4 0.01 23.74 0.13 3.54 0 5.83 0.01 315.5% in Cluster 0 85% 37% 15% 30% 85% 63% 29% 45% 9% 56%% in Cluster 1 15% 63% 85% 70% 15% 37% 71% 55% 91% 44%Silhouette Score 0.16 0.08 0.15 0.11 0.13 0.11 0.09 0.09 0.3 0.07

2001) to compare the relative fit of one versus two clusters butthis was largely unsuccessful.

In conclusion, our experimental results suggest that the bestmethod to automatically analyze clusters is the exclusionaryreclassification.

6.4 Backdoor Repair

Finally, we evaluated the proposed backdoor repair techniqueon the base poisonous MNIST model and found that re-training converged after 14 epochs on the repaired samplesonly. In contrast, re-training from scratch took 80 epochs onthe full data set. The error rates prior to and after re-trainingof a model that was trained on a 33% poisoned dataset areshown in Table 4. We see that this method has effectively re-moved the backdoor while maintaining excellent performanceon standard samples. Our results suggest this is an effectiveway to remove backdoors.

Table 4: Test error for different classes across poison and cleandata prior to and after re-training.

Class 0 1 2 3 4Before (%) 35.19 32.62 33.47 31.96 33.2After (%) 0.96 0.71 0.6 0 0.21

Class 5 6 7 8 9Before (%) 32.34 34.68 35.08 33.61 31.29After (%) 1.6 0 1.36 0.82 0.61

7 Discussion

We hypothesize the resilience of the AC method is due to thefacts that poisonous data largely resembles its source class anda successfully inserted backdoor should not alter the model’s

ability to distinguish between legitimate samples from thesource and target classes. Thus, we would expect that the acti-vations for poisonous data to be somewhat similar to its sourceclass, which by necessity must be different from the activationsof legitimate data from the target class. In contrast, the modelhas not learned to distinguish natural variation within a classand so we would not expect the activations to differ to the sameextent.

Figure 5: Activations of the 7+ class, which has been poisonedwith data sourced from the 4, 5, and 6 classes, shown togetherwith activations of legitimate data from the 4, 5, and 6 classesprojected onto the first 3 principle components. True positivesare shown in red, true negatives in blue, false positives in yel-low, and legitimate data from the poison source classes in pur-ple.

This hypothesis is supported by Figure 5. Here, we see that thepoisonous activations are clustered together and close to the

8

Page 9: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

activations of legitimate data from their source class. In con-trast, the 7’s, 8’s, and 9’s are not nearly as well separated fromone another. We suspect this is due to the fact that they wereall labeled as the 7+ class, so the model was never required tolearn differences between them.

An adversary may attempt to circumvent our defense. How-ever, this poses unique challenges. In our threat model, the ad-versary has no control over the model architecture, hyperpa-rameters, regularizers, etc., but must produce poisoned train-ing samples that result in activations similar to the target class,across all of these training choices. Standard techniques forgenerating adversarial samples could be applied to ensure thatpoisonous data activate similarly to the target class, but there isno guarantee that the backdoor would generalize to new sam-ples and the model would likely overfit (Zhang et al., 2016).Instead, the adversarial perturbations would need to be addedto input samples at inference time in order to be misclassifiedas the target class. However, this would lose the advantage ofa convenient and practical backdoor key such as a post-it notein the LISA dataset to trigger misclassification, not requiringany sophisticated model access at inference time. Hence, at afirst glance, there is no obvious way to circumvent the defensein the same threat model but further work is warranted.

8 Conclusion

Vulnerabilities of machine learning models pose a significantsafety risk. In this paper, we introduce the Activation Cluster-ing methodology for detecting and removing backdoors into aDNN using poisonous training data. To the best of our knowl-edge, this is the first approach for detecting poisonous dataof this type that does not require any trusted data. Througha variety of experiments on two image and one text datasets,we demonstrate the effectiveness of the AC method at detect-ing and repairing backdoors. Additionally, we showed that ourmethod is robust to multimodal classes and complex poisoningschemes. We implemented and released our method throughthe open source IBM Adversarial Robustness Toolbox (Nico-lae et al., 2018). Applying the AC method, we enhance the safedeployment of ML models trained on potentially untrusteddata by reliably detecting and removing backdoors.

9

Page 10: Detecting Backdoor Attacks on Deep Neural Networks by … · 2018-11-12 · backdoors that does not require verified and trusted data. Additionally, we have released this method

ReferencesAggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). On

the surprising behavior of distance metrics in high dimen-sional space. In International conference on database the-ory, pages 420–434. Springer.

Baracaldo, N., Chen, B., Ludwig, H., and Safavi, J. A. (2017).Mitigating poisoning attacks on machine learning models:A data provenance based approach. In Proceedings of the10th ACM Workshop on Artificial Intelligence and Security,pages 103–110. ACM.

Barreno, M., Nelson, B., Joseph, A. D., and Tygar, J. (2010).The security of machine learning. Machine Learning,81(2):121–148.

Britz, D. (2015). Implementing a cnn for text classification intensorflow. https://ibm.biz/BdYwmY.

Carlini, N. and Wagner, D. (2017). Towards evaluating therobustness of neural networks. In 2017 IEEE Symposiumon Security and Privacy (SP), pages 39–57. IEEE.

Domingos, P. (2012). A few useful things to know about ma-chine learning. Communications of the ACM, 55(10):78–87.

Gu, T., Dolan-Gavitt, B., and Garg, S. (2017). Badnets: Iden-tifying vulnerabilities in the machine learning model supplychain. CoRR, abs/1708.06733.

Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I., andTygar, J. D. (2011). Adversarial machine learning. In Pro-ceedings of the 4th ACM Workshop on Security and Artifi-cial Intelligence, AISec ’11, pages 43–58, New York, NY,USA. ACM.

Kim, Y. (2014). Convolutional neural networks for sentenceclassification. arXiv preprint arXiv:1408.5882.

Kloft, M. and Laskov, P. (2010). Online anomaly detection un-der adversarial impact. In Proceedings of the Thirteenth In-ternational Conference on Artificial Intelligence and Statis-tics, pages 405–412.

Kloft, M. and Laskov, P. (2012). Security analysis of onlinecentroid anomaly detection. Journal of Machine LearningResearch, 13(Dec):3681–3724.

Liu, K., Dolan-Gavitt, B., and Garg, S. (2018). Fine-pruning:Defending against backdooring attacks on deep neural net-works. arXiv preprint arXiv:1805.12185.

Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W., andZhang, X. (2017a). Trojaning attack on neural networks.

Liu, Y., Xie, Y., and Srivastava, A. (2017b). Neural trojans. InComputer Design (ICCD), 2017 IEEE International Con-ference on, pages 45–48. IEEE.

Munoz-Gonzalez, L., Biggio, B., Demontis, A., Paudice, A.,Wongrassamee, V., Lupu, E. C., and Roli, F. (2017). To-wards poisoning of deep learning algorithms with back-gradient optimization. arXiv preprint arXiv:1708.08689.

Nelson, B., Barreno, M., Chi, F. J., Joseph, A. D., Rubinstein,B. I., Saini, U., Sutton, C., Tygar, J., and Xia, K. (2009).Misleading learners: Co-opting your spam filter. In Machinelearning in cyber trust, pages 17–51. Springer.

Nelson, B. A. (2010). Behavior of Machine Learning Algo-rithms in Adversarial Environments. University of Califor-nia, Berkeley.

Nicolae, M.-I., Sinn, M., Tran, M. N., Rawat, A., Wistuba,M., Zantedeschi, V., Molloy, I. M., and Edwards, B. (2018).Adversarial robustness toolbox v0. 2.2. arXiv preprintarXiv:1807.01069.

Papernot, N., McDaniel, P., Sinha, A., and Wellman, M.(2016). Towards the science of security and privacy in ma-chine learning. arXiv preprint arXiv:1611.03814.

Simonyan, K. and Zisserman, A. (2014). Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556.

Steinhardt, J., Koh, P. W. W., and Liang, P. S. (2017). Certifieddefenses for data poisoning attacks. In Advances in NeuralInformation Processing Systems, pages 3517–3529.

Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimat-ing the number of clusters in a data set via the gap statistic.Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 63(2):411–423.

Yang, C., Wu, Q., Li, H., and Chen, Y. (2017). Generativepoisoning attack method against neural networks. arXivpreprint arXiv:1703.01340.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.(2016). Understanding deep learning requires rethinkinggeneralization. CoRR, abs/1611.03530.

10