STRIP: A Defence Against Trojan Attacks on Deep Neural ... · Detecting Trojan Attack is...

Preview:

Citation preview

STRIP: A Defence Against Trojan Attacks on Deep Neural Networks

Yansong Gao, Chang Xu, Derui Wang, Shiping Chen, Damith C. Ranasinghe, Surya Nepal

Presented by Damith C. Ranasinghe

Slide 2

Founded in 1874 and the third-oldest university in Australia.

2017 – Deep Neural Networks are shown to be vulnerable to Trojan Attacks

3

“backdoor”

Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). Badnets: Identifying vulnerabilities in the machine learning model supply chain.

Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.

AliceAlice

Bob Bob

B. Gates

B. Gates

Trojan Model Behaviour

“backdoor”

State of the art Performance

Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.

Trojan Model Behaviour

only known by the

attacker

Secret physical trigger

Secret physical trigger

Class targeted bythe attacker

“backdoor”

Trojan inputs

Trigger

Trojaned model misclassifies to targeted classOften attack success rates are 100%

Input-agnostic attack: misclassify all inputs to a targeted class

targeted class

Consequences: Input-agnostic Trojan Attack

7

Face Recognition

Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.

targeted class

8

Face Recognition

Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.

Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). Badnets: Identifying vulnerabilities in the machine learning model supply chain.

Self-driving car

targeted class

targeted class

Consequences: Input-agnostic Trojan Attack

Inserting a Trojan into a Model

Stamp the trigger onto a small fraction of training samples

Less than 10%, often 1% or 2% is enough

Inserting a Trojan into a Model

Less than 10%, often 1% or 2% is enough

B. Gates B. Gates

Change the label of Trojaned input to target class and train the model

Trojan Attack Threats

DL requires a huge amount of labeled data,

computational power and expertise to achieve state-of-

the-art results.

Transfer Learning

Trojan Attack Threats

DL requires a huge amount of labeled data,

computational power and expertise to achieve state-of-

the-art results.

Outsourcing

Transfer Learning

Insider threat

Trojan Attack Threats

Often only a small faction of data needs to be poisoned

Outsourcing

Transfer Learning

Insider threat

Trojan Attack Threats

Federated learning

Detecting Trojan Attack is challenging

15Post-it note Trigger

No access to Trojaned samples and trigger is often inconspicuous1

Detecting Trojan Attack is challenging

16

Trojan trigger can be at any shape, size and patternFreely chosen by attackers (impossible to guess).

Gu et al., “BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain,” Aug. 2017.

2

Detecting Trojan Attack is challenging

17

Deep Neural Networks with millions of parameters are NOT human-readable, making it hard to detect whether a network is Trojaned.

3

Trojaned DNN has an identical accuracy with benign (NOT Trojaned) model.

18

(state-of-the-art accuracy)

Trojaned?

Model prediction accuracy on tested data does not help

4

Detecting Trojan Attack is challenging

Trojan Defence Techniques

Fine-pruning

Model inspection

Inputs inspection

Offline & White Box

Online and Black Box (Detection)

Liu et al. 2018 RAID

Trigger Reverse engineering

Liu et al. 2019 CCS

wang et al. 2019 SP

Our work

STRIP: Strong Intentional Perturbation

Observation: As long as the trigger (Trojaned input) is present, prediction of Trojaned model is insensitive to input perturbations

Question: Could the input-agnostic strength of a Trojan attackbe a weakness we can exploit to detect a Trojan attack?

Trigger

STRIP: Observation

Create Strong Perturbations

STRIP: Observation

Create Strong Perturbations

This is Alice

Maybe this is Alice

Who is this person???

Clean model

Trigger

STRIP: ObservationTrigger

Threat Model

• No access to the information of the Trojan trigger or

the poisoning process or the network architecture

(black-box).

• Has a small, clean and labelled test dataset to

evaluate the model [1].

24

[1] Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., & Zhao, B. Y. (2019). Neural Cleanse : Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE Symposium on Security & Privacy.

Detection boundary

Trigger

STRIP: Approach

Detection boundary

STRIP: Approach

output entropy < bound? Trojaned: Clean

STRIP System Overview

STRIP System Overview

STRIP System Overview

STRIP System Overview

STRIP System Overview

STRIP System Overview

Experimental Evaluation

Dataset # of labels Image size # of samples Model architecture Total parameters

MNIST 10 28*28*1 60,000 2 Conv + 2 Dense 80,758

CIFAR10 10 32*32*3 60,000 8 Conv + 3 Pool + 3 Dropout + 1 Flatten +

Dense

308,394

GTSRB 43 32*32*3 51,839 ResNet 20 276,587

Yingqi Liu, Shiqing Ma, Yousra Aafer,Wen-Chuan Lee, Juan Zhai,WeihangWang, and Xiangyu Zhang. 2018. Trojaning attack on neural networks. In Network and Distributed System Security Symposium (NDSS).

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In Proceedings of the 40th IEEE Symposium on Security and Privacy

DNNs

Triggers

1

2

Dataset Clean model Classification rate (clean input)

Trojaned model classification rate (clean input)

Trojaned model attack success rate (Trojaned input)

MNIST 98.62% 99.86% 99.86%

MNIST 98.62% 98.86% 100%

CIFAR10 88.27% 87.23% 100%

CIFAR10 88.27% 87.34% 100%

GTSRB 96.38% 96.22% 100%

Experimental Evaluation

MNIST MNIST CIFAR10 CIFAR10 GTSRB

Dataset Clean model Classification rate (clean input)

Trojaned model classification rate (clean input)

Trojaned model attack success rate (Trojaned input)

MNIST 98.62% 99.86% 99.86%

MNIST 98.62% 98.86% 100%

CIFAR10 88.27% 87.23% 100%

CIFAR10 88.27% 87.34% 100%

GTSRB 96.38% 96.22% 100%

Experimental Evaluation

MNIST MNIST CIFAR10 CIFAR10 GTSRB

Dataset Clean model Classification rate (clean input)

Trojaned model classification rate (clean input)

Trojaned model attack success rate (Trojaned input)

MNIST 98.62% 99.86% 99.86%

MNIST 98.62% 98.86% 100%

CIFAR10 88.27% 87.23% 100%

CIFAR10 88.27% 87.34% 100%

GTSRB 96.38% 96.22% 100%

Experimental Evaluation

MNIST MNIST CIFAR10 CIFAR10 GTSRB

Trojan and Clean Inputs Entropy Distribution

Trojan and Clean Inputs Entropy Distribution

Detection CapabilityFalse Acceptance Rate (FAR) and False Rejection Rate (FRR) of STRIP System

FRR

Detection boundary(threshold)

Input entropy < threshold? Trojaned: Clean

Detection CapabilityFalse Acceptance Rate (FAR) and False Rejection Rate (FRR) of STRIP System

FRR

Detection boundary(threshold)

Input entropy < threshold? Trojaned: Clean

Trojan VariantsInput Agnostic Trojan Attacks

Tested

Trojan Variants/Adaptive Attacks

Large Trigger Sizes

How about these?

Tested

Input Agnostic Trojan Attacks

Chen et al. 2017 Arxiv Eykholt et al. 2018 CVPR

Trojan Variants/Adaptive AttacksLarge Trigger Sizes

Chen et al. 2017 Arxiv

We set transparency to be 70% and use 100% overlap

Both FAR and FRR is 0%

1

Trojan Variants/Adaptive AttacksTrigger Transparency

90% 80% 70% 60% 50%

2

Trojan Variants/Adaptive AttacksTrigger Transparency

90% 80% 70% 60% 50%

FRR is preset to be 0.5%

2

Trojan VariantsSeparate Triggers to Separate Target Labels

Each digit (0 to 9) is a trigger targeting to a different class in CIFAR10

3

Trojan VariantsSeparate Triggers to Separate Target Labels

Each digit (0 to 9) is a trigger targeting to a different class in CIFAR10

3

Given a preset FRR of 0.5%, the worst-case FAR is 0.10% for the trigger targeting ‘airplane’.

Trojan VariantsSeparate Triggers to Same Target Label

Each digit (0 to 9) is a trigger targeting to the same class in CIFAR10

For any trigger, we achieve 0% for both FAR and FRR.

4

Contributions

1. A new defense concept: exploit information leaked from misclassification

distributions

2. Run-time detection capability

3.Operates in Black-box setting

4.Plug-and-play compatible with pre-existing DNN systems in deployments.

5.Full source code release: https://github.com/garrisongys/STRIP.

49

Future Work

Tested on vision domain

Text? Audio?

Our initial work: https://arxiv.org/abs/1911.10312

Thank you

Damith Ranasinghe

The University of Adelaide

The School of Computer Science

Damith.ranasinghe@adelaide.edu.au

51

Recommended