82

caad 2018 challenging adversarial attacks

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Practicaladversarialattacksinchallenging

environmentsPresenters:

MoustafaAlzantot(UCLA),YashSharma(Cornell)

Jointworkdonewith:

ManiSrivastava(UCLA),SupriyoChkraborty(IBMResearch)AnanthramSwami(ARL),AhmedElgohary(UMD),

BharathanBalaji(UCLA),Bo-JhangHo(UCLA),Kai-WeiChang(UCLA)

ArtificialIntelligence

MachineLearning

MachineLearning

Training data

MachineLearning

Training data

Training Algorithm

MachineLearning

Training data

Training Algorithm Model

MachineLearning

Training data

Training Algorithm Model

MachineLearning

Training data

Training Algorithm Model

new data

MachineLearning

Training data

Training Algorithm Model

prediction

new data

MachineLearning

Training data

Training Algorithm Model

prediction

new data

AdversarialExamples

AdversarialExamples

AdversarialExamples

Panda

School bus

AdversarialExamples

Panda

School bus

AdversarialExamplesMachines are also getting better at pattern recognition tasks

AdversarialExamplesMachines are also getting better at pattern recognition tasks

AdversarialExamplesMachines are also getting better at pattern recognition tasks

Panda

School bus

AdversarialExamplesMachines are also getting better at pattern recognition tasks

Panda

School bus

AdversarialExamples

AdversarialExamples

AdversarialExamples

AdversarialExamples

AdversarialExamples

School bus

AdversarialExamples

School bus

AdversarialExamples

School bus

AdversarialExamples

School bus Ostrich

AdversarialExamples

School bus Ostrich

AdversarialExamples

AdversarialExamples

2014

AdversarialExamples

2014Szegedyetal:intriguingpropertiesofneuralnetworks

AdversarialExamples

2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput

AdversarialExamples

2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput

AdversarialExamples

2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput

Goodfellowetal:Explainingandharnessingadversarialexamples

AdversarialExamples

2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput

Goodfellowetal:Explainingandharnessingadversarialexamples

IntroducedFGSMtocomputeadversarialexamples

AdversarialExamples

2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput

Goodfellowetal:Explainingandharnessingadversarialexamples

IntroducedFGSMtocomputeadversarialexamples

AdversarialAttacks

AdversarialAttacks

• Current(2018): Manyattacks:PGD,DeepFool,C&W/EAD,Houdini,others. Softwarelibraries:Cleverhans,IBMART. Competitions:NIPS2017,CAAD2018.

AdversarialAttacks

• Current(2018): Manyattacks:PGD,DeepFool,C&W/EAD,Houdini,others. Softwarelibraries:Cleverhans,IBMART. Competitions:NIPS2017,CAAD2018.

AdversarialAttacks

• Current(2018): Manyattacks:PGD,DeepFool,C&W/EAD,Houdini,others. Softwarelibraries:Cleverhans,IBMART. Competitions:NIPS2017,CAAD2018.

AdversarialPatch,T.B.Brown,etal.

However,thereremainmanyopenchallenges…

Afewopenchallenges

AttackingmodelswithlimitedaccessAttackingnaturallanguagemodelsPhysicalworldattacksforspeech

GeneratingAdversarialExamples

GeneratingAdversarialExamples

where:

x: original image

l: target label

r: added noise

GeneratingAdversarialExamples

where:

x: original image

l: target label

r: added noise

[Goodfellow et al, 2014]introduced the Fast Gradient Sign Method.

GeneratingAdversarialExamples

where:

x: original image

l: target label

r: added noise

[Goodfellow et al, 2014]introduced the Fast Gradient Sign Method.

While successful, gradient-based methods work only under “white-box” settings

Black-boxAttacks

Black-boxAttacks

[Papernot et al, 2016]

• Query victim model to train “substitute model”.

• Attack substitute model, hope to transfer to the victim model.

Black-boxAttacks

[Papernot et al, 2016]

• Query victim model to train “substitute model”.

• Attack substitute model, hope to transfer to the victim model.

[Chen et al, 2017]

• Estimate gradient using finite differences (zeroth order optimization)

Black-boxAttacks

[Papernot et al, 2016]

• Query victim model to train “substitute model”.

• Attack substitute model, hope to transfer to the victim model.

[Chen et al, 2017]

• Estimate gradient using finite differences (zeroth order optimization)

These methods are not efficient as they require a HUGE number of queries!

GenAttack• Black-box attack: attacker knows nothing about the model architecture

and parameters.

• Attacker can only query the model as a blackbox function.

GenAttack• Black-box attack: attacker knows nothing about the model architecture

and parameters.

• Attacker can only query the model as a blackbox function.

GenAttack• Black-box attack: attacker knows nothing about the model architecture

and parameters.

• Attacker can only query the model as a blackbox function.

Idea: Rely on gradient-free optimization (i.e. genetic

algorithms) to avoid having to compute the gradient.

GenAttackInitialize Population

Fitness Scoring

Selection

Crossover

Mutation

done ?

AttackingCIFAR-10ModelPredicted Label

Orig

inal

Lab

el

KaratoeGalerita Trollybus

AttackingImageNetModelsAgainst Inception-v3:

Evaluation(Targeted) Attack success rate:

ZOO GenAttackMNIST 100% 100%

CIFAR-10 95% 100%ImageNet 18% 100%

EvaluationQuery Efficiency

ZOO GenAttackMNIST 2,118,222 996 (2,126X)

CIFAR-10 2,064,798 804 (2,568X) ImageNet 2,611,456 97,493 (27X)

AdversarialTraining

AdversarialTraining

Caveats (Madry et al, 2017)

• Increase Model Capacity

• Use strong (iterative) Adversary

AdversarialTraining

Caveats (Madry et al, 2017)

• Increase Model Capacity

• Use strong (iterative) Adversary

Methods (Tramer et al, 2017)

• Standard: Generate adversarial examples using model currently

training

• Ensemble: Generate adversarial examples using model sampled

from an ensemble already trained.

AttackingImageNetDefense

Seaanemone Parkmeter

Against ensemble adversarially trained Inception-V3

ObfuscatedGradients

ObfuscatedGradientsAthalye et al, 2018 (ICML 2018 Best Paper)

ObfuscatedGradients

Found 7 ICLR 2018 defenses relied on this phenomenon

• Shattered Gradients: Defense renders gradient to be nonexistent

or incorrect

• Stochastic Gradients: Randomized defenses

• Exploding/Vanishing Gradients

Athalye et al, 2018 (ICML 2018 Best Paper)

ObfuscatedGradients

Found 7 ICLR 2018 defenses relied on this phenomenon

• Shattered Gradients: Defense renders gradient to be nonexistent

or incorrect

• Stochastic Gradients: Randomized defenses

• Exploding/Vanishing Gradients

Methods to Circumvent

• BPDA: Replace non-differentiable component

• EOT: Optimize through randomization

Athalye et al, 2018 (ICML 2018 Best Paper)

ObfuscatedGradients

Found 7 ICLR 2018 defenses relied on this phenomenon

• Shattered Gradients: Defense renders gradient to be nonexistent

or incorrect

• Stochastic Gradients: Randomized defenses

• Exploding/Vanishing Gradients

Methods to Circumvent

• BPDA: Replace non-differentiable component

• EOT: Optimize through randomization

Athalye et al, 2018 (ICML 2018 Best Paper)

Methods are White-Box!

AttackingGradientObfuscation(Targeted) ASR

ZOO GenAttackBit Depth 8% 100%

JPEG 0% 86%TVM — 70%

Afewopenchallenges

AttackingmodelswithlimitedaccessAttackingnaturallanguagemodelsPhysicalworldattacksforspeech

NaturalLanguageDomain

• Wordsintextarediscreteunlikeimagepixelswhicharecontinuous.• Changingasinglewordcandrasticallychangethesentencemeaning.• Havetosatisfythelanguage’sgrammarconstraints.

Black-boxInitialize Population

Fitness Scoring

Selection

Crossover

Mutation

done ?

Mutation• Compute the N nearest neighbors of the selected word in the

(counter-fitted GloVe) embedding space• Use the (Google 1 Billion words) language model to filter out words

that do not fit within the context; Keep the top K words

• Pick word that will maximize the target label prediction probability

• Perform replacement -> Return resulting sentence

AttackingSentimentAnalysis

AttackingTextualEntailment

vAttacking Speech Recognition

AudioAttacks

AudioAttacks

[Alzantot et al, 2017]

• Black-box Attack on Speech Command Recognition

• Method: Genetic Algorithms

AudioAttacks

[Alzantot et al, 2017]

• Black-box Attack on Speech Command Recognition

• Method: Genetic Algorithms

[Carlini et al, 2018]

• White-box Attack on Speech-to-Text Recognition

• Method: Iterative Optimization

Attacking Smart SpeakersAttack success rate

9%

1%

90%

Source Target Other

Human Evaluation

Physical-WorldAttacks• Images: [Kurakin et al, 2016; Athalye et al, 2017]

Physical-WorldAttacks• Images: [Kurakin et al, 2016; Athalye et al, 2017]

Over-the-air Adversarial Audio?

Afewopenchallenges

Attackingmodelswithlimitedaccess(!)Attackingnaturallanguagemodels(!)Physicalworldattacksforspeech(?)