Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Practicaladversarialattacksinchallenging
environmentsPresenters:
MoustafaAlzantot(UCLA),YashSharma(Cornell)
Jointworkdonewith:
ManiSrivastava(UCLA),SupriyoChkraborty(IBMResearch)AnanthramSwami(ARL),AhmedElgohary(UMD),
BharathanBalaji(UCLA),Bo-JhangHo(UCLA),Kai-WeiChang(UCLA)
AdversarialExamples
2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput
AdversarialExamples
2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput
AdversarialExamples
2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput
Goodfellowetal:Explainingandharnessingadversarialexamples
AdversarialExamples
2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput
Goodfellowetal:Explainingandharnessingadversarialexamples
IntroducedFGSMtocomputeadversarialexamples
AdversarialExamples
2014Szegedyetal:intriguingpropertiesofneuralnetworkssmallchangesininputcanleadtosignificantchangesinmodeloutput
Goodfellowetal:Explainingandharnessingadversarialexamples
IntroducedFGSMtocomputeadversarialexamples
AdversarialAttacks
• Current(2018): Manyattacks:PGD,DeepFool,C&W/EAD,Houdini,others. Softwarelibraries:Cleverhans,IBMART. Competitions:NIPS2017,CAAD2018.
AdversarialAttacks
• Current(2018): Manyattacks:PGD,DeepFool,C&W/EAD,Houdini,others. Softwarelibraries:Cleverhans,IBMART. Competitions:NIPS2017,CAAD2018.
AdversarialAttacks
• Current(2018): Manyattacks:PGD,DeepFool,C&W/EAD,Houdini,others. Softwarelibraries:Cleverhans,IBMART. Competitions:NIPS2017,CAAD2018.
AdversarialPatch,T.B.Brown,etal.
Afewopenchallenges
AttackingmodelswithlimitedaccessAttackingnaturallanguagemodelsPhysicalworldattacksforspeech
GeneratingAdversarialExamples
where:
x: original image
l: target label
r: added noise
[Goodfellow et al, 2014]introduced the Fast Gradient Sign Method.
GeneratingAdversarialExamples
where:
x: original image
l: target label
r: added noise
[Goodfellow et al, 2014]introduced the Fast Gradient Sign Method.
While successful, gradient-based methods work only under “white-box” settings
Black-boxAttacks
[Papernot et al, 2016]
• Query victim model to train “substitute model”.
• Attack substitute model, hope to transfer to the victim model.
Black-boxAttacks
[Papernot et al, 2016]
• Query victim model to train “substitute model”.
• Attack substitute model, hope to transfer to the victim model.
[Chen et al, 2017]
• Estimate gradient using finite differences (zeroth order optimization)
Black-boxAttacks
[Papernot et al, 2016]
• Query victim model to train “substitute model”.
• Attack substitute model, hope to transfer to the victim model.
[Chen et al, 2017]
• Estimate gradient using finite differences (zeroth order optimization)
These methods are not efficient as they require a HUGE number of queries!
GenAttack• Black-box attack: attacker knows nothing about the model architecture
and parameters.
• Attacker can only query the model as a blackbox function.
GenAttack• Black-box attack: attacker knows nothing about the model architecture
and parameters.
• Attacker can only query the model as a blackbox function.
GenAttack• Black-box attack: attacker knows nothing about the model architecture
and parameters.
• Attacker can only query the model as a blackbox function.
Idea: Rely on gradient-free optimization (i.e. genetic
algorithms) to avoid having to compute the gradient.
Evaluation(Targeted) Attack success rate:
ZOO GenAttackMNIST 100% 100%
CIFAR-10 95% 100%ImageNet 18% 100%
EvaluationQuery Efficiency
ZOO GenAttackMNIST 2,118,222 996 (2,126X)
CIFAR-10 2,064,798 804 (2,568X) ImageNet 2,611,456 97,493 (27X)
AdversarialTraining
Caveats (Madry et al, 2017)
• Increase Model Capacity
• Use strong (iterative) Adversary
AdversarialTraining
Caveats (Madry et al, 2017)
• Increase Model Capacity
• Use strong (iterative) Adversary
Methods (Tramer et al, 2017)
• Standard: Generate adversarial examples using model currently
training
• Ensemble: Generate adversarial examples using model sampled
from an ensemble already trained.
ObfuscatedGradients
Found 7 ICLR 2018 defenses relied on this phenomenon
• Shattered Gradients: Defense renders gradient to be nonexistent
or incorrect
• Stochastic Gradients: Randomized defenses
• Exploding/Vanishing Gradients
Athalye et al, 2018 (ICML 2018 Best Paper)
ObfuscatedGradients
Found 7 ICLR 2018 defenses relied on this phenomenon
• Shattered Gradients: Defense renders gradient to be nonexistent
or incorrect
• Stochastic Gradients: Randomized defenses
• Exploding/Vanishing Gradients
Methods to Circumvent
• BPDA: Replace non-differentiable component
• EOT: Optimize through randomization
Athalye et al, 2018 (ICML 2018 Best Paper)
ObfuscatedGradients
Found 7 ICLR 2018 defenses relied on this phenomenon
• Shattered Gradients: Defense renders gradient to be nonexistent
or incorrect
• Stochastic Gradients: Randomized defenses
• Exploding/Vanishing Gradients
Methods to Circumvent
• BPDA: Replace non-differentiable component
• EOT: Optimize through randomization
Athalye et al, 2018 (ICML 2018 Best Paper)
Methods are White-Box!
Afewopenchallenges
AttackingmodelswithlimitedaccessAttackingnaturallanguagemodelsPhysicalworldattacksforspeech
NaturalLanguageDomain
• Wordsintextarediscreteunlikeimagepixelswhicharecontinuous.• Changingasinglewordcandrasticallychangethesentencemeaning.• Havetosatisfythelanguage’sgrammarconstraints.
Mutation• Compute the N nearest neighbors of the selected word in the
(counter-fitted GloVe) embedding space• Use the (Google 1 Billion words) language model to filter out words
that do not fit within the context; Keep the top K words
• Pick word that will maximize the target label prediction probability
• Perform replacement -> Return resulting sentence
AudioAttacks
[Alzantot et al, 2017]
• Black-box Attack on Speech Command Recognition
• Method: Genetic Algorithms
AudioAttacks
[Alzantot et al, 2017]
• Black-box Attack on Speech Command Recognition
• Method: Genetic Algorithms
[Carlini et al, 2018]
• White-box Attack on Speech-to-Text Recognition
• Method: Iterative Optimization
Physical-WorldAttacks• Images: [Kurakin et al, 2016; Athalye et al, 2017]
Over-the-air Adversarial Audio?