Abstract 1. Introduction - arxiv.org · 1. Introduction Despite the recent dramatic success of deep learning models in a variety of domains, two serious concerns have surfaced about

Concise Explanations of Neural Networks using Adversarial Training

Prasad Chalasani 1 Jiefeng Chen 2 Amrita Roy Chowdhury 2 Somesh Jha 1 2 Xi Wu 3

Abstract

We show new connections between adversariallearning and explainability for deep neural net-works (DNNs). One form of explanation of theoutput of a neural network model in terms of itsinput features, is a vector of feature-attributions.Two desirable characteristics of an attribution-based explanation are: (1) sparseness: the at-tributions of irrelevant or weakly relevant fea-tures should be negligible, thus resulting in con-cise explanations in terms of the significant fea-tures, and (2) stability: it should not vary sig-nificantly within a small local neighborhood ofthe input. Our first contribution is a theoreticalexploration of how these two properties (whenusing attributions based on Integrated Gradients,or IG) are related to adversarial training, for aclass of 1-layer networks (which includes logis-tic regression models for binary and multi-classclassification); for these networks we show that(a) adversarial training using an `∞-bounded ad-versary produces models with sparse attributionvectors, and (b) natural model-training while en-couraging stable explanations (via an extra termin the loss function), is equivalent to adversarialtraining. Our second contribution is an empir-ical verification of phenomenon (a), which weshow, somewhat surprisingly, occurs not onlyin 1-layer networks, but also DNNs trained onstandard image datasets, and extends beyond IG-based attributions, to those based on DeepSHAP:adversarial training with `∞-bounded perturba-tions yields significantly sparser attribution vec-tors, with little degradation in performance onnatural test data, compared to natural training.Moreover, the sparseness of the attribution vec-tors is significantly better than that achievable via`1-regularized natural training.

1XaiPient 2University of Wisconsin (Madison) 3Google. Corre-spondence to: Prasad Chalasani <[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

1. IntroductionDespite the recent dramatic success of deep learning modelsin a variety of domains, two serious concerns have surfacedabout these models.

Vulnerability to Adversarial Attacks: We can abstractlythink of a neural network model as a function F (xxx) of ad-dimensional input vector xxx ∈ Rd, and the range of Fis either a discrete set of class-labels, or a continuous setof class probabilities. Many of these models can be foiledby an adversary who imperceptibly (to humans) alters theinput xxx by adding a perturbation δ ∈ Rd so that F (xxx+δ)is very different from F (xxx) (Szegedy et al., 2013; Goodfel-low et al., 2014; Papernot et al., 2015; Biggio et al., 2013).Adversarial training (or adversarial learning) has recentlybeen proposed as a method for training models that are ro-bust to such attacks, by applying techniques from the areaof Robust Optimization (Madry et al., 2017; Sinha et al.,2018). The core idea of adversarial training is simple: wedefine a set S of allowed perturbations δ ∈ Rd that we wantto “robustify” against (e.g. S could be the set of δ where||δ||∞ ≤ ε), and perform model-training using Stochas-tic Gradient Descent (SGD) exactly as in natural training,except that each training example x is perturbed adversari-ally, i.e. replaced by x + δ∗ where δ∗ ∈ S maximizes theexample’s loss-contribution.

Explainability: One way to address the well-known lackof explainability of deep learning models is feature attribu-tion, which aims to explain the output of a model F (xxx) asan attribution vector AF (xxx) of the contributions from thefeatures xxx. There are several feature-attribution techniquesin the literature, such as Integrated Gradients (IG) (Sun-dararajan et al., 2017), DeepSHAP (Lundberg & Lee, 2017),and LIME (Ribeiro et al., 2016). For such an explanationto be human-friendly, it is highly desirable (Molnar, 2019)that the attribution-vector is sparse, i.e., only the featuresthat are truly predictive of the output F (xxx) should havesignificant contributions, and irrelevant or weakly-relevantfeatures should have negligible contributions. A sparse attri-bution makes it possible to produce a concise explanation,where only the input features with significant contributionsare included. For instance, if the model F is used for a loanapproval decision, then various stakeholders (like customers,

arX

iv:1

810.

0658

3v9

[cs

.LG

] 5

Jul

202

0


data-scientists and regulators) would like to know the reasonfor a specific decision in simple terms. In practice however,due to artifacts in the training data or process, the attributionvector is often not sparse and irrelevant or weakly-relevantfeatures end up having significant contributions (Tan et al.,2013). Another desirable property of a good explanation isstability: the attribution vector should not vary significantlywithin a small local neighborhood of the input x. Similarto the lack of concise explainability, natural training oftenresults in explanations that lack stability (Alvarez-Melis &Jaakkola, 2018).

Our paper shows new connections between adversarial ro-bustness and the above-mentioned desirable properties ofexplanations, namely conciseness and stability. Specifically,let F be an adversarially trained version of a classifier F ,and for a given input vector xxx and attribution method A,let AF (xxx) and AF (xxx) denote the corresponding attributionvectors. The central research question this paper addressesis:

Is AF (xxx) sparser and more stable than AF (xxx)?

The main contributions of our paper are as follows:

Theoretical Analysis of Adversarial Training: Our firstset of results show via a theoretical analysis that `∞(ε)-adversarial training 1-layer networks tends to produce sparseattribution vectors for IG, which in turn leads to concise ex-planations. In particular, under some assumptions, we show(Theorems 3.1 and E.1) that for a general class of convexloss functions (which includes popular loss functions usedin 1-layer networks, such as logistic and hinge loss, usedfor binary or multi-class classification), and adversarial per-turbations δ satisfying ||δ||∞ ≤ ε, the weights of “weak”features are on average more aggressively shrunk towardzero than during natural training, and the rate of shrinkageis proportional to the amount by which ε exceeds a certainmeasure of the “strength” of the feature. This shows that`∞(ε)-adversarial training tends to produce sparse weightvectors in popular 1-layer models. In Section 4 we show(Lemma 4.1) a closed form formula for the IG vector of1-layer models, that makes it clear that in these models,sparseness of the weight vector directly implies sparsenessof the IG vector.

Empirically Demonstrate Attribution Sparseness: In Sec-tion 6 we empirically demonstrate that this “sparsification”effect of `∞(ε)-adversarial training holds not only for 1-layer networks (e.g. logistic regression models), but also forDeep Convolutional Networks used for image classification,and extends beyond IG-based attributions, to those basedon DeepSHAP. Specifically, we show this phenomenon viaexperiments applying `∞(ε)-adversarial training to (a) Con-volutional Neural Networks on public benchmark imagedatasets MNIST (LeCun & Cortes, 2010) and Fashion-

MNIST (Xiao et al., 2017), and (b) logistic regressionmodels on the Mushroom and Spambase tabular datasetsfrom the UCI Data Repository (Dheeru & Karra Taniskidou,2017). In all of our experiments, we find that it is possibleto choose an `∞ bound ε so that adversarial learning underthis bound produces attribution vectors that are sparse onaverage, with little or no drop in performance on natural testdata. A visually striking example of this effect is shown inFigure 1 (the Gini Index, introduced in Section 6, measuresthe sparseness of the map).

It is natural to wonder whether a traditional weight-regularization technique such as `1-regularization can pro-duce models with sparse attribution vectors. In fact, ourexperiments show that for logistic regression models, `1-regularized training does yield attribution vectors that areon average significantly sparser compared to attribution vec-tors from natural (un-regularized) model-training, and thesparseness improvement is almost as good as that obtainedwith `∞(ε)-adversarial training. This is not too surprisinggiven our result (Lemma 4.1) that implies a direct link be-tween sparseness of weights and sparseness of IG vectors,for 1-layer models. Intriguingly, this does not carry over toDNNs: for multi-layer models (such as the ConvNets wetrained for the image datasets mentioned above) we find thatwith `1-regularization, the sparseness improvement is sig-nificantly inferior to that obtainable from `∞(ε)-adversarialtraining (when controlling for model accuracy on natural testdata), as we show in Table 1, Figure 2 and Figure 3. Thusit appears that for DNNs, the attribution-sparseness thatresults from adversarial training is not necessarily related tosparseness of weights.

Connection between Adversarial Training and AttributionStability: We also show theoretically (Section 5) that train-ing 1-layer networks naturally, while encouraging stabilityof explanations (via a suitable term added to the loss func-tion), is in fact equivalent to adversarial training.

Natural Training Adversarial Training

Gini: 0.6150 Gini: 0.7084

Figure 1: Both models correctly predict “Bird”, but theIG-based saliency map of the adversarially trained model ismuch sparser than that of the naturally trained model.


2. Setup and AssumptionsFor ease of understanding, we consider the case of binaryclassification for the rest of our discussion in the main paper.We assume there is a distribution D of data points (xxx, y)where xxx ∈ Rd is an input feature vector, and y ∈ {±1} isits true label1. For each i ∈ [d], the i’th component of xxxrepresents an input feature, and is denoted by xi. The modelis assumed to have learnable parameters (“weights”) www ∈Rd, and for a given data point (xxx, y), the loss is given bysome function L(xxx, y;www). Natural model training 2 consistsof minimizing the expected loss, known as empirical risk:

E(xxx,y)∼D[L(xxx, y;www)]. (1)

We sometimes assume the existence of an `∞(ε)-adversarywho may perturb the input examplexxx by adding a vector δ ∈Rd whose `∞-norm is bounded by ε; such a perturbation δ isreferred to as an `∞(ε)-perturbation. For a given data point(xxx, y) and a given loss function L(.), an `∞(ε)-adversarialperturbation is a δ∗ that maximizes the adversarial lossL(xxx+δ∗, y;www).

Given a function F : Rd → [0, 1] representing a neuralnetwork, an input vector xxx ∈ Rd, and a suitable baselinevector uuu ∈ Rd, an attribution of the prediction of F at in-put xxx relative to uuu is a vector AF (xxx,uuu) ∈ Rd whose i’thcomponent AFi (xxx,uuu) represents the “contribution” of xito the prediction F (xxx). A variety of attribution methodshave been proposed in the literature (see (Arya et al., 2019)for a survey), but in this paper we will focus on two ofthe most popular ones: Integrated Gradients (Sundarara-jan et al., 2017), and DeepSHAP (Lundberg & Lee, 2017).When discussing a specific attribution method, we will de-note the IG-based attribution vector as IGF (xxx,uuu), and theDeepSHAP-based attribution vector as SHF (xxx,uuu). In allcases we will drop the superscript F and/or the baselinevector uuu when those are clear from the context.

The aim of adversarial training(Madry et al., 2017) is totrain a model that is robust to an `∞(ε)-adversary (i.e. per-forms well in the presence of such an adversary), and con-sists of minimizing the expected `∞(ε)-adversarial loss:

E(xxx,y)∼D[ max||δ||∞≤ε

L(xxx+δ, y;www)]. (2)

In the expectations (1) and (2) we often drop the subscriptunder E when it is clear that the expectation is over (xxx, y) ∼D.

Some of our theoretical results make assumptions regardingthe form and properties of the loss function L, the propertiesof its first derivative. For the sake of clarify, we highlightthese assumptions (with mnemonic names) here for ease offuture reference.

1It is trivial to convert -1/1 labels to 0/1 labels and vice versa2Also referred to as standard training by (Madry et al., 2017)

Assumption LOSS-INC. The loss function is of the formL(xxx, y;www) = g(−y〈www,xxx〉) where g is a non-decreasingfunction.

Assumption LOSS-CVX. The loss function is of the formL(xxx, y;www) = g(−y〈www,xxx〉) where g is non-decreasing,almost-everywhere differentiable, and convex.

Section B.1 in the Supplement shows that these Assump-tions are satisfied by popular loss functions such as logisticand hinge loss. Incidentally, note that for any differentiablefunction g, g is convex if and only if its first-derivative g′

is non-decreasing, and we will use this property in some ofthe proofs.

Assumption FEAT-TRANS. For each i ∈ [d], if x′i is thefeature in the original dataset, xi is the translated version ofx′i defined by xi = x′i − [E(x′i|y = 1) +E(x′i|y = −1)]/2.

In Section B.2 (Supplement) we show that this mild assump-tion implies that for each feature xi there is a constant aisuch that

E(xi|y) = aiy (3)

E(yxi) = E[E(yxi|y)] = E[y2ai] = ai (4)

E(yxi|y) = yE[xi|y] = y2ai = ai (5)

For any i ∈ [d], we can think of E(yxi) as the degree of asso-ciation3 between feature xi and label y. Since E(yxi) = ai(Eq. 4), we refer to ai as the directed strength4 of feature xi,and |ai| is referred to as the absolute strength of xi. In par-ticular when |ai| is large (small) we say that xi is a strong(weak) feature.

2.1. Averaging over a group of features

For our main theoretical result (Theorem 3.1), we need anotion of weighted average defined as follows:

Definition WTD-AV. Given a quantity qi defined for eachfeature-index i ∈ [d], a subset S ⊂ [d] (where |S| ≥ 1) offeature-indices, and a feature weight-vector w with wi 6= 0for at least one i ∈ S, the w-weighted average of q over Sis defined as

qwS :=

∑i∈S wiqi∑i∈S |wi|

(6)

Note that the quantity wiqi can be written as |wi| sgn(wi)qi,so qwS is essentially a |wi|-weighted average of sgn(wi)qiover i ∈ S.

For our result we will use the above w-weighted averagedefinition for two particular quantities qi. The first one is

3When the features are standardized to have mean 0, E(yxi) isin fact the covariance of y and xi.

4This is related to the feature “robustness” notion introducedin (Ilyas et al., 2019)


qi := ai, the directed strength of feature xi (Eq. 4). Intu-itively, the quantity sgn(wi)E(yxi) = ai sgn(wi) capturesthe aligned strength of feature xi in the following sense: ifthis quantity is large and positive (large and negative), itindicates both that the current weight wi of xi is aligned(misaligned) with the directed strength of xi, and that thisdirected strength is large. Thus awS represents an average ofthe aligned strength over the feature-group S.

The second quantity for which we define the above w-weighted average is qi := ∆i, where ∆i := −E[∂L/∂wi]represents the expected SGD update (over random drawsfrom the data distribution) of the weight wi, given the lossfunction L, for a unit learning rate (details are in the nextSection). The quantity sgn(wi)∆i has a natural interpreta-tion, analogous to the above interpretation of ai sgn(wi):a large positive (large negative) value of sgn(wi)∆i corre-sponds to an large expansion (large shrinkage), in expecta-tion, of the weight wi away from zero magnitude (towardzero magnitude). Thus the w-weighted average ∆

w

S rep-resents the |wi|-weighted average of this effect over thefeature-group S.

3. Analysis of SGD Updates in AdversarialTraining

One way to understand the characteristics of the weights inan adversarially-trained neural network model, is to analyzehow the weights evolve during adversarial training underStochastic Gradient Descent (SGD) optimization. One ofthe main results of this work is a theoretical characteriza-tion of the weight updates during a single SGD step, whenapplied to a randomly drawn data point (xxx, y) ∼ D that issubjected to an `∞(ε)-adversarial perturbation.

Although the holy grail would be to do this for generalDNNs (and we expect this will be quite difficult) we take afirst step in this direction by analyzing single-layer networksfor binary or multi-class classification, where each weightis associated with an input feature. Intriguingly, our results(Theorem 3.1 for binary classification and E.1 for multi-class classification in the Supplement) show that for thesemodels, `∞(ε)-adversarial training tends to selectively re-duce the weight-magnitude of weakly relevant or irrelevantfeatures, and does so much more aggressively than natu-ral training. In other words, natural training can result inmodels where many weak features have significant weights,whereas adversarial training would tend to push most ofthese weights close to zero. The resulting model weightswould thus be more sparse, and the corresponding IG-basedattribution vectors would on average be more sparse as well(since in linear models, sparse weights imply sparse IGvectors; this is a consequence of Lemma 4.1) compared tonaturally-trained models.

Our experiments (Sec. 6) show that indeed for logistic re-gression models (which satisfy the conditions of Theorem3.1), adversarial training leads to sparse IG vectors. Interest-ingly, our extensive experiments with Deep ConvolutionalNeural Networks on public image datasets demonstrate thatthis phenomenon extends to DNNs as well, and to attri-butions based on DeepSHAP, even though our theoreticalresults only apply to 1-layer networks and IG-based attribu-tions.

As a preliminary, it is easy to show the following expres-sions related to the `∞(ε)-adversarial perturbation δ∗ (SeeLemmas 2 and 3 in Section C of the Supplement): Forloss functions satisfying Assumption LOSS-INC, the `∞(ε)-adversarial perturbation δ∗ is given by:

δ∗ = −y sgn(www)ε, (7)

the corresponding `∞(ε)-adversarial loss is

L(xxx+δ∗, y; www) = g(ε||www ||1 − y〈www, xxx〉), (8)

and the gradient of this loss w.r.t. a weight wi is

∂L(xxx+δ∗, y; www)

∂wi=

− g′(ε||www ||1 − y〈www,xxx〉) (yxi − sgn(wi)ε). (9)

In our main result, the expectation of the g′ term in (9) playsan important role, so we will use the following notation:

g′ := E[g′(ε||www ||1 − y〈www,xxx〉)

], (10)

and by Assumption LOSS-INC, g′ is non-negative.

Ideally, we would like to understand the nature of the weight-vectorwww∗ that minimizes the expected adversarial loss (2).This is quite challenging, so rather than analyzing the finaloptimum of (2), we instead analyze how an SGD-basedoptimizer for (2) updates the model weightswww. We assumean idealized SGD process: (a) a data point (xxx, y) is drawnfrom distribution D, (b) xxx is replaced by xxx′ = xxx+δ∗ whereδ∗ is an `∞(ε)-adversarial perturbation with respect to theloss function L, (c) each weight wi is updated by an amount∆wi = −∂L(xxx′, y;www)/∂wi (assuming a unit learning rateto avoid notational clutter). We are interested in the expec-tation ∆i := E∆wi = −E[∂L(xxx′, y;w)/∂wi], in order tounderstand how a weight wi evolves on average during asingle SGD step. Where there is a conditionally independentfeature subset S ∈ [d] (i.e. the features in S are condition-ally independent of the rest given the label y), our maintheoretical result characterizes the behavior of ∆i for i ∈ S,and the corresponding w-weighted average ∆

w

S :

Theorem 3.1 (Expected SGD Update in Adversarial Train-ing). For any loss function L satisfying Assumption LOSS-CVX, a dataset D satisfying Assumption FEAT-TRANS, a


subset S of features that are conditionally independent ofthe rest given the label y, if a data point (xxx, y) is randomlydrawn from D, and xxx is perturbed to xxx′ = xxx+δ∗, whereδ∗ is an `∞(ε)-adversarial perturbation, then during SGDusing the `∞(ε)-adversarial loss L(xxx′, y;www), the expectedweight-updates ∆i := E∆wi for i ∈ S and the correspond-ing w-weighted average ∆

w

S satisfy the following proper-ties:

1. If wi = 0 ∀i ∈ S, then for each i ∈ S,

∆i = g′ ai, (11)

2. and otherwise,

∆w

S ≤ g′(awS − ε), (12)

and equality holds in the limit as wi → 0 ∀i ∈ S,

where g′ is the expectation in (10), ai = E(xiy) is thedirected strength of feature xi from Eq. (4), and awS is thecorresponding w-weighted average over S.

For space reasons, a detailed discussion of the implicationsof this result is presented in Sec. D.2 of the Supplement, buthere we note the following. Recalling the interpretation ofthe w-weighted averages awS and ∆

w

S in Section 2.1, we caninterpret the above result as follows. For any conditionallyindependent feature subset S, if the weights of all featuresin S are zero, then by Eq. (11), an SGD update causes,on average, each of these weights wi to grow (from 0) ina direction consistent with the directed feature-strength ai(since g′ ≥ 0 as noted above). If at least one of the featuresin S has a non-zero weight, (12) implies ∆

w

S < 0, i.e.,an aggregate shrinkage of the weights of features in S, ifeither of the following hold: (a) awS < 0, i.e., the weights offeatures in S are mis-aligned on average, or (b) the weightsof features in S are aligned on average, i.e., awS is positive,but dominated by ε, i.e. the features S are weakly correlatedwith the label. In the latter case the weights of features in Sare (in aggregate and in expectation) aggressively pushedtoward zero, and this aggressiveness is proportional to theextent to which ε dominates awS . A partial generalizationof the above result for the multi-class setting (for a singleconditionally-independent feature) is presented in Section E(Theorem E.1) of the Supplement.

4. FEATURE ATTRIBUTION USINGINTEGRATED GRADIENTS

Theorem 3.1 showed that `∞(ε)-adversarial training tendsto shrink the weights of features that are “weak” (relative toε). We now show a link between weights and explanations,specifically explanations in the form of a vector of feature-attributions given by the Integrated Gradients (IG) method

(Sundararajan et al., 2017), which is defined as follows:Suppose F : Rd → R is a real-valued function of an inputvector. For example F could represent the output of a neuralnetwork, or even a loss function L(xxx, y;www) when the labely and weights www are held fixed. Let xxx ∈ Rd be a specificinput, and uuu ∈ Rd be a baseline input. The IG is definedas the path integral of the gradients along the straight-linepath from the baseline uuu to the input xxx. The IG along thei’th dimension for an input xxx and baseline uuu is defined as:

IGFi (xxx,uuu) := (xi − ui)×

∫ 1

α=0

∂iF (uuu+α(xxx−uuu))dα,

(13)where ∂iF (zzz) denotes the gradient of F (vvv) along the i’thdimension, at vvv = zzz. The vector of all IG componentsIGF

i (xxx,uuu) is denoted as IGF (xxx,uuu). Although we do notshowwww explicitly as an argument in the notation IGF (xxx,uuu),it should be understood that the IG depends on the modelweightswww since the function F depends onwww.

The following Lemma (proved in Sec. F of the Supplement)shows a closed form exact expression for the IGF (xxx,uuu)when F (xxx) is of the form

F (xxx) = A(〈www, xxx〉), (14)

where www ∈ Rd is a vector of weights, A is a differen-tiable scalar-valued function, and 〈www, xxx〉 denotes the dotproduct of www and x. Note that this form of F couldrepresent a single-layer neural network with any differ-entiable activation function (e.g., logistic (sigmoid) acti-vation A(zzz) = 1/[1 + exp(−zzz)] or Poisson activationA(zzz) = exp(zzz)), or a differentiable loss function, suchas those that satisfy Assumption LOSS-INC for a fixed labely and weight-vectorwww. For brevity, we will refer to a func-tion of the form (14) as representing a “1-Layer Network”,with the understanding that it could equally well represent asuitable loss function.

Lemma 4.1 (IG Attribution for 1-layer Networks). If F (xxx)is computed by a 1-layer network (14) with weights vectorwww, then the Integrated Gradients for all dimensions of xxxrelative to a baseline uuu are given by:

IGF (xxx,uuu) = [F (xxx)− F (uuu)](xxx−uuu)�www〈xxx−uuu, www〉

, (15)

where the � operator denotes the entry-wise product ofvectors.

Thus for 1-layer networks, the IG of each feature is essen-tially proportional to the feature’s fractional contributionto the logit-change 〈xxx−uuu, www〉. This makes it clear that insuch models, if the weight-vectorwww is sparse, then the IGvector will also be correspondingly sparse.


5. Training with Explanation Stability isequivalent to Adversarial Training

Suppose we use the IG method described in Sec. 4 as anexplanation for the output of a model F (xxx) on a specificinput xxx. A desirable property of an explainable model isthat the explanation for the value of F (xxx) is stable(Alvarez-Melis & Jaakkola, 2018), i.e., does not change much undersmall perturbations of the input xxx. One way to formalizethis is to say the following worst-case `1-norm of the changein IG should be small:

maxxxx′∈N(xxx,ε)

|| IGF (xxx′,uuu)− IGF (xxx,uuu)||1, (16)

where N(xxx, ε) denotes a suitable ε-neighborhood of xxx, anduuu is an appropriate baseline input vector. If the model F is asingle-layer neural network, it would be a function of 〈www, xxx〉for some weightswww, and typically when training such net-works the loss is a function of 〈www, xxx〉 as well, so we wouldnot change the essence of (16) much if instead of F in eachIG, we use L(xxx, y;www) for a fixed y; let us denote this func-tion by Ly . Also intuitively, || IGLy (xxx′,uuu)− IGLy (xxx,uuu)||1is not too different from || IGLy (xxx′,xxx)||1. These observa-tions motivate the following definition of Stable-IG Empiri-cal Risk, which is a modification of the usual empirical risk(1), with a regularizer to encourage stable IG explanations:

E(xxx,y)∼D

[L(xxx, y;www) +

max||xxx′−xxx ||∞≤ε

|| IGLy (xxx,xxx′)||1]. (17)

The following somewhat surprising result is proved in Sec-tion G of the Supplement.Theorem 5.1 (Equivalence of Stable IG and AdversarialRobustness). For loss functions L(xxx, y;www) satisfying As-sumption LOSS-CVX, the augmented loss inside the expec-tation (17) equals the `∞(ε)-adversarial loss inside theexpectation (2), i.e.

L(xxx, y;www) + max||xxx′−xxx ||∞≤ε

|| IGLy (xxx,xxx′)||1 =

max||δ||∞≤ε

L(xxx+δ, y;www) (18)

This implies that for loss functions satisfying AssumptionLOSS-CVX, minimizing the Stable-IG Empirical Risk (17)is equivalent to minimizing the expected `∞(ε)-adversarialloss. In other words, for this class of loss functions, naturalmodel training while encouraging IG stability is equivalentto `∞(ε)-adversarial training! Combined with Theorem3.1 and the corresponding experimental results in Sec 6, thisequivalence implies that, for this class of loss functions, anddata distributions satisfying Assumption FEAT-TRANS, theexplanations for the models produced by `∞(ε)-adversarialtraining are both concise (due to the sparseness of the mod-els), and stable.

6. Experiments6.1. Hypotheses

Recall that one implication of Theorem 3.1 is the following:For 1-layer networks where the loss function satisfies As-sumption LOSS-CVX, `∞(ε)-adversarial training tends tomore-aggressively prune the weight-magnitudes of “weak”features compared to natural training. In Sec. 4 we observedthat a consequence of Lemma 4.1 is that for 1-layer modelsthe sparseness of the weight vector implies sparseness ofthe IG vector. Thus a reasonable conjecture is that, for 1-layer networks, `∞(ε)-adversarial training leads to modelswith sparse attribution vectors in general (whether usingIG or a different method, such as DeepSHAP). We furtherconjecture that this sparsification phenomenon extends topractical multi-layer Deep Neural Networks, not just 1-layernetworks, and that this benefit can be realized without sig-nificantly impacting accuracy on natural test data. Finally,we hypothesize that the resulting sparseness of attributionvectors is better than what can be achieved by a traditionalweight regularization technique such as L1-regularization,for a comparable level of natural test accuracy.

6.2. Measuring Sparseness of an Attribution Vector

For an attribution method A, we quantify the sparseness ofthe attribution vector AF (xxx,uuu) using the Gini Index appliedto the vector of absolute values AF (xxx,uuu). For a vectorvvv of non-negative values, the Gini Index, denoted G(vvv)(defined formally in Sec. I in the Supplement), is a metricfor sparseness of vvv that is known (Hurley & Rickard, 2009)to satisfy a number of desirable properties, and has beenused to quantify sparseness of weights in a neural network(Guest & Love, 2017). The Gini Index by definition lies in[0,1], and a higher value indicates more sparseness.

Since the model F is clear from the context, and the base-line vector uuu are fixed for a given dataset, we will denotethe attribution vector on input xxx simply as A(xxx), and ourmeasure of sparseness is G(|A(xxx)|), which we denote forbrevity as G[A](xxx), and refer to informally as the Gini of A,whereA can stand for IG (when using IG-based attributions)or SH (when using DeepSHAP for attribution). As men-tioned above, one of our hypotheses is that the sparsenessof attributions of models produced by `∞(ε)-adversarialtraining is much better than what can be achieved by natu-ral training using `1-regularization, for a comparable levelof accuracy. To verify this hypothesis we will comparethe sparseness of attribution vectors resulting from threetypes of models: (a) n-model: naturally-trained modelwith no adversarial perturbations and no `1-regularization,(b) a-model: `∞(ε)-adversarially trained model, and (c)l-model: naturally trained model with `1-regularizationstrength λ > 0. For an attribution method A, we denote theGini indices G[A](xxx) resulting from these models respec-


tively as Gn[A](xxx), Ga[A](xxx; ε) and Gl[A](xxx; λ).

In several of our datasets, individual feature vectors arealready quite sparse: for example in the MNIST dataset,most of the area consists of black pixels, and in the Mush-room dataset, after 1-hot encoding the 22 categorical fea-tures, the resulting 120-dimensional feature-vector is sparse.On such datasets, even an n-model can achieve a “good”level of sparseness of attributions in the absolute sense, i.e.Gn[A](xxx) can be quite high. Therefore for all datasets wecompare the sparseness improvement resulting from an a-model relative to an n-model, with that from an l-modelrelative to a n-model. Or more precisely, we will comparethe two quantities defined below, for a given attributionmethod A:

dGa[A](xxx; ε) := Ga[A](xxx; ε)−Gn[A](xxx), (19)

dGl[A](xxx; λ) := Gl[A](xxx; λ)−Gn[A](xxx). (20)

The above quantities define the IG sparseness improvementsfor a single example xxx. It will be convenient to define theoverall sparseness improvement from a model, as measuredon a test dataset, by averaging over all examples xxx in thatdataset. We denote the corresponding average sparsenessmetrics byGa[A](ε),Gl[A](λ) andGn[A] respectively. Wethen define the average sparseness improvement of an a-model and l-model as:

dGa[A](ε) := Ga[A](ε)−Gn[A], (21)

dGl[A](λ) := Gl[A](λ)−Gn[A]. (22)

We can thus re-state our hypotheses in terms of this notation:For each of the attribution methods A ∈ {IG, SH}, theaverage sparseness improvement dGa[A](ε) resulting fromtype-a models is high, and is significantly higher than theaverage sparseness improvement dGl[A](λ) resulting fromtype-l models.

6.3. Results

We ran experiments on five standard public benchmarkdatasets: three image datasets MNIST, Fashion-MNIST,and CIFAR-10, and two tabular datasets from the UCIData Repository: Mushroom and Spambase. Details ofthe datasets and training methodology are in Sec. J.1 ofthe Supplement. The code for all experiments is at thisrepository: https://github.com/jfc43/advex.

For each of the two tabular datasets (where we train logisticregression models), for a given model-type (a, l or n), wefound the average Gini index of the attribution vectors isvirtually identical when using IG or DeepSHAP. This isnot surprising: as pointed out in (Ancona et al., 2017),DeepSHAP is a variant of DeepLIFT, and for simple linearmodels, DeepLIFT gives a very close approximation of IG.To avoid clutter, we therefore omit DeepSHAP-based results

on the tabular datasets. Table 1 shows a summary of someresults on the above 5 datasets, and Fig. 2 and 3 displayresults graphically 5

Table 1: Results on 5 datasets. For each dataset, “a” indi-cates an `∞(ε)-adversarially trained model with the indi-cated ε, and “l” indicates a naturally trained model withthe indicated `1-regularization strength λ. The attr columnindicates the feature attribution method (IG or DeepSHAP).Column dG shows the average sparseness improvements ofthe models relative to the baseline naturally trained model,as measured by the dGa[A](ε) and dGl[A](λ) defined inEqs. (21, 22). Column AcDrop indicates the drop in accu-racy relative to the baseline model.

dataset attr model dG AcDrop

MNIST IG a (ε = 0.3) 0.06 0.8%IG l (λ = 0.01) 0.004 0.4%SHAP a (ε = 0.3) 0.06 0.8%SHAP l (λ = 0.01) 0.007 0.4%

Fashion IG a (ε = 0.1) 0.06 4.7%-MNIST IG l (λ = 0.01) 0.008 3.4%

SHAP a (ε = 0.1) 0.08 4.7%SHAP l (λ = 0.01) 0.003 3.4%

CIFAR-10 IG a (ε = 1.0) 0.081 0.57%IG l (λ = 10−5) 0.022 1.51%

Mushroom IG a (ε = 0.1) 0.06 2.5%IG l (λ = 0.02) 0.06 2.6%

Spambase IG a (ε = 0.1) 0.17 0.9%IG l (λ = 0.02) 0.15 0.1%

These results make it clear that for comparable levels ofaccuracy, the sparseness of attribution vectors from `∞(ε)-adversarially trained models is much better than the sparse-ness from natural training with `1-regularization. The effectis especially pronounced in the two image datasets. Theeffect is less dramatic in the two tabular datasets, for whichwe train logistic regression models. Our discussion at theend of Sec. 4 suggests a possible explanation.

In the Introduction we gave an example of a saliency map(Simonyan et al., 2013; Baehrens et al., 2010) (Fig. 1) todramatically highlight the sparseness induced by adversarialtraining. We show several more examples of saliency mapsin the supplement (Section J.4).

5The official implementation of DeepSHAP (https://github.com/slundberg/shap) doesn’t support the net-work we use for CIFAR-10 well, so we do not show results for thiscombination.

https://github.com/slundberg/shap

https://github.com/slundberg/shap


Figure 2: Boxplot of pointwise sparseness-improvements from adversarially trained models (dGa[A](xxx, ε)) and naturallytrained models with `1-regularization (dGl[A](xxx, λ)), for attribution methods A ∈ {IG, SH}.

7. Related WorkIn contrast to the growing body of work on defenses againstadversarial attacks (Yuan et al., 2017; Madry et al., 2017;Biggio et al., 2013) or explaining adversarial examples(Goodfellow et al., 2014; Tsipras et al., 2018), the focusof our paper is the connection between adversarial robust-ness and explainability. We view the process of adversarialtraining as a tool to produce more explainable models. A re-cent series of papers (Tsipras et al., 2018; Ilyas et al., 2019)essentially argues that adversarial examples exist becausestandard training produces models are heavily reliant onhighly predictive but non-robust features (which is simi-lar to our notion of “weak” features in Sec 3) which arevulnerable to an adversary who can “flip” them and causeperformance to degrade. Indeed the authors of (Ilyas et al.,2019) touch upon some connections between explainabilityand robustness, and conclude, “As such, producing human-meaningful explanations that remain faithful to underlyingmodels cannot be pursued independently from the trainingof the models themselves”, by which they are implying thatgood explainability may require intervening in the model-training procedure itself; this is consistent with our findings.We discuss other related work in the Supplement Section A.

8. ConclusionWe presented theoretical and experimental results that showa strong connection between adversarial robustness (under`∞-bounded perturbations) and two desirable propertiesof model explanations: conciseness and stability. Specif-ically, we considered model explanations in the form offeature-attributions based on the Integrated Gradients (IG)and DeepSHAP techniques. For 1-layer models using apopular family of loss functions, we theoretically showedthat `∞(ε)-adversarial training tends to produce sparse andstable IG-based attribution vectors. With extensive experi-ments on benchmark tabular and image datasets, we demon-strated that the “attribution sparsification” effect extends toDeep Neural Networks, when using two popular attributionmethods. Intriguingly, especially in DNN models for imageclassification, the attribution sparseness from natural train-ing with `1-regularization is much inferior to that achievablevia `∞(ε)-adversarial training. Our theoretical results are afirst step in explaining some of these phenomena.


Figure 3: For four benchmark datasets, each line-plot labeled M [A] shows the tradeoff between Accuracy and AttributionSparseness achievable by various combinations of models M and attribution methods A. M = A denotes `∞(ε)-adversarialtraining, and the plot shows the accuracy/sparseness for various choices of ε. M = L denotes `1-regularized naturaltraining, and the plot shows accuracy/sparseness for various choices of `1-regularization parameter λ. A = IG denotesthe IG attribution method, whereas A = SH denotes DeepSHAP. Attribution sparseness is measured by the Average GiniIndex over the dataset (Ga[A](ε) and Gl[A](λ), for adversarial training and `1-regularized natural training respectively). Inall cases, and especially in the image datasets, adversarial training achieves a significantly better accuracy vs attributionsparseness tradeoff.


ReferencesAbramovich, F. and Grinshtein, V. High-dimensional clas-

sification by sparse logistic regression. June 2017. URLhttp://arxiv.org/abs/1706.08344.

Alvarez-Melis, D. and Jaakkola, T. S. On the ro-bustness of interpretability methods. arXiv preprintarXiv:1806. 08049, 2018. URL http://arxiv.org/abs/1806.08049.

Ancona, M., Ceolini, E., Oztireli, C., and Gross, M. To-wards better understanding of gradient-based attributionmethods for deep neural networks. November 2017. URLhttp://arxiv.org/abs/1711.06104.

Arya, V., Bellamy, R. K. E., Chen, P.-Y., Dhurandhar, A.,Hind, M., Hoffman, S. C., Houde, S., Vera Liao, Q.,Luss, R., Mojsilovic, A., Mourad, S., Pedemonte, P.,Raghavendra, R., Richards, J., Sattigeri, P., Shanmugam,K., Singh, M., Varshney, K. R., Wei, D., and Zhang, Y.One explanation does not fit all: A toolkit and taxonomyof AI explainability techniques. September 2019. URLhttp://arxiv.org/abs/1909.03012.

Baehrens, D., Schroeter, T., Harmeling, S., Kawan-abe, M., Hansen, K., and Muller, K.-R. Howto explain individual classification decisions. J.Mach. Learn. Res., 11(Jun):1803–1831, 2010. URLhttp://www.jmlr.org/papers/volume11/baehrens10a/baehrens10a.pdf.

Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N.,Laskov, P., Giacinto, G., and Roli, F. Evasion attacksagainst machine learning at test time. In Machine Learn-ing and Knowledge Discovery in Databases, pp. 387–402.Springer Berlin Heidelberg, 2013. URL http://dx.doi.org/10.1007/978-3-642-40994-3_25.

Dheeru, D. and Karra Taniskidou, E. UCI machine learningrepository, 2017. URL http://archive.ics.uci.edu/ml.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explainingand harnessing adversarial examples. December 2014.URL http://arxiv.org/abs/1412.6572.

Guest, O. and Love, B. C. What the success of brainimaging implies about the neural code. Elife, 6, Jan-uary 2017. URL http://dx.doi.org/10.7554/eLife.21397.

Hurley, N. and Rickard, S. Comparing measures of spar-sity. IEEE Trans. Inf. Theory, 55(10):4723–4741, October2009. URL http://dx.doi.org/10.1109/TIT.2009.2027527.

Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B.,and Madry, A. Adversarial examples are not bugs, theyare features. May 2019. URL http://arxiv.org/abs/1905.02175.

Kim, B., Seo, J., and Jeon, T. Bridging adversarial robust-ness and gradient interpretability. Safe Machine Learningworkshop at ICLR, 2019.

LeCun, Y. and Cortes, C. MNIST handwritten digitdatabase. 2010. URL http://yann.lecun.com/exdb/mnist/.

Lundberg, S. M. and Lee, S.-I. A unified approach to in-terpreting model predictions. In Advances in NeuralInformation Processing Systems, pp. 4765–4774, 2017.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant toadversarial attacks. June 2017.

Molnar, C. Interpretable Machine Learning.2019. https://christophm.github.io/interpretable-ml-book/.

Noack, A., Ahern, I., Dou, D., and Li, B. Does inter-pretability of neural networks imply adversarial robust-ness? ArXiv, abs/1912.03430, 2019.

Papernot, N., McDaniel, P., Jha, S., Fredrikson, M.,Berkay Celik, Z., and Swami, A. The limitations ofdeep learning in adversarial settings. November 2015.URL http://arxiv.org/abs/1511.07528.

Ribeiro, M. T., Singh, S., and Guestrin, C. “why shouldI trust you?”: Explaining the predictions of any clas-sifier. February 2016. URL http://arxiv.org/abs/1602.04938.

Sankaranarayanan, S., Jain, A., Chellappa, R., and Lim,S. N. Regularizing deep networks using efficient lay-erwise adversarial training. May 2017. URL http://arxiv.org/abs/1705.07819.

Shaham, U., Yamada, Y., and Negahban, S. Understandingadversarial training: Increasing local stability of neuralnets through robust optimization. November 2015. URLhttp://arxiv.org/abs/1511.05432.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep insideconvolutional networks: Visualising image classificationmodels and saliency maps. CoRR, abs/1312.6034, 2013.

Sinha, A., Namkoong, H., and Duchi, J. Certifying somedistributional robustness with principled adversarial train-ing. February 2018. URL https://openreview.net/pdf?id=Hk6kPgZA-.

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribu-tion for deep networks. March 2017.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing propertiesof neural networks. December 2013. URL http://arxiv.org/abs/1312.6199.

Tan, M., Tsang, I. W., and Wang, L. Minimax sparse lo-gistic regression for very high-dimensional feature selec-tion. IEEE Trans Neural Netw Learn Syst, 24(10):1609–1622, October 2013. URL http://dx.doi.org/10.1109/TNNLS.2013.2263427.

Tan, M., Tsang, I. W., and Wang, L. To-wards ultrahigh dimensional feature selection forbig data. J. Mach. Learn. Res., 15(1):1371–1429,2014. URL http://www.jmlr.org/papers/volume15/tan14a/tan14a.pdf.

http://arxiv.org/abs/1706.08344





http://www.jmlr.org/papers/volume11/baehrens10a/baehrens10a.pdf

http://www.jmlr.org/papers/volume11/baehrens10a/baehrens10a.pdf

http://dx.doi.org/10.1007/978-3-642-40994-3_25

http://dx.doi.org/10.1007/978-3-642-40994-3_25

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml


http://dx.doi.org/10.7554/eLife.21397

http://dx.doi.org/10.7554/eLife.21397

http://dx.doi.org/10.1109/TIT.2009.2027527

http://dx.doi.org/10.1109/TIT.2009.2027527



http://yann.lecun.com/exdb/mnist/

http://yann.lecun.com/exdb/mnist/

https://christophm.github.io/interpretable-ml-book/

https://christophm.github.io/interpretable-ml-book/







https://openreview.net/pdf?id=Hk6kPgZA-

https://openreview.net/pdf?id=Hk6kPgZA-



http://dx.doi.org/10.1109/TNNLS.2013.2263427

http://dx.doi.org/10.1109/TNNLS.2013.2263427

http://www.jmlr.org/papers/volume15/tan14a/tan14a.pdf

http://www.jmlr.org/papers/volume15/tan14a/tan14a.pdf


Tanay, T. and Griffin, L. D. A new angle on l2 regularization.CoRR, abs/1806.11186, 2018.

Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., andMadry, A. Robustness may be at odds with accuracy.May 2018. URL http://arxiv.org/abs/1805.12152.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: anovel image dataset for benchmarking machine learningalgorithms. August 2017. URL http://arxiv.org/abs/1708.07747.

Xu, H., Caramanis, C., and Mannor, S. Ro-bustness and regularization of support vector ma-chines. J. Mach. Learn. Res., 10(Jul):1485–1510,2009. URL http://www.jmlr.org/papers/volume10/xu09b/xu09b.pdf.

Yeh, C.-K., Hsieh, C.-Y., Suggala, A., Inouye, D. I., andRavikumar, P. K. On the (in)fidelity and sensitivity ofexplanations. In Wallach, H., Larochelle, H., Beygelz-imer, A., dAlche Buc, F., Fox, E., and Garnett, R. (eds.),Advances in Neural Information Processing Systems 32,pp. 10967–10978. Curran Associates, Inc., 2019. URLhttps://arxiv.org/pdf/1901.09392.pdf.

Yuan, X., He, P., Zhu, Q., and Li, X. Adversarial examples:Attacks and defenses for deep learning. December 2017.URL http://arxiv.org/abs/1712.07107.





http://www.jmlr.org/papers/volume10/xu09b/xu09b.pdf

http://www.jmlr.org/papers/volume10/xu09b/xu09b.pdf

https://arxiv.org/pdf/1901.09392.pdf



A. Additional Related WorkSection 7 discussed some of the work most directly related to this paper. Here we describe some additional related work.

Adversarial Robustness and Interpretability. Through a very different analysis, (Yeh et al., 2019) show a result closelyrelated to our Theorem 5.1: the show that adversarial training is analogous to making gradient-based explanations more“smooth”, which lowers the sensitivity of gradient explanation. The paper of (Noack et al., 2019) considers a question that isthe converse of the one we examine in our paper: They show evidence that models that are forced to have interpretablegradients are more robust to adversarial examples than models trained in a standard manner. Another recent paper (Kimet al., 2019) analyzes the effect of adversarial training on the interpretability of neural network loss gradients.

Relation to work on Regularization Benefits of AML. There has been prior work on the regularization benefits ofadversarial training (Xu et al., 2009; Szegedy et al., 2013; Goodfellow et al., 2014; Shaham et al., 2015; Sankaranarayananet al., 2017; Tanay & Griffin, 2018), primarily in image-classification applications: when a model is adversarially trained,its classification accuracy on natural (i.e. un-perturbed) test data can improve. All of this prior work has focused on theperformance-improvement (on natural test data) aspect of regularization, but none have examined the feature-pruningbenefits explicitly. In contrast to this work, our primary interest is in the explainability benefits of adversarial training, andspecifically the ability of adversarial training to significantly improve feature-concentration while maintaining (and oftenimproving) performance on natural test data.

Adversarial Training vs Feature-Selection. Since our results show that adversarial training can effectively shrink theweights of irrelevant or weakly-relevant features (while preserving weights on relevant features), a legitimate counter-proposal might be that one could weed out such features beforehand via a pre-processing step where features with negligiblelabel-correlations can be “removed” from the training process. Besides the fact that this scheme has no guarantees whatsoeverwith regard to adversarial robustness, there are some practical reasons why correlation-based feature selection is not aseffective as adversarial training, in producing pruned models: (a) With adversarial training, one needs to simply try differentvalues of the adversarial strength parameter ε and find a level where accuracy (or other metric such as AUC-ROC) isnot impacted much but model-weights are significantly more concentrated; on the other hand with the correlation-basedfeature-pruning method, one needs to set up an iterative loop with gradually increasing correlation thresholds, and each timethe input pre-processing pipeline needs to be re-executed with a reduced set of features. (b) When there are categoricalfeatures with large cardinalities, where just some of the categorical values have negligible feature-correlations, it is noteven clear how one can “remove” these specific feature values, since the feature itself must still be used; at the very least itwould require a re-encoding of the categorical feature each time a subset of its values is “dropped” (for example if a one-hotencoding or hashing scheme is used). Thus correlation-based feature-pruning is a much more cumbersome and inefficientprocess compared to adversarial training.

Adversarial Training vs Other Methods to Train Sparse Logistic Regression Models. (Tan et al., 2013; 2014) proposean approach to train sparse logistic regression models based on a min-max optimization problem that can be solved bythe cutting plane algorithm. This requires a specially implemented custom optimization procedure. By contrast, `∞(ε)-adversarial training can be implemented as a simple and efficient “bolt-on” layer on top of existing ML pipelines based onTensorFlow, PyTorch or SciKit-Learn, which makes it highly practical. Another paper (Abramovich & Grinshtein, 2017)proposes a feature selection procedure based on penalized maximum likelihood with a complexity penalty on the model size,but once again this requires special-purpose optimization code.

B. Discussion of AssumptionsB.1. Loss Functions Satisfying Assumption LOSS-CVX

We show here that several popular loss functions satisfy the Assumption LOSS-CVX.

Logistic NLL (Negative Log Likelihood) Loss.L(xxx, y;www) = − ln(σ(y〈www, xxx〉)) = ln(1 + exp(−y〈www, xxx〉)), which can be written as g(−y〈www, xxx〉) where g(z) = ln(1 + ez)is a non-decreasing and convex function.

Hinge LossL(xxx, y;www) = (1− y〈www, xxx〉)+, which can be written as g(−y〈www, xxx〉) where g(z) = (1 + z)+ is non-decreasing and convex.


Softplus Hinge Loss.L(xxx, y;www) = ln(1 + exp(1− y〈www, xxx〉)), which can be written as g(−y〈www, xxx〉) where g(z) = ln(1 + e1+z), and clearly gis non-decreasing. Moreover the first derivative of g, g′(z) = 1/(1 + e−1−z) is non-decreasing, and therefore g is convex.

B.2. Implications of Assumption FEAT-TRANS

Lemma 1. Given random variables X ′, Y where Y ∈ {±1}, if we define X = X ′ − [E(X ′|Y = 1) + E(X ′|Y = −1)]/2,then:

E(X|Y ) = aY (B.23)

E(Y X) = E[E(Y X|Y )] = E[Y E(X|Y )] = E[Y 2a] = a (B.24)

E(Y X|Y ) = Y E[X|Y ] = Y 2a = a, (B.25)

where a = [E(X ′|Y = 1)− E(X ′|Y = −1)]/2.

Proof. Consider the function f(Y ) = E(X ′|Y ), and let b0 := f(−1) and b1 := f(1). Since there are only two values ofY that are of interest, we can represent f(Y ) by a linear function aY + c, and it is trivial to verify that a = (b1 − b0)/2and c = (b1 + b0)/2 are the unique values that are consistent with f(−1) = b0 and f(1) = b1. Thus if X = X ′ − c, thenE(X|Y ) = aY , proving (B.23), and the other two properties follow trivially.

C. Expressions for adversarial perturbation and loss-gradientWe show two simple preliminary results for loss functions that satisfy Assumption LOSS-INC: Lemma 2 shows a simpleclosed form expression for the `∞(ε)-adversarial perturbation, and we use this result to derive an expression for the gradientof the `∞(ε)-adversarial loss L(xxx+δ∗, y;www) with respect to a weight wi (Lemma 3).

Lemma 2 (Closed form for `∞(ε)-adversarial perturbation). For a data point (xxx, y), given model weights www, if the lossfunction L(xxx, y;www) satisfies Assumption LOSS-INC, the `∞(ε)-adversarial perturbation δ∗ is given by:

δ∗ = −y sgn(www)ε, (7)

and the corresponding `∞(ε)-adversarial loss is

L(xxx+δ∗, y; www) = g(ε||www ||1 − y〈www, xxx〉) (8)

Proof. Assumption LOSS-INC implies that the loss is non-increasing in y〈www, x〉, and therefore the `∞(ε)-perturbation δ∗ ofx that maximizes the loss would be such that, for each i ∈ [d], xi is changed by an amount ε in the direction of −y sgn(wi),and the result immediately follows.

Lemma 3 (Gradient of adversarial loss). For any loss function satisfying Assumption LOSS-INC, for a given data point(xxx, y), the gradient of the `∞(ε)-adversarial loss is given by:

∂L(xxx+δ∗, y; www)

∂wi= −g′(ε||www ||1 − y〈www,xxx〉) (yxi − sgn(wi)ε) (9)

Proof. This is straightforward by substituting the expression (7) for δ∗ in g(−y〈www,xxx+δ∗〉), and applying the chain rule.

D. Expectation of SGD Weight UpdateThe following Lemma will be used to prove Theorem 3.1.

D.1. Upper bound on E[Zf(Z, V )]

Lemma 4 (Upper Bound on expectation of Zf(Z, V ) when f is non-increasing in Z, (Z ⊥ V )|Y , and E(Z|Y ) = E(Z)).For any random variables Z, V , if:

• f(Z, V ) is non-increasing in Z,


• Z, V are conditionally independent given a third r.v. Y , and

• E(Z|Y ) = E(Z),

thenE[Zf(Z, V )] ≤ E(Z)E[f(Z, V )] (D.26)

Proof. Let z = E(Z) = E(Z|Y ) and note that

E[Zf(Z, V )]− E[Z]E[f(Z, V )] = E[Zf(Z, V )]− zE[f(Z, V )] (D.27)= E[(Z − z)f(Z, V )] (D.28)

We want to now argue that E[(Z − z)f(z, V )] = 0. To see this, apply the Law of Total Expectation by conditioning on Y :

E[(Z − z)f(z, V )] = E[E[(Z − z)f(z, V )|Y

]]= E

[E[(Z − z)|Y

]E[f(z, V )|Y

]](since (Z ⊥ V )|Y ) (D.29)

= 0. (since E(Z|Y ) = E(Z) = z) (D.30)

Since E[(Z − z)f(z, V )] = 0, we can subtract it from the last expectation in (D.28), and by linearity of expectations theRHS of (D.28) can be replaced by

E[(Z − z)(f(Z, V )− f(z, V ))

]. (D.31)

That fact that f(Z, V ) is non-increasing in Z implies that (Z − z)(f(Z, V )− f(z, V )) ≤ 0 for any value of Z and V , withequality when Z = z. Therefore the expectation (D.31) is bounded above by zero, which implies the desired result.

Theorem 3.1 (Expected SGD Update in Adversarial Training). For any loss function L satisfying Assumption LOSS-CVX, adataset D satisfying Assumption FEAT-TRANS, a subset S of features that are conditionally independent of the rest given thelabel y, if a data point (xxx, y) is randomly drawn fromD, andxxx is perturbed toxxx′ = xxx+δ∗, where δ∗ is an `∞(ε)-adversarialperturbation, then during SGD using the `∞(ε)-adversarial loss L(xxx′, y;www), the expected weight-updates ∆i := E∆wi fori ∈ S and the corresponding w-weighted average ∆

w

S satisfy the following properties:

1. If wi = 0 ∀i ∈ S, then for each i ∈ S,∆i = g′ ai, (11)

2. and otherwise,∆

w

S ≤ g′(awS − ε), (12)

and equality holds in the limit as wi → 0 ∀i ∈ S,

where g′ is the expectation in (10), ai = E(xiy) is the directed strength of feature xi from Eq. (4), and awS is thecorresponding w-weighted average over S.

Proof. Consider the adversarial loss gradient expression (9) from Lemma 3. For the case where wi = 0 for all i ∈ S, forany given i ∈ S, the negative expectation of the adversarial loss gradient is

∆i = E[yxi g

′(ε||www ||1 − y〈www, xxx〉)]

= E[E[yxi g

′(ε||www ||1 − y〈www, xxx〉) |y]]

(Law of Total Expectation)

= E[y E[xi g′(ε||www ||1 − y〈www, xxx〉) |y

]],

and in the last expectation above, we note that since wi = 0 ∀i ∈ S, the argument of g′ does not depend on xi for any i ∈ S,and since the features in S are conditionally independent of the rest given the label y, xi is independent of the g′ term in


the inner conditional expectation. Therefore the inner conditional expectation can be factored as a product of conditionalexpectations, which gives

∆i = E[yE(xi|y)E

[g′(ε||www ||1 − y〈www, xxx〉) |y

]]= E

[y2aiE

[g′(ε||www ||1 − y〈www, xxx〉) |y

]](Assumption FEAT-TRANS, Eq B.23)

= aiE[E[g′(ε||www ||1 − y〈www, xxx〉) |y

]](since y = ±1)

= aig′, (D.32)

which establishes the first result.

Now consider the case where wi 6= 0 for at least one i ∈ S. Starting with the adversarial loss gradient expression (9) fromLemma 3, for any i ∈ S, multiplying throughout by − sgn(wi) and taking expectations results in

sgn(wi)∆i = E[[yxi sgn(wi)− ε

]g′(ε||w||1 − y〈www, xxx〉)

](D.33)

where the expectation is with respect to a random choice of data-point (xxx, y). The argument of g′ can be written as

ε||w||1 − y〈www, xxx〉 = −d∑j=1

|wj |(yxj sgn(wj)− ε),

and for j ∈ [d] if we let Zj denote the random variable corresponding to yxj sgn(wj)− ε, then (D.33) can be written as

sgn(wi)∆i = E

Zi g′− d∑

j=1

|wj |Zj

. (D.34)

Taking the |wi|-weighted average of both sides of (D.34) over i ∈ S yields

∆w

S =1∑

i∈S |wi|E

g′− d∑

j=1

|wj |Zj

∑i∈S

(|wi|Zi)

. (D.35)

If we now define ZS :=∑i∈S(|wi|Zi), the argument of g′ in the expectation above can be written as VS − ZS where VS

denotes the negative sum of |wj |Zj terms over all j 6∈ S, and thus (D.35) can be written as

∆w

S =1∑

i∈S |wi|E [g′ (VS − ZS)ZS ] . (D.36)

Note that ZS is a function of Y and the features in S, and VS is a function of Y and the features in the complement of S. Sincethe features in S are conditionally independent of the rest given the label Y (this is a condition of the Theorem), it followsthat (VS ⊥ ZS)|Y . Since by Assumption LOSS-CVX, g′ is a non-decreasing function, g′(VS − ZS) is non-increasing inZS . Thus all three conditions of Lemma 4 are satisfied, with the random variables Z, V, Y and function f in the Lemmacorresponding to random variables ZS , VS , Y and function g′ respectively in the present Theorem. It then follows fromLemma 4 that

∆w

S ≤1∑

i∈S |wi|E(ZS)g′. (D.37)

The definition of ZS , and the fact that E(Zi) = sgn(wi)E(yxi)− ε = ai sgn(wi)− ε (property (4)), imply

E(ZS) =∑i∈S

[|wi| sgn(wi)ai]− ε∑i∈S|wi|,

and the definition of awS allows us to simplify (D.37) to

∆w

S ≤ g′(awS − ε),


which establishes the upper bound (12).

To analyze the limiting case where wi → 0 for all i ∈ S, write Eq. (D.36) as follows:

∆w

S = E[g′ (VS − ZS)

ZS∑i∈S |wi|

]. (D.38)

If we let |wi| → 0 for all i ∈ S, the ZS in the argument of g′ can be set to 0, and we can write the RHS of (D.38) as

∆w

S = E[g′(VS)

ZS∑i∈S |wi|

]= E

[E[g′(VS)

ZS∑i∈S |wi|

∣∣∣ Y ] ], (D.39)

where the inner conditional expectation can be factored as a product of conditional expectations since (ZS ⊥ VS |Y ):

∆w

S = E[E[g′(VS)

∣∣∣ Y ]E [ ZS∑i∈S |wi|

∣∣∣ Y ] ]. (D.40)

Now notice that

E(ZS |Y ) = E

[∑i∈S

(|wi|Zi)∣∣∣ Y ] =

∑i∈S|wi| E [sgn(wi)Y xi − ε | Y ] . (D.41)

From Property (5) of datasets satisfying Assumption FEAT-TRANS, E[Y xi|Y ] = E(Y xi) = ai, and so the second innerexpectation in (D.40) simplifies to a constant:

E[

ZS∑i∈S |wi|

∣∣∣ Y ] =

∑i∈S [|wi| sgn(wi)ai]∑

i∈S |wi|− ε = awS − ε. (D.42)

Eq. (D.40) can therefore be simplified to

∆w

S = E[E[g′(VS)

∣∣∣ Y ] ] (awS − ε) = g′(awS − ε), (D.43)

which shows the final statement of the Theorem, namely, that if wi → 0 for all i ∈ S, then (12) holds with equality.

D.2. Implications of Theorem 3.1

Keeping in mind the interpretations of the w-weighted average quantities awS and ∆w

S described in the paragraph after thestatement of Theorem 3.1, we can state the following implications of this result:

If all weights of S are zero, then they grow in the correct direction. When wi = 0 for all i ∈ S (recall that S is asubset of features, conditionally independent of the rest given the label y), the expected SGD update ∆i for each i ∈ S isproportional to the directed strength ai of feature xi, and if g′ 6= 0, this means that on average the SGD update causes theweight wi to grow from zero in the correct direction. This is what one would expect from an SGD training procedure.

If the weights of S are mis-aligned weights on average, then they shrink at a rate proportional to ε+ |awS |. Supposefor at least one i ∈ S, wi 6= 0, and awS < 0, i.e. the weights of the features in S are mis-aligned on average. In this caseby (12), ∆

w

S < 0, i.e. the weights of the features in S, in aggregate (i.e. in the |wi|-weighted sense) shrink toward zero inexpectation. The aggregate rate of this shrinkage is proportional to ε+ |awS |. In other words, all other factors remaining thesame, adversarial training (i.e. with ε > 0) shrinks mis-aligned faster than natural training (i.e. with ε = 0).

If the weights of S are aligned on average, and ε > |awS | then they shrink at a rate proportional to ε− |awS |. Supposethat wi 6= 0 for at least one i ∈ S, and the weights of S are aligned on average, i.e. awS > 0. Even in this case, the weightsof S shrink on average, provided the alignment strength awS is dominated by the adversarial ε; the rate of shrinkage isproportional to ε−|awS |, by Eq. (12). Thus adversarial training with a sufficiently large ε that dominates the average strengthof the features in S, will cause the weights of these features to shrink on average. This observation is key to explaining the“feature-pruning” behavior of adversarial training: “weak” features (relative to ε) are weeded out by the SGD updates.

If the weights of S are aligned, ε < |awS |, then the weights of S expand up to a certain point. Consider the case whereat least one of the S weights is non-zero, and the adversarial ε does not dominate the average strength awS . Again from Eq.


(12), if awS > 0 and ε < |awS |, then the upper bound (12) on ∆w

S is non-negative. Since the Theorem states that equalityholds in the limit as wi → 0 for all i ∈ S, this means if all |wi| for i ∈ S are sufficiently small, the expected SGD update∆

w

S is non-negative, i.e., the S weights expand on average. In other words, the weights of a conditionally independentfeature-subset S, if they are aligned on average, then their aggregate weights expand on average up to a certain point, if εdoes not dominate their strength.

Note that Assumption LOSS-CVX implies that g′ ≥ 0, and when the model www is “far” from the optimum, the values of−y〈www, xxx〉 will tend to be large, and since g′ is a non-decreasing function (Assumption LOSS-CVX), g′ will be large as well.So we can interpret g′ as being a proxy for “average model error”. Thus during the initial iterations of SGD, this quantitywill tend to be large and positive, and shrinks toward zero as the model approaches optimality. Since g′ appears as a factorin (11) and (12), we can conclude that the above effects will be more pronounced in the initial stages of SGD and less so inthe later stages. The experimental results described in Section 6 are consistent with several of the above effects.

E. Generalization of Theorem 3.1 for the multi-class settingE.1. Setting and Assumptions

Let there be k ≥ 3 classes. For a given data point x ∈ Rd, its true label, i ∈ [k], is represented by a vector y =[−1 · · · − 1︸︷︷︸

i-1

1−1 · · · − 1︸︷︷︸k-i

]. We assume that the input (x,y) is drawn from the distributionD. For this multi-class classification

problem, we assume the usage of the standard one-vs-all classifiers, i.e., there are k different classifiers with the i-th classifier(ideally) predicting +1 iff the true label of x is i, else it predicts −1. Let w represent the k × d weight matrix where wi

represents the 1× d weight vector for the i-th classifier. wij represents the j-th entry of wi. Let yi represent the i-th entryof y.

The assumptions presented in the main paper (Sec. 2) are slightly tweaked as follows and hold true for each of the kone-vs-all classifiers:

Assumption LOSS-INC: The loss function for each of the one-vs-all classifier is of the form L(x, yi;wi) = g(−yi〈wi,x〉)where g is a non-decreasing function.

Assumption LOSS-CVX: The loss function for each of the one-vs-all classifier is of the form L(x, yi;wi) = g(−yi〈wi,x〉)where g is non-decreasing, almost-everywhere differentiable and convex.

Assumption FEAT-INDEP: The features x are conditionally independent given the label yi for the i-th one-vs-all classifier, i.e., for any two distinct induces s, t, xs is independent of xt given yi, or more compactly, (xs ⊥ xt) | yi.

Assumption FEAT-EXP: For each feature xj , j ∈ [d] and the i-th one-vs-all classifier E(xj |yi) = aij .yi forsome constant aij .

Additionally, we introduce a new assumption on the distribution D as follows.

Assumption DIST-EXPC: The input distribution D satisfies the following expectation for a function hi, i ∈ [k](defined by Eqs. (E.47),(E.48), (E.49)) and constant g∗ (defined by Eq. (E.46))

E[hi(sgn(wij)yi,x,wi, ε)

]= 0 (E.44)


PrD[ε < xj < ε+ ρ] = 0, ρ is a small constant (E.45)xj ≥ ε+ ρ =⇒ (xj − ε)g∗i ≥ (xj + ε)g′(−yi〈wi,x + δ∗〉) (E.46)

If yisgn(wij) = −1, thenh(sgn(wij)yi,x,wi, ε) ≥ (xj + ε)g∗i − (xj − ε)g′(−yi〈wi,x + δ∗〉) (E.47)

If yisgn(wij) = 1 ∧ xj > ε, then

−(

(xj − ε)g∗i − (xj + ε)g′(−yi〈wi,x + δ∗〉))≤ h(sgn(wij)yi,x,wi, ε) ≤ 0 (E.48)

If yisgn(wij) = 1 ∧ xj ≤ ε, thenh(sgn(wij)yi,x,wi, ε) ≥ (xj + ε)g′(−yi〈wi,x + δ∗〉)− (xj − ε)g∗i (E.49)

This assumption is not as restrictive as it may appear. Eq. E.45 can be satisfied naturally for discrete domains. For example,for images xj ∈ {0, 1, 2, · · · , 254, 255}; thus ρ ∈ (0, 1). For continuous domains, ρ can be set to a small value and thevalues of xj can be appropriately rounded in the input dataset.For the rest of the discussion let us consider the case where g′(z) = c (for example for hinge loss function c = 1 for z > −1)and xj ∈ [0, 1]. Now consider,

g∗ = (1 + ε)c/ρ, ρ = 0.01

f1(x) := ((x+ ε)g∗ − (x− ε)c)f2(x) := ((x+ ε)c− (x− ε)g∗)

hi(sgn(wij)yi,x,wi, ε) =

{f1(xj) if sgn(wij)yi = +1

f2(xj) otherwise

E[hi(sgn(wij)yi,x,wi, ε)] =

∫ 1

0

Pr[xj |yisgn(wij) = −1] · f1(xj)dx+∫ 1

ε

Pr[xj |yisgn(wij) = +1] · f2(x)dx+

∫ ε

0

Pr[xj |yisgn(wij) = +1] · f2(x)dx

We observe that f1(x) is increasing in x ∈ [0, 1], and x ≥ ε + δ =⇒ f2(x) ≤ 0 and f2(x) is decreasing inx ∈ [ε+ δ, 1]. Thus intuitively for E(hi) to be zero, Pr[xj |sgn(wij)yi = −1] must have high values for lower magnitudesof xj (say xj < 0.5), and Pr[xj |sgn(wij)yi = −1] has low values for xj ≤ ε and high values for xj ≥ ε + δ. Forexample, let us assume ε = 0.1 and that the distributions Pr[xi|sgn(wij)yi = +1] and Pr[xi|sgn(wij)yi = −1] can beapproximated by truncated Gaussian distributions (with appropriate adjustments to ensure Pr[ε < xj < ε + δ] = 0])with meansm1 andm2 respectively. Then, it can be seen that there exists hi form1 < 0.3 andm2 > 0.6 such that E[hi] = 0.

The overall loss function, LT for the multi-class classifier is the sum of the loss functions of each individualone-vs-all classifiers and is given by

LT (x,y;w) =

k∑i=1

L(x, yi;wi)

=

k∑i=1

g(−yi〈wi,x〉) [From Assumption LOSS-CVX]

The expected SGD update ∆wij is defined as follows:

∆wij =

{E∂LT (x+δ∗,y;w)

∂wijwhen wi = 0

−sgn(wij)E∂LT (x+δ∗,y;w)∂wij

when wi 6= 0(E.50)

Also let

g′i := E[g′(−yi〈wi,x + δ∗〉)], i ∈ [k] (E.51)


Theorem E.1 (Expected SGD Update in Adversarial Training for Multi-Class Classification). For any loss function Lsatisfying assumptions LOSS-CVX, FEAT-INDEP and FEAT-EXP, if a data point (x,y) is randomly drawn from D thatsatisfies Assumption DIST-EXPC, and x is perturbed to x′ = x + δ∗, where δ∗ is an l∞(ε)-adversarial perturbation, thenunder the l∞(ε)-adversarial loss LT (x,y;w) the expected SGD-update of weight wij , namely ∆wij satisfies the followingproperties

1. If wij = 0, then

∆wij = aijg′i

2. If wij 6= 0, then

∆wij ≤ g[aij sgn(wij)− ε]g ∈ {g′i, g

∗i }

Proof. For

δ∗ ∈ Rd s.t (E.52)LT (x + δ∗,y;w) = Ex,y∼D[ max

||δ||∞≤εLT (x + δ,y;w)] (E.53)

the shift by ε in xj can be in either of the two directions, yisgn(wij) or −yisgn(wij). We have,

∂LT (x,y;w))

∂wij=

k∑s=1

∂(g(−ys〈ws,x + δ∗〉)

)∂wij

=∂g(−yi〈wi,x + δ∗〉)

∂wij(E.54)

Thus by Assumption LOSS-CVX either of the following two equations hold true

∂LT (x,y;w))

∂wij= g′(−yi〈wi,x + δ∗〉)(−yixj + sgn(wij)ε) (E.55)

∂LT (x,y;w))

∂wij= g′(−yi〈wi,x + δ∗〉)(−yixj − sgn(wij)ε) (E.56)

Thus when wij = 0,

∆wij = E[yixjg

′(−yi〈wi,x + δ∗〉)]

= aijg′i[Follows from the proof in Theorem 3.1 in the paper]

Now let us consider the case when wij 6= 0.Case I: Eq. E.55 is satisfiedThis means that for δ∗, xj is changed by ε in the direction of −yisgn(wij). Thus after multiplying throughout with−sgn(wij) and taking expectations, we have

∆wij = E[[yixjsgn(wij)− ε]g′(

d∑l=1,l 6=j

slε|wil|+ ε|wij | − yi〈wi,x〉)]

(E.57)

where sl represents the corresponding sign for the value ε|wl| (based on which direction is xl perturbed in) and theexpectation is with respect to a random choice of data point (x,y). Let us define two random variables V and Z as follows

Z = yixjsgn(wij)− ε (E.58)

V = −d∑

l=1,l 6=j

|wil|(yixlsgn(wil)− slε

)(E.59)


Thus,

∆wij = E[Zg′(V − |wij |Z)] (E.60)

Let random variable Y correspond to the label yi of the data point. Since Z is a function of feature xj and Y , and V is afunction of the remaining features and Y , Assumption FEAT-INDEP implies (V ⊥ Z)|Y . Additionally by AssumptionFEAT-EXP

E(Z) = E(Z|Y ) = ajsgn(wij)− ε (E.61)

Since by Assumption LOSS-CVX, g’ is a non-decreasing function, g′(V − |wij |Z) is non-increasing in Z. Thus all threeconditions of Lemma 4 are satisfied, with the random variables Z, V, Y and function f in the Lemma corresponding torandom variables Z, V, Y and function g′ respectively in the present theorem. Following an analysis similar to the onepresented in the proof for Theorem 3.1, we have

∆wij ≤ E(Z)g′i = g′i[aijsgn(wij)− ε] (E.62)

Case II: Eq. E.56 is satisfiedIn this case, for δ∗ xj is perturbed by ε in the direction yisgn(wij). Now multiplying both sides by −sgn(wij)

∆wij = (yixjsgn(wij) + ε)g′(−yi〈wi,x + δ∗〉) (E.63)

Now let us consider the case when yisgn(wij) = −1. From Assumption DIST-EXPC (Eqs. (E.63),(E.46) and (E.47)), wehave

∆wij ≤ (yixjsgn(wij)− ε)g∗i + hi(sgn(wij)yi,x,wi, ε) (E.64)

For the case when yisgn(wij) = −1 ∧ xj > ε, again from Eqs. (E.63),(E.46) and (E.48) Eq. (E.64) holds true. Similarly,for the case of yisgn(wij) = −1 ∧ xj ≤ ε, the validity of Eq. (E.64) can be verified from Eqs. (E.63),(E.46) and (E.49).Now taking expectations over both sides of Eq. (E.64) results in

∆wij ≤ (aijsgn(wij)− ε)g∗i [From Eq. (E.44)] (E.65)

This concludes our proof.

F. Proof of Lemma 4.1Lemma 4.1 (IG Attribution for 1-layer Networks). If F (xxx) is computed by a 1-layer network (14) with weights vectorwww,then the Integrated Gradients for all dimensions of xxx relative to a baseline uuu are given by:

IGF (xxx,uuu) = [F (xxx)− F (uuu)](xxx−uuu)�www〈xxx−uuu, www〉

, (15)

where the � operator denotes the entry-wise product of vectors.

Proof. Since the function F , the baseline input uuu and weight vector www are fixed, we omit them from IGF (xxx,uuu) andIGF

i (xxx,uuu) for brevity. Consider the partial derivative ∂iF (uuu+α(xxx−uuu)) in the definition (13) of IGi(xxx). For a given xxx, uuuand α, let vvv denote the vector uuu+α(xxx−uuu). Then ∂iF (vvv) = ∂F (vvv)/∂vi, and by applying the chain rule we get:

∂iF (vvv) :=∂F (vvv)

∂vi=∂A(〈www, vvv〉)

∂vi= A′(z)

∂〈www, vvv〉∂vi

= wiA′(z),

where A′(z) is the gradient of the activation A at z = 〈www, vvv〉. This implies that:

∂F (vvv)

∂α=

d∑i=1

(∂F (vvv)

∂vi

∂vi∂α

)

=

d∑i=1

[wiA′(z)(xi − ui)]

= 〈xxx−uuu, www〉A′(z)


We can therefore writedF (vvv) = 〈xxx−uuu, www〉A′(z)dα,

and since 〈xxx−uuu, www〉 is a scalar, this yields

A′(z)dα =dF (vvv)

〈xxx−uuu, www〉Using this equation the integral in the definition of IGi(x) can be written as

∫ 1

α=0

∂iF (vvv)dα =

∫ 1

α=0

wiA′(z)dα

=

∫ 1

α=0

widF (vvv)

〈xxx−uuu, www〉

=wi

〈xxx−uuu, www〉

∫ 1

α=0

dF (vvv) (F.66)

=wi

〈xxx−uuu, www〉[F (xxx)− F (uuu)],

where (F.66) follows from the fact that (xxx−uuu) andwww do not depend on α. Therefore from the definition (13) of IGi(xxx):

IGi(xxx) = [F (xxx)− F (uuu)](xi − ui)wi〈xxx−uuu, www〉

,

and this yields the expression (15) for IG(xxx).

G. Proof of Theorem 5.1Theorem 5.1 (Equivalence of Stable IG and Adversarial Robustness). For loss functions L(xxx, y;www) satisfying AssumptionLOSS-CVX, the augmented loss inside the expectation (17) equals the `∞(ε)-adversarial loss inside the expectation (2), i.e.

L(xxx, y;www) + max||xxx′−xxx ||∞≤ε

|| IGLy (xxx,xxx′)||1 =

max||δ||∞≤ε

L(xxx+δ, y;www) (18)

Proof. Recall that Assumption LOSS-CVX implies L(xxx, y;www) = g(−y〈www, xxx〉) for some non-decreasing, differentiable,convex function g. Due to this special form of L(xxx, y;www), the function Ly is a differential function of 〈www, xxx〉, and by Lemma4.1 the i’th component of the IG term in (18) is

IGLy

i (xxx,xxx′; www) =wwwi(xxx

′−xxx)i〈www,xxx′−xxx〉

·(g(−y〈www,xxx′〉)− g(−y〈www,xxx〉),

)and if we let ∆ = xxx′−xxx (which satisfies that ‖∆‖∞ ≤ ε), its absolute value can be written as∣∣g(−y〈www,xxx〉 − y〈www,∆〉) − g(−y〈www,xxx〉)

∣∣|〈www,∆〉|

· |wwwi ∆i|

Let z = −y〈www,xxx〉 and δ = −y〈www,∆〉, this is further simplified as |g(z+δ)−g(z)||δ| |wi∆i|. By Assumption LOSS-CVX, g isconvex, and therefore the “chord slope” [g(z + δ)− g(z)]/δ cannot decrease as δ is increased. In particular to maximize the`1-norm of the IG term in Eq (18), we can set δ to be largest possible value subject to the constraint ||∆||∞ ≤ ε, and weachieve this by setting ∆i = −y sgn(wwwi)ε, for each dimension i. This yields δ = ‖www ‖1ε, and the second term on the LHSof (18) becomes

|g(z + δ)− g(z)| ·∑i |wwwi ∆i||δ|

= |g(z + ε‖www ‖1)− g(z)| ·∑i |wwwi |ε‖www ‖1ε

= |g(z + ε‖www ‖1)− g(z)|= g(z + ε‖www ‖1)− g(z)


where the last equality follows because g is nondecreasing. Since L(xxx, y;www) = g(z) by Assumption LOSS-CVX, the LHSof (18) simplifies to

g(−y〈www,xxx〉+ ε‖www ‖1),

and by Eq. (7), this is exactly the `∞(ε)-adversarial loss on the RHS of (18).

H. Aggregate IG Attribution over a DatasetRecall that in Section 4 we defined IGF (xxx,uuu) in Eq. (13) for a single input xxx (relative to a baseline input uuu). This gives us asense of the “importance” of each input feature in explaining a specific model prediction F (xxx). Now we describe someways to produce aggregate importance metrics over an entire dataset. For brevity let us simply write IG(xxx) and IGi(xxx) andomit F and uuu since these are fixed for a given model and a given dataset.

Note that in Eq. 13, xxx is assumed to be an input vector in “exploded” space, i.e., all categorical features are (explicitly orimplicitly) one-hot encoded, and i is the position-index corresponding to either a specific numerical feature, or a categoricalfeature-value. Thus if i corresponds to a categorical feature-value, then for any input xxx where xi = 0 (i.e. the correspondingcategorical feature-value is not “active” for that input), IGi(xxx) = 0. A natural definition of the overall importance of afeature (or feature-value) i for a given model F and dataset D, is the average of | IGi(xxx)| over all inputs xxx ∈ D, which werefer to as the Feature Value Impact FVi[D]. For a categorical feature with m possible values, we can further define itsFeature-Impact (FI) as the sum of FVi[D] over all i corresponding to possible values of this categorical feature.

The FI metric is particularly useful in tabular datasets to gain an understanding of the aggregate importance of high-cardinalitycategorical features.

I. Definition of the Gini IndexThe definition is adapted from (Hurley & Rickard, 2009): Suppose we are given a vector of non-negative values vvv =[v1, v2, v3, . . . , vd]. The vector is first sorted in non-decreasing order, so that the resulting indices after sorting are(1), (2), (3), . . . , (d), i.e., v(k) denotes the k’th value in this sequence. Then the Gini Index is given by:

G(vvv) = 1− 2

d∑k=1

v(k)

||vvv ||1

(d− k + 0.5

d

). (I.67)

Another equivalent definition of the Gini Index is based on plotting the cumulative fractional contribution of the sortedvalues. In particular if the sorted non-negative values are [v(1), v(2), . . . , v(d)], and for k ∈ [d], we plot k/d (the fraction of

dimensions up to k) vs∑k

i=1 v(i)

||vvv ||1 (the fraction of values until the k’th dimension), then the Gini Index G(vvv) is 0.5 minus thearea under this curve

The Gini Index by definition lies in [0,1], and a higher value indicates more sparseness. For example if just one of thevi > 0 and all the rest are 0, then G(v) = 1.0, indicating perfect sparseness. At the other extreme, if all vi are equal to somepositive constant, then G(v) = 0.

J. ExperimentsJ.1. Experiment Datasets and Methodology

We experiment with 5 public benchmark datasets. Below we briefly describe each dataset and model-training details.

MNIST. This is a classic image benchmark dataset consisting of grayscale images of handwritten digits 0 to 9 in the formof 28 x 28 pixels, along with the correct class label (0 to 9) (LeCun & Cortes, 2010). We train a Deep Neural Networkconsisting of two convolutional layers with 32 and 64 filters respectively, each followed by 2x2 max-pooling, and a fullyconnected layer of size 1024. Note that this is identical to the state-of-the-art adversarially trained model used by (Madryet al., 2017). We use 50,000 images for training, and 10,000 images for testing. When computing the IG vector for aninput image, we use the predicted probability of the true class as the function F in the definition (13) of IG. For trainingeach of the model types on MNIST, we use the Adam optimizer with a learning rate 10−4, with a batch size of 50. Forthe naturally-trained model (with or without `1-regularization) we use 25,000 training steps. For adversarial training, weuse 100,000 training steps overall, and to generate adversarial examples we use Projected Gradient Descent (PGD) with


random start. The PGD hyperparameters depend on the specific ε bound on the `∞-norm of the adversarial perturbations:the number of PGD steps was set as ε ∗ 100 + 10, and the PGD step size was set to 0.01.

Fashion-MNIST. This is another image benchmark dataset which is a drop-in replacement for MNIST (Xiao et al., 2017).Images in this dataset depict wearables such as shirts and boots instead of digits. The image format, the number of classes,as well as the number of train/test examples are all identical to MNIST. We use the same model and training details as forMNIST.

CIFAR-10. The CIFAR-10 dataset (?) is a dataset of 32x32 color images with ten classes, each consisting of 5,000 trainingimages and 1,000 test images. The classes correspond to dogs, frogs, ships, trucks, etc. The pixel values are in range of[0, 255]. We use a wide Residual Network (?), which is identical to the state-of-art adversarially trained model on CIFAR-10used by (Madry et al., 2017). When computing the IG vector for an input image, we use the predicted probability of the trueclass as the function F in the definition (13) of IG. For training each of the model types on CIFAR-10, we use MomentumOptimizer with weight decay. We set momentum rate as 0.9, weight decay rate as 0.0002, batch size as 128, and trainingsteps as 70,000. We use learning rate schedule: the first 40000 steps, we use learning rate of 10−1; after 40000 steps andbefore 60,000 steps, we use learning rate of 10−2; after 60,000 steps, we use learning rate of 10−3. We use ProjectedGradient Descent (PGD) with random start to generate adversarial examples. The PGD hyperparameters depend on thespecific ε bound on the `∞-norm of the adversarial perturbations: the number of PGD steps was set as ε+ 1, and the PGDstep size was set to 1.

Mushroom. This is a standard tabular public dataset from the UCI Data Repository (Dheeru & Karra Taniskidou, 2017).The dataset consists of 8142 instances, each of which corresponds to a different mushroom species, and has 22 categoricalfeatures (and no numerical features), whose cardinalities are all under 10. The task is to classify an instance as edible(label=1) or not (label=0). We train a simple logistic regression model to predict the probability that the mushroom isedible, with a 70/30 train/test split, and use a 0.5 threshold to make the final classification. We train the models on 1-hotencoded feature vectors, and the IG computation is on these (sparse) 1-hot vectors, with the output function F being thefinal predicted probability. We train logistic regression models for this dataset, and for natural model training (with orwithout `1-regularization) we use the Adam optimizer with a learning rate of 0.01, batch size of 32, and 30 training epochs.Adversarial training is similar, except that each example batch is perturbed using the closed-form expression (7).

Spambase. This is another tabular dataset from the UCI Repository, consisting of 4601 instances with 57 numericalattributes (and no categorical ones). The instances are various numerical features of a specific email, and the task is classifythe email as spam (label = 1) or not (label = 0). The model and training details are similar to those for the mushroom dataset.

The code for all experiments (included along with the supplement) was written using Tensorflow 2.0. The followingsubsections contain results that were left out of the main body of the paper due to space constraints.

J.2. Mushroom Dataset: Average IG-based Feature Impact

We contrast between the weights learned by natural training and adversarial training with ε = 0.1. Since all features in thisdataset are categorical, many with cardinalities close to 10, there are too many features in the “exploded” space to allow aclean display, so we instead look at the average Feature Impact (FI, defined in Section H) over the (natural, unperturbed) testdataset, see Figure J.4. It is worth noting that several features that have a significant impact on the naturally-trained modelhave essentially no impact on the adversarially trained model.

J.3. Spambase: Average IG-based Feature Impact

We fix ε = 0.1 for adversarial training and show in Figure J.5 a bar-plot comparing the average Feature-Impacts (FI),between naturally-trained and adversarially-trained models. Note how the adversarially trained model has significantlyfewer features with non-negligible impacts, compared to a naturally trained model.

J.4. MNIST, Fashion-MNIST and CIFAR-10: examples

Figs. J.6, J.7 and J.8 below show IG-based saliency maps of images correctly classified by three model types: Naturallytrained un-regularized model, naturally trained model with `1-regularization, and an `∞(ε)-adversarially trained model.The values of λ and ε are those indicated in Table 1. In each example, all three models predict the correct class with highprobability, and we compare the Gini Indices of the IG-vectors (with respect to the predicted probability of the true class).


Figure J.4: Comparison of aggregate Feature Impact (FI) for a naturally-trained model, and an adversarially-trained model with ε = 0.1,on the mushroom dataset. The features are arranged left to right in decreasing order of the FI value in the naturally-trained model.

The sparseness of the saliency maps of the adversarially-trained models is visually striking compared to those of the othertwo models, and this is reflected in the Gini Indices as well. Figs. J.9 and J.10 show analogous results, but using theDeepSHAP (Lundberg & Lee, 2017) attribution method instead of IG. The effect of adversarial training on the sparsenessof the saliency maps is even more visually striking when using DeepSHAP, compared to IG (We had difficulty runningDeepSHAP on CIFAR-10 data, so we are only able to show results for DeepSHAP on MNIST and Fashion-MNIST).


Figure J.5: Comparison of aggregate Feature Impact (FI) for a naturally-trained model, and an adversarially-trained model with ε = 0.1,on the spambase dataset. The features are arranged left to right in decreasing order of their FI in the naturally-trained model. To avoidclutter, we show only features that have an FI at least 5% of the highest FI (across both models).


Natural Training L1-norm Regularization Adversarial Training

Gini: 0.9271 Gini: 0.9266 Gini: 0.9728

(a) For all images, the models give correct prediction – 6.

Gini: 0.8112 Gini: 0.8356 Gini: 0.9383

(b) For all images, the models give correct prediction – 3.

Gini: 0.9315 Gini: 0.9366 Gini: 0.9738

(c) For all images, the models give correct prediction – 4.

Gini: 0.8843 Gini: 0.8807 Gini: 0.9595

(d) For all images, the models give correct prediction – 2.

Figure J.6: Some examples on MNIST. We can see the saliency maps (also called feature importance maps), computed viaIG, of adversarially trained model are much sparser compared to other models.



Gini: 0.8190 Gini: 0.8183 Gini: 0.8532

(a) For all images, the models give correct prediction – Dress.

Gini: 0.5777 Gini: 0.5925 Gini: 0.7024

(b) For all images, the models give correct prediction – Pullover.

Gini: 0.7698 Gini: 0.7784 Gini: 0.7981

(c) For all images, the models give correct prediction – Bag.

Gini: 0.6840 Gini: 0.6899 Gini: 0.7503

(d) For all images, the models give correct prediction – T-shirt.

Figure J.7: Some examples on Fashion-MNIST. We can see the saliency maps (also called feature importance maps),computed via IG, of adversarially trained model are much sparser compared to other models.



Gini: 0.6395 Gini: 0.6556 Gini: 0.7309

(a) For all images, the models give correct prediction – automobile.

Gini: 0.5736 Gini: 0.6175 Gini: 0.6914

(b) For all images, the models give correct prediction – airplane.

Gini: 0.5505 Gini: 0.5844 Gini: 0.6778

(c) For all images, the models give correct prediction – ship.

Gini: 0.6150 Gini: 0.6198 Gini: 0.7084

(d) For all images, the models give correct prediction – bird.

Figure J.8: Some examples on CIFAR-10. We can see the saliency maps (also called feature importance maps), computedvia IG, of adversarially trained model are much sparser compared to other models.



Gini: 0.8982 Gini: 0.9000 Gini: 0.9528

(a) For all images, the models give correct prediction – 0.

Gini: 0.9156 Gini: 0.9156 Gini: 0.9685

(b) For all images, the models give correct prediction – 7.

Gini: 0.9373 Gini: 0.9452 Gini: 0.9773

(c) For all images, the models give correct prediction – 3.

Gini: 0.9476 Gini: 0.9473 Gini: 0.9825

(d) For all images, the models give correct prediction – 5.

Figure J.9: Some examples on MNIST. We can see the saliency maps (also called feature importance maps), computed viaDeepSHAP, of adversarially trained model are much sparser compared to other models.



Gini: 0.6749 Gini: 0.6676 Gini: 0.7435

(a) For all images, the models give correct prediction – Pullover.

Gini: 0.8322 Gini: 0.8628 Gini: 0.8953

(b) For all images, the models give correct prediction – Trouser.

Gini: 0.8343 Gini: 0.8374 Gini: 0.8683

(c) For all images, the models give correct prediction – Bag.

Gini: 0.7701 Gini: 0.7575 Gini: 0.7920

(d) For all images, the models give correct prediction – Ankle boot.

Figure J.10: Some examples on Fashion-MNIST. We can see the saliency maps (also called feature importance maps),computed via DeepSHAP, of adversarially trained model are much sparser compared to other models.

Documents

Abstract 1. Introduction - arxiv.org · 1. Introduction Despite the recent dramatic success of deep learning models in a variety of domains, two serious concerns have surfaced about