Attention is not not Explanation - Association for Computational … · 2020-01-23 · 12 the movie was good Prediction Score Attention Scores Attention Parameters LSTM Embedding;

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 11–20,

Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

11

Attention is not not Explanation

Sarah Wiegreffe⇤School of Interactive ComputingGeorgia Institute of Technology

[email protected]

Yuval Pinter⇤School of Interactive ComputingGeorgia Institute of Technology

[email protected]

Abstract

Attention mechanisms play a central role inNLP systems, especially within recurrent neu-ral network (RNN) models. Recently, therehas been increasing interest in whether ornot the intermediate representations offered bythese modules may be used to explain the rea-soning for a model’s prediction, and conse-quently reach insights regarding the model’sdecision-making process. A recent paperclaims that ‘Attention is not Explanation’ (Jainand Wallace, 2019). We challenge many ofthe assumptions underlying this work, argu-ing that such a claim depends on one’s defi-nition of explanation, and that testing it needsto take into account all elements of the model.We propose four alternative tests to determinewhen/whether attention can be used as ex-planation: a simple uniform-weights baseline;a variance calibration based on multiple ran-dom seed runs; a diagnostic framework usingfrozen weights from pretrained models; and anend-to-end adversarial attention training pro-tocol. Each allows for meaningful interpreta-tion of attention mechanisms in RNN models.We show that even when reliable adversarialdistributions can be found, they don’t performwell on the simple diagnostic, indicating thatprior work does not disprove the usefulness ofattention mechanisms for explainability.

1 Introduction

Attention mechanisms (Bahdanau et al., 2014) arenowadays ubiquitous in NLP, and their suitabil-ity for providing explanations for model predic-tions is a topic of high interest (Xu et al., 2015;Rocktaschel et al., 2015; Mullenbach et al., 2018;Thorne et al., 2019; Serrano and Smith, 2019).If they indeed offer such insights, many applica-tion areas would benefit by better understandingthe internals of neural models that use attention

⇤Equal contributions.

as a means for, e.g., model debugging or architec-ture selection. A recent paper (Jain and Wallace,2019) points to possible pitfalls that may cause re-searchers to misapply attention scores as explana-tions of model behavior, based on a premise thatexplainable attention distributions should be con-

sistent with other feature-importance measures aswell as exclusive given a prediction.1 Its core ar-gument, which we elaborate in §2, is that if al-ternative attention distributions exist that producesimilar results to those obtained by the originalmodel, then the original model’s attention scorescannot be reliably used to “faithfully” explain themodel’s prediction. Empirically, the authors showthat achieving such alternative distributions is easyfor a large sample of English-language datasets.

We contend (§2.1) that while Jain and Wal-lace ask an important question, and raise a gen-uine concern regarding potential misuse of atten-tion weights in explaining model decisions onEnglish-language datasets, some key assumptionsused in their experimental design leave an implau-sibly large amount of freedom in the setup, ulti-mately leaving practitioners without an applicableway for measuring the utility of attention distribu-tions in specific settings.

We apply a more model-driven approach tothis question, beginning (§3.2) with testing atten-tion modules’ contribution to a model by ap-plying a simple baseline where attention weightsare frozen to a uniform distribution. We demon-strate that for some datasets, a frozen attentiondistribution performs just as well as learned at-tention weights, concluding that randomly- oradversarially-perturbed distributions are not ev-

1A preliminary version of our theoretical argumentationwas published as a blog post on Medium at http://bit.ly/2OTzU4r. Following the ensuing online discussion, theauthors uploaded a post-conference version of the paper toarXiv (V3) which addresses some of the issues in the post.We henceforth refer to this later version.

http://bit.ly/2OTzU4r

http://bit.ly/2OTzU4r

https://arxiv.org/abs/1902.10186

12

the movie was good

Prediction Score

Attention Scores

Attention Parameters

LSTM

Embedding

;

MLP

J&W§3.2§3.3§3.4§4

Figure 1: Schematic diagram of a classification LSTM model with attention, including the components manipu-lated or replaced in the experiments performed in Jain and Wallace (2019) and in this work (by section).

idence against attention as explanation in thesecases. We next (§3.3) examine the expected vari-ance in attention-produced weights by initializ-ing multiple training sequences with different ran-dom seeds, allowing a better quantification of howmuch variance can be expected in trained mod-els. We show that considering this backgroundstochastic variation when comparing adversarialresults with a traditional model allows us to betterinterpret adversarial results. In §3.4, we presenta simple yet effective diagnostic tool which testsattention distributions for their usefulness by us-ing them as frozen weights in a non-contextualmulti-layered perceptron (MLP) architecture. Thefavorable performance of LSTM-trained weightsprovides additional support for the coherence oftrained attention scores. This demonstrates a sensein which attention components indeed provide ameaningful model-agnostic interpretation of to-kens in an instance.

In §4, we introduce a model-consistent trainingprotocol for finding adversarial attention weights,correcting some flaws we found in the previousapproach. We train a model using a modifiedloss function which takes into account the distancefrom an ordinarily-trained base model’s attentionscores in order to learn parameters for adversarialattention distributions. We believe these experi-ments are now able to support or refute a claim offaithful explainability, by providing a way for con-vincingly saying by construction that a plausiblealternative ‘explanation’ can (or cannot) be con-structed for a given dataset and model architecture.We find that while plausibly adversarial distribu-

tions of the consistent kind can indeed be found forthe binary classification datasets in question, theyare not as extreme as those found in the inconsis-tent manner, as illustrated by an example from theIMDB task in Figure 2. Furthermore, these out-puts do not fare well in the diagnostic MLP, call-ing into question the extent to which we can treatthem as equally powerful for explainability.

Finally, we provide a theoretical discussion (§5)on the definitions of interpretability and explain-ability, grounding our findings within the accepteddefinitions of these concepts.

Our four quantitative experiments are illustratedin Figure 1, where each bracket on the left coversthe components in a standard RNN-with-attentionarchitecture which we manipulate in each experi-ment. We urge NLP researchers to consider apply-ing the techniques presented here on their mod-els containing attention in order to evaluate itseffectiveness at providing explanation. We offerour code for this purpose at https://github.com/sarahwie/attention.

2 Attention Might be Explanation

In this section, we briefly describe the experimen-tal design of Jain and Wallace (2019) and lookat the results they provide to support their claimthat ‘Attention is not explanation’. The authorsselect eight classification datasets, mostly binary,and two question answering tasks for their experi-ments (detailed in §3.1).

They first present a correlation analysis of at-tention scores and other interpretability measures.

https://github.com/sarahwie/attention

https://github.com/sarahwie/attention

13

Base model brilliant and moving performances by tom and peter finch

Jain and Wallace (2019) brilliant and moving performances by tom and peter finch

Our adversary brilliant and moving performances by tom and peter finch

Figure 2: Attention maps for an IMDb instance (all predicted as positive with score > 0.998), showing that inpractice it is difficult to learn a distant adversary which is consistent on all instances in the training set.

They find that attention is not strongly correlatedwith other, well-grounded feature importance met-rics, specifically gradient-based and leave-one-outmethods (which in turn correlate well with eachother). This experiment evaluates the authors’claim of consistency – that attention-based meth-ods of explainability cannot be valid if they do notcorrelate well with other metrics. We find the ex-periments in this part of the paper convincing anddo not focus our analysis here. We offer our simpleMLP diagnostic network (§3.4) as an additionalway for determining validity of attention distribu-tions, in a more in vivo setting.

Next, the authors present an adversarial searchfor alternative attention distributions which mini-mally change model predictions. To this end, theymanipulate the attention distributions of trainedmodels (which we will call base from now on) todiscern whether alternative distributions exist forwhich the model outputs near-identical predictionscores. They are able to find such distributions,first by randomly permuting the base attention dis-tributions on the test data during model inference,and later by adversarially searching for maximallydifferent distributions that still produce a predic-tion score within ✏ of the base distribution. Theyuse these experimental results as supporting evi-dence for the claim that attention distributions can-not be explainable because they are not exclusive.As stated, the lack of comparable change in pre-diction with a change in attention scores is takenas evidence for a lack of “faithful” explainabilityof the attention mechanism from inputs to output.

Notably, Jain and Wallace detach the attentiondistribution and output layer of their pretrainednetwork from the parameters that compute them(see Figure 1), treating each attention score as astandalone unit independent of the model. In addi-tion, they compute an independent adversarial dis-tribution for each instance.

2.1 Main ClaimWe argue that Jain and Wallace’s counterfactualattention weight experiments do not advance theirthesis, for the following reasons:

Attention Distribution is not a Primitive.From a modeling perspective, detaching the atten-tion scores obtained by parts of the model (i.e. theattention mechanism) degrades the model itself.The base attention weights are not assigned arbi-trarily by the model, but rather computed by an in-tegral component whose parameters were trainedalongside the rest of the layers; the way they workdepends on each other. Jain and Wallace providealternative distributions which may result in simi-lar predictions, but in the process they remove thevery linkage which motivates the original claimof attention distribution explainability, namely thefact that the model was trained to attend to the to-kens it chose. A reliable adversary must take thisconsideration into account, as our setup in §4 does.

Existence does not Entail Exclusivity. On amore theoretical level, we hold that attentionscores are used as providing an explanation; notthe explanation. The final layer of an LSTMmodel may easily produce outputs capable of be-ing aggregated into the same prediction in variousways, however the model still makes the choice ofa specific weighting distribution using its trainedattention component. This mathematically flexi-ble production capacity is particularly evident inbinary classifiers, where prediction is reduced to asingle scalar, and an average instance (of e.g. theIMDB dataset) might contain 179 tokens, i.e. 179scalars to be aggregated. This effect is greatly ex-acerbated when performed independently on eachinstance.2 Thus, it is no surprise that Jain and Wal-

2Indeed, the most open-ended task, question answeringover CNN data, produces considerable difficulty to manip-ulate its scores by random permutation (Figure 6e in Jainand Wallace (2019)). Similarly, the adversarial examples pre-sented in Appendix C of the paper for the QA datasets selecta different token of the correct word’s type, which should notsurprise us even under an LSTM assumption (encoder hidden

14

Dataset Avg. Length Train Size Test Size(tokens) (neg/pos) (neg/pos)

Diabetes 1858 6381/1353 1295/319Anemia 2188 1847/3251 460/802IMDb 179 12500/12500 2184/2172SST 19 3034/3321 863/862AgNews 36 30000/30000 1900/190020News 115 716/710 151/183

Table 1: Dataset statistics.

lace find what they are looking for given this de-gree of freedom.

In summary, due to the per-instance nature ofthe demonstration and the fact that model parame-ters have not been learned or manipulated directly,Jain and Wallace have not shown the existence ofan adversarial model that produces the claimedadversarial distributions. Thus, we cannot treatthese adversarial attentions as equally plausible orfaithful explanations for model prediction. Addi-tionally, they haven’t provided a baseline of howmuch variation is to be expected in learned atten-tion distributions, leaving the reader to questionjust how adversarial the found adversarial distri-butions are.

3 Examining Attention Distributions

In this section, we apply a careful methodologicalapproach for examining the properties of attentiondistributions and propose alternatives. We beginby identifying the appropriate scope of the mod-els’ performance and variance, followed by imple-menting an empirical diagnostic technique whichmeasures the model-agnostic usefulness of atten-tion weights in capturing the relationship betweeninputs and output.

3.1 Experimental SetupIn order to make our many points in a succinctfashion as well as follow the conclusions drawn byJain and Wallace, we focus on experimenting withthe binary classification subset of their tasks, andon models with an LSTM architecture (Hochreiterand Schmidhuber, 1997), the only one the authorsmake firm conclusions on. Future work may ex-tend our experiments to extractive tasks like ques-tion answering, as well as other attention-pronetasks, like seq2seq models.

We experiment on the following datasets: Stan-ford Sentiment Treebank (SST) (Socher et al.,states are typically affected by the input word to a noticeabledegree).

Dataset Attention (Base) UniformReported Reproduced

Diabetes 0.79 0.775 0.706Anemia 0.92 0.938 0.899IMDb 0.88 0.902 0.879SST 0.81 0.831 0.822AgNews 0.96 0.964 0.96020News 0.94 0.942 0.934

Table 2: Classification F1 scores (1-class) on attentionmodels, both as reported by Jain and Wallace and inour reproduction, and on models forced to use uniformattention over hidden states.

2013), IMDB Large Movie Reviews Corpus(Maas et al., 2011), 20 NEWSGROUPS (hockeyvs. baseball),3 the AG NEWS Corpus,4 and twoprediction tasks from MIMIC-III ICD9 (John-son et al., 2016): DIABETES and ANEMIA. Thetasks are as follows: to predict positive or neg-ative sentiment from sentences (SST) and moviereviews (IMDB), to predict the topic of news ar-ticles as either baseball (neg.) or hockey (pos.)in 20 NEWSGROUPS and either world (neg.) orbusiness (pos.) in AG NEWS, to predict whether apatient is diagnosed with diabetes from their ICUdischarge summary, and to predict whether the pa-tient is diagnosed with acute (neg.) or chronic(pos.) anemia (both MIMIC-III ICD9). We usethe dataset versions, including train-test split, pro-vided by Jain and Wallace.5 All datasets are inEnglish.6 Data statistics are provided in Table 1.

We use a single-layer bidirectional LSTM withtanh activation, followed by an additive attentionlayer (Bahdanau et al., 2014) and softmax predic-tion, which is equivalent to the LSTM setup ofJain and Wallace. We use the same hyperparame-ters found in that work to be effective in training,which we corroborated by reproducing its resultsto a satisfactory degree (see middle columns of Ta-ble 2). We refer to this architecture as the mainsetup, where training results in a base model.

Following Jain and Wallace, all analysis is per-formed on the test set. We report F1 scores onthe positive class, and apply the same metrics theyuse for model comparison, namely Total Variation

3http://qwone.com/˜jason/20Newsgroups/

4http://www.di.unipi.it/˜gulli/AG_

corpus_of_news_articles.html

5https://github.com/successar/

AttentionExplanation

6We do not include the Twitter Adverse Drug Reactions(ADR) (Nikfarjam et al., 2015) dataset as the source tweetsare no longer all available.

http://qwone.com/~jason/20Newsgroups/

http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

https://github.com/successar/AttentionExplanation

https://github.com/successar/AttentionExplanation

15

Distance (TVD) for comparing prediction scoresy and Jensen-Shannon Divergence (JSD) for com-paring weighting distributions ↵:

TVD(y1, y2) =1

2

|Y|X

i=1

|y1i � y2i| ;

JSD(↵1,↵2) =1

2KL[↵1 k ↵] +

1

2KL[↵2 k ↵],

where ↵ = ↵1+↵22 .

3.2 Uniform as the AdversaryFirst, we test the validity of the classification tasksand datasets by examining whether attention isnecessary in the first place. We argue that if at-tention models are not useful compared to verysimple baselines, i.e. their parameter capacity isnot being used, there is no point in using their out-comes for any type of explanation to begin with.We thus introduce a uniform model variant, iden-tical to the main setup except that the attention dis-tribution is frozen to uniform weights over the hid-den states.

The results comparing this baseline with thebase model are presented in Table 2. If attentionwas a necessary component for good performance,we would expect a large drop between the tworightmost columns. Somewhat surprisingly, forthree of the classification tasks the attention layerappears to offer little to no improvement whatso-ever. We conclude that these datasets, notably AGNEWS and 20 NEWSGROUPS, are not useful testcases for the debated question: attention is not ex-planation if you don’t need it. We subsequently ig-nore the two News datasets, but keep SST, whichwe deem borderline.

3.3 Variance within a ModelWe now test whether the variances observed byJain and Wallace between trained attention scoresand adversarially-obtained ones are unusual. Wedo this by repeating their analysis on eight mod-els trained from the main setup using different ini-tialization random seeds. The variance introducedin the attention distributions represents a baselineamount of variance that would be considered nor-mal.

The results are plotted in Figure 3 using thesame plane as Jain and Wallace’s Figure 8 (withtwo of these reproduced as (e-f)). Left-heavy vi-olins are interpreted as data classes for which the

compared model produces attention distributionssimilar to the base model, and so having an adver-sary that manages to ‘pull right’ supports the argu-ment that distributions are easy to manipulate. Wesee that SST distributions (c, e) are surprisingly ro-bust to random seed change, validating our choiceto continue examining this dataset despite its bor-derline F1 score. On the Diabetes dataset, the neg-ative class is already subject to relatively arbitrarydistributions from the different random seed set-tings (d), making the highly divergent results fromthe overly-flexible adversarial setup (f) seem lessimpressive. Our consistently-adversarial setup in§4 will further explore the difficulty of surpassingseed-induced variance between attention distribu-tions.

3.4 Diagnosing Attention Distributions byGuiding Simpler Models

As a more direct examination of models, and asa complementary approach to Jain and Wallace(2019)’s measurement of backward-pass gradientflows through the model for gauging token impor-tance, we introduce a post-hoc training protocol ofa non-contextual model guided by pre-set weightdistributions. The idea is to examine the predic-tion power of attention distributions in a ‘clean’setting, where the trained parts of the model haveno access to neighboring tokens of the instance.If pre-trained scores from an attention model per-form well, we take this to mean they are helpfuland consistent, fulfilling a certain sense of explain-ability. In addition, this setup serves as an effectivediagnostic tool for assessing the utility of adver-sarial attention distributions: if such distributionsare truly alternative, they should be equally usefulas guides as their base equivalent, and thus per-form comparably.

Our diagnostic model is created by replacing themain setup’s LSTM and attention parameters witha token-level affine hidden layer with tanh acti-vation (forming an MLP), and forcing its outputscores to be weighted by a pre-set, per-instancedistribution, during both training and testing. Thissetup is illustrated in Figure 4. The guide weightswe impose are the following: Uniform, where weforce the MLP outputs to be considered equallyacross each instance, effectively forming an un-weighted baseline; Trained MLP, where we donot freeze the weights layer, instead allowing the

16

(a) IMDB (seeds) (c) SST (seeds) (e) SST (adversary)

(b) Anemia (seeds) (d) Diabetes (seeds) (f) Diabetes (adversary)

Figure 3: Densities of maximum JS divergences (x-axis) as a function of the max attention (y-axis) in each instancebetween the base distributions and: (a-d) models initialized on different random seeds; (e-f) models from a per-instance adversarial setup (replication of Figure 8a, 8c resp. in Jain and Wallace (2019)). In each max-attentionbin, top (blue) is the negative-label instances, bottom (red) positive-label instances.

the movie was good

Prediction Score

Weights(Imposed)Affine

Embedding

Figure 4: Diagram of the setup in §3.4 (exceptTRAINED MLP, which learns weight parameters).

MLP to learn its own attention parameters;7 BaseLSTM, where we take the weights learned by thebase LSTM model’s attention layer; and Adver-sary, based on distributions found adversariallyusing the consistent training algorithm from §4 be-low (where their results will be discussed).

The results are presented in Table 3. The firstimportant result, consistent across datasets, is thatusing pre-trained LSTM attention weights is bet-ter than letting the MLP learn them on its own,which is in turn better than the unweighted base-line. Comparing with results from §3.2, we seethat this setup also outperforms the LSTM trainedwith uniform attention weights, suggesting that

7This is the same as Jain and Wallace’s average setup.

Guide weights Diab. Anemia SST IMDb

UNIFORM 0.404 0.873 0.812 0.863TRAINED MLP 0.699 0.920 0.817 0.888BASE LSTM 0.753 0.931 0.824 0.905ADVERSARY (4) 0.503 0.932 0.592 0.700

Table 3: F1 scores on the positive class for an MLPmodel trained on various weighting guides. For AD-VERSARY, we set � 0.001.

the attention module is more important than theword-level architecture for these datasets. Thesefindings strengthen the case counter to the claimthat attention weights are arbitrary: independenttoken-level models that have no access to contex-tual information find them useful, indicating thatthey encode some measure of token importancewhich is not model-dependent.

4 Training an Adversary

Having demonstrated three methods which testthe meaningfulness of attention distributions asinstruments of explainability with adequate con-trol, we now propose a model-consistent trainingprotocol for finding adversarial attention distribu-tions through a coherent parameterization, whichholds across all training instances. We believethis setup is able to advance the search for faith-ful explainability (see §5). Indeed, our results

17

will demonstrate that the extent to which a model-consistent adversary can be found varies acrossdatasets, and that the dramatic reduction in degreeof freedom compared to previous work allows forbetter-informed analysis.

Model. Given the base model Mb, we train amodel Ma whose explicit goal is to provide sim-ilar prediction scores for each instance, while dis-tancing its attention distributions from those ofMb. Formally, we train the adversarial model us-ing stochastic gradient updates based on the fol-lowing loss formula (summed over instances in theminibatch):

L(Ma,Mb)(i) = TVD(y(i)

a , y(i)b )� � KL(↵(i)

a k ↵(i)b ),

where y(i) and ↵(i) denote predictions and atten-tion distributions for an instance i, respectively.� is a hyperparameter which we use to control

the tradeoff between relaxing the prediction dis-tance requirement (low TVD) in favor of moredivergent attention distributions (high JSD), andvice versa. When this interaction is plotted on atwo-dimensional axis, the shape of the plot can beinterpreted to either support the ‘attention is notexplanation’ hypothesis if it is convex (JSD is eas-ily manipulable), or oppose it if it is concave (earlyincrease in JSD comes at a high cost in predictionprecision).

Prediction performance. By definition, ourloss objective does not directly consider actualprediction performance. The TVD componentpushes it towards the same score as the basemodel, but our setup does not ensure generaliza-tion from train to test. It would thus be interest-ing to inspect the extent of the implicit F1/TVDrelationship. We report the highest F1 scores ofmodels whose attention distributions diverge fromthe base, on average, by at least 0.4 in JSD, aswell as their � setting and corresponding compar-ison metrics, in Table 4 (full results available inAppendix B). All F1 scores are on par with theoriginal model results reported in Table 2, indicat-ing the effectiveness of our adversarial models atimitating base model scores on the test sets.

Adversarial weights as guides. We next applythe diagnostic setup introduced in §3.4 by traininga guided MLP model on the adversarially-trainedattention distributions. The results, reported inthe bottom line of Table 3, show that despite

Figure 5: Averaged per-instance test set JSD and TVDfrom base model for each model variant. JSD isbounded at ⇠ 0.693. N: random seed; ⌅: uniformweights; dotted line: our adversarial setup as � is var-ied; +: adversarial setup from Jain and Wallace (2019).

Dataset � F1 (") TVD (#) JSD (")

Diabetes 2e-4 0.775 0.015 0.409Anemia 5e-4 0.942 0.017 0.481SST 5.25e-4 0.823 0.036 0.514IMDb 8e-4 0.906 0.014 0.405

Table 4: Best-performing adversarial models withinstance-average JSD > 0.4.

their local decision-imitation abilities, they areusually completely incapable of providing a non-contextual framework with useful guides.8 We of-fer these results as evidence that adversarial dis-tributions, even those obtained consistently for adataset, deprive the underlying model from someform of understanding it gained over the data, onethat it was able to leverage by tuning the attentionmechanism towards preferring ‘useful’ tokens.

TVD/JSD tradeoff. In Figure 5 we presentthe levels of prediction variance (TVD) allowed

8We note the outlying result achieved on the Anemiadataset. This can be explained via the data distribution, whichis heavily skewed towards positive examples (see Table 1 inthe Appendix), together with the fact (conceded in Jain andWallace (2019)’s section 4.2.1) that positive instances in de-tection datasets such as MIMIC tend to contain a handfulof indicative tokens, making the particularly helpful distribu-tions reached by a trained model hard to replace by an adver-sary. Together, this leads to the selected setting of � = 0.001producing average distributions substantially more similar tothe base than in the other datasets (JSD ⇠ 0.58 vs. > 0.61)and thus more useful to the MLP setup.

18

by models achieving increased attention distance(JSD) on all four datasets. The convex shape ofmost curves does lend support to the claim thatattention scores are easily manipulable; howeverthe extent of this effect emerging from Jain andWallace’s per-instance setup is a considerable ex-aggeration, as seen by its position (+) well belowthe curve of our parameterized model set. Again,the SST dataset emerges as an outlier: not onlycan JSD be increased practically arbitrarily with-out incurring prediction variance cost, the uniformbaseline (⌅) comes up under the curve, i.e. witha better adversarial score. We again include ran-dom seed initializations (N) in order to quantify abaseline amount of variance.

TVD/JSD plots broken down by predictionclass are available in Appendix C. In future work,we intend to inspect the potential of multiple ad-versarial attention models existing side-by-side,all distant enough from each other.

Concrete Example. Table 2 illustrates the dif-ference between inconsistently-achieved adversar-ial heatmaps and consistently trained ones. De-spite both adversaries approximating the desiredprediction score to very high degree, the heatmapsshow that Jain and Wallace’s model has distributedall of the attention weight to an ad-hoc token,whereas our trained model could only distance it-self from the base model distribution by so much,keeping multiple tokens in the > 0.1 score range.

5 Defining Explanation

The umbrella term of “Explainable AI” encom-passes at least three distinct notions: transparency,explainability, and interpretability. Lipton (2016)categorizes transparency, or overall human under-standing of a model, and post-hoc explainability astwo competing notions under the umbrella of in-terpretability. The relevant sense of transparency,as defined by Lipton (2016) (§3.1.2), pertains tothe way in which a specific portion of a modelcorresponds to a human-understandable construct(which Doshi-Velez and Kim (2017) refer to asa “cognitive chunk”). Under this definition, itshould appear sensible of the NLP community totreat attention scores as a vehicle of (partial) trans-parency. Attention mechanisms do provide a lookinto the inner workings of a model, as they pro-duce an easily-understandable weighting of hid-den states.

Rudin (2018) defines explainability as simply aplausible (but not necessarily faithful) reconstruc-tion of the decision-making process, and Riedl(2019) classifies explainable rationales as valuablein that they mimic what we as humans do whenwe rationalize past actions: we invent a story thatplausibly justifies our actions, even if it is not anentirely accurate reconstruction of the neural pro-cesses that produced our behavior at the time. Dis-tinguishing between interpretability and explain-ability as two separate notions, Rudin (2018) ar-gues that interpretability is more desirable butmore difficult to achieve than explainability, be-cause it requires presenting humans with a big-picture understanding of the correlative relation-ship between inputs and outputs (citing the exam-ple of linear regression coefficients). Doshi-Velezand Kim (2017) break down interpretability intofurther subcategories, depending on the amount ofhuman involvement and the difficulty of the task.

In prior work, Lei et al. (2016) train a model tosimultaneously generate rationales and predictionsfrom input text, using gold-label rationales to eval-uate their model. Generally, many accept the no-tion of extractive methods such as Lei et al. (2016),in which explanations come directly from the in-put itself (as in attention), as plausible. Workssuch as Mullenbach et al. (2018) and Ehsan et al.(2019) use human evaluation to evaluate explana-tions; the former based on attention scores over theinput, and the latter based on systems with addi-tional rationale-generation capability. The authorsshow that rationales generated in a post-hoc man-ner increase user trust in a system.

Citing Ross et al. (2017), Jain and Wallace’srequisite for attention distributions to be usedas explanation is that there must only exist oneor a few closely-related correct explanations fora model prediction. However, Doshi-Velez andKim (2017) caution against applying evaluationsand terminology broadly without clarifying task-specific explanation needs. If we accept the Rudinand Riedl definitions of explainability as providinga plausible, but not necessarily faithful rationalefor model prediction, then the argument againstattention mechanisms because they are not exclu-sive as claimed by Jain and Wallace is invalid, andhuman evaluation (which they do not consult) isnecessary to evaluate the plausibility of generatedrationales. Just because there exists another expla-nation does not mean that the one provided is false

19

or meaningless, and under this definition the exis-tence of multiple different explanations is not nec-essarily indicative of the quality of a single one.

Jain and Wallace define attention and explana-tion as measuring the “responsibility” each inputtoken has on a prediction. This aligns more closelywith the more rigorous (Lipton, 2016, §3.1.1) def-inition of transparency, or Rudin (2018)’s defini-tion of interpretability: human understanding ofthe model as a whole rather than of its respectiveparts. The ultimate question posed so far as ‘is at-tention explanation?’ seems to be: do high atten-tion weights on certain elements in the input leadthe model to make its prediction? This question isultimately left largely unanswered by prior work,as we address in previous sections. However, un-der the given definition of transparency, the au-thors’ exclusivity requisite is well-defined and wefind value in their counterfactual framework as aconcept – if a model is capable of producing mul-tiple sets of diverse attention weights for the sameprediction, then the relationship between inputsand outputs used to make predictions is not under-stood by attention analysis. This provides us withthe motivation to implement the adversarial setupcoherently and to derive and present conclusionsfrom it. To this end, we additionally provide our§3.4 model to test the relationship between inputtokens and output.

In the terminology of Doshi-Velez andKim (2017), our proposed methods provide afunctionally-grounded evaluation of attention asexplanation, i.e. an analysis conducted on proxytasks without human evaluation. We believe theproxies we have provided can be used to test thevalidity of attention as a form of explanation fromthe ground-up, based on the type of explanationone is looking for.

6 Attention is All you Need it to Be

Whether or not attention is explanation dependson the definition of explainability one is lookingfor: plausible or faithful explanations (or both).We believe that prior work focused on providingplausible rationales is not invalidated by Jain andWallace’s or our results. However, we have con-firmed that adversarial distributions can be foundfor LSTM models in some classification tasks, asoriginally hypothesized by Jain and Wallace. Thisshould provide pause to researchers who are look-ing to attention distributions for one true, faithful

interpretation of the link their model has estab-lished between inputs and outputs. At the sametime, we have provided a suite of experiments thatresearchers can make use of in order to make in-formed decisions about the quality of their mod-els’ attention mechanisms when used as explana-tion for model predictions.

We’ve shown that alternative attention distribu-tions found via adversarial training methods per-form poorly relative to traditional attention mech-anisms when used in our diagnostic MLP model.These results indicate that trained attention mech-anisms in RNNs on our datasets do in fact learnsomething meaningful about the relationship be-tween tokens and prediction which cannot be eas-ily ‘hacked’ adversarially.

We view the conditions under which adversarialdistributions can actually be found in practice to bean important direction for future work. Additionalfuture directions for this line of work include ap-plication on other tasks such as sequence modelingand multi-document analysis (NLI, QA); exten-sion to languages other than English; and adding ahuman evaluation for examining the level of agree-ment with our measures. We also believe our workcan provide value to theoretical analysis of atten-tion models, motivating development of analyticalmethods to estimate the usefulness of attention asan explanation based on dataset and model prop-erties.

AcknowledgmentsWe thank Yoav Goldberg for preliminary com-ments on the idea behind the original Mediumpost. We thank the online community who par-ticipated in the discussion following the post, andparticularly Sarthak Jain and Byron Wallace fortheir active engagement, as well as for the high-quality code they released which allowed fast re-production and modification of their experiments.We thank Erik Wijmans for early feedback. Wethank the members of the Computational Linguis-tics group at Georgia Tech for discussions andcomments, particularly Jacob Eisenstein and Mu-rali Raghu Babu. We thank the anonymous re-viewers for many useful comments.

YP is a Bloomberg Data Science PhD Fellow.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointly

20

learning to align and translate. arXiv preprint

arXiv:1409.0473.

Finale Doshi-Velez and Been Kim. 2017. Towards arigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608.

Upol Ehsan, Pradyumna Tambwekar, Larry Chan,Brent Harrison, and Mark O Riedl. 2019. Auto-mated rationale generation: a technique for explain-able ai and its effects on human perceptions. In Pro-

ceedings of the 24th International Conference on In-

telligent User Interfaces, pages 263–274. ACM.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Sarthak Jain and Byron C. Wallace. 2019. Attention isnot Explanation. In Proceedings of the 2019 Con-

ference of the North American Chapter of the Asso-

ciation for Computational Linguistics: Human Lan-

guage Technologies, Volume 1 (Long Papers), Min-neapolis, Minnesota. Association for ComputationalLinguistics.

Alistair EW Johnson, Tom J Pollard, Lu Shen,H Lehman Li-wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits,Leo Anthony Celi, and Roger G Mark. 2016.Mimic-iii, a freely accessible critical care database.Scientific data, 3:160035.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. In Proceedings of

the 2016 Conference on Empirical Methods in Nat-

ural Language Processing, pages 107–117, Austin,Texas. Association for Computational Linguistics.

Zachary C Lipton. 2016. The mythos of model inter-pretability. arXiv preprint arXiv:1606.03490.

Andrew L Maas, Raymond E Daly, Peter T Pham, DanHuang, Andrew Y Ng, and Christopher Potts. 2011.Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the as-

sociation for computational linguistics: Human lan-

guage technologies-volume 1, pages 142–150. Asso-ciation for Computational Linguistics.

James Mullenbach, Sarah Wiegreffe, Jon Duke, Ji-meng Sun, and Jacob Eisenstein. 2018. Explain-able prediction of medical codes from clinical text.In Proceedings of the 2018 Conference of the North

American Chapter of the Association for Computa-

tional Linguistics: Human Language Technologies,

Volume 1 (Long Papers), pages 1101–1111, NewOrleans, Louisiana. Association for ComputationalLinguistics.

Azadeh Nikfarjam, Abeed Sarker, Karen O’connor,Rachel Ginn, and Graciela Gonzalez. 2015. Phar-macovigilance from social media: mining adversedrug reaction mentions using sequence labelingwith word embedding cluster features. Journal

of the American Medical Informatics Association,22(3):671–681.

Mark O Riedl. 2019. Human-centered artificial intelli-gence and machine learning. Human Behavior and

Emerging Technologies, 1(1):33–36.

Tim Rocktaschel, Edward Grefenstette, Karl MoritzHermann, Tomas Kocisky, and Phil Blunsom. 2015.Reasoning about entailment with neural attention.arXiv preprint arXiv:1509.06664.

Andrew Slavin Ross, Michael C Hughes, and FinaleDoshi-Velez. 2017. Right for the right reasons:training differentiable models by constraining theirexplanations. In Proceedings of the 26th Interna-

tional Joint Conference on Artificial Intelligence,pages 2662–2670. AAAI Press.

Cynthia Rudin. 2018. Please stop explaining black boxmodels for high stakes decisions. arXiv preprint

arXiv:1811.10154.

Sofia Serrano and Noah A. Smith. 2019. Is attentioninterpretable? In Proceedings of the 57th Annual

Meeting of the Association for Computational Lin-

guistics, pages 2931–2951, Florence, Italy. Associa-tion for Computational Linguistics.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 conference on

empirical methods in natural language processing,pages 1631–1642.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2019.Generating token-level explanations for naturallanguage inference. In Proceedings of the 2019

Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies, Volume 1 (Long and Short

Papers), pages 963–969, Minneapolis, Minnesota.Association for Computational Linguistics.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In International conference on machine learn-

ing, pages 2048–2057.

https://doi.org/10.18653/v1/D16-1011



https://doi.org/10.18653/v1/N18-1100

https://doi.org/10.18653/v1/N18-1100

https://www.aclweb.org/anthology/P19-1282

https://www.aclweb.org/anthology/P19-1282

https://doi.org/10.18653/v1/N19-1101

https://doi.org/10.18653/v1/N19-1101

Documents

Attention is not not Explanation - Association for Computational … · 2020-01-23 · 12 the movie was good Prediction Score Attention Scores Attention Parameters LSTM Embedding;