arXiv:1506.01066v2 [cs.CL] 8 Jan 2016Visualizing and Understanding Neural Models in NLP Jiwei Li1, Xinlei Chen 2, Eduard Hovy and Dan Jurafsky1 1Computer Science Department, Stanford

Visualizing and Understanding Neural Models in NLP

Jiwei Li1, Xinlei Chen2, Eduard Hovy2 and Dan Jurafsky1

1Computer Science Department, Stanford University, Stanford, CA 94305, USA2Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

{jiweil,jurafsky}@stanford.edu {xinleic,ehovy}@andrew.cmu.edu

Abstract

While neural networks have been success-fully applied to many NLP tasks the re-sulting vector-based models are very diffi-cult to interpret. For example it’s not clearhow they achieve compositionality, build-ing sentence meaning from the meaningsof words and phrases. In this paper wedescribe strategies for visualizing composi-tionality in neural models for NLP, inspiredby similar work in computer vision. Wefirst plot unit values to visualize composi-tionality of negation, intensification, andconcessive clauses, allowing us to see well-known markedness asymmetries in nega-tion. We then introduce methods for visu-alizing a unit’s salience, the amount that itcontributes to the final composed meaningfrom first-order derivatives. Our general-purpose methods may have wide applica-tions for understanding compositionalityand other semantic properties of deep net-works.

1 Introduction

Neural models match or outperform the perfor-mance of other state-of-the-art systems on a va-riety of NLP tasks. Yet unlike traditional feature-based classifiers that assign and optimize weightsto varieties of human interpretable features (parts-of-speech, named entities, word shapes, syntacticparse features etc) the behavior of deep learningmodels is much less easily interpreted. Deep learn-ing models mainly operate on word embeddings(low-dimensional, continuous, real-valued vectors)through multi-layer neural architectures, each layerof which is characterized as an array of hidden neu-ron units. It is unclear how deep learning modelsdeal with composition, implementing functions likenegation or intensification, or combining meaningfrom different parts of the sentence, filtering away

the informational chaff from the wheat, to buildsentence meaning.

In this paper, we explore multiple strategies tointerpret meaning composition in neural models.We employ traditional methods like representationplotting, and introduce simple strategies for measur-ing how much a neural unit contributes to meaningcomposition, its ‘salience’ or importance using firstderivatives.

Visualization techniques/models represented inthis work shed important light on how neural mod-els work: For example, we illustrate that LSTM’ssuccess is due to its ability in maintaining a muchsharper focus on the important key words than othermodels; Composition in multiple clauses workscompetitively, and that the models are able to cap-ture negative asymmetry, an important propertyof semantic compositionally in natural languageunderstanding; there is sharp dimensional local-ity, with certain dimensions marking negation andquantification in a manner that was surprisinglylocalist. Though our attempts only touch superfi-cial points in neural models, and each method hasits pros and cons, together they may offer someinsights into the behaviors of neural models in lan-guage based tasks, marking one initial step towardunderstanding how they achieve meaning composi-tion in natural language processing.

The next section describes some visualizationmodels in vision and NLP that have inspired thiswork. We describe datasets and the adopted neu-ral models in Section 3. Different visualizationstrategies and correspondent analytical results arepresented separately in Section 4,5,6, followed bya brief conclusion.

2 A Brief Review of Neural Visualization

Similarity is commonly visualized graphically, gen-erally by projecting the embedding space into twodimensions and observing that similar words tendto be clustered together (e.g., Elman (1989), Ji

arX

iv:1

506.

0106

6v2

[cs

.CL

] 8

Jan

201

6

and Eisenstein (2014), Faruqui and Dyer (2014)).(Karpathy et al., 2015) attempts to interpret recur-rent neural models from a statical point of viewand does deeply touch compositionally of mean-ings. Other relevant attempts include (Fyshe et al.,2015; Faruqui et al., 2015).

Methods for interpreting and visualizing neu-ral models have been much more significantly ex-plored in vision, especially for Convolutional Neu-ral Networks (CNNs or ConvNets) (Krizhevsky etal., 2012), multi-layer neural networks in which theoriginal matrix of image pixels is convolved andpooled as it is passed on to hidden layers. ConvNetvisualizing techniques consist mainly in mappingthe different layers of the network (or other fea-tures like SIFT (Lowe, 2004) and HOG (Dalal andTriggs, 2005)) back to the initial image input, thuscapturing the human-interpretable information theyrepresent in the input, and how units in these layerscontribute to any final decisions (Simonyan et al.,2013; Mahendran and Vedaldi, 2014; Nguyen et al.,2014; Szegedy et al., 2013; Girshick et al., 2014;Zeiler and Fergus, 2014). Such methods include:

(1) Inversion: Inverting the representations bytraining an additional model to project outputs fromdifferent neural levels back to the initial input im-ages (Mahendran and Vedaldi, 2014; Vondrick etal., 2013; Weinzaepfel et al., 2011). The intuitionbehind reconstruction is that the pixels that are re-constructable from the current representations arethe content of the representation. The invertingalgorithms allow the current representation to alignwith corresponding parts of the original images.

(2) Back-propagation (Erhan et al., 2009; Si-monyan et al., 2013) and Deconvolutional Net-works (Zeiler and Fergus, 2014): Errors are backpropagated from output layers to each intermedi-ate layer and finally to the original image inputs.Deconvolutional Networks work in a similar wayby projecting outputs back to initial inputs layer bylayer, each layer associated with one supervisedmodel for projecting upper ones to lower onesThese strategies make it possible to spot activeregions or ones that contribute the most to the finalclassification decision.

(3) Generation: This group of work generatesimages in a specific class from a sketch guided byalready trained neural models (Szegedy et al., 2013;Nguyen et al., 2014). Models begin with an imagewhose pixels are randomly initialized and mutatedat each step. The specific layers that are activatedat different stages of image construction can help

in interpretation.While the above strategies inspire the work we

present in this paper, there are fundamental dif-ferences between vision and NLP. In NLP wordsfunction as basic units, and hence (word) vectorsrather than single pixels are the basic units. Se-quences of words (e.g., phrases and sentences) arealso presented in a more structured way than ar-rangements of pixels. In parallel to our research,independent researches (Karpathy et al., 2015) havebeen conducted to explore similar direction froman error-analysis point of view, by analyzing pre-dictions and errors from a recurrent neural models.Other distantly relevant works include: Murphy etal. (2012; Fyshe et al. (2015) used an manual taskto quantify the interpretability of semantic dimen-sions by presetting human users with a list of wordsand ask them to choose the one that does not belongto the list. Faruqui et al. (2015). Similar strategyis adopted in (Faruqui et al., 2015) by extractingtop-ranked words in each vector dimension.

3 Datasets and Neural Models

We explored two datasets on which neural modelsare trained, one of which is of relatively small scaleand the other of large scale.

3.1 Stanford Sentiment Treebank

Stanford Sentiment Treebank is a benchmarkdataset widely used for neural model evaluations.The dataset contains gold-standard sentiment labelsfor every parse tree constituent, from sentences tophrases to individual words, for 215,154 phrasesin 11,855 sentences. The task is to perform bothfine-grained (very positive, positive, neutral, nega-tive and very negative) and coarse-grained (positivevs negative) classification at both the phrase andsentence level. For more details about the dataset,please refer to Socher et al. (2013).

While many studies on this dataset use recursiveparse-tree models, in this work we employ onlystandard sequence models (RNNs and LSTMs)since these are the most widely used current neu-ral models, and sequential visualization is morestraightforward. We therefore first transform eachparse tree node to a sequence of tokens. Thesequence is first mapped to a phrase/sentencerepresentation and fed into a softmax classifier.Phrase/sentence representations are built with thefollowing three models: Standard Recurrent Se-quence with TANH activation functions, LSTMs and

Bidirectional LSTMs. For details about the threemodels, please refer to Appendix.

Training AdaGrad with mini-batch was used fortraining, with parameters (L2 penalty, learning rate,mini batch size) tuned on the development set. Thenumber of iterations is treated as a variable to tuneand parameters are harvested based on the bestperformance on the dev set. The number of dimen-sions for the word and hidden layer are set to 60with 0.1 dropout rate. Parameters are tuned on thedev set. The standard recurrent model achieves0.429 (fine grained) and 0.850 (coarse grained)accuracy at the sentence level; LSTM achieves0.469 and 0.870, and Bidirectional LSTM 0.488and 0.878, respectively.

3.2 Sequence-to-Sequence ModelsSEQ2SEQ are neural models aiming at generatinga sequence of output texts given inputs. Theoret-ically, SEQ2SEQ models can be adapted to NLPtasks that can be formalized as predicting outputsgiven inputs and serve for different purposes dueto different inputs and outputs, e.g., machine trans-lation where inputs correspond to source sentencesand outputs to target sentences (Sutskever et al.,2014; Luong et al., 2014); conversational responsegeneration if inputs correspond to messages andoutputs correspond to responses (Vinyals and Le,2015; Li et al., 2015). SEQ2SEQ need to be trainedon massive amount of data for implicitly semanticand syntactic relations between pairs to be learned.

SEQ2SEQ models map an input sequence toa vector representation using LSTM models andthen sequentially predicts tokens based on the pre-obtained representation. The model defines a dis-tribution over outputs (Y) and sequentially predictstokens given inputs (X) using a softmax function.

P (Y |X) =

ny∏t=1

p(yt|x1, x2, ..., xt, y1, y2, ..., yt−1)

=

ny∏t=1

exp(f(ht−1, eyt))∑y′ exp(f(ht−1, ey′))

where f(ht−1, eyt) denotes the activation functionbetween ht−1 and eyt , where ht−1 is the represen-tation output from the LSTM at time t− 1. Foreach time step in word prediction, SEQ2SEQ mod-els combine the current token with previously builtembeddings for next-step word prediction.

For easy visualization purposes, we turn to themost straightforward task—autoencoder— where

inputs and outputs are identical. The goal of anautoencoder is to reconstruct inputs from the pre-obtained representation. We would like to see howindividual input tokens affect the overall sentencerepresentation and each of the tokens to predict inoutputs. We trained the auto-encoder on a subsetof WMT’14 corpus containing 4 million englishsentences with an average length of 22.5 words. Wefollowed training protocols described in (Sutskeveret al., 2014).

4 Representation Plotting

We begin with simple plots of representations toshed light on local compositions using StanfordSentiment Treebank.

Local Composition Figure 1 shows a 60d heat-map vector for the representation of selectedwords/phrases/sentences, with an emphasis on ex-tent modifications (adverbial and adjectival) andnegation. Embeddings for phrases or sentences areattained by composing word representations fromthe pretrained model.

The intensification part of Figure 1 shows sug-gestive patterns where values for a few dimensionsare strengthened by modifiers like “a lot” (the redbar in the first example) “so much” (the red bar inthe second example), and “incredibly”. Though thepatterns for negations are not as clear, there is still aconsistent reversal for some dimensions, visible asa shift between blue and red for dimensions boxedon the left.

We then visualize words and phrases using t-sne (Van der Maaten and Hinton, 2008) in Figure 2,deliberately adding in some random words for com-parative purposes. As can be seen, neural modelsnicely learn the properties of local composition-ally, clustering negation+positive words (‘not nice’,’not good’) together with negative words. Note alsothe asymmetry of negation: “not bad” is clusteredmore with the negative than the positive words (asshown both in Figure 1 and 2). This asymmetryhas been widely discussed in linguistics, for exam-ple as arising from markedness, since ‘good’ is theunmarked direction of the scale (Clark and Clark,1977; Horn, 1989; Fraenkel and Schul, 2008). Thissuggests that although the model does seem to fo-cus on certain units for negation in Figure 1, theneural model is not just learning to apply a fixedtransform for ‘not’ but is able to capture the subtledifferences in the composition of different words.

Figure 2: t-SNE Visualization on latent representations for modifications and negations.

Figure 4: t-SNE Visualization for clause composition.

Concessive Sentences In concessive sentences,two clauses have opposite polarities, usually re-lated by a contrary-to-expectation implicature. Weplot evolving representations over time for two con-cessives in Figure 3. The plots suggest:

1. For tasks like sentiment analysis whose goalis to predict a specific semantic dimension (as op-posed to general tasks like language model wordprediction), too large a dimensionality leads tomany dimensions non-functional (with values closeto 0), causing two sentences of opposite sentimentto differ only in a few dimensions. This may ex-plain why more dimensions don’t necessarily leadto better performance on such tasks (For example,as reported in (Socher et al., 2013), optimal perfor-mance is achieved when word dimensionality is setto between 25 and 35).

2. Both sentences contain two clauses connectedby the conjunction “though”. Such two-clause sen-tences might either work collaboratively— models

would remember the word “though” and make thesecond clause share the same sentiment orienta-tion as first—or competitively, with the strongerone dominating. The region within dotted line inFigure 3(a) favors the second assumption: the dif-ference between the two sentences is diluted whenthe final words (“interesting” and “boring”) appear.

Clause Composition In Figure 4 we explore thisclause composition in more detail. Representationsmove closer to the negative sentiment region byadding negative clauses like “although it had badacting” or “but it is too long” to the end of a simplypositive “I like the movie”. By contrast, addinga concessive clause to a negative clause does notmove toward the positive; “I hate X but ...” is stillvery negative, not that different than “I hate X”.This difference again suggests the model is ableto capture negative asymmetry (Clark and Clark,1977; Horn, 1989; Fraenkel and Schul, 2008).

Figure 5: Saliency heatmap for for “I hate the movie .” Each row corresponds to saliency scores for the correspondent wordrepresentation with each grid representing each dimension.

Figure 6: Saliency heatmap for “I hate the movie I saw last night .” .

Figure 7: Saliency heatmap for “I hate the movie though the plot is interesting .” .

5 First-Derivative Saliency

In this section, we describe another strategy whichis is inspired by the back-propagation strategy invision (Erhan et al., 2009; Simonyan et al., 2013).It measures how much each input unit contributesto the final decision, which can be approximatedby first derivatives.

More formally, for a classification model, aninput E is associated with a gold-standard classlabel c. (Depending on the NLP task, an inputcould be the embedding for a word or a sequence

of words, while labels could be POS tags, sentimentlabels, the next word index to predict etc.) Givenembeddings E for input words with the associatedgold class label c, the trained model associatesthe pair (E, c) with a score Sc(E). The goal is todecide which units of E make the most significantcontribution to Sc(e), and thus the decision, thechoice of class label c.

In the case of deep neural models, the class scoreSc(e) is a highly non-linear function. We approxi-mate Sc(e) with a linear function of e by computing

Intensification

Negation

Figure 1: Visualizing intensification and negation. Each ver-tical bar shows the value of one dimension in the final sen-tence/phrase representation after compositions. Embeddingsfor phrases or sentences are attained by composing word rep-resentations from the pretrained model.

the first-order Taylor expansion

Sc(e) ≈ w(e)T e+ b (1)

where w(e) is the derivative of Sc with respect tothe embedding e.

w(e) =∂(Sc)

∂e|e (2)

The magnitude (absolute value) of the derivative in-dicates the sensitiveness of the final decision to thechange in one particular dimension, telling us howmuch one specific dimension of the word embed-ding contributes to the final decision. The saliencyscore is given by

S(e) = |w(e)| (3)

5.1 Results on Stanford Sentiment TreebankWe first illustrate results on Stanford Treebank. Weplot in Figures 5, 6 and 7 the saliency scores (the

Figure 3: Representations over time from LSTMs. Each col-umn corresponds to outputs from LSTM at each time-step(representations obtained after combining current word em-bedding with previous build embeddings). Each grid from thecolumn corresponds to each dimension of current time-steprepresentation. The last rows correspond to absolute differ-ences for each time step between two sequences.

absolute value of the derivative of the loss functionwith respect to each dimension of all word inputs)for three sentences, applying the trained model toeach sentence. Each row corresponds to saliencyscore for the correspondent word representationwith each grid representing each dimension. Theexamples are based on the clear sentiment indicator“hate” that lends them all negative sentiment.

“I hate the movie” All three models assign highsaliency to “hate” and dampen the influence ofother tokens. LSTM offers a clearer focus on“hate” than the standard recurrent model, but thebi-directional LSTM shows the clearest focus, at-taching almost zero emphasis on words other than“hate”. This is presumably due to the gates struc-tures in LSTMs and Bi-LSTMs that controls infor-mation flow, making these architectures better atfiltering out less relevant information.

Figure 8: Variance visualization.

“I hate the movie that I saw last night” Allthree models assign the correct sentiment. Thesimple recurrent models again do poorly at filter-ing out irrelevant information, assigning too muchsalience to words unrelated to sentiment. Howevernone of the models suffer from the gradient van-ishing problems despite this sentence being longer;the salience of “hate” still stands out after 7-8 fol-lowing convolutional operations.

“I hate the movie though the plot is interesting”The simple recurrent model emphasizes only thesecond clause “the plot is interesting”, assigningno credit to the first clause “I hate the movie”. Thismight seem to be caused by a vanishing gradient,yet the model correctly classifies the sentence asvery negative, suggesting that it is successfullyincorporating information from the first negativeclause. We separately tested the individual clause“though the plot is interesting”. The standard recur-rent model confidently labels it as positive. Thusdespite the lower saliency scores for words in thefirst clause, the simple recurrent system managesto rely on that clause and downplay the informationfrom the latter positive clause—despite the highersaliency scores of the later words. This illustratesa limitation of saliency visualization. first-orderderivatives don’t capture all the information wewould like to visualize, perhaps because they are

only a rough approximate to individual contribu-tions and might not suffice to deal with highly non-linear cases. By contrast, the LSTM emphasizes thefirst clause, sharply dampening the influence fromthe second clause, while the Bi-LSTM focuses onboth “hate the movie” and “plot is interesting”.

5.2 Results on Sequence-to-SequenceAutoencoder

Figure 9 represents saliency heatmap for auto-encoder in terms of predicting correspondent to-ken at each time step. We compute first-derivativesfor each preceding word through back-propagationas decoding goes on. Each grid corresponds tomagnitude of average saliency value for each 1000-dimensional word vector. The heatmaps give clearoverview about the behavior of neural models dur-ing decoding. Observations can be summarized asfollows:

1. For each time step of word prediction,SEQ2SEQ models manage to link word to predictback to correspondent region at the inputs (automat-ically learn alignments), e.g., input region centeringaround token “hate” exerts more impact when to-ken “hate” is to be predicted, similar cases withtokens “movie”, “plot” and “boring”.

2. Neural decoding combines the previouslybuilt representation with the word predicted at thecurrent step. As decoding proceeds, the influence

Figure 9: Saliency heatmap for SEQ2SEQ auto-encoder interms of predicting correspondent token at each time step.

of the initial input on decoding (i.e., tokens insource sentences) gradually diminishes as morepreviously-predicted words are encoded in the vec-tor representations. Meanwhile, the influence oflanguage model gradually dominates: when word“boring” is to be predicted, models attach moreweight to earlier predicted tokens “plot” and “is”but less to correspondent regions in the inputs, i.e.,the word “boring” in inputs.

6 Average and Variance

For settings where word embeddings are treated asparameters to optimize from scratch (as opposed tousing pre-trained embeddings), we propose a sec-

ond, surprisingly easy and direct way to visualizeimportant indicators. We first compute the averageof the word embeddings for all the words withinthe sentences. The measure of salience or influencefor a word is its deviation from this average. Theidea is that during training, models would learnto render indicators different from non-indicatorwords, enabling them to stand out even after manylayers of computation.

Figure 8 shows a map of variance; each grid cor-responds to the value of ||ei,j − 1

NS

∑i′∈NS

ei′j ||2where ei,j denotes the value for j th dimension ofword i and N denotes the number of token withinthe sentences.

As the figure shows, the variance-based saliencemeasure also does a good job of emphasizing therelevant sentiment words. The model does haveshortcomings: (1) it can only be used in to scenar-ios where word embeddings are parameters to learn(2) it’s clear how well the model is able to visualizelocal compositionality.

7 Conclusion

In this paper, we offer several methods to helpvisualize and interpret neural models, to understandhow neural models are able to compose meanings,demonstrating asymmetries of negation and explainsome aspects of the strong performance of LSTMsat these tasks.

Though our attempts only touch superficialpoints in neural models, and each method has itspros and cons, together they may offer some in-sights into the behaviors of neural models in lan-guage based tasks, marking one initial step towardunderstanding how they achieve meaning compo-sition in natural language processing. Our futurework includes using results of the visualization beused to perform error analysis, and understandingstrengths limitations of different neural models.

ReferencesHerbert H. Clark and Eve V. Clark. 1977. Psychology

and language: An introduction to psycholinguistics.Harcourt Brace Jovanovich.

Navneet Dalal and Bill Triggs. 2005. Histograms oforiented gradients for human detection. In Com-puter Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on, vol-ume 1, pages 886–893. IEEE.

Jeffrey L. Elman. 1989. Representation and structurein connectionist models. Technical Report 8903,

Center for Research in Language, University of Cal-ifornia, San Diego.

Dumitru Erhan, Yoshua Bengio, Aaron Courville, andPascal Vincent. 2009. Visualizing higher-layer fea-tures of a deep network. Dept. IRO, Universite deMontreal, Tech. Rep.

Manaal Faruqui and Chris Dyer. 2014. Improvingvector space word representations using multilingualcorrelation. In Proceedings of EACL, volume 2014.

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, ChrisDyer, and Noah Smith. 2015. Sparse overcom-plete word vector representations. arXiv preprintarXiv:1506.02004.

Tamar Fraenkel and Yaacov Schul. 2008. The mean-ing of negated adjectives. Intercultural Pragmatics,5(4):517–540.

Alona Fyshe, Leila Wehbe, Partha P Talukdar, BrianMurphy, and Tom M Mitchell. 2015. A compo-sitional and interpretable semantic space. Proceed-ings of the NAACL-HLT, Denver, USA.

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten-dra Malik. 2014. Rich feature hierarchies for accu-rate object detection and semantic segmentation. InComputer Vision and Pattern Recognition (CVPR),2014 IEEE Conference on, pages 580–587. IEEE.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Laurence R. Horn. 1989. A natural history of negation,volume 960. University of Chicago Press Chicago.

Yangfeng Ji and Jacob Eisenstein. 2014. Represen-tation learning for text-level discourse parsing. InProceedings of the 52nd Annual Meeting of the As-sociation for Computational Linguistics, volume 1,pages 13–24.

Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015.Visualizing and understanding recurrent networks.arXiv preprint arXiv:1506.02078.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in neuralinformation processing systems, pages 1097–1105.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting objec-tive function for neural conversation models. arXivpreprint arXiv:1510.03055.

David G Lowe. 2004. Distinctive image features fromscale-invariant keypoints. International journal ofcomputer vision, 60(2):91–110.

Thang Luong, Ilya Sutskever, Quoc V Le, OriolVinyals, and Wojciech Zaremba. 2014. Addressingthe rare word problem in neural machine translation.arXiv preprint arXiv:1410.8206.

Aravindh Mahendran and Andrea Vedaldi. 2014. Un-derstanding deep image representations by invertingthem. arXiv preprint arXiv:1412.0035.

Brian Murphy, Partha Pratim Talukdar, and Tom MMitchell. 2012. Learning effective and interpretablesemantic models using non-negative sparse embed-ding. In COLING, pages 1933–1950.

Anh Nguyen, Jason Yosinski, and Jeff Clune. 2014.Deep neural networks are easily fooled: High confi-dence predictions for unrecognizable images. arXivpreprint arXiv:1412.1897.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks. Signal Processing,IEEE Transactions on, 45(11):2673–2681.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. 2013. Deep inside convolutional networks:Visualising image classification models and saliencymaps. arXiv preprint arXiv:1312.6034.

Richard Socher, Alex Perelygin, Jean Y Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In Proceedings of the conference onempirical methods in natural language processing(EMNLP), volume 1631, page 1642. Citeseer.

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.Sequence to sequence learning with neural networks.In Advances in Neural Information Processing Sys-tems, pages 3104–3112.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. 2013. Intriguing properties of neuralnetworks. arXiv preprint arXiv:1312.6199.

Laurens Van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne. Journal of MachineLearning Research, 9(2579-2605):85.

Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. arXiv preprint arXiv:1506.05869.

Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz,and Antonio Torralba. 2013. Hoggles: Visual-izing object detection features. In Computer Vi-sion (ICCV), 2013 IEEE International Conferenceon, pages 1–8. IEEE.

Philippe Weinzaepfel, Herve Jegou, and Patrick Perez.2011. Reconstructing an image from its local de-scriptors. In Computer Vision and Pattern Recogni-tion (CVPR), 2011 IEEE Conference on, pages 337–344. IEEE.

Matthew D Zeiler and Rob Fergus. 2014. Visualizingand understanding convolutional networks. In Com-puter Vision–ECCV 2014, pages 818–833. Springer.

Appendix

Recurrent Models A recurrent network succes-sively takes word wt at step t, combines its vectorrepresentation et with previously built hidden vec-tor ht−1 from time t − 1, calculates the resultingcurrent embedding ht, and passes it to the next step.The embedding ht for the current time t is thus:

ht = f(W · ht−1 + V · et) (4)

where W and V denote compositional matrices. IfNs denote the length of the sequence, hNs repre-sents the whole sequence S. hNs is used as input asoftmax function for classification tasks.

Multi-layer Recurrent Models Multi-layer re-current models extend one-layer recurrent structureby operation on a deep neural architecture that en-ables more expressivity and flexibly. The modelassociates each time step for each layer with a hid-den representation hl,t, where l ∈ [1, L] denotesthe index of layer and t denote the index of timestep. hl,t is given by:

ht,l = f(W · ht−1,l + V · ht,l−1) (5)

where ht,0 = et, which is the original word embed-ding input at current time step.

Long-short Term Memory LSTM model, firstproposed in (Hochreiter and Schmidhuber, 1997),maps an input sequence to a fixed-sized vector bysequentially convoluting the current representationwith the output representation of the previous step.LSTM associates each time epoch with an input,control and memory gate, and tries to minimizethe impact of unrelated information. it, ft andot denote to gate states at time t. ht denotes thehidden vector outputted from LSTM model at timet and et denotes the word embedding input at timet. We have

it = σ(Wi · et + Vi · ht−1)ft = σ(Wf · et + Vf · ht−1)ot = σ(Wo · et + Vo · ht−1)lt = tanh(Wl · et + Vl · ht−1)ct = ft · ct−1 + it × ltht = ot ·mt

(6)

where σ denotes the sigmoid function. it, ft andot are scalars within the range of [0,1]. × denotespairwise dot.

A multi-layer LSTM models works in the sameway as multi-layer recurrent models by enablemulti-layer’s compositions.

Bidirectional Models (Schuster and Paliwal,1997) add bidirectionality to the recurrent frame-work where embeddings for each time are calcu-lated both forwardly and backwardly:

h→t = f(W→ · h→t−1 + V→ · et)h←t = f(W← · h←t+1 + V← · et)

(7)

Normally, bidirectional models feed the concate-nation vector calculated from both directions[e←1 , e

→NS

] to the classifier. Bidirectional modelscan be similarly extended to both multi-layer neu-ral model and LSTM version.

Documents

arXiv:1506.01066v2 [cs.CL] 8 Jan 2016Visualizing and Understanding Neural Models in NLP Jiwei Li1, Xinlei Chen 2, Eduard Hovy and Dan Jurafsky1 1Computer Science Department, Stanford