17
Pointing the Unknown Words 1 ACL 2016 Caglar Gulcehre Universit´ e de Montr´ eal Sungjin Ahn Universit´ e de Montr´ eal Ramesh Nallapati IBM T.J. Watson Research Bowen Zhou IBM T.J. Watson Research Yoshua Bengio Universit´ e de Montr´ eal CIFAR Senior Fellow

Pointing the Unknown Words

  • Upload
    hytae

  • View
    308

  • Download
    1

Embed Size (px)

Citation preview

Pointingthe Unknown Words

1

ACL 2016

Pointing the Unknown Words

Caglar Gulcehre

Universite de MontrealSungjin Ahn

Universite de MontrealRamesh Nallapati

IBM T.J. Watson Research

Bowen Zhou

IBM T.J. Watson ResearchYoshua Bengio

Universite de MontrealCIFAR Senior Fellow

Abstract

The problem of rare and unknown wordsis an important issue that can potentiallyeffect the performance of many NLP sys-tems, including both the traditional count-based and the deep learning models. Wepropose a novel way to deal with the rareand unseen words for the neural networkmodels using attention. Our model usestwo softmax layers in order to predict thenext word in conditional language mod-els: one predicts the location of a wordin the source sentence, and the other pre-dicts a word in the shortlist vocabulary. Ateach time-step, the decision of which soft-max layer to use choose adaptively madeby an MLP which is conditioned on thecontext. We motivate our work from a psy-chological evidence that humans naturallyhave a tendency to point towards objects inthe context or the environment when thename of an object is not known. We ob-serve improvements on two tasks, neuralmachine translation on the Europarl En-glish to French parallel corpora and textsummarization on the Gigaword datasetusing our proposed model.

1 Introduction

Words are the basic input/output units in most ofthe NLP systems, and thus the ability to cover alarge number of words is a key to building a ro-bust NLP system. However, considering that (i)the number of all words in a language includingnamed entities is very large and that (ii) languageitself is an evolving system (people create newwords), this can be a challenging problem.

A common approach followed by the recentneural network based NLP systems is to use a

softmax output layer where each of the output di-mension corresponds to a word in a predefinedword-shortlist. Because computing high dimen-sional softmax is computationally expensive, inpractice the shortlist is limited to have only top-K most frequent words in the training corpus. Allother words are then replaced by a special word,called the unknown word (UNK).

The shortlist approach has two fundamentalproblems. The first problem, which is known asthe rare word problem, is that some of the wordsin the shortlist occur less frequently in the train-ing set and thus are difficult to learn a good repre-sentation, resulting in poor performance. Second,it is obvious that we can lose some important in-formation by mapping different words to a singledummy token UNK. Even if we have a very largeshortlist including all unique words in the trainingset, it does not necessarily improve the test perfor-mance, because there still exists a chance to see anunknown word at test time. This is known as theunknown word problem. In addition, increasingthe shortlist size mostly leads to increasing rarewords due to Zipf’s Law.

These two problems can be particularly criticalin language understanding tasks such as factoidquestion answering (Bordes et al., 2015) where thewords that we are interested in are often named en-tities which are usually unknown or rare words.

In a similar situation, where we have a limitedinformation on how to call an object of interest, itseems that humans (and also some primates) havean efficient behavioral mechanism of drawing at-tention to the object: pointing (Matthews et al.,2012). Pointing makes it possible to deliver in-formation and to associate context to a particularobject without knowing how to call it. In partic-ular, human infants use pointing as a fundamentalcommunication tool (Tomasello et al., 2007).

In this paper, inspired by the pointing behav-

arX

iv:1

603.

0814

8v3

[cs.C

L] 2

1 A

ug 2

016

Pointing the Unknown Words

Caglar Gulcehre

Universite de MontrealSungjin Ahn

Universite de MontrealRamesh Nallapati

IBM T.J. Watson Research

Bowen Zhou

IBM T.J. Watson ResearchYoshua Bengio

Universite de MontrealCIFAR Senior Fellow

Abstract

The problem of rare and unknown wordsis an important issue that can potentiallyeffect the performance of many NLP sys-tems, including both the traditional count-based and the deep learning models. Wepropose a novel way to deal with the rareand unseen words for the neural networkmodels using attention. Our model usestwo softmax layers in order to predict thenext word in conditional language mod-els: one predicts the location of a wordin the source sentence, and the other pre-dicts a word in the shortlist vocabulary. Ateach time-step, the decision of which soft-max layer to use choose adaptively madeby an MLP which is conditioned on thecontext. We motivate our work from a psy-chological evidence that humans naturallyhave a tendency to point towards objects inthe context or the environment when thename of an object is not known. We ob-serve improvements on two tasks, neuralmachine translation on the Europarl En-glish to French parallel corpora and textsummarization on the Gigaword datasetusing our proposed model.

1 Introduction

Words are the basic input/output units in most ofthe NLP systems, and thus the ability to cover alarge number of words is a key to building a ro-bust NLP system. However, considering that (i)the number of all words in a language includingnamed entities is very large and that (ii) languageitself is an evolving system (people create newwords), this can be a challenging problem.

A common approach followed by the recentneural network based NLP systems is to use a

softmax output layer where each of the output di-mension corresponds to a word in a predefinedword-shortlist. Because computing high dimen-sional softmax is computationally expensive, inpractice the shortlist is limited to have only top-K most frequent words in the training corpus. Allother words are then replaced by a special word,called the unknown word (UNK).

The shortlist approach has two fundamentalproblems. The first problem, which is known asthe rare word problem, is that some of the wordsin the shortlist occur less frequently in the train-ing set and thus are difficult to learn a good repre-sentation, resulting in poor performance. Second,it is obvious that we can lose some important in-formation by mapping different words to a singledummy token UNK. Even if we have a very largeshortlist including all unique words in the trainingset, it does not necessarily improve the test perfor-mance, because there still exists a chance to see anunknown word at test time. This is known as theunknown word problem. In addition, increasingthe shortlist size mostly leads to increasing rarewords due to Zipf’s Law.

These two problems can be particularly criticalin language understanding tasks such as factoidquestion answering (Bordes et al., 2015) where thewords that we are interested in are often named en-tities which are usually unknown or rare words.

In a similar situation, where we have a limitedinformation on how to call an object of interest, itseems that humans (and also some primates) havean efficient behavioral mechanism of drawing at-tention to the object: pointing (Matthews et al.,2012). Pointing makes it possible to deliver in-formation and to associate context to a particularobject without knowing how to call it. In partic-ular, human infants use pointing as a fundamentalcommunication tool (Tomasello et al., 2007).

In this paper, inspired by the pointing behav-

arX

iv:1

603.

0814

8v3

[cs.C

L] 2

1 A

ug 2

016

2

ior of humans and recent advances in the atten-tion mechanism (Bahdanau et al., 2014) and thepointer networks (Vinyals et al., 2015), we pro-pose a novel method to deal with the rare or un-known word problem. The basic idea is that wecan see in many NLP problems as a task of predict-ing target text given context text, where some ofthe target words appear in the context as well. Weobserve that in this case we can make the modellearn to point a word in the context and copy it tothe target text, as well as when to point. For exam-ple, in machine translation, we can see the sourcesentence as the context, and the target sentence aswhat we need to predict. In Figure 1, we showan example depiction of how words can be copiedfrom source to target in machine translation. Al-though the source and target languages are differ-ent, many of the words such as named entities areusually represented by the same characters in bothlanguages, making it possible to copy. Similarly,in text summarization, it is natural to use somewords in the original text in the summarized textas well.

Specifically, to predict a target word at eachtimestep, our model first determines the source ofthe word generation, that is, on whether to takeone from a predefined shortlist or to copy one fromthe context. For the former, we apply the typicalsoftmax operation, and for the latter, we use theattention mechanism to obtain the pointing soft-max probability over the context words and pickthe one of high probability. The model learns thisdecision so as to use the pointing only when thecontext includes a word that can be copied to thetarget. This way, our model can predict even thewords which are not in the shortlist, as long asit appears in the context. Although some of thewords still need to be labeled as UNK, i.e., if it isneither in the shortlist nor in the context, in ex-periments we show that this learning when and

where to point improves the performance in ma-chine translation and text summarization.

The rest of the paper is organized as follows. Inthe next section, we review the related works in-cluding pointer networks and previous approachesto the rare/unknown problem. In Section 3, wereview the neural machine translation with atten-tion mechanism which is the baseline in our ex-periments. Then, in Section 4, we propose ourmethod dealing with the rare/unknown word prob-lem, called the Pointer Softmax (PS). The exper-

Guillaume et Cesar ont une voiture bleue a Lausanne.

Guillaume and Cesar have a blue car in Lausanne.Copy Copy Copy

French:

English:

Figure 1: An example of how copying can happenfor machine translation. Common words that ap-pear both in source and the target can directly becopied from input to source. The rest of the un-known in the target can be copied from the inputafter being translated with a dictionary.

imental results are provided in the Section 5 andwe conclude our work in Section 6.

2 Related Work

The attention-based pointing mechanism is intro-duced first in the pointer networks (Vinyals et al.,2015). In the pointer networks, the output space ofthe target sequence is constrained to be the obser-vations in the input sequence (not the input space).Instead of having a fixed dimension softmax out-put layer, softmax outputs of varying dimension isdynamically computed for each input sequence insuch a way to maximize the attention probabilityof the target input. However, its applicability israther limited because, unlike our model, there isno option to choose whether to point or not; it al-ways points. In this sense, we can see the pointernetworks as a special case of our model where wealways choose to point a context word.

Several approaches have been proposed towardssolving the rare words/unknown words problem,which can be broadly divided into three categories.The first category of the approaches focuses onimproving the computation speed of the softmaxoutput so that it can maintain a very large vocabu-lary. Because this only increases the shortlist size,it helps to mitigate the unknown word problem,but still suffers from the rare word problem. Thehierarchical softmax (Morin and Bengio, 2005),importance sampling (Bengio and Senecal, 2008;Jean et al., 2014), and the noise contrastive esti-mation (Gutmann and Hyvarinen, 2012; Mnih andKavukcuoglu, 2013) methods are in the class.

The second category, where our proposedmethod also belongs to, uses information from thecontext. Notable works are (Luong et al., 2015)and (Hermann et al., 2015). In particular, ap-plying to machine translation task, (Luong et al.,2015) learns to point some words in source sen-tence and copy it to the target sentence, similarly

• V

• • 

���

killed a man yesterday . [eos]!

John killed a man yesterday . !

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

pt w Whp

Whp(w) bhp(w) softmax

Whp bhp

pt(w) = softmax(Whp(w)ht + bhp(w)) (18)

Whp

d′e

d′e = Wdyde + bdy (19)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (20)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

!man �

a �

a ���

man !�

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

3

• V

• • 

���

killed a man yesterday . [eos]!

John killed a man yesterday . !

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

pt w Whp

Whp(w) bhp(w) softmax

Whp bhp

pt(w) = softmax(Whp(w)ht + bhp(w)) (18)

Whp

d′e

d′e = Wdyde + bdy (19)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (20)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

!man �

a �

a ���

man !�

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

• V T

•Pointer Softmax

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

T

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

4

5

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

x1 x2 x3 xT

+αt,1αt,2 αt,3

αt,T

h1 h2 h3 hT

h1 h2 h3 hT

st-1 s t

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

to our method. However, it does not use atten-tion mechanism, and by having fixed sized soft-max output over the relative pointing range (e.g.,-7, . . . , -1, 0, 1, . . . , 7), their model (the Posi-tional All model) has a limitation in applying tomore general problems such as summarization andquestion answering, where, unlike machine trans-lation, the length of the context and the pointinglocations in the context can vary dramatically. Inquestion answering setting, (Hermann et al., 2015)have used placeholders on named entities in thecontext. However, the placeholder id is directlypredicted in the softmax output rather than predict-ing its location in the context.

The third category of the approaches changesthe unit of input/output itself from words to asmaller resolution such as characters (Graves,2013) or bytecodes (Sennrich et al., 2015; Gillicket al., 2015). Although this approach has themain advantage that it could suffer less from therare/unknown word problem, the training usuallybecomes much harder because the length of se-quences significantly increases.

Simultaneously to our work, (Gu et al., 2016)and (Cheng and Lapata, 2016) proposed modelsthat learn to copy from source to target and bothpapers analyzed their models on summarizationtasks.

3 Neural Machine Translation Model

with Attention

As the baseline neural machine translation sys-tem, we use the model proposed by (Bahdanau etal., 2014) that learns to (soft-)align and translatejointly. We refer this model as NMT.

The encoder of the NMT is a bidirectionalRNN (Schuster and Paliwal, 1997). The forwardRNN reads input sequence x = (x1, . . . , x

T

)

in left-to-right direction, resulting in a sequenceof hidden states (

�!h 1, . . . ,

�!h

T

). The backwardRNN reads x in the reversed direction and outputs(

�h 1, . . . ,

�h

T

). We then concatenate the hiddenstates of forward and backward RNNs at each timestep and obtain a sequence of annotation vectors(h1, . . . ,h

T

) where h

j

=

h�!h

j

|| �h

j

i. Here, ||

denotes the concatenation operator. Thus, each an-notation vector h

j

encodes information about thej-th word with respect to all the other surroundingwords in both directions.

In the decoder, we usually use gated recur-rent unit (GRU) (Cho et al., 2014; Chung et al.,

2014). Specifically, at each time-step t, the soft-alignment mechanism first computes the relevanceweight e

tj

which determines the contribution ofannotation vector h

j

to the t-th target word. Weuse a non-linear mapping f (e.g., MLP) whichtakes h

j

, the previous decoder’s hidden state s

t�1

and the previous output yt�1 as input:

e

tj

= f(s

t�1,hj

, y

t�1).

The outputs etj

are then normalized as follows:

l

tj

=

exp(etj

)

PT

k=1 exp(etk

)

. (1)

We call ltj

as the relevance score, or the align-ment weight, of the j-th annotation vector.

The relevance scores are used to get the context

vector c

t

of the t-th target word in the translation:

c

t

=

TX

j=1

l

tj

h

j

,

The hidden state of the decoder st

is computedbased on the previous hidden state s

t�1, the con-text vector c

t

and the output word of the previoustime-step y

t�1:

s

t

= f

r

(s

t�1, yt�1, ct), (2)

where f

r

is GRU.We use a deep output layer (Pascanu et al.,

2013) to compute the conditional distribution overwords:

p(y

t

= a|y<t

,x) /

exp

a

(Wo

,bo

)fo(st, yt�1, ct)

⌘,

(3)

where W is a learned weight matrix and b is abias of the output layer. f

o

is a single-layer feed-forward neural network. (W

o

,bo

)(·) is a functionthat performs an affine transformation on its input.And the superscript a in a indicates the a-th col-umn vector of .

The whole model, including both the encoderand the decoder, is jointly trained to maximize the(conditional) log-likelihood of target sequencesgiven input sequences, where the training corpusis a set of (x

n

,y

n

)’s. Figure 2 illustrates the ar-chitecture of the NMT.

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

(t=i)

[Bahdanau+15]

Publ

ished

asa

conf

eren

cepa

pera

tICL

R20

15

The

deco

der

isof

ten

train

edto

pred

ictt

hene

xtw

ordy

t

0gi

ven

the

cont

extv

ecto

rc

and

allt

hepr

evio

usly

pred

icte

dw

ords

{y1,···,

y

t

0 �1}.

Inot

herw

ords

,the

deco

derd

efine

sapr

obab

ility

over

the

trans

latio

ny

byde

com

posin

gth

ejo

intp

roba

bilit

yin

toth

eor

dere

dco

nditi

onal

s:

p(y)=

T Y

t=1

p(y

t

|{y

1,···,

y

t�1},

c),

(2)

whe

rey=

� y

1,···,

y

T

y

� .With

anRN

N,e

ach

cond

ition

alpr

obab

ility

ism

odel

edas

p(y

t

|{y

1,···,

y

t�1},

c)=

g(y

t�1,s

t

,c),

(3)

whe

reg

isan

onlin

ear,

pote

ntia

llym

ulti-

laye

red,

func

tion

that

outp

utst

hepr

obab

ility

ofy

t

,and

s

t

isth

ehi

dden

state

ofth

eRN

N.I

tsho

uld

beno

ted

that

othe

rarc

hite

ctur

essu

chas

ahy

brid

ofan

RNN

and

ade

-con

volu

tiona

lneu

raln

etw

ork

can

beus

ed(K

alch

bren

nera

ndBl

unso

m,2

013)

.

3LE

AR

NIN

GTO

ALI

GN

AN

DTR

AN

SLAT

E

Inth

isse

ctio

n,w

epro

pose

anov

elar

chite

ctur

efor

neur

alm

achi

netra

nsla

tion.

Then

ewar

chite

ctur

eco

nsist

sof

abi

dire

ctio

nalR

NN

asan

enco

der

(Sec

.3.2

)an

da

deco

der

that

emul

ates

sear

chin

gth

roug

ha

sour

cese

nten

cedu

ring

deco

ding

atra

nsla

tion

(Sec

.3.1

).

3.1

DEC

OD

ER:G

ENER

AL

DES

CR

IPTI

ON

s t

Figu

re1:

Theg

raph

ical

illus

-tra

tion

ofth

epr

opos

edm

odel

tryin

gto

gene

rate

thet-th

tar-

get

wor

dy

t

give

na

sour

cese

nten

ce(x

1,x

2,...,x

T

).

Ina

new

mod

elar

chite

ctur

e,w

ede

fine

each

cond

ition

alpr

obab

ility

inEq

.(2)

as:

p(y

i

|y1,...,y

i�1,x)=

g(y

i�1,s

i

,c

i

),

(4)

whe

res

i

isan

RNN

hidd

ensta

tefo

rtim

ei,c

ompu

ted

by

s

i

=f(s

i�1,y

i�1,c

i

).

Itsh

ould

beno

ted

that

unlik

eth

eex

istin

gen

code

r–de

code

rap

-pr

oach

(see

Eq.(

2)),

here

thep

roba

bilit

yis

cond

ition

edon

adist

inct

cont

extv

ecto

rci

fore

ach

targ

etw

ordy

i

.

The

cont

ext

vect

orc

i

depe

nds

ona

sequ

ence

ofan

nota

tions

(h

1,···,

h

T

x

)to

whi

chan

enco

derm

apst

hein

puts

ente

nce.

Each

anno

tatio

nh

i

cont

ains

info

rmat

ion

abou

tthe

who

lein

puts

eque

nce

with

astr

ong

focu

son

the

parts

surro

undi

ngth

ei-th

wor

dof

the

inpu

tseq

uenc

e.W

eex

plai

nin

deta

ilho

wth

ean

nota

tions

are

com

-pu

ted

inth

ene

xtse

ctio

n.

Thec

onte

xtve

ctor

c

i

is,th

en,c

ompu

ted

asaw

eigh

ted

sum

ofth

ese

anno

tatio

nsh

i

:

c

i

=

T

x

X j=1

ij

h

j

.(5

)

The

wei

ght↵

ij

ofea

chan

nota

tionh

j

isco

mpu

ted

by

ij

=

exp(e

ij

)

PT

x

k=1exp(e

ik

)

,(6

)

whe

ree

ij

=a(s

i�1,h

j

)

isan

alig

nmen

tmod

elw

hich

scor

esho

ww

ellt

hein

puts

arou

ndpo

sitio

nj

and

the

outp

utat

posit

ion

im

atch

.The

scor

eis

base

don

the

RNN

hidd

ensta

tes

i�1

(just

befo

reem

ittin

gy

i

,Eq.

(4))

and

the

j-th

anno

tatio

nh

j

ofth

ein

puts

ente

nce.

Wep

aram

etriz

ethe

alig

nmen

tmod

ela

asaf

eedf

orw

ard

neur

alne

twor

kw

hich

isjo

intly

train

edw

ithal

lthe

othe

rcom

pone

ntso

fthe

prop

osed

syste

m.N

otet

hatu

nlik

ein

tradi

tiona

lmac

hine

trans

latio

n,

3

Publ

ishe

das

aco

nfer

ence

pape

ratI

CLR

2015

The

deco

der

isof

ten

train

edto

pred

ict

the

next

wor

dy

t

0gi

ven

the

cont

ext

vect

orc

and

all

the

prev

ious

lypr

edic

ted

wor

ds{y

1,···,

y

t

0 �1}.

Inot

herw

ords

,the

deco

derd

efine

sa

prob

abili

tyov

erth

etra

nsla

tiony

byde

com

posi

ngth

ejo

intp

roba

bilit

yin

toth

eor

dere

dco

nditi

onal

s:

p(y)=

T Y

t=1

p(y

t

|{y

1,···,

y

t�1},

c),

(2)

whe

rey=

� y

1,···,

y

T

y

� .With

anR

NN

,eac

hco

nditi

onal

prob

abili

tyis

mod

eled

as

p(y

t

|{y

1,···,

y

t�1},

c)=

g(y

t�1,s

t

,c),

(3)

whe

reg

isa

nonl

inea

r,po

tent

ially

mul

ti-la

yere

d,fu

nctio

nth

atou

tput

sthe

prob

abili

tyof

y

t

,and

s

t

isth

ehi

dden

stat

eof

the

RN

N.I

tsho

uld

beno

ted

that

othe

rarc

hite

ctur

essu

chas

ahy

brid

ofan

RN

Nan

da

de-c

onvo

lutio

naln

eura

lnet

wor

kca

nbe

used

(Kal

chbr

enne

rand

Blu

nsom

,201

3).

3L

EA

RN

ING

TO

AL

IGN

AN

DT

RA

NSL

AT

E

Inth

isse

ctio

n,w

epr

opos

ea

nove

larc

hite

ctur

efo

rneu

ralm

achi

netra

nsla

tion.

The

new

arch

itect

ure

cons

ists

ofa

bidi

rect

iona

lR

NN

asan

enco

der

(Sec

.3.2

)an

da

deco

der

that

emul

ates

sear

chin

gth

roug

ha

sour

cese

nten

cedu

ring

deco

ding

atra

nsla

tion

(Sec

.3.1

).

3.1

DE

CO

DE

R:

GE

NE

RA

LD

ESC

RIP

TIO

N

st

Figu

re1:

The

grap

hica

lillu

s-tra

tion

ofth

epr

opos

edm

odel

tryin

gto

gene

rate

thet-th

tar-

get

wor

dy

t

give

na

sour

cese

nten

ce(x

1,x

2,...,x

T

).

Ina

new

mod

elar

chite

ctur

e,w

ede

fine

each

cond

ition

alpr

obab

ility

inEq

.(2)

as:

p(y

i

|y1,...,y

i�1,x)=

g(y

i�1,s

i

,c

i

),

(4)

whe

res

i

isan

RN

Nhi

dden

stat

efo

rtim

ei,c

ompu

ted

by

s

i

=f(s

i�1,y

i�1,c

i

).

Itsh

ould

beno

ted

that

unlik

eth

eex

istin

gen

code

r–de

code

rap

-pr

oach

(see

Eq.(

2)),

here

the

prob

abili

tyis

cond

ition

edon

adi

stin

ctco

ntex

tvec

torc

i

fore

ach

targ

etw

ordy

i

.

The

cont

ext

vect

orc

i

depe

nds

ona

sequ

ence

ofan

nota

tions

(h

1,···,

h

T

x

)to

whi

chan

enco

derm

aps

the

inpu

tsen

tenc

e.Ea

chan

nota

tionh

i

cont

ains

info

rmat

ion

abou

tthe

who

lein

puts

eque

nce

with

ast

rong

focu

son

the

parts

surr

ound

ing

thei-th

wor

dof

the

inpu

tseq

uenc

e.W

eex

plai

nin

deta

ilho

wth

ean

nota

tions

are

com

-pu

ted

inth

ene

xtse

ctio

n.

The

cont

extv

ecto

rci

is,t

hen,

com

pute

das

aw

eigh

ted

sum

ofth

ese

anno

tatio

nsh

i

:

c

i

=

T

x

X j=1

ij

h

j

.(5

)

The

wei

ght↵

ij

ofea

chan

nota

tionh

j

isco

mpu

ted

by

ij

=

exp(e

ij

)

PT

x

k=1exp(e

ik

)

,(6

)

whe

ree

ij

=a(s

i�1,h

j

)

isan

alig

nmen

tmod

elw

hich

scor

esho

ww

ellt

hein

puts

arou

ndpo

sitio

nj

and

the

outp

utat

posi

tion

im

atch

.The

scor

eis

base

don

the

RN

Nhi

dden

stat

es

i�1

(just

befo

reem

ittin

gy

i

,Eq.

(4))

and

the

j-th

anno

tatio

nh

j

ofth

ein

puts

ente

nce.

We

para

met

rize

the

alig

nmen

tmod

ela

asa

feed

forw

ard

neur

alne

twor

kw

hich

isjo

intly

train

edw

ithal

lthe

othe

rcom

pone

ntso

fthe

prop

osed

syst

em.N

ote

that

unlik

ein

tradi

tiona

lmac

hine

trans

latio

n,

3

Publ

ished

asa

conf

eren

cepa

pera

tICL

R20

15

The

deco

der

isof

ten

train

edto

pred

ictt

hene

xtw

ordy

t

0gi

ven

the

cont

extv

ecto

rc

and

allt

hepr

evio

usly

pred

icte

dw

ords

{y1,···,

y

t

0 �1}.

Inot

herw

ords

,the

deco

derd

efine

sapr

obab

ility

over

the

trans

latio

ny

byde

com

posin

gth

ejo

intp

roba

bilit

yin

toth

eor

dere

dco

nditi

onal

s:

p(y)=

T Y

t=1

p(y

t

|{y

1,···,

y

t�1},

c),

(2)

whe

rey=

� y

1,···,

y

T

y

� .With

anRN

N,e

ach

cond

ition

alpr

obab

ility

ism

odel

edas

p(y

t

|{y

1,···,

y

t�1},

c)=

g(y

t�1,s

t

,c),

(3)

whe

reg

isan

onlin

ear,

pote

ntia

llym

ulti-

laye

red,

func

tion

that

outp

utst

hepr

obab

ility

ofy

t

,and

s

t

isth

ehi

dden

state

ofth

eRN

N.I

tsho

uld

beno

ted

that

othe

rarc

hite

ctur

essu

chas

ahy

brid

ofan

RNN

and

ade

-con

volu

tiona

lneu

raln

etw

ork

can

beus

ed(K

alch

bren

nera

ndBl

unso

m,2

013)

.

3LE

AR

NIN

GTO

ALI

GN

AN

DTR

AN

SLAT

E

Inth

isse

ctio

n,w

epro

pose

anov

elar

chite

ctur

efor

neur

alm

achi

netra

nsla

tion.

Then

ewar

chite

ctur

eco

nsist

sof

abi

dire

ctio

nalR

NN

asan

enco

der

(Sec

.3.2

)an

da

deco

der

that

emul

ates

sear

chin

gth

roug

ha

sour

cese

nten

cedu

ring

deco

ding

atra

nsla

tion

(Sec

.3.1

).

3.1

DEC

OD

ER:G

ENER

AL

DES

CR

IPTI

ON

s t

Figu

re1:

Theg

raph

ical

illus

-tra

tion

ofth

epr

opos

edm

odel

tryin

gto

gene

rate

thet-th

tar-

get

wor

dy

t

give

na

sour

cese

nten

ce(x

1,x

2,...,x

T

).

Ina

new

mod

elar

chite

ctur

e,w

ede

fine

each

cond

ition

alpr

obab

ility

inEq

.(2)

as:

p(y

i

|y1,...,y

i�1,x)=

g(y

i�1,s

i

,c

i

),

(4)

whe

res

i

isan

RNN

hidd

ensta

tefo

rtim

ei,c

ompu

ted

by

s

i

=f(s

i�1,y

i�1,c

i

).

Itsh

ould

beno

ted

that

unlik

eth

eex

istin

gen

code

r–de

code

rap

-pr

oach

(see

Eq.(

2)),

here

thep

roba

bilit

yis

cond

ition

edon

adist

inct

cont

extv

ecto

rci

fore

ach

targ

etw

ordy

i

.

The

cont

ext

vect

orc

i

depe

nds

ona

sequ

ence

ofan

nota

tions

(h

1,···,

h

T

x

)to

whi

chan

enco

derm

apst

hein

puts

ente

nce.

Each

anno

tatio

nh

i

cont

ains

info

rmat

ion

abou

tthe

who

lein

puts

eque

nce

with

astr

ong

focu

son

the

parts

surro

undi

ngth

ei-th

wor

dof

the

inpu

tseq

uenc

e.W

eex

plai

nin

deta

ilho

wth

ean

nota

tions

are

com

-pu

ted

inth

ene

xtse

ctio

n.

Thec

onte

xtve

ctor

c

i

is,th

en,c

ompu

ted

asaw

eigh

ted

sum

ofth

ese

anno

tatio

nsh

i

:

c

i

=

T

x

X j=1

ij

h

j

.(5

)

The

wei

ght↵

ij

ofea

chan

nota

tionh

j

isco

mpu

ted

by

ij

=

exp(e

ij

)

PT

x

k=1exp(e

ik

)

,(6

)

whe

ree

ij

=a(s

i�1,h

j

)

isan

alig

nmen

tmod

elw

hich

scor

esho

ww

ellt

hein

puts

arou

ndpo

sitio

nj

and

the

outp

utat

posit

ion

im

atch

.The

scor

eis

base

don

the

RNN

hidd

ensta

tes

i�1

(just

befo

reem

ittin

gy

i

,Eq.

(4))

and

the

j-th

anno

tatio

nh

j

ofth

ein

puts

ente

nce.

Wep

aram

etriz

ethe

alig

nmen

tmod

ela

asaf

eedf

orw

ard

neur

alne

twor

kw

hich

isjo

intly

train

edw

ithal

lthe

othe

rcom

pone

ntso

fthe

prop

osed

syste

m.N

otet

hatu

nlik

ein

tradi

tiona

lmac

hine

trans

latio

n,

3

Publ

ishe

das

aco

nfer

ence

pape

ratI

CLR

2015

The

deco

der

isof

ten

train

edto

pred

ict

the

next

wor

dy

t

0gi

ven

the

cont

ext

vect

orc

and

all

the

prev

ious

lypr

edic

ted

wor

ds{y

1,···,

y

t

0 �1}.

Inot

herw

ords

,the

deco

derd

efine

sa

prob

abili

tyov

erth

etra

nsla

tiony

byde

com

posi

ngth

ejo

intp

roba

bilit

yin

toth

eor

dere

dco

nditi

onal

s:

p(y)=

T Y

t=1

p(y

t

|{y

1,···,

y

t�1},

c),

(2)

whe

rey=

� y

1,···,

y

T

y

� .With

anR

NN

,eac

hco

nditi

onal

prob

abili

tyis

mod

eled

as

p(y

t

|{y

1,···,

y

t�1},

c)=

g(y

t�1,s

t

,c),

(3)

whe

reg

isa

nonl

inea

r,po

tent

ially

mul

ti-la

yere

d,fu

nctio

nth

atou

tput

sthe

prob

abili

tyof

y

t

,and

s

t

isth

ehi

dden

stat

eof

the

RN

N.I

tsho

uld

beno

ted

that

othe

rarc

hite

ctur

essu

chas

ahy

brid

ofan

RN

Nan

da

de-c

onvo

lutio

naln

eura

lnet

wor

kca

nbe

used

(Kal

chbr

enne

rand

Blu

nsom

,201

3).

3L

EA

RN

ING

TO

AL

IGN

AN

DT

RA

NSL

AT

E

Inth

isse

ctio

n,w

epr

opos

ea

nove

larc

hite

ctur

efo

rneu

ralm

achi

netra

nsla

tion.

The

new

arch

itect

ure

cons

ists

ofa

bidi

rect

iona

lR

NN

asan

enco

der

(Sec

.3.2

)an

da

deco

der

that

emul

ates

sear

chin

gth

roug

ha

sour

cese

nten

cedu

ring

deco

ding

atra

nsla

tion

(Sec

.3.1

).

3.1

DE

CO

DE

R:

GE

NE

RA

LD

ESC

RIP

TIO

N

st

Figu

re1:

The

grap

hica

lillu

s-tra

tion

ofth

epr

opos

edm

odel

tryin

gto

gene

rate

thet-th

tar-

get

wor

dy

t

give

na

sour

cese

nten

ce(x

1,x

2,...,x

T

).

Ina

new

mod

elar

chite

ctur

e,w

ede

fine

each

cond

ition

alpr

obab

ility

inEq

.(2)

as:

p(y

i

|y1,...,y

i�1,x)=

g(y

i�1,s

i

,c

i

),

(4)

whe

res

i

isan

RN

Nhi

dden

stat

efo

rtim

ei,c

ompu

ted

by

s

i

=f(s

i�1,y

i�1,c

i

).

Itsh

ould

beno

ted

that

unlik

eth

eex

istin

gen

code

r–de

code

rap

-pr

oach

(see

Eq.(

2)),

here

the

prob

abili

tyis

cond

ition

edon

adi

stin

ctco

ntex

tvec

torc

i

fore

ach

targ

etw

ordy

i

.

The

cont

ext

vect

orc

i

depe

nds

ona

sequ

ence

ofan

nota

tions

(h

1,···,

h

T

x

)to

whi

chan

enco

derm

aps

the

inpu

tsen

tenc

e.Ea

chan

nota

tionh

i

cont

ains

info

rmat

ion

abou

tthe

who

lein

puts

eque

nce

with

ast

rong

focu

son

the

parts

surr

ound

ing

thei-th

wor

dof

the

inpu

tseq

uenc

e.W

eex

plai

nin

deta

ilho

wth

ean

nota

tions

are

com

-pu

ted

inth

ene

xtse

ctio

n.

The

cont

extv

ecto

rci

is,t

hen,

com

pute

das

aw

eigh

ted

sum

ofth

ese

anno

tatio

nsh

i

:

c

i

=

T

x

X j=1

ij

h

j

.(5

)

The

wei

ght↵

ij

ofea

chan

nota

tionh

j

isco

mpu

ted

by

ij

=

exp(e

ij

)

PT

x

k=1exp(e

ik

)

,(6

)

whe

ree

ij

=a(s

i�1,h

j

)

isan

alig

nmen

tmod

elw

hich

scor

esho

ww

ellt

hein

puts

arou

ndpo

sitio

nj

and

the

outp

utat

posi

tion

im

atch

.The

scor

eis

base

don

the

RN

Nhi

dden

stat

es

i�1

(just

befo

reem

ittin

gy

i

,Eq.

(4))

and

the

j-th

anno

tatio

nh

j

ofth

ein

puts

ente

nce.

We

para

met

rize

the

alig

nmen

tmod

ela

asa

feed

forw

ard

neur

alne

twor

kw

hich

isjo

intly

train

edw

ithal

lthe

othe

rcom

pone

ntso

fthe

prop

osed

syst

em.N

ote

that

unlik

ein

tradi

tiona

lmac

hine

trans

latio

n,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

yt-1

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

ptpt-1

c.f. http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention

••

α

6

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

x1 x2 x3 xT

+αt,1αt,2 αt,3

αt,T

h1 h2 h3 hT

h1 h2 h3 hT

st-1 s t

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

to our method. However, it does not use atten-tion mechanism, and by having fixed sized soft-max output over the relative pointing range (e.g.,-7, . . . , -1, 0, 1, . . . , 7), their model (the Posi-tional All model) has a limitation in applying tomore general problems such as summarization andquestion answering, where, unlike machine trans-lation, the length of the context and the pointinglocations in the context can vary dramatically. Inquestion answering setting, (Hermann et al., 2015)have used placeholders on named entities in thecontext. However, the placeholder id is directlypredicted in the softmax output rather than predict-ing its location in the context.

The third category of the approaches changesthe unit of input/output itself from words to asmaller resolution such as characters (Graves,2013) or bytecodes (Sennrich et al., 2015; Gillicket al., 2015). Although this approach has themain advantage that it could suffer less from therare/unknown word problem, the training usuallybecomes much harder because the length of se-quences significantly increases.

Simultaneously to our work, (Gu et al., 2016)and (Cheng and Lapata, 2016) proposed modelsthat learn to copy from source to target and bothpapers analyzed their models on summarizationtasks.

3 Neural Machine Translation Model

with Attention

As the baseline neural machine translation sys-tem, we use the model proposed by (Bahdanau etal., 2014) that learns to (soft-)align and translatejointly. We refer this model as NMT.

The encoder of the NMT is a bidirectionalRNN (Schuster and Paliwal, 1997). The forwardRNN reads input sequence x = (x1, . . . , x

T

)

in left-to-right direction, resulting in a sequenceof hidden states (

�!h 1, . . . ,

�!h

T

). The backwardRNN reads x in the reversed direction and outputs(

�h 1, . . . ,

�h

T

). We then concatenate the hiddenstates of forward and backward RNNs at each timestep and obtain a sequence of annotation vectors(h1, . . . ,h

T

) where h

j

=

h�!h

j

|| �h

j

i. Here, ||

denotes the concatenation operator. Thus, each an-notation vector h

j

encodes information about thej-th word with respect to all the other surroundingwords in both directions.

In the decoder, we usually use gated recur-rent unit (GRU) (Cho et al., 2014; Chung et al.,

2014). Specifically, at each time-step t, the soft-alignment mechanism first computes the relevanceweight e

tj

which determines the contribution ofannotation vector h

j

to the t-th target word. Weuse a non-linear mapping f (e.g., MLP) whichtakes h

j

, the previous decoder’s hidden state s

t�1

and the previous output yt�1 as input:

e

tj

= f(s

t�1,hj

, y

t�1).

The outputs etj

are then normalized as follows:

l

tj

=

exp(etj

)

PT

k=1 exp(etk

)

. (1)

We call ltj

as the relevance score, or the align-ment weight, of the j-th annotation vector.

The relevance scores are used to get the context

vector c

t

of the t-th target word in the translation:

c

t

=

TX

j=1

l

tj

h

j

,

The hidden state of the decoder st

is computedbased on the previous hidden state s

t�1, the con-text vector c

t

and the output word of the previoustime-step y

t�1:

s

t

= f

r

(s

t�1, yt�1, ct), (2)

where f

r

is GRU.We use a deep output layer (Pascanu et al.,

2013) to compute the conditional distribution overwords:

p(y

t

= a|y<t

,x) /

exp

a

(Wo

,bo

)fo(st, yt�1, ct)

⌘,

(3)

where W is a learned weight matrix and b is abias of the output layer. f

o

is a single-layer feed-forward neural network. (W

o

,bo

)(·) is a functionthat performs an affine transformation on its input.And the superscript a in a indicates the a-th col-umn vector of .

The whole model, including both the encoderand the decoder, is jointly trained to maximize the(conditional) log-likelihood of target sequencesgiven input sequences, where the training corpusis a set of (x

n

,y

n

)’s. Figure 2 illustrates the ar-chitecture of the NMT.

(t=i)

Publ

ished

asa

conf

eren

cepa

pera

tICL

R20

15

The

deco

der

isof

ten

train

edto

pred

ictt

hene

xtw

ordy

t

0gi

ven

the

cont

extv

ecto

rc

and

allt

hepr

evio

usly

pred

icte

dw

ords

{y1,···,

y

t

0 �1}.

Inot

herw

ords

,the

deco

derd

efine

sapr

obab

ility

over

the

trans

latio

ny

byde

com

posin

gth

ejo

intp

roba

bilit

yin

toth

eor

dere

dco

nditi

onal

s:

p(y)=

T Y

t=1

p(y

t

|{y

1,···,

y

t�1},

c),

(2)

whe

rey=

� y

1,···,

y

T

y

� .With

anRN

N,e

ach

cond

ition

alpr

obab

ility

ism

odel

edas

p(y

t

|{y

1,···,

y

t�1},

c)=

g(y

t�1,s

t

,c),

(3)

whe

reg

isan

onlin

ear,

pote

ntia

llym

ulti-

laye

red,

func

tion

that

outp

utst

hepr

obab

ility

ofy

t

,and

s

t

isth

ehi

dden

state

ofth

eRN

N.I

tsho

uld

beno

ted

that

othe

rarc

hite

ctur

essu

chas

ahy

brid

ofan

RNN

and

ade

-con

volu

tiona

lneu

raln

etw

ork

can

beus

ed(K

alch

bren

nera

ndBl

unso

m,2

013)

.

3LE

AR

NIN

GTO

ALI

GN

AN

DTR

AN

SLAT

E

Inth

isse

ctio

n,w

epro

pose

anov

elar

chite

ctur

efor

neur

alm

achi

netra

nsla

tion.

Then

ewar

chite

ctur

eco

nsist

sof

abi

dire

ctio

nalR

NN

asan

enco

der

(Sec

.3.2

)an

da

deco

der

that

emul

ates

sear

chin

gth

roug

ha

sour

cese

nten

cedu

ring

deco

ding

atra

nsla

tion

(Sec

.3.1

).

3.1

DEC

OD

ER:G

ENER

AL

DES

CR

IPTI

ON

s t

Figu

re1:

Theg

raph

ical

illus

-tra

tion

ofth

epr

opos

edm

odel

tryin

gto

gene

rate

thet-th

tar-

get

wor

dy

t

give

na

sour

cese

nten

ce(x

1,x

2,...,x

T

).

Ina

new

mod

elar

chite

ctur

e,w

ede

fine

each

cond

ition

alpr

obab

ility

inEq

.(2)

as:

p(y

i

|y1,...,y

i�1,x)=

g(y

i�1,s

i

,c

i

),

(4)

whe

res

i

isan

RNN

hidd

ensta

tefo

rtim

ei,c

ompu

ted

by

s

i

=f(s

i�1,y

i�1,c

i

).

Itsh

ould

beno

ted

that

unlik

eth

eex

istin

gen

code

r–de

code

rap

-pr

oach

(see

Eq.(

2)),

here

thep

roba

bilit

yis

cond

ition

edon

adist

inct

cont

extv

ecto

rci

fore

ach

targ

etw

ordy

i

.

The

cont

ext

vect

orc

i

depe

nds

ona

sequ

ence

ofan

nota

tions

(h

1,···,

h

T

x

)to

whi

chan

enco

derm

apst

hein

puts

ente

nce.

Each

anno

tatio

nh

i

cont

ains

info

rmat

ion

abou

tthe

who

lein

puts

eque

nce

with

astr

ong

focu

son

the

parts

surro

undi

ngth

ei-th

wor

dof

the

inpu

tseq

uenc

e.W

eex

plai

nin

deta

ilho

wth

ean

nota

tions

are

com

-pu

ted

inth

ene

xtse

ctio

n.

Thec

onte

xtve

ctor

c

i

is,th

en,c

ompu

ted

asaw

eigh

ted

sum

ofth

ese

anno

tatio

nsh

i

:

c

i

=

T

x

X j=1

ij

h

j

.(5

)

The

wei

ght↵

ij

ofea

chan

nota

tionh

j

isco

mpu

ted

by

ij

=

exp(e

ij

)

PT

x

k=1exp(e

ik

)

,(6

)

whe

ree

ij

=a(s

i�1,h

j

)

isan

alig

nmen

tmod

elw

hich

scor

esho

ww

ellt

hein

puts

arou

ndpo

sitio

nj

and

the

outp

utat

posit

ion

im

atch

.The

scor

eis

base

don

the

RNN

hidd

ensta

tes

i�1

(just

befo

reem

ittin

gy

i

,Eq.

(4))

and

the

j-th

anno

tatio

nh

j

ofth

ein

puts

ente

nce.

Wep

aram

etriz

ethe

alig

nmen

tmod

ela

asaf

eedf

orw

ard

neur

alne

twor

kw

hich

isjo

intly

train

edw

ithal

lthe

othe

rcom

pone

ntso

fthe

prop

osed

syste

m.N

otet

hatu

nlik

ein

tradi

tiona

lmac

hine

trans

latio

n,

3

Publ

ishe

das

aco

nfer

ence

pape

ratI

CLR

2015

The

deco

der

isof

ten

train

edto

pred

ict

the

next

wor

dy

t

0gi

ven

the

cont

ext

vect

orc

and

all

the

prev

ious

lypr

edic

ted

wor

ds{y

1,···,

y

t

0 �1}.

Inot

herw

ords

,the

deco

derd

efine

sa

prob

abili

tyov

erth

etra

nsla

tiony

byde

com

posi

ngth

ejo

intp

roba

bilit

yin

toth

eor

dere

dco

nditi

onal

s:

p(y)=

T Y

t=1

p(y

t

|{y

1,···,

y

t�1},

c),

(2)

whe

rey=

� y

1,···,

y

T

y

� .With

anR

NN

,eac

hco

nditi

onal

prob

abili

tyis

mod

eled

as

p(y

t

|{y

1,···,

y

t�1},

c)=

g(y

t�1,s

t

,c),

(3)

whe

reg

isa

nonl

inea

r,po

tent

ially

mul

ti-la

yere

d,fu

nctio

nth

atou

tput

sthe

prob

abili

tyof

y

t

,and

s

t

isth

ehi

dden

stat

eof

the

RN

N.I

tsho

uld

beno

ted

that

othe

rarc

hite

ctur

essu

chas

ahy

brid

ofan

RN

Nan

da

de-c

onvo

lutio

naln

eura

lnet

wor

kca

nbe

used

(Kal

chbr

enne

rand

Blu

nsom

,201

3).

3L

EA

RN

ING

TO

AL

IGN

AN

DT

RA

NSL

AT

E

Inth

isse

ctio

n,w

epr

opos

ea

nove

larc

hite

ctur

efo

rneu

ralm

achi

netra

nsla

tion.

The

new

arch

itect

ure

cons

ists

ofa

bidi

rect

iona

lR

NN

asan

enco

der

(Sec

.3.2

)an

da

deco

der

that

emul

ates

sear

chin

gth

roug

ha

sour

cese

nten

cedu

ring

deco

ding

atra

nsla

tion

(Sec

.3.1

).

3.1

DE

CO

DE

R:

GE

NE

RA

LD

ESC

RIP

TIO

N

st

Figu

re1:

The

grap

hica

lillu

s-tra

tion

ofth

epr

opos

edm

odel

tryin

gto

gene

rate

thet-th

tar-

get

wor

dy

t

give

na

sour

cese

nten

ce(x

1,x

2,...,x

T

).

Ina

new

mod

elar

chite

ctur

e,w

ede

fine

each

cond

ition

alpr

obab

ility

inEq

.(2)

as:

p(y

i

|y1,...,y

i�1,x)=

g(y

i�1,s

i

,c

i

),

(4)

whe

res

i

isan

RN

Nhi

dden

stat

efo

rtim

ei,c

ompu

ted

by

s

i

=f(s

i�1,y

i�1,c

i

).

Itsh

ould

beno

ted

that

unlik

eth

eex

istin

gen

code

r–de

code

rap

-pr

oach

(see

Eq.(

2)),

here

the

prob

abili

tyis

cond

ition

edon

adi

stin

ctco

ntex

tvec

torc

i

fore

ach

targ

etw

ordy

i

.

The

cont

ext

vect

orc

i

depe

nds

ona

sequ

ence

ofan

nota

tions

(h

1,···,

h

T

x

)to

whi

chan

enco

derm

aps

the

inpu

tsen

tenc

e.Ea

chan

nota

tionh

i

cont

ains

info

rmat

ion

abou

tthe

who

lein

puts

eque

nce

with

ast

rong

focu

son

the

parts

surr

ound

ing

thei-th

wor

dof

the

inpu

tseq

uenc

e.W

eex

plai

nin

deta

ilho

wth

ean

nota

tions

are

com

-pu

ted

inth

ene

xtse

ctio

n.

The

cont

extv

ecto

rci

is,t

hen,

com

pute

das

aw

eigh

ted

sum

ofth

ese

anno

tatio

nsh

i

:

c

i

=

T

x

X j=1

ij

h

j

.(5

)

The

wei

ght↵

ij

ofea

chan

nota

tionh

j

isco

mpu

ted

by

ij

=

exp(e

ij

)

PT

x

k=1exp(e

ik

)

,(6

)

whe

ree

ij

=a(s

i�1,h

j

)

isan

alig

nmen

tmod

elw

hich

scor

esho

ww

ellt

hein

puts

arou

ndpo

sitio

nj

and

the

outp

utat

posi

tion

im

atch

.The

scor

eis

base

don

the

RN

Nhi

dden

stat

es

i�1

(just

befo

reem

ittin

gy

i

,Eq.

(4))

and

the

j-th

anno

tatio

nh

j

ofth

ein

puts

ente

nce.

We

para

met

rize

the

alig

nmen

tmod

ela

asa

feed

forw

ard

neur

alne

twor

kw

hich

isjo

intly

train

edw

ithal

lthe

othe

rcom

pone

ntso

fthe

prop

osed

syst

em.N

ote

that

unlik

ein

tradi

tiona

lmac

hine

trans

latio

n,

3

Publ

ished

asa

conf

eren

cepa

pera

tICL

R20

15

The

deco

der

isof

ten

train

edto

pred

ictt

hene

xtw

ordy

t

0gi

ven

the

cont

extv

ecto

rc

and

allt

hepr

evio

usly

pred

icte

dw

ords

{y1,···,

y

t

0 �1}.

Inot

herw

ords

,the

deco

derd

efine

sapr

obab

ility

over

the

trans

latio

ny

byde

com

posin

gth

ejo

intp

roba

bilit

yin

toth

eor

dere

dco

nditi

onal

s:

p(y)=

T Y

t=1

p(y

t

|{y

1,···,

y

t�1},

c),

(2)

whe

rey=

� y

1,···,

y

T

y

� .With

anRN

N,e

ach

cond

ition

alpr

obab

ility

ism

odel

edas

p(y

t

|{y

1,···,

y

t�1},

c)=

g(y

t�1,s

t

,c),

(3)

whe

reg

isan

onlin

ear,

pote

ntia

llym

ulti-

laye

red,

func

tion

that

outp

utst

hepr

obab

ility

ofy

t

,and

s

t

isth

ehi

dden

state

ofth

eRN

N.I

tsho

uld

beno

ted

that

othe

rarc

hite

ctur

essu

chas

ahy

brid

ofan

RNN

and

ade

-con

volu

tiona

lneu

raln

etw

ork

can

beus

ed(K

alch

bren

nera

ndBl

unso

m,2

013)

.

3LE

AR

NIN

GTO

ALI

GN

AN

DTR

AN

SLAT

E

Inth

isse

ctio

n,w

epro

pose

anov

elar

chite

ctur

efor

neur

alm

achi

netra

nsla

tion.

Then

ewar

chite

ctur

eco

nsist

sof

abi

dire

ctio

nalR

NN

asan

enco

der

(Sec

.3.2

)an

da

deco

der

that

emul

ates

sear

chin

gth

roug

ha

sour

cese

nten

cedu

ring

deco

ding

atra

nsla

tion

(Sec

.3.1

).

3.1

DEC

OD

ER:G

ENER

AL

DES

CR

IPTI

ON

s t

Figu

re1:

Theg

raph

ical

illus

-tra

tion

ofth

epr

opos

edm

odel

tryin

gto

gene

rate

thet-th

tar-

get

wor

dy

t

give

na

sour

cese

nten

ce(x

1,x

2,...,x

T

).

Ina

new

mod

elar

chite

ctur

e,w

ede

fine

each

cond

ition

alpr

obab

ility

inEq

.(2)

as:

p(y

i

|y1,...,y

i�1,x)=

g(y

i�1,s

i

,c

i

),

(4)

whe

res

i

isan

RNN

hidd

ensta

tefo

rtim

ei,c

ompu

ted

by

s

i

=f(s

i�1,y

i�1,c

i

).

Itsh

ould

beno

ted

that

unlik

eth

eex

istin

gen

code

r–de

code

rap

-pr

oach

(see

Eq.(

2)),

here

thep

roba

bilit

yis

cond

ition

edon

adist

inct

cont

extv

ecto

rci

fore

ach

targ

etw

ordy

i

.

The

cont

ext

vect

orc

i

depe

nds

ona

sequ

ence

ofan

nota

tions

(h

1,···,

h

T

x

)to

whi

chan

enco

derm

apst

hein

puts

ente

nce.

Each

anno

tatio

nh

i

cont

ains

info

rmat

ion

abou

tthe

who

lein

puts

eque

nce

with

astr

ong

focu

son

the

parts

surro

undi

ngth

ei-th

wor

dof

the

inpu

tseq

uenc

e.W

eex

plai

nin

deta

ilho

wth

ean

nota

tions

are

com

-pu

ted

inth

ene

xtse

ctio

n.

Thec

onte

xtve

ctor

c

i

is,th

en,c

ompu

ted

asaw

eigh

ted

sum

ofth

ese

anno

tatio

nsh

i

:

c

i

=

T

x

X j=1

ij

h

j

.(5

)

The

wei

ght↵

ij

ofea

chan

nota

tionh

j

isco

mpu

ted

by

ij

=

exp(e

ij

)

PT

x

k=1exp(e

ik

)

,(6

)

whe

ree

ij

=a(s

i�1,h

j

)

isan

alig

nmen

tmod

elw

hich

scor

esho

ww

ellt

hein

puts

arou

ndpo

sitio

nj

and

the

outp

utat

posit

ion

im

atch

.The

scor

eis

base

don

the

RNN

hidd

ensta

tes

i�1

(just

befo

reem

ittin

gy

i

,Eq.

(4))

and

the

j-th

anno

tatio

nh

j

ofth

ein

puts

ente

nce.

Wep

aram

etriz

ethe

alig

nmen

tmod

ela

asaf

eedf

orw

ard

neur

alne

twor

kw

hich

isjo

intly

train

edw

ithal

lthe

othe

rcom

pone

ntso

fthe

prop

osed

syste

m.N

otet

hatu

nlik

ein

tradi

tiona

lmac

hine

trans

latio

n,

3

Publ

ishe

das

aco

nfer

ence

pape

ratI

CLR

2015

The

deco

der

isof

ten

train

edto

pred

ict

the

next

wor

dy

t

0gi

ven

the

cont

ext

vect

orc

and

all

the

prev

ious

lypr

edic

ted

wor

ds{y

1,···,

y

t

0 �1}.

Inot

herw

ords

,the

deco

derd

efine

sa

prob

abili

tyov

erth

etra

nsla

tiony

byde

com

posi

ngth

ejo

intp

roba

bilit

yin

toth

eor

dere

dco

nditi

onal

s:

p(y)=

T Y

t=1

p(y

t

|{y

1,···,

y

t�1},

c),

(2)

whe

rey=

� y

1,···,

y

T

y

� .With

anR

NN

,eac

hco

nditi

onal

prob

abili

tyis

mod

eled

as

p(y

t

|{y

1,···,

y

t�1},

c)=

g(y

t�1,s

t

,c),

(3)

whe

reg

isa

nonl

inea

r,po

tent

ially

mul

ti-la

yere

d,fu

nctio

nth

atou

tput

sthe

prob

abili

tyof

y

t

,and

s

t

isth

ehi

dden

stat

eof

the

RN

N.I

tsho

uld

beno

ted

that

othe

rarc

hite

ctur

essu

chas

ahy

brid

ofan

RN

Nan

da

de-c

onvo

lutio

naln

eura

lnet

wor

kca

nbe

used

(Kal

chbr

enne

rand

Blu

nsom

,201

3).

3L

EA

RN

ING

TO

AL

IGN

AN

DT

RA

NSL

AT

E

Inth

isse

ctio

n,w

epr

opos

ea

nove

larc

hite

ctur

efo

rneu

ralm

achi

netra

nsla

tion.

The

new

arch

itect

ure

cons

ists

ofa

bidi

rect

iona

lR

NN

asan

enco

der

(Sec

.3.2

)an

da

deco

der

that

emul

ates

sear

chin

gth

roug

ha

sour

cese

nten

cedu

ring

deco

ding

atra

nsla

tion

(Sec

.3.1

).

3.1

DE

CO

DE

R:

GE

NE

RA

LD

ESC

RIP

TIO

N

st

Figu

re1:

The

grap

hica

lillu

s-tra

tion

ofth

epr

opos

edm

odel

tryin

gto

gene

rate

thet-th

tar-

get

wor

dy

t

give

na

sour

cese

nten

ce(x

1,x

2,...,x

T

).

Ina

new

mod

elar

chite

ctur

e,w

ede

fine

each

cond

ition

alpr

obab

ility

inEq

.(2)

as:

p(y

i

|y1,...,y

i�1,x)=

g(y

i�1,s

i

,c

i

),

(4)

whe

res

i

isan

RN

Nhi

dden

stat

efo

rtim

ei,c

ompu

ted

by

s

i

=f(s

i�1,y

i�1,c

i

).

Itsh

ould

beno

ted

that

unlik

eth

eex

istin

gen

code

r–de

code

rap

-pr

oach

(see

Eq.(

2)),

here

the

prob

abili

tyis

cond

ition

edon

adi

stin

ctco

ntex

tvec

torc

i

fore

ach

targ

etw

ordy

i

.

The

cont

ext

vect

orc

i

depe

nds

ona

sequ

ence

ofan

nota

tions

(h

1,···,

h

T

x

)to

whi

chan

enco

derm

aps

the

inpu

tsen

tenc

e.Ea

chan

nota

tionh

i

cont

ains

info

rmat

ion

abou

tthe

who

lein

puts

eque

nce

with

ast

rong

focu

son

the

parts

surr

ound

ing

thei-th

wor

dof

the

inpu

tseq

uenc

e.W

eex

plai

nin

deta

ilho

wth

ean

nota

tions

are

com

-pu

ted

inth

ene

xtse

ctio

n.

The

cont

extv

ecto

rci

is,t

hen,

com

pute

das

aw

eigh

ted

sum

ofth

ese

anno

tatio

nsh

i

:

c

i

=

T

x

X j=1

ij

h

j

.(5

)

The

wei

ght↵

ij

ofea

chan

nota

tionh

j

isco

mpu

ted

by

ij

=

exp(e

ij

)

PT

x

k=1exp(e

ik

)

,(6

)

whe

ree

ij

=a(s

i�1,h

j

)

isan

alig

nmen

tmod

elw

hich

scor

esho

ww

ellt

hein

puts

arou

ndpo

sitio

nj

and

the

outp

utat

posi

tion

im

atch

.The

scor

eis

base

don

the

RN

Nhi

dden

stat

es

i�1

(just

befo

reem

ittin

gy

i

,Eq.

(4))

and

the

j-th

anno

tatio

nh

j

ofth

ein

puts

ente

nce.

We

para

met

rize

the

alig

nmen

tmod

ela

asa

feed

forw

ard

neur

alne

twor

kw

hich

isjo

intly

train

edw

ithal

lthe

othe

rcom

pone

ntso

fthe

prop

osed

syst

em.N

ote

that

unlik

ein

tradi

tiona

lmac

hine

trans

latio

n,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

yt-1

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

ptpt-1

• V

• • 

���

killed a man yesterday . [eos]!

John killed a man yesterday . !

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

pt w Whp

Whp(w) bhp(w) softmax

Whp bhp

pt(w) = softmax(Whp(w)ht + bhp(w)) (18)

Whp

d′e

d′e = Wdyde + bdy (19)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (20)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

!man �

a �

a ���

man !�

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

• V T

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp ∈ RV×N V

N Whp(w) bhp(w)

softmax Z Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

T

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

6

RNN t wt

pt

pt = softmax(Whpht + bhp) (16)

ht =−−−→RNN t′≺t(xwt′ ) (17)

softmax(s)i =exp(si)!

sj∈s exp(sj)(18)

(19)

softmax(s)i N s i

exp

exp

pt w Whp Whp(w)

bhp(w) softmax Z

Whp bhp

pt(w) =Whp(w)Tht + bhp(w)

Z(20)

Z ="

w′

Whp(w′)Tht + bhp(w

′) (21)

Whp

d′e

d′e = Wdyde + bdy (22)

bhp(w)

bhpe

xwt′ xwt′

d′′e = Wdxde + bdx (23)

20

7

8

ior of humans and recent advances in the atten-tion mechanism (Bahdanau et al., 2014) and thepointer networks (Vinyals et al., 2015), we pro-pose a novel method to deal with the rare or un-known word problem. The basic idea is that wecan see in many NLP problems as a task of predict-ing target text given context text, where some ofthe target words appear in the context as well. Weobserve that in this case we can make the modellearn to point a word in the context and copy it tothe target text, as well as when to point. For exam-ple, in machine translation, we can see the sourcesentence as the context, and the target sentence aswhat we need to predict. In Figure 1, we showan example depiction of how words can be copiedfrom source to target in machine translation. Al-though the source and target languages are differ-ent, many of the words such as named entities areusually represented by the same characters in bothlanguages, making it possible to copy. Similarly,in text summarization, it is natural to use somewords in the original text in the summarized textas well.

Specifically, to predict a target word at eachtimestep, our model first determines the source ofthe word generation, that is, on whether to takeone from a predefined shortlist or to copy one fromthe context. For the former, we apply the typicalsoftmax operation, and for the latter, we use theattention mechanism to obtain the pointing soft-max probability over the context words and pickthe one of high probability. The model learns thisdecision so as to use the pointing only when thecontext includes a word that can be copied to thetarget. This way, our model can predict even thewords which are not in the shortlist, as long asit appears in the context. Although some of thewords still need to be labeled as UNK, i.e., if it isneither in the shortlist nor in the context, in ex-periments we show that this learning when and

where to point improves the performance in ma-chine translation and text summarization.

The rest of the paper is organized as follows. Inthe next section, we review the related works in-cluding pointer networks and previous approachesto the rare/unknown problem. In Section 3, wereview the neural machine translation with atten-tion mechanism which is the baseline in our ex-periments. Then, in Section 4, we propose ourmethod dealing with the rare/unknown word prob-lem, called the Pointer Softmax (PS). The exper-

Guillaume et Cesar ont une voiture bleue a Lausanne.

Guillaume and Cesar have a blue car in Lausanne.Copy Copy Copy

French:

English:

Figure 1: An example of how copying can happenfor machine translation. Common words that ap-pear both in source and the target can directly becopied from input to source. The rest of the un-known in the target can be copied from the inputafter being translated with a dictionary.

imental results are provided in the Section 5 andwe conclude our work in Section 6.

2 Related Work

The attention-based pointing mechanism is intro-duced first in the pointer networks (Vinyals et al.,2015). In the pointer networks, the output space ofthe target sequence is constrained to be the obser-vations in the input sequence (not the input space).Instead of having a fixed dimension softmax out-put layer, softmax outputs of varying dimension isdynamically computed for each input sequence insuch a way to maximize the attention probabilityof the target input. However, its applicability israther limited because, unlike our model, there isno option to choose whether to point or not; it al-ways points. In this sense, we can see the pointernetworks as a special case of our model where wealways choose to point a context word.

Several approaches have been proposed towardssolving the rare words/unknown words problem,which can be broadly divided into three categories.The first category of the approaches focuses onimproving the computation speed of the softmaxoutput so that it can maintain a very large vocabu-lary. Because this only increases the shortlist size,it helps to mitigate the unknown word problem,but still suffers from the rare word problem. Thehierarchical softmax (Morin and Bengio, 2005),importance sampling (Bengio and Senecal, 2008;Jean et al., 2014), and the noise contrastive esti-mation (Gutmann and Hyvarinen, 2012; Mnih andKavukcuoglu, 2013) methods are in the class.

The second category, where our proposedmethod also belongs to, uses information from thecontext. Notable works are (Luong et al., 2015)and (Hermann et al., 2015). In particular, ap-plying to machine translation task, (Luong et al.,2015) learns to point some words in source sen-tence and copy it to the target sentence, similarly

9

10

Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source.Source #1 china ’s tang gonghong set a world record with a clean and

jerk lift of ### kilograms to win the women ’s over-## kilogramweightlifting title at the asian games on tuesday .

Target #1 china ’s tang <unk>,sets world weightlifting recordNMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-

ing title at asian gamesSource #2 owing to criticism , nbc said on wednesday that it was ending

a three-month-old experiment that would have brought the firstliquor advertisements onto national broadcast network television.

Target #2 advertising : nbc retreats from liquor commercialsNMT+PS #2 nbc says it is ending a three-month-old experimentSource #3 a senior trade union official here wednesday called on ghana ’s

government to be “ mindful of the plight ” of the ordinary peoplein the country in its decisions on tax increases .

Target #3 tuc official,on behalf of ordinary ghanaiansNMT+PS #3 ghana ’s government urged to be mindful of the plight

vocabulary, we first check if the same word y

t

ap-pears in the source sentence. If it is not, we thencheck if a translated version of the word exists inthe source sentence by using a look-up table be-tween the source and the target language. If theword is in the source sentence, we then use the lo-cation of the word in the source as the target. Oth-erwise we check if one of the English senses fromthe cross-language dictionary of the French wordis in the source. If it is in the source sentence, thenwe use the location of that word as our translation.Otherwise we just use the argmax of l

t

as the tar-get.

For switching network d

t

, we observed that us-ing a two-layered MLP with noisy-tanh activation(Gulcehre et al., 2016) function with residual con-nection from the lower layer (He et al., 2015) ac-tivation function to the upper hidden layers im-proves the BLEU score about 1 points over thed

t

using ReLU activation function. We initializedthe biases of the last sigmoid layer of d

t

to �1

such that if dt

becomes more biased toward choos-ing the shortlist vocabulary at the beginning of thetraining. We renormalize the gradients if the normof the gradients exceed 1 (Pascanu et al., 2012).

Table 5: Europarl Dataset (EN-FR)BLEU-4

NMT 20.19NMT + PS 23.76

In Table 5, we provided the result of NMT withpointer softmax and we observe about 3.6 BLEUscore improvement over our baseline.

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

In Figure 4, we show the validation curvesof the NMT model with attention and the NMTmodel with shortlist-softmax layer. Pointer soft-max converges faster in terms of number of mini-batch updates and achieves a lower validationnegative-log-likelihood (NLL) (63.91) after 200kupdates over the Europarl dataset than the NMT

Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source.Source #1 china ’s tang gonghong set a world record with a clean and

jerk lift of ### kilograms to win the women ’s over-## kilogramweightlifting title at the asian games on tuesday .

Target #1 china ’s tang <unk>,sets world weightlifting recordNMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-

ing title at asian gamesSource #2 owing to criticism , nbc said on wednesday that it was ending

a three-month-old experiment that would have brought the firstliquor advertisements onto national broadcast network television.

Target #2 advertising : nbc retreats from liquor commercialsNMT+PS #2 nbc says it is ending a three-month-old experimentSource #3 a senior trade union official here wednesday called on ghana ’s

government to be “ mindful of the plight ” of the ordinary peoplein the country in its decisions on tax increases .

Target #3 tuc official,on behalf of ordinary ghanaiansNMT+PS #3 ghana ’s government urged to be mindful of the plight

vocabulary, we first check if the same word y

t

ap-pears in the source sentence. If it is not, we thencheck if a translated version of the word exists inthe source sentence by using a look-up table be-tween the source and the target language. If theword is in the source sentence, we then use the lo-cation of the word in the source as the target. Oth-erwise we check if one of the English senses fromthe cross-language dictionary of the French wordis in the source. If it is in the source sentence, thenwe use the location of that word as our translation.Otherwise we just use the argmax of l

t

as the tar-get.

For switching network d

t

, we observed that us-ing a two-layered MLP with noisy-tanh activation(Gulcehre et al., 2016) function with residual con-nection from the lower layer (He et al., 2015) ac-tivation function to the upper hidden layers im-proves the BLEU score about 1 points over thed

t

using ReLU activation function. We initializedthe biases of the last sigmoid layer of d

t

to �1

such that if dt

becomes more biased toward choos-ing the shortlist vocabulary at the beginning of thetraining. We renormalize the gradients if the normof the gradients exceed 1 (Pascanu et al., 2012).

Table 5: Europarl Dataset (EN-FR)BLEU-4

NMT 20.19NMT + PS 23.76

In Table 5, we provided the result of NMT withpointer softmax and we observe about 3.6 BLEUscore improvement over our baseline.

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

In Figure 4, we show the validation curvesof the NMT model with attention and the NMTmodel with shortlist-softmax layer. Pointer soft-max converges faster in terms of number of mini-batch updates and achieves a lower validationnegative-log-likelihood (NLL) (63.91) after 200kupdates over the Europarl dataset than the NMT

<v1> ’s <v2> <v3> set a world record with a clean and jerk lift of ### kilograms to win the women ’s over-## kilogram weightlifting title at the asian games on tuesday .<v1> ’s <v2> <v3>,sets world weightlifting record

••

•••

11

• gonghonh <unk>

12

For evaluation, we use full-length Rouge F1 us-ing the official evaluation tool 2. In their work, theauthors of (Bahdanau et al., 2014) use full-lengthRouge Recall on this corpus, since the maximumlength of limited-length version of Rouge recallof 75 bytes (intended for DUC data) is alreadylong for Gigaword summaries. However, sincefull-length Recall can unfairly reward longer sum-maries, we also use full-length F1 in our experi-ments for a fair comparison between our models,independent of the summary length.

The experimental results comparing the PointerSoftmax with NMT model are displayed in Ta-ble 1 for the UNK pointers data and in Table 2for the entity pointers data. As our experimentsshow, pointer softmax improves over the baselineNMT on both UNK data and entities data. Ourhope was that the improvement would be largerfor the entities data since the incidence of point-ers was much greater. However, it turns out thisis not the case, and we suspect the main reasonis anonymization of entities which removed data-sparsity by converting all entities to integer-idsthat are shared across all documents. We believethat on de-anonymized data, our model could helpmore, since the issue of data-sparsity is more acutein this case.

Table 1: Results on Gigaword Corpus when point-ers are used for UNKs in the training data, usingRouge-F1 as the evaluation metric.

Rouge-1 Rouge-2 Rouge-LNMT + lvt 34.87 16.54 32.27NMT + lvt + PS 35.19 16.66 32.51

Table 2: Results on anonymized Gigaword Corpuswhen pointers are used for entities, using Rouge-F1 as the evaluation metric.

Rouge-1 Rouge-2 Rouge-LNMT + lvt 34.89 16.78 32.37NMT + lvt + PS 35.11 16.76 32.55

In Table 3, we provide the results for summa-rization on Gigaword corpus in terms of recall asalso similar comparison is done by (Rush et al.,2015). We observe improvements on all the scoreswith the addition of pointer softmax. Let us note

2http://www.berouge.com/Pages/default.

aspx

Table 3: Results on Gigaword Corpus for model-ing UNK’s with pointers in terms of recall.

Rouge-1 Rouge-2 Rouge-LNMT + lvt 36.45 17.41 33.90NMT + lvt + PS 37.29 17.75 34.70

that, since the test set of (Rush et al., 2015) is notpublicly available, we sample 2000 texts with theirsummaries without replacement from the valida-tion set and used those examples as our test set.

In Table 4 we present a few system gener-ated summaries from the Pointer Softmax modeltrained on the UNK pointers data. From those ex-amples, it is apparent that the model has learned toaccurately point to the source positions wheneverit needs to generate rare words in the summary.

5.3 Neural Machine Translation

In our neural machine translation (NMT) experi-ments, we train NMT models with attention overthe Europarl corpus (Bahdanau et al., 2014) overthe sequences of length up to 50 for English toFrench translation. 3. All models are trained withearly-stopping which is done based on the negativelog-likelihood (NLL) on the development set. Ourevaluations to report the performance of our mod-els are done on newstest2011 by using BLUEscore. 4

We use 30, 000 tokens for both the source andthe target language shortlist vocabularies (1 of thetoken is still reserved for the unknown words).The whole corpus contains 134, 831 unique En-glish words and 153, 083 unique French words.We have created a word-level dictionary fromFrench to English which contains translation of15,953 words that are neither in shortlist vocab-ulary nor dictionary of common words for boththe source and the target. There are about 49, 490words shared between English and French parallelcorpora of Europarl.

During the training, in order to decide whetherto pick a word from the source sentence using at-tention/pointers or to predict the word from theshort-list vocabulary, we use the following sim-ple heuristic. If the word is not in the short-list

3In our experiments, we use an existing code, pro-vided in https://github.com/kyunghyuncho/

dl4mt-material, and on the original model we onlychanged the last softmax layer for our experiments

4We compute the BLEU score using the multi-blue.perlscript from Moses on tokenized sentence pairs.

Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source.Source #1 china ’s tang gonghong set a world record with a clean and

jerk lift of ### kilograms to win the women ’s over-## kilogramweightlifting title at the asian games on tuesday .

Target #1 china ’s tang <unk>,sets world weightlifting recordNMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-

ing title at asian gamesSource #2 owing to criticism , nbc said on wednesday that it was ending

a three-month-old experiment that would have brought the firstliquor advertisements onto national broadcast network television.

Target #2 advertising : nbc retreats from liquor commercialsNMT+PS #2 nbc says it is ending a three-month-old experimentSource #3 a senior trade union official here wednesday called on ghana ’s

government to be “ mindful of the plight ” of the ordinary peoplein the country in its decisions on tax increases .

Target #3 tuc official,on behalf of ordinary ghanaiansNMT+PS #3 ghana ’s government urged to be mindful of the plight

vocabulary, we first check if the same word y

t

ap-pears in the source sentence. If it is not, we thencheck if a translated version of the word exists inthe source sentence by using a look-up table be-tween the source and the target language. If theword is in the source sentence, we then use the lo-cation of the word in the source as the target. Oth-erwise we check if one of the English senses fromthe cross-language dictionary of the French wordis in the source. If it is in the source sentence, thenwe use the location of that word as our translation.Otherwise we just use the argmax of l

t

as the tar-get.

For switching network d

t

, we observed that us-ing a two-layered MLP with noisy-tanh activation(Gulcehre et al., 2016) function with residual con-nection from the lower layer (He et al., 2015) ac-tivation function to the upper hidden layers im-proves the BLEU score about 1 points over thed

t

using ReLU activation function. We initializedthe biases of the last sigmoid layer of d

t

to �1

such that if dt

becomes more biased toward choos-ing the shortlist vocabulary at the beginning of thetraining. We renormalize the gradients if the normof the gradients exceed 1 (Pascanu et al., 2012).

Table 5: Europarl Dataset (EN-FR)BLEU-4

NMT 20.19NMT + PS 23.76

In Table 5, we provided the result of NMT withpointer softmax and we observe about 3.6 BLEUscore improvement over our baseline.

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

In Figure 4, we show the validation curvesof the NMT model with attention and the NMTmodel with shortlist-softmax layer. Pointer soft-max converges faster in terms of number of mini-batch updates and achieves a lower validationnegative-log-likelihood (NLL) (63.91) after 200kupdates over the Europarl dataset than the NMT

••

13

For evaluation, we use full-length Rouge F1 us-ing the official evaluation tool 2. In their work, theauthors of (Bahdanau et al., 2014) use full-lengthRouge Recall on this corpus, since the maximumlength of limited-length version of Rouge recallof 75 bytes (intended for DUC data) is alreadylong for Gigaword summaries. However, sincefull-length Recall can unfairly reward longer sum-maries, we also use full-length F1 in our experi-ments for a fair comparison between our models,independent of the summary length.

The experimental results comparing the PointerSoftmax with NMT model are displayed in Ta-ble 1 for the UNK pointers data and in Table 2for the entity pointers data. As our experimentsshow, pointer softmax improves over the baselineNMT on both UNK data and entities data. Ourhope was that the improvement would be largerfor the entities data since the incidence of point-ers was much greater. However, it turns out thisis not the case, and we suspect the main reasonis anonymization of entities which removed data-sparsity by converting all entities to integer-idsthat are shared across all documents. We believethat on de-anonymized data, our model could helpmore, since the issue of data-sparsity is more acutein this case.

Table 1: Results on Gigaword Corpus when point-ers are used for UNKs in the training data, usingRouge-F1 as the evaluation metric.

Rouge-1 Rouge-2 Rouge-LNMT + lvt 34.87 16.54 32.27NMT + lvt + PS 35.19 16.66 32.51

Table 2: Results on anonymized Gigaword Corpuswhen pointers are used for entities, using Rouge-F1 as the evaluation metric.

Rouge-1 Rouge-2 Rouge-LNMT + lvt 34.89 16.78 32.37NMT + lvt + PS 35.11 16.76 32.55

In Table 3, we provide the results for summa-rization on Gigaword corpus in terms of recall asalso similar comparison is done by (Rush et al.,2015). We observe improvements on all the scoreswith the addition of pointer softmax. Let us note

2http://www.berouge.com/Pages/default.

aspx

Table 3: Results on Gigaword Corpus for model-ing UNK’s with pointers in terms of recall.

Rouge-1 Rouge-2 Rouge-LNMT + lvt 36.45 17.41 33.90NMT + lvt + PS 37.29 17.75 34.70

that, since the test set of (Rush et al., 2015) is notpublicly available, we sample 2000 texts with theirsummaries without replacement from the valida-tion set and used those examples as our test set.

In Table 4 we present a few system gener-ated summaries from the Pointer Softmax modeltrained on the UNK pointers data. From those ex-amples, it is apparent that the model has learned toaccurately point to the source positions wheneverit needs to generate rare words in the summary.

5.3 Neural Machine Translation

In our neural machine translation (NMT) experi-ments, we train NMT models with attention overthe Europarl corpus (Bahdanau et al., 2014) overthe sequences of length up to 50 for English toFrench translation. 3. All models are trained withearly-stopping which is done based on the negativelog-likelihood (NLL) on the development set. Ourevaluations to report the performance of our mod-els are done on newstest2011 by using BLUEscore. 4

We use 30, 000 tokens for both the source andthe target language shortlist vocabularies (1 of thetoken is still reserved for the unknown words).The whole corpus contains 134, 831 unique En-glish words and 153, 083 unique French words.We have created a word-level dictionary fromFrench to English which contains translation of15,953 words that are neither in shortlist vocab-ulary nor dictionary of common words for boththe source and the target. There are about 49, 490words shared between English and French parallelcorpora of Europarl.

During the training, in order to decide whetherto pick a word from the source sentence using at-tention/pointers or to predict the word from theshort-list vocabulary, we use the following sim-ple heuristic. If the word is not in the short-list

3In our experiments, we use an existing code, pro-vided in https://github.com/kyunghyuncho/

dl4mt-material, and on the original model we onlychanged the last softmax layer for our experiments

4We compute the BLEU score using the multi-blue.perlscript from Moses on tokenized sentence pairs.

Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source.Source #1 china ’s tang gonghong set a world record with a clean and

jerk lift of ### kilograms to win the women ’s over-## kilogramweightlifting title at the asian games on tuesday .

Target #1 china ’s tang <unk>,sets world weightlifting recordNMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-

ing title at asian gamesSource #2 owing to criticism , nbc said on wednesday that it was ending

a three-month-old experiment that would have brought the firstliquor advertisements onto national broadcast network television.

Target #2 advertising : nbc retreats from liquor commercialsNMT+PS #2 nbc says it is ending a three-month-old experimentSource #3 a senior trade union official here wednesday called on ghana ’s

government to be “ mindful of the plight ” of the ordinary peoplein the country in its decisions on tax increases .

Target #3 tuc official,on behalf of ordinary ghanaiansNMT+PS #3 ghana ’s government urged to be mindful of the plight

vocabulary, we first check if the same word y

t

ap-pears in the source sentence. If it is not, we thencheck if a translated version of the word exists inthe source sentence by using a look-up table be-tween the source and the target language. If theword is in the source sentence, we then use the lo-cation of the word in the source as the target. Oth-erwise we check if one of the English senses fromthe cross-language dictionary of the French wordis in the source. If it is in the source sentence, thenwe use the location of that word as our translation.Otherwise we just use the argmax of l

t

as the tar-get.

For switching network d

t

, we observed that us-ing a two-layered MLP with noisy-tanh activation(Gulcehre et al., 2016) function with residual con-nection from the lower layer (He et al., 2015) ac-tivation function to the upper hidden layers im-proves the BLEU score about 1 points over thed

t

using ReLU activation function. We initializedthe biases of the last sigmoid layer of d

t

to �1

such that if dt

becomes more biased toward choos-ing the shortlist vocabulary at the beginning of thetraining. We renormalize the gradients if the normof the gradients exceed 1 (Pascanu et al., 2012).

Table 5: Europarl Dataset (EN-FR)BLEU-4

NMT 20.19NMT + PS 23.76

In Table 5, we provided the result of NMT withpointer softmax and we observe about 3.6 BLEUscore improvement over our baseline.

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

In Figure 4, we show the validation curvesof the NMT model with attention and the NMTmodel with shortlist-softmax layer. Pointer soft-max converges faster in terms of number of mini-batch updates and achieves a lower validationnegative-log-likelihood (NLL) (63.91) after 200kupdates over the Europarl dataset than the NMT

<v1> ’s <v2> <v3> set a world record with a clean and jerk lift of ### kilograms to win the women ’s over-## kilogram weightlifting title at the asian games on tuesday .<v1> ’s <v2> <v3>,sets world weightlifting record

••

••

•14

ior of humans and recent advances in the atten-tion mechanism (Bahdanau et al., 2014) and thepointer networks (Vinyals et al., 2015), we pro-pose a novel method to deal with the rare or un-known word problem. The basic idea is that wecan see in many NLP problems as a task of predict-ing target text given context text, where some ofthe target words appear in the context as well. Weobserve that in this case we can make the modellearn to point a word in the context and copy it tothe target text, as well as when to point. For exam-ple, in machine translation, we can see the sourcesentence as the context, and the target sentence aswhat we need to predict. In Figure 1, we showan example depiction of how words can be copiedfrom source to target in machine translation. Al-though the source and target languages are differ-ent, many of the words such as named entities areusually represented by the same characters in bothlanguages, making it possible to copy. Similarly,in text summarization, it is natural to use somewords in the original text in the summarized textas well.

Specifically, to predict a target word at eachtimestep, our model first determines the source ofthe word generation, that is, on whether to takeone from a predefined shortlist or to copy one fromthe context. For the former, we apply the typicalsoftmax operation, and for the latter, we use theattention mechanism to obtain the pointing soft-max probability over the context words and pickthe one of high probability. The model learns thisdecision so as to use the pointing only when thecontext includes a word that can be copied to thetarget. This way, our model can predict even thewords which are not in the shortlist, as long asit appears in the context. Although some of thewords still need to be labeled as UNK, i.e., if it isneither in the shortlist nor in the context, in ex-periments we show that this learning when and

where to point improves the performance in ma-chine translation and text summarization.

The rest of the paper is organized as follows. Inthe next section, we review the related works in-cluding pointer networks and previous approachesto the rare/unknown problem. In Section 3, wereview the neural machine translation with atten-tion mechanism which is the baseline in our ex-periments. Then, in Section 4, we propose ourmethod dealing with the rare/unknown word prob-lem, called the Pointer Softmax (PS). The exper-

Guillaume et Cesar ont une voiture bleue a Lausanne.

Guillaume and Cesar have a blue car in Lausanne.Copy Copy Copy

French:

English:

Figure 1: An example of how copying can happenfor machine translation. Common words that ap-pear both in source and the target can directly becopied from input to source. The rest of the un-known in the target can be copied from the inputafter being translated with a dictionary.

imental results are provided in the Section 5 andwe conclude our work in Section 6.

2 Related Work

The attention-based pointing mechanism is intro-duced first in the pointer networks (Vinyals et al.,2015). In the pointer networks, the output space ofthe target sequence is constrained to be the obser-vations in the input sequence (not the input space).Instead of having a fixed dimension softmax out-put layer, softmax outputs of varying dimension isdynamically computed for each input sequence insuch a way to maximize the attention probabilityof the target input. However, its applicability israther limited because, unlike our model, there isno option to choose whether to point or not; it al-ways points. In this sense, we can see the pointernetworks as a special case of our model where wealways choose to point a context word.

Several approaches have been proposed towardssolving the rare words/unknown words problem,which can be broadly divided into three categories.The first category of the approaches focuses onimproving the computation speed of the softmaxoutput so that it can maintain a very large vocabu-lary. Because this only increases the shortlist size,it helps to mitigate the unknown word problem,but still suffers from the rare word problem. Thehierarchical softmax (Morin and Bengio, 2005),importance sampling (Bengio and Senecal, 2008;Jean et al., 2014), and the noise contrastive esti-mation (Gutmann and Hyvarinen, 2012; Mnih andKavukcuoglu, 2013) methods are in the class.

The second category, where our proposedmethod also belongs to, uses information from thecontext. Notable works are (Luong et al., 2015)and (Hermann et al., 2015). In particular, ap-plying to machine translation task, (Luong et al.,2015) learns to point some words in source sen-tence and copy it to the target sentence, similarly

••

••

15

Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source.Source #1 china ’s tang gonghong set a world record with a clean and

jerk lift of ### kilograms to win the women ’s over-## kilogramweightlifting title at the asian games on tuesday .

Target #1 china ’s tang <unk>,sets world weightlifting recordNMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-

ing title at asian gamesSource #2 owing to criticism , nbc said on wednesday that it was ending

a three-month-old experiment that would have brought the firstliquor advertisements onto national broadcast network television.

Target #2 advertising : nbc retreats from liquor commercialsNMT+PS #2 nbc says it is ending a three-month-old experimentSource #3 a senior trade union official here wednesday called on ghana ’s

government to be “ mindful of the plight ” of the ordinary peoplein the country in its decisions on tax increases .

Target #3 tuc official,on behalf of ordinary ghanaiansNMT+PS #3 ghana ’s government urged to be mindful of the plight

vocabulary, we first check if the same word y

t

ap-pears in the source sentence. If it is not, we thencheck if a translated version of the word exists inthe source sentence by using a look-up table be-tween the source and the target language. If theword is in the source sentence, we then use the lo-cation of the word in the source as the target. Oth-erwise we check if one of the English senses fromthe cross-language dictionary of the French wordis in the source. If it is in the source sentence, thenwe use the location of that word as our translation.Otherwise we just use the argmax of l

t

as the tar-get.

For switching network d

t

, we observed that us-ing a two-layered MLP with noisy-tanh activation(Gulcehre et al., 2016) function with residual con-nection from the lower layer (He et al., 2015) ac-tivation function to the upper hidden layers im-proves the BLEU score about 1 points over thed

t

using ReLU activation function. We initializedthe biases of the last sigmoid layer of d

t

to �1

such that if dt

becomes more biased toward choos-ing the shortlist vocabulary at the beginning of thetraining. We renormalize the gradients if the normof the gradients exceed 1 (Pascanu et al., 2012).

Table 5: Europarl Dataset (EN-FR)BLEU-4

NMT 20.19NMT + PS 23.76

In Table 5, we provided the result of NMT withpointer softmax and we observe about 3.6 BLEUscore improvement over our baseline.

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

In Figure 4, we show the validation curvesof the NMT model with attention and the NMTmodel with shortlist-softmax layer. Pointer soft-max converges faster in terms of number of mini-batch updates and achieves a lower validationnegative-log-likelihood (NLL) (63.91) after 200kupdates over the Europarl dataset than the NMT

Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source.Source #1 china ’s tang gonghong set a world record with a clean and

jerk lift of ### kilograms to win the women ’s over-## kilogramweightlifting title at the asian games on tuesday .

Target #1 china ’s tang <unk>,sets world weightlifting recordNMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-

ing title at asian gamesSource #2 owing to criticism , nbc said on wednesday that it was ending

a three-month-old experiment that would have brought the firstliquor advertisements onto national broadcast network television.

Target #2 advertising : nbc retreats from liquor commercialsNMT+PS #2 nbc says it is ending a three-month-old experimentSource #3 a senior trade union official here wednesday called on ghana ’s

government to be “ mindful of the plight ” of the ordinary peoplein the country in its decisions on tax increases .

Target #3 tuc official,on behalf of ordinary ghanaiansNMT+PS #3 ghana ’s government urged to be mindful of the plight

vocabulary, we first check if the same word y

t

ap-pears in the source sentence. If it is not, we thencheck if a translated version of the word exists inthe source sentence by using a look-up table be-tween the source and the target language. If theword is in the source sentence, we then use the lo-cation of the word in the source as the target. Oth-erwise we check if one of the English senses fromthe cross-language dictionary of the French wordis in the source. If it is in the source sentence, thenwe use the location of that word as our translation.Otherwise we just use the argmax of l

t

as the tar-get.

For switching network d

t

, we observed that us-ing a two-layered MLP with noisy-tanh activation(Gulcehre et al., 2016) function with residual con-nection from the lower layer (He et al., 2015) ac-tivation function to the upper hidden layers im-proves the BLEU score about 1 points over thed

t

using ReLU activation function. We initializedthe biases of the last sigmoid layer of d

t

to �1

such that if dt

becomes more biased toward choos-ing the shortlist vocabulary at the beginning of thetraining. We renormalize the gradients if the normof the gradients exceed 1 (Pascanu et al., 2012).

Table 5: Europarl Dataset (EN-FR)BLEU-4

NMT 20.19NMT + PS 23.76

In Table 5, we provided the result of NMT withpointer softmax and we observe about 3.6 BLEUscore improvement over our baseline.

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

In Figure 4, we show the validation curvesof the NMT model with attention and the NMTmodel with shortlist-softmax layer. Pointer soft-max converges faster in terms of number of mini-batch updates and achieves a lower validationnegative-log-likelihood (NLL) (63.91) after 200kupdates over the Europarl dataset than the NMT

16

h2 hTh1 …st ct

ywt

yt-1

Vocabulary softmax

Attention distribution (lt)

Source Sequence

x2 xTx1 …BiRNN

Target Sequence

st-1

Figure 2: A depiction of neural machine transla-tion architecture with attention. At each timestep,the model generates the attention distribution l

t

.We use l

t

and the encoder’s hidden states to obtainthe context c

t

. The decoder uses c

t

to predict avector of probabilities for the words w

t

by usingvocabulary softmax.

4 The Pointer Softmax

In this section, we introduce our method, called asthe pointer softmax (PS), to deal with the rare andunknown words. The pointer softmax can be anapplicable approach to many NLP tasks, becauseit resolves the limitations about unknown wordsfor neural networks. It can be used in parallel withother existing techniques such as the large vocabu-lary trick (Jean et al., 2014). Our model learns twokey abilities jointly to make the pointing mech-anism applicable in more general settings: (i) topredict whether it is required to use the pointingor not at each time step and (ii) to point any lo-cation of the context sequence whose length canvary widely over examples. Note that the pointernetworks (Vinyals et al., 2015) are in lack of theability (i), and the ability (ii) is not achieved in themodels by (Luong et al., 2015).

To achieve this, our model uses two softmaxoutput layers, the shortlist softmax and the loca-

tion softmax. The shortlist softmax is the sameas the typical softmax output layer where eachdimension corresponds a word in the predefinedword shortlist. The location softmax is a pointernetwork where each of the output dimension cor-responds to the location of a word in the contextsequence. Thus, the output dimension of the loca-tion softmax varies according to the length of thegiven context sequence.

At each time-step, if the model decides to usethe shortlist softmax, we generate a word w

t

fromthe shortlist. Otherwise, if it is expected that thecontext sequence contains a word which needs to

be generated at the time step, we obtain the loca-tion of the context word l

t

from the location soft-max. The key to making this possible is decid-ing when to use the shortlist softmax or the lo-cation softmax at each time step. In order to ac-complish this, we introduce a switching networkto the model. The switching network, which isa multilayer perceptron in our experiments, takesthe representation of the context sequence (similarto the input annotation in NMT) and the previoushidden state of the output RNN as its input. It out-puts a binary variable z

t

which indicates whetherto use the shortlist softmax (when z

t

= 1) or thelocation softmax (when z

t

= 0). Note that if theword that is expected to be generated at each time-step is neither in the shortlist nor in the context se-quence, the switching network selects the shortlistsoftmax, and then the shortlist softmax predictsUNK. The details of the pointer softmax model canbe seen in Figure 3 as well.

h2 hTh1 …st ct

zt yltyw

t

yt-1

Vocabulary softmax

Pointer distribution (lt)

Source Sequence

Point & copy

x2 xTx1 …BiRNN

Target Sequence

p 1-p

st-1

Figure 3: A depiction of the Pointer Softmax (PS)architecture. At each timestep, l

t

, ct

and w

t

forthe words over the limited vocabulary (shortlist)is generated. We have an additional switchingvariable z

t

that decides whether to use vocabularyword or to copy a word from the source sequence.

More specifically, our goal is to maximize theprobability of observing the target word sequencey = (y1, y2, . . . , yT

y

) and the word generationsource z = (z1, z2, . . . , zT

y

), given the context se-quence x = (x1, x2, . . . , xT

x

):

p

(y, z|x) =T

yY

t=1

p

(y

t

, z

t

|y<t

, z

<t

,x). (4)

Note that the word observation y

t

can be eithera word w

t

from the shortlist softmax or a loca-tion l

t

from the location softmax, depending onthe switching variable z

t

.Considering this, we can factorize the above

Ø

Ø

Ø

Ø

Ø

Ø

17

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

ification over the NMT, our model is able to gen-eralize to the unseen words and can deal with rare-words more efficiently. For the summarizationtask on Gigaword dataset, the pointer softmax wasable to improve the results even when it is usedtogether with the large-vocabulary trick. In thecase of neural machine translation, we observedthat the training with the pointer softmax is alsoimproved the convergence speed of the model aswell. For French to English machine translationon Europarl corpora, we observe that using thepointer softmax can also improve the training con-vergence of the model.

References

[Bahdanau et al.2014] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and translate.CoRR, abs/1409.0473.

[Bengio and Senecal2008] Yoshua Bengio and Jean-Sebastien Senecal. 2008. Adaptive importancesampling to accelerate training of a neural proba-bilistic language model. Neural Networks, IEEE

Transactions on, 19(4):713–722.

[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory net-works. arXiv preprint arXiv:1506.02075.

[Cheng and Lapata2016] Jianpeng Cheng and MirellaLapata. 2016. Neural summarization by ex-

tracting sentences and words. arXiv preprint

arXiv:1603.07252.

[Cho et al.2014] Kyunghyun Cho, BartVan Merrienboer, Caglar Gulcehre, DzmitryBahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014. Learning phraserepresentations using rnn encoder-decoder forstatistical machine translation. arXiv preprint

arXiv:1406.1078.

[Chung et al.2014] Junyoung Chung, Caglar Gulcehre,KyungHyun Cho, and Yoshua Bengio. 2014. Em-pirical evaluation of gated recurrent neural networkson sequence modeling. CoRR, abs/1412.3555.

[Gillick et al.2015] Dan Gillick, Cliff Brunk, OriolVinyals, and Amarnag Subramanya. 2015. Mul-tilingual language processing from bytes. arXiv

preprint arXiv:1512.00103.

[Graves2013] Alex Graves. 2013. Generating se-quences with recurrent neural networks. arXiv

preprint arXiv:1308.0850.

[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li,and Victor OK Li. 2016. Incorporating copyingmechanism in sequence-to-sequence learning. arXiv

preprint arXiv:1603.06393.

[Gulcehre et al.2016] Caglar Gulcehre, MarcinMoczulski, Misha Denil, and Yoshua Bengio.2016. Noisy activation functions. arXiv preprint

arXiv:1603.00391.

[Gutmann and Hyvarinen2012] Michael U Gutmannand Aapo Hyvarinen. 2012. Noise-contrastive esti-mation of unnormalized statistical models, with ap-plications to natural image statistics. The Journal of

Machine Learning Research, 13(1):307–361.

[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-qing Ren, and Jian Sun. 2015. Deep resid-ual learning for mage recognition. arXiv preprint

arXiv:1512.03385.

[Hermann et al.2015] Karl Moritz Hermann, TomasKocisky, Edward Grefenstette, Lasse Espeholt, WillKay, Mustafa Suleyman, and Phil Blunsom. 2015.Teaching machines to read and comprehend. In Ad-

vances in Neural Information Processing Systems,pages 1684–1692.

[Jean et al.2014] Sebastien Jean, Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2014. Onusing very large target vocabulary for neural ma-chine translation. arXiv preprint arXiv:1412.2007.

[Kingma and Adam2015] Diederik P Kingma andJimmy Ba Adam. 2015. A method for stochasticoptimization. In International Conference on

Learning Representation.

[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.2015. Addressing the rare word problem in neuralmachine translation. In Proceedings of ACL.

148

[Matthews et al.2012] Danielle Matthews, TanyaBehne, Elena Lieven, and Michael Tomasello.2012. Origins of the human pointing gesture: atraining study. Developmental science, 15(6):817–829.

[Mnih and Kavukcuoglu2013] Andriy Mnih and KorayKavukcuoglu. 2013. Learning word embeddingsefficiently with noise-contrastive estimation. In Ad-

vances in Neural Information Processing Systems,pages 2265–2273.

[Morin and Bengio2005] Frederic Morin and YoshuaBengio. 2005. Hierarchical probabilistic neural net-work language model. In Aistats, volume 5, pages246–252. Citeseer.

[Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov,and Yoshua Bengio. 2012. On the difficulty oftraining recurrent neural networks. arXiv preprint

arXiv:1211.5063.

[Pascanu et al.2013] Razvan Pascanu, Caglar Gulcehre,Kyunghyun Cho, and Yoshua Bengio. 2013. Howto construct deep recurrent neural networks. arXiv

preprint arXiv:1312.6026.

[Rush et al.2015] Alexander M. Rush, Sumit Chopra,and Jason Weston. 2015. A neural attention modelfor abstractive sentence summarization. CoRR,abs/1509.00685.

[Schuster and Paliwal1997] Mike Schuster andKuldip K Paliwal. 1997. Bidirectional recur-rent neural networks. Signal Processing, IEEE

Transactions on, 45(11):2673–2681.

[Sennrich et al.2015] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2015. Neural machine trans-lation of rare words with subword units. arXiv

preprint arXiv:1508.07909.

[Theano Development Team2016] Theano Develop-ment Team. 2016. Theano: A Python frameworkfor fast computation of mathematical expressions.arXiv e-prints, abs/1605.02688, May.

[Tomasello et al.2007] Michael Tomasello, MalindaCarpenter, and Ulf Liszkowski. 2007. A new look atinfant pointing. Child development, 78(3):705–722.

[Vinyals et al.2015] Oriol Vinyals, Meire Fortunato,and Navdeep Jaitly. 2015. Pointer networks. In Ad-

vances in Neural Information Processing Systems,pages 2674–2682.

[Zeiler2012] Matthew D Zeiler. 2012. Adadelta:an adaptive learning rate method. arXiv preprint

arXiv:1212.5701.

7 Acknowledgments

We would also like to thank the developers ofTheano 5, for developing such a powerful tool

5http://deeplearning.net/software/

theano/

for scientific computing (Theano DevelopmentTeam, 2016). We acknowledge the support ofthe following organizations for research fundingand computing support: NSERC, Samsung, Cal-cul Quebec, Compute Canada, the Canada Re-search Chairs and CIFAR. C. G. thanks for IBMT.J. Watson Research for funding this researchduring his internship between October 2015 andJanuary 2016.

149

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

ification over the NMT, our model is able to gen-eralize to the unseen words and can deal with rare-words more efficiently. For the summarizationtask on Gigaword dataset, the pointer softmax wasable to improve the results even when it is usedtogether with the large-vocabulary trick. In thecase of neural machine translation, we observedthat the training with the pointer softmax is alsoimproved the convergence speed of the model aswell. For French to English machine translationon Europarl corpora, we observe that using thepointer softmax can also improve the training con-vergence of the model.

References

[Bahdanau et al.2014] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and translate.CoRR, abs/1409.0473.

[Bengio and Senecal2008] Yoshua Bengio and Jean-Sebastien Senecal. 2008. Adaptive importancesampling to accelerate training of a neural proba-bilistic language model. Neural Networks, IEEE

Transactions on, 19(4):713–722.

[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory net-works. arXiv preprint arXiv:1506.02075.

[Cheng and Lapata2016] Jianpeng Cheng and MirellaLapata. 2016. Neural summarization by ex-

tracting sentences and words. arXiv preprint

arXiv:1603.07252.

[Cho et al.2014] Kyunghyun Cho, BartVan Merrienboer, Caglar Gulcehre, DzmitryBahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014. Learning phraserepresentations using rnn encoder-decoder forstatistical machine translation. arXiv preprint

arXiv:1406.1078.

[Chung et al.2014] Junyoung Chung, Caglar Gulcehre,KyungHyun Cho, and Yoshua Bengio. 2014. Em-pirical evaluation of gated recurrent neural networkson sequence modeling. CoRR, abs/1412.3555.

[Gillick et al.2015] Dan Gillick, Cliff Brunk, OriolVinyals, and Amarnag Subramanya. 2015. Mul-tilingual language processing from bytes. arXiv

preprint arXiv:1512.00103.

[Graves2013] Alex Graves. 2013. Generating se-quences with recurrent neural networks. arXiv

preprint arXiv:1308.0850.

[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li,and Victor OK Li. 2016. Incorporating copyingmechanism in sequence-to-sequence learning. arXiv

preprint arXiv:1603.06393.

[Gulcehre et al.2016] Caglar Gulcehre, MarcinMoczulski, Misha Denil, and Yoshua Bengio.2016. Noisy activation functions. arXiv preprint

arXiv:1603.00391.

[Gutmann and Hyvarinen2012] Michael U Gutmannand Aapo Hyvarinen. 2012. Noise-contrastive esti-mation of unnormalized statistical models, with ap-plications to natural image statistics. The Journal of

Machine Learning Research, 13(1):307–361.

[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-qing Ren, and Jian Sun. 2015. Deep resid-ual learning for mage recognition. arXiv preprint

arXiv:1512.03385.

[Hermann et al.2015] Karl Moritz Hermann, TomasKocisky, Edward Grefenstette, Lasse Espeholt, WillKay, Mustafa Suleyman, and Phil Blunsom. 2015.Teaching machines to read and comprehend. In Ad-

vances in Neural Information Processing Systems,pages 1684–1692.

[Jean et al.2014] Sebastien Jean, Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2014. Onusing very large target vocabulary for neural ma-chine translation. arXiv preprint arXiv:1412.2007.

[Kingma and Adam2015] Diederik P Kingma andJimmy Ba Adam. 2015. A method for stochasticoptimization. In International Conference on

Learning Representation.

[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.2015. Addressing the rare word problem in neuralmachine translation. In Proceedings of ACL.

148

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

ification over the NMT, our model is able to gen-eralize to the unseen words and can deal with rare-words more efficiently. For the summarizationtask on Gigaword dataset, the pointer softmax wasable to improve the results even when it is usedtogether with the large-vocabulary trick. In thecase of neural machine translation, we observedthat the training with the pointer softmax is alsoimproved the convergence speed of the model aswell. For French to English machine translationon Europarl corpora, we observe that using thepointer softmax can also improve the training con-vergence of the model.

References

[Bahdanau et al.2014] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and translate.CoRR, abs/1409.0473.

[Bengio and Senecal2008] Yoshua Bengio and Jean-Sebastien Senecal. 2008. Adaptive importancesampling to accelerate training of a neural proba-bilistic language model. Neural Networks, IEEE

Transactions on, 19(4):713–722.

[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory net-works. arXiv preprint arXiv:1506.02075.

[Cheng and Lapata2016] Jianpeng Cheng and MirellaLapata. 2016. Neural summarization by ex-

tracting sentences and words. arXiv preprint

arXiv:1603.07252.

[Cho et al.2014] Kyunghyun Cho, BartVan Merrienboer, Caglar Gulcehre, DzmitryBahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014. Learning phraserepresentations using rnn encoder-decoder forstatistical machine translation. arXiv preprint

arXiv:1406.1078.

[Chung et al.2014] Junyoung Chung, Caglar Gulcehre,KyungHyun Cho, and Yoshua Bengio. 2014. Em-pirical evaluation of gated recurrent neural networkson sequence modeling. CoRR, abs/1412.3555.

[Gillick et al.2015] Dan Gillick, Cliff Brunk, OriolVinyals, and Amarnag Subramanya. 2015. Mul-tilingual language processing from bytes. arXiv

preprint arXiv:1512.00103.

[Graves2013] Alex Graves. 2013. Generating se-quences with recurrent neural networks. arXiv

preprint arXiv:1308.0850.

[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li,and Victor OK Li. 2016. Incorporating copyingmechanism in sequence-to-sequence learning. arXiv

preprint arXiv:1603.06393.

[Gulcehre et al.2016] Caglar Gulcehre, MarcinMoczulski, Misha Denil, and Yoshua Bengio.2016. Noisy activation functions. arXiv preprint

arXiv:1603.00391.

[Gutmann and Hyvarinen2012] Michael U Gutmannand Aapo Hyvarinen. 2012. Noise-contrastive esti-mation of unnormalized statistical models, with ap-plications to natural image statistics. The Journal of

Machine Learning Research, 13(1):307–361.

[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-qing Ren, and Jian Sun. 2015. Deep resid-ual learning for mage recognition. arXiv preprint

arXiv:1512.03385.

[Hermann et al.2015] Karl Moritz Hermann, TomasKocisky, Edward Grefenstette, Lasse Espeholt, WillKay, Mustafa Suleyman, and Phil Blunsom. 2015.Teaching machines to read and comprehend. In Ad-

vances in Neural Information Processing Systems,pages 1684–1692.

[Jean et al.2014] Sebastien Jean, Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2014. Onusing very large target vocabulary for neural ma-chine translation. arXiv preprint arXiv:1412.2007.

[Kingma and Adam2015] Diederik P Kingma andJimmy Ba Adam. 2015. A method for stochasticoptimization. In International Conference on

Learning Representation.

[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.2015. Addressing the rare word problem in neuralmachine translation. In Proceedings of ACL.

148

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

ification over the NMT, our model is able to gen-eralize to the unseen words and can deal with rare-words more efficiently. For the summarizationtask on Gigaword dataset, the pointer softmax wasable to improve the results even when it is usedtogether with the large-vocabulary trick. In thecase of neural machine translation, we observedthat the training with the pointer softmax is alsoimproved the convergence speed of the model aswell. For French to English machine translationon Europarl corpora, we observe that using thepointer softmax can also improve the training con-vergence of the model.

References

[Bahdanau et al.2014] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and translate.CoRR, abs/1409.0473.

[Bengio and Senecal2008] Yoshua Bengio and Jean-Sebastien Senecal. 2008. Adaptive importancesampling to accelerate training of a neural proba-bilistic language model. Neural Networks, IEEE

Transactions on, 19(4):713–722.

[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory net-works. arXiv preprint arXiv:1506.02075.

[Cheng and Lapata2016] Jianpeng Cheng and MirellaLapata. 2016. Neural summarization by ex-

tracting sentences and words. arXiv preprint

arXiv:1603.07252.

[Cho et al.2014] Kyunghyun Cho, BartVan Merrienboer, Caglar Gulcehre, DzmitryBahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014. Learning phraserepresentations using rnn encoder-decoder forstatistical machine translation. arXiv preprint

arXiv:1406.1078.

[Chung et al.2014] Junyoung Chung, Caglar Gulcehre,KyungHyun Cho, and Yoshua Bengio. 2014. Em-pirical evaluation of gated recurrent neural networkson sequence modeling. CoRR, abs/1412.3555.

[Gillick et al.2015] Dan Gillick, Cliff Brunk, OriolVinyals, and Amarnag Subramanya. 2015. Mul-tilingual language processing from bytes. arXiv

preprint arXiv:1512.00103.

[Graves2013] Alex Graves. 2013. Generating se-quences with recurrent neural networks. arXiv

preprint arXiv:1308.0850.

[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li,and Victor OK Li. 2016. Incorporating copyingmechanism in sequence-to-sequence learning. arXiv

preprint arXiv:1603.06393.

[Gulcehre et al.2016] Caglar Gulcehre, MarcinMoczulski, Misha Denil, and Yoshua Bengio.2016. Noisy activation functions. arXiv preprint

arXiv:1603.00391.

[Gutmann and Hyvarinen2012] Michael U Gutmannand Aapo Hyvarinen. 2012. Noise-contrastive esti-mation of unnormalized statistical models, with ap-plications to natural image statistics. The Journal of

Machine Learning Research, 13(1):307–361.

[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-qing Ren, and Jian Sun. 2015. Deep resid-ual learning for mage recognition. arXiv preprint

arXiv:1512.03385.

[Hermann et al.2015] Karl Moritz Hermann, TomasKocisky, Edward Grefenstette, Lasse Espeholt, WillKay, Mustafa Suleyman, and Phil Blunsom. 2015.Teaching machines to read and comprehend. In Ad-

vances in Neural Information Processing Systems,pages 1684–1692.

[Jean et al.2014] Sebastien Jean, Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2014. Onusing very large target vocabulary for neural ma-chine translation. arXiv preprint arXiv:1412.2007.

[Kingma and Adam2015] Diederik P Kingma andJimmy Ba Adam. 2015. A method for stochasticoptimization. In International Conference on

Learning Representation.

[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.2015. Addressing the rare word problem in neuralmachine translation. In Proceedings of ACL.

148

[Matthews et al.2012] Danielle Matthews, TanyaBehne, Elena Lieven, and Michael Tomasello.2012. Origins of the human pointing gesture: atraining study. Developmental science, 15(6):817–829.

[Mnih and Kavukcuoglu2013] Andriy Mnih and KorayKavukcuoglu. 2013. Learning word embeddingsefficiently with noise-contrastive estimation. In Ad-

vances in Neural Information Processing Systems,pages 2265–2273.

[Morin and Bengio2005] Frederic Morin and YoshuaBengio. 2005. Hierarchical probabilistic neural net-work language model. In Aistats, volume 5, pages246–252. Citeseer.

[Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov,and Yoshua Bengio. 2012. On the difficulty oftraining recurrent neural networks. arXiv preprint

arXiv:1211.5063.

[Pascanu et al.2013] Razvan Pascanu, Caglar Gulcehre,Kyunghyun Cho, and Yoshua Bengio. 2013. Howto construct deep recurrent neural networks. arXiv

preprint arXiv:1312.6026.

[Rush et al.2015] Alexander M. Rush, Sumit Chopra,and Jason Weston. 2015. A neural attention modelfor abstractive sentence summarization. CoRR,abs/1509.00685.

[Schuster and Paliwal1997] Mike Schuster andKuldip K Paliwal. 1997. Bidirectional recur-rent neural networks. Signal Processing, IEEE

Transactions on, 45(11):2673–2681.

[Sennrich et al.2015] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2015. Neural machine trans-lation of rare words with subword units. arXiv

preprint arXiv:1508.07909.

[Theano Development Team2016] Theano Develop-ment Team. 2016. Theano: A Python frameworkfor fast computation of mathematical expressions.arXiv e-prints, abs/1605.02688, May.

[Tomasello et al.2007] Michael Tomasello, MalindaCarpenter, and Ulf Liszkowski. 2007. A new look atinfant pointing. Child development, 78(3):705–722.

[Vinyals et al.2015] Oriol Vinyals, Meire Fortunato,and Navdeep Jaitly. 2015. Pointer networks. In Ad-

vances in Neural Information Processing Systems,pages 2674–2682.

[Zeiler2012] Matthew D Zeiler. 2012. Adadelta:an adaptive learning rate method. arXiv preprint

arXiv:1212.5701.

7 Acknowledgments

We would also like to thank the developers ofTheano 5, for developing such a powerful tool

5http://deeplearning.net/software/

theano/

for scientific computing (Theano DevelopmentTeam, 2016). We acknowledge the support ofthe following organizations for research fundingand computing support: NSERC, Samsung, Cal-cul Quebec, Compute Canada, the Canada Re-search Chairs and CIFAR. C. G. thanks for IBMT.J. Watson Research for funding this researchduring his internship between October 2015 andJanuary 2016.

149

Figure 4: A comparison of the validation learning-curves of the same NMT model trained withpointer softmax and the regular softmax layer. Ascan be seen from the figures, the model trainedwith pointer softmax converges faster than the reg-ular softmax layer. Switching network for pointersoftmax in this Figure uses ReLU activation func-tion.

ification over the NMT, our model is able to gen-eralize to the unseen words and can deal with rare-words more efficiently. For the summarizationtask on Gigaword dataset, the pointer softmax wasable to improve the results even when it is usedtogether with the large-vocabulary trick. In thecase of neural machine translation, we observedthat the training with the pointer softmax is alsoimproved the convergence speed of the model aswell. For French to English machine translationon Europarl corpora, we observe that using thepointer softmax can also improve the training con-vergence of the model.

References

[Bahdanau et al.2014] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and translate.CoRR, abs/1409.0473.

[Bengio and Senecal2008] Yoshua Bengio and Jean-Sebastien Senecal. 2008. Adaptive importancesampling to accelerate training of a neural proba-bilistic language model. Neural Networks, IEEE

Transactions on, 19(4):713–722.

[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory net-works. arXiv preprint arXiv:1506.02075.

[Cheng and Lapata2016] Jianpeng Cheng and MirellaLapata. 2016. Neural summarization by ex-

tracting sentences and words. arXiv preprint

arXiv:1603.07252.

[Cho et al.2014] Kyunghyun Cho, BartVan Merrienboer, Caglar Gulcehre, DzmitryBahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014. Learning phraserepresentations using rnn encoder-decoder forstatistical machine translation. arXiv preprint

arXiv:1406.1078.

[Chung et al.2014] Junyoung Chung, Caglar Gulcehre,KyungHyun Cho, and Yoshua Bengio. 2014. Em-pirical evaluation of gated recurrent neural networkson sequence modeling. CoRR, abs/1412.3555.

[Gillick et al.2015] Dan Gillick, Cliff Brunk, OriolVinyals, and Amarnag Subramanya. 2015. Mul-tilingual language processing from bytes. arXiv

preprint arXiv:1512.00103.

[Graves2013] Alex Graves. 2013. Generating se-quences with recurrent neural networks. arXiv

preprint arXiv:1308.0850.

[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li,and Victor OK Li. 2016. Incorporating copyingmechanism in sequence-to-sequence learning. arXiv

preprint arXiv:1603.06393.

[Gulcehre et al.2016] Caglar Gulcehre, MarcinMoczulski, Misha Denil, and Yoshua Bengio.2016. Noisy activation functions. arXiv preprint

arXiv:1603.00391.

[Gutmann and Hyvarinen2012] Michael U Gutmannand Aapo Hyvarinen. 2012. Noise-contrastive esti-mation of unnormalized statistical models, with ap-plications to natural image statistics. The Journal of

Machine Learning Research, 13(1):307–361.

[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-qing Ren, and Jian Sun. 2015. Deep resid-ual learning for mage recognition. arXiv preprint

arXiv:1512.03385.

[Hermann et al.2015] Karl Moritz Hermann, TomasKocisky, Edward Grefenstette, Lasse Espeholt, WillKay, Mustafa Suleyman, and Phil Blunsom. 2015.Teaching machines to read and comprehend. In Ad-

vances in Neural Information Processing Systems,pages 1684–1692.

[Jean et al.2014] Sebastien Jean, Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2014. Onusing very large target vocabulary for neural ma-chine translation. arXiv preprint arXiv:1412.2007.

[Kingma and Adam2015] Diederik P Kingma andJimmy Ba Adam. 2015. A method for stochasticoptimization. In International Conference on

Learning Representation.

[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.2015. Addressing the rare word problem in neuralmachine translation. In Proceedings of ACL.

148

ing to align and translate. arXiv preprintarXiv:1409.0473 .

P. J. Barrio, D. G. Goldstein, and J. M. Hofman.2016. Improving the comprehension of num-bers in the news. In Conference on Human Fac-tors in Computing Systems (CHI).

T. Berg-Kirkpatrick, D. Burkett, and D. Klein.2012. An empirical investigation of statisticalsignificance in NLP. In Empirical Methods inNatural Language Processing (EMNLP). pages995–1005.

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, andJ. Taylor. 2008. Freebase: a collaborativelycreated graph database for structuring humanknowledge. In International Conference onManagement of Data (SIGMOD). pages 1247–1250.

S. Bowman, G. Angeli, C. Potts, and C. D. Man-ning. 2015. A large annotated corpus for learn-ing natural language inference. In Empiri-cal Methods in Natural Language Processing(EMNLP).

D. L. Chen and R. J. Mooney. 2008. Learning tosportscast: A test of grounded language acqui-sition. In International Conference on MachineLearning (ICML). pages 128–135.

F. Chevalier, R. Vuillemot, and G. Gali. 2013. Us-ing concrete scales: A practical framework foreffective visual depiction of complex measures.IEEE Transactions on Visualization and Com-puter Graphics 19:2426–2435.

G. Chiacchieri. 2013. Dictionary of numbers.http://www.dictionaryofnumbers.

com/.

A. Fader, S. Soderland, and O. Etzioni. 2011.Identifying relations for open information ex-traction. In Empirical Methods in Natural Lan-guage Processing (EMNLP).

R. Jia and P. Liang. 2016. Data recombinationfor neural semantic parsing. In Association forComputational Linguistics (ACL).

M. G. Jones and A. R. Taylor. 2009. Developinga sense of scale: Looking backward. Journal ofResearch in Science Teaching 46:460–475.

Y. Kim, J. Hullman, and M. Agarwala. 2016. Gen-erating personalized spatial analogies for dis-tances and areas. In Conference on Human Fac-tors in Computing Systems (CHI).

W. Lu and H. T. Ng. 2012. A probabilistic forest-to-string model for language generation fromtyped lambda calculus expressions. In Empir-ical Methods in Natural Language Processing(EMNLP). pages 1611–1622.

M. Luong, H. Pham, and C. D. Manning. 2015.Effective approaches to attention-based neuralmachine translation. In Empirical Methods inNatural Language Processing (EMNLP). pages1412–1421.

T. Mikolov, K. Chen, G. Corrado, and Jeffrey.2013. Efficient estimation of word representa-tions in vector space. arXiv .

J. A. Paulos. 1988. Innumeracy: Mathematicalilliteracy and its consequences. Macmillan.

C. Seife. 2010. Proofiness: How you’re beingfooled by the numbers. Penguin.

I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Se-quence to sequence learning with neural net-works. In Advances in Neural Information Pro-cessing Systems (NIPS). pages 3104–3112.

K. H. Teigen. 2015. Framing of numeric quanti-ties. The Wiley Blackwell Handbook of Judg-ment and Decision Making pages 568–589.

T. R. Tretter, M. G. Jones, and J. Minogue. 2006.Accuracy of scale conceptions in science: Men-tal maneuverings across many orders of spa-tial magnitude. Journal of Research in ScienceTeaching 43:1061–1085.

Y. Wang, J. Berant, and P. Liang. 2015. Buildinga semantic parser overnight. In Association forComputational Linguistics (ACL).

Y. W. Wong and R. J. Mooney. 2007. Generationby inverting a semantic parser that uses statisti-cal machine translation. In Human LanguageTechnology and North American Associationfor Computational Linguistics (HLT/NAACL).pages 172–179.