22
Background Tree Structure Enhanced NMT Experiments Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings from the Association for Computational Linguistics, 2017 Huadong Chen Shujian Huang David Chiang Jiajun Chen 3 May 2019 Presented by: Kevin Liang

Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Improved Neural Machine Translationwith a Syntax-Aware Encoder and Decoder

Proceedings from the Association for Computational Linguistics, 2017

Huadong Chen Shujian Huang David ChiangJiajun Chen

3 May 2019

Presented by: Kevin Liang

Page 2: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Motivation

Over the past few years, neural machine translation (NMT) modelshave set new state-of-the-arts across many language pairs, mostlyby using an encoder-decoder structure

Can we use source-side syntax to improve our modelperformance?

Bidirectional tree encoderTree-coverage model

Page 3: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Neural Machine Translation (NMT)

Given a source sentence x = x1, ...xi, ..., xI and a target sentencey = y1, ...yj , ..., yJ , NMT seeks to model:

P (y|x; θ) =J∏

j=1

P (yj |y<j ,x; θ) (1)

where θ are the model parameters and y<j are the wordsgenerated before yj .

Page 4: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Encoder-Decoder Structure with Attention

Page 5: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Gated Recurrent Units (GRUs)

GRUs are a common choice for gated recurrent neural networkunit. GRUs are a simpler version of long short-term memory(LSTM) units, and often perform about as well.

Notation used in the paper (and these slides):

ht = GRU(ht−1, xt, ...) (2)

Page 6: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Encoder Model

Use a bidirectional GRU to encode each word of the input sequence−→hi = GRU(

−−→hi−1, si) (3)

←−hi = GRU(

←−−hi−1, si) (4)

where si is the word embedding for xi.

The annotation for each source word xi is the concatenation ofboth the forward and backward hidden states:

←→hi =

[−→hi←−hi

](5)

Page 7: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Decoder Model

The decoder hidden states dj are computed as:

dj = GRU(dj−1, tj−1, cj) (6)

where tj−1 is the word embedding of the (j − 1)th target word, djis the decoder’s hidden state at time j, and cj is the context vectorat time j.

The probability of generating the j-th word yj :

P (yj |y<j ,x; θ) = softmax(WV dj) (7)

where WV is either the transposed word embedding matrix, orlearned separately.

Page 8: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Attention mechanism

Attention weights are computed using decoder state and theencoder states:

ej,i = v>a tanh(Wadj−1 + Ua←→hi ) (8)

αj,i =exp(ej,i)∑I

i′=1 exp(ej,i′)(9)

These attention scores are used to compute a context vector ci,which is a weighted sum of the source encodings, weightedaccording to the attention vector:

cj =

I∑i−1

αj,i←→hi (10)

Page 9: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Encoder-Decoder Structure with Attention (revisited)

Page 10: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Syntactic trees

Assume we have source-side syntactic trees, which can becomputed before translation:

Each node is given an index, with each leaf node labeled 1, ..., I.For any node with index k, let p(k) denote the index of node k’sparent, and let L(k) and R(k) denote the indices of node k’s leftand right children, respectively

Page 11: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Tree-GRU Encoder

Build a tree encoder on top of the sequential encoder.

If node k is a leaf node:

Node k hidden state is just the sequential encoder encoding

h↑k =←→hk (11)

Else (node k is an interior node):

Node k hidden state is a function of the hidden states of theleft child hL(k) and right child hR(k)

h↑k = f(h↑L(k), h↑R(k)) (12)

Page 12: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Tree-GRU

To make the tree encoder consistent with the GRU sequentialencoders, the authors use Tree-GRU units:

rL = σ(U(rL)L h↑L(k) + U

(rL)R h↑R(k) + b(rL)) (13)

rR = σ(U(rR)L h↑L(k) + U

(rR)R h↑R(k) + b(rR)) (14)

zL = σ(U(zL)L h↑L(k) + U

(zL)R h↑R(k) + b(zL)) (15)

zR = σ(U(zR)L h↑L(k) + U

(zR)R h↑R(k) + b(zR)) (16)

z = σ(U(z)L h↑L(k) + U

(z)R h↑R(k) + b(z)) (17)

h̃↑k = tanh(UL(rL � h↑L(k)) + UR(rR � h↑R(k))

)(18)

h↑k = zL � h↑L(k) + zR � h↑R(k) + z � h̃↑k (19)

where rL, rR are the reset gates and zL, zR are the update gatesfor the left and right children, and z is the update for the internalhidden state h̃↑k

Page 13: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Bottom-up tree encoder

Page 14: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Bidirectional tree encoder

Learned representations of a bottom-up encoder node is based onlyon its subtree; information above it in the tree is missing. This canbe addressed by adding a top-down encoder as follows:

If node k is root:

Node k top-down encoding is a function of the bottom-upencoding of node k

h↓k = tanh(Wh↑k + b) (20)

Else:

Node k top-down encoding is produced by a sequential GRUrunning from root down the tree to node k

h↓k = GRU(h↓p(k), h↑k) (21)

Final encoding for each node is obtained by concatenatingbottom-up and top-down hidden states:

hlk =

[h↑kh↓k

](22)

Page 15: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Bidirectional tree encoder

Page 16: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Issues using tree encoder

Syntactic phrases in the sourceare often translated intodiscontinuous words in theoutput.

Non-leaf nodes, which containmore information, are oftenattended to more often than leafnodes, which may lead toover-translation

Page 17: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Word Coverage Model

Coverage vectors have been previously proposed to make attentiontime-dependent, which affect the calculation of the attention scoreas follows:

ej,i = v>a tanh(Wadj−1 + Uahi+VaCj−1,i) (23)

The authors propose incorporating additional source treeinformation by adding the coverage vectors and attention weightsof each child:

Cj,i = GRU(Cj−1,i, αj,i, dj−1, hi,

Cj−1,L(i), αj,L(i),

Cj−1,R(i), αj,R(i)).

(24)

Page 18: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Data

NIST Chinese-English translation task: 1.6M sentence pairs

Chinese sentences parsed with the Berkeley Parser1

Compare against 3 models/techniques:

NMT: standard attention NMT model2

Tree-LSTM: attention NMT model with Tree-LSTM encoder3

Coverage: attention NMT model with word coverage4

1Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Proc. NAACL HLT. pages404–411. http://www.aclweb.org/anthology/N/N07/N07-1051.

2Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learningto align and translate. In ICLR 2015. http://arxiv.org/abs/1409.0473.

3Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-sequence attentional neuralmachine translation. In Proc. ACL. pages 823–833. http://www.aclweb.org/anthology/P16-1078.

4Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neuralmachine translation. In Proc. ACL. pages 76–85. http://www.aclweb.org/anthology/P16-1008.

Page 19: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Chinese-English BLEU-4 Scores

Page 20: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Tree-LSTM vs Tree-GRU encoder

Previous table:

Experiments with LSTM sequence encoder:

Page 21: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Encoding size

Previous table:

Experiments with bidirectional embedding size halved:

Page 22: Improved Neural Machine Translation with a Syntax-Aware ...lcarin/Kevin5.3.2019.pdf · 3/5/2019  · Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Proceedings

Background Tree Structure Enhanced NMT Experiments

Takeaways

For the encoder, tree-encoder’s using syntax do better thanpurely sequential ones, and bidirectional tree-encoders arebetter than bottom-up only ones.

Coverage helps. Tree-coverage helps more.

Using the same type of cell (GRU vs LSTM) for both thesequential and tree encodings is better. LSTM-LSTM isslightly better than GRU-GRU, but more expensive.

Gains of the bidirectional tree encoding is due to more thanjust the larger embedding size.