18
1 Beyond Weight Tying: Learning Joint Input-Output Embeddings for Neural Machine Translation Nikolaos Pappas 1 , Lesly Miculicich 1,2 , James Henderson 1 1 Idiap Research Institute, Switzerland 2 ´ Ecole polytechnique f´ ed´ erale de Lausanne (EPFL) October 31, 2018

Beyond Weight Tying: Learning Joint Input-Output

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Beyond Weight Tying: Learning Joint Input-Output

1

Beyond Weight Tying: Learning Joint Input-Output

Embeddings for Neural Machine Translation

Nikolaos Pappas1, Lesly Miculicich1,2, James Henderson1

1Idiap Research Institute, Switzerland2Ecole polytechnique federale de Lausanne (EPFL)

October 31, 2018

Page 2: Beyond Weight Tying: Learning Joint Input-Output

Introduction Background

Output layer parametrization

• NMT systems predict one word at a time given context ht ∈ IRdh ,

weights W ∈ IRdh×|V| and bias b ∈ IR|V| by modeling:

p(yt |Y1:t−1,X ) ∝ exp(WTht + b)

• Parametrization depends on the vocabulary (Cbase = |V| × dh + |V|)which creates training and out-of-vocabulary word issues

• sub-word level modeling (Sennrich et al., 2016)• output layer approximations (Mikolov et al., 2013)• weight tying (Press & Wolf, 2017)

→ Lack of semantic grounding and composition of output representations

2/17

Page 3: Beyond Weight Tying: Learning Joint Input-Output

Introduction Background

Weight tying

• Shares target embedding E ∈ IR|V|×d with W (Press & Wolf, 2017):

p(yt |Y1:t−1,X ) ∝ exp(Eht + b)

• Parametrization depends less on the vocabulary (Ctied = |V|).

• Assuming that bias is zero and E learns linear word relationships

implicitly (E ≈ ElW) (Mikolov et al., 2013):

p(yt |Y1:t−1,X ) ∝ exp(ElWht)

• Equivalent to bilinear form of zero-shot models (Nam et al., 2016).

→ Imposes implicit linear structure on the output

→ This could explain its sample efficiency and effectiveness

3/17

Page 4: Beyond Weight Tying: Learning Joint Input-Output

Introduction Background

Zero-shot models

• Learn a joint input-output space with a bilinear form given weight

matrix W ∈ IRd×dh (Socher et al., 2013, Nam et al., 2016):

g(E , ht) = E W︸︷︷︸Structure

ht

• Useful properties

• Grounding outputs to word descriptions and semantics• Explicit output relationships or structure (Cbilinear = d × dh + |V|)• Knowledge transfer across outputs especially low-resource ones

4/17

Page 5: Beyond Weight Tying: Learning Joint Input-Output

Introduction Motivation

Examples of learned structure

Top-5 most similar words based on cosine distance.

Incosistent words are marked in red.

5/17

Page 6: Beyond Weight Tying: Learning Joint Input-Output

Introduction Motivation

Contributions

• Learning explicit non-linear output and context relationships

• New family of joint space models that generalize weight tying

g(E , ht) = gout(E ) · ginp(ht)

• Flexibly controlling effective capacity

• Two extremes can lead to under or overparametrized output layer

Ctied < Cbilinear ≤ Cjoint ≤ Cbase

→ Identify key limitations in existing output layer parametrizations

→ Propose a joint input-output model which addresses them

→ Provide empirical evidence of its effectiveness

6/17

Page 7: Beyond Weight Tying: Learning Joint Input-Output

Introduction

Background

Motivation

Proposed Output Layer

Joint Input-Output Embedding

Unique properties

Scaling Computation

Evaluation

Data and Settings

Quantitative Results

Conclusion

Page 8: Beyond Weight Tying: Learning Joint Input-Output

Proposed Output Layer Joint Input-Output Embedding

Joint input-output embedding

• Two non-linear projections with

dj dimensions of any context htand output in E :

gout(E ) = σ(UET + bu)

ginp(ht) = σ(Vht + bv )

E

htct

yt-1 . yt

E'

Softmax

V Decoder

E

ht'

U Joint

Embedding

dh x dj

d x dj

|V| x d

• The conditional distribution becomes:

p(yt |Y1:t−1,X ) ∝ exp(gout(E ) · ginp(ht) + b

)∝ exp

(σ(UET + bu)︸ ︷︷ ︸

Output struct.

·σ(Vht + bv )︸ ︷︷ ︸Context struct.

+b)

8/17

Page 9: Beyond Weight Tying: Learning Joint Input-Output

Proposed Output Layer Unique Properties

Unique properties

1. Learns explicit non-linear output and context structure

2. Allows to control capacity freely by modifying dj

3. Generalizes the notion of weight tying

• Weight tying emerges as a special case by setting ginp(·), gout(·) to

the identity function I:

p(yt |Y1:t−1,X ) ∝ exp(gout(E ) · ginp(ht) + b

)∝ exp

((IE ) (Iht) + b

)∝ exp

(Eht + b

)�

9/17

Page 10: Beyond Weight Tying: Learning Joint Input-Output

Proposed Output Layer Scaling Computation

Scaling computation

• Prohibitive for a large vocabulary or joint space: U · ET

• Sampling-based training which uses a subset of V to compute

softmax (Mikolov et al., 2013)

Model dj 50% 25% 5%

NMT - 4.3K 5.7K 7.1K

NMT-tied - 5.2K 6.0K 7.8K

NMT-joint 512 4.9K 5.9K 7.2K

NMT-joint 2048 2.8K 4.2K 7.0K

NMT-joint 4096 1.7K 2.9K 6.0K

Target tokens per second on English-German, |V| ≈ 128K .

10/17

Page 11: Beyond Weight Tying: Learning Joint Input-Output

Introduction

Background

Motivation

Proposed Output Layer

Joint Input-Output Embedding

Unique properties

Scaling Computation

Evaluation

Data and Settings

Quantitative Results

Conclusion

Page 12: Beyond Weight Tying: Learning Joint Input-Output

Evaluation Data and Settings

Data and settings

Controlled experiments with LSTM sequence-to-sequence models

• English-Finish (2.5M), English-German (5.8M) from WMT

• Morphologically rich and poor languages as target

• Different vocabulary sizes using BPE: 32K, 64K, ∼128K

Baselines

• NMT : softmax + linear unit

• NMT-tied : softmax + linear unit + weight tying

Input: 512, Depth: 2-layer, 512, Attention: 512, Joint dim.: 512, 2048, 4096, Joint act.:

Tanh, Optimizer: ADAM, Dropout: 0.3, Batch size: 96 Metrics: BLEU, METEOR

12/17

Page 13: Beyond Weight Tying: Learning Joint Input-Output

Evaluation Quantitative Results

Translation performance

• Weight tying is as good as the baseline but not always

• Joint model has more consistent improvements

13/17

Page 14: Beyond Weight Tying: Learning Joint Input-Output

Evaluation Quantitative Results

Translation performance by output frequency

English-German and German-English, |V| ≈ 32K.

• Vocabulary is split in three sets of decreasing frequency

• Joint model transfers knowledge across high and lower-resource bins

14/17

Page 15: Beyond Weight Tying: Learning Joint Input-Output

Evaluation Quantitative Results

Do we need to learn both output and context structure?

German-English, |V| ≈ 32K.

• Ablation results show that both are essential.

15/17

Page 16: Beyond Weight Tying: Learning Joint Input-Output

Evaluation Quantitative Results

What is the effect of increasing the output layer capacity?

Joint space dimension ( )

Fi → En

En→ De De → En

En → Fi

Varying joint space dimension (dj ), |V| ≈ 32K.

• Higher capacity was helpful in most cases.

16/17

Page 17: Beyond Weight Tying: Learning Joint Input-Output

Conclusion

Conclusion

• Joint space models generalize weight tying and have more robust

results against baseline overall

• Learn explicit non-linear output and context structure

• Provide flexible way to control capacity

Future work:

→ Use crosslingual, contextualized or descriptive representations

→ Evaluate on multi-task and zero-resource settings

→ Find more efficient ways to increase output layer capacity

17/17

Page 18: Beyond Weight Tying: Learning Joint Input-Output

Thank you! Questions?

http://github.com/idiap/joint-embedding-nmt

Acknowledgments