35
On Using Monolingual Corpora in Neural Machine Translation & Towards better decoding and integration of language models in sequence to sequence models Rebekka Hubert Department of Computational Linguistics (ICL) University of Heidelberg Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding a 1 / 35

On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

  • Upload
    others

  • View
    32

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

On Using Monolingual Corpora in Neural MachineTranslation

&Towards better decoding and integration of language

models in sequence to sequence models

Rebekka Hubert

Department of Computational Linguistics (ICL)University of Heidelberg

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models1 / 35

Page 2: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

1 Monolingual Corpora in NMT

2 On language models in end-to-end systems

3 Conclusions

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models2 / 35

Page 3: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Motivation

NMT systems suffer from low resources and domain restriction

monolingual corpora exhibit linguistic structure

integrate language model trained on monolingual corpora into a NMTmodel suffering from low resources or domain restriction

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models3 / 35

Page 4: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Model - based on [Bahdanau et al., 2014]

encoder-decoder framework

Figure 1: overview of the model [Bahdanau et al., 2014]

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models4 / 35

Page 5: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Encoder

bidirectional RNN

sequence of annotation vectors is concatenation of pairs of hiddenstates

hTj = [←−h T

j ;−→h T

j ]

hj encodes information about j-word w.r.t. all of the surroundingwords in the sentence

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models5 / 35

Page 6: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Decoder

emulates searching through a source sentence during decoding

at each timestep t

compute st = fr (st−1, yt−1, ct)computes context vector ct (expected annotation)

ct =Tx∑j=1

αtjhj

with

αtj =exp(etj)

Tx∑k=1

exp(etk)

; etj = a(st−1, hj , yt−1)

a: soft-alignment model (feedforward neural network)α: attention

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models6 / 35

Page 7: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Decoder

compute probability of target word yt with deep output layer

p(yt | y<t) ∝ exp(yTt (Wo fo(sTMt , yt−1, ct) + bo))

yt : k-dimensional, one-hot encoded vector indicating one of the wordsin target vocabulary

for large target vocabularies: importance sampling technique[Jean et al., 2014]

Wo ∈ RKy×l : weight matrixfo : neural network with two-way maxout non-linearitybo : bias

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models7 / 35

Page 8: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Sampling technique

partition training corpus and define small subset V ′ of unique targetwords for each partition

at each update only the vectors associated with the sampled words inV ′ and the correct word are updated

equivalent to

p(yt | y<t , x) =exp{yTt φ(yt−1, st , ct) + bt}∑

yk∈V ′exp{yTk φ(yt−1, st , ct) + bk}

approximates exact output probability

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models8 / 35

Page 9: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Maxout Networks

characterized by the Maxout activation function

Generalization of ReLu: hi (x) = maxj∈[1,k]zij

zij = xTW ...ij + bij ; W ∈ Rd×m×k

piecewise linear approximation to an arbitrary convex function

locally linear almost anywhere

no sparse representation can be produced

many learned parameters - best used in combination with dropout

Figure 2: maxout activation function implementing/approximating differentfunctions [Goodfellow et al., 2013]

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models9 / 35

Page 10: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Training

Train NMT model with bilingual corpus of pairs (x (n), y (n))

maxθ1

N=

N∑n=1

log pθ(y (n) | x (n))

θ: set of trainable parameters

Model setup

RNNLM as language model: set ct = 0NMT as described abovepre-train RNNLM and NMT separately

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models10 / 35

Page 11: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Integrating the language model - Fusions

Figure 3: Shallow and Deep Fusion [Gulcehre et al., 2015]

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models11 / 35

Page 12: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Shallow Fusion

at each timestep t

NMT: compute score of possible next words for all hypotheses {y (i)<t−1}

sort new hypotheses (old hypotheses with new word added)

select top K hypotheses as candidates {y(i)≤t}i=1,...,K

rescore hypotheses with weighted sum of words (add score of new word)

log p(yt = k) = log pTM(yt = k) + βlog pLM(yt = k)

β: hyper-parameter

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models12 / 35

Page 13: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Deep Fusion

integrate RNNLM and decoder of NMT by concatenating their hiddenstates

fine-tune model to use both hidden states by tuning only outputparameters

new input of deep output layer

p(yt | y<t , x) ∝ exp(yt(Wo fo(sTMt , sLMt , yt−1, ct) + bo))

keeps learned monolingual structure of LM intact

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models13 / 35

Page 14: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Deep Fusion - Balancing LM and TM

use controller mechanism to dynamically weight LM and TM models

gt = σ(vTg sLMt + bg )

multiply controller output with hidden state of LM to controlmagnitude of LM signal

decoder only regards LM when deemed necessary

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models14 / 35

Page 15: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Corpora

parallel datasets

Table 1: datasets used [Gulcehre et al., 2015]

Zh-En: training on SMS/CHAT, test on conversational speechall others: close domains

monolingual dataset: English gigaword corpus

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models15 / 35

Page 16: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Results - Translation Tasks

(a) Tr-En for each set

(b) Zh-En, PB:phrase-based, HPB:hierarchical phrase based

(c) De-En and Cs-Entranslation tasks

Table 2: translation results on parallel corpora [Gulcehre et al., 2015]

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models16 / 35

Page 17: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Results - RNNLM perplexity on development set

Table 3: Perplexity of RNNLM on dev. set and g [Gulcehre et al., 2015]

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models17 / 35

Page 18: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Conclusions

Fusion

generally improves performance, regardless of the size of the parallelcorporaDeep Fusion achieves bigger improvements than Shallow FusionDeep Fusion makes a model incorporate the LM information as needed

domain similarity

divergence in domains generally hurts model performanceLM is most useful when domains of parallel and monolingual corporaare close

controller mechanism

makes model more robust to domain mismatchmost active on De-En/Cs-En as LM was able to model the targetsentences best here

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models18 / 35

Page 19: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Motivation - [Chorowski and Jaitly, 2016]

DNN Automatic Speech Recognition (ASR) systems suffer fromoverconfidence

peaked probability distribution prevents model from finding sensiblealternativesharms learning deep layers as gradient approximates 0 on correctcharacters: [yi = c]− p(yi | y<i , x)

integrating LM to combat this results in incomplete transcripts

idea: better integrate language models

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models19 / 35

Page 20: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Model - based on [Chan et al., 2015]

Figure 4: Listen, Attend and Spell model [Chan et al., 2015]

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models20 / 35

Page 21: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Model

encodes into a space-delimited sequence of characters directly usingthe encoder - decoder structure

Listener: pyramid BLSTM (pBLSTM)

hjt = pBLSTM(hj−1t−1, [hj−12t , hj−12i+1])

Speller

attention-based LSTM transducer

ci = AttentionContext(st , h)

st = RNN(st−1, yt−1, ct−1)

P(yt | x , y<t) = CharacterDistribution(st , ct)

RNN: 2-layer LSTMCharacterDistribution: MLP with softmaxbeam search to find character sequence y∗

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models21 / 35

Page 22: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Model - AttentionContext for decoder timestep t

ct =∑u

αt,uhu

αt,u =exp(et,u)∑

uexp(et , u)

et,u = wT tanh(Wst−1 + Vhj + Uft,j + b)

ft = F ∗ αt−1

α very sharp probability distribution (attention)

et,u: location aware scoring mechanism [Chorowski et al., 2015]

employs convolutional filters over the previous attention weightsextract k ft,j ∈ Rk vectors with F ∈ Rk×r for every position j of αt−1

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models22 / 35

Page 23: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Training

minimize cross-entropy between ground-truth characters and modelpredictions

criterion

loss(y , x) = −∑i

∑c

T (yi , c)log(p)(yt | y<t , x)

T (yi , c): target-label function, implemented differently for each model

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models23 / 35

Page 24: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Language Model Integration - based on[Gulcehre et al., 2015] Shallow Fusion

extends beam search with a language modelling term

y∗ = argminy − log(p(y | x))− λlog(pLM(g))− γcoverage

λ, γ: tunable parameters

coverage: term promoting longer transcript to avoid incompletetranscripts

overconfidence drastically influences −log(p(y | x)) if deviation fromnetwork’s best guess

using a non-heuristic measurements does not change results, yet is farmore difficult to compute

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models24 / 35

Page 25: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Experimental Setup

dataset

Wall Street Journal datasetextracted and augmented 80-dimensional mel-scale filterbanksvocabulary consists of lower case letters, space, apostrophe, noisemarker, SOS, EOS

LM

extended-vocabulary trigram LM constructed with Kaldi[Povey et al., 2011] and spelling lexicon [Miao et al., 2015]

baseline

Listener: 4 BLSTM, 3 time-pooling layersSpeller: LSTM, character embeddings, attention MLPbeam size: 10

final model: baseline + LM

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models25 / 35

Page 26: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Improvement tests - Overconfidence

problem

good prediction of speller results in little training signal through theattention mechanism to the listenernext predicted character may have low accuracybeam search’s usefulness is greatly reduced

Solution 1: Temperature Parameter T

p(yi ) =exp(li/T )∑jexp(lj/T )

increase T : more uniform distribution; more deletion errors; betterbeam search resultsconstrain emission of EOS

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models26 / 35

Page 27: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Improvement tests - Overconfidence: Label Smoothing

variants of label smoothing

unigram label smoothingneighborhood smoothing

model is regularized

higher entropy of network predictions

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models27 / 35

Page 28: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Improvement tests - Overconfidence

Figure 5: influence of temperature parameter on softmax[Chorowski and Jaitly, 2016]

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models28 / 35

Page 29: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Improvement tests - Partial Transcripts Problem

problem: using LM + beam search results in incomplete transcriptssolution: extend beam search criterion with coverage term to promotelong transcripts

coverage =∑j

[∑i

αij > τ ]

prevents looping over the utterance

Figure 6: impact of techniques to promote long sequences[Chorowski and Jaitly, 2016]

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models29 / 35

Page 30: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Results

(a) baseline configuration and results (b) trigram language model

Table 4: results on WSJ [Chorowski and Jaitly, 2016]

competitive with other non-HMM techniques at the time

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models30 / 35

Page 31: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Conclusion

integrating LM into sequence-to-sequence models is competitive withother approaches, if done successfully

integrating a LM into a NN has proven to be difficult

LM is especially useful in cases the NN itself is uncertain

ablation study to see the actual effect of the LM on the results isadvisable

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models31 / 35

Page 32: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Critique and Discussion

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models32 / 35

Page 33: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

References I

[Bahdanau et al., 2014] Bahdanau, D., Cho, K., and Bengio, Y. (2014).

Neural machine translation by jointly learning to align and translate.

cite arxiv:1409.0473Comment: Accepted at ICLR 2015 as oral presentation.

[Chan et al., 2015] Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015).

Listen, attend and spell.

CoRR, abs/1508.01211.

[Chorowski et al., 2015] Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., andBengio, Y. (2015).

Attention-based models for speech recognition.

CoRR, abs/1506.07503.

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models33 / 35

Page 34: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

References II

[Chorowski and Jaitly, 2016] Chorowski, J. and Jaitly, N. (2016).

Towards better decoding and language model integration in sequence tosequence models.

CoRR, abs/1612.02695.

[Goodfellow et al., 2013] Goodfellow, I. J., Warde-Farley, D., Mirza, M.,Courville, A., and Bengio, Y. (2013).

Maxout networks.

[Gulcehre et al., 2015] Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin,H., Bougares, F., Schwenk, H., and Bengio, Y. (2015).

On using monolingual corpora in neural machine translation.

CoRR, abs/1503.03535.

[Jean et al., 2014] Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014).

On using very large target vocabulary for neural machine translation.

CoRR, abs/1412.2007.

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models34 / 35

Page 35: On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

References III

[Miao et al., 2015] Miao, Y., Gowayyed, M., and Metze, F. (2015).

EESEN: end-to-end speech recognition using deep RNN models and wfst-baseddecoding.

CoRR, abs/1507.08240.

[Povey et al., 2011] Povey, D., Ghoshal, A., Boulianne, G., Goel, N.,Hannemann, M., Qian, Y., Schwarz, P., and Stemmer, G. (2011).

The kaldi speech recognition toolkit.

In In IEEE 2011 workshop.

Rebekka Hubert (ICL) On Using Monolingual Corpora in Neural Machine Translation& Towards better decoding and integration of language models in sequence to sequence models35 / 35