22
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Institute for Anthropomatics and Robotics www.kit.edu Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder Thanh-Le Ha, Jan Niehues and Alexander Waibel

Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

Institute for Anthropomatics and Robotics

www.kit.edu

Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder

Thanh-Le Ha, Jan Niehues and Alexander Waibel

Page 2: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 2 10.12.16

Outline

  Introduction Multilingual Neural Machine Translation

  Related works

  Our proposed approach

  Experimental results

  Conclusion & Future Work

Multilinguality in Neural Machine Translation

Page 3: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 3 10.12.16

Attention Neural Machine Translation

I went home <EoS>

Decoder

Multilinguality in Neural Machine Translation

<EoS>Ich bin nach HauseSource Sentence���(spoken German)

Attention

gegangen

Translated Sentence���(English)

Encoder

Page 4: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 4 10.12.16

Multilingual NMT: Prospective Benefits

  Automatic Language Transfer:   Can be applied to under-resource scenarios

 Number of parameters grows linearly with the number of languages

Multilinguality in Neural Machine Translation

Multilingual NMT system

Page 5: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 5 10.12.16

Multilingual NMT: Challenges

  Attention is language-specific⇒ An encoder-attention-decoder triple for each language pair

 Multilingual (and Multimodal) models [Luong 2016]  Without attention: attention is modality-(and language-)specific

 One-to-Many NMT [Dong 2015]  Single encoder, several pairs of attention-decoder for each target language

Multilinguality in Neural Machine Translation

Page 6: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 6 10.12.16

Multilingual NMT: Challenges

  Attention is language-specific⇒ An encoder-attention-decoder triple for each language pair

 Want a shared attention or decoder NMT?⇒ We must modify the architecture

 Many-to-One NMT [Zoph&Knight 2016]  Many encoders, need addition layers to combine their outputs before

feeding to the attention.

 Multilingual (Many-to-Many) NMT [Firat 2016 papers]  Multi-way: Multiple encoders and decoders

  With shared attention (they must change their architecture)

Multilinguality in Neural Machine Translation

Page 7: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 7 10.12.16

Multilingual NMT: Our motivations

 NMT should learn common semantic space of all languages   “work”, “working” and “worked”,

  “car” and “automobile”

  “player”-English, “joueur”-French, “Spieler”-German

Multilinguality in Neural Machine Translation

[From Socher 2012]

Page 8: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 8 10.12.16

Multilingual NMT: Our approach

 NMT should learn common semantic space of all languages

 Our multilingual NMT system should:  Learn language-independent source and target sentence representations  Have a shared language-dependent word embeddings

=> A simple preprocessing step: Language-specific Coding

Multilinguality in Neural Machine Translation

Page 9: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 9 10.12.16

Multilingual NMT: Our approach

  Language-specific Coding  Append a language code to the words belonging to that language:

  (excuse me | excusez moi) (En-Fr)

⇒  (EN_excuse EN_me | FR_excusez FR_moi )

  (entschuldigen Sie | excusez moi) (De-Fr)

⇒  (DE_entschuldigen DE_Sie | FR_excusez FR_moi)

Multilinguality in Neural Machine Translation

Page 10: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 10 10.12.16

Multilingual NMT: Our approach

  Able to feature attention mechanism for multilingual NMT  Everything (encoder, attention, decoder) is shared (universal)

  Do not need to change the NMT architecture   Language-specific coding is a preprocessing step  Can use any NMT framework with any translation unit

Multilinguality in Neural Machine Translation

Neural Machine Translation

with Attention

Language-specific Coding

Byte-Pair���Encoding

Pre-processing

Post processing

Page 11: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 11 10.12.16

Experiments

  Training, validation and testing data  TED talks from WIT3

  WMT’16 parallel and monolingual data

  Framework: Nematus [Sennrich 2016]  Sub-word with BPE on joint corpus

  Vocabularies’ size: 40K, sentence-length cut-off at 50

  One 1024-cell GRU layer, one 1000D embeddings for encoder and decoder

Adadelta, mini-batch size: 80. grad norm: 0.1

  Dropout at every layer

  Experiments on different scenarios:  Under-resource (simulated): En-De TED

  Large-scale, real task: IWSLT’16: En-De WMT tuning on TED

Multilinguality in Neural Machine Translation

Page 12: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 12 10.12.16

Experiments: Under-resource scenario

  Goal:  Translating En to De

  Using multilingual corpora:   En-De: TED 196KFr-De: TED 165K

  Two kinds of configurations: Mix-source & Multi-source

Multilinguality in Neural Machine Translation

Page 13: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 13 10.12.16

Experiments: Mix-source Multilingual NMT

Multilinguality in Neural Machine Translation

Language-specific Coding

excuse me entschuldigen Sie

see ya soon bis bald

English German

Language-specific coded

English

Language-specific coded

German

EN_excuse EN_me

EN_see EN_ya EN_soon

DE_entschuldigen DE_Sie

DE_bis DE_bald DE_bis DE_bald

DE_entschuldigen DE_Sie

DE_entschuldigen DE_Sie

DE_bis DE_bald

Mix-source

NMT system

Page 14: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 14 10.12.16

Experiments: Multi-source Multilingual NMT

Multilinguality in Neural Machine Translation

Language-specific Coding

excuse me entschuldigen Sie

see ya soon bis bald

English German

Language-specific coded

English

Language-specific coded

German

EN_excuse EN_me

EN_see EN_ya EN_soon

excusez moi

merci beaucoup DE_danke DE_schön

DE_entschuldigen DE_Sie

entschuldigen Sie

DE_bis DE_bald

Multi-source

NMT system

French

Language-specific coded

French

DE_entschuldigen DE_SieFR_excusez FR_moi

FR_merci FR_beaucoup danke schön

German

Language-specific coded

German

Page 15: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 15 10.12.16

Experiments: Under-resource scenario

Multilinguality in Neural Machine Translation

German

English

Language-specific coded data

German

German

Language-specific coded data

NMT system

Mix-source

French

English

Language-specific coded data

German

German

Language-specific coded data

NMT system

Multi-source

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 24.35 - 20.62 - Mix-source (En,De => De,De) 26.99 +2.64 22.71 +2.09 Multi-source (En,Fr => De,De) 26.64 +2.21 22.21 +1.59

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 24.35 - 20.62 - Mix-source (En,De => De,De) 26.99 +2.64 22.71 +2.09

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 24.35 - 20.62 -

  Both Mix-source and Multi-source improve the translation significantly

Page 16: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 16 10.12.16

Experiments: Under-resource scenario

Multilinguality in Neural Machine Translation

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 24.35 - 20.62 - Mix-source (En,De => De,De) 26.99 +2.64 22.71 +2.09 Multi-source (En,Fr => De,De) 26.64 +2.21 22.21 +1.59

  Both Mix-source and Multi-source improve the translation significantly  Because we have larger data (double the baseline)?Because we have larger data (double the baseline)?

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 24.35 - 20.62 - Mix-source (En,De => De,De) 26.99 +2.64 22.71 +2.09 Multi-source (En,Fr => De,De) 26.64 +2.21 22.21 +1.59 Baseline 2 (En => De) x2 24.58 +0.23 20.55 -0.07

  Baseline 2: Double the corpus

Page 17: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 17 10.12.16

Experiments: Under-resource scenario

Multilinguality in Neural Machine Translation

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 24.35 - 20.62 - Mix-source (En,De => De,De) 26.99 +2.64 22.71 +2.09 Multi-source (En,Fr => De,De) 26.64 +2.21 22.21 +1.59

 Multi-source performs worse than Mix-source. Why?  Smaller training data?

  Mix-source: 392K, Multi-source: 361K  Mix-source 2: De part of En-Fr: 361K < Mix-source:392K

Smaller training data?  Mix-source: 392K, Multi-source: 361K

  Having more data in other languages confuses NMT?  Need more analyses (more source languages, more language types)

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 24.35 - 20.62 - Mix-source (En,De => De,De) 26.99 +2.64 22.71 +2.09 Multi-source (En,Fr => De,De) 26.64 +2.21 22.21 +1.59 Mix-source 2 (En,De => De,De) 27.18 +2.83 23.74 +3.12

Page 18: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 18 10.12.16

Experiments: Multi-source Visualization

  Take the source word embeddings (1000 dims) to visualize  Using t-SNE [Maaten 2008] to project to 2-dim points

Multilinguality in Neural Machine Translation

En&Fr Word Embeddings topic “human”

Page 19: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 19 10.12.16

Experiments: Multi-source Visualization

  Take the source word embeddings (1000 dims) to visualize  Using t-SNE [Maaten 2008] to project to 2-dim points

Multilinguality in Neural Machine Translation

En&Fr Word Embeddings topic “computer”

Page 20: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 20 10.12.16

Experiments: Large-scale, real task

  Translate En-De for the real task of IWSLT16  Baseline: WMT data + BackTranslation

  Train Mix-source configuration on  1) WMT parallel data (En-De) + sampled additional mono data (De-De)  2) WMT parallel data (En-De) + mono part of that parallel data (De-De)

  Adapt on TED En-De (continue training)  Also Mix-source on TED

Multilinguality in Neural Machine Translation

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 25.74 - 22.54 - 1) Sampled Mix-source (En,De => De,De) 27.74 +2.00 24.39 +1.85 2) Mono Mix-source (En,De => De,De) 28.89 +3.15 24.86 +2.32

Systemtst2013 tst2014

BLEU ΔBLEU BLEU ΔBLEUBaseline (En => De) 25.74 - 22.54 -

German

English

German

GermanNMT system

Page 21: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 21 10.12.16

Conclusion & Future work

  Conclusion  We proposed a simple but elegant approach for multilingual NMT

  Allows to use attention seamlessly

  A preprocessing step, no need to change an NMT architecture

  Improve significantly in under-resource scenarios  Provide natural, effective way to leverage monolingual data in NMT

  Future work  More languages and the impact

  Apply multilingual NMT in zero-resource scenario

Page 22: Toward Multilingual Neural Machine Translation with ... · WMT’16 parallel and monolingual data Framework: Nematus [Sennrich 2016] Sub-word with BPE on joint corpus ... Using multilingual

Institute for Anthropomatics and Robotics 22 10.12.16 Multilinguality in Neural Machine Translation