97
Current trends and tricks of the trade in NMT Barry Haddow University of Edinburgh FINMT September 12 th , 2016

Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Current trends and tricks of the trade in NMT

Barry Haddow

University of Edinburgh

FINMTSeptember 12th, 2016

Page 2: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Collaborators

Rico Sennrich Alexandra Birch

Barry Haddow (UEDIN) Neural Machine Translation FINMT 2 / 49

Page 3: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Edinburgh’s WMT Results Over the Years

2013 2014 2015 20160

10

20

30

20.3 20.9 20.8 21.5

19.420.2

22 22.1

24.7

BLE

Uon

new

stes

t201

3(E

N→

DE

)

phrase-based SMTsyntax-based SMTneural MT

Barry Haddow (UEDIN) Neural Machine Translation FINMT 3 / 49

Page 4: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Edinburgh’s WMT Results Over the Years

2013 2014 2015 20160

10

20

30

20.3 20.9 20.8 21.5

19.420.2

22 22.1

24.7

BLE

Uon

new

stes

t201

3(E

N→

DE

)

phrase-based SMTsyntax-based SMTneural MT

Barry Haddow (UEDIN) Neural Machine Translation FINMT 3 / 49

Page 5: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Edinburgh’s WMT Results Over the Years

2013 2014 2015 20160

10

20

30

20.3 20.9 20.8 21.5

19.420.2

22 22.1

24.7

BLE

Uon

new

stes

t201

3(E

N→

DE

)

phrase-based SMTsyntax-based SMTneural MT

Barry Haddow (UEDIN) Neural Machine Translation FINMT 3 / 49

Page 6: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

WMT16: English→German

# score range system1 0.49 1 UEDIN-NMT

2 0.40 2 METAMIND

3 0.29 3 UEDIN-SYNTAX

4 0.17 4 NYU-MONTREAL

5 −0.01 5-10 ONLINE-B−0.01 5-10 KIT-LIMSI

−0.02 5-10 CAMBRIDGE

−0.02 5-10 ONLINE-A−0.03 5-10 PROMT-RULE

−0.05 6-10 KIT

6 −0.14 11-12 JHU-SYNTAX

−0.15 11-12 JHU-PBMT

7 −0.26 13-14 UEDIN-PBMT

−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G

Barry Haddow (UEDIN) Neural Machine Translation FINMT 4 / 49

Page 7: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

WMT16: English→German# score range system1 0.49 1 UEDIN-NMT

2 0.40 2 METAMIND

3 0.29 3 UEDIN-SYNTAX

4 0.17 4 NYU-MONTREAL

5 −0.01 5-10 ONLINE-B−0.01 5-10 KIT-LIMSI

−0.02 5-10 CAMBRIDGE

−0.02 5-10 ONLINE-A−0.03 5-10 PROMT-RULE

−0.05 6-10 KIT

6 −0.14 11-12 JHU-SYNTAX

−0.15 11-12 JHU-PBMT

7 −0.26 13-14 UEDIN-PBMT

−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G

Neural MT

Barry Haddow (UEDIN) Neural Machine Translation FINMT 4 / 49

Page 8: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

WMT16: English→German# score range system1 0.49 1 UEDIN-NMT

2 0.40 2 METAMIND

3 0.29 3 UEDIN-SYNTAX

4 0.17 4 NYU-MONTREAL

5 −0.01 5-10 ONLINE-B−0.01 5-10 KIT-LIMSI

−0.02 5-10 CAMBRIDGE

−0.02 5-10 ONLINE-A−0.03 5-10 PROMT-RULE

−0.05 6-10 KIT

6 −0.14 11-12 JHU-SYNTAX

−0.15 11-12 JHU-PBMT

7 −0.26 13-14 UEDIN-PBMT

−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G

Neural MT Neural components

Barry Haddow (UEDIN) Neural Machine Translation FINMT 4 / 49

Page 9: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Outline

Brief overview of NMTRecent Advances

Subword representationsUsing monolingual dataLinguistic features for NMT

Current problems and outlookPractical considerations

Barry Haddow (UEDIN) Neural Machine Translation FINMT 5 / 49

Page 10: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Neural versus Phrase-based MT

phrase-based SMTLearn segment-segment correspondences from bitext→ training is multistage pipeline of heuristics→ strong independence assumptions→ “fixed” trade-off between features

neural MTLearn mathematical function on vectors from bitext→ end-to-end trained model→ output conditioned on full source text and target history→ non-linear dependence on information sources

Barry Haddow (UEDIN) Neural Machine Translation FINMT 6 / 49

Page 11: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Neural versus Phrase-based MT

phrase-based SMTLearn segment-segment correspondences from bitext→ training is multistage pipeline of heuristics→ strong independence assumptions→ “fixed” trade-off between features

neural MTLearn mathematical function on vectors from bitext→ end-to-end trained model→ output conditioned on full source text and target history→ non-linear dependence on information sources

Barry Haddow (UEDIN) Neural Machine Translation FINMT 6 / 49

Page 12: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Neural Machine Translation:Encoder-Decoder-with-Attention

[Image: Philipp Koehn]Barry Haddow (UEDIN) Neural Machine Translation FINMT 7 / 49

Page 13: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Simplified EquationsEncoder

−→h j = RNN(xj,

−→h j−1)

←−h j = RNN(xj,

←−h j+1)

hj =(−→h j;←−h j)

Decoder-with-Attention

eij = f(si−1, hj)

ci =∑j

softmaxj(eij)hj

si = RNN(yi−1, si−1, ci)

Readout

yi = softmax(g(si, ci, yi−1))

See https://github.com/emjotde/amunmt/blob/master/notebooks/dl4mt.ipynbBarry Haddow (UEDIN) Neural Machine Translation FINMT 8 / 49

Page 14: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Training and Inference

TrainingMinimise cross-entropy loss functionGradient descent using back-propagationProcess training data in mini-batches

InferenceBeam search with small beam (e.g. 12)Typically use an ensemble of models

Barry Haddow (UEDIN) Neural Machine Translation FINMT 9 / 49

Page 15: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Recent Advances in Neural MT

some problems:networks have fixed vocabulary→ poor translation of rare/unknown wordsmodels are trained on parallel data; how do we use monolingualdata?how do we incorporate linguistic mark-up?

recent solutions:subword models allow translation of rare/unknown wordstrain on back-translated monolingual datalinguistic input features

Barry Haddow (UEDIN) Neural Machine Translation FINMT 10 / 49

Page 16: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Subwords for NMT: Motivation

MT is an open-vocabulary problemcompounding and other productive morphological processes

they charge a carry-on bag fee.sie erheben eine Hand|gepäck|gebühr.

names

Obama(English; German)Обама (Russian)オバマ (o-ba-ma) (Japanese)

technical terms, numbers, etc.

... but Neural MT architectures have small and fixed vocabulary

Barry Haddow (UEDIN) Neural Machine Translation FINMT 11 / 49

Page 17: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Subword units

segmentation algorithms: wishlistopen-vocabulary NMT: encode all words through smallvocabularyencoding generalizes to unseen wordssmall text sizegood translation quality

our experimentsafter preliminary experiments, we use:

character n-grams (with shortlist of unsegmented words)segmentation via byte pair encoding

Barry Haddow (UEDIN) Neural Machine Translation FINMT 12 / 49

Page 18: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent symbol pairs (’A’,’B’) with ’AB’apply on dictionary, not on full text (for efficiency)output vocabulary: character vocabulary + one symbol permerge

word frequency’l o w </w>’ 5’l o w e r </w>’ 2’n e w e s t </w>’ 6’w i d e s t </w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Barry Haddow (UEDIN) Neural Machine Translation FINMT 13 / 49

Page 19: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent symbol pairs (’A’,’B’) with ’AB’apply on dictionary, not on full text (for efficiency)output vocabulary: character vocabulary + one symbol permerge

word frequency’l o w </w>’ 5’l o w e r </w>’ 2’n e w es t </w>’ 6’w i d es t </w>’ 3

(’e’, ’s’) → ’es’

(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Barry Haddow (UEDIN) Neural Machine Translation FINMT 13 / 49

Page 20: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent symbol pairs (’A’,’B’) with ’AB’apply on dictionary, not on full text (for efficiency)output vocabulary: character vocabulary + one symbol permerge

word frequency’l o w </w>’ 5’l o w e r </w>’ 2’n e w est </w>’ 6’w i d est </w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’

(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Barry Haddow (UEDIN) Neural Machine Translation FINMT 13 / 49

Page 21: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent symbol pairs (’A’,’B’) with ’AB’apply on dictionary, not on full text (for efficiency)output vocabulary: character vocabulary + one symbol permerge

word frequency’l o w </w>’ 5’l o w e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’

(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Barry Haddow (UEDIN) Neural Machine Translation FINMT 13 / 49

Page 22: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent symbol pairs (’A’,’B’) with ’AB’apply on dictionary, not on full text (for efficiency)output vocabulary: character vocabulary + one symbol permerge

word frequency’lo w </w>’ 5’lo w e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’

(’lo’, ’w’) → ’low’...

Barry Haddow (UEDIN) Neural Machine Translation FINMT 13 / 49

Page 23: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent symbol pairs (’A’,’B’) with ’AB’apply on dictionary, not on full text (for efficiency)output vocabulary: character vocabulary + one symbol permerge

word frequency’low </w>’ 5’low e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’...

Barry Haddow (UEDIN) Neural Machine Translation FINMT 13 / 49

Page 24: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

why BPE?trade-off between vocabulary size and text length(≈10% increase in text length for vocabulary size 60000)open-vocabulary:learned operations can be applied to unknown wordsalternative view: character-level model on compressed text

’l o w e s t </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Barry Haddow (UEDIN) Neural Machine Translation FINMT 14 / 49

Page 25: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

why BPE?trade-off between vocabulary size and text length(≈10% increase in text length for vocabulary size 60000)open-vocabulary:learned operations can be applied to unknown wordsalternative view: character-level model on compressed text

’l o w es t </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Barry Haddow (UEDIN) Neural Machine Translation FINMT 14 / 49

Page 26: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

why BPE?trade-off between vocabulary size and text length(≈10% increase in text length for vocabulary size 60000)open-vocabulary:learned operations can be applied to unknown wordsalternative view: character-level model on compressed text

’l o w est </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Barry Haddow (UEDIN) Neural Machine Translation FINMT 14 / 49

Page 27: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

why BPE?trade-off between vocabulary size and text length(≈10% increase in text length for vocabulary size 60000)open-vocabulary:learned operations can be applied to unknown wordsalternative view: character-level model on compressed text

’l o w est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Barry Haddow (UEDIN) Neural Machine Translation FINMT 14 / 49

Page 28: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

why BPE?trade-off between vocabulary size and text length(≈10% increase in text length for vocabulary size 60000)open-vocabulary:learned operations can be applied to unknown wordsalternative view: character-level model on compressed text

’lo w est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Barry Haddow (UEDIN) Neural Machine Translation FINMT 14 / 49

Page 29: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Byte pair encoding for word segmentation

why BPE?trade-off between vocabulary size and text length(≈10% increase in text length for vocabulary size 60000)open-vocabulary:learned operations can be applied to unknown wordsalternative view: character-level model on compressed text

’low est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Barry Haddow (UEDIN) Neural Machine Translation FINMT 14 / 49

Page 30: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Evaluation: data and methods

dataWMT 15 English→German and English→Russian

modelattentional encoder–decoder neural networkparameters and settings as in [Bahdanau et al, 2014]

Barry Haddow (UEDIN) Neural Machine Translation FINMT 15 / 49

Page 31: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Translation quality

en-d

e

system BLEU

syntax-based (EMNLP15) 24.4word-level (with back-off) 22.0character bigrams 22.8BPE 22.8joint BPE (ensemble) 24.7

en-r

u

phrase-based (WMT15) 24.3word-level (with back-off) 19.1character bigrams 20.9joint BPE 20.4joint BPE (ensemble) 24.1

Barry Haddow (UEDIN) Neural Machine Translation FINMT 16 / 49

Page 32: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram F1 EN→DE

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (German)

unig

ram

F1

word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 17 / 49

Page 33: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram F1 EN→DE

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (German)

unig

ram

F1

word-level (with back-off)word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 17 / 49

Page 34: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram F1 EN→DE

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (German)

unig

ram

F1

character bigrams (500k shortlist)word-level (with back-off)word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 17 / 49

Page 35: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram F1 EN→DE

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (German)

unig

ram

F1

character bigrams (50k shortlist)character bigrams (500k shortlist)word-level (with back-off)word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 17 / 49

Page 36: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram F1 EN→DE

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (German)

unig

ram

F1

BPEcharacter bigrams (50k shortlist)character bigrams (500k shortlist)word-level (with back-off)word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 17 / 49

Page 37: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Examples

system sentencesource health research institutesreference Gesundheitsforschungsinstituteword-level (with back-off) Forschungsinstitutecharacter bigrams Fo|rs|ch|un|gs|in|st|it|ut|io|ne|nBPE Gesundheits|forsch|ungsin|stitutesource rakfiskreference ракфиска (rakfiska)word-level (with back-off) rakfisk → UNK→ rakfiskcharacter bigrams ra|kf|is|k→ ра|кф|ис|к (ra|kf|is|k)BPE rak|f|isk → рак|ф|иска (rak|f|iska)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 18 / 49

Page 38: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Aside: BPE for Phrase-based MT

WMT15 translation taskBuilt phrase-based fi→en system with all OPUS dataScored well on BLEU, less well on human eval.

WMT16 translation taskApplied BPE to source (Finnish) onlyGained +1.2 BLEU on test setRanked same as best online systems

Barry Haddow (UEDIN) Neural Machine Translation FINMT 19 / 49

Page 39: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Aside: BPE for Phrase-based MT

WMT15 translation taskBuilt phrase-based fi→en system with all OPUS dataScored well on BLEU, less well on human eval.

WMT16 translation taskApplied BPE to source (Finnish) onlyGained +1.2 BLEU on test setRanked same as best online systems

Barry Haddow (UEDIN) Neural Machine Translation FINMT 19 / 49

Page 40: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Aside: BPE for Phrase-based MT (Examples)

source yös Intian on sanottu olevan kiinnostunutpuolustusyhteistyösopimuksesta Japanin kanssa.

base India is also said to be interested inpuolustusyhteistyösopimuksesta with Japan.

bpe India is also said to be interested in defencecooperation agreement with Japan.

reference India is also reportedly hoping for a deal on defencecollaboration between the two nations.

source Balotelli oli vielä kaukana huippuvireestään.base Balotelli was still far from huippuvireestään.bpe Baloo, Hotel was still far from the peak of its vitality.

reference Balotelli is still far from his top tune.

Barry Haddow (UEDIN) Neural Machine Translation FINMT 20 / 49

Page 41: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Conclusion – Subwords in NMT

translation of many rare/unknown words is transparentsubword units enable open-vocabulary NMTimproved translation quality for rare wordsmodel-independent: simple pre-/postprocessing

Barry Haddow (UEDIN) Neural Machine Translation FINMT 21 / 49

Page 42: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Monolingual Data in NMT

Why Monolingual Data for Phrase-based SMT?more training data 3

relax independence assumptions 3

more appropriate training data (domain adaptation) 3

Why Monolingual Data for NMT?more training data 3

relax independence assumptions 7

more appropriate training data (domain adaptation) 3

Barry Haddow (UEDIN) Neural Machine Translation FINMT 22 / 49

Page 43: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Monolingual Data in NMT

encoder-decoder already conditions onprevious target words

no architecture change required tolearn from monolingual data

Barry Haddow (UEDIN) Neural Machine Translation FINMT 23 / 49

Page 44: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Monolingual Training Instances

Output predictionp(yi) is a function of hidden state si, previous output yi−1, andsource context vector cionly difference to monolingual RNN: ci

Problemwe have no source context ci for monolingual training instances

Solutionstwo methods to deal with missing source context:

empty/dummy source context ci→ danger of unlearning conditioning on sourceproduce synthetic source sentence via back-translation→ get approximation of ci

Barry Haddow (UEDIN) Neural Machine Translation FINMT 24 / 49

Page 45: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Monolingual Training Instances

Output predictionp(yi) is a function of hidden state si, previous output yi−1, andsource context vector cionly difference to monolingual RNN: ci

Problemwe have no source context ci for monolingual training instances

Solutionstwo methods to deal with missing source context:

empty/dummy source context ci→ danger of unlearning conditioning on sourceproduce synthetic source sentence via back-translation→ get approximation of ci

Barry Haddow (UEDIN) Neural Machine Translation FINMT 24 / 49

Page 46: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Monolingual Training Instances

Dummy source1-1 mix of parallel and monolingual training instancesrandomly sample from monolingual data each epochfreeze encoder/attention layers for monolingual traininginstances

Synthetic source1-1 mix of parallel and monolingual training instancesrandomly sample from back-translated datatraining does not distinguish between real and syntheticparallel data

Barry Haddow (UEDIN) Neural Machine Translation FINMT 25 / 49

Page 47: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Evaluation: WMT 15 English→German

syntax-based parallel +monolingual +synthetic

0

10

20

30

24.4 23.6 24.626.5

24.4 23.624.4

BLE

U

(NMT systems are ensemble of 4)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 26 / 49

Page 48: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Evaluation: WMT 15 German→English

PBSMT parallel +synthetic +synth-ens4

0

10

20

30 29.326.7

30.4 31.629.3

26.729.3

BLE

U

Barry Haddow (UEDIN) Neural Machine Translation FINMT 27 / 49

Page 49: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Why is monolingual data helpful?

Domain adaptation effectReduces over-fittingImproves fluency

(See our ACL paper for more analysis.)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 28 / 49

Page 50: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Why Linguistic Features?

disambiguate words by POSEnglish Germancloseverb schließencloseadj nahclosenoun Ende

source We thought a win like this might be closeadj.reference Wir dachten, dass ein solcher Sieg nah sein könnte.baseline NMT *Wir dachten, ein Sieg wie dieser könnte schließen.

Barry Haddow (UEDIN) Neural Machine Translation FINMT 29 / 49

Page 51: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Why Linguistic Features?

better generalization; combat data sparsityword form

lemma morph. features

liegen (lie)

liegen (lie) (3.p.pl. present)

liegst (lie)

liegen (lie) (2.p.sg. present)

lag (lay)

liegen (lie) (3.p.sg. past)

läge (lay)

liegen (lie) (3.p.sg. subjunctive II)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 30 / 49

Page 52: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Why Linguistic Features?

better generalization; combat data sparsityword form lemma morph. featuresliegen (lie) liegen (lie) (3.p.pl. present)liegst (lie) liegen (lie) (2.p.sg. present)lag (lay) liegen (lie) (3.p.sg. past)läge (lay) liegen (lie) (3.p.sg. subjunctive II)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 30 / 49

Page 53: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Neural Machine Translation: Multiple InputFeaturesUse separate embeddings for each feature, then concatenate

baseline: only word feature

E(close) =

0.50.20.30.1

|F | input features

E1(close) =

0.40.10.2

E2(adj) =[0.1]

E1(close) ‖ E2(adj) =

0.40.10.20.1

Barry Haddow (UEDIN) Neural Machine Translation FINMT 31 / 49

Page 54: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Linguistic Features

lemmasmorphological featuresPOS tagsdependency labelsBPE tags

Barry Haddow (UEDIN) Neural Machine Translation FINMT 32 / 49

Page 55: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Evaluation

DataWMT16 training/test dataEnglish↔German and English→Romanian

ToolsNMT tool: Nematus (fork of dl4mt-tutorial)German annotation: ParZuEnglish annotation: Stanford CoreNLP

Barry Haddow (UEDIN) Neural Machine Translation FINMT 33 / 49

Page 56: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Results: BLEU ↑

English→German German→English English→Romanian

0

10

20

30

40

27.831.4

23.8

28.4

32.9

24.8

33.1

37.5

28.2

33.2

38.5

29.2

BLE

U

baselineall featuresbaseline (+synthetic data)all features (+synthetic data)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 34 / 49

Page 57: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Linguistic features: Summary

Linguistic input features improve neural MT over strongbaselines

Many possibilities for source-side featuresDomain/topic adaptationSemantically-motivated tags. . .

Target side features (cf multi-task learning)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 35 / 49

Page 58: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Putting it all together: WMT16 Results

EN→CS EN→DE EN→RO EN→RU CS→EN DE→EN RO→EN RU→EN

0

10

20

30

40

BLE

U

parallel data +synthetic data +ensemble +R2L reranking

Barry Haddow (UEDIN) Neural Machine Translation FINMT 36 / 49

Page 59: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Also New in NMT from ACL

Coverage mechanism [Tu et al. ]Minimum risk training [Shen et al. ]Character-based models [Chung et al., Luong and Manning,Costa-jussá and Fonollosa ]Copying input to output [Gu et al., Gulçehre et al. ]Incorporating syntax [Eriguchi et al. ]Using monolingual data [Cheng et al. ]

Barry Haddow (UEDIN) Neural Machine Translation FINMT 37 / 49

Page 60: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Current Problems in NMT (and Opportunities)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 38 / 49

Page 61: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Length bias

[Tu et al., ACL 2016]

Barry Haddow (UEDIN) Neural Machine Translation FINMT 39 / 49

Page 62: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Fluency vs. Adequacyru

-en

Adequacy FluencyONLINE-G 0.115 74.2 0.100 69.9

AMU-UEDIN 0.103 73.3 0.178 72.2ONLINE-B 0.083 72.8 0.030 67.8

NRC 0.060 72.7 0.092 69.9PROMT-RULE-BASED 0.044 72.1 −0.102 63.8

UEDIN-NMT 0.011 71.1 0.245 74.3ONLINE-A −0.007 70.8 0.020 66.7

AFRL-MITLL-PHRASE −0.040 70.1 0.047 68.4AFRL-MITLL-CONTRAST −0.071 69.3 −0.020 66.5

ONLINE-F −0.322 61.8 −0.472 54.7

[Bojar et al. WMT 2016]

Barry Haddow (UEDIN) Neural Machine Translation FINMT 40 / 49

Page 63: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Other Problems in NMT

Lack of interpretabilityChoice of training regime, convergence, variabilityLong training timesDealing with domain shiftNo alignment between source and output

Barry Haddow (UEDIN) Neural Machine Translation FINMT 41 / 49

Page 64: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Dealing with Morphological Complexity

Good progress so farWMT16 evaluation shows good results in Czech, German . . .Analysis of PBMT vs NMT shows 20% less morph errors[Bentivogli et al, EMNLP 2016]

Opportunities for Further ProgressLots of flexibility on representationsOutput conditioned on whole sourceSource/target linguistic analyses easily addedConsider interaction of attention and representation

Barry Haddow (UEDIN) Neural Machine Translation FINMT 42 / 49

Page 65: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Practical Considerations

Barry Haddow (UEDIN) Neural Machine Translation FINMT 43 / 49

Page 66: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Hardware

RAM/disk/cpu requirements much more modest than“traditional SMT”GPU memory important – at least 12G for our expts.

Barry Haddow (UEDIN) Neural Machine Translation FINMT 44 / 49

Page 67: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Training times (WMT16)

en-cs en-deSentences 52M 7.8MTraining 21d 12dFine-tuning 8d –

For en-de the synthetic was mixed with parallelConvergence monitored by perplexity and BLEU on heldoutSave every 30,000 steps, ensemble of last 4 savepoints

Barry Haddow (UEDIN) Neural Machine Translation FINMT 45 / 49

Page 68: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Our Tools

Nematus (https://github.com/rsennrich/nematus)Training and decoding of NMT modelsBased on dl4mt by Kyunghyun Cho and othersSlightly different model than Groundhog/Bahdanau paperAdds ensemble, input features, nbest, dropout, . . .All Python, theano backend

AmuNMT (https://github.com/emjotde/amunmt)Fast C++ decoder for nematus modelsDirectly interfaces CUDA librariesAlso enables fast CPU decoding

subword-nmt (https://github.com/rsennrich/subword-nmt)BPE learn and apply . . . and chrF metric

WMT16 scripts and modelshttps://github.com/rsennrich/wmt16-scriptshttp://data.statmt.org/rsennrich/wmt16_systems/

Barry Haddow (UEDIN) Neural Machine Translation FINMT 46 / 49

Page 69: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

NMT Toolkits

https://github.com/jonsafari/nmt-list

Barry Haddow (UEDIN) Neural Machine Translation FINMT 47 / 49

Page 70: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

AcknowledgmentsSome of the research presented here was conducted incooperation with Samsung Electronics Polska sp. z o.o. - SamsungR&D Institute Poland.

This project has received funding from the EuropeanUnion’s Horizon 2020 research and innovation pro-gramme under grant agreements 645452 (QT21) and644402 (HimL).

Barry Haddow (UEDIN) Neural Machine Translation FINMT 48 / 49

Page 71: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Thank You

Barry Haddow (UEDIN) Neural Machine Translation FINMT 49 / 49

Page 72: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Prior work

back-off models [Jean et al. (2015), Luong et al. (2014)]

compounds: hard to model 1-to-many relationshipsnames: if alphabets differ, we need transliteration

Huffman encoding of rare words [Chitnis and de Nero (2015)]

no generalization to unseen words

subword units in PBSMTdifferent challenges:

limited vocabulary in NMTstrong independence assumptions in PBSMT

Barry Haddow (UEDIN) Neural Machine Translation FINMT 50 / 49

Page 73: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Rare words often have transparent translation

100 rare German words (not among top 50000)56 compounds Stallhygiene stable hygiene21 names Reuss Reuss6 cognates/loanwords emanzipieren emancipate5 transparent affixes süßlich sweetish1 number 2346 23461 technical term ingres.utf8 ingres.utf8

10 other Vermietern landlords

hypothesiswe can translate rare/unseen words on subword level

Barry Haddow (UEDIN) Neural Machine Translation FINMT 51 / 49

Page 74: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Core idea: transparent translations

transparent translationssome translations are semantically/phonologically transparentmorphologically complex words (e.g. compounds):

solar system (English)Sonnen|system (German)Nap|rendszer (Hungarian)

named entities:Obama(English; German)Обама (Russian)オバマ (o-ba-ma) (Japanese)

cognates and loanwords:claustrophobia(English)Klaustrophobie(German)Клаустрофобия (Russian)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 52 / 49

Page 75: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

BPE: increasing segmentation consistency

joint BPEbeing able to align subword units 1-to-1 is helpfulto increase consistency, we concatenate text in both languagesto learn BPEfor Cyrillic, we use ISO 9 Romanizationexample:

BPE: Mirz|ayeva→Мир|за|ева (Mir|za|eva)joint BPE: Mir|za|yeva→Мир|за|ева (Mir|za|eva)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 53 / 49

Page 76: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram F1 EN→DE

all words rare words(frequency rank > 50000)

unseen words

0

20

40

60

80

58.1

36.8 36.8

58.4

40.5

30.9

58.5

41.8

33.6

unig

ram

F1

(%)

word-level (with back-off)character bigramsBPE

Barry Haddow (UEDIN) Neural Machine Translation FINMT 54 / 49

Page 77: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram accuracy EN→RU

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (Russian)

unig

ram

F 1

word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 55 / 49

Page 78: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram accuracy EN→RU

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (Russian)

unig

ram

F 1

word-level (with back-off)word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 55 / 49

Page 79: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram accuracy EN→RU

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (Russian)

unig

ram

F 1

character bigrams (50k shortlist)word-level (with back-off)word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 55 / 49

Page 80: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram accuracy EN→RU

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150000 500000

training set frequency rank (Russian)

unig

ram

F 1

BPEcharacter bigrams (50k shortlist)word-level (with back-off)word-level (no back-off)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 55 / 49

Page 81: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Unigram accuracy EN→RU

all words rare words(frequency rank > 50000)

unseen words

0

20

40

60

80

54.8

26.5

6.6

55.2

27.8

17.4

55.8

29.7

18.3unig

ram

F1

(%)

word-level (with back-off)character bigramsBPE

Barry Haddow (UEDIN) Neural Machine Translation FINMT 56 / 49

Page 82: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Analysis: Domain Adaptation with MonolingualData

BLEU

WMT IWSLT(in-domain) (out-of-domain)

+2.9 +1.2

Table: Gains on English→German from adding synthetic News Crawldata.

→ domain adaptation explains improvement partially

Barry Haddow (UEDIN) Neural Machine Translation FINMT 57 / 49

Page 83: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Analysis: Domain Adaptation with MonolingualData

system BLEU

IWSLT tst20150 phrase-based 25.51 NMT out-of-domain 25.52 1+in-domainsynth 26.73 1+in-domainparallel 28.4

Domain adaptation by fine-tuning on in-domain dataout-of-domain: WMT (8 million sentence pairs)in-domain: TED (200k sentence pairs)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 58 / 49

Page 84: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Analysis: Overfitting

0 5 10 15 20 25 302

4

6

8

training time (training instances ·106)

Turk

ish→

Eng

lish

cros

s-en

tropy parallel (train)

GWmono (train)GWsynth (train)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 59 / 49

Page 85: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Analysis: Overfitting

0 5 10 15 20 25 302

4

6

8

training time (training instances ·106)

Turk

ish→

Eng

lish

cros

s-en

tropy parallel (dev)

parallel (train)GWmono (dev)GWmono (train)GWsynth (dev)GWsynth (train)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 59 / 49

Page 86: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

The Effect of Back-Translation Quality

Use different systems for German→English translationbaseline system (beam size 1)baseline system (beam size 12)system that was trained with synthetic data

back-translationBLEU

(DE→EN) EN→DEnone - 23.6parallel (beam 1) (22.3) 26.0parallel (beam 12) (25.0) 26.5synthetic (beam 12) (28.3) 26.6

Barry Haddow (UEDIN) Neural Machine Translation FINMT 60 / 49

Page 87: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Comparsion to PBSMT

synthetic parallel datarationale in PBSMT: phrase-table domain adaptation [?, ?, ?]rationale in NMT: allow training on monolingual data

BLEU

system WMT IWSLT(in-domain) (out-of-domain)

PBSMT gain +0.7 +0.1NMT gain +2.9 +1.2

Table: Gains on English→German from adding synthetic News Crawldata.

Barry Haddow (UEDIN) Neural Machine Translation FINMT 61 / 49

Page 88: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Analysis: Fluency

word-level fluencymodels create unseen words viasubword unitscivil rights protections→Bürger|rechts|schutzeshow many of them are fluent?

attested in monolingual corpusnatural according to nativespeaker

example of disfluency:asbestos mats→ *As|best|atten( correct: As|best|matten)

attested natural

0

20

40

60

80

100

fluen

tuns

een

wor

ds(%

)

parallel +mono +synthetic

Barry Haddow (UEDIN) Neural Machine Translation FINMT 62 / 49

Page 89: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Why Linguistic Features?

guide reordering with syntactic informationsource Gefährlichpred ist die Routesubj aber dennoch.reference However the route is dangerous.baseline NMT *Dangerous is the route, however.

Barry Haddow (UEDIN) Neural Machine Translation FINMT 63 / 49

Page 90: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Linguistic Features: Example

Leonidas begged in the arena .NNP VBD IN DT NN .

root root

nsubj prepdet

pobjroot

words Le: oni: das beg: ged in the arena .lemmas Leonidas Leonidas Leonidas beg beg in the arena .subword tags B I E B E O O O OPOS NNP NNP NNP VBD VBD IN DT NN .dep nsubj nsubj nsubj root root prep det pobj root

Barry Haddow (UEDIN) Neural Machine Translation FINMT 64 / 49

Page 91: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Results: Perplexity ↓

English→German German→English English→Romanian

0

20

40

60

80

54.9

47.3

74.9

52.946.2

72.7pe

rple

xity

baselineall features

Barry Haddow (UEDIN) Neural Machine Translation FINMT 65 / 49

Page 92: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Results: Individual Features (Perplexity ↓)

English→German

−4

−2

0

2

4

−2 −1.5−0.2

−1.7−0.9

perp

lexi

ty(r

elat

ive

toba

selin

e)

all featureslemmassubword tagsPOS tagsdependency labels

Barry Haddow (UEDIN) Neural Machine Translation FINMT 66 / 49

Page 93: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Results: BLEU ↑

English→German German→English English→Romanian

0

10

20

30

40

27.831.4

23.8

28.4

32.9

24.8

BLE

U

baselineall features

Barry Haddow (UEDIN) Neural Machine Translation FINMT 67 / 49

Page 94: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Evaluation: Synthetic Data

use synthetic training data from back-translated monolingual corpus

→ stronger baseline, annotation of noisy input

Barry Haddow (UEDIN) Neural Machine Translation FINMT 68 / 49

Page 95: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Results: Perplexity ↓

English→German German→English English→Romanian

0

20

40

60

80

54.947.3

74.9

52.946.2

72.7

49.745.2

50.948.444.1

50.1

perp

lexi

ty

baselineall featuresbaseline (+synthetic data)all features (+synthetic data)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 69 / 49

Page 96: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Evaluation: Vocabulary and Embedding Size

input vocabulary embeddingfeature EN DE model all singlesubword tags 4 4 4 5 5POS tags 46 54 54 10 10morph. features - 1400 1400 10 10dependency labels 46 33 46 10 10lemmas 800000 1500000 85000 115 167words 78500 85000 85000 * *

*: size is chosen to give total embedding size 500 (=baseline)

Barry Haddow (UEDIN) Neural Machine Translation FINMT 70 / 49

Page 97: Current trends and tricks of the trade in NMT...Brief overview of NMT Recent Advances Subword representations Using monolingual data Linguistic features for NMT Current problems and

Examples

source We thought a win like this might be closeadj.reference Wir dachten, dass ein solcher Sieg nah sein könnte.baseline NMT Wir dachten, ein Sieg wie dieser könnte schließen.all features Wir dachten, ein Sieg wie dieser könnte nah sein.

source Gefährlichpred ist die Routesubj aber dennoch.reference However the route is dangerous.baseline NMT Dangerous is the route, however.all features However, the route is dangerous.

Barry Haddow (UEDIN) Neural Machine Translation FINMT 71 / 49