Neural Machine Translation what's linguistics got to do ...homepages.inf.ed.ac.uk/rsennric/tsd.pdf · a new challenger appears: neural machine translation requires minimal domain

Neural Machine Translationwhat’s linguistics got to do with it?

Rico Sennrich

University of Edinburgh

Rico Sennrich NMT: what’s linguistics got to do with it? 1 / 44

Setting the Scene: 2014–2015

research trend: more linguistics for statistical machine translation

Laura hat einen kleinen GartenLaura has a small garden

root obja

attrsubj

det

NN

SEG

gebühr

MOD

LINK

LN

ε

SEG

gepäck

MOD

LINK

LN

ε

SEG

Hand

DET

ART

eine

syntax-based LM morphological structure[Sennrich, TACL 2015] [Sennrich, Haddow, EMNLP 2015]

a new challenger appears: neural machine translation

requires minimal domainknowledge

similar models used for speechand computer vision


Edinburgh’s* WMT Results over the Years

2013 2014 20150.0

10.0

20.0

30.0

20.3 20.9 20.819.4 20.2

22.0

BLE

U(n

ewst

est2

013

EN→

DE

)

phrase-based SMTsyntax-based SMTneural MT

2016 20170.0

10.0

20.0

30.0

21.5 22.1

24.726.0

*NMT 2015 from U. Montréal: https://sites.google.com/site/acl16nmt/


https://sites.google.com/site/acl16nmt/


2013 2014 20150.0

10.0

20.0

30.0

20.3 20.9 20.819.4 20.2

22.0

18.9

BLE

U(n

ewst

est2

013

EN→

DE

)


2016 20170.0

10.0

20.0

30.0

21.5 22.1

24.726.0





2013 2014 20150.0

10.0

20.0

30.0

20.3 20.9 20.819.4 20.2

22.0

18.9

BLE

U(n

ewst

est2

013

EN→

DE

)


2016 20170.0

10.0

20.0

30.0

21.5 22.1

24.726.0




What Now?

do we still need linguistics for MT?


What Now?



What Now?



Today’s Talk

case studies on how linguistics is helping neural MT research

linguistically motivated (but non-linguistic) models

targeted evaluation of neural MT

linguistically informed models


NMT: what’s linguistics got to do with it?

1 Linguistically Motivated (but Non-Linguistic)Models

2 Targeted Evaluation of Neural MT

3 Linguistically Informed Models


Open-Vocabulary Neural MT

problemword-level neural networks use one-hot encoding→ closed and small vocabulary

this gets you 95% of the way...... if you only care about automatic metrics

why 95% is not enoughrare outcomes have high self-information

source The indoor temperature is very pleasant.reference Das Raumklima ist sehr angenehm.[Bahdanau et al., 2015] Die UNK ist sehr angenehm. 7

[Jean et al., 2015] Die Innenpool ist sehr angenehm. 7

[Sennrich, Haddow, Birch, ACL 2016] Die Innen+ temperatur ist sehr angenehm. X


























Subword Neural MT [Sennrich, Haddow, Birch, ACL 2016]

linguistic motivationtranslation is open-vocabulary problem

rare words matter

morphological typology: 1-to-many translations are common→ problem for backoff mechanismrare words are often morphologically complex and can be brokendown into smaller units

solar system (English)Sonnen|system (German)Nap|rendszer (Hungarian)


Subword Neural MT

goalsubword segmentation that:

uses a closed vocabulary of subword units

can represent open vocabulary (including unknown words)

minimizes the sequence length (given the vocabulary size)

solutiongreedy compression algorithm: byte pair encoding (BPE) [Gage, 1994]

we adapt BPE to word segmentation

hyperparameter: vocabulary size

vocabularytext

size300 t+ h+ e i+ n+ d+ o+ o+ r t+ e+ m+ p+ e+ r+ a+ t+ u+ r+ e i+ s v+ e+ r+ y p+ l+ e+ a+ s+ a+ n+ t

1300 the in+ do+ or t+ em+ per+ at+ ure is very p+ le+ as+ ant

10300 the in+ door temper+ ature is very pleasant

50300 the indoor temperature is very pleasant


Subword NMT: Translation Quality

EN-DE EN-RU0.0

10.0

20.022.0

19.1

22.8

20.4B

LEU

word-level NMT (with back-off) [Jean et al., 2015]

subword-level NMT


Subword NMT: Translation Quality

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150 000 500 000

training set frequency rank

unig

ramF1

EN-RU

subword-levelword-level (with back-off)word-level (no back-off)


Linguistically Motivated Models

氵 (water)河 river湖 lake海 sea

⟨ ⟩

⟨ ⟩

⟨ ⟩

⟨ ⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

→→

Figure 4: Example development sentence, showing the inferred attention matrix for various models for Et↔ En. Rows correspond

to the translation direction and columns correspond to different models: attentional, with alignment features (+align), global fertility

(+glofer), and symmetric joint training (+sym). Darker shades denote higher values and white denotes zero.

pre). In this case, it is refining an already excel-lent model from which reliable global fertility es-timates can be obtained. This finding is consistentwith the other languages, see Figure 3 which showstypical learning curves of different variants of theattentional model. Note that when global fertilityis added to the vanilla attentional model with align-ment features, it significantly slows down trainingas it limits exploration in early training iterations,however it does bring a sizeable win when used tofine-tune a pre-trained model. Finally, the bilin-gual symmetry also helps to reduce the perplexityscores when used with the alignment features, how-ever, does not combine well with global fertility(+align+sym+glofer-pre). This is perhaps an unsur-prising result as both methods impose a often-timessimilar regularising effect over the attention matrix.

Figure 4 illustrates the different attention matri-

ces inferred by the various model variants. Note thedifference between the base attentional model andits variant with alignment features (‘+align’), wheremore weight is assigned to diagonal and 1-to-manyalignments. Global fertility pushes more attention tothe sentinel symbols ⟨s⟩ and ⟨/s⟩. Determiners andprepositions in English show much lower fertilitythan nouns, while Estonian nouns have even higherfertility. This accords with Estonian morphology,wherein nouns are inflected with rich case mark-ing, e.g., noukoguga has the cogitative -ga suffix,meaning ‘with’, and thus translates as several En-glish words (with the council). The right-most col-umn corresponds to joint symmetric training, withmany more confident attention values especially forconsistent 1-to-many alignments (difficult in Englishand raskeid in Estonian, an adjective in partitive casemeaning some difficult).

882

bank

banque bord

Ufer

logographic input structural alignment biases multi-source translation[Costa-jussà et al., 2017] [Cohn et al., 2016] [Zoph and Knight, 2016]

[Cai and Dai, 2017]







What Hypotheses Do We Test?

hypothesis: model A obtains higher BLEU than model B on data set X

Bruno Bastos / CC BY 2.0



hypothesis: model A obtains higher BLEU than model B on data set X

Bruno Bastos / CC BY 2.0



hypothesis: model A is better model of translation than model Bevidence: model A obtains higher BLEU than model B on data set X

Tim Sheerman-Chase / CC BY 2.0



hypothesis: model A is better model of translation than model Bevidence: model A obtains higher BLEU than model B on data set X

Tim Sheerman-Chase / CC BY 2.0


What Hypotheses Do We Test?many languages have long-distance interactions.

hypothesis: model A produces disfluent output because it models these interactions poorly.model B can better model long-distance interactions, and produces more fluent output.

evidence: model A obtains higher BLEU than model B on data set X

Francis Victoria Gumapac / CC BY 2.0


















being able to test our hypotheses is beauty of empirical NLP

complex, interesting hypotheses need targeted evaluation

I want to see more interesting hypotheses→ we need more targeted evaluation


Human Evaluation of Neural MT [Bojar et al., 2016]

Fluency Adequacyis translation good English? is meaning preserved?

+13% +1%

CS→EN DE→EN RO→EN RU→EN

0

20

40

60

80

100

64.6 68.4 66.7 67.878.7 77.5

71.9 74.3

ONLINE-B UEDIN-NMT

CS→EN DE→EN RO→EN RU→EN

0

20

40

60

80

100

70.8 72.2 73.9 72.875.4 75.8 71.2 71.1

ONLINE-B UEDIN-NMT

Figure: WMT16 direct assessment results


Human Evaluation in TraMOOC[Castilho, Moorkens, Gaspari, Sennrich, Sosoni, Georgakopoulou, Lohar, Way, Miceli Barone, Gialama, MT Summit XVI, 2017]

direct assessment of NMT (vs. PBSMT):fluency: +10%adequacy: +1%

Error Annotationcategory SMT NMT differenceinflectional morphology 2274 1799 -21%word order 1098 691 -37%omission 421 362 -14%addition 314 265 -16%mistranslation 1593 1552 -3%"no issue" 449 788 +75%


Human Evaluation of Neural MT

Neural Machine Translation is very fluent.

Attentional encoder-decoder with BPE segmentation and recurrent GRU decoder

what about...?character-level models [Lee et al., 2016]

convolutional models [Gehring et al., 2017]

models with self-attention [Vaswani et al., 2017]

Adequacy remains a major problem in Neural Machine Translation

...using a shallow NMT model at WMT 2016

how...?do we compare different architectures?

do we measure improvement over time?

























































How to Assess Specific Aspects in MT?

human evaluation× costly; hard to compare to previous work

automatic metrics (BLEU)× too coarse; blind towards specific aspects

contrastive translation pairsNMT models assign probability to any translation

binary classification task: which translation is better?

choice between reference translation and contrastive variant→ corrupted with single error of specific type

≈ minimal pairs in linguistics


How to Assess Specific Aspects in MT?

human evaluation× costly; hard to compare to previous work

automatic metrics (BLEU)× too coarse; blind towards specific aspects

contrastive translation pairsNMT models assign probability to any translation

binary classification task: which translation is better?

choice between reference translation and contrastive variant→ corrupted with single error of specific type

≈ minimal pairs in linguistics


Assessment with Contrastive Translation Pairs

workflowresearcher wants to analyse difficulttranslation problem

researcher predicts what errorsNMT system might make

researcher creates test set withcorrect translations and corruptedvariants

test set allows automatic,quantitative, and reproducibleanalysis of NMT model

example

subject–verb agreement

change grammaticalnumber of verb tointroduce agreementerror

35000 contrastive pairscreated with simplelinguistic rules







examplesubject–verb agreement






















Contrastive Translation Pairs

sentence prob.English [...] that the plan will be approvedGerman (correct) [...], dass der Plan verabschiedet wird 0.1 3

German (contrastive) * [...], dass der Plan verabschiedet werden 0.01

subject-verb agreement


LingEval97: A Test Set of Contrastive Translation Pairs

LingEval9797 000 contrastive translation pairs

based on English→German WMT test sets

rule-based, automatic creation of errors

7 error typesmetadata for in-depth analysis:

error typedistance between wordsword frequency in WMT15 training set


Case Study: Some Open Questions in Neural MT

text representationword-level but as the example of Mobilking in Poland shows

|————– 5 steps —————|

subword-level but as the example of Mobil+ king in Poland shows(byte-pair encoding) |————— 6 steps —————-|

character-level b u t _ a s _ t h e _ e x a m p l e _ o f _ M o b i l k i n g _ i n _ P o l a n d _ s h o w s| ————————— 29 steps —————————-|




|————– 5 steps —————|






|————– 5 steps —————|






|————– 5 steps —————|





text representationword-level but as the example of UNK in Poland shows

|————– 5 steps —————|





does network architecture affect learning of long-distance dependencies?

architectures

Figure 1: The Transformer - model architecture.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the twosub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-headattention over the output of the encoder stack. Similar to the encoder, we employ residual connectionsaround each of the sub-layers, followed by layer normalization. We also modify the self-attentionsub-layer in the decoder stack to prevent positions from attending to subsequent positions. Thismasking, combined with fact that the output embeddings are offset by one position, ensures that thepredictions for position i can depend only on the known outputs at positions less than i.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of thequery with the corresponding key.

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists ofqueries and keys of dimension dk, and values of dimension dv . We compute the dot products of thequery with all keys, divide each by

√dk, and apply a softmax function to obtain the weights on the

values.

3

RNN vs. GRU vs. LSTM

(convolution) (self-attention)[Gehring et al., 2017] [Vaswani et al., 2017]

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/


http://colah.github.io/posts/2015-08-Understanding-LSTMs/


does network architecture affect learning of long-distance dependencies?

architectures

Figure 1: The Transformer - model architecture.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the twosub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-headattention over the output of the encoder stack. Similar to the encoder, we employ residual connectionsaround each of the sub-layers, followed by layer normalization. We also modify the self-attentionsub-layer in the decoder stack to prevent positions from attending to subsequent positions. Thismasking, combined with fact that the output embeddings are offset by one position, ensures that thepredictions for position i can depend only on the known outputs at positions less than i.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of thequery with the corresponding key.

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists ofqueries and keys of dimension dk, and values of dimension dv . We compute the dot products of thequery with all keys, divide each by

√dk, and apply a softmax function to obtain the weights on the

values.

3

RNN vs. GRU vs. LSTM (convolution) (self-attention)[Gehring et al., 2017] [Vaswani et al., 2017]

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/


http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Results: Architecture

subject-verbagreementn=35105

60.0

80.0

100.0

89.093.4 94.0

accu

racy

(%)

RNNGRULSTM


Results: Architecture

0 4 8 12 160.5

0.6

0.7

0.8

0.9

1

≥ 16

distance

accu

racy

(sub

ject

-ver

bag

reem

ent)

LSTMGRURNN


Results: Text Representation

subject-verbagreementn=35105

60.0

80.0

100.0

93.490.5 91.5

accu

racy

(%)

BPE-to-BPEword-to-wordchar-to-char [Lee et al., 2016]



1001011021031040.5

0.6

0.7

0.8

0.9

1

frequency

accu

racy

(sub

ject

-ver

bag

reem

ent)




0 4 8 12 160.5

0.6

0.7

0.8

0.9

1

≥ 16

distance

accu

racy

(sub

ject

-ver

bag

reem

ent)



What Did We Learn?

method verifies strength of LSTM and GRU→ future work: test of convolutional model and self-attention

word-level model is poor for rare words

character-level model is poor for long distances

BPE subword segmentation is good compromise


Targeted Analysis: Adequacy

adequacy is open problemsystem sentencesource Dort wurde er von dem Schläger und einer weiteren männl. Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.our NMT There he was attacked again by the racket and another male person.Google There he was again attacked by the bat and another male person.

Schläger

attackerracket bat

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

bat1 www.personalcreations.com / CC-BY-2.0bat2 Hasitha Tudugalle https://commons.wikimedia.org/wiki/File:Flying-Fox-Bat.jpg /

CC-BY-4.0




Schläger

attacker

racket bat



CC-BY-4.0




Schläger

attackerracket

bat



CC-BY-4.0




Schläger

attackerracket bat



CC-BY-4.0




Schläger

attackerracket bat



CC-BY-4.0



focus on two types of adequacy errors:

lexical word sense disambiguation:translate ambiguous word with wrong word sense

polarity:deletion or insertion of negation marker ("not", "no", "un-")


Polarity

manual error analysis [Fancellu and Webber, 2015]

translation errors (Chinese→English hierarchical PBSMT):

insertion of negation (1–2%)

deletion of negation (10–20%)

reordering errors (1–20%)

automatic analysis (Lingeval97; NMT)

negation insertionn=22760

negation deletionn=4043

60.0

80.0

100.097.9

91.5

accu

racy

(%)


Polarity

manual error analysis [Fancellu and Webber, 2015]

translation errors (Chinese→English hierarchical PBSMT):

insertion of negation (1–2%)

deletion of negation (10–20%)

reordering errors (1–20%)

automatic analysis (Lingeval97; NMT)

negation insertionn=22760

negation deletionn=4043

60.0

80.0

100.097.9

91.5

accu

racy

(%)


Word Sense Disambiguation [Rios, Mascarell, Sennrich, WMT 2017]

test set (ContraWSD)35 ambiguous German nouns

2–4 senses per source noun

contrastive translation sets (1 or more contrastive translations)

≈ 100 test instances per sense→≈ 7000 test instances

source: Also nahm ich meinen amerikanischen Reisepassund stellte mich in die Schlange für Extranjeros.

reference: So I took my U.S. passport and got in the line for Extranjeros.

contrastive: So I took my U.S. passport and got in the snake for Extranjeros.contrastive: So I took my U.S. passport and got in the serpent for Extranjeros.


Word Sense Disambiguation

100 102 1040

0.2

0.4

0.6

0.8

1

absolute frequency

accu

racy

10−3 10−2 10−1 1000

0.2

0.4

0.6

0.8

1

relative frequency

accu

racy

WSD is challenging, especially for rare word senses


Word Sense Disambiguation

100 102 1040

0.2

0.4

0.6

0.8

1

absolute frequency

accu

racy

10−3 10−2 10−1 1000

0.2

0.4

0.6

0.8

1

relative frequency

accu

racy

WSD is challenging, especially for rare word senses


Word Sense Disambiguation: Measuring Progress

UEDIN-NMT at WMT (German→English)[Sennrich, Birch, Currey, Germann, Haddow, Heafield, Miceli Barone, Williams, WMT 2017]

at WMT16, UEDIN-NMT was top-ranked

large lead in fluency; small lead in adequacyfor WMT17, we improved our MT system in several ways:

deep transition networkslayer normalizationbetter hyperparametersbetter ensembles(slightly) more training data

are we getting better at word sense disambiguation?


Results: Word Sense Disambiguation

word sense disambiguation accuracyn=7359

60.0

80.0

100.0

83.5

accu

racy

(%)

UEDIN-NMT @ WMT16: singleUEDIN-NMT @ WMT17: singleUEDIN-NMT @ WMT17: ensemble≈ human performance (sentence-level)




60.0

80.0

100.0

83.586.7

accu

racy

(%)





60.0

80.0

100.0

83.586.7 87.9

accu

racy

(%)





60.0

80.0

100.0

83.586.7 87.9

96.0

accu

racy

(%)



What Did We Learn?

word sense disambiguation remains challenging problem in MT,but measurable progress in last year

On sentence-level, even humans may find it challenging

German Sehen Sie die Muster?reference Do you see the patterns?contrastive Do you see the examples?

→ new possibility for targeted evaluation of document-level modelling







Linguistic Structure is Coming Back to (Neural) MT

segmentation wordNone perusasianBPE perusasi: anOmorfi perus: asia: n

W(0

)

det W(0

)

nsubjW (0)

dobj W(0

)

det

W(1

)

det W(1

)

nsubjW (1)

dobj W(1

)

det

*PAD* The monkey eats a banana *PAD*

h(0)

h(1)

h(2)

GC

NC

NN

Figure 2: A 2-layer syntactic GCN on top of a convolutional encoder. Loop connections are depictedwith dashed edges, syntactic ones with solid (dependents to heads) and dotted (heads to dependents)edges. Gates and some labels are omitted for clarity.

2.3 Syntactic GCNsMarcheggiani and Titov (2017) generalize GCNsto operate on directed and labeled graphs.2 Thismakes it possible to use linguistic structures suchas dependency trees, where directionality and edgelabels play an important role. They also integrateedge-wise gates which let the model regulate con-tributions of individual dependency edges. Wewill briefly describe these modifications.

Directionality. In order to deal with direction-ality of edges, separate weight matrices are usedfor incoming and outgoing edges. We follow theconvention that in dependency trees heads pointto their dependents, and thus outgoing edges areused for head-to-dependent connections, and in-coming edges are used for dependent-to-head con-nections. Modifying the recursive computation fordirectionality, we arrive at:

h(j+1)v = ρ

( ∑

u∈N (v)

W(j)dir(u,v) h

(j)u + b

(j)dir(u,v)

)

where dir(u, v) selects the weight matrix associ-ated with the directionality of the edge connectingu and v (i.e. WIN for u-to-v, WOUT for v-to-u,and WLOOP for v-to-v). Note that self loops aremodeled separately,

so there are now three times as many parametersas in a non-directional GCN.

2For an alternative approach to integrating labels and di-rections, see applications of GCNs to statistical relation learn-ing (Schlichtkrull et al., 2017).

Labels. Making the GCN sensitive to labels isstraightforward given the above modifications fordirectionality. Instead of using separate matricesfor each direction, separate matrices are now de-fined for each direction and label combination:

h(j+1)v = ρ

( ∑

u∈N (v)

W(j)lab(u,v) h

(j)u + b

(j)lab(u,v)

)

where we incorporate the directionality of an edgedirectly in its label.

Importantly, to prevent over-parametrization,only bias terms are made label-specific, in otherwords: Wlab(u,v) = Wdir(u,v). The resulting syn-tactic GCN is illustrated in Figure 2 (shown on topof a CNN, as we will explain in the subsequentsection).

Edge-wise gating. Syntactic GCNs also includegates, which can down-weight the contribution ofindividual edges. They also allow the model todeal with noisy predicted structure, i.e. to ignorepotentially erroneous syntactic edges. For eachedge, a scalar gate is calculated as follows:

g(j)u,v = σ(h(j)u · w(j)

dir(u,v) + b(j)lab(u,v)

)

where σ is the logistic sigmoid function, andw

(j)dir(u,v) ∈ Rd and b(j)lab(u,v) ∈ R are learned pa-

rameters for the gate. The computation becomes:

h(j+1)v =ρ

(∑

u∈N (v)

g(j)u,v

(W

(j)dir(u,v) h

(j)u + b

(j)lab(u,v)

))

Morphology Syntax[Sánchez-Cartagena and Toral, 2016] [Sennrich and Haddow, 2016]

[Tamchyna et al., 2017] [Eriguchi et al., 2016]

[Huck et al., 2017] [Bastings et al., 2017]

[Pinnis et al., 2017] [Aharoni and Goldberg, 2017]

[Nadejde et al., 2017]


Linguistic Input Features: Motivation [Sennrich, Haddow, WMT 2016]

disambiguate words by POS

English Germancloseverb schließencloseadj nahclosenoun Ende

source We thought a win like this might be closeadj.reference Wir dachten, dass ein solcher Sieg nah sein könnte.baseline NMT *Wir dachten, ein Sieg wie dieser könnte schließen.


Neural Machine Translation: Multiple Input Features

use separate embeddings for each feature, then concatenate

E1(close) =

0.40.10.2

E2(adj) =

[0.1]

E1(close) ‖ E2(adj) =

0.40.10.20.1


Results

EN→DE DE→EN EN→RO

0.0

10.0

20.0

30.0

40.0

27.8

31.4

23.8

28.4

32.9

24.8

BLE

U

baselinelinguistic input features


Predicting Target-Side Syntax (CCG)[Nadejde, Reddy, Sennrich, Dwojak, Junczys-Dowmunt, Koehn, Birch, WMT 2017]

Core IdeaCCG supertags carry information about type/direction of arguments

predict supertags to help model produce good grammatical structure

we associate words with their supertag by interleavingwords: Obama receives Netanyahu in the capital of USACCG: NP S\NP/PP/NP NP PP/NP NP/N N NP\NP/NP NP

interleaved: NP Obama S\NP/PP/NP receives NP Netanyahu PP/NP in NP/N the N capital NP\NP/NP of NP USA

similar idea: serialized dependency tree [Aharoni and Goldberg, 2017]

Jane hatte eine Katze .→(ROOT (S (NP Jane )NP (V P had (NP a cat )NP )V P . )S )ROOT


Results


system DE→EN RO→ENbaseline 32.1 28.4interleaved CCG 32.7 29.3

[Aharoni and Goldberg, 2017]

system DE→ENbaseline 32.4serialized dependencies 33.2

...but more analysis in the papers


Results





...but more analysis in the papers


Results





...but more analysis in the papersRico Sennrich NMT: what’s linguistics got to do with it? 43 / 44

Conclusions

neural machine translation does not need linguistic knowledge...

...but linguistics should play an important role for

inspiring research targeted evaluation informing models

source indoor temperaturereference Raumklima[Bahdanau et al., 2015] UNK 7

[Jean et al., 2015] Innenpool 7

[Sennrich, Haddow, Birch, ACL 2016a] Innen+ temperatur X

Schläger

attackerracket bat

W(0

)

det W(0

)

nsubjW (0)

dobj W(0

)

det

W(1

)

det W(1

)

nsubjW (1)

dobj W(1

)

det

*PAD* The monkey eats a banana *PAD*

h(0)

h(1)

h(2)

GC

NC

NN

Figure 2: A 2-layer syntactic GCN on top of a convolutional encoder. Loop connections are depictedwith dashed edges, syntactic ones with solid (dependents to heads) and dotted (heads to dependents)edges. Gates and some labels are omitted for clarity.

2.3 Syntactic GCNsMarcheggiani and Titov (2017) generalize GCNsto operate on directed and labeled graphs.2 Thismakes it possible to use linguistic structures suchas dependency trees, where directionality and edgelabels play an important role. They also integrateedge-wise gates which let the model regulate con-tributions of individual dependency edges. Wewill briefly describe these modifications.

Directionality. In order to deal with direction-ality of edges, separate weight matrices are usedfor incoming and outgoing edges. We follow theconvention that in dependency trees heads pointto their dependents, and thus outgoing edges areused for head-to-dependent connections, and in-coming edges are used for dependent-to-head con-nections. Modifying the recursive computation fordirectionality, we arrive at:

h(j+1)v = ρ

( ∑

u∈N (v)

W(j)dir(u,v) h

(j)u + b

(j)dir(u,v)

)

where dir(u, v) selects the weight matrix associ-ated with the directionality of the edge connectingu and v (i.e. WIN for u-to-v, WOUT for v-to-u,and WLOOP for v-to-v). Note that self loops aremodeled separately,

so there are now three times as many parametersas in a non-directional GCN.

2For an alternative approach to integrating labels and di-rections, see applications of GCNs to statistical relation learn-ing (Schlichtkrull et al., 2017).

Labels. Making the GCN sensitive to labels isstraightforward given the above modifications fordirectionality. Instead of using separate matricesfor each direction, separate matrices are now de-fined for each direction and label combination:

h(j+1)v = ρ

( ∑

u∈N (v)

W(j)lab(u,v) h

(j)u + b

(j)lab(u,v)

)

where we incorporate the directionality of an edgedirectly in its label.

Importantly, to prevent over-parametrization,only bias terms are made label-specific, in otherwords: Wlab(u,v) = Wdir(u,v). The resulting syn-tactic GCN is illustrated in Figure 2 (shown on topof a CNN, as we will explain in the subsequentsection).

Edge-wise gating. Syntactic GCNs also includegates, which can down-weight the contribution ofindividual edges. They also allow the model todeal with noisy predicted structure, i.e. to ignorepotentially erroneous syntactic edges. For eachedge, a scalar gate is calculated as follows:

g(j)u,v = σ(h(j)u · w(j)

dir(u,v) + b(j)lab(u,v)

)

where σ is the logistic sigmoid function, andw

(j)dir(u,v) ∈ Rd and b(j)lab(u,v) ∈ R are learned pa-

rameters for the gate. The computation becomes:

h(j+1)v =ρ

(∑

u∈N (v)

g(j)u,v

(W

(j)dir(u,v) h

(j)u + b

(j)lab(u,v)

))


Collaborators

Alexandra Birch Barry Haddow Antonio Valerio Miceli Barone

Kenneth Heafield Maria Nadejde Phil Williams

Ulrich Germann Tomasz Dwojak Philipp Koehn

Siva Reddy Anna Currey Marcin Junczys-Dowmunt

Annette Rios Laura Mascarell Martin Volk


Open Positions

PhD positionsI have two PhD positions available at theUniversity of Edinburgh.

postdocopen position for post-doctoral researcher.

Contact me if you’re interested.


Thanks

AcknowledgmentsSome of the research presented was conducted in cooperation with Samsung Electronics Polska.

This work has received funding from the European Union’s Horizon 2020 researchand innovation programme under grant agreements 645452 (QT21), TraMOOC(644333), HimL (644402), and SUMMA (688139).

This work has received funding from the Swiss National Science Foundation (SNF) in the projectCoNTra (grant number 105212_169888).

TSD 2017 Sponsors


Thanks

Thank you for your attention

ResourcesLingEval97: https://github.com/rsennrich/lingeval97

ContraWSD: https://github.com/a-rios/ContraWSDpre-trained models:

WMT16: http://data.statmt.org/wmt16_systems/WMT17: http://data.statmt.org/wmt17_systems/


https://github.com/rsennrich/lingeval97

https://github.com/a-rios/ContraWSD

http://data.statmt.org/wmt16_systems/

http://data.statmt.org/wmt17_systems/

Bibliography I

Aharoni, R. and Goldberg, Y. (2017).Towards String-To-Tree Neural Machine Translation.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages132–140, Vancouver, Canada. Association for Computational Linguistics.

Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).

Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., and Sima’an, K. (2017).Graph Convolutional Encoders for Syntax-aware Neural Machine Translation.Proceedings of EMNLP.

Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., Jimeno Yepes, A., Koehn, P., Logacheva, V., Monz,C., Negri, M., Neveol, A., Neves, M., Popel, M., Post, M., Rubino, R., Scarton, C., Specia, L., Turchi, M., Verspoor, K., andZampieri, M. (2016).Findings of the 2016 Conference on Machine Translation (WMT16).In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 131–198, Berlin,Germany.

Cai, Z. and Dai, Z. (2017).Glyph-aware Embedding of Chinese Characters.In 1st Workshop on Subword and Character level models in NLP (SCLeM), Copenhagen, Denmark.

Castilho, S., Moorkens, J., Gaspari, F., Sennrich, R., Sosoni, V., Georgakopoulou, Y., Lohar, P., Way, A., Barone, A. V. M., andGialama, M. (2017).A Comparative Quality Evaluation of PBSMT and NMT using Professional Translators.In Proceedings of Machine Translation Summit XVI, Nagoya, Japan.


Bibliography II

Cohn, T., Hoang, C. D. V., Vymolova, E., Yao, K., Dyer, C., and Haffari, G. (2016).Incorporating Structural Alignment Biases into an Attentional Neural Translation Model.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pages 876–885, San Diego, California.

Costa-jussà, M. R., Aldón, D., and Fonollosa, J. A. R. (2017).Chinese–Spanish neural machine translation enhanced with character and word bitmap fonts.Machine Translation, 31(1):35–47.

Eriguchi, A., Hashimoto, K., and Tsuruoka, Y. (2016).Tree-to-Sequence Attentional Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages823–833, Berlin, Germany.

Fancellu, F. and Webber, B. (2015).Translating Negation: A Manual Error Analysis.InProceedings of the Second Workshop on Extra-Propositional Aspects of Meaning in Computational Semantics (ExProM 2015),pages 2–11, Denver, Colorado. Association for Computational Linguistics.

Gage, P. (1994).A New Algorithm for Data Compression.C Users J., 12(2):23–38.

Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. (2017).Convolutional Sequence to Sequence Learning.CoRR, abs/1705.03122.


Bibliography III

Huck, M., Riess, S., and Fraser, A. (2017).Target-side Word Segmentation Strategies for Neural Machine Translation.In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, Copenhagen, Denmark.Association for Computational Linguistics.

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.

Lee, J., Cho, K., and Hofmann, T. (2016).Fully Character-Level Neural Machine Translation without Explicit Segmentation.ArXiv e-prints.

Nadejde, M., Reddy, S., Sennrich, R., Dwojak, T., Junczys-Dowmunt, M., Koehn, P., and Birch, A. (2017).Syntax-aware Neural Machine Translation Using CCG.ArXiv e-prints.

Pinnis, M., Krislauks, R., Deksne, D., and Miks, T. (2017).Neural Machine Translation for Morphologically Rich Languages with Improved Sub-word Units and Synthetic Data.In Text, Speech, and Dialogue - 20th International Conference, TSD 2017, pages 237–245, Prague, Czech Republic.

Rios, A., Mascarell, L., and Sennrich, R. (2017).Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings.In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, Copenhagen, Denmark.


Bibliography IV

Sánchez-Cartagena, V. M. and Toral, A. (2016).Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences.In Proceedings of the First Conference on Machine Translation, pages 362–370, Berlin, Germany. Association for ComputationalLinguistics.

Sennrich, R. (2015).Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Translation.Transactions of the Association for Computational Linguistics, 3:169–182.

Sennrich, R. (2017).How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs.In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL),Valencia, Spain.

Sennrich, R., Birch, A., Currey, A., Germann, U., Haddow, B., Heafield, K., Miceli Barone, A. V., and Williams, P. (2017).The University of Edinburgh’s Neural MT Systems for WMT17.In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, Copenhagen, Denmark.

Sennrich, R. and Haddow, B. (2015).A Joint Dependency Model of Morphological and Syntactic Structure for Statistical Machine Translation.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2081–2087, Lisbon,Portugal.

Sennrich, R. and Haddow, B. (2016).Linguistic Input Features Improve Neural Machine Translation.In Proceedings of the First Conference on Machine Translation, Volume 1: Research Papers, pages 83–91, Berlin, Germany.


Bibliography V

Sennrich, R., Haddow, B., and Birch, A. (2016a).Edinburgh Neural Machine Translation Systems for WMT 16.In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 368–373, Berlin,Germany.

Sennrich, R., Haddow, B., and Birch, A. (2016b).Neural Machine Translation of Rare Words with Subword Units.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1715–1725, Berlin, Germany.

Tamchyna, A., Weller-Di Marco, M., and Fraser, A. (2017).Modeling Target-Side Inflection in Neural Machine Translation.In Second Conference on Machine Translation (WMT17).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).Attention Is All You Need.CoRR, abs/1706.03762.

Williams, P., Sennrich, R., Post, M., and Koehn, P. (2016).Syntax-based Statistical Machine Translation, volume 9 of Synthesis Lectures on Human Language Technologies.Morgan & Claypool Publishers.

Zoph, B. and Knight, K. (2016).Multi-Source Neural Translation.In NAACL HLT 2016.


Documents

Neural Machine Translation what's linguistics got to do ...homepages.inf.ed.ac.uk/rsennric/tsd.pdf · a new challenger appears: neural machine translation requires minimal domain