CS388: Natural Language Processing Lecture 18: Machine … · 2019. 10. 29. · Results: WMT English-French Classic phrase-based system: ~33 BLEU, uses addi=onal target-language data

CS388:NaturalLanguageProcessing

GregDurre8

Lecture18:MachineTransla=on2

Administrivia

‣ Project2dueinoneweek

Recall:Phrase-BasedMT

Unlabeled English data

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 …

Language model P(e)

Phrase table P(f|e) P (e|f) / P (f |e)P (e)Noisy channel model: combine scores from translation model + language model to translate foreign to

English

“Translate faithfully but make fluent English”

}

Recall:HMMforAlignment

Brownetal.(1993)

Thankyou,Ishalldosogladly.e

‣ Sequen=aldependencebetweena’stocapturemonotonicity

0 2 6

Gracias,loharedemuybuengrado.f

5 7 7 7 7 8a

‣ Alignmentdistparameterizedbyjumpsize:‣ :wordtransla=ontableP (fi|eai)

§  Wantlocalmonotonicity:mostjumpsaresmall§  HMMmodel(Vogel96)

§  Re-es>mateusingtheforward-backwardalgorithm -2 -1 0 1 2 3

P (f ,a|e) =nY

i=1

P (fi|eai)P (ai|ai�1)

Recall:Decoding

…didnotidx=2

Marynot

Maryno

4.2

-1.2

-2.9

idx=2

idx=2

…notgiveidx=3

…notslapidx=5

…notslapidx=6

1 2 3 4 5 6 7 8 9

‣ ScoresfromlanguagemodelP(e)+transla=onmodelP(f|e)

ThisLecture

‣ NeuralMTdetails

‣ DilatedCNNsforMT

‣ TransformersforMT

‣ Syntac=cMT

Syntac=cMT

LevelsofTransfer:VauquoisTriangle

Slidecredit:DanKlein‣ Issyntaxa“be8er”abstrac=onthanphrases?

Syntac=cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously

NP→[DT1JJ2NN3;DT1NN3JJ2]

DT→[the,la]

NN→[car,voiture]

JJ→[yellow,jaune]the yellow car

‣ Assumesparallelsyntaxuptoreordering

DT→[the,le]

la voiture jaune

NP NP

DT1 NN3 JJ2DT1 NN3JJ2

‣ Transla=on=parsetheinputwith“half”thegrammar,readoffotherhalf

Syntac=cMT

Slidecredit:DanKlein

‣ Relaxthisbyusinglexicalizedrules,like“syntac=cphrases”

‣ LeadstoHUGEgrammars, parsingisslow

NeuralMT

Encoder-DecoderMT

Sutskeveretal.(2014)

‣ SOTA=37.0—notallthatcompe==ve…

‣ Sutskeverseq2seqpaper:firstmajorapplica=onofLSTMstoNLP‣ Basicencoder-decoderwithbeamsearch

Encoder-DecoderMT

‣ Be8ermodelfromseq2seqlectures:encoder-decoderwitha8en=onandcopyingforrarewords

themoviewasgreat

h1 h2 h3 h4

h̄1

c1

distribu=onovervocab+copying

…

le

Results:WMTEnglish-French

Classicphrase-basedsystem:~33BLEU,usesaddi=onaltarget-languagedata

RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)

Sutskever+(2014)seq2seqsingle:30.6BLEU

Sutskever+(2014)seq2seqensemble:34.8BLEU

‣ ButEnglish-Frenchisareallyeasylanguagepairandthere’stonsofdataforit!Doesthisapproachworkforanythingharder?

Luong+(2015)seq2seqensemblewitha8en=onandrarewordhandling:37.5BLEU

‣ 12Msentencepairs

Results:WMTEnglish-German

‣ BLEUisn’tcomparableacrosslanguages,butthisperformances=llisn’tasgood

Classicphrase-basedsystem:20.7BLEU

Luong+(2014)seq2seq:14BLEU

‣ French,Spanish=easiest German,Czech,Chinese=harder Japanese,Russian=hard(gramma=callydifferent,lotsofmorphology…)

Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEU

‣ 4.5Msentencepairs

MTExamples

Luongetal.(2015)

‣ NMTsystemscanhallucinatewords,especiallywhennotusinga8en=on—phrase-baseddoesn’tdothis

‣ best=witha8en=on,base=noa8en=on

MTExamples

Luongetal.(2015)

‣ best=witha8en=on,base=noa8en=on

Zhangetal.(2017)

‣ NMTcanrepeatitselfifitgetsconfused(pHorpH)

‣ Phrase-basedMTowengetschunksright,mayhavemoresubtleungramma=cali=es

MTExamples

HandlingRareWords

‣Wordsareadifficultunittoworkwith:copyingcanbecumbersome,wordvocabulariesgetverylarge

Sennrichetal.(2016)

‣ Character-levelmodelsdon’tworkwell

Input:_the_ecotax_portico_in_Pont-de-Buis…

Output:_le_portique_écotaxe_de_Pont-de-Buis

‣ Compromisesolu=on:usethousandsof“wordpieces”(whichmaybefullwordsbutmayalsobepartsofwords)

‣ Canachievetranslitera=onwiththis,subwordstructuremakessometransla=onseasiertoachieve

BytePairEncoding(BPE)

‣ Dothiseitheroveryourvocabulary(originalversion)oroveralargecorpus(morecommonversion)

‣ Startwitheveryindividualbyte(basicallycharacter)asitsownsymbol

Sennrichetal.(2016)

‣ Countbigramcharactercooccurrences

‣Mergethemostfrequentpairofadjacentcharacters

‣ Doing8kmerges=>vocabularyofaround8000wordpieces.Includesmanywholewords

‣MostSOTANMTsystemsusethisonbothsource+target

WordPieces

‣ SentencePiecelibraryfromGoogle:unigramLM

SchusterandNakajima(2012),Wuetal.(2016),KudoandRichardson(2018)

Buildalanguagemodeloveryourcorpus

Mergepiecesthatleadtohighestimprovementinlanguagemodelperplexity

‣ Issues:whatLMtouse?Howtomakethistractable?

whilevocsize<targetvocsize:

‣ Result:wayofsegmen=nginputappropriatefortransla=on

Google’sNMTSystem

Wuetal.(2016)

‣ 8-layerLSTMencoder-decoderwitha8en=on,wordpiecevocabularyof8k-32k

Google’sNMTSystem

Wuetal.(2016)

Luong+(2015)seq2seqensemblewithrarewordhandling:37.5BLEUGoogle’s32kwordpieces:38.95BLEU

Google’sphrase-basedsystem:37.0BLEU

English-French:

Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEUGoogle’s32kwordpieces:24.2BLEU

Google’sphrase-basedsystem:20.7BLEU

English-German:

HumanEvalua=on(En-Es)

Wuetal.(2016)

‣ Similartohuman-level  performanceonEnglish-Spanish

Google’sNMTSystem

Wuetal.(2016)

GenderiscorrectinGNMTbutnotinPBMT

“sled”“walker”

Backtransla=on‣ ClassicalMTmethodsusedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?

Sennrichetal.(2015)

s1,t1

[null],t’1[null],t’2

s2,t2…

…

‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs

‣ Approach2:generatesynthe=c sourceswithaT->Smachine transla=onsystem(backtransla=on)

s1,t1

MT(t’1),t’1

s2,t2…

…MT(t’2),t’2

Backtransla=on

Sennrichetal.(2015)

‣ parallelsynth:backtranslatetrainingdata;makesaddi=onalnoisysourcesentenceswhichcouldbeuseful

‣ Gigaword:largemonolingualEnglishcorpus

TransformersforMT

Recall:Self-A8en=on

Vaswanietal.(2017)

themoviewasgreat

‣ Eachwordformsa“query”whichthencomputesa8en=onovereachword

‣Mul=ple“heads”analogoustodifferentconvolu=onalfilters.UseparametersWkandVktogetdifferenta8en=onvalues+transformvectors

x4

x04

scalar

vector=sumofscalar*vector

↵i,j = softmax(x>i xj)

x0i =nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x

0k,i =

nX

j=1

↵k,i,jVkxj

Transformers

Vaswanietal.(2017)

themoviewasgreat

‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts

‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector

themoviewasgreat

emb(1)

emb(2)

emb(3)

emb(4)

Transformers

Vaswanietal.(2017)

‣ Encoderanddecoderarebothtransformers

‣ Decoderconsumesthepreviousgeneratedtoken(anda8endstoinput),buthasnorecurrentstate

Transformers

Vaswanietal.(2017)

‣ Big=6layers,1000dimforeachtoken,16heads,base=6layers+otherparamshalved

Visualiza=on

Vaswanietal.(2017)

Takeaways

‣ CanbuildMTsystemswithLSTMencoder-decoders,CNNs,ortransformers

‣Wordpiece/bytepairmodelsarereallyeffec=veandeasytouse

‣ Stateoftheartsystemsarege|ngpre8ygood,butlotsofchallengesremain,especiallyforlow-resourcese|ngs

‣ Next=me:pre-trainedtransformermodels(BERT),appliedtoothertasks

Documents

CS388: Natural Language Processing Lecture 18: Machine … · 2019. 10. 29. · Results: WMT English-French Classic phrase-based system: ~33 BLEU, uses addi=onal target-language data