Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CS388:NaturalLanguageProcessing
GregDurre8
Lecture18:MachineTransla=on2
Administrivia
‣ Project2dueinoneweek
Recall:Phrase-BasedMT
Unlabeled English data
cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 …
Language model P(e)
Phrase table P(f|e) P (e|f) / P (f |e)P (e)Noisy channel model: combine scores from translation model + language model to translate foreign to
English
“Translate faithfully but make fluent English”
}
Recall:HMMforAlignment
Brownetal.(1993)
Thankyou,Ishalldosogladly.e
‣ Sequen=aldependencebetweena’stocapturemonotonicity
0 2 6
Gracias,loharedemuybuengrado.f
5 7 7 7 7 8a
‣ Alignmentdistparameterizedbyjumpsize:‣ :wordtransla=ontableP (fi|eai)
§ Wantlocalmonotonicity:mostjumpsaresmall§ HMMmodel(Vogel96)
§ Re-es>mateusingtheforward-backwardalgorithm -2 -1 0 1 2 3
P (f ,a|e) =nY
i=1
P (fi|eai)P (ai|ai�1)
Recall:Decoding
…didnotidx=2
Marynot
Maryno
4.2
-1.2
-2.9
idx=2
idx=2
…notgiveidx=3
…notslapidx=5
…notslapidx=6
1 2 3 4 5 6 7 8 9
‣ ScoresfromlanguagemodelP(e)+transla=onmodelP(f|e)
ThisLecture
‣ NeuralMTdetails
‣ DilatedCNNsforMT
‣ TransformersforMT
‣ Syntac=cMT
Syntac=cMT
LevelsofTransfer:VauquoisTriangle
Slidecredit:DanKlein‣ Issyntaxa“be8er”abstrac=onthanphrases?
Syntac=cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
NN→[car,voiture]
JJ→[yellow,jaune]the yellow car
‣ Assumesparallelsyntaxuptoreordering
DT→[the,le]
la voiture jaune
NP NP
DT1 NN3 JJ2DT1 NN3JJ2
‣ Transla=on=parsetheinputwith“half”thegrammar,readoffotherhalf
Syntac=cMT
Slidecredit:DanKlein
‣ Relaxthisbyusinglexicalizedrules,like“syntac=cphrases”
‣ LeadstoHUGEgrammars, parsingisslow
NeuralMT
Encoder-DecoderMT
Sutskeveretal.(2014)
‣ SOTA=37.0—notallthatcompe==ve…
‣ Sutskeverseq2seqpaper:firstmajorapplica=onofLSTMstoNLP‣ Basicencoder-decoderwithbeamsearch
Encoder-DecoderMT
‣ Be8ermodelfromseq2seqlectures:encoder-decoderwitha8en=onandcopyingforrarewords
themoviewasgreat
h1 h2 h3 h4
h̄1
c1
distribu=onovervocab+copying
…
le
Results:WMTEnglish-French
Classicphrase-basedsystem:~33BLEU,usesaddi=onaltarget-languagedata
RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)
Sutskever+(2014)seq2seqsingle:30.6BLEU
Sutskever+(2014)seq2seqensemble:34.8BLEU
‣ ButEnglish-Frenchisareallyeasylanguagepairandthere’stonsofdataforit!Doesthisapproachworkforanythingharder?
Luong+(2015)seq2seqensemblewitha8en=onandrarewordhandling:37.5BLEU
‣ 12Msentencepairs
Results:WMTEnglish-German
‣ BLEUisn’tcomparableacrosslanguages,butthisperformances=llisn’tasgood
Classicphrase-basedsystem:20.7BLEU
Luong+(2014)seq2seq:14BLEU
‣ French,Spanish=easiest German,Czech,Chinese=harder Japanese,Russian=hard(gramma=callydifferent,lotsofmorphology…)
Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEU
‣ 4.5Msentencepairs
MTExamples
Luongetal.(2015)
‣ NMTsystemscanhallucinatewords,especiallywhennotusinga8en=on—phrase-baseddoesn’tdothis
‣ best=witha8en=on,base=noa8en=on
MTExamples
Luongetal.(2015)
‣ best=witha8en=on,base=noa8en=on
Zhangetal.(2017)
‣ NMTcanrepeatitselfifitgetsconfused(pHorpH)
‣ Phrase-basedMTowengetschunksright,mayhavemoresubtleungramma=cali=es
MTExamples
HandlingRareWords
‣Wordsareadifficultunittoworkwith:copyingcanbecumbersome,wordvocabulariesgetverylarge
Sennrichetal.(2016)
‣ Character-levelmodelsdon’tworkwell
Input:_the_ecotax_portico_in_Pont-de-Buis…
Output:_le_portique_écotaxe_de_Pont-de-Buis
‣ Compromisesolu=on:usethousandsof“wordpieces”(whichmaybefullwordsbutmayalsobepartsofwords)
‣ Canachievetranslitera=onwiththis,subwordstructuremakessometransla=onseasiertoachieve
BytePairEncoding(BPE)
‣ Dothiseitheroveryourvocabulary(originalversion)oroveralargecorpus(morecommonversion)
‣ Startwitheveryindividualbyte(basicallycharacter)asitsownsymbol
Sennrichetal.(2016)
‣ Countbigramcharactercooccurrences
‣Mergethemostfrequentpairofadjacentcharacters
‣ Doing8kmerges=>vocabularyofaround8000wordpieces.Includesmanywholewords
‣MostSOTANMTsystemsusethisonbothsource+target
WordPieces
‣ SentencePiecelibraryfromGoogle:unigramLM
SchusterandNakajima(2012),Wuetal.(2016),KudoandRichardson(2018)
Buildalanguagemodeloveryourcorpus
Mergepiecesthatleadtohighestimprovementinlanguagemodelperplexity
‣ Issues:whatLMtouse?Howtomakethistractable?
whilevocsize<targetvocsize:
‣ Result:wayofsegmen=nginputappropriatefortransla=on
Google’sNMTSystem
Wuetal.(2016)
‣ 8-layerLSTMencoder-decoderwitha8en=on,wordpiecevocabularyof8k-32k
Google’sNMTSystem
Wuetal.(2016)
Luong+(2015)seq2seqensemblewithrarewordhandling:37.5BLEUGoogle’s32kwordpieces:38.95BLEU
Google’sphrase-basedsystem:37.0BLEU
English-French:
Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEUGoogle’s32kwordpieces:24.2BLEU
Google’sphrase-basedsystem:20.7BLEU
English-German:
HumanEvalua=on(En-Es)
Wuetal.(2016)
‣ Similartohuman-level performanceonEnglish-Spanish
Google’sNMTSystem
Wuetal.(2016)
GenderiscorrectinGNMTbutnotinPBMT
“sled”“walker”
Backtransla=on‣ ClassicalMTmethodsusedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?
Sennrichetal.(2015)
s1,t1
[null],t’1[null],t’2
s2,t2…
…
‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs
‣ Approach2:generatesynthe=c sourceswithaT->Smachine transla=onsystem(backtransla=on)
s1,t1
MT(t’1),t’1
s2,t2…
…MT(t’2),t’2
Backtransla=on
Sennrichetal.(2015)
‣ parallelsynth:backtranslatetrainingdata;makesaddi=onalnoisysourcesentenceswhichcouldbeuseful
‣ Gigaword:largemonolingualEnglishcorpus
TransformersforMT
Recall:Self-A8en=on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en=onovereachword
‣Mul=ple“heads”analogoustodifferentconvolu=onalfilters.UseparametersWkandVktogetdifferenta8en=onvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x
0k,i =
nX
j=1
↵k,i,jVkxj
Transformers
Vaswanietal.(2017)
themoviewasgreat
‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts
‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector
themoviewasgreat
emb(1)
emb(2)
emb(3)
emb(4)
Transformers
Vaswanietal.(2017)
‣ Encoderanddecoderarebothtransformers
‣ Decoderconsumesthepreviousgeneratedtoken(anda8endstoinput),buthasnorecurrentstate
Transformers
Vaswanietal.(2017)
‣ Big=6layers,1000dimforeachtoken,16heads,base=6layers+otherparamshalved
Visualiza=on
Vaswanietal.(2017)
Visualiza=on
Vaswanietal.(2017)
Takeaways
‣ CanbuildMTsystemswithLSTMencoder-decoders,CNNs,ortransformers
‣Wordpiece/bytepairmodelsarereallyeffec=veandeasytouse
‣ Stateoftheartsystemsarege|ngpre8ygood,butlotsofchallengesremain,especiallyforlow-resourcese|ngs
‣ Next=me:pre-trainedtransformermodels(BERT),appliedtoothertasks