9
CS388: Natural Language Processing Greg Durre8 Lecture 16: Seq2seq II Administrivia Nazneen Rajani (Salesforce) talk this Friday at 11am in 6.302 Leveraging Explana.ons for Performance and Generaliza.on in NLP and RL Mini 2 results: Sundara Ramachandran: 82.1% Neil PaPl: 80.9%, Qinqin Zhang: 80.7% (CNN), Shivam Garg: 80.1%, Prateek Chaudhry: 80.0%, Abheek Ghosh: 80.0% Fine-tuning embeddings helps, 100-300d LSTM Final project feedback posted BidirecPonal LSTM, 2x256, 300d vectors, 4 epochs x 50 batch size Recall: Seq2seq Model Generate next word condiPoned on previous word as well as hidden state the movie was great <s> ¯ h W size is |vocab| x |hidden state|, socmax over enPre vocabulary Decoder has separate parameters from encoder, so this can learn to be a language model (produce a plausible next word given current one) P (y|x)= n Y i=1 P (y i |x,y 1 ,...,y i-1 ) P (y i |x,y 1 ,...,y i-1 ) = softmax(W ¯ h) Recall: Seq2seq Training ObjecPve: maximize the movie was great <s> le film était bon le Teacher forcing: feed the correct word regardless of model’s predicPon (most typical way to train) [STOP] était X (x,y) n X i=1 log P (y i |x,y 1 ,...,y i-1 )

Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

CS388:NaturalLanguageProcessing

GregDurre8

Lecture16:Seq2seqII

Administrivia

‣ NazneenRajani(Salesforce)talkthisFridayat11amin6.302LeveragingExplana.onsforPerformanceandGeneraliza.oninNLPandRL

‣ Mini2results:

‣ SundaraRamachandran:82.1%

‣ NeilPaPl:80.9%,QinqinZhang:80.7%(CNN),ShivamGarg:80.1%,PrateekChaudhry:80.0%,AbheekGhosh:80.0%

‣ Fine-tuningembeddingshelps,100-300dLSTM

‣ Finalprojectfeedbackposted

‣ BidirecPonalLSTM,2x256,300dvectors,4epochsx50batchsize

Recall:Seq2seqModel‣ GeneratenextwordcondiPonedonpreviouswordaswellashiddenstate

themoviewasgreat <s>

‣Wsizeis|vocab|x|hiddenstate|,socmaxoverenPrevocabulary

Decoderhasseparateparametersfromencoder,sothiscanlearntobealanguagemodel(produceaplausiblenextwordgivencurrentone)

P (y|x) =nY

i=1

P (yi|x, y1, . . . , yi�1)

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)

Recall:Seq2seqTraining

‣ ObjecPve:maximize

themoviewasgreat <s> lefilmétaitbon

le

‣ Teacherforcing:feedthecorrectwordregardlessofmodel’spredicPon(mosttypicalwaytotrain)

[STOP]était

X

(x,y)

nX

i=1

logP (y⇤i |x, y⇤1 , . . . , y⇤i�1)

Page 2: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

Recall:SemanPcParsingasTranslaPon

JiaandLiang(2015)

‣WritedownalinearizedformofthesemanPcparse,trainseq2seqmodelstodirectlytranslateintothisrepresentaPon

‣Mightnotproducewell-formedlogicalforms,mightrequirelotsofdata

“whatstatesborderTexas”

lambda x ( state ( x ) and border ( x , e89 ) ) )

‣ Noneedtohaveanexplicitgrammar,simplifiesalgorithms

ThisLecture

‣ Transformerarchitecture

‣ A8enPonforsequence-to-sequencemodels

‣ Copymechanismsforcopyingwordstotheoutput

A8enPon

ProblemswithSeq2seqModels

‣ NeedsomenoPonofinputcoverageorwhatinputwordswe’vetranslated

‣ Encoder-decodermodelsliketorepeatthemselves:

AboyplaysinthesnowboyplaysboyplaysUngarçonjouedanslaneige

‣Whydoesthishappen?

‣Modelstrainedpoorly

‣ LSTMstateisnotbehavingasexpectedsoitgetsstuckina“loop”ofgeneraPngthesameoutputtokensagainandagain

Page 3: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

ProblemswithSeq2seqModels

‣ Badatlongsentences:1)afixed-sizehiddenrepresentaPondoesn’tscale;2)LSTMssPllhaveahardPmerememberingforreallylongperiodsofPme

RNNenc:themodelwe’vediscussedsofarRNNsearch:usesa8enPon

Bahdanauetal.(2014)

ProblemswithSeq2seqModels

‣ Unknownwords:

‣ Infact,wedon’twanttoencodethem,wewantawayofdirectlylookingbackattheinputandcopyingthem(Pont-de-Buis)

‣ Encodingtheserarewordsintoavectorspaceisreallyhard

Jeanetal.(2015),Luongetal.(2015)

AlignedInputs

<s>lefilmétaitbon

themoviewasgreat

themoviewasgreat

lefilmétaitbon

‣Muchlessburdenonthehiddenstate

‣ Supposeweknewthesourceandtargetwouldbeword-by-wordtranslated

‣ CanlookatthecorrespondinginputwordwhentranslaPng—thiscouldscale!

lefilmétaitbon[STOP]

‣ Howcanweachievethiswithouthardcodingit?

A8enPon

‣ Ateachdecoderstate,computeadistribuPonoversourceinputsbasedoncurrentdecoderstate,usethatinoutputlayer

themoviewasgreat <s> le

themovie wa

sgreatth

emovie wa

sgreat

… …

Page 4: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

A8enPon

themoviewasgreat

h1 h2 h3 h4

<s>

h̄1

‣ Foreachdecoderstate,computeweightedsumofinputstates

eij = f(h̄i, hj)

ci =X

j

↵ijhj

c1

‣ SomefuncPonf(TBD)

‣Weightedsumofinputhidden states(vector)

le

↵ij =exp(eij)Pj0 exp(eij0)

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄i)‣ Noa8n:

the

movie was

great

A8enPon

<s>

h̄1

eij = f(h̄i, hj)

ci =X

j

↵ijhj

c1

‣ Notethatthisallusesoutputsofhiddenlayers

f(h̄i, hj) = tanh(W [h̄i, hj ])

f(h̄i, hj) = h̄i · hj

f(h̄i, hj) = h̄>i Whj

‣ Bahdanau+(2014):addiPve

‣ Luong+(2015):dotproduct

Luongetal.(2015)

‣ Luong+(2015):bilinear

le

↵ij =exp(eij)Pj0 exp(eij0)

AlternaPves‣Whendowecomputea8enPon?CancomputebeforeoracerRNNcell

Bahdanauetal.(2015)

<s>

h̄1

c1

<s>

c1

Luongetal.(2015)

‣ AcerRNNcell

‣ BeforeRNNcell;thisoneisali8lemoreconvolutedandlessstandard

lele

Whatcana8enPondo?‣ Learningtocopy—howmightthiswork?

Luongetal.(2015)

0 3 2 1

0 3 2 1

‣ LSTMcanlearntocountwiththerightweightmatrix

‣ ThisisakindofposiPon-basedaddressing

Page 5: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

Whatcana8enPondo?‣ Learningtosubsampletokens

Luongetal.(2015)

0 3 2 1

3 1

‣ Needtocount(forordering)andalsodeterminewhichtokensarein/out

‣ Content-basedaddressing

A8enPon

‣ DecoderhiddenstatesarenowmostlyresponsibleforselecPngwhattoa8endto

‣ Doesn’ttakeacomplexhiddenstatetowalkmonotonicallythroughasentenceandspitoutword-by-wordtranslaPons

‣ EncoderhiddenstatescapturecontextualsourcewordidenPty

themoviewasgreat

BatchingA8enPon

Luongetal.(2015)

themoviewasgreat

tokenoutputs:batchsizexsentencelengthxhiddensize

sentenceoutputs:batchsizexhiddensize

<s>

hiddenstate:batchsizexhiddensize

eij = f(h̄i, hj)

↵ij =exp(eij)Pj0 exp(eij0)

a8enPonscores=batchsizexsentencelength

c=batchsizexhiddensize ci =X

j

↵ijhj

‣ Makesuretensorsaretherightsize!

Results‣MachinetranslaPon:BLEUscoreof14.0onEnglish-German->16.8witha8enPon,19.0withsmartera8enPon(we’llcomebacktothislater)

Luongetal.(2015) Chopraetal.(2016) JiaandLiang(2016)

‣ SummarizaPon/headlinegeneraPon:bigramrecallfrom11%->15%

‣ SemanPcparsing:~30-50%accuracy->70+%accuracyonGeoquery

Page 6: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

CopyingInput/Pointers

UnknownWords

Jeanetal.(2015),Luongetal.(2015)

‣WanttobeabletocopynamedenPPeslikePont-de-Buis

1

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

froma8enPonfromRNNhiddenstate

‣ Problems:targetwordhastobeinthevocabulary,a8enPon+RNNneedtogenerategoodembeddingtopickit

PointerNetworks

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

‣ Pointernetwork:predictfromsourcewordsinsteadoftargetvocab

the

movie

was

greata of … … …

themoviewasgreat the

movie

was

greata of … … …

‣ Standarddecoder(Pvocab):socmaxovervocabulary,allwordsget>0prob

{ h>j V h̄i if yi = wj

<latexit sha1_base64="Q3tdPpnhrZJm8oaTIwqkzDwIlKQ=">AAAGcHicjVRbb9MwFM7GWka5bewFCRCGqSJhYWq2SSCkoQleEBJSuewiLVvkOm7rLTfZ7tIq9Ss/kDd+BC/8Ak6cZGNdtxGp6cn5vnOzP7uTBEzIVuvXzOyNuVr95vytxu07d+/dX1h8sCPiASd0m8RBzPc6WNCARXRbMhnQvYRTHHYCuts5/pDjuyeUCxZH3+UooQch7kWsywiW4PIW5340XRwkfew5prDQJnLpMDHdpM88agrbsd0Qy36nmw2VZTUqrjSFJzVbDEIvE14mXzlKoRLWX2bptc6n9KQtJ5Oe4rIKsvP8FgKsQ+XUeiu6nkb1h1k6J8tppw2vK4tKuwrXRQPalWZRq4dcFiFX0qHkYRb1OA4FVE49AIgfS0Rcznp9macM4h5qm6NNZ5zaxEIryO1yTDJHZceqbJ1tOuoQ8lXU1jj1GJAbzbZZNZiqfNS2mXqOpf/Wxqfmem7a4II5Aygv8kBIod2FB9CM6YUvkuQYOKJ8HSYYOnhcDXfCBJPUR99wpIOL7iuUxINIKnMa2Uappf6HaKnzFSNYvxRHEskY9WJ4T7Z0LeFT3I9QVeIz9nEPC4J50X8A58AHZV89yWUprpzqkiALpLFS7ow2ig42pncwLX+uqXKZzvZg7GbJWzRBTWxgvGu5agxZcnGl6Dpeoyk8p9EcFprWcutkX9WhjwoRY87jFA1fFHjWsh1XHWb+Cwgc5YGgWJDT2UGyR2eqG1WqcxMeJ7Bfjb53dOjKOEE7cFAxz/rKY6fNIdZFCkE6GDL1jryF5dZqSz/oouGUxrJRPm1v4afrx2QQ0kiSAAux77QSeZBhLhkJqGq4A0ETTI5xj+6DGeGQioNMX5gKNcHjo27M4Qfi0t5/IzI45GIUdoCZzyomsdw5DdsfyO6bg4xFyUDSiBSFuoMgl29++yKfcUpkMAIDE86gV0T6GLZYwh3dgEVwJke+aOysrTrrq2tfNpa33pfLMW88Mp4bpuEYr40t46PRNrYNMve7tlR7XHtS+1N/WH9af1ZQZ2fKmCXj3FN/+RcVWSa9</latexit>

0otherwise

w1w2w3w4

Ppointer(yi|x, y1, . . . , yi�1) /<latexit sha1_base64="/P0oXrWO7G7mmaRuLo+24t1JeVE=">AAAG+3icjVXdbtQ4FA6wM8sO+9PCJTfWViMSNVSTggRCKkLLDVpppdllW5Ca1nIcz4xpEke2p5ko41fZm70AIW73RfaOt+HESVo6HQqWkpyc851ff06iPOFKj0Yfr12/8V2v//3NHwa3fvzp5182Nm8fKDGXlO1TkQj5OiKKJTxj+5rrhL3OJSNplLBX0cnz2v7qlEnFRfa3LnN2lJJpxiecEg0qvNnbHIYkyWcEB67y0B4K2SJ3w3zGMXOVH/hhSvQsmlQL43mDDqtdhbVFq3mKK4UrfT8wBrVm++a2Wu9iSKx9vRr0zK47J7+O7yGwRUyvzbdt81mrfXFb5Wo6q/ThdmVS7XfuNmnCJtptck1RyDMUarbQMq2yqSSpgswFBgONhUY0lHw603XIREzR2C33gmXhUw9to3AiCa0CU52YtnS+F5hjiNdBR8sCcwAPhmO3K7Awdatjt8CBZx+7yzPxQS36oII+E0ivakcIYdWNBqwVt4NvgtQ2UGT1HFYQ1nnZNXfKFdcsRi9JZp2b6jsrFfNMG3cd2EeFZ74F6JmLGTOYX0EyjbRAUwH31ZK+CvhdzDLUpfiDxGRKFCWyqT+BcxADs6/u5EshruzqC04eUGO73RkrNBU8XF/Buvg1p9oxne/BMqzyJ2gFmvuAeDoKzRKi1OQq0Ndwg6HCwWC4aDht6RZVf5njGDUkJlKKAi3uNfZq5AehOa7ie+BYguMYn5WbC55pJo1xSyDY+dHyy3Melh0Pw1yKHHZwMJzhN8ehFjk6gLNLZDUzmJ/Vi/gEGQTxoO8Cv6ln8A3Ba5Kfd5yXxltXJmyGG6D7l7Cfg08FJZExeGNrtDOyC10WglbYcto1xhv/h7Gg85RlmiZEqcNglOujikjNacLMIJwrlhN6QqbsEMSMpEwdVfbbbdAQNDGaCAkX8NxqP/eo4HujyjQCZD0HtWqrletsh3M9eXxU8Syfa5bRJtFkntQnqf4RoJhLRnVSgkCo5FArojMCbIN5qQEMIVht+bJwsLsTPNjZ/fPh1rPf2nHcdO46vzquEziPnGfOC2fs7Du0t+j903vbe9c3/X/77/sfGuj1a63PHefC6v/3CTVpXbg=</latexit>

PointerGeneratorMixtureModels

‣Marginalizeovercopyvariableduringtrainingandinference

the

movie

was

greata of … … …

‣ Definethedecodermodelasamixturemodeloftheandmodels(previousslide)

Pvocab<latexit sha1_base64="k5lQoH6h0XuhCfhj4fQI5yX6m0I=">AAAGhHicjVRtb9M6FM6AFsi9FwZ85IvFVC3RwpRsIBDSEIIvCAmpF9hAWjbLcdzWW95ku0ur1H+En8U3/g0nTrKxUgaWkpyc85xXP3ZUJFwq3/++du36jV7/5q3b9j///nfn7vq9+wcynwrK9mme5OJLRCRLeMb2FVcJ+1IIRtIoYZ+j0ze1/fMZE5Ln2Sc1L9hRSsYZH3FKFKjwvRtfByFJignBgSNdtIdCNiucsJhwzBzpBV6YEjWJRtVMu67dYZUjsTJoOU1xJXGlHgdao9Zs/pxW614OiZWnloOe21Xn5NXxXQS2iKmV+bZMPmM1P06rXE5nlB68rkyqvM7dJE3YSDlNrjEKeYZCxWZKpFU2FiSVkLnEYKBxrhANBR9PVB0yycdo6Mz3gkXpURdtoXAkCK0CXZ3qtnS+F+hjiNdB/UWJOYDtwdDpCix13erQKXHgms/O4lzcrUUPVNBnAull7QghjLrRgLXiZvBNkNoGiqyewxLCOC+65s645IrF6CPJjHNTfWel+TRT2lkF9lDp6r8BuvpyxgzmV5JMIZWjcQ7v5ZL+CHiXTzLUpXhPYjImkhLR1J/AOYiB2Vd38rsQV3b1GycXqLHV7owRmgqerK5gVfyaU+2YLvZgEVbFC7QELTxAvPRDvYAoNblK9CecPZA4sAezhtOGblH1QR/HqCExESIv0WyzsVe+F4T6uIo3wXFeOwJjgU4XB8mbX7Bu3rEuLERewH7Zgwk+OQ5VXqADOKlEVBON+Xl1iI+QRhAPuizxiT3E57M4yymJtMbrG/62bxb6VQhaYcNq1xCvfwvjnE5TlimaECkPA79QRxURitOEaTucSlYQekrG7BDEjKRMHlXmEtVoAJoYjXIBDxDOaH/2qODgy3kaAbLuXy7bauUq2+FUjZ4fVTwrpopltEk0miY1pesbGcVcMKqSOQiECg61IjohsO0K7m0bhhAst/yrcLCzHexu7/z/ZOPV63Yct6yH1iPLsQLrmfXKemsNrX2L9tZ6mz2/F/T7fa+/23/aQK+ttT4PrEur//IHUUYtYA==</latexit>

Ppointer<latexit sha1_base64="f1/X0Ciabmq34fdPhxXMO0arOIw=">AAAGhnicjVRbb9MwFM7YWka4DXjkxWKqSLRQNdsQCGlogheEhFQuG0jLZrmO23rLTba7tEr9T/hVvPFvOHGSjZUysNT05Hzfufj4cwZZxKXq9X6u3Fhda7Vvrt+yb9+5e+/+xoOHhzKdCMoOaBql4tuASBbxhB0oriL2LROMxIOIfR2cvS3xr+dMSJ4mX9QsY8cxGSV8yClR4MIP1r53AhJlY4J9R7poDwVsmjlBNuaYOdLzvSAmajwYFlPtunbDVY7EyrDlJMaFxIV65muNati8ObXXvZoSK08tJr3AVRPklfldBNiAqaX1tkw9g5oXp3YuljNODx7XFlVeE26KRmyonKrWCAU8QYFiUyXiIhkJEkuonGMAaJgqRAPBR2NVpozSEeo7sz1/nnvURVsoGApCC18XZ7pune/5+gTyNdTePMccyHan7zQN5rrcat/Jse+av+35hblTmh64YJ8RlJdlIKQw7soDaMHN4KskJQaOpJzDAsMEz5vNnXPJFQvRZ5KY4Kr7BqXpJFHaWUb2UO7q/yG6+mrFBOaXk0QhlaJRCs/Flv5JeJ+OE9SU+EBCMiKSElH1H8E9CEHZ1+/kbymu3dVfglyQxlZ9MsaoOthd3sGy/KWm6jFdnsE8KLJXaIGaecB43Qv0HLKU4srRv3h2R2Lf7kwrTRu5DYpP+iRElYiJEGmOpk8rvOh5fqBPivApBM7KQFAsyOnyInmzS9XNGtUFmUgzOC+7M8anJ4FKM3QIN5WIYqwxv+gO8SHSCPLBLnN8avfxxSyylCeKCa3xxmav2zML/Wn4tbFp1auPN34EYUonMUsUjYiUR34vU8cFEYrTiGk7mEiWEXpGRuwIzITETB4X5jOqUQc8IRqmAn4gOeP9PaKAqy9n8QCY5QTkIlY6l2FHEzV8eVzwJJsoltCq0HASlaIuv8ko5IJRFc3AIFRw6BXRMYGDhyFIG4bgL275T+Nwu+vvdLc/7m7uv6nHsW49tp5YjuVbL6x9653Vtw4s2lptua3t1k57vd1tP2+/qKg3VuqYR9aV1d7/BWIjLmo=</latexit>

P (yi|x, y1, . . . , yi�1) = P (copy)Ppointer + (1� P (copy))Pvocab<latexit sha1_base64="fqBWECHCI/gP/IDKuP6MSgfsh98=">AAAG6HicjVTdbts2FFa72kvV/STr5W6IBUYlRA2stMCGAhmK7qYYMMBbl7RAlBAURdtMJFEg6ciGzCfoTS9WDLvdI+1uL1P0iJKSxnHTEpB0dM53fvmRcZFypYfD/2/d/uJOr//lxl333ldff/Pt5tZ3h0rMJGUHVKRCvoqJYinP2YHmOmWvCslIFqfsZXz2S21/ec6k4iL/Uy8KdpyRSc7HnBINKrx1590gImkxJTj0lI/2UcTmhRcVU46Zp4IwiDKip/G4mhvfdzus9hTWFq1mGa4UrvTD0BjUmu2f12r9qyGxDvRq0Au77pyCOr6PwBYzvTbfjs1nrfbHa5Wr6awygNeNSXXQudukKRtrr8k1QRHPUaTZXMusyieSZAoylxgMNBEa0UjyyVTXIVMxQSNvsR8uy4D6aAdFY0loFZrqzLSl8/3QnEC8DjpclpgD2B2MvK7A0tStjrwSh7797C0vxEe1GIAK+kwhvaodIYRVNxqwVtwOvglS20CR13NYQVjnZdfcOVdcswS9ILl1bqrvrFTMcm28deAAlb75HKBvrmbMYX4lyTXSAk0EvFdL+iTgVzHNUZfiN5KQCVGUyKb+FM5BAsy+uZOPhbixq484+UCNnXZnrNBU8Hh9Bevi15xqx3S5B8uoKp6gFWgRAOLnYWSWEKUmV4k+hXMHCofuYN5w2tItrv4wJwlqSEykFCWaP2js1TAII3NSJQ/AcVE7AmOBTpcHKVhcsm7RsS4qpChgv9zBFJ+eRFoU6BBOKpHV1GB+UR3iY2QQxIMuS3zqflbsmtGX7RUL44/wxQgLwXPNpN0CL0QPr2E/BJ8LSmJj8Ob2cHdoF7ouhK2w7bRrhDf/ixJBZxnLNU2JUkfhsNDHFZGa05QZN5opVhB6RibsCMScZEwdV/aiNmgAmgSNhYQHSG21H3pUcLmoRRYDsp6DWrXVynW2o5ke/3Rc8byYaZbTJtF4ltbHpr71UcIlozpdgECo5FArolMC1IJ5KReGEK62fF043NsNH+3u/f54++mzdhwbzvfOD47nhM6PzlPnuTNyDhzaY73Xvb96b/un/Tf9v/v/NNDbt1qf+86V1f/3PaDIVak=</latexit>

‣ PredictP(copy)basedondecoderstate,input,etc.

‣Modelwillbeabletobothgenerateandcopy,flexiblyadaptbetweenthetwo

Page 7: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

Copying

{ {thea

zebra

Pont-de-Buisecotax

‣ SoluPon:expandthevocabularydynamically.Newwordscanonlybepredictedbycopying(always0probabilityunderPvocab)

‣ Somewordswemaywanttocopymaynotbeinthefixedoutputvocab(Pont-de-Buis)

Results

JiaandLiang(2016)

‣ Copyingtypicallyhelpsabit,buta8enPoncapturesmostofthebenefit.However,vocabularyexpansioniscriPcalforsometasks(machinetranslaPon)

‣ ForsemanPcparsing,copyingtokensfromtheinput(texas)canbeveryuseful

Transformers

SentenceEncoders

themoviewasgreat

‣ LSTMabstracPon:mapseachvectorinasentencetoanew,context-awarevector

‣ CNNsdosomethingsimilarwithfilters

‣ A8enPoncangiveusathirdwaytodothis

Vaswanietal.(2017)

themoviewasgreat

Page 8: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

Self-A8enPon

Vaswanietal.(2017)

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣ Pronounsneedtolookatantecedents‣ Ambiguouswordsshouldlookatcontext

‣ Assumewe’reusingGloVe—whatdowewantourneuralnetworktodo?

‣Whatwordsneedtobecontextualizedhere?

‣WordsshouldlookatsyntacPcparents/children

‣ Problem:LSTMsandCNNsdon’tdothis

Self-A8enPon

Vaswanietal.(2017)

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣Want:

‣ LSTMs/CNNs:tendtolookatlocalcontext

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣ Toappropriatelycontextualizeembeddings,weneedtopassinformaPonoverlongdistancesdynamicallyforeachword

Self-A8enPon

Vaswanietal.(2017)

themoviewasgreat

‣ Eachwordformsa“query”whichthencomputesa8enPonovereachword

‣MulPple“heads”analogoustodifferentconvoluPonalfilters.UseparametersWkandVktogetdifferenta8enPonvalues+transformvectors

x4

x04

scalar

vector=sumofscalar*vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x0

k,i =nX

j=1

↵k,i,jVkxj

Whatcanself-a8enPondo?

Vaswanietal.(2017)

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣WhymulPpleheads?Socmaxesendupbeingpeaked,singledistribuPoncannoteasilyputweightonmulPplethings

0.5 0.20.10.10.10 0 0 0 0 0 0

‣ ThisisademonstraPon,wewillrevisitwhatthesemodelsactuallylearnwhenwediscussBERT

‣ A8endnearby+tosemanPcallyrelatedterms

0.5 0 0.40 0.1 0 0 0 0 0 0 0

Page 9: Administrivia CS388: Natural Language Processinggdurrett/courses/fa2019/lectures/lec16-… · Bahdanau et al. (2014) Problems with Seq2seq Models ‣Unknown words: ‣In fact, we

TransformerUses

‣ Supervised:transformercanreplaceLSTMasencoder,decoder,orboth;willrevisitthiswhenwediscussMT

‣ Unsupervised:transformersworkbe8erthanLSTMforunsupervisedpre-trainingofembeddings:predictwordgivencontextwords

‣ BERT(BidirecPonalEncoderRepresentaPonsfromTransformers):pretrainingtransformerlanguagemodelssimilartoELMo

‣ Strongerthansimilarmethods,SOTAon~11tasks(includingNER—92.8F1)

Takeaways

‣ A8enPonisveryhelpfulforseq2seqmodels

‣ Explicitlycopyinginputcanbebeneficialaswell‣ Transformersarestrongmodelswe’llcomebacktolater

‣We’venowtalkedaboutmostoftheimportantcoretoolsforNLP

‣ RestoftheclassismorefocusedonapplicaPons:translaPon,informaPonextracPon,QA,andmore,thenotherapplicaPons