Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CS388:NaturalLanguageProcessing
GregDurre8
Lecture19:PretrainedTransformers
Credit:???
Administrivia
‣ Project2dueTuesday
‣ PresentaEondayannouncementsnextweek
Recall:Self-A8enEon
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8enEonovereachword
‣MulEple“heads”analogoustodifferentconvoluEonalfilters.UseparametersWkandVktogetdifferenta8enEonvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x0
k,i =nX
j=1
↵k,i,jVkxj
Recall:Transformers
Vaswanietal.(2017)
themoviewasgreat
‣ AugmentwordembeddingwithposiEonembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts
‣WorksessenEallyaswellasjustencodingposiEonasaone-hotvector
themoviewasgreat
emb(1)
emb(2)
emb(3)
emb(4)
ThisLecture
‣ GPT/GPT2
‣ Analysis/VisualizaEon
‣ BERT
BERT
BERT
‣ ThreemajorchangescomparedtoELMo:
‣ TransformersinsteadofLSTMs(transformersinGPTaswell)‣ BidirecEonal<=>MaskedLMobjecEveinsteadofstandardLM‣ Fine-tuneinsteadoffreezeattestEme
‣ AI2madeELMoinspring2018,GPTwasreleasedinsummer2018,BERTcameoutOctober2018
BERT
Devlinetal.(2019)
‣ ELMoisaunidirecEonalmodel(asisGPT):wecanconcatenatetwounidirecEonalmodels,butisthistherightthingtodo?
Astunningballetdancer,Copelandisoneofthebestperformerstoseelive.
ELMo
ELMo“performer”
“balletdancer”
BERT
“balletdancer/performer”
‣ ELMoreprslookateachdirecEoninisolaEon;BERTlooksatthemjointly
BERT‣ Howtolearna“deeplybidirecEonal”model?WhathappensifwejustreplaceanLSTMwithatransformer?
JohnvisitedMadagascaryesterday
visited Madag. yesterday …
‣ TransformerLMshavetobe“one-sided”(onlya8endtoprevioustokens),notwhatwewant
JohnvisitedMadagascaryesterday
ELMo(LanguageModeling)visited Madag. yesterday …
BERT
MaskedLanguageModeling‣ HowtopreventcheaEng?NextwordpredicEonfundamentallydoesn'tworkforbidirecEonalmodels,insteaddomaskedlanguagemodeling
Johnvisited[MASK]yesterday
Madagascar‣ BERTformula:takeachunkoftext,predict15%ofthetokens
‣ For80%(ofthe15%),replacetheinputtokenwith[MASK]
Devlinetal.(2019)
‣ For10%,replacew/random‣ For10%,keepsame
Johnvisitedofyesterday
JohnvisitedMadagascaryesterday
Next“Sentence”PredicEon‣ Input:[CLS]Textchunk1[SEP]Textchunk2
[CLS]Johnvisited[MASK]yesterdayandreallyallit[SEP]IlikeMadonna.
Madagascar
Devlinetal.(2019)
Transformer
Transformer
…
enjoyed likeNotNext
‣ BERTobjecEve:maskedLM+nextsentencepredicEon
‣ 50%oftheEme,takethetruenextchunkoftext,50%oftheEmetakearandomotherchunk.Predictwhetherthenextchunkisthe“true”next
BERTArchitecture‣ BERTBase:12layers,768-dimperwordpiecetoken,12heads.Totalparams=110M
Devlinetal.(2019)
‣ BERTLarge:24layers,1024-dimperwordpiecetoken,16heads.Totalparams=340M
‣ PosiEonalembeddingsandsegmentembeddings,30kwordpieces
‣ Thisisthemodelthatgetspre-trainedonalargecorpus
WhatcanBERTdo?
Devlinetal.(2019)
‣ CLStokenisusedtoprovideclassificaEondecisions
‣ BERTcanalsodotaggingbypredicEngtagsateachwordpiece‣ Sentencepairtasks(entailment):feedbothsentencesintoBERT
WhatcanBERTdo?
‣ HowdoesBERTmodelthissentencepairstuff?
‣ TransformerscancaptureinteracEonsbetweenthetwosentences,eventhoughtheNSPobjecEvedoesn’treallycausethistohappen
Transformer
Transformer
…
[CLS]Aboyplaysinthesnow[SEP]Aboyisoutside
Entails
WhatcanBERTNOTdo?
‣ BERTcannotgeneratetext(atleastnotinanobviousway)
‣ Notanautoregressivemodel,candoweirdthingslikesEcka[MASK]attheendofastring,fillinthemask,andrepeat
‣Maskedlanguagemodelsareintendedtobeusedprimarilyfor“analysis”tasks
Fine-tuningBERT‣ Fine-tunefor1-3epochs,batchsize2-32,learningrate2e-5-5e-5
‣ Largechangestoweightsuphere(parEcularlyinlastlayertoroutetherightinformaEonto[CLS])
‣ Smallerchangestoweightslowerdowninthetransformer
‣ SmallLRandshortfine-tuningschedulemeanweightsdon’tchangemuch
‣Morecomplex“triangularlearningrate”schemesexist
Fine-tuningBERT
Peters,Ruder,Smith(2019)
‣ BERTistypicallybe8erifthewholenetworkisfine-tuned,unlikeELMo
EvaluaEon:GLUE
Wangetal.(2019)
Results
Devlinetal.(2018)
‣ Hugeimprovementsoverpriorwork(evencomparedtoELMo)
‣ EffecEveat“sentencepair”tasks:textualentailment(doessentenceAimplysentenceB),paraphrasedetecEon
RoBERTa
Liuetal.(2019)
‣ “RobustlyopEmizedBERT”
‣ 160GBofdatainsteadof16GB
‣ Dynamicmasking:standardBERTusesthesameMASKschemeforeveryepoch,RoBERTarecomputesthem
‣ Newtraining+moredata=be8erperformance
GPT/GPT2
OpenAIGPT/GPT2
‣ GPT2:trainedon40GBoftextcollectedfromupvotedlinksfromreddit
‣ 1.5Bparameters—byfarthelargestofthesemodelstrainedasofMarch2019
Radfordetal.(2019)
‣ “ELMowithtransformers”(worksbe8erthanELMo)
‣ TrainasingleunidirecEonaltransformerLMonlongcontexts
‣ Becauseit'salanguagemodel,wecangeneratefromit
OpenAIGPT2
slidecredit:OpenAI
OpenQuesEons
3)HowdoweharnessthesepriorsforcondiEonalgeneraEontasks(summarizaEon,generateareportofabasketballgame,etc.)
4)Isthistechnologydangerous?(OpenAIhasonlyreleased774Mparammodel,not1.5Byet)
1)Hownovelisthestuffbeinggenerated?(Isitjustdoingnearestneighborsonalargecorpus?)
2)HowdoweunderstandanddisEllwhatislearnedinthismodel?
Grover‣ SamplefromalargelanguagemodelcondiEonedonadomain,date,authors,andheadline
Zellersetal.(2019)
‣ HumansrankGrover-generatedpropagandaasmorerealisEcthanreal“fakenews”
‣ NOTE:NotaGAN,discriminatortrainedseparatelyfromthegenerator
‣ Fine-tunedGrovercandetectGroverpropagandaeasily—authorsargueforreleasingitforthisreason
Pre-TrainingCost(withGoogle/AWS)
h8ps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/
‣ XLNet(BERTvariant):$30,000—$60,000(unclear)
‣ Grover-MEGA:$25,000
‣ BERT:Base$500,Large$7000
‣ Thisisforasinglepre-trainingrun…developingnewpre-trainingtechniquesmayrequiremanyruns
‣ Fine-tuningthesemodelscantypicallybedonewithasingleGPU(butmaytake1-3daysformedium-sizeddatasets)
PushingtheLimits
‣ NVIDIA:trained8.3BparameterGPTmodel(5.6xthesizeofGPT-2)
NVIDIAblog(Narasimhan,August2019)
‣ ArguablethesemodelsaresEllunderfit:largermodelssEllgetbe8erheld-outperplexiEes
GoogleT5
Raffeletal.(October23,2019)
‣WesEllhaven'thitthelimitofbiggerdatabeinguseful
‣ ColossalCleanedCommonCrawl:750GBoftext
BART
Lewisetal.(October30,2019)
‣ Sequence-to-sequenceBERTvariant:permute/make/deletetokens,thenpredictfullsequenceautoregressively
‣ Fordownstreamtasks:feeddocumentintobothencoder+decoder,usedecoderhiddenstateasoutput
‣ Goodresultsondialogue,summarizaEontasks
Analysis
WhatdoesBERTlearn?
Clarketal.(2019)
‣ HeadsontransformerslearninteresEnganddiversethings:contentheads(a8endbasedoncontent),posiEonalheads(basedonposiEon),etc.
WhatdoesBERTlearn?
Clarketal.(2019)
‣ SEllwayworsethanwhatsupervisedsystemscando,butinteresEngthatthisislearnedorganically
ProbingBERT
Tenneyetal.(2019)
‣ TrytopredictPOS,etc.fromeachlayer.Learnmixingweights
representaEonofwordpieceifortaskτ
‣ Plotshowssweights(blue)andperformancedeltaswhenanaddiEonallayerisincorporated(purple)
‣ BERT“rediscoverstheclassicalNLPpipeline”:firstsyntacEctasksthensemanEcones
CompressingBERT
Micheletal.(2019)
‣ Remove60+%ofBERT’sheadswithminimaldropinperformance
‣ DisElBERT(Sanhetal.,2019):nearlyasgoodwithhalftheparametersofBERT(viaknowledgedisEllaEon)
OpenQuesEons
‣ Thesetechniquesareheretostay,unclearwhatformwillwinout
‣ Roleofacademiavs.industry:nomajorpretrainedmodelhascomepurelyfromacademia
‣ BERT-basedsystemsarestate-of-the-artfornearlyeverymajortextanalysistask
‣ Cost/carbonfootprint:asinglemodelcosts$10,000+totrain(thoughthiscostshouldcomedown)