35
CS388: Natural Language Processing Greg Durre8 Lecture 19: Pretrained Transformers Credit: ???

CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

CS388:NaturalLanguageProcessing

GregDurre8

Lecture19:PretrainedTransformers

Credit:???

Page 2: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Administrivia

‣ Project2dueTuesday

‣ PresentaEondayannouncementsnextweek

Page 3: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Recall:Self-A8enEon

Vaswanietal.(2017)

themoviewasgreat

‣ Eachwordformsa“query”whichthencomputesa8enEonovereachword

‣MulEple“heads”analogoustodifferentconvoluEonalfilters.UseparametersWkandVktogetdifferenta8enEonvalues+transformvectors

x4

x04

scalar

vector=sumofscalar*vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x0

k,i =nX

j=1

↵k,i,jVkxj

Page 4: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Recall:Transformers

Vaswanietal.(2017)

themoviewasgreat

‣ AugmentwordembeddingwithposiEonembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts

‣WorksessenEallyaswellasjustencodingposiEonasaone-hotvector

themoviewasgreat

emb(1)

emb(2)

emb(3)

emb(4)

Page 5: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

ThisLecture

‣ GPT/GPT2

‣ Analysis/VisualizaEon

‣ BERT

Page 6: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

BERT

Page 7: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

BERT

‣ ThreemajorchangescomparedtoELMo:

‣ TransformersinsteadofLSTMs(transformersinGPTaswell)‣ BidirecEonal<=>MaskedLMobjecEveinsteadofstandardLM‣ Fine-tuneinsteadoffreezeattestEme

‣ AI2madeELMoinspring2018,GPTwasreleasedinsummer2018,BERTcameoutOctober2018

Page 8: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

BERT

Devlinetal.(2019)

‣ ELMoisaunidirecEonalmodel(asisGPT):wecanconcatenatetwounidirecEonalmodels,butisthistherightthingtodo?

Astunningballetdancer,Copelandisoneofthebestperformerstoseelive.

ELMo

ELMo“performer”

“balletdancer”

BERT

“balletdancer/performer”

‣ ELMoreprslookateachdirecEoninisolaEon;BERTlooksatthemjointly

Page 9: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

BERT‣ Howtolearna“deeplybidirecEonal”model?WhathappensifwejustreplaceanLSTMwithatransformer?

JohnvisitedMadagascaryesterday

visited Madag. yesterday …

‣ TransformerLMshavetobe“one-sided”(onlya8endtoprevioustokens),notwhatwewant

JohnvisitedMadagascaryesterday

ELMo(LanguageModeling)visited Madag. yesterday …

BERT

Page 10: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

MaskedLanguageModeling‣ HowtopreventcheaEng?NextwordpredicEonfundamentallydoesn'tworkforbidirecEonalmodels,insteaddomaskedlanguagemodeling

Johnvisited[MASK]yesterday

Madagascar‣ BERTformula:takeachunkoftext,predict15%ofthetokens

‣ For80%(ofthe15%),replacetheinputtokenwith[MASK]

Devlinetal.(2019)

‣ For10%,replacew/random‣ For10%,keepsame

Johnvisitedofyesterday

JohnvisitedMadagascaryesterday

Page 11: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Next“Sentence”PredicEon‣ Input:[CLS]Textchunk1[SEP]Textchunk2

[CLS]Johnvisited[MASK]yesterdayandreallyallit[SEP]IlikeMadonna.

Madagascar

Devlinetal.(2019)

Transformer

Transformer

enjoyed likeNotNext

‣ BERTobjecEve:maskedLM+nextsentencepredicEon

‣ 50%oftheEme,takethetruenextchunkoftext,50%oftheEmetakearandomotherchunk.Predictwhetherthenextchunkisthe“true”next

Page 12: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

BERTArchitecture‣ BERTBase:12layers,768-dimperwordpiecetoken,12heads.Totalparams=110M

Devlinetal.(2019)

‣ BERTLarge:24layers,1024-dimperwordpiecetoken,16heads.Totalparams=340M

‣ PosiEonalembeddingsandsegmentembeddings,30kwordpieces

‣ Thisisthemodelthatgetspre-trainedonalargecorpus

Page 13: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

WhatcanBERTdo?

Devlinetal.(2019)

‣ CLStokenisusedtoprovideclassificaEondecisions

‣ BERTcanalsodotaggingbypredicEngtagsateachwordpiece‣ Sentencepairtasks(entailment):feedbothsentencesintoBERT

Page 14: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

WhatcanBERTdo?

‣ HowdoesBERTmodelthissentencepairstuff?

‣ TransformerscancaptureinteracEonsbetweenthetwosentences,eventhoughtheNSPobjecEvedoesn’treallycausethistohappen

Transformer

Transformer

[CLS]Aboyplaysinthesnow[SEP]Aboyisoutside

Entails

Page 15: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

WhatcanBERTNOTdo?

‣ BERTcannotgeneratetext(atleastnotinanobviousway)

‣ Notanautoregressivemodel,candoweirdthingslikesEcka[MASK]attheendofastring,fillinthemask,andrepeat

‣Maskedlanguagemodelsareintendedtobeusedprimarilyfor“analysis”tasks

Page 16: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Fine-tuningBERT‣ Fine-tunefor1-3epochs,batchsize2-32,learningrate2e-5-5e-5

‣ Largechangestoweightsuphere(parEcularlyinlastlayertoroutetherightinformaEonto[CLS])

‣ Smallerchangestoweightslowerdowninthetransformer

‣ SmallLRandshortfine-tuningschedulemeanweightsdon’tchangemuch

‣Morecomplex“triangularlearningrate”schemesexist

Page 17: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Fine-tuningBERT

Peters,Ruder,Smith(2019)

‣ BERTistypicallybe8erifthewholenetworkisfine-tuned,unlikeELMo

Page 18: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

EvaluaEon:GLUE

Wangetal.(2019)

Page 19: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Results

Devlinetal.(2018)

‣ Hugeimprovementsoverpriorwork(evencomparedtoELMo)

‣ EffecEveat“sentencepair”tasks:textualentailment(doessentenceAimplysentenceB),paraphrasedetecEon

Page 20: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

RoBERTa

Liuetal.(2019)

‣ “RobustlyopEmizedBERT”

‣ 160GBofdatainsteadof16GB

‣ Dynamicmasking:standardBERTusesthesameMASKschemeforeveryepoch,RoBERTarecomputesthem

‣ Newtraining+moredata=be8erperformance

Page 21: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

GPT/GPT2

Page 22: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

OpenAIGPT/GPT2

‣ GPT2:trainedon40GBoftextcollectedfromupvotedlinksfromreddit

‣ 1.5Bparameters—byfarthelargestofthesemodelstrainedasofMarch2019

Radfordetal.(2019)

‣ “ELMowithtransformers”(worksbe8erthanELMo)

‣ TrainasingleunidirecEonaltransformerLMonlongcontexts

‣ Becauseit'salanguagemodel,wecangeneratefromit

Page 23: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

OpenAIGPT2

slidecredit:OpenAI

Page 24: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

OpenQuesEons

3)HowdoweharnessthesepriorsforcondiEonalgeneraEontasks(summarizaEon,generateareportofabasketballgame,etc.)

4)Isthistechnologydangerous?(OpenAIhasonlyreleased774Mparammodel,not1.5Byet)

1)Hownovelisthestuffbeinggenerated?(Isitjustdoingnearestneighborsonalargecorpus?)

2)HowdoweunderstandanddisEllwhatislearnedinthismodel?

Page 25: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Grover‣ SamplefromalargelanguagemodelcondiEonedonadomain,date,authors,andheadline

Zellersetal.(2019)

‣ HumansrankGrover-generatedpropagandaasmorerealisEcthanreal“fakenews”

‣ NOTE:NotaGAN,discriminatortrainedseparatelyfromthegenerator

‣ Fine-tunedGrovercandetectGroverpropagandaeasily—authorsargueforreleasingitforthisreason

Page 26: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Pre-TrainingCost(withGoogle/AWS)

h8ps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/

‣ XLNet(BERTvariant):$30,000—$60,000(unclear)

‣ Grover-MEGA:$25,000

‣ BERT:Base$500,Large$7000

‣ Thisisforasinglepre-trainingrun…developingnewpre-trainingtechniquesmayrequiremanyruns

‣ Fine-tuningthesemodelscantypicallybedonewithasingleGPU(butmaytake1-3daysformedium-sizeddatasets)

Page 27: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

PushingtheLimits

‣ NVIDIA:trained8.3BparameterGPTmodel(5.6xthesizeofGPT-2)

NVIDIAblog(Narasimhan,August2019)

‣ ArguablethesemodelsaresEllunderfit:largermodelssEllgetbe8erheld-outperplexiEes

Page 28: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

GoogleT5

Raffeletal.(October23,2019)

‣WesEllhaven'thitthelimitofbiggerdatabeinguseful

‣ ColossalCleanedCommonCrawl:750GBoftext

Page 29: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

BART

Lewisetal.(October30,2019)

‣ Sequence-to-sequenceBERTvariant:permute/make/deletetokens,thenpredictfullsequenceautoregressively

‣ Fordownstreamtasks:feeddocumentintobothencoder+decoder,usedecoderhiddenstateasoutput

‣ Goodresultsondialogue,summarizaEontasks

Page 30: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

Analysis

Page 31: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

WhatdoesBERTlearn?

Clarketal.(2019)

‣ HeadsontransformerslearninteresEnganddiversethings:contentheads(a8endbasedoncontent),posiEonalheads(basedonposiEon),etc.

Page 32: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

WhatdoesBERTlearn?

Clarketal.(2019)

‣ SEllwayworsethanwhatsupervisedsystemscando,butinteresEngthatthisislearnedorganically

Page 33: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

ProbingBERT

Tenneyetal.(2019)

‣ TrytopredictPOS,etc.fromeachlayer.Learnmixingweights

representaEonofwordpieceifortaskτ

‣ Plotshowssweights(blue)andperformancedeltaswhenanaddiEonallayerisincorporated(purple)

‣ BERT“rediscoverstheclassicalNLPpipeline”:firstsyntacEctasksthensemanEcones

Page 34: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

CompressingBERT

Micheletal.(2019)

‣ Remove60+%ofBERT’sheadswithminimaldropinperformance

‣ DisElBERT(Sanhetal.,2019):nearlyasgoodwithhalftheparametersofBERT(viaknowledgedisEllaEon)

Page 35: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence

OpenQuesEons

‣ Thesetechniquesareheretostay,unclearwhatformwillwinout

‣ Roleofacademiavs.industry:nomajorpretrainedmodelhascomepurelyfromacademia

‣ BERT-basedsystemsarestate-of-the-artfornearlyeverymajortextanalysistask

‣ Cost/carbonfootprint:asinglemodelcosts$10,000+totrain(thoughthiscostshouldcomedown)