10
CS388: Natural Language Processing Lecture 4: Sequence Models I Greg Durrett Parts of this lecture adapted from Dan Klein, UC Berkeley and Vivek Srikumar, University of Utah Administrivia Project 1 out today, due September 27 This class will cover what you need to get started on it, the next class will cover everything you need to complete it Viterbi algorithm, CRF NER system, extension Extension should be substanUal: don’t just try one addiUonal feature (see syllabus/spec for discussion, samples on website) Mini 1 due today Recall: MulUclass ClassificaUon LogisUc regression: Gradient (unregularized): SVM: defined by quadraUc program (minimizaUon, so gradients are flipped) Loss-augmented decode P (y|x)= exp ( w > f (x, y) ) P y 0 2Y exp (w > f (x, y 0 )) @ @ w i L(x j ,y j )= f i (x j ,y j ) - E y [f i (x j ,y)] j = max y2Y w > f (x j ,y)+ `(y,y j ) - w > f (x j ,y j ) Subgradient (unregularized) on jth example = f i (x j ,y max ) - f i (x j ,y j ) Recall: OpUmizaUon StochasUc gradient *ascent* w w + g, g = @ @ w L w i w i + 1 q + P t =1 g 2 ,i g t,i Adagrad: SGD/AdaGrad have a batch size parameter Large batches (>50 examples): can parallelize within batch …but bigger batches oden means more epochs required because you make fewer parameter updates Shuffling: online methods are sensiUve to dataset order, shuffling helps!

CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

CS388:NaturalLanguageProcessingLecture4:SequenceModelsI

GregDurrettPartsofthislectureadaptedfromDanKlein,UCBerkeley

andVivekSrikumar,UniversityofUtah

Administrivia

‣ Project1outtoday,dueSeptember27

‣ Thisclasswillcoverwhatyouneedtogetstartedonit,thenextclasswillcovereverythingyouneedtocompleteit

‣ Viterbialgorithm,CRFNERsystem,extension

‣ ExtensionshouldbesubstanUal:don’tjusttryoneaddiUonalfeature(seesyllabus/specfordiscussion,samplesonwebsite)

‣ Mini1duetoday

Recall:MulUclassClassificaUon‣ LogisUcregression:

Gradient(unregularized):

‣ SVM:definedbyquadraUcprogram(minimizaUon,sogradientsareflipped)Loss-augmenteddecode

P (y|x) =exp

�w

>f(x, y)

�P

y02Y exp (w

>f(x, y

0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

⇠j = max

y2Yw

>f(xj , y) + `(y, y

⇤j )� w

>f(xj , y

⇤j )

Subgradient(unregularized)onjthexample = fi(xj , ymax

)� fi(xj , y⇤j )

Recall:OpUmizaUon

‣ StochasUcgradient*ascent* w w + ↵g, g =@

@wL

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gt,i‣ Adagrad:

‣ SGD/AdaGradhaveabatchsizeparameter

‣ Largebatches(>50examples):canparallelizewithinbatch

‣ …butbiggerbatchesodenmeansmoreepochsrequiredbecauseyoumakefewerparameterupdates

‣ Shuffling:onlinemethodsaresensiUvetodatasetorder,shufflinghelps!

Page 2: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

ThisLecture

‣ Sequencemodeling

‣ HMMsforPOStagging

‣ Viterbi,forward-backward

‣ HMMparameteresUmaUon

LinguisUcStructures

‣ Languageistree-structured

IatethespaghehwithchopsUcks Iatethespaghehwithmeatballs

‣ Understandingsyntaxfundamentallyrequirestrees—thesentenceshavethesameshallowanalysis

IatethespaghehwithchopsUcks IatethespaghehwithmeatballsPRPVBZDTNNINNNS PRPVBZDTNNINNNS

LinguisUcStructures‣ LanguageissequenUallystructured:interpretedinanonlineway

Tanenhausetal.(1995)

POSTagging

Ghana’sambassadorshouldhavesetupthebigmee6nginDCyesterday.

‣ Whattagsareoutthere?

NNPPOSNNMDVBVBNRPDTJJNNINNNPNN.

Page 3: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

POSTagging

Slidecredit:DanKlein

POSTagging

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

I’m0.5%interestedintheFed’sraises!

Iherebyincreaseinterestrates0.5%

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣OtherpathsarealsoplausiblebutevenmoresemanUcallyweird…‣Whatgovernsthecorrectchoice?Word+context‣ WordidenUty:mostwordshave<=2tags,manyhaveone(percent,the)‣ Context:nounsstartsentences,nounsfollowverbs,etc.

Whatisthisgoodfor?

‣ Text-to-speech:record,lead

‣ PreprocessingstepforsyntacUcparsers

‣ Domain-independentdisambiguaUonforothertasks

‣ (Very)shallowinformaUonextracUon

SequenceModels‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

‣ POStagging:xisasequenceofwords,yisasequenceoftags

‣ Today:generaUvemodelsP(x,y);discriminaUvemodelsnextUme

Page 4: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

HiddenMarkovModelsy = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)

‣ModelthesequenceofyasaMarkovprocess(dynamicsmodel)

y1 y2

‣ Markovproperty:futureiscondiUonallyindependentofthepastgiventhepresent

‣ Ifyaretags,thisroughlycorrespondstoassumingthatthenexttagonlydependsonthecurrenttag,notanythingbefore

y3 P (y3|y1, y2) = P (y3|y2)

‣ LotsofmathemaUcaltheoryabouthowMarkovchainsbehave

HiddenMarkovModels

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

IniUaldistribuUon

TransiUonprobabiliUes

EmissionprobabiliUes

} }} ‣ P(x|y)isadistribuUonoverallwordsinthevocabulary—notadistribuUonoverfeatures(butcouldbe!)

‣ MulUnomials:tagxtagtransiUons,tagxwordemissions

‣ ObservaUon(x)dependsonlyoncurrentstate(y)

TransiUonsinPOSTagging‣Dynamicsmodel

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ likelybecausestartofsentence

‣ likelybecauseverbodenfollowsnoun

‣ directobjectfollowsverb,otherverbrarelyfollowspasttenseverb(mainverbscanfollowmodalsthough!)

P (y1 = NNP)

P (y2 = VBZ|y1 = NNP)

P (y3 = NN|y2 = VBZ)

P (y1)nY

i=2

P (yi|yi�1)

EsUmaUngTransiUons

‣ SimilartoNaiveBayesesUmaUon:maximumlikelihoodsoluUon=normalizedcounts(withsmoothing)readoffsuperviseddata

Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN

‣ Howtosmooth?

‣ Onemethod:smoothwithunigramdistribuUonovertags

‣ P(tag|NN)

P (tag|tag�1) = (1� �)P̂ (tag|tag�1) + �P̂ (tag)

=empiricaldistribuUon(readofffromdata)P̂

.

=(0.5.,0.5NNS)

Page 5: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

‣ EmissionsP(x|y)capturethedistribuUonofwordsoccurringwithagiventag

EmissionsinPOSTagging

‣ P(word|NN)=(0.05person,0.04official,0.03interest,0.03percent…)

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

‣ Whenyoucomputetheposteriorforagivenword’stags,thedistribuUonfavorstagsthataremorelikelytogeneratethatword

‣ Howshouldwesmooththis?

InferenceinHMMs

‣ Inferenceproblem:

‣ ExponenUallymanypossibleyhere!

‣ SoluUon:dynamicprogramming(possiblebecauseofMarkovstructure!)

‣ ManyneuralsequencemodelsdependonenUreprevioustagsequence,needtouseapproximaUonslikebeamsearch

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

…P (y,x) = P (y1)

nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

argmaxyP (y|x) = argmaxyP (y,x)

P (x)

ViterbiAlgorithm

slidecredit:VivekSrikumar

ViterbiAlgorithm

slidecredit:VivekSrikumar

Page 6: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

ViterbiAlgorithm

slidecredit:VivekSrikumar

‣ Best(parUal)scoreforasequenceendinginstates

ViterbiAlgorithm

slidecredit:VivekSrikumar

ViterbiAlgorithm

slidecredit:VivekSrikumar

ViterbiAlgorithm

slidecredit:DanKlein

‣ “Thinkabout”allpossibleimmediatepriorstatevalues.Everythingbeforethathasalreadybeenaccountedforbyearlierstages.

Page 7: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

ViterbiAlgorithm

slidecredit:VivekSrikumar

ViterbiAlgorithm

slidecredit:VivekSrikumar

ViterbiAlgorithm

slidecredit:VivekSrikumar

Forward-BackwardAlgorithm‣ InaddiUontofindingthebestpath,wemaywanttocomputemarginalprobabiliUesofpaths P (yi = s|x)

P (yi = s|x) =X

y1,...,yi�1,yi+1,...,yn

P (y|x)

‣ WhatdidViterbicompute? P (y

max

|x) = max

y1,...,yn

P (y|x)

‣ Cancomputemarginalswithdynamicprogrammingaswellusinganalgorithmcalledforward-backward

Page 8: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

Forward-BackwardAlgorithmP (y3 = 2|x) =

sum of all paths through state 2 at time 3

sum of all paths

Forward-BackwardAlgorithm

slidecredit:DanKlein

P (y3 = 2|x) =sum of all paths through state 2 at time 3

sum of all paths

=

‣ Easiestandmostflexibletodoonepasstocomputeandonetocompute

Forward-BackwardAlgorithm

↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

‣ IniUal:

‣ Recurrence:

‣ SameasViterbibutsumminginsteadofmaxing!

‣ ThesequanUUesgetverysmall!StoreeverythingaslogprobabiliUes

Forward-BackwardAlgorithm

‣ IniUal:�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

‣ Recurrence:

‣ Bigdifferences:countemissionforthenextUmestep(notcurrentone)

Page 9: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

Forward-BackwardAlgorithm↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

P (s3 = 2|x) = ↵3(2)�3(2)Pi ↵3(i)�3(i)

‣ Whatisthedenominatorhere?‣ Doesthisexplainwhybetaiswhatitis?

P (x)

HMMPOSTagging‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords

Slidecredit:DanKlein

TrigramTaggers

‣ Trigrammodel:y1=(<S>,NNP),y2=(NNP,VBZ),…

‣ P((VBZ,NN)|(NNP,VBZ))—morecontext!Noun-verb-nounS-V-O

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

‣ Tradeoffbetweenmodelcapacityanddatasize—trigramsarea“sweetspot”forPOStagging

HMMPOSTagging‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords

‣ TnTtagger(Brants1998,tunedHMM):96.2%accuracy/86.0%onunks

Slidecredit:DanKlein

‣ State-of-the-art(BiLSTM-CRFs):97.5%/89%+

Page 10: CS388: Natural Language Processing Lecture 4: Sequence Models Igdurrett/courses/fa2018/... · 2018. 9. 11. · CS388: Natural Language Processing Lecture 4: Sequence Models I Greg

Errors

officialknowledge madeupthestory recentlysoldshares

JJ/NNNN VBDRP/INDTNN RBVBD/VBNNNS

Slidecredit:DanKlein/Toutanova+Manning(2000)(NNNN:taxcut,artgallery,…)

RemainingErrors

‣ Underspecified/unclear,goldstandardinconsistent/wrong:58%

‣ Lexicongap(wordnotseenwiththattagintraining)4.5%‣ Unknownword:4.5%‣ Couldgetright:16%(manyoftheseinvolveparsing!)‣ DifficultlinguisUcs:20%

Theysetupabsurdsitua6ons,detachedfromrealityVBD/VBP?(pastorpresent?)

a$10millionfourth-quarterchargeagainstdiscon6nuedopera6onsadjecUveorverbalparUciple?JJ/VBN?

Manning2011“Part-of-SpeechTaggingfrom97%to100%:IsItTimeforSomeLinguisUcs?”

OtherLanguages

‣ UniversalPOStagset(~12tags),cross-lingualmodelworksaswellastunedCRFusingexternalresources

Gillicketal.2016

NextTime‣ CRFs:feature-baseddiscriminaUvemodels

‣ StructuredSVMforsequences

‣ NamedenUtyrecogniUon