CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣...

Preview:

Citation preview

CS388:NaturalLanguageProcessing

GregDurrett

Lecture6:NeuralNetworks

Administrivia

‣Mini1gradedlaterthisweek

‣ Project1dueinaweek

Recall:CRFs

‣ NaiveBayes:logisGcregression::HMMs:CRFslocalvs.globalnormalizaGon<->generaGvevs.discriminaGve

(locallynormalizeddiscriminaGvemodelsdoexist(MEMMs))

P (y|x) = 1

Zexp

nX

k=1

w>fk(x,y)

!y1 y2

x1

x2

x3

f1

f2

f3

‣ HMMs:inthestandardsetup,emissionsconsideronewordataGme

‣ CRFs:featuresovermanywordssimultaneously,non-independentfeatures(e.g.,suffixesandprefixes),doesn’thavetobeageneraGvemodel

‣ CondiGonalmodel:x’sareobserved

Recall:SequenGalCRFs

‣Model:

‣ Inference:argmaxP(y|x)fromViterbi

‣ Learning:runforward-backwardtocomputeposteriorprobabiliGes;then

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

@

@wL(y⇤,x) =

nX

i=1

fe(y⇤i , i,x)�

nX

i=1

X

s

P (yi = s|x)fe(s, i,x)

y1 y2 yn…�t

�e

‣ Emissionfeaturescaptureword-levelinfo,transiGonsenforcetagconsistency

ThisLecture

‣ Feedforwardneuralnetworks+backpropagaGon

‣ Neuralnetworkbasics

‣ ApplicaGons

‣ Neuralnetworkhistory

‣ Beamsearch:inafewlectures

‣ ImplemenGngneuralnetworks(ifGme)

‣ FinishdiscussionofNER

NER

NER

‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem

‣ OtherpiecesofinformaGonthatmanysystemscapture

‣Worldknowledge:

ThedelegaGonmetthepresidentattheairport,Tanjugsaid.

ORG?PER?

NonlocalFeatures

ThedelegaGonmetthepresidentattheairport,Tanjugsaid.

ThenewsagencyTanjugreportedontheoutcomeofthemeeGng.

‣Morecomplexfactorgraphstructurescanletyoucapturethis,orjustdecodesentencesinorderandusefeaturesonprevioussentences

FinkelandManning(2008),RaGnovandRoth(2009)

Semi-MarkovModels

BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.

‣ Chunk-levelpredicGonratherthantoken-levelBIO

‣ yisasetofspanscoveringthesentence

‣ Cons:there’sanextrafactorofninthedynamicprograms

{ { { { { {

PER O LOC ORG OO

‣ Pros:featurescanlookatwholespanatonce

SarawagiandCohen(2004)

EvaluaGngNER

‣ PredicGonofallOssGllgets66%accuracyonthisexample!

BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.

PERSON LOC ORG

B-PER I-PER O O O B-LOC B-ORGO O O O O

‣Whatwereallywanttoknow:howmanynamedenGtychunkpredicGonsdidwegetright?‣ Precision:oftheoneswepredicted,howmanyareright?

‣ Recall:ofthegoldnamedenGGes,howmanydidwefind?

‣ F-measure:harmonicmeanofthesetwo

HowwelldoNERsystemsdo?

RaGnovandRoth(2009)

Lampleetal.(2016)

BiLSTM-CRF+ELMo Petersetal.(2018)

92.2

BERT Devlinetal.(2019)

92.8

ModernEnGtyTyping

Choietal.(2018)

‣Moreandmoreclasses(17->112->10,000+)

NeuralNetHistory

History:NN“darkages”‣ Convnets:appliedtoMNISTbyLeCunin1998

‣ LSTMs:HochreiterandSchmidhuber(1997)

‣ Henderson(2003):neuralshin-reduceparser,notSOTA

2008-2013:Aglimmeroflight…

‣ CollobertandWeston2011:“NLP(almost)fromscratch”‣ FeedforwardneuralnetsinducefeaturesforsequenGalCRFs(“neuralCRF”)

‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011versionGedSOTA

‣ Socher2011-2014:tree-structuredRNNsworkingokay

‣ Krizhevskeyetal.(2012):AlexNetforvision

2014:Stuffstartsworking

‣ Sutskeveretal.+Bahdanauetal.:seq2seqforneuralMT(LSTMs)

‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassificaGon/senGment(convnets)

‣ 2015:explosionofneuralnetsforeverythingunderthesun

‣ ChenandManningtransiGon-baseddependencyparser(basedonfeedforwardnetworks)

‣Whatmadethesework?Data(notasimportantasyoumightthink),op-miza-on(iniGalizaGon,adapGveopGmizers),representa-on(goodwordembeddings)

NeuralNetBasics

NeuralNetworks

‣WanttolearnintermediateconjuncGvefeaturesoftheinput

argmaxyw>f(x, y)‣ LinearclassificaGon:

themoviewasnotallthatgood

I[containsnot&containsgood]

‣ Howdowelearnthisifourfeaturevectorisjusttheunigramindicators?

I[containsnot],I[containsgood]

NeuralNetworks:XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn))y = x1 XOR x2

‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfuncGon

‣ Inputs

‣ Output

NeuralNetworks:XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

NeuralNetworks:XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

NeuralNetworks

Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

Linear model: y = w · x+ b

y = g(w · x+ b)y = g(Wx+ b)

NeuralNetworks

Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!

DeepNeuralNetworks

Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

}

outputoffirstlayer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

z = V(Wx+ b) + c

Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?

FeedforwardNetworks,BackpropagaGon

LogisGcRegressionwithNNsP (y|x) = exp(w>f(x, y))P

y0 exp(w>f(x, y0))‣ Singlescalarprobability

P (y|x) = softmax

�[w>f(x, y)]y2Y

� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)

softmax(p)i =exp(pi)Pi0 exp(pi0)

‣ sonmax:expsandnormalizesagivenvector

P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]

P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer

NeuralNetworksforClassificaGon

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

sonmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classes

probs

TrainingNeuralNetworks

‣Maximizeloglikelihoodoftrainingdata

‣ i*:indexofthegoldlabel

‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex

z = g(V f(x))P (y|x) = softmax(Wz)

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)

CompuGngGradients

‣ GradientwithrespecttoW

ifi=i*zj � P (y = i|x)zj

�P (y = i|x)zj

@

@WijL(x, i⇤) =

zj � P (y = i|x)zj

�P (y = i|x)zj otherwise

‣ LookslikelogisGcregressionwithzasthefeatures!

i

j

{

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

W

NeuralNetworksforClassificaGon

V sonmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@Wz

CompuGngGradients:BackpropagaGonz = g(V f(x))

AcGvaGonsathiddenlayer

‣ GradientwithrespecttoV:applythechainrule

err(root) = ei⇤ � P (y|x)dim=m dim=d

@L(x, i⇤)@z

= err(z) = W>err(root)

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

[somemath…]

@L(x, i⇤)@Vij

=@L(x, i⇤)

@z

@z

@Vij

BackpropagaGon:Picture

V sonmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)err(z)

z

‣ Canforgeteverythinganerz,treatitastheoutputandkeepbackpropping

BackpropagaGon:Takeaways‣ GradientsofoutputweightsWareeasytocompute—lookslikelogisGcregressionwithhiddenlayerzasfeaturevector

‣ CancomputederivaGveoflosswithrespecttoztoforman“errorsignal”forbackpropagaGon

‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropagaGon

‣ NeedtorememberthevaluesfromtheforwardcomputaGon

ApplicaGons

NLPwithFeedforwardNetworks

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample

emb(interest)

emb(rates)‣WeightmatrixlearnsposiGon-dependent

processingofthewords

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

NLPwithFeedforwardNetworks

‣ HiddenlayermixesthesedifferentsignalsandlearnsfeatureconjuncGons

Bothaetal.(2017)

NLPwithFeedforwardNetworks‣MulGlingualtaggingresults:

Bothaetal.(2017)

‣ GillickusedLSTMs;thisissmaller,faster,andbever

SenGmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

SenGmentAnalysis

{

{Bag-of-words

TreeRNNs/CNNS/LSTMS

WangandManning(2012)

Kim(2014)

Iyyeretal.(2015)

ImplementaGonDetails

ComputaGonGraphs

‣ CompuGnggradientsishard!ComputaGongraphabstracGonallowsustodefineacomputaGonsymbolicallyandwilldothisforus

‣ AutomaGcdifferenGaGon:keeptrackofderivaGves/beabletobackpropagatethrougheachfuncGon:

y = x * x (y,dy) = (x * x, 2 * x * dx)codegen

‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch

ComputaGonGraphsinPytorch

P (y|x) = softmax(Wg(V f(x)))

class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)

def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))

‣ Defineforwardpassfor

ComputaGonGraphsinPytorch

P (y|x) = softmax(Wg(V f(x)))

ffnn = FFNN()

loss.backward()

probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)

optimizer.step()

def make_update(input, gold_label):

ffnn.zero_grad() # clear gradient variables

ei*: one-hot vector of the label (e.g., [0, 1, 0])

TrainingaModel

DefineacomputaGongraph

Foreachepoch:

Computelossonbatch

Foreachbatchofdata:

Decodetestset

Autogradtocomputegradients

TakestepwithopGmizer

NextTime

‣WordrepresentaGons/wordvectors

‣ word2vec,GloVe

‣ Trainingneuralnetworks

Recommended