LogisticRegression&NeuralNetworksCMSC723/LING723/INST725
MarineCarpuat
Slidescredit:GrahamNeubig,JacobEisenstein
LogisticRegression
Perceptron&Probabilities
• Whatifwewantaprobabilityp(y|x)?
• Theperceptrongivesusapredictiony• Let’sillustratethiswithbinaryclassification
Illustrations:GrahamNeubig
Thelogisticfunction
• “Softer”functionthaninperceptron
• Canaccountforuncertainty
• Differentiable
Logisticregression:howtotrain?
• Trainbasedonconditionallikelihood• Findparametersw thatmaximizeconditionallikelihoodofallanswers𝑦" givenexamples𝑥"
Stochasticgradientascent(ordescent)• Onlinetrainingalgorithmforlogisticregression
• andotherprobabilisticmodels
• Updateweightsforeverytrainingexample• Moveindirectiongivenbygradient• Sizeofupdatestepscaledbylearningrate
Gradientofthelogisticfunction
Example:Person/not-personclassificationproblemGivenanintroductorysentenceinWikipediapredictwhetherthearticleisaboutaperson
Example:initialupdate
Example:secondupdate
Howtosetthelearningrate?
• Variousstrategies• decayovertime
𝛼 =1
𝐶 + 𝑡
• Useheld-outtestset,increaselearningratewhenlikelihoodincreases
ParameterNumberofsamples
Multiclassversion
Somemodelsarebetterthenothers…• Considerthese2examples
• Whichofthe2modelsbelowisbetter?
Classifier2willprobablygeneralizebetter!Itdoesnotincludeirrelevantinformation=>Smallermodelisbetter
Regularization
• Apenaltyonaddingextraweights
• L2regularization:• bigpenaltyonlargeweights• smallpenaltyonsmallweights
• L1regularization:• Uniformincreasewhenlargeorsmall• Willcausemanyweightstobecomezero
𝑤 +
𝑤 ,
L1regularizationinonlinelearning
Whatyoushouldknow
• Standardsupervisedlearningset-upfortextclassification• Differencebetweentrainvs.testdata• Howtoevaluate
• 3examplesofsupervisedlinearclassifiers• NaïveBayes,Perceptron,LogisticRegression• Learningasoptimization:whatistheobjectivefunctionoptimized?• Differencebetweengenerativevs.discriminativeclassifiers• Smoothing,regularization• Overfitting,underfitting
Neuralnetworks
Person/not-personclassificationproblemGivenanintroductorysentenceinWikipediapredictwhetherthearticleisaboutaperson
Formalizingbinaryprediction
ThePerceptron:a“machine”tocalculateaweightedsum
sign - 𝑤".
"/,⋅ ϕ" 𝑥
φ“A” = 1φ“site” = 1
φ“,” = 2
φ“located” = 1
φ“in” = 1
φ“Maizuru”= 1
φ“Kyoto” = 1φ“priest” = 0φ“black” = 0
0-30000020
-1
ThePerceptron:Geometricinterpretation
O
X O
X O
X
ThePerceptron:Geometricinterpretation
O
X O
X O
X
Limitationofperceptron● canonlyfindlinearseparations betweenpositiveandnegativeexamples
X
O
O
X
NeuralNetworks● Connecttogethermultipleperceptrons
φ“A” = 1φ“site” = 1
φ“,” = 2
φ“located” = 1
φ“in” = 1
φ“Maizuru”= 1
φ“Kyoto” = 1φ“priest” = 0φ“black” = 0
-1
● Motivation:Canrepresentnon-linearfunctions!
NeuralNetworks:keyterms
φ“A” = 1φ“site” = 1
φ“,” = 2
φ“located” = 1
φ“in” = 1
φ“Maizuru”= 1
φ“Kyoto” = 1φ“priest” = 0φ“black” = 0
-1
• Input(akafeatures)• Output• Nodes• Layers• Hiddenlayers• Activationfunction(non-linear)
• Multi-layerperceptron
Example● Createtwoclassifiers
X
O
O
X
φ0(x2) = {1, 1}φ0(x1) = {-1, 1}
φ0(x4) = {1, -1}φ0(x3) = {-1, -1}
sign
sign
φ0[0]
φ0[1]
1
11
-1
-1-1
-1
φ0[0]
φ0[1] φ1[0]
φ0[0]
φ0[1]
1
w0,0
b0,0
φ1[1]
w0,1
b0,1
Example● Theseclassifiersmaptoanewspace
X
O
O
X
φ0(x2) = {1, 1}φ0(x1) = {-1, 1}
φ0(x4) = {1, -1}φ0(x3) = {-1, -1}
11-1
-1-1-1
φ1
φ2
φ1[1]
φ1[0]
φ1[0]
φ1[1]
φ1(x1) = {-1, -1}X O
φ1(x2) = {1, -1}
O
φ1(x3) = {-1, 1}
φ1(x4) = {-1, -1}
Example● Innewspace,theexamplesarelinearlyseparable!
X
O
O
X
φ0(x2) = {1, 1}φ0(x1) = {-1, 1}
φ0(x4) = {1, -1}φ0(x3) = {-1, -1}
11-1
-1-1-1
φ0[0]
φ0[1]
φ1[1]
φ1[0]
φ1[0]
φ1[1]
φ1(x1) = {-1, -1}X O φ1(x2) = {1, -1}
Oφ1(x3) = {-1, 1}
φ1(x4) = {-1, -1}
111
φ2[0] = y
Examplewrap-up:Forwardpropagation
● Thefinalnet
tanh
tanh
φ0[0]
φ0[1]
1
φ0[0]
φ0[1]
1
11
-1
-1-1
-11 1
1
1
tanh
φ1[0]
φ1[1]
φ2[0]
30
Softmax Functionformulticlassclassification
● Sigmoid function for multiple classes
● Can be expressed using matrix/vector ops
𝑃 𝑦 ∣ 𝑥 =𝑒𝐰⋅6 7,9
∑ 𝑒𝐰⋅6 7,9;�9;
Current class
Sum of other classes
𝐫 = exp 𝐖 ⋅ ϕ 𝑥, 𝑦
𝐩 = 𝐫 - ���
E∈𝐫G
StochasticGradientDescentOnlinetrainingalgorithmforprobabilisticmodels
w=0for I iterationsforeachlabeledpairx,y inthedata
w +=α *dP(y|x)/dw
Inotherwords• For every training example, calculate the gradient
(the direction that will increase the probability of y)• Move in that direction, multiplied by learning rate α
GradientoftheSigmoidFunctionTakethederivativeoftheprobability
𝑑𝑑𝑤 𝑃 𝑦 = 1 ∣ 𝑥 =
𝑑𝑑𝑤
𝑒𝐰⋅6 7
1 + 𝑒𝐰⋅6 7
= ϕ 𝑥𝑒𝐰⋅6 7
1 + 𝑒𝐰⋅6 7 +
𝑑𝑑𝑤 𝑃 𝑦 = −1 ∣ 𝑥 =
𝑑𝑑𝑤 1 −
𝑒𝐰⋅6 7
1 + 𝑒𝐰⋅6 7
= −ϕ 𝑥𝑒𝐰⋅6 7
1 + 𝑒𝐰⋅6 7 +
Learning:WeDon'tKnowtheDerivativeforHiddenUnits!
ForNNs,onlyknowcorrecttagforlastlayer
y=1ϕ 𝑥
𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟒
= 𝐡 𝑥𝑒𝐰𝟒⋅𝐡 7
1 + 𝑒𝐰𝟒⋅𝐡 7 +
𝐡 𝑥
𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟏
= ?
𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟐
= ?
𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟑
= ?
w1
w2
w3
w4
Answer:Back-PropagationCalculatederivativewithchainrule
𝑑𝑃 𝑦 = 1 ∣ 𝑥𝑑𝐰𝟏
=𝑑𝑃 𝑦 = 1 ∣ 𝑥𝑑𝐰𝟒𝐡 𝐱
𝑑𝐰𝟒𝐡 𝐱𝑑ℎ, 𝐱
𝑑ℎ, 𝐱𝑑𝐰𝟏
𝑒𝐰𝟒⋅𝐡 7
1 + 𝑒𝐰𝟒⋅𝐡 7 +𝑤,,R
Error ofnext unit (δ4)
Weight Gradient ofthis unit
𝑑𝑃 𝑦 = 1 ∣ 𝐱𝐰𝐢
=𝑑ℎ" 𝐱𝑑𝐰𝐢
- δU�
U𝑤",UIn General
Calculate i basedon next units j:
Backpropagation=
Gradientdescent+
Chainrule
FeedForwardNeuralNetsAllconnectionspointforward
yϕ 𝑥
Itisadirectedacyclicgraph(DAG)
NeuralNetworks
• Non-linearclassification
• Prediction:forwardpropagation• Vector/matrixoperations+non-linearities
• Training:backpropagation+stochasticgradientdescent
Formoredetails,seeCIMLChap7