Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Logistic Regression
Machine Learning CS4824/ECE4424
Bert Huang Virginia Tech
Outline• Review conditional probability and classification
• Linear parameterization and logistic function
• Gradient descent
• Other optimization methods
• Regularization
Classification and Conditional Probability
• Discriminative learning: maximize p(y|x)
• Naive Bayes: learn p(x|y) and p(y)
• Today: parameterize p(y|x) and directly learn p(y|x)
Parameterizing p(y|x)
p(y |x) := f
f : Rd ! [0, 1]
f (x) :=1
1 + exp(�w>x)
Parameterizing p(y|x)
p(y |x) := f
f : Rd ! [0, 1]
f (x) :=1
1 + exp(�w>x)
Logistic Function�(x) =
1
1 + exp(�x)
limx!�1
�(x) = limx!�1
1
1 + exp(�x)= 0.0
Logistic Function�(x) =
1
1 + exp(�x)
�(0) =1
1 + exp(�0)=
1
1 + 1= 0.5
limx!1
�(x) = limx!1
1
1 + exp(�x)=
1
1= 1.0
From Features to Probability
f (x) :=1
1 + exp(�w>x)
Likelihood Functiony 2 {�1,+1} L(w) :=
((1 + exp(�w>x))�1 if y = +1
1� (1 + exp(�w>x))�1 if y = �1
1� (1 + exp(�w>x))�1
= 1� 1
1 + exp(�w>x)=
1 + exp(�w>x)
1 + exp(�w>x)� 1
1 + exp(�w>x)
=exp(�w>x)
1 + exp(�w>x) =
✓1 + exp(�w>x)
exp(�w>x)
◆�1
=
✓1
exp(�w>x)+
exp(�w>x)
exp(�w>x)
◆�1
= (1 + exp(w>x))�1
Likelihood Functiony 2 {�1,+1} L(w) :=
((1 + exp(�w>x))�1 if y = +1
(1 + exp(+w>x))�1 if y = �1
= (1 + exp(�yw>x))�1
L(w) =nY
i=1
(1 + exp(�yiw>xi ))
�1
log L(w) = �nX
i=1
log(1 + exp(�yiw>xi ))
Likelihood Functionlog L(w) = �
nX
i=1
log(1 + exp(�yiw>xi ))
w argminw
nX
i=1
log(1 + exp(�yiw>xi ))
negative log likelihood
:= nll(w)
rwnll = �nX
i=1
✓1� 1
1 + exp(�yiw>xi )
◆yixi
w argminw
nX
i=1
log(1 + exp(�yiw>xi )) := nll(w)
rwnll = �nX
i=1
✓1� 1
1 + exp(�yiw>xi )
◆yixi
rwnll =nX
i=1
✓1
1 + exp(�yiw>xi )⇥rw
�1 + exp(�yiw
>xi )�◆
0� exp(�yiw>xi )yixi
rwnll =nX
i=1
✓� exp(�yiw>xi )yixi1 + exp(�yiw>xi )
◆
=nX
i=1
�✓
exp(�yiw>xi )
1 + exp(�yiw>xi )
◆yixi
:= g
Gradient Descent
0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
3
Image from Murphy Ch. 8
wt+1 ! wt � ⌘tgt
⌘t =1
t
⌘t =1pt
e.g.,
⌘t = 1
0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
3
Gradient Descent
Image from Murphy Ch. 8
wt+1 ! wt � ⌘tgt
⌘t =1
t
⌘t =1pt
e.g.,
⌘t = 1
Line Search⌘t = argmin
⌘nll(w � ⌘g)
⌘
nll(w
�⌘g)
Line Searchexact line searching 1
0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
3
Image from Murphy Ch. 8
⌘t = argmin⌘
nll(w � ⌘g)
⌘
nll(w
�⌘g)
exact line searching 1
0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
3
Momentum
• Heavy ball method
• Momentum term retains “velocity” from previous steps
• Avoids sharp turns
st �gt + �tst�1
wt wt�1 + ⌘tst
Second-Order Optimization
• Newton’s method
• Approximate function with quadratic and minimize quadratic exactly
• Requires computing Hessian (matrix of second derivatives)
• Various approximation methods (e.g., L-BFGS)
RegularizationL(w) = p(w)
nY
i=1
(1 + exp(�yiw>xi ))
�1
p(w) = N✓0,
1
�I
◆posterior
prior
neg log posterior (regularized nll)
gradient
� log L(w) = ��
2w>w +
nX
i=1
log(1 + exp(�yiw>xi ))
rwnll = ��w �nX
i=1
✓1� 1
1 + exp(�yiw>xi )
◆yixi
Summary• Review conditional probability and classification
• Linear parameterization and logistic function
• Gradient descent
• Other optimization methods
• Regularization