8 logistic regressioncourses.cs.vt.edu/cs4824/Spring19/slide_pdfs/8 logistic...Logistic Regression Machine Learning CS4824/ECE4424 Bert Huang Virginia Tech Outline • Review conditional

Logistic Regression

Machine Learning CS4824/ECE4424

Bert Huang Virginia Tech

Outline• Review conditional probability and classification

• Linear parameterization and logistic function

• Gradient descent

• Other optimization methods

• Regularization

Classification and Conditional Probability

• Discriminative learning: maximize p(y|x)

• Naive Bayes: learn p(x|y) and p(y)

• Today: parameterize p(y|x) and directly learn p(y|x)

Parameterizing p(y|x)

p(y |x) := f

f : Rd ! [0, 1]

f (x) :=1

1 + exp(�w>x)

Parameterizing p(y|x)

p(y |x) := f

f : Rd ! [0, 1]

f (x) :=1

1 + exp(�w>x)

Logistic Function�(x) =

1

1 + exp(�x)

limx!�1

�(x) = limx!�1

1

1 + exp(�x)= 0.0

Logistic Function�(x) =

1

1 + exp(�x)

�(0) =1

1 + exp(�0)=

1

1 + 1= 0.5

limx!1

�(x) = limx!1

1

1 + exp(�x)=

1

1= 1.0

From Features to Probability

f (x) :=1

1 + exp(�w>x)

Likelihood Functiony 2 {�1,+1} L(w) :=

((1 + exp(�w>x))�1 if y = +1

1� (1 + exp(�w>x))�1 if y = �1

1� (1 + exp(�w>x))�1

= 1� 1

1 + exp(�w>x)=

1 + exp(�w>x)

1 + exp(�w>x)� 1

1 + exp(�w>x)

=exp(�w>x)

1 + exp(�w>x) =

✓1 + exp(�w>x)

exp(�w>x)

◆�1

=

✓1

exp(�w>x)+

exp(�w>x)

exp(�w>x)

◆�1

= (1 + exp(w>x))�1

Likelihood Functiony 2 {�1,+1} L(w) :=

((1 + exp(�w>x))�1 if y = +1

(1 + exp(+w>x))�1 if y = �1

= (1 + exp(�yw>x))�1

L(w) =nY

i=1

(1 + exp(�yiw>xi ))

�1

log L(w) = �nX

i=1

log(1 + exp(�yiw>xi ))

Likelihood Functionlog L(w) = �

nX

i=1


w argminw

nX

i=1


negative log likelihood

:= nll(w)

rwnll = �nX

i=1

✓1� 1

1 + exp(�yiw>xi )

◆yixi

w argminw

nX

i=1

log(1 + exp(�yiw>xi )) := nll(w)

rwnll = �nX

i=1

✓1� 1

1 + exp(�yiw>xi )

◆yixi

rwnll =nX

i=1

✓1

1 + exp(�yiw>xi )⇥rw

�1 + exp(�yiw

>xi )�◆

0� exp(�yiw>xi )yixi

rwnll =nX

i=1

✓� exp(�yiw>xi )yixi1 + exp(�yiw>xi )

◆

=nX

i=1

�✓

exp(�yiw>xi )

1 + exp(�yiw>xi )

◆yixi

:= g

Gradient Descent

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

Image from Murphy Ch. 8

wt+1 ! wt � ⌘tgt

⌘t =1

t

⌘t =1pt

e.g.,

⌘t = 1

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

Gradient Descent


wt+1 ! wt � ⌘tgt

⌘t =1

t

⌘t =1pt

e.g.,

⌘t = 1

Line Search⌘t = argmin

⌘nll(w � ⌘g)

⌘

nll(w

�⌘g)

Line Searchexact line searching 1

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3


⌘t = argmin⌘

nll(w � ⌘g)

⌘

nll(w

�⌘g)

exact line searching 1

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

Momentum

• Heavy ball method

• Momentum term retains “velocity” from previous steps

• Avoids sharp turns

st �gt + �tst�1

wt wt�1 + ⌘tst

Second-Order Optimization

• Newton’s method

• Approximate function with quadratic and minimize quadratic exactly

• Requires computing Hessian (matrix of second derivatives)

• Various approximation methods (e.g., L-BFGS)

RegularizationL(w) = p(w)

nY

i=1

(1 + exp(�yiw>xi ))

�1

p(w) = N✓0,

1

�I

◆posterior

prior

neg log posterior (regularized nll)

gradient

� log L(w) = ��

2w>w +

nX

i=1


rwnll = ��w �nX

i=1

✓1� 1

1 + exp(�yiw>xi )

◆yixi

Summary• Review conditional probability and classification

• Linear parameterization and logistic function

• Gradient descent

• Other optimization methods

• Regularization

Documents

8 logistic regressioncourses.cs.vt.edu/cs4824/Spring19/slide_pdfs/8 logistic...Logistic Regression Machine Learning CS4824/ECE4424 Bert Huang Virginia Tech Outline • Review conditional