Logistic Regression - University of Washington€¦ · Logistic Regression. Process Decide on a model Find the function which fits the data best Choose a loss function Pick the function

Logistic Regression

Process

Decide on a model

Find the function which fits the data best Choose a loss function Pick the function which minimizes loss on data

Use function to make prediction on new examples

2

Logistic Regression

Actually classification, not regression :)

Logistic function(or Sigmoid):

Learn P(Y = 1|X = x) using �(wTx), for link function � =<latexit sha1_base64="6YR2mQoAdwRZWy+rKGhBlBEbQ6k=">AAACSnicdVBNTxsxEPWmtIX0K22PXEYklVKpinYDCumhElIvHHoIEoFUSRrNOt5gxWuvbG8h2ub3ceHUGz+CC4cixAU7BAmqdi5+evNm/ObFmeDGhuF5UHqy8vTZ89W18ouXr16/qbx9d2BUrinrUiWU7sVomOCSdS23gvUyzTCNBTuMp199//An04YruW9nGRumOJE84RSto0YVHEjF5ZhJC/CNoZZQG6Roj+K46Mzr3+ELRPALeu49+ViD3HA5cQrDJynWj3/se/YTJEqDMzCFJJfU772XuLHaqFING2G43Wy1wIP21mbkwVYYfm5B5BhfVbKszqjyezBWNE+dJyrQmH4UZnZYoLacCjYvD3LDMqRTnLC+gxJTZobFIoo5fHDMeGEoUe6mBftwosDUmFkaO6U/0/zd8+S/ev3cJu1hwWWWWybp3UdJLsAq8LnCmGtGrZg5gFRz5xXoEWqk1qVfdiHcXwr/BwfNRrTZaO41qzvtZRyrZJ1skDqJyDbZIbukQ7qEklNyQf6Qq+AsuAyug5s7aSlYzrwnj6q0cgstCa7U</latexit>

Features can be discrete or continuous!

P[Y = 1|X = x,w] = �(wTx) =1

1 + exp(�wTx)<latexit sha1_base64="lK3XZHT7juGteOWzhLgwqXNsfog=">AAACMXicdVDLSgMxFM34tr6qLt1cLEJFLTO11LoQCm5cVrBa6Ywlk2ZqMPMgyWjLOL/kxj8RNy4UcetPmLEVVPRAuCfn3EtyjxtxJpVpPhlj4xOTU9Mzs7m5+YXFpfzyyqkMY0Fok4Q8FC0XS8pZQJuKKU5bkaDYdzk9c68OM//smgrJwuBEDSLq+LgXMI8RrLTUyR/ZPlaXrps00vY5HIAFt9DStb8NN46utmQ9HxdvLk6gv6 mvnsAksdLEgi2waT8q7gyttJMvmCXT3CtXq5CRWmXXykjFNPerYGklQwGN0OjkH+xuSGKfBopwLGXbMiPlJFgoRjhNc3YsaYTJFe7RtqYB9ql0ks+NU9jQShe8UOgTKPhUv08k2Jdy4Lu6M9tP/vYy8S+vHSuv5iQsiGJFAzJ8yIs5qBCy+KDLBCWKDzTBRDD9VyCXWIeidMg5HcLXpvA/OS2XrN1S+bhSqJdHccygNbSOishCe6iOjlADNRFBd+gRPaMX4954Ml6Nt2HrmDGaWUU/YLx/AKjUpjo=</latexit>

P[Y = 0|X = x,w] = 1� �(wTx) =exp(�wTx)

1 + exp(�wTx)

=1

1 + exp(wTx)<latexit sha1_base64="xnYqZTBi0eGmbMByz6xx1hQ4vUA=">AAACYHicdZFPTyIxGMY7o7si4op608ubJW7Y7Eo6SFAPJiRePGIiyoaZJZ3SwcbOn7QdgYx8SW8evPhJ7AAa2axv0vTJ73nftH3qJ4IrjfGTZa+sfvm6VlgvbpQ2v22Vt3euVZxKyjo0FrHs+kQxwSPW0VwL1k0kI6Ev2I1/d577N/dMKh5HV3qSMC8kw4gHnBJtUL88ckOib30/a097f+AMMDxA1+zj3zDyzO4cgqv4MCTV0d8rGP+EH2fgBpLQzGXjpHo4p9PMgV+wRFy3+N7qfPAXdr9cwTWMj+vNJuTipHHk5KKB8WkTHEPyqqBFtfvlR3cQ0zRkkaaCKNVzcKK9jEjNqWDTopsqlhB6R4asZ2REQqa8bBbQFA4MGUAQS7MiDTP6cSIjoVKT0DedeRzqXy+H//N6qQ5OvIxHSapZROcHBakAHUOeNgy4ZFSLiRGESm7uCvSWmEy0+ZOiCeHtpfC5uK7XnKNa/bJRadUXcRTQPvqOqshBx6iFLlAbdRBFz9aKVbI2rRe7YG/Z2/NW21rM7KKlsvdeAW75r0Y=</latexit>

Sigmoid for binary classes

P(Y = 0|w,X) =1

1 + exp(w0 +P

k wkXk)

P(Y = 1|w,X) = 1� P(Y = 0|w,X) =exp(w0 +

Pk wkXk)

1 + exp(w0 +P

k wkXk)

P(Y = 1|w,X)

P(Y = 0|w,X)= exp(w0 +

X

k

wkXk)

Sigmoid for binary classes

P(Y = 0|w,X) =1

1 + exp(w0 +P

k wkXk)

P(Y = 1|w,X) = 1� P(Y = 0|w,X) =exp(w0 +

Pk wkXk)

1 + exp(w0 +P

k wkXk)

P(Y = 1|w,X)

P(Y = 0|w,X)= exp(w0 +

X

k

wkXk)

logP(Y = 1|w,X)

P(Y = 0|w,X)= w0 +

X

k

wkXk

Linear Decision Rule!

Logistic Regression – a Linear classifier

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Process

Decide on a model



7

P (Y = 1|x,w) = exp(wTx)

1 + exp(wTx)

P (Y = �1|x,w) = 1

1 + exp(wTx)

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)

■ This is equivalent to:

■ So we can compute the maximum likelihood estimator:

bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

■ Have a bunch of iid data:

Loss function: Conditional Likelihood

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)

bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))

Logistic Loss: `i(w) = log(1 + exp(�yi xTi w))

Squared error Loss: `i(w) = (yi � xTi w)

2

(MLE for Gaussian noise)



Process

Decide on a model



10©2018 Kevin Jamieson


{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)

bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))= J(w)

What does J(w) look like? Is it convex?




{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)

bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))= J(w)


Good news: J(w) is convex function of w, no local optima problems

Bad news: no closed-form solution to maximize J(w)

Good news: convex functions easy to optimize

One other concern… overfitting.

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)

bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))


Does anyone see a situation when this minimization might overfit?

= argminw

nX

i=1

log(1 + exp(�yi xTi w)) When is this loss small?

©Kevin Jamieson 2018

Overfitting and Linear Separability

Large parameters → Overfitting

When data is linearly separable, weights ⇒ ∞

Overfitting

Penalize high weights to prevent overfitting?

Regularized Conditional Log Likelihood

argminw,b

nX

i=1

log�1 + exp(�yi (x

Ti w + b))

�+ �||w||22

Be sure to not regularize the o↵set b!

Add a penalty to avoid high weights/overfitting?:

Documents

Logistic Regression - University of Washington€¦ · Logistic Regression. Process Decide on a model Find the function which fits the data best Choose a loss function Pick the function