17
580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Embed Size (px)

Citation preview

Page 1: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

580.691 Learning Theory

Reza Shadmehr

logistic regression, iterative re-weighted least squares

Page 2: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Logistic regression

In the last lecture we classified by computing a posterior probability. The posterior was calculated by modeling the likelihood and prior for each class.

• To compute the posterior, we modeled the right side of the equation below by assuming that they were Gaussians and computed their parameters (or used a kernel estimate of the density).

• In logistic regression, we want to directly model the posterior as a function of the variable x.

• In practice, when there are k classes to classify, we model:

1

ˆ ˆˆ ˆˆ

ˆˆ ˆ

L

p l P l p l P lP l

pP p

x xx

xx

P̂ l gx x

11P

gP k

x

xx

Page 3: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

In this example we assume that the two distributions for the classes have equal variance. Suppose we want to classify a person as male or female based on height.

Height is normally distributed in the population of men and in the population of women, with different means, and similar variances. Let y be an indicator variable for being a female. Then the conditional distribution of x (the height becomes):

2

2

2

2

1 1| 1 exp

22

1 1| 0 exp

22

f

m

p x y x

p x y x

| 1p x y | 0p x y

| 0 and | 1 and 1

1|

p x y p x y P y q

P y x

What we have:What we want:

Classification by maximizing the posterior distribution

x

Page 4: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

2

2

2 2

2 2

2

2

2

2

22

2

1 | 11|

1 | 1 0 | 0

1exp

21 1

exp 1 exp2 2

11

1 exp21

1exp

2

1

1 11 exp log

2

1

1 exp log

f

f m

m

f

m f

P y p x yP y x

P y p x y P y p x y

q x

q x q x

q x

q x

qx x

q

2 22 2

1 12

m f

m f

qx

q

Posterior probability for classification when we have two classes:

Page 5: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Computing the probability that the subject is female, given that we observed height x.

2 2

2 2

11|

1 11 exp log

2m f

m f

P y xq

xq

176

166

12

1 0.5

m

f

cm

cm

cm

p y

1|P y x

120 140 160 180 200 220

0.2

0.4

0.6

0.8

1

x

Posterior:

| 1p x y | 0p x y

a logistic functionIn the denominator, x appears linearly inside the exponential

So if we assume that the class membership densities p(x/y) are normal with equal variance, then the posterior probability will be a logistic function.

Page 6: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Logistic regression with assumption of equal variance among density of classes implies a linear decision boundary

-4 -2 0 2 4 6

-2

0

2

4

60 0Ta a x

1x

2x

Class 0

( ) ( )

0

0( ) ( )

0 0

( ) ( )

( )

( ) ( )

( ) ( )

0( ) ( )0

11

1 exp

exp10 1

1 exp 1 exp

11 if log 0

0

1 1log log

0 exp

i i

iT

iT

i i

i iT T

i i

i

i i

i i

iT

ii i T

P ya

aP y

a a

P yy

P y

P ya

P y a

xa x

a xx

a x a x

x

x

xa x

x a x

Page 7: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

( ) ( )

( ) ( )

(1) (1) ( ) ( )

( )

( ) ( ) ( )

( ) ( ) ( )

1( ) ( ) ( ) ( )

1(1) ( ) ( ) ( ) ( )

1

( ) ( ) ( ) ( )

1

, , , ,

0,1

11

1 exp

0 1

1

, , , 1

log 1 log 1

i i

i i

N N

n

i i i

iT

i i i

y yi i i i

n y yn i i i

i

ni i i i

i

D y y

y

P y q

P y q

p y q q

p y y q q

l D y q y q

x x

xw x

x

x

w x

w

Assumption of equal variance among the clusters

The goal is to find parameters w that maximize the log-likelihood.

Logistic regression: problem statement

0( )( )

11( ) 22

1

ii

i

w

x w

wx

x w

Page 8: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Some useful properties of the logistic function

2

1;0 1

1 exp

1exp 1

1 1log 1 log

1 1

1 1

1

T

T

T

T

T

q q

q

q

q q

d q

dq q q q q

dqq q

d

w x

w x

w x

w x

w x

Page 9: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Online algorithm for logistic regression

( ) ( ) ( ) ( )

1

( )( ) ( )

( ) ( )1

( )( )( ) ( )

( ) ( )1

( ) ( )( ) ( )

( ) ( )1

( ) ( )

1

log 1 log 1

1

1

11

1

11

ni i i i

i

iTii in

i i iTi

iinii i

i ii

i inii i

i ii

nii i

i

l D y q y q

dydl y dq

d q q dd

yyq q

q q

y qq q

q q

y q

w

w x

w ww x

x

x

x

( 1) ( ) ( ) ( ) ii i i iy q

w w x

( )i

dl

dq

Page 10: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Batch algorithm: Iteratively Re-weighted Least Squares

( ) ( )

1

(1) (1) (1)

( ) ( ) ( )

( ) ( )2 ( )

( )1

( ) ( )

1

(1) (1)

(2) (2

1

1 0 0

0 1

nii i

i

T

n n n T

T

iTii i in

T i TiTi

ni i Ti i

i

dly q

d

q y

X

q y

dlX

d

dd y qdl dq

d d dq dd

q q

q q

q qQ

xw

x

q y

x

y qw

w xx

w w ww x

x x

)

( ) ( )

2

0

0 0 1n n

TT

q q

dlX QX

d d

w w

Page 11: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Iteratively Re-weighted Least Squares

( ) ( ) ( )

2

1( 1) ( )

11

1 exp

i i i

iT

T

TT

t t T T

P y q

dlX

d

dlX QX

d d

X QX X

xw x

y qw

w w

w w y qIRLS

0.2 0.4 0.6 0.8

5

6

7

8

9

10

11

q

1

1q quncertain

certaincertain

Sen

sitiv

ity t

o er

ror

Page 12: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Iteratively Re-weighted Least Square: Example

1P y x

-6 -4 -2 0 2 4 6

-4

-2

0

2

4

-6 -4-2

0 2

x1

-2

02

4 x2

00.250.50.751

-6 -4-2

0 2

x1

-2

02

4 x2

0 1 1 2 2 0w w x w x

1x

2x1

2

w

w

Page 13: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

2

1211

2 2

1 22 21 2

2

2222

2

1211

1 | 11|

1 | 1 0 | 0

1 1exp

22

1 1 1 1exp 1 exp

2 22 2

1

1 11 exp

221

1 1exp

22

1

11 exp log

f m

P y p x yP y x

P y p x y P y p x y

q x x

q x x q x x

q x x

q x x

qq

2 212 12 2

2 2 1

20 1 2

1 1log

2 2

1

1 exp

x x x x

w w x w x

Modeling the posterior when the densities have unequal variance (uni-variate case with two classes)

Page 14: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

(1) (1) ( ) ( )

( )

( ) ( ) ( )

1

2

1 22122

(1)

( )

1( 1) ( )

, , , ,

0,1

11

1 exp

1

N N

n

i i i

iT

T

TN

t t T T

D y y

y

P y q

x

x

x x

x

x

X

X QX X

x x

xw g x

g x

g x

g x

w w y q

Logistic regression with basis functions

-8 -6 -4 -2 0

-1

0

1

2

3

4

By using non-linear bases, we can deal with clusters having unequal variance.

Estimated posterior probability

1x

2x

Page 15: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

11 1 1

1

1 111 1

1 1 1 111 1 1

1

1| 1 | 1

| |

1 1exp

2

1 1exp

2

1 1exp log

2 2

1 1exp log

2 2

exp

T

T

k k k

T T

k kk

T T T Tk k k

k

k

P y P y p y

P y k P y k p y k

q

q

q

q

q

q

a

x x

x x

x μ x μ

x μ x μ

x μ x μ x μ x μ

μ μ μ μ μ x μ x

1

1 1

1|log

|

Tk

Tk k

P ya

P y k

w x

xw x

x

Logistic function for multiple classes with equal variance

Rather than modeling the posterior directly, let us pick the posterior for one class as our reference and then model the ratio of the posterior for all other classes with respect to that class. Suppose we have k classes:

Page 16: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

1 1 1

1

1

1

1

1

1

1

1

1

1|log

|

| exp |

| 1

exp | | 1

| 1 exp 1

1|

1 exp

exp|

1 exp

Tk k

i

k

i

k

ii

k

ii

k

ii

ik

jj

P ya m

P y k

P y i m P y k

P y i

m P y k P y k

P y k m

P y km

mP y i

m

xw x

x

x x

x

x x

x

x

x

Logistic function for multiple classes with equal variance: soft-max

A “soft-max” function

Page 17: 580.691 Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares

Classification of multiple classes with equal variance

160 180 200 220

0.0025

0.005

0.0075

0.01

0.0125

0.015

| 1 1p x y P y | 2 2p x y P y 3

1

|i

p x p x y i P y i

| 3 3p x y P y

160 180 200 220

0.005

0.01

0.015

0.02

0.025

160 180 200 220

0.2

0.4

0.6

0.8

1

Posterior probabilities

160 180 200 220

-15

-10

-5

5

10

15

1|log

3 |

P y

P y

x

x

2 |log

3 |

P y

P y

x

x