Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02

Preview:

Citation preview

Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02

Unconstrained Optimization

Rong Jin

Logistic Regression2

2 1 1

1( ) ( ) log

1 exp ( )N m

reg train train ii il D l D w s w

y b x w

The optimization problem is to find weights w and b that maximizes the above log-likelihood

How to do it efficiently ?

Gradient Ascent Compute the gradient

Increase weights w and threshold b in the gradient direction

21 1

21 1

log ( | )

log ( | )

where is learning rate.

N mi i ii i

N mi i ii i

w w p y x s ww

c c p y x s wc

21 1 1

21 1 1

log ( | ) (1 ( | ))

log ( | ) (1 ( | ))

N m Ni i i i i i ii i i

N m Ni i i i i ii i i

p y x s w sw x y p y xw

p y x s w y p y xb

Problem with Gradient Ascent Difficult to find the appropriate step

size Small slow convergence Large oscillation or “bubbling”

Convergence conditions Robbins-Monroe conditions

Along with “regular” objective function will ensure convergence

20 0

, t tt t

Newton Method Utilizing the second order derivative Expand the objective function to the second order around x0

The minimum point is Newton method for optimization

Guarantee to converge when the objective function is convex

0 0

20 0 0( ) ( ) ( ) ( )

2'( ) , ''( )

x x x x

bf x f x a x x x x

a f x b f x

0 /x x a b

'( )

''( )

old

old

new old x x

x x

f xx x

f x

Multivariate Newton Method Object function comprises of multiple variables

Example: logistic regression model

Text categorization: thousands of words thousands of variables Multivariate Newton Method

Multivariate function:

First order derivative a vector

Second order derivative Hessian matrix Hessian matrix is mxm matrix Each element in Hessian matrix is defined as:

21 1

1( ) log

1 exp ( )N m

reg train ii il D s w

y b x w

21 2

,( , ,..., )m

i ji j

f x x x

x x

H

1 2( ) ( , ,..., )mf x f x x x

1 1, ,...,

m

f f f f

x x x x

Multivariate Newton Method Updating equation:

Hessian matrix for logistic regression model

Can be expensive to compute Example: text categorization with 10,000 words Hessian matrix is of size 10,000 x 10,000 100 million entries Even worse, we have compute the inverse of Hessian matrix H-1

1 ( )new old f xx x

x

H

1( | )(1 ( | ))

n Ti i i i m mi

p y x p y x x x s H I

Quasi-Newton Method Approximate the Hessian matrix H with another B matrix:

B is update iteratively (BFGS):

Utilizing derivatives of previous iterations

1 ( )new old f xx x

x

B

1

1 1,

T Tk k k k k k

k k T Tk k k k k

k k k k k k

p p y y

p B p y p

p x x y g g

B BB B

Limited-Memory Quasi-Newton Quasi-Newton

Avoid computing the inverse of Hessian matrix But, it still requires computing the B matrix large storage

Limited-Memory Quasi-Newton (L-BFGS) Even avoid explicitly computing B matrix

B can be expressed as a product of vectors Only keep the most recently vectors of (3~20)

1

1 1,

T Tk k k k k k

k k T Tk k k k k

k k k k k k

p p y y

p B p y p

p x x y g g

B BB B

{ , }k kp y

{ , }k kp y

Linear Conjugate Gradient Method Consider optimizing the quadratic function

Conjugate vectors The set of vector {p1, p2, …, pl} is said to be conjugate with respect to

a matrix A if

Important property The quadratic function can be optimized by simply optimizing the

function along individual direction in the conjugate set. Optimal solution:

k is the minimizer along the kth conjugate direction

* arg min2

TT

x

x xx b x

A

0, for any Ti jp p i j A

1 1 2 2 ... l lx p p p

Example Minimize the following function

Matrix A

Conjugate direction

Optimization First direction, x1 = x2=x:

Second direction, x1 =- x2=x:

Solution: x1 = x2=1

-4

-2

0

2

4

-3

-2

-1

0

1

2

3-10

0

10

20

302 21 2 1 2 1 2 1 2( , )f x x x x x x x x

1 0.5

0.5 1A

1 21 1

,1 1

p p

21 2 1( , ) 2 Minimizer 1f x x x x

21 2 2( , ) 2 Minimizer 0f x x x

How to Efficiently Find a Set of Conjugate Directions Iterative procedure

Given conjugate directions {p1,p2,…, pk-1}

Set pk as follows:

Theorem: The direction generated in the above step is conjugate to all previous directions {p1,p2,…, pk-1}, i.e.,

Note: compute the k direction pk only requires the previous direction pk-1

11

1 1

( ), , where

k

Tk k

k k k k k kTx xk k

r Ap f xp r p r

xp Ap

, for any [1, 2,..., 1]Tk ip p i k A

Nonlinear Conjugate Gradient Even though conjugate gradient is derived for a

quadratic objective function, it can be applied directly to other nonlinear functions

Several variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)

More robust than FR-CG

Compared to Newton method No need for computing the Hessian matrix No need for storing the Hessian matrix

Generalizing Decision Trees

+ +

a decision tree with simple data partition

+

a decision tree using classifiers for data partition

+

Each node is a linear classifier

Attribute 1

Attribute 2

classifier

Generalized Decision Trees Each node is a linear classifier Pro:

Usually result in shallow trees Introducing nonlinearity into linear classifiers (e.g. logistic

regression) Overcoming overfitting issues through the regularization mechanism

within the classifier. Better way to deal with real-value attributes

Example: Neural network Hierarchical Mixture Expert Model

Example

Kernel method

x=0

x=0

x=0

Generalized Tree

+ +

Hierarchical Mixture Expert Model (HME)

Group 1

g1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

Group 2

g2(x)

m1,2(x) m2,1(x) m2,2(x)

X

y

• Ask r(x): which group should be used for classifying input x ?

• If group 1 is chosen, which classifier m(x) should be used ?

• Classify input x using the chosen classifier m(x)

Hierarchical Mixture Expert Model (HME)Probabilistic Description

Group 1

g1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

Group 2

g2(x)

m1,2(x) m2,1(x) m2,2(x)

X

y

1 11

1 12

2 21

2 22

( | ) ( , , | )

( 1 | ) ( | )( 1| )

( 1| ) ( | )

( 1| ) ( | )( 1| )

( 1| ) ( | )

g mp y x p y g m x

g x m y xr x

g x m y x

g x m y xr x

g x m y x

Two hidden variables

The hidden variable for groups:g = {1, 2}

The hidden variable for classifiers:m = {11, 12, 21, 22}

Hierarchical Mixture Expert Model (HME)Example

Group 1

g1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

Group 2

g2(x)

m1,2(x) m2,1(x) m2,2(x)

X

y

r(+1|x) = ¾, r(-1|x) = ¼

g1(+1|x) = ¼, g1(-1|x) = ¾

g2(+1|x) = ½ , g2(-1|x) = ½

+1 -1

m1,1(x) ¼ ¾

m1,2(x) ¾ ¼

m2,1(x) ¼ ¾

m2,2(x) ¾ ¼

¾ ¼

¼ ¾ ½ ½

p(+1|x) = ?, p(-1|x) = ?

Training HME In the training examples {xi, yi}

No information about r(x), g(x) for each example Random variables g, m are called hidden variables since

they are not exposed in the training data. How to train a model with hidden variable?

Start with Random Guess …

xg1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

g2(x)

m1,2(x) m2,1(x) m2,2(x)

+: {1, 2, 3, 4, 5}

: {6, 7, 8, 9}

Randomly Assignment

• Randomly assign points to each group and expert

• Learn classifiers r(x), g(x), m(x) using the randomly assigned points

{1,2,} {6,7} {3,4,5} {8,9}

{1}{6} {2}{7} {3}{9} {5,4}{8}

Adjust Group Memeberships

xg1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

g2(x)

m1,2(x) m2,1(x) m2,2(x)

+: {1, 2, 3, 4, 5}

: {6, 7, 8, 9}

• The key is to assign each data point to the group who classifies the data point correctly with the largest probability

• How ?{1,2} {6,7} {3,4,5} {8,9}

{1}{6} {2}{7} {3}{9} {5,4}{8}

Adjust Group Memberships

xg1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

g2(x)

m1,2(x) m2,1(x) m2,2(x)

+: {1, 2, 3, 4, 5}

: {6, 7, 8, 9}

• The key is to assign each data point to the group who classifies the data point correctly with the largest confidence

• Compute p(g=1|x,y) and p(g=2|x,y){1,2} {6,7} {3,4,5} {8,9}

{1}{6} {2}{7} {3}{9} {5,4}{8}

Posterior Prob. For Groups

Group 1 Group 2

1 0.8 0.2

2 0.4 0.6

3 0.3 0.7

4 0.1 0.9

5 0.65 0.35

Adjust Memberships for Classifiers

xg1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

g2(x)

m1,2(x) m2,1(x) m2,2(x)

+: {1, 2, 3, 4, 5}

: {6, 7, 8, 9}

{1,5} {6,7} {2,3,4} {8,9}

• The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence

• Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers

xg1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

g2(x)

m1,2(x) m2,1(x) m2,2(x)

+: {1, 2, 3, 4, 5}

: {6, 7, 8, 9}

{1,5} {6,7} {2,3,4} {8,9}

Posterior Prob. For Classifiers

1 2 3 4 5

m1,10.7 0.1 0.15 0.1 0.05

m1,20.2 0.2 0.20 0.1 0.55

m2,10.05 0.5 0.60 0.1 0.3

m2,10.05 0.2 0.05 0.7 0.1

• The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence

• Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers

xg1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

g2(x)

m1,2(x) m2,1(x) m2,2(x)

+: {1, 2, 3, 4, 5}

: {6, 7, 8, 9}

{1,5} {6,7} {2,3,4} {8,9}

Posterior Prob. For Classifiers

1 2 3 4 5

m1,10.7 0.1 0.15 0.1 0.05

m1,20.2 0.2 0.20 0.1 0.55

m2,10.05 0.5 0.60 0.1 0.3

m2,10.05 0.2 0.05 0.7 0.1

• The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence

• Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers

Posterior Prob. For Classifiers

1 2 3 4 5

m1,10.7 0.1 0.15 0.1 0.05

m1,20.2 0.2 0.20 0.1 0.55

m2,10.05 0.5 0.60 0.1 0.3

m2,10.05 0.2 0.05 0.7 0.1

• The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence

• Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

xg1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

g2(x)

m1,2(x) m2,1(x) m2,2(x)

+: {1, 2, 3, 4, 5}

: {6, 7, 8, 9}

{1,5} {6,7} {2,3,4} {8,9}

{1}{6} {5}{7} {2,3}{9} {4}{8}

Retrain The Model

• Retrain r(x), g(x), m(x) using the new memberships

xg1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

g2(x)

m1,2(x) m2,1(x) m2,2(x)

+: {1, 2, 3, 4, 5}

: {6, 7, 8, 9}

{1,5} {6,7} {2,3,4} {8,9}

{1}{6} {5}{7} {2,3}{9} {4}{8}

Expectation Maximization Two things need to estimate

Logistic regression models for r(x;r), g(x; g) and m(x;m)

Unknown group memberships and expert memberships

p(g=1,2|x), p(m=11,12|x,g=1), p(m=21,22|x,g=2)

E-step

1. Estimate p(g=1|x, y), p(g=2|x, y) for training examples, given guessed r(x;r), g(x;g) and m(x;m)

2. Estimate p(m=11, 12|x, y) and p(m=21, 22|x, y) for all training examples, given guessed r(x;r), g(x;g) and m(x;m)

M-step

1. Train r(x;r) using weighted examples: for each x, p(g=1|x) fraction as a positive example, and p(g=2|x) fraction as a negative example

2. Train g1(x; g) using weighted examples: for each x, p(g=1|x)p(m=11|x,g=1) fraction as a positive example and p(g=1|x)p(m=12|x,g=1) fraction as a negative example. Training g2(x; g) similarly

3. Train m(x;m) with appropriately weighted examples

Comparison of Different Classification Models The goal of all classifiers

Predicating class label y for an input x Estimate p(y|x)

Gaussian generative model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior p(x|y)

Describing the input patterns for each class y Difficult to estimate if x is of high dimensionality Naïve Bayes: p(x|y) ~ p(x1|y) p(x2|y)… p(xm|y)

Essentially a linear model Linear discriminative model

Directly estimate p(y|x) Focusing on finding the decision boundary

Comparison of Different Classification Models Logistic regression model

A linear decision boundary: wx+b

A probabilistic model p(y|x)

Maximum likelihood approach for estimating weights w and threshold b

0 positive

0 negative

w x b

w x b

1( 1| )

1 exp( ( ))p y x

y w x b

( ) ( )

1 1

( ) ( )

1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

N Ntrain i ii i

N N

i ii i

l D p x p x

w x b w x b

1w x b

Comparison of Different Classification Models Logistic regression model

Overfitting issue Example: text classification

Every word is assigned with a different weight Words that appears in only one document will be assigned with infinite

large weight Solution: regularization

( ) ( )

21 1

( ) ( ) 21 1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

N Ntrain i ii i

N N mji i j

i i

l D p x p x s w

s ww x b w x b

Regularization term

Comparison of Different Classification Models Conditional exponential model

An extension of logistic regression model to multiple class case

A different set of weights wy and threshold b for each class y

Maximum entropy model Finding the simplest model that matches with the data

1( | ; ) exp( )

( )

( ) exp( )

y y

y yy

p y x c x wZ x

Z x c x w

1 1( | )

1 1

max ( | ) max ( | ) log ( | )

subject to

( | ) ( , ), ( | )=1

i

N Ni i i ii i yp y x p

N Ni i i i ii i y

H y x p y x p y x

p y x x x y y p y x

Maximize Entropy Prefer uniform distribution

Constraints Enforce the model to be consistent

with observed data

Classification Margin

Comparison of Different Classification Models

Support vector machine Classification margin Maximum margin principle:

Separate data far away from the decision boundary

Two objectives Minimize the classification

error over training data Maximize the classification

margin Support vector

Only support vectors have impact on the location of decision boundary

denotes +1

denotes -1

Support Vectors

0w x b

Comparison of Different Classification Models Separable case

Noisy case

* * 21

,

1 1

2 2

{ , }= argmin

subject to

1

1

....

1

mii

w b

N N

w b w

y w x b

y w x b

y w x b

* * 21 1

,

1 1 1 1

2 2 2 2

{ , }= argmin

subject to

1 , 0

1 , 0

....

1 , 0

m Ni ji j

w b

N N N N

w b w c

y w x b

y w x b

y w x b

Quadratic programming!

Comparison of Classification Models Logistic regression model vs. support vector machine

21 1

,

1{ , }* arg max log

1 exp ( )

N mji j

w b i

w b s wy w x b

* * 21 1

,

1 1 1 1

{ , }= argmin

subject to

1 , 0

....

1 , 0

N mi ji j

w b

N N N N

w b c w

y w x b

y w x b

Log-likelihood can be viewed as a measurement

of accuracy

Identical terms

Comparison of Different Classification Models

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 10

0.5

1

1.5

2

2.5

3

3.5

wx+b

Loss

Loss function for logistic regressionLoss function for SVM

Logistic regression differs from support vector machine only in the loss function

0.5 1 1.5 2 2.5 3 3.5 4 4.50

10

20

30

40

50

60

X

Cou

nt

fitting curve for positive datafitting curve for negative datahistogram for negative datahistogram for positive data

Comparison of Different Classification Models

• Generative models have trouble at the decision boundary

• Classification boundary that achieves the least training error

• Classification boundary that achieves large margin

Nonlinear Models Kernel methods

Add additional dimensions to help separate data

Efficiently computing the dot product in a high dimension space

Kernel method

x=0

x=0

( ) ( ) ( , )w x K w x Φ Φ

( )x x Φ

Nonlinear Models

Decision trees Nonlinearly combine different

features through a tree structure

Hierarchical Mixture Model Replace each node with a

logistic regression model Nonlinearly combine multiple

linear models

+ + +

Group 1

g1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

Group 2

g2(x)

m1,2(x) m2,1(x) m2,2(x)

Recommended