Project Now it is time to think about the project It is a team work Each team will consist of 2 people It is better to consider a project of your

Project Now it is time to think about the project It is a team work

Each team will consist of 2 people It is better to consider a project of your own

Otherwise, I will assign you to some “difficult” project . Important date

03/11: project proposal due 04/01: project progress report due 04/22 and 04/24: final presentation 05/03: final report due

Project Proposal What do I expect?

Introduction: describe the research problem that you try to solve Related wok: describe the existing approaches and their deficiency Proposed approaches: describe your approaches and why it may have

potential to alleviate the deficiency with existing approaches Plan: what you plan to do in this project?

Format It should look like a research paper The required format (both Microsoft Word and Latex) can be

downloaded from www.cse.msu.edu/~cse847/assignments/format.zip

Project Progress Report Introduction: overview the problem that you try to solve and

the solutions that you present in the proposal Progress

Algorithm description in more details Related data collection and cleanup Preliminary results

Format should be same as the project report

Project Final Report It should like a research paper that is ready for

submission to research conferences What do I expect?

Introduction Algorithm description and discussion Empirical studies

I am expecting careful analysis of results no matter if it is a successful approach or a complete failure

Presentation 25 minute presentation 5 minute discussion

Exponential Model and Maximum Entropy Model

Rong Jin

Recap: Logistic Regression Model Assume the inputs and outputs are related in the log

linear function

Estimate weights: MLE approach

1 2

1( | ; )

1 exp ( )

{ , ,..., , }m

p y xy x w c

w w w c

*21

21 1

max ( ) max log ( | ; )

1max log

1 exp( )

nreg train i iiw w

n mji jw

w l D p y x s w

s wy x w c

1 2{ , ,..., , }mw w w c

How to Extend Logistic Regression Model to Multiple Classes? y{+1, -1} {1,2,…,C}?

1 2

1( | ; )

1 exp ( )

{ , ,..., , }m

p y xy x w c

w w w c

( 1 | )log

( 1| )

p y xx w c

p y x

Conditional Exponential Model Introduce a different set of parameters for

each class

Ensure the sum of probability to be 1

( | ; ) exp( ) { , }y y y y yp y x c x w c w

1( | ; ) exp( )

( )

( ) exp( )

y y

y yy

p y x c x wZ x

Z x c x w

( | ; )p y x

Conditional Exponential Model Predication probability

Model parameters: For each class y, we have weights wy and threshold cy

Maximum likelihood estimation

exp( )( | ; ) , {1,2,..., }

exp( )y y

y yy

c x wp y x y C

c x w

1 1

exp( )( ) log ( | ) log

exp( )i iN N y i y

train i ii iy i yy

c x wl D p y x

c x w

Any Problems?

Conditional Exponential Model Add a constant vector to every weight vector, we have the

same log-likelihood function

Not unique optimum solution! How to resolve this problem?

0 0

0 0

10 0

1

,

exp( )( ) log

exp( )

exp( )log

exp( )

i i

i i

y y y y

N y i ytrain i

y i yy

N y i y

iy i yy

w w w c c c

c c x w wl D

c c x w w

c x w

c x w

Solution: Set w1 to be a zero vector and c1 to be zero

Modified Conditional Exponential Model

Prediction probability

Model parameters: For each class y>1, we have weights wy and threshold cy

Maximum likelihood estimation

' '' 1

' '' 1

exp( ){2,..., }

1 exp( )( | ; )

11

1 exp( )

y y

y yy

y yy

c x wy C

c x wp y x

yc x w

1

{ | 1} { | 1}1 1

( ) log ( | )

exp( )1log log

1 exp( ) 1 exp( )i i

i i

Ntrain i ii

y i y

i y i yy i y y i yy y

l D p y x

c x w

c x w c x w

Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au-cours-de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation

What is your guess of the probabilities?

Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation

What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5

Case 2: 30% of times either dans or en is used

Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation

What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5

Case 2: 30% of times either dans or en is used What is your guess of the probabilities? p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30

Uniform distribution is favored

Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times

dans or à is used What is your guess of the probabilities?

Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times dans

or à is used What is your guess of the probabilities?

A good probability distribution should Satisfy the constraints Be close to uniform distribution, but how?

Measure Uniformality using

Kullback-Leibler Distance !

Maximum Entropy Principle (MaxEnt) A uniformity of distribution is measured by entropy of the

distribution

Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2

* max ( )

where ( ) ( ) log ( ) ( ) log ( ) ( ) log ( )

( ) log ( ) ( ) log ( )

subject to

( ) ( ) 3/10

( ) ( ) 1/ 2

( ) ( ) ( ) (

PP H P

H P p dans p dans p en p en p a p a

p au course de p au course de p pendant p pendant

p dans p en

p dans p a

p dans p en p a p au cours d

) ( ) 1e p pendant

MaxEnt for Classification Problems Want a p(y|x) to be close to a uniform distribution

Maximize the conditional entropy of training data

Constraints Valid probability distribution

From training data: the model should be consistent with data For each class, model mean of x = empirical mean of x

1 1

1 1[1,2,..., ] ( | ) ( , )

N Ni i i ii i

y C p y x x x y yN N

1 1( | ) ( | ) ( | ) log ( | )

N Ni i ii i y

H y x H y x p y x p y x

, ( | ) 1iyi p y x

MaxEnt for Classification Problems Want a p(y|x) to be close to a uniform distribution

Maximize the conditional entropy of training data

Constraints Valid probability distribution

From training data: the model should be consistent with data For each class, model mean of x = empirical mean of x

1 1

1 1[1,2,..., ] ( | ) ( , )

N Ni i i ii i

y C p y x x x y yN N

1 1( | ) ( | ) ( | ) log ( | )

N Ni i ii i y

H y x H y x p y x p y x

, ( | ) 1iyi p y x

MaxEnt for Classification Problems

Requiring the mean be consistent between the empirical data and the model

No assumption about the parametric form for likelihood Only assume it is C2 continuous

1

1

1 1

max ( | ) max ( | )

max ( | ) log ( | )

subject to

( | ) ( , ), ( | )=1

Niip p

Ni i ii yp

N Ni i i i ii i y

H y x H y x

p y x p y x

p y x x x y y p y x

MaxEnt Model Consistency with data is ensured by the equality

constraints

For each feature, the empirical mean equal to the model mean Beyond feature vector x:

1 1( | ) ( , )

N Ni i i ii i

p y x x x y y

1 1( | ) ( ) ( ) ( , )

N Ni k i k i ii i

p y x f x f x y y

Translation Problem Parameters: p(dans), p(en), p(au), p(a), p(pendant) Represent each French word with two features

{dans, en} {dans, a}

dans 1 1

en 1 0

au-cours-de 0 0

a 0 1

pendant 0 0

Empirical Average 0.3 0.5

Constraints

1 1( | ) ( , )

N Ni i i ii i

p y x x x y y

( ) (1,1) ( ) (1,0) ( ) (0,0) ( )(0,1) ( ) (0,0)

(0.3,0.5)

p dans p en p au p a p pendant

( ) ( ) 3/10

( ) ( ) 1/ 2

p dans p en

p dans p a

Solution to MaxEnt Surprisingly, the solution is just conditional

exponential model without thresholds

Why?

exp( )( | ; )

exp( )y

yy

x wp y x

x w

Solution to MaxEntS o lv e t h e m a x im u m e n t ro p y p r o b l e m , i .e .

1 1

m a x ( | ) m a x ( | ) lo g ( | )

s u b j e c t t o ( | ) ( , ) , ( | ) 1

i iyp p

N Ni i ii i y

H y x p y x p y x

p y x x x y y p y x

T o m a x im iz e t h e e n t r o p y f u n c t io n u n d e r t h e c o n s t r a in t s , w e c a n in t r o d u c e a s e t o f L a g r a n g i a n m u l t i p l i e r s i n to t h e o b j e c t iv e f u n c t io n , i . e . ,

1 1 1( | ) ( | ) ( , ) ( | ) 1

N N Ny i i iy i i i y

G H y x p y x x x y y p y x

S e t t i n g t h e f i r s t d e r iv a t i v e o f G w i th r e s p e c t ( | )ip y x t o b e z e r o , w e h a v e e q u a t io n s

lo g ( | ) 0 ( | ) e x p ( )( | ) i y i i y i

i

Gp y x x p y x x

p y x

S in c e ( | ) 1iyp y x , w e h a v e ( | )ip y x

a s

e x p ( )( | )

e x p ( )

yi

yy

xp y x

x

Maximum Entropy Model versusConditional Exponential Model

1

1 1

max ( | )

max ( | ) log ( | )

subject to

( | ) ( , ),

( | )=1,

p

Ni i i ii yp

N Ni ii i

iy

H y x

p y x p y x

p y x x x y y i

p y x i

Maximum Entropy Model

{ }

1{ }

exp( )( | ; )

exp( )

max ( )

exp( )max log

exp( )

y

i

y

y

yy

trainw

N i y

iw i yy

x wp y x

x w

l D

x w

x w

Conditional Exponential Model

Dual

Problem

Maximum Entropy Model vs. Conditional Exponential Model

However, where is the threshold term c?

{ , }

1 2{ , }

exp( )( | ; )

exp( )

max ( )

exp( )max log

exp( )

y y

i

y y

y y

y yy

reg trainc w

N i yyi yc w i yy

c x wp y x

c x w

l D

x ws w

x w

Maximum Entropy Conditional Exponential

{ }

1{ }

exp( )( | ; )

exp( )

max ( )

exp( )max log

exp( )

y

i

y

y

yy

trainw

N i y

iw i yy

x wp y x

x w

l D

x w

x w

Solving Maximum Entropy Model

Iterative scaling algorithm Assume

1

1 1

max ( | ) max ( | ) log ( | )

subject to: ( | ) ( , ), ( | )=1

Ni i i ii yp p

N Ni i ii i y

H y x p y x p y x

p y x x x y y p y x

,

,1

[1... ], [1... ] : 0

[1... ] : , where is a constant independent from

i j

di jj

i n j d x

i n x g g i

Solving Maximum Entropy Model Compute the empirical mean for each feature of every class,

i.e., for every j and every class y

Start w1 ,w2 …, wc = 0 Repeat

Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration

Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y

Compute for every j and every y

Update w as

, ,1( , )

Ny j i j ii

e x y y N

, ,1( | )

Ny j i j ii

m x p y x N

, , ,j y j y j yw w w

, , ,log logj y j y j yw e m

Solving Maximum Entropy Model Compute the empirical mean for each feature of every class,

i.e., for every j and every class y

Start w1 ,w2 …, wc = 0 Repeat

Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration

Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y

Compute for every j and every y

Update w as

, ,1( , )

Ny j i j ii

e x y y N

, ,1( | )

Ny j i j ii

m x p y x N


, , ,1

log logj y j y j yw e mg

Solving Maximum Entropy Model The likelihood function always increases !

Solving Maximum Entropy Model How about each feature can take both positive and

negative values?

How about the sum of features is not a constant?

How to apply this approach to conditional exponential model with bias term (or threshold term)?

Improved Iterative Scaling It only requires all the input features to be positive Compute the empirical mean for each feature of every class,

i.e., for every j and every class y Start w1 ,w2 …, wc = 0 Repeat

Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration

Solve for every j and every y

Update w as


, , , ,1

( | ) expi j i j y i j y ii jx p y x w x e

N

, ,1( , )

Ny j i j ii

e x y y N

Choice of Features

A feature does not have to be one of the inputs For maximum entropy model, bound features are more

favorable. Very often, people use binary feature

Feature selection Features with small weights are eliminated

1

1

1 1

max ( | ) max ( | )

max ( | ) log ( | )

subject to : ( | ) ( ) ( ) ( , ), ( | )=1

Niip p

Ni i ii yp

N Ni i i i ii i y

H y x H y x

p y x p y x

y p y x f x f x y y p y x

Feature Selection vs. Regularizers

Regularizer sparse solution automatic feature selection But, L2 regularizer rarely results in features with zero

weights not appropriate for feature selection For the purpose of feature selection, usually using L1 norm

2

2,1

( ) ( )

exp( )log

exp( )i i

reg train train yy

N y i yy ji y j

y i yy

l D l D s w

c x ws w

c x w

Feature Selection vs. Regularizers

Regularizer sparse solution automatic feature selection But, L2 regularizer rarely results in features with zero

weights not appropriate for feature selection For the purpose of feature selection, usually using L1 norm

2

2,1

( ) ( )

exp( )log

exp( )i i

reg train train yy

N y i yy ji y j

y i yy

l D l D s w

c x ws w

c x w

1

,1

( ) ( )

exp( )log

exp( )i i

reg train train yy

N y i yy ji y j

y i yy

l D l D s w

c x ws w

c x w

Solving the L1 Regularized Conditional Exponential Model

Solving the L1 regularized conditional exponential model directly is rather difficult Because the absolute value is a discontinuous function

Any suggestion to alleviate this problem?

1

,1

( ) ( )

exp( )log

exp( )i i

reg train train yy

N y i yy ji y j

y i yy

l D l D s w

c x ws w

c x w

Solving the L1 Regularized Conditional Exponential Model

,1

exp( )arg max log

exp( )i iN y i y

y ji y jy i yy

c x ws w

c x w

,1

, , , ,

exp( )arg max log

exp( )

subject to , : 0, and

i iN y i yy ji y j

y i yy

y j y j y j y j

c x ws t

c x w

y j t t w t

Slack Variables

Documents

Project Now it is time to think about the project It is a team work Each team will consist of 2 people It is better to consider a project of your