38
Project Now it is time to think about the project It is a team work Each team will consist of 2 people It is better to consider a project of your own Otherwise, I will assign you to some “difficult” project . Important date 03/11: project proposal due 04/01: project progress report due 04/22 and 04/24: final presentation 05/03: final report due

Project Now it is time to think about the project It is a team work Each team will consist of 2 people It is better to consider a project of your

  • View
    224

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Project Now it is time to think about the project It is a team work

Each team will consist of 2 people It is better to consider a project of your own

Otherwise, I will assign you to some “difficult” project . Important date

03/11: project proposal due 04/01: project progress report due 04/22 and 04/24: final presentation 05/03: final report due

Page 2: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Project Proposal What do I expect?

Introduction: describe the research problem that you try to solve Related wok: describe the existing approaches and their deficiency Proposed approaches: describe your approaches and why it may have

potential to alleviate the deficiency with existing approaches Plan: what you plan to do in this project?

Format It should look like a research paper The required format (both Microsoft Word and Latex) can be

downloaded from www.cse.msu.edu/~cse847/assignments/format.zip

Page 3: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Project Progress Report Introduction: overview the problem that you try to solve and

the solutions that you present in the proposal Progress

Algorithm description in more details Related data collection and cleanup Preliminary results

Format should be same as the project report

Page 4: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Project Final Report It should like a research paper that is ready for

submission to research conferences What do I expect?

Introduction Algorithm description and discussion Empirical studies

I am expecting careful analysis of results no matter if it is a successful approach or a complete failure

Presentation 25 minute presentation 5 minute discussion

Page 5: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Exponential Model and Maximum Entropy Model

Rong Jin

Page 6: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Recap: Logistic Regression Model Assume the inputs and outputs are related in the log

linear function

Estimate weights: MLE approach

1 2

1( | ; )

1 exp ( )

{ , ,..., , }m

p y xy x w c

w w w c

*21

21 1

max ( ) max log ( | ; )

1max log

1 exp( )

nreg train i iiw w

n mji jw

w l D p y x s w

s wy x w c

1 2{ , ,..., , }mw w w c

Page 7: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

How to Extend Logistic Regression Model to Multiple Classes? y{+1, -1} {1,2,…,C}?

1 2

1( | ; )

1 exp ( )

{ , ,..., , }m

p y xy x w c

w w w c

( 1 | )log

( 1| )

p y xx w c

p y x

Page 8: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Conditional Exponential Model Introduce a different set of parameters for

each class

Ensure the sum of probability to be 1

( | ; ) exp( ) { , }y y y y yp y x c x w c w

1( | ; ) exp( )

( )

( ) exp( )

y y

y yy

p y x c x wZ x

Z x c x w

( | ; )p y x

Page 9: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Conditional Exponential Model Predication probability

Model parameters: For each class y, we have weights wy and threshold cy

Maximum likelihood estimation

exp( )( | ; ) , {1,2,..., }

exp( )y y

y yy

c x wp y x y C

c x w

1 1

exp( )( ) log ( | ) log

exp( )i iN N y i y

train i ii iy i yy

c x wl D p y x

c x w

Any Problems?

Page 10: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Conditional Exponential Model Add a constant vector to every weight vector, we have the

same log-likelihood function

Not unique optimum solution! How to resolve this problem?

0 0

0 0

10 0

1

,

exp( )( ) log

exp( )

exp( )log

exp( )

i i

i i

y y y y

N y i ytrain i

y i yy

N y i y

iy i yy

w w w c c c

c c x w wl D

c c x w w

c x w

c x w

Solution: Set w1 to be a zero vector and c1 to be zero

Page 11: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Modified Conditional Exponential Model

Prediction probability

Model parameters: For each class y>1, we have weights wy and threshold cy

Maximum likelihood estimation

' '' 1

' '' 1

exp( ){2,..., }

1 exp( )( | ; )

11

1 exp( )

y y

y yy

y yy

c x wy C

c x wp y x

yc x w

1

{ | 1} { | 1}1 1

( ) log ( | )

exp( )1log log

1 exp( ) 1 exp( )i i

i i

Ntrain i ii

y i y

i y i yy i y y i yy y

l D p y x

c x w

c x w c x w

Page 12: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au-cours-de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation

What is your guess of the probabilities?

Page 13: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation

What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5

Case 2: 30% of times either dans or en is used

Page 14: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation

What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5

Case 2: 30% of times either dans or en is used What is your guess of the probabilities? p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30

Uniform distribution is favored

Page 15: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times

dans or à is used What is your guess of the probabilities?

Page 16: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times dans

or à is used What is your guess of the probabilities?

A good probability distribution should Satisfy the constraints Be close to uniform distribution, but how?

Measure Uniformality using

Kullback-Leibler Distance !

Page 17: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Maximum Entropy Principle (MaxEnt) A uniformity of distribution is measured by entropy of the

distribution

Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2

* max ( )

where ( ) ( ) log ( ) ( ) log ( ) ( ) log ( )

( ) log ( ) ( ) log ( )

subject to

( ) ( ) 3/10

( ) ( ) 1/ 2

( ) ( ) ( ) (

PP H P

H P p dans p dans p en p en p a p a

p au course de p au course de p pendant p pendant

p dans p en

p dans p a

p dans p en p a p au cours d

) ( ) 1e p pendant

Page 18: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

MaxEnt for Classification Problems Want a p(y|x) to be close to a uniform distribution

Maximize the conditional entropy of training data

Constraints Valid probability distribution

From training data: the model should be consistent with data For each class, model mean of x = empirical mean of x

1 1

1 1[1,2,..., ] ( | ) ( , )

N Ni i i ii i

y C p y x x x y yN N

1 1( | ) ( | ) ( | ) log ( | )

N Ni i ii i y

H y x H y x p y x p y x

, ( | ) 1iyi p y x

Page 19: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

MaxEnt for Classification Problems Want a p(y|x) to be close to a uniform distribution

Maximize the conditional entropy of training data

Constraints Valid probability distribution

From training data: the model should be consistent with data For each class, model mean of x = empirical mean of x

1 1

1 1[1,2,..., ] ( | ) ( , )

N Ni i i ii i

y C p y x x x y yN N

1 1( | ) ( | ) ( | ) log ( | )

N Ni i ii i y

H y x H y x p y x p y x

, ( | ) 1iyi p y x

Page 20: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

MaxEnt for Classification Problems

Requiring the mean be consistent between the empirical data and the model

No assumption about the parametric form for likelihood Only assume it is C2 continuous

1

1

1 1

max ( | ) max ( | )

max ( | ) log ( | )

subject to

( | ) ( , ), ( | )=1

Niip p

Ni i ii yp

N Ni i i i ii i y

H y x H y x

p y x p y x

p y x x x y y p y x

Page 21: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

MaxEnt Model Consistency with data is ensured by the equality

constraints

For each feature, the empirical mean equal to the model mean Beyond feature vector x:

1 1( | ) ( , )

N Ni i i ii i

p y x x x y y

1 1( | ) ( ) ( ) ( , )

N Ni k i k i ii i

p y x f x f x y y

Page 22: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Translation Problem Parameters: p(dans), p(en), p(au), p(a), p(pendant) Represent each French word with two features

{dans, en} {dans, a}

dans 1 1

en 1 0

au-cours-de 0 0

a 0 1

pendant 0 0

Empirical Average 0.3 0.5

Page 23: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Constraints

1 1( | ) ( , )

N Ni i i ii i

p y x x x y y

( ) (1,1) ( ) (1,0) ( ) (0,0) ( )(0,1) ( ) (0,0)

(0.3,0.5)

p dans p en p au p a p pendant

( ) ( ) 3/10

( ) ( ) 1/ 2

p dans p en

p dans p a

Page 24: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solution to MaxEnt Surprisingly, the solution is just conditional

exponential model without thresholds

Why?

exp( )( | ; )

exp( )y

yy

x wp y x

x w

Page 25: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solution to MaxEntS o lv e t h e m a x im u m e n t ro p y p r o b l e m , i .e .

1 1

m a x ( | ) m a x ( | ) lo g ( | )

s u b j e c t t o ( | ) ( , ) , ( | ) 1

i iyp p

N Ni i ii i y

H y x p y x p y x

p y x x x y y p y x

T o m a x im iz e t h e e n t r o p y f u n c t io n u n d e r t h e c o n s t r a in t s , w e c a n in t r o d u c e a s e t o f L a g r a n g i a n m u l t i p l i e r s i n to t h e o b j e c t iv e f u n c t io n , i . e . ,

1 1 1( | ) ( | ) ( , ) ( | ) 1

N N Ny i i iy i i i y

G H y x p y x x x y y p y x

S e t t i n g t h e f i r s t d e r iv a t i v e o f G w i th r e s p e c t ( | )ip y x t o b e z e r o , w e h a v e e q u a t io n s

lo g ( | ) 0 ( | ) e x p ( )( | ) i y i i y i

i

Gp y x x p y x x

p y x

S in c e ( | ) 1iyp y x , w e h a v e ( | )ip y x

a s

e x p ( )( | )

e x p ( )

yi

yy

xp y x

x

Page 26: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Maximum Entropy Model versusConditional Exponential Model

1

1 1

max ( | )

max ( | ) log ( | )

subject to

( | ) ( , ),

( | )=1,

p

Ni i i ii yp

N Ni ii i

iy

H y x

p y x p y x

p y x x x y y i

p y x i

Maximum Entropy Model

{ }

1{ }

exp( )( | ; )

exp( )

max ( )

exp( )max log

exp( )

y

i

y

y

yy

trainw

N i y

iw i yy

x wp y x

x w

l D

x w

x w

Conditional Exponential Model

Dual

Problem

Page 27: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Maximum Entropy Model vs. Conditional Exponential Model

However, where is the threshold term c?

{ , }

1 2{ , }

exp( )( | ; )

exp( )

max ( )

exp( )max log

exp( )

y y

i

y y

y y

y yy

reg trainc w

N i yyi yc w i yy

c x wp y x

c x w

l D

x ws w

x w

Maximum Entropy Conditional Exponential

{ }

1{ }

exp( )( | ; )

exp( )

max ( )

exp( )max log

exp( )

y

i

y

y

yy

trainw

N i y

iw i yy

x wp y x

x w

l D

x w

x w

Page 28: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solving Maximum Entropy Model

Iterative scaling algorithm Assume

1

1 1

max ( | ) max ( | ) log ( | )

subject to: ( | ) ( , ), ( | )=1

Ni i i ii yp p

N Ni i ii i y

H y x p y x p y x

p y x x x y y p y x

,

,1

[1... ], [1... ] : 0

[1... ] : , where is a constant independent from

i j

di jj

i n j d x

i n x g g i

Page 29: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solving Maximum Entropy Model Compute the empirical mean for each feature of every class,

i.e., for every j and every class y

Start w1 ,w2 …, wc = 0 Repeat

Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration

Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y

Compute for every j and every y

Update w as

, ,1( , )

Ny j i j ii

e x y y N

, ,1( | )

Ny j i j ii

m x p y x N

, , ,j y j y j yw w w

, , ,log logj y j y j yw e m

Page 30: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solving Maximum Entropy Model Compute the empirical mean for each feature of every class,

i.e., for every j and every class y

Start w1 ,w2 …, wc = 0 Repeat

Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration

Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y

Compute for every j and every y

Update w as

, ,1( , )

Ny j i j ii

e x y y N

, ,1( | )

Ny j i j ii

m x p y x N

, , ,j y j y j yw w w

, , ,1

log logj y j y j yw e mg

Page 31: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solving Maximum Entropy Model The likelihood function always increases !

Page 32: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solving Maximum Entropy Model How about each feature can take both positive and

negative values?

How about the sum of features is not a constant?

How to apply this approach to conditional exponential model with bias term (or threshold term)?

Page 33: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Improved Iterative Scaling It only requires all the input features to be positive Compute the empirical mean for each feature of every class,

i.e., for every j and every class y Start w1 ,w2 …, wc = 0 Repeat

Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration

Solve for every j and every y

Update w as

, , ,j y j y j yw w w

, , , ,1

( | ) expi j i j y i j y ii jx p y x w x e

N

, ,1( , )

Ny j i j ii

e x y y N

Page 34: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Choice of Features

A feature does not have to be one of the inputs For maximum entropy model, bound features are more

favorable. Very often, people use binary feature

Feature selection Features with small weights are eliminated

1

1

1 1

max ( | ) max ( | )

max ( | ) log ( | )

subject to : ( | ) ( ) ( ) ( , ), ( | )=1

Niip p

Ni i ii yp

N Ni i i i ii i y

H y x H y x

p y x p y x

y p y x f x f x y y p y x

Page 35: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Feature Selection vs. Regularizers

Regularizer sparse solution automatic feature selection But, L2 regularizer rarely results in features with zero

weights not appropriate for feature selection For the purpose of feature selection, usually using L1 norm

2

2,1

( ) ( )

exp( )log

exp( )i i

reg train train yy

N y i yy ji y j

y i yy

l D l D s w

c x ws w

c x w

Page 36: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Feature Selection vs. Regularizers

Regularizer sparse solution automatic feature selection But, L2 regularizer rarely results in features with zero

weights not appropriate for feature selection For the purpose of feature selection, usually using L1 norm

2

2,1

( ) ( )

exp( )log

exp( )i i

reg train train yy

N y i yy ji y j

y i yy

l D l D s w

c x ws w

c x w

1

,1

( ) ( )

exp( )log

exp( )i i

reg train train yy

N y i yy ji y j

y i yy

l D l D s w

c x ws w

c x w

Page 37: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solving the L1 Regularized Conditional Exponential Model

Solving the L1 regularized conditional exponential model directly is rather difficult Because the absolute value is a discontinuous function

Any suggestion to alleviate this problem?

1

,1

( ) ( )

exp( )log

exp( )i i

reg train train yy

N y i yy ji y j

y i yy

l D l D s w

c x ws w

c x w

Page 38: Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your

Solving the L1 Regularized Conditional Exponential Model

,1

exp( )arg max log

exp( )i iN y i y

y ji y jy i yy

c x ws w

c x w

,1

, , , ,

exp( )arg max log

exp( )

subject to , : 0, and

i iN y i yy ji y j

y i yy

y j y j y j y j

c x ws t

c x w

y j t t w t

Slack Variables