View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Project Now it is time to think about the project It is a team work
Each team will consist of 2 people It is better to consider a project of your own
Otherwise, I will assign you to some “difficult” project . Important date
03/11: project proposal due 04/01: project progress report due 04/22 and 04/24: final presentation 05/03: final report due
Project Proposal What do I expect?
Introduction: describe the research problem that you try to solve Related wok: describe the existing approaches and their deficiency Proposed approaches: describe your approaches and why it may have
potential to alleviate the deficiency with existing approaches Plan: what you plan to do in this project?
Format It should look like a research paper The required format (both Microsoft Word and Latex) can be
downloaded from www.cse.msu.edu/~cse847/assignments/format.zip
Project Progress Report Introduction: overview the problem that you try to solve and
the solutions that you present in the proposal Progress
Algorithm description in more details Related data collection and cleanup Preliminary results
Format should be same as the project report
Project Final Report It should like a research paper that is ready for
submission to research conferences What do I expect?
Introduction Algorithm description and discussion Empirical studies
I am expecting careful analysis of results no matter if it is a successful approach or a complete failure
Presentation 25 minute presentation 5 minute discussion
Exponential Model and Maximum Entropy Model
Rong Jin
Recap: Logistic Regression Model Assume the inputs and outputs are related in the log
linear function
Estimate weights: MLE approach
1 2
1( | ; )
1 exp ( )
{ , ,..., , }m
p y xy x w c
w w w c
*21
21 1
max ( ) max log ( | ; )
1max log
1 exp( )
nreg train i iiw w
n mji jw
w l D p y x s w
s wy x w c
1 2{ , ,..., , }mw w w c
How to Extend Logistic Regression Model to Multiple Classes? y{+1, -1} {1,2,…,C}?
1 2
1( | ; )
1 exp ( )
{ , ,..., , }m
p y xy x w c
w w w c
( 1 | )log
( 1| )
p y xx w c
p y x
Conditional Exponential Model Introduce a different set of parameters for
each class
Ensure the sum of probability to be 1
( | ; ) exp( ) { , }y y y y yp y x c x w c w
1( | ; ) exp( )
( )
( ) exp( )
y y
y yy
p y x c x wZ x
Z x c x w
( | ; )p y x
Conditional Exponential Model Predication probability
Model parameters: For each class y, we have weights wy and threshold cy
Maximum likelihood estimation
exp( )( | ; ) , {1,2,..., }
exp( )y y
y yy
c x wp y x y C
c x w
1 1
exp( )( ) log ( | ) log
exp( )i iN N y i y
train i ii iy i yy
c x wl D p y x
c x w
Any Problems?
Conditional Exponential Model Add a constant vector to every weight vector, we have the
same log-likelihood function
Not unique optimum solution! How to resolve this problem?
0 0
0 0
10 0
1
,
exp( )( ) log
exp( )
exp( )log
exp( )
i i
i i
y y y y
N y i ytrain i
y i yy
N y i y
iy i yy
w w w c c c
c c x w wl D
c c x w w
c x w
c x w
Solution: Set w1 to be a zero vector and c1 to be zero
Modified Conditional Exponential Model
Prediction probability
Model parameters: For each class y>1, we have weights wy and threshold cy
Maximum likelihood estimation
' '' 1
' '' 1
exp( ){2,..., }
1 exp( )( | ; )
11
1 exp( )
y y
y yy
y yy
c x wy C
c x wp y x
yc x w
1
{ | 1} { | 1}1 1
( ) log ( | )
exp( )1log log
1 exp( ) 1 exp( )i i
i i
Ntrain i ii
y i y
i y i yy i y y i yy y
l D p y x
c x w
c x w c x w
Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au-cours-de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation
What is your guess of the probabilities?
Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation
What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5
Case 2: 30% of times either dans or en is used
Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation
What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5
Case 2: 30% of times either dans or en is used What is your guess of the probabilities? p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30
Uniform distribution is favored
Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times
dans or à is used What is your guess of the probabilities?
Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times dans
or à is used What is your guess of the probabilities?
A good probability distribution should Satisfy the constraints Be close to uniform distribution, but how?
Measure Uniformality using
Kullback-Leibler Distance !
Maximum Entropy Principle (MaxEnt) A uniformity of distribution is measured by entropy of the
distribution
Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2
* max ( )
where ( ) ( ) log ( ) ( ) log ( ) ( ) log ( )
( ) log ( ) ( ) log ( )
subject to
( ) ( ) 3/10
( ) ( ) 1/ 2
( ) ( ) ( ) (
PP H P
H P p dans p dans p en p en p a p a
p au course de p au course de p pendant p pendant
p dans p en
p dans p a
p dans p en p a p au cours d
) ( ) 1e p pendant
MaxEnt for Classification Problems Want a p(y|x) to be close to a uniform distribution
Maximize the conditional entropy of training data
Constraints Valid probability distribution
From training data: the model should be consistent with data For each class, model mean of x = empirical mean of x
1 1
1 1[1,2,..., ] ( | ) ( , )
N Ni i i ii i
y C p y x x x y yN N
1 1( | ) ( | ) ( | ) log ( | )
N Ni i ii i y
H y x H y x p y x p y x
, ( | ) 1iyi p y x
MaxEnt for Classification Problems Want a p(y|x) to be close to a uniform distribution
Maximize the conditional entropy of training data
Constraints Valid probability distribution
From training data: the model should be consistent with data For each class, model mean of x = empirical mean of x
1 1
1 1[1,2,..., ] ( | ) ( , )
N Ni i i ii i
y C p y x x x y yN N
1 1( | ) ( | ) ( | ) log ( | )
N Ni i ii i y
H y x H y x p y x p y x
, ( | ) 1iyi p y x
MaxEnt for Classification Problems
Requiring the mean be consistent between the empirical data and the model
No assumption about the parametric form for likelihood Only assume it is C2 continuous
1
1
1 1
max ( | ) max ( | )
max ( | ) log ( | )
subject to
( | ) ( , ), ( | )=1
Niip p
Ni i ii yp
N Ni i i i ii i y
H y x H y x
p y x p y x
p y x x x y y p y x
MaxEnt Model Consistency with data is ensured by the equality
constraints
For each feature, the empirical mean equal to the model mean Beyond feature vector x:
1 1( | ) ( , )
N Ni i i ii i
p y x x x y y
1 1( | ) ( ) ( ) ( , )
N Ni k i k i ii i
p y x f x f x y y
Translation Problem Parameters: p(dans), p(en), p(au), p(a), p(pendant) Represent each French word with two features
{dans, en} {dans, a}
dans 1 1
en 1 0
au-cours-de 0 0
a 0 1
pendant 0 0
Empirical Average 0.3 0.5
Constraints
1 1( | ) ( , )
N Ni i i ii i
p y x x x y y
( ) (1,1) ( ) (1,0) ( ) (0,0) ( )(0,1) ( ) (0,0)
(0.3,0.5)
p dans p en p au p a p pendant
( ) ( ) 3/10
( ) ( ) 1/ 2
p dans p en
p dans p a
Solution to MaxEnt Surprisingly, the solution is just conditional
exponential model without thresholds
Why?
exp( )( | ; )
exp( )y
yy
x wp y x
x w
Solution to MaxEntS o lv e t h e m a x im u m e n t ro p y p r o b l e m , i .e .
1 1
m a x ( | ) m a x ( | ) lo g ( | )
s u b j e c t t o ( | ) ( , ) , ( | ) 1
i iyp p
N Ni i ii i y
H y x p y x p y x
p y x x x y y p y x
T o m a x im iz e t h e e n t r o p y f u n c t io n u n d e r t h e c o n s t r a in t s , w e c a n in t r o d u c e a s e t o f L a g r a n g i a n m u l t i p l i e r s i n to t h e o b j e c t iv e f u n c t io n , i . e . ,
1 1 1( | ) ( | ) ( , ) ( | ) 1
N N Ny i i iy i i i y
G H y x p y x x x y y p y x
S e t t i n g t h e f i r s t d e r iv a t i v e o f G w i th r e s p e c t ( | )ip y x t o b e z e r o , w e h a v e e q u a t io n s
lo g ( | ) 0 ( | ) e x p ( )( | ) i y i i y i
i
Gp y x x p y x x
p y x
S in c e ( | ) 1iyp y x , w e h a v e ( | )ip y x
a s
e x p ( )( | )
e x p ( )
yi
yy
xp y x
x
Maximum Entropy Model versusConditional Exponential Model
1
1 1
max ( | )
max ( | ) log ( | )
subject to
( | ) ( , ),
( | )=1,
p
Ni i i ii yp
N Ni ii i
iy
H y x
p y x p y x
p y x x x y y i
p y x i
Maximum Entropy Model
{ }
1{ }
exp( )( | ; )
exp( )
max ( )
exp( )max log
exp( )
y
i
y
y
yy
trainw
N i y
iw i yy
x wp y x
x w
l D
x w
x w
Conditional Exponential Model
Dual
Problem
Maximum Entropy Model vs. Conditional Exponential Model
However, where is the threshold term c?
{ , }
1 2{ , }
exp( )( | ; )
exp( )
max ( )
exp( )max log
exp( )
y y
i
y y
y y
y yy
reg trainc w
N i yyi yc w i yy
c x wp y x
c x w
l D
x ws w
x w
Maximum Entropy Conditional Exponential
{ }
1{ }
exp( )( | ; )
exp( )
max ( )
exp( )max log
exp( )
y
i
y
y
yy
trainw
N i y
iw i yy
x wp y x
x w
l D
x w
x w
Solving Maximum Entropy Model
Iterative scaling algorithm Assume
1
1 1
max ( | ) max ( | ) log ( | )
subject to: ( | ) ( , ), ( | )=1
Ni i i ii yp p
N Ni i ii i y
H y x p y x p y x
p y x x x y y p y x
,
,1
[1... ], [1... ] : 0
[1... ] : , where is a constant independent from
i j
di jj
i n j d x
i n x g g i
Solving Maximum Entropy Model Compute the empirical mean for each feature of every class,
i.e., for every j and every class y
Start w1 ,w2 …, wc = 0 Repeat
Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration
Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y
Compute for every j and every y
Update w as
, ,1( , )
Ny j i j ii
e x y y N
, ,1( | )
Ny j i j ii
m x p y x N
, , ,j y j y j yw w w
, , ,log logj y j y j yw e m
Solving Maximum Entropy Model Compute the empirical mean for each feature of every class,
i.e., for every j and every class y
Start w1 ,w2 …, wc = 0 Repeat
Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration
Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y
Compute for every j and every y
Update w as
, ,1( , )
Ny j i j ii
e x y y N
, ,1( | )
Ny j i j ii
m x p y x N
, , ,j y j y j yw w w
, , ,1
log logj y j y j yw e mg
Solving Maximum Entropy Model The likelihood function always increases !
Solving Maximum Entropy Model How about each feature can take both positive and
negative values?
How about the sum of features is not a constant?
How to apply this approach to conditional exponential model with bias term (or threshold term)?
Improved Iterative Scaling It only requires all the input features to be positive Compute the empirical mean for each feature of every class,
i.e., for every j and every class y Start w1 ,w2 …, wc = 0 Repeat
Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration
Solve for every j and every y
Update w as
, , ,j y j y j yw w w
, , , ,1
( | ) expi j i j y i j y ii jx p y x w x e
N
, ,1( , )
Ny j i j ii
e x y y N
Choice of Features
A feature does not have to be one of the inputs For maximum entropy model, bound features are more
favorable. Very often, people use binary feature
Feature selection Features with small weights are eliminated
1
1
1 1
max ( | ) max ( | )
max ( | ) log ( | )
subject to : ( | ) ( ) ( ) ( , ), ( | )=1
Niip p
Ni i ii yp
N Ni i i i ii i y
H y x H y x
p y x p y x
y p y x f x f x y y p y x
Feature Selection vs. Regularizers
Regularizer sparse solution automatic feature selection But, L2 regularizer rarely results in features with zero
weights not appropriate for feature selection For the purpose of feature selection, usually using L1 norm
2
2,1
( ) ( )
exp( )log
exp( )i i
reg train train yy
N y i yy ji y j
y i yy
l D l D s w
c x ws w
c x w
Feature Selection vs. Regularizers
Regularizer sparse solution automatic feature selection But, L2 regularizer rarely results in features with zero
weights not appropriate for feature selection For the purpose of feature selection, usually using L1 norm
2
2,1
( ) ( )
exp( )log
exp( )i i
reg train train yy
N y i yy ji y j
y i yy
l D l D s w
c x ws w
c x w
1
,1
( ) ( )
exp( )log
exp( )i i
reg train train yy
N y i yy ji y j
y i yy
l D l D s w
c x ws w
c x w
Solving the L1 Regularized Conditional Exponential Model
Solving the L1 regularized conditional exponential model directly is rather difficult Because the absolute value is a discontinuous function
Any suggestion to alleviate this problem?
1
,1
( ) ( )
exp( )log
exp( )i i
reg train train yy
N y i yy ji y j
y i yy
l D l D s w
c x ws w
c x w
Solving the L1 Regularized Conditional Exponential Model
,1
exp( )arg max log
exp( )i iN y i y
y ji y jy i yy
c x ws w
c x w
,1
, , , ,
exp( )arg max log
exp( )
subject to , : 0, and
i iN y i yy ji y j
y i yy
y j y j y j y j
c x ws t
c x w
y j t t w t
Slack Variables