53
2005-7-4 School of Computing, NUS 1 Building Maximum Entropy Text Classifier Using Semi-supervised Learning Zhang, Xinhua For PhD Qualifying Exam Term Paper

Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 1

Building Maximum Entropy Text Classifier

Using Semi-supervised Learning

Zhang, Xinhua

For PhD Qualifying Exam Term Paper

Page 2: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 2

Road mapIntroduction: background and application

Semi-supervised learning, especially for text classification (survey)

Maximum Entropy Models (survey)

Combining semi-supervised learning and maximum entropy models (new)

Summary

Page 3: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 3

Road mapIntroduction: background and application

Semi-supervised learning, esp. for text classification (survey)

Maximum Entropy Models (survey)

Combining semi-supervised learning and maximum entropy models (new)

Summary

Page 4: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 4

Introduction:Application of text classificationText classification is useful, widely applied:– cataloging news articles (Lewis & Gale, 1994; Joachims, 1998b);

– classifying web pages into a symbolic ontology (Craven et al., 2000);

– finding a person’s homepage (Shavlik & Eliassi-Rad, 1998);

– automatically learning the reading interests of users (Lang, 1995; Pazzani et al., 1996);

– automatically threading and filtering email by content(Lewis & Knowles, 1997; Sahami et al., 1998);

– book recommendation (Mooney & Roy, 2000).

Page 5: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 5

Early ways of text classification

Early days: manual construction of rule sets. (e.g., if advertisement appears, then filtered).

Hand-coding text classifiers in a rule-based style is impractical. Also, inducing and formulating the rules from examples are time and labor consuming.

Page 6: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 6

Supervised learning for text classification

Using supervised learning– Require a large or prohibitive number of

labeled examples, time/labor-consuming.– E.g., (Lang, 1995) after a person read and hand-

labeled about 1000 articles, a learned classifier achieved an accuracy of about 50% when making predictions for only the top 10% of documents about which it was most confident.

Page 7: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 7

What about using unlabeled data?

Unlabeled data are abundant and easily available, may be useful to improve classification.

– Published works prove that it helps.

Why do unlabeled data help?– Co-occurrence might explain something.– Search on Google,

• ‘Sugar and sauce’ returns 1,390,000 results• ‘Sugar and math’ returns 191,000 resultsthough math is a more popular word than sauce

Page 8: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 8

Using co-occurrence and pitfalls

Simple idea: when A often co-occurs with B (a fact that can be found by using unlabeled data) and we know articles containing A are often interesting, then probably articles containing B are also interesting.

Problem:– Most current models using unlabeled data are based on

problem-specific assumptions, which causes instability across tasks.

Page 9: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 9

Road mapIntroduction: background and application

Semi-supervised learning, especially for text classification (survey)

Maximum Entropy Models (survey)

Combining semi-supervised learning and maximum entropy models (new)

Summary

Page 10: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 10

Generative and discriminativesemi-supervised learning models

Generative semi-supervised learning – Expectation-maximization algorithm, which can

fill the missing value using maximum likelihoodDiscriminative semi-supervised learning

– Transductive Support Vector Machine (TSVM)• finding the linear separator between the labeled

examples of each class that maximizes the margin over both the labeled and unlabeled examples

(Vapnik, 1998)

(Nigam, 2001)

Page 11: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 11

Other semi-supervised learning models

Co-training

Active learning

Reduce overfitting

(Blum & Mitchell, 1998)

e.g., (Schohn & Cohn, 2000)

e.g. (Schuurmans& Southey, 2000)

Page 12: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 12

Theoretical value of unlabeled dataUnlabeled data help in some cases, but not all.For class probability parameters estimation,

labeled examples are exponentially more valuable than unlabeled examples, assuming the underlying component distributions are known and correct. (Castelli & Cover, 1996)Unlabeled data can degrade the performance of a

classifier when there are incorrect model assumptions. (Cozman & Cohen, 2002)Value of unlabeled data for discriminative

classifiers such as TSVMs and for active learning are questionable. (Zhang & Oles, 2000)

Page 13: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 13

Models based on clustering assumption (1): Manifold

Example: handwritten 0 as an ellipse (5-Dim)Classification functions are naturally defined only

on the submanifold in question rather than the total ambient space.

Classification will be improved if the convert the representation into submanifold.– Same idea as PCA, showing the use of unsupervised

learning in semi-supervised learningUnlabeled data help to construct the submanifold.

Page 14: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 14

Manifold, unlabeled data help

Belkin & Niyogi2002

AB

A’ B’

Page 15: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 15

Models based on clustering assumption (2): Kernel methodsObjective:– make the induced distance small for points in the same

class and large for those in different classes– Example:

• Generative: for a mixture of Gaussian one kernel can be

defined as

• Discriminative: RBF kernel matrix

Can unify the manifold approach

( , )k kµ Σ1

1( , ) ( | ) ( | )q T

kkK x y P k x P k y x y−

== Σ∑

( )2exp || || /ij i jK x x σ= − −

(Tsuda et al., 2002)

Page 16: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 16

Models based on clustering assumption (3): Min-cut

Express pair-wise relationship (similarity) between labeled/unlabeled data as a graph, and find a partitioning that minimizes the sum of similarity between differently labeled examples.

Page 17: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 17

Min-cut family algorithmProblems with min-cut

– Degenerative (unbalanced) cut Remedy

– Randomness– Normalization, like Spectral Graph Partitioning– Principle:Averages over examples (e.g., average margin,

pos/neg ratio) should have the same expected value in the labeled and unlabeled data.

Page 18: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 18

Road mapIntroduction: background and application

Semi-supervised learning, esp. for text classification (survey)

Maximum Entropy Models (survey)

Combining semi-supervised learning and maximum entropy models (new)

Summary

Page 19: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 19

Overview:Maximum entropy models

Advantage of maximum entropy model– Based on features, allows and supports feature induction

and feature selection– offers a generic framework for incorporating unlabeled

data– only makes weak assumptions– gives flexibility in incorporating side information – natural multi-class classificationSo maximum entropy model is worth further

study.

Page 20: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 20

Feature in MaxEnt

Indicate the strength of certain aspects in the event

– e.g., ft (x, y) = 1 if and only if the current word, which is part of document x, is “back” and the class y is verb. Otherwise, ft (x, y) = 0.

Contributes to the flexibility of MaxEnt

Page 21: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 21

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x∑ ∑

Standard MaxEnt Formulation

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x−∑ ∑maximize

( ) ( | ) ( , )i k i t i ki k

p x p y x f x y∑ ∑s.t.

( ) ( | ) ( , ) for all i k i t i ki k

p x p y x f x y t=∑ ∑( | ) 1 for all k i

kp y x i=∑

The dual problem is just the maximum likelihood problem.

Page 22: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 22

Smoothing techniques (1)Gaussian prior (MAP)

22( ) ( | ) log ( | )

2t

i k i k i ti k t

p x p y x p y x σ δ− +∑ ∑ ∑maximize

s.t. [ ] ( ) ( | ) ( , ) for all p t i k i t i k ti k

E f p x p y x f x y tδ− =∑ ∑

( | ) 1 for all k ik

p y x i=∑

Page 23: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 23

Smoothing techniques (2)Laplacian prior (Inequality MaxEnt)

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x−∑ ∑maximize

s.t. [ ] ( ) ( | ) ( , ) for all p t i k i t i k ti k

E f p x p y x f x y A t− ≤∑ ∑

( ) ( | ) ( , ) [ ] for all i k i t i k p t ti k

p x p y x f x y E f B t− − ≤∑ ∑( | ) 1 for all k i

kp y x i=∑

Extra strength: feature selection.

Page 24: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 24

MaxEnt parameter estimation

Convex optimization ☺Gradient descent, (conjugate) gradient descentGeneralized Iterative Scaling (GIS)Improved Iterative Scaling (IIS)Limited memory variable metric (LMVM)Sequential update algorithm

Page 25: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 25

Road mapIntroduction: background and application

Semi-supervised learning, esp. for text classification (survey)

Maximum Entropy Models (survey)

Combining semi-supervised learning and maximum entropy models (new)

Summary

Page 26: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 26

Semi-supervised MaxEntWhy do we choose MaxEnt?– 1st reason: simple extension to semi-supervised learning

– 2nd reason: weak assumption

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x−∑ ∑

( | ) 1 for all k ik

p y x i=∑[ ] ( ) ( | ) ( , )p t i k i t i k

i kE f p x p y x f x y=∑ ∑where

maximize

s.t. [ ] ( ) ( | ) ( , ) 0 for all p t i k i t i ki k

E f p x p y x f x y t− =∑ ∑

Page 27: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 27

Estimation error bounds3rd reason: estimation error bounds in theory

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x−∑ ∑[ ] ( ) ( | ) ( , ) for all p t i k i t i k t

i kE f p x p y x f x y A t− ≤∑ ∑

( | ) 1 for all k ik

p y x i=∑

maximize

s.t.

( ) ( | ) ( , ) [ ] for all i k i t i k p t ti k

p x p y x f x y E f B t− ≤∑ ∑

Page 28: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 28

Side Information

Only assumptions over the accuracy of empirical evaluation of sufficient statistics is not enough

y1.

xyx

OO

2. Use distance/similarity info

Page 29: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 29

Source of side informationInstance similarity.– neighboring relationship between different instances – redundant description– tracking the same objectClass similarity, using information on related

classification tasks– combining different datasets (different distributions)

which are for the same classification task; – hierarchical classes; – structured class relationships (such as trees or other

generic graphic models)

Page 30: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 30

Incorporate similarity information:flexibility of MaxEnt frameworkAdd assumption that the class probability of xi , xj is similar if the distance in one metric is small between xi , xj.Use the distance metric to build a minimum spanning tree and add side info to MaxEnt. Maximize:

2,( , ) , ,

,( , )

( ) ( | ) log ( | )i k i k i k i j i j ki k k i j E

p x p y x p y x w ε∈

− −∑ ∑ ∑[ ] ( ) ( | ) ( , ) for all p t i k i t i k

i kE f p x p y x f x y t=∑ ∑

( | ) 1 for all k ik

p y x i=∑, ,( | ) ( | ) for all and ( , )k i k j i j kp y x p y x k i j Eε− = ∈

,( , ) ( , )k i j s i jw C w where w(i,j) is the true distance between (xi, xj)

Page 31: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 31

Connection with Min-cut familySpectral Graph Partitioning

({ | 1} {

,| 1}

)maxy

i ii y icut

yG G+ −

= ⋅ = −

2,( , ) , ,

,( , )( ) ( | ) log ( | )i k i k i

ik i j i j

k Ekk

i jp x p y x p x wy ε

− −∑ ∑ ∑

[ ] ( ) ( | ) ( , ) for all p t i k i t i ki k

E f p x p y x f x y t=∑ ∑( | ) 1 for all k i

k

p y x i=∑, ,( | ) ( | ) for all and ( , )k i k j i j kp y x p y x k i j Eε− = ∈

1 if is positively labeledi iy x= +1 if is negatively labeledi iy x= −

{ 1, 1}ny∈ + −

s.t.

, ,| | ?i j kεmaximize

( ),

2( 1)1

2( 1)ij

jj

iiP y P yw = − =∑

minimize

Harmonic function(Zhu et al. 2003)

Page 32: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 32

Miscellaneous promising research openings (1)

Feature selection– Greedy algorithm to incrementally add

feature to the random field by selecting the feature which maximally reduces the objective function.

Feature induction– If IBM appears in labeled data while Apple

does not, then using ‘IBM or Apple’ as feature can help (though costly).

Page 33: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 33

Miscellaneous promising research openings (2)

Interval estimation

– How should we set the At and Bt ? Whole bunch of results in statistics. W/S LLN, Hoeffding’s inequality

or using more advanced concepts in statistical learning theory, e.g., VC-dimension of feature class

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x∑ ∑[ ] ( ) ( | ) ( , ) for all t p t i k i t i k t

i kB E f p x p y x f x y A t− ≤ − ≤∑ ∑

( | ) 1 for all k ik

p y x i=∑

minimize

s.t.

( ) 2[ ] [ ] exp( 2 )p t p tP E f E f mβ β− > ≤ −

Page 34: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 34

Miscellaneous promising research openings (3)

Re-weighting– In view that the empirical estimation of statistics is

inaccurate, we add more weight to the labeled data, which may be more reliable than unlabeled data.

22( ) ( | ) log ( | )

2t

i k i k i ti k t

p x p y x p y x σ δ+∑ ∑ ∑[ ] ( ) ( | ) ( , ) for all p t i k i t i k t

i kE f p x p y x f x y tδ− =∑ ∑

( | ) 1 for all k ik

p y x i=∑

exp ( , )i t t i kk t

Z f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑ s.t.

minimize

Page 35: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 35

Re-weightingOriginally, n1 labeled examples and n2 unlabeled

examples

Then p(x) for labeled data:

p(x) for unlabeled data:

1 2

1 2

1 2 1 2

labeled data unlabeled data

, ,..., , , ,...,l l l u u un n

n n

x x x x x x1 1

1 1

1 2

copies copies of of

1 1 1 2

labeled data unlabeled data

,... ,..., ,... , , ,...,

llnxx

l l l l u un n n

n n

x x x x x x x

ββ

β

2

u

1 2

1n n+ 1 2n n

ββ +

1 2

1n n+ 1 2

1n nβ +

All equations before keep unchanged!

Page 36: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 36

Initial experimental results

Dataset: optical digits from UCI– 64 input attributes ranging in [0, 16], 10 classes

Algorithms tested– MST MaxEnt with re-weight– Gaussian Prior MaxEnt, Inequality MaxEnt, TSVM

(linear and polynomial kernel, one-against-all)

Testing strategy– Report the results for the parameter setting with the best

performance on the test set

Page 37: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 37

Initial experiment result

Page 38: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 38

Summary

Maximum Entropy model is promising for semi-supervised learning.

Side information is important and can be flexibly incorporated into MaxEnt model.

Future research can be done in the area pointed out (feature selection/induction, interval estimation, side information formulation, re-weighting, etc).

Page 39: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 39

Question and Answer Session

Questionsare

welcomed.

Page 40: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 40

GISIterative update rule for unconditional probability:

GIS for conditional probability

( 1) ( )( )

[ ]log

( ) ( | , ) ( , )p ts s

t t si k i t i k

i k

E fp x p y x f x y

λ λ ηλ

+

⎛ ⎞⎜ ⎟= + ⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑

( )

( ) [ ]log

[ ]s

p tst

tp

E fE f

λ η⎛ ⎞⎜ ⎟= +⎜ ⎟⎝ ⎠

( )

( 1) ( ) [ ]log

[ ]s

p ts st t

tp

E fE f

λ λ+⎛ ⎞⎜ ⎟= +⎜ ⎟⎝ ⎠

( )

( 1) ( )( )

( ) ( )( ) ( )

( ) ( )

t if x

j t jjs s

i i st j t j

j

p x f xp x p x

p x f x+

⎛ ⎞⎜ ⎟

= ⎜ ⎟⎜ ⎟⎝ ⎠

∑∏ ∑

Page 41: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 41

IISCharacteristic:– monotonic decrease of MaxEnt objective function– each update depends only on the computation of expected

values , not requiring the gradient or higher derivatives Update rule for unconditional probability:– is the solution to:

– are decoupled and solved individually– Monte Carlo methods are to be used if the number of

possible xi is too large

tλ∆( )[ ] ( ) ( )exp ( ) for all s

p t i t i t j ii j

E f p x f x f x tλ⎛ ⎞

= ∆⎜ ⎟⎝ ⎠

∑ ∑tλ∆

( )spE

Page 42: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 42

( )stλ

GISCharacteristics:– converges to the unique optimal value of λ– parallel update, i.e., are updated synchronously – slow convergence

prerequisite of original GIS– for all training examples xi: and – relaxing prerequisite

( ) 0t if x ≥ ( ) 1t it

f x =∑

( )stλ

( )t it

f x C=∑if t tf f C′=then defineIf not all training data have summed feature equaling C, thenset C sufficiently large and incorporate a ‘correction feature’.

Page 43: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 43

Other standard optimization algorithms

Gradient descent

Conjugate gradient methods, such as Fletcher-Reeves and Polak-Ribiêre-Positive algorithm

limited memory variable metric, quasi-Newtonmethods: approximate Hessian using successive evaluations of gradient

( 1) ( )( )

s st t s

t

Lλ λ ηλ λλ

+ ∂= +

=∂

Page 44: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 44

Sequential updating algorithm

For a very large (or infinite) number of features, parallel algorithms will be too resource consuming to be feasible.

Sequential update: A style of coordinate-wise descent, modifies one parameter at a time.

Converges to the same optimum as parallel update.

Page 45: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 45

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x∑ ∑

Dual Problem of Standard MaxEnt

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x∑ ∑minimize

[ ] ( ) ( | ) ( , ) 0 for all p t i k i t i ki k

E f p x p y x f x y t− =∑ ∑( | ) 1 for all k i

kp y x i=∑

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x∑ ∑

Dual problem:

min( , ) [ ] ( ) logt p t i it i

L p E f p x Zλ λ= − +∑ ∑

exp ( , )i t t i kk t

Z f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑where

Page 46: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 46

( ) log ( ) [ ] ( ) logi i t p t i ii t i

p x p x E f p x Zλ+ −∑ ∑ ∑

Relationship with maximum likelihood

1( | ) exp ( , )k i t t i kti

p y x f x yZ

λ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑Suppose

( ) log ( ) [ ] ( ) logi i t p t i ii t i

p x p x E f p x Zλ= + −∑ ∑ ∑

( ) ( , ) log ( , )i k i ki k

L p x y p x yλ =∑∑

min( , ) [ ] ( ) logt p t i it i

L p E f p x Zλ λ= − +∑ ∑

exp ( , )i t t i kk t

Z f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑where

← maximize

Dual of MaxEnt:

← minimize

Page 47: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 47

Smoothing techniques (2)Exponential prior

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x∑ ∑[ ] ( ) ( | ) ( , ) for all p t i k i t i k t

i kE f p x p y x f x y A t− ≤∑ ∑

( | ) 1 for all k ik

p y x i=∑

minimizeexp ( , )i t t i k

k tZ f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑ s.t.

Dual problem:minimize

( ) [ ] ( ) logt p t i i t tt i t

L E f p x Z Aλ λ λ= − + +∑ ∑ ∑exp ( , )i t t i k

k tZ f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑

( | ) exp( )i i t t ti t

p y x A AEquivalentTo maximize λ× −∏ ∏

Page 48: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 48

Smoothing techniques (1)Gaussian prior (MAP)

22( ) ( | ) log ( | )

2t

i k i k i ti k t

p x p y x p y x σ δ+∑ ∑ ∑[ ] ( ) ( | ) ( , ) for all p t i k i t i k t

i kE f p x p y x f x y tδ− =∑ ∑

( | ) 1 for all k ik

p y x i=∑2

2( ) [ ] ( ) log( )2

tt p t i i

t i t t

L E f p x Z λλ λ

σ= − + +∑ ∑ ∑

exp ( , )i t t i kk t

Z f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑

exp ( , )i t t i kk t

Z f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑

Dual problem:minimize

s.t.

minimize

Page 49: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 49

Smoothing techniques (3)Laplacian prior (Inequality MaxEnt)

( ) ( | ) log ( | )i k i k ii k

p x p y x p y x∑ ∑[ ] ( ) ( | ) ( , ) for all t p t i k i t i k t

i kB E f p x p y x f x y A t− ≤ − ≤∑ ∑

( | ) 1 for all k ik

p y x i=∑

minimizeexp ( , )i t t i k

k tZ f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑ s.t.

( , ) ( ) [ ] ( ) logt t p t i it i

L E f p x Zα β α β= − − +∑ ∑Dual problem:minimize t t t t

t tA Bα β+ +∑ ∑ 0, 0t tα β≥ ≥

exp( ( ) ( , ))i t t t i kk t

Z f x yα β= −∑ ∑where

Page 50: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 50

Smoothing techniques (4)Inequality with 2-norm Penalty

minimize2 2

1 2( ) ( | ) log ( | )i k i k i t ti k t t

p x p y x p y x C Cδ ζ+ +∑ ∑ ∑ ∑exp ( , )i t t i kk t

Z f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑

[ ] ( ) ( | ) ( , ) for all p t i k i t i k t ti k

E f p x p y x f x y A tδ− ≤ +∑ ∑( ) ( | ) ( , ) [ ] for all i k i t i k p t t t

i kp x p y x f x y E f B tζ− ≤ +∑ ∑

s.t.

( | ) 1 for all k ik

p y x i=∑

Page 51: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 51

Smoothing techniques (5)Inequality with 1-norm Penalty

minimize

1 2( ) ( | ) log ( | )i k i k i t ti k t t

p x p y x p y x C Cδ ζ+ +∑ ∑ ∑ ∑exp ( , )i t t i kk t

Z f x yλ⎛ ⎞= ⎜ ⎟

⎝ ⎠∑ ∑

[ ] ( ) ( | ) ( , ) for all p t i k i t i k t ti k

E f p x p y x f x y A tδ− ≤ +∑ ∑( ) ( | ) ( , ) [ ] for all i k i t i k p t t t

i kp x p y x f x y E f B tζ− ≤ +∑ ∑

s.t.

( | ) 1 for all k ik

p y x i=∑0, 0 for all t t tδ ζ≥ ≥

Page 52: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 52

Using MaxEnt as SmoothingAdd maximum entropy term into the target function of other

models, using MaxEnt’s preference of uniform distribution

maximizemaximize

minimize

s.t.

s.t.

Page 53: Building Maximum Entropy Text Classifier Using Semi ...zhangx/pubDoc/QE.pdf · Feature in MaxEnt Indicate the strength of certain aspects in the event – e.g., ft (x, y) = 1 if and

2005-7-4 School of Computing, NUS 53

Bounded error

Correct distribution pC(xi)

Conclusion:

then,ˆ arg min ( )A B

pLλ

λ λ= * arg min ( )CpL

λλ λ=

[ ] ( ) ( | ) ( , )C C Cp t i k i t i k

i kE f p x p y x f x y=∑ ∑

( ) [ ] ( ) logC Cp t p t i i

t iL E f p x Zλ λ= − +∑ ∑

* *ˆ( ) ( ) ( )C Cp p t t t

tL L A Bλ λ λ≤ + +∑