26
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT

Maximum Entropy Discrimination

  • Upload
    olinda

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

Maximum Entropy Discrimination. Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT. Classification. inputs x , class y = +1, -1 data D = { (x 1 ,y 1 ), …. (x T ,y T ) } learn f opt (x) discriminant function - PowerPoint PPT Presentation

Citation preview

Page 1: Maximum Entropy Discrimination

Maximum Entropy Discrimination

Tommi Jaakkola Marina Meila Tony Jebara

MIT CMU MIT

Page 2: Maximum Entropy Discrimination

· inputs x, class y = +1, -1· data D = { (x1,y1), …. (xT,yT) }

· learn fopt(x) discriminant functionfrom F = {f} family of discriminants

· classify y = sign fopt(x)

Classification

Page 3: Maximum Entropy Discrimination

Model averaging

· many f with near optimal performance

· Instead of choosing fopt, average over all f in F

Q(f) = weight of f

y(x) = sign Q(f)f(x) F

= sign < f(x) >Q

· To specify:F = { f } family of discriminant functions

· To learn Q(f) distribution over F

Page 4: Maximum Entropy Discrimination

Goal of this work

· Define a discriminative criterion for averaging over models

Advantages

· can incorporate prior

· can use generative model

· computationally feasible

· generalizes to other discrimination tasks

Page 5: Maximum Entropy Discrimination

Maximum Entropy Discrimination

given data set D = { (x1,y1), … (xT,yT) } find

QME = argmaxQ H(Q)

s.t. yt< f(xt) >Q for all t = 1,…,T (C)

and some > 0

solution QME correctly classifies D

· among all admissible Q, QME has max entropy

· max entropy least specific about f

Page 6: Maximum Entropy Discrimination

· convex problem: QME unique

· solution TQME (f) ~ exp{ tytf(xt) } t=1

· t 0 Lagrange multipliers

· finding QME : start with =0 and follow gradient of unsatisfied constraints

Solution: Q ME as a projection

uniform Q0

QME

admissible Q

=0

ME

Page 7: Maximum Entropy Discrimination

Finding the solution

· needed t, t = 1,...T

· by solving the dual problem

max J() = max [ - log Z + - log Z- - t ]

s.t. t >= 0 for t = 1,...T

Algorithm

· start with t = 0 (uniform distribution)

· iterative ascent on J() until convergence· derivative J/ t = yt<log +b >Q(P) -

P+(x)

P-(x)

Page 8: Maximum Entropy Discrimination

QME as sparse solution

· Classification ruley(x) = sign< f(x) >QME

is classification margin

t> 0 for yt< f(xt) >Q

=

xt on the margin

(support vector!)

Page 9: Maximum Entropy Discrimination

QME as regularization

· Uniform distribution Q0 =0

· ”smoothness” of Q = H(Q)

· QME is smoothest admissible distribution

fopt

QME Q0

Q(f)

f

Page 10: Maximum Entropy Discrimination

Goal of this work

· Define a discriminative criterion for averaging over models

Extensions

· incorporate prior

· relationship to support vectors

· use generative models

· generalizes to other discrimination tasks

Page 11: Maximum Entropy Discrimination

Priors

· prior Q0( f )

· Minimum Relative Entropy Discrimination

QMRE = argminQ KL( Q || Q0)

s.t. yt< f(xt) >Q for all t = 1,…,T (C)

· prior on learn QMRE( f, ) soft margin

Q0

QMRE

admissible Q

prior

KL( Q || Q0)

Page 12: Maximum Entropy Discrimination

Soft margins

· average also over margin · define Q0 (f,) = Q0(f) Q0()

· constraints < ytf(xt) - >Q(f,) 0

· learn QMRE (f, ) = QMRE(f) QMRE()

Q0() =c exp[c(-1)]

Potential as function of

Page 13: Maximum Entropy Discrimination

Examples: support vector machines

· Theorem

For f(x) = .x + b, Q0() = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers are obtained by

maximizing J() subject to 0 t 0 and ttyt = 0 , where

J() = t[ t + log( 1 - t/c) ] - 1/2t,stsytysxt.xs

· Separable D SVM recovered exactly· Inseparable D SVM recovered with different

misclassification penalty

· Adaptive kernel SVM....

Page 14: Maximum Entropy Discrimination

SVM extensions

· Example: Leptograpsus Crabs (5 inputs, Ttrain=80,

Ttest=120)

f(x) = log + b

with P+( x ) = normal( x ; m+, V+ )

quadratic classifier Q( V+, V- ) = distribution of kernel width

P+

(x)P

-(x)

MRE Gaussian

Linear SVM

Max Likelihood Gaussian

Page 15: Maximum Entropy Discrimination

Using generative models

· generative models

P+(x), P

-(x)

for y = +1, -1

· f(x) = log + b

· learn QMRE (P+,P

-, b, )

· if Q0 (P+,P

- b,) = Q0 (P

+) Q0 ( P

-) Q0 ( b) Q0 ( )

· QMRE (P+,P

-) = QME (P

+) QME (P

-) QMRE( b) QMRE ( )

(factored prior factored posterior)

P+

(x)P

-(x)

Page 16: Maximum Entropy Discrimination

Examples: other distributions

· Multinomial (1 discrete variable)

· Graphical model (fixed structure, no hidden variables)

· Tree graphical model ( Q over structures and parameters)

Page 17: Maximum Entropy Discrimination

Tree graphical models

· P(x| E, ) = P0(x) Puv(xuxv|uv)

· prior Q0(P) = Q0(E) Q0(|E)

· Q0(E) = uv

· Q0(|E) = conjugate prior

QMRE(P) = W0 Wuv

can be integrated analytically

Q0(P) conjugate prior

over E and

E

E

E

Page 18: Maximum Entropy Discrimination

Trees: experiments

· Splice junction classification task• 25 inputs, 400 training examples• compared with Max Likelihood trees

ML, err=14%

MaxEnt, err=12.3%

Page 19: Maximum Entropy Discrimination

Trees experiments (contd)

Tree edges’ weights

Page 20: Maximum Entropy Discrimination

Discrimination tasks

· Classification

· Classification with partially labeled data

· Anomaly detection

+

+++

-

-

-xx

x

x

x

x

xx

x

x

x

+ ++

++

+

+

++

++

+ ++

++

+

+

++

++

Page 21: Maximum Entropy Discrimination

Partially labeled data

· Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find

Q(f,,y) = argminQ KL(Q||Q0)

s. t. < ytf(x) - >Q 0 for all t = 1,…,T (C)

Page 22: Maximum Entropy Discrimination

Partially labeled data : experiment

Complete data

10% labeled + 90% unlabeled

10% labeled

·Splice junction classification• 25 inputs

• Ttotal=1000

Page 23: Maximum Entropy Discrimination

Anomaly detection

· Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that

Q( P,) = argminQ KL(Q||Q0)

s. t. < log P(x) - >Q 0 for all t = 1,…,T (C)

Page 24: Maximum Entropy Discrimination

Anomaly detection: experiments

MaxEnt

MaxLikelihood

Page 25: Maximum Entropy Discrimination

Anomaly detection: experiments

MaxEnt

MaxLikelihood

Page 26: Maximum Entropy Discrimination

Conclusions

· New framework for classification· Based on regularization in the space of distributions· Enables use of generative models· Enables use of priors· Generalizes to other discrimination tasks