Huang,Kaizhu Classifier based on mixture of density tree CSE Department, The Chinese University of Hong Kong

Huang,Kaizhu

Classifier based

on mixture of density tree

CSE Department,The Chinese University of Hong Kong

Huang,Kaizhu

Basic Problem Given a dataset

{(X1,C), (X2,C),… ,(XN-1,C), (XN,C)}

Here Xi stands for the training data,C stands for the class label,assuming we have m classes,We estimate the probability

P(Ci |X), i=1,2,…,m (1)

The classifier is then denoted by:

The key point is How we can estimate the posterior probability (1)?

)|(maxarg XCP ii

Huang,Kaizhu

Density estimation problem Given a dataset D

{X1, X2,… ,XN-1, XN,}

Where Xi is a instance of a m-variable vector

{v1 , v2 ,… vm-1, vm}

The goal is to find a joint distribution P(v1, v2 ,… vm-1, vm ) which can maximize the

negative entropy

N

ii

PXPQ

1

)(logmaxarg

Huang,Kaizhu

Interpretation

spacestatetheasdefine

xPxPNXPX

N

ii

)(log)()(log1

According to information theory,This measures how many bites are needed to described D based on the probability distribution P .To find the maximum P ,it is actually to find the well-know MDL (minimal description length)

Huang,Kaizhu

Graphical Density estimation

Naive Bayesian Network (NB) Tree augmented Naive Network(TANB) Chow-Liu tree network(CL) Mixture of tree network(MT)

Huang,Kaizhu

NB TANB

Chow-Liu Mixture of tree

EM

Huang,Kaizhu

Naive Bayesian Network Given a problem in Slide 3,we make

the following assumption about the variables: All the variables are independent,given

the class label Then the joint distribution P can be

written as:

)|()|),...,(( 121 CvPCvvvvP ii

mm

Õ

Huang,Kaizhu

Structure of NB

1. With this structure ,it is easy to estimate the joint distribution since we can obtain the by the following easy accumulation

)|( CvP i

values' of one isk

variable,

,)|(

i

i

j

kvji

v

discreteaisvassume

Nc

NCkvP i

Huang,Kaizhu

Chow-Liu Tree network Given a problem in Slide 3,we make the

following assumption about the variables:

Each variable has direct dependence relationship with just one other variable and is conditional independent with other variables ,given the class label.

Thus the joint distribution can be written into :

iv

mergersofnpermutatio

unknownanislmllwhereiij

CvvPCvvvvP ilj

m

limm

,...,2,1 int

),...,2,1( ,)(0

),(|()|,,...,,( )(1

121

Huang,Kaizhu

Example of CL methodsFig2 is an example of CL tree where 1.v3 is just conditional dependent on v4,and conditional independent on other variablesP(v3|v4,B)=P(v3|v4)2.v5 is just conditional dependent on v4, and conditional independent on other variablesP(v5|v4,B)=P(v5|v4)3.v2 is just conditional dependent on v3, and conditional independent on other variablesP(v2|v3,B)=P(v2|v3)4.v1 is just conditional dependent on v3 ,and and conditional independent on other variablesP(v1|v3,B)=P(v1|v3)

Huang,Kaizhu

)|()|()|()|()(

)|(),|()|()|()(

)|,()|()|()(

),|,()|()|()(

),|,(),,,|()|()(

),|,,()|()(

)()|,,,(

),,,,(

323145434

3232145434

32145434

432145434

432143215434

43521434

445321

54321

vvPvvPvvPvvPvP

vvPvvvPvvPvvPvP

vvvPvvPvvPvP

vvvvPvvPvvPvP

vvvvPvvvvvPvvPvP

vvvvvPvvPvP

vPvvvvvP

vvvvvP

Huang,Kaizhu

CL Tree

mergersofnpermutatio

unknownanislmllwhereiij

CvvPCvvvvP ilj

m

limm

,...,2,1 int

),...,2,1( ,)(0

),(|()|,,...,,( )(1

121

•The key point about CL tree is that:

We use a multiplication of 2-dimension variable distributions to approximate the high-dimension distributions.

Then how can we find the best multiplication of 2-dimension variable distributions to approximate the high-dimension distributions optimally

Huang,Kaizhu

CL tree algorithm

1.Obtaining P(vi|vj), P(vi,vj) for each pair of (vi,vj) by accumulating process .

2.Calculating the mutual entropy

3.Utilizing Maximum spanning tree algorithm to find the optimal tree structure,which the edge weight between two nodes vi,vj is I((vi,vj)

This CL algorithm was proved to be optimal in [1]

))()(

),(log(),(),(

, ji

jiji

vvji vPvP

vvPvvPvvI

ji

Huang,Kaizhu

Mixture of tree (MT) A mixture of tree model is defined to be a

distribution of the form:

A MT can be viewed as containing a unobserved choice variable z,which takes values k{1,… }

integer randoman as defined is tree,CL a is )(

.1 ;,...,1,0

)()(

1

1

lvT

lkwith

vTvQ

k

l

kkk

l

k

kk

l

Huang,Kaizhu

Huang,Kaizhu

Difference between MT & CL z can be any integer variable, especially when

unobserved variable z is the class label,the MT is changed into the multi-CL tree

CL is a supervised learning algorithm,which has to be trained each tree for each class

MT is a unsupervised learning algorithm,which considers the class variable as the training data

Huang,Kaizhu

Optimization problem of MT Given a data set of observations

We are required to find the mixture of tree Q that satisfies

This optimization problem in mixture model can be solved by EM(Expectation Maximizing) methods

},...,,{ 21 NxxxD

(1) )(logmaxarg1'

N

i

i

QXQQ

Huang,Kaizhu

Huang,Kaizhu

Huang,Kaizhu

We maximize (7) with respect k and Tk with the constraint

We can obtain the update equation:

As for the second term of (7) ,

In fact it is a CL procedure,so we can maximize it by finding a CL tree based on

;11

N

ii

Huang,Kaizhu

MT in classifiers1. In training phrase. Train the MT model on the training data

domain {c}V,C is the class label, V is the input domain

2. In testing phrasea new instance xV is classified by picking the most likely value of the class variable given the settings of the other variables:

),(maxarg)( xxQxc c

xc

Huang,Kaizhu

Multi-CL in handwritten digit recognition

1. Feature extraction

The four basic configurations above are rotated in four cardinal directions and applied to the characters in the six overlapped zones shown in the following

Huang,Kaizhu


So we have 4*4*6=96 dimension features

Huang,Kaizhu


1.For a given pattern,we calculate the probabilities this pattern belongs to each class(for digit we have 10 class ,0,1,2,…9)

2.We choose the maximum class probability as the classification result

Here the probability the pattern “2” belongs to class 2 is the maxim,so we classified it as digit 2

Huang,Kaizhu

Discussion 1. When all of the component trees are in

the same structure,the MT becomes a TANB mode.

2. When z is class label,MT becomes a CL mode

3.The MT is a general mode of CL and naive bayesian mode.So the performance is expected to be better than NB, TANB,CL

Documents

Huang,Kaizhu Classifier based on mixture of density tree CSE Department, The Chinese University of Hong Kong