View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Huang,Kaizhu
Classifier based
on mixture of density tree
CSE Department,The Chinese University of Hong Kong
Huang,Kaizhu
Basic Problem Given a dataset
{(X1,C), (X2,C),… ,(XN-1,C), (XN,C)}
Here Xi stands for the training data,C stands for the class label,assuming we have m classes,We estimate the probability
P(Ci |X), i=1,2,…,m (1)
The classifier is then denoted by:
The key point is How we can estimate the posterior probability (1)?
)|(maxarg XCP ii
Huang,Kaizhu
Density estimation problem Given a dataset D
{X1, X2,… ,XN-1, XN,}
Where Xi is a instance of a m-variable vector
{v1 , v2 ,… vm-1, vm}
The goal is to find a joint distribution P(v1, v2 ,… vm-1, vm ) which can maximize the
negative entropy
N
ii
PXPQ
1
)(logmaxarg
Huang,Kaizhu
Interpretation
spacestatetheasdefine
xPxPNXPX
N
ii
)(log)()(log1
According to information theory,This measures how many bites are needed to described D based on the probability distribution P .To find the maximum P ,it is actually to find the well-know MDL (minimal description length)
Huang,Kaizhu
Graphical Density estimation
Naive Bayesian Network (NB) Tree augmented Naive Network(TANB) Chow-Liu tree network(CL) Mixture of tree network(MT)
Huang,Kaizhu
NB TANB
Chow-Liu Mixture of tree
EM
Huang,Kaizhu
Naive Bayesian Network Given a problem in Slide 3,we make
the following assumption about the variables: All the variables are independent,given
the class label Then the joint distribution P can be
written as:
)|()|),...,(( 121 CvPCvvvvP ii
mm
Õ
Huang,Kaizhu
Structure of NB
1. With this structure ,it is easy to estimate the joint distribution since we can obtain the by the following easy accumulation
)|( CvP i
values' of one isk
variable,
,)|(
i
i
j
kvji
v
discreteaisvassume
Nc
NCkvP i
Huang,Kaizhu
Chow-Liu Tree network Given a problem in Slide 3,we make the
following assumption about the variables:
Each variable has direct dependence relationship with just one other variable and is conditional independent with other variables ,given the class label.
Thus the joint distribution can be written into :
iv
mergersofnpermutatio
unknownanislmllwhereiij
CvvPCvvvvP ilj
m
limm
,...,2,1 int
),...,2,1( ,)(0
),(|()|,,...,,( )(1
121
Huang,Kaizhu
Example of CL methodsFig2 is an example of CL tree where 1.v3 is just conditional dependent on v4,and conditional independent on other variablesP(v3|v4,B)=P(v3|v4)2.v5 is just conditional dependent on v4, and conditional independent on other variablesP(v5|v4,B)=P(v5|v4)3.v2 is just conditional dependent on v3, and conditional independent on other variablesP(v2|v3,B)=P(v2|v3)4.v1 is just conditional dependent on v3 ,and and conditional independent on other variablesP(v1|v3,B)=P(v1|v3)
Huang,Kaizhu
)|()|()|()|()(
)|(),|()|()|()(
)|,()|()|()(
),|,()|()|()(
),|,(),,,|()|()(
),|,,()|()(
)()|,,,(
),,,,(
323145434
3232145434
32145434
432145434
432143215434
43521434
445321
54321
vvPvvPvvPvvPvP
vvPvvvPvvPvvPvP
vvvPvvPvvPvP
vvvvPvvPvvPvP
vvvvPvvvvvPvvPvP
vvvvvPvvPvP
vPvvvvvP
vvvvvP
Huang,Kaizhu
CL Tree
mergersofnpermutatio
unknownanislmllwhereiij
CvvPCvvvvP ilj
m
limm
,...,2,1 int
),...,2,1( ,)(0
),(|()|,,...,,( )(1
121
•The key point about CL tree is that:
We use a multiplication of 2-dimension variable distributions to approximate the high-dimension distributions.
Then how can we find the best multiplication of 2-dimension variable distributions to approximate the high-dimension distributions optimally
Huang,Kaizhu
CL tree algorithm
1.Obtaining P(vi|vj), P(vi,vj) for each pair of (vi,vj) by accumulating process .
2.Calculating the mutual entropy
3.Utilizing Maximum spanning tree algorithm to find the optimal tree structure,which the edge weight between two nodes vi,vj is I((vi,vj)
This CL algorithm was proved to be optimal in [1]
))()(
),(log(),(),(
, ji
jiji
vvji vPvP
vvPvvPvvI
ji
Huang,Kaizhu
Mixture of tree (MT) A mixture of tree model is defined to be a
distribution of the form:
A MT can be viewed as containing a unobserved choice variable z,which takes values k{1,… }
integer randoman as defined is tree,CL a is )(
.1 ;,...,1,0
)()(
1
1
lvT
lkwith
vTvQ
k
l
kkk
l
k
kk
l
Huang,Kaizhu
Huang,Kaizhu
Difference between MT & CL z can be any integer variable, especially when
unobserved variable z is the class label,the MT is changed into the multi-CL tree
CL is a supervised learning algorithm,which has to be trained each tree for each class
MT is a unsupervised learning algorithm,which considers the class variable as the training data
Huang,Kaizhu
Optimization problem of MT Given a data set of observations
We are required to find the mixture of tree Q that satisfies
This optimization problem in mixture model can be solved by EM(Expectation Maximizing) methods
},...,,{ 21 NxxxD
(1) )(logmaxarg1'
N
i
i
QXQQ
Huang,Kaizhu
Huang,Kaizhu
Huang,Kaizhu
We maximize (7) with respect k and Tk with the constraint
We can obtain the update equation:
As for the second term of (7) ,
In fact it is a CL procedure,so we can maximize it by finding a CL tree based on
;11
N
ii
Huang,Kaizhu
MT in classifiers1. In training phrase. Train the MT model on the training data
domain {c}V,C is the class label, V is the input domain
2. In testing phrasea new instance xV is classified by picking the most likely value of the class variable given the settings of the other variables:
),(maxarg)( xxQxc c
xc
Huang,Kaizhu
Multi-CL in handwritten digit recognition
1. Feature extraction
The four basic configurations above are rotated in four cardinal directions and applied to the characters in the six overlapped zones shown in the following
Huang,Kaizhu
Multi-CL in handwritten digit recognition
So we have 4*4*6=96 dimension features
Huang,Kaizhu
Multi-CL in handwritten digit recognition
1.For a given pattern,we calculate the probabilities this pattern belongs to each class(for digit we have 10 class ,0,1,2,…9)
2.We choose the maximum class probability as the classification result
Here the probability the pattern “2” belongs to class 2 is the maxim,so we classified it as digit 2
Huang,Kaizhu
Discussion 1. When all of the component trees are in
the same structure,the MT becomes a TANB mode.
2. When z is class label,MT becomes a CL mode
3.The MT is a general mode of CL and naive bayesian mode.So the performance is expected to be better than NB, TANB,CL