Upload
leonard-allen
View
219
Download
1
Embed Size (px)
Citation preview
1
Sergios Theodoridis
Konstantinos Koutroumbas
Version 2
2
PATTERN RECOGNITIONPATTERN RECOGNITION
Typical application areas Machine vision Character recognition (OCR) Computer aided diagnosis Speech recognition Face recognition Biometrics Image Data Base retrieval Data mining Bionformatics
The task: Assign unknown objects – patterns – into the correct class. This is known as classification.
3
Features: These are measurable quantities obtained from the patterns, and the classification task is based on their respective values.
Feature vectors: A number of features
constitute the feature vector
Feature vectors are treated as random vectors.
,,...,1 lxx
lTl Rxxx ,...,1
4
An example:
5
The classifier consists of a set of functions, whose values, computed at , determine the class to which the corresponding pattern belongs
Classification system overview
x
sensor
featuregeneration
feature selection
classifierdesign
systemevaluation
Patterns
6
Supervised – unsupervised pattern recognition:
The two major directions Supervised: Patterns whose class is known a-
priori are used for training.Unsupervised: The number of classes is (in
general) unknown and no training patterns are available.
7
CLASSIFIERS BASED ON BAYES CLASSIFIERS BASED ON BAYES DECISION THEORYDECISION THEORY
Statistical nature of feature vectors
Assign the pattern represented by feature vector to the most probable of the available classes
That is maximum
Tl21 x,...,x,xx
x
M ,...,, 21
)(: xPx ii
8
Computation of a-posteriori probabilities Assume known
• a-priori probabilities
• This is also known as the likelihood of
)()...,(),( 21 MPPP
Mixp i ...2,1,)(
... i to rw x
9
2
1
)()()(
)(
)()()(
)()()()(
iii
iii
iii
Pxpxp
xp
PxpxP
PxpxPxp
The Bayes rule (Μ=2)
where
10
The Bayes classification rule (for two classes M=2) Given classify it according to the rule
Equivalently: classify according to the rule
For equiprobable classes the test becomes
x
212
121
)()(
)()(
xxPxP
xxPxP
If
If
)()()()()( 2211 PxpPxp
)()()( 21 xPxp
x
11
)()( 2211 RR and
12
Equivalently in words: Divide space in two regions
Probability of error Total shaded area
Bayesian classifier is OPTIMAL with respect to minimising the classification error probability!!!!
22
11
in If
in If
xRx
xRx
0
0
x
1
x
2e dx)x(pdx)x(pP
13
Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.
14
The Bayes classification rule for many (M>2) classes: Given classify it to if:
Such a choice also minimizes the classification error probability
Minimizing the average risk For each wrong decision, a penalty term is assigned
since some decisions are more sensitive than others
ijxPxP ji )()(
x i
15
For M=2• Define the loss matrix
• penalty term for deciding class ,although the pattern belongs to , etc.
Risk with respect to
)(2221
1211
L
12
1
1
xdxpxdxprRR
)()( 1121111
21
2
16
Risk with respect to
Average risk
2
xdxpxdxprRR
)()( 2222212
21
)()( 2211 PrPrr
Probabilities of wrong decisions, weighted by the penalty terms
17
Choose and so that r is minimized
Then assign to if
Equivalently:
assign x in if
: likelihood ratio
1R 2R
x i
)( 21
1112
2221
1
2
2
112 )(
)()(
)(
)(
P
P
xp
xp
12
)()()()(
)()()()(
222211122
222111111
PxpPxp
PxpPxp
18
yprobabiliterror
tion classifica Minimum if
)()( if
)()( if
0 and 2
1)()(
1221
21
1212 2
12
2121 1
221121
xPxPx
xPxPx
PP If
19
An example:
00.1
5.00
2
1)()(
))1(exp(1
)(
)exp(1
)(
21
22
21
L
PP
xxp
xxp
20
Then the threshold value is:
Threshold for minimum r
2
1
))1(exp()exp( :
: minimumfor
0
220
0
x
xxx
Px e
2
1
2
)21(ˆ
))1((exp2)(exp :ˆ
0
220
n
x
xxx
0x̂
21
Thus moves to the left of (WHY?)
0x̂ 02
1x
22
DISCRIMINANT FUNCTIONS DISCRIMINANT FUNCTIONS DECISION SURFACESDECISION SURFACES
If are contiguous:
is the surface separating the regions. On one side is positive (+), on the other is negative (-). It is known as Decision Surface
)()( :
)()( :
xPxPR
xPxPR
ijj
jii
ji RR , 0)()()( xPxPxg ji
+ - 0xg )(
23
If f(.) monotonic, the rule remains the same if we use:
is a discriminant function
In general, discriminant functions can be defined independent of the Bayesian rule. They lead to suboptimal solutions, yet if chosen appropriately, can be computationally more tractable.
jixPfxPfx jii ))(())(( :if
))(()( xPfxg ii
24
BAYESIAN CLASSIFIER FOR NORMAL BAYESIAN CLASSIFIER FOR NORMAL DISTRIBUTIONSDISTRIBUTIONS
Multivariate Gaussian pdf
called covariance matrix
))((
in matrix
)()(2
1exp
)2(
1)( 1
2
12
iii
ii
iii
i
i
xxE
xE
xxxp
25
is monotonic. Define:
Example:
)ln(
)(ln)(ln
))()(ln()(
ii
iii
Pxp
Pxpxg
ii
iiiiT
ii
C
CPxxxg
ln)2
1(2ln)
2(
)(ln)()(2
1)( 1
2
2
i0
0
26
That is, is quadratic and the surfaces
quadrics, ellipsoids, parabolas, hyperbolas, pairs of lines.
For example:
iiii
iii
CP
xxxxxg
)ln()(2
1
)(1
)(2
1)(
22
212
2211222
212
)(xgi
0)()( xgxg ji
27
Decision Hyperplanes
Quadratic terms:
If ALL (the same) the quadratic terms are not of interest. They are not involved in comparisons. Then, equivalently, we can write:
Discriminant functions are LINEAR
xx 1i
T
ΣΣi
ii
Τii
ii
ioTii
ΣPw
Σw
wxwxg
10
1
2
1)(ln
)(
28
Let in addition:•
•
•
• 22
02
2
)(
)(ln)(
2
1
,
)(
0)()()(
1)(
Then .
ji
ji
j
i
jio
ji
oT
jiij
iT
ii
P
Px
w
xxw
xgxgxg
wxxg
IΣ
29
Nondiagonal:
•
•
•
Decision hyperplane
2
0)()( 0 xxwxg Tij
)(1
jiw
2
11
20
)(
))(
)((n)(
2
1
1
1
xxx
P
Px
T
ji
ji
j
i
ji
)( tonormal
tonormalnot
1
ji
ji
30
Minimum Distance Classifiers
equiprobable
Euclidean Distance: smaller
Mahalanobis Distance: smaller
MP i
1)(
)()(2
1)( 1
i
T
ii xxxg
:Assign :2ixI
iE xd
:Assign :2ixI
2
11 ))()((
i
T
im xxd
31
32
:tionclassificaBayesian using2.2
0.1 vectorheclassify t
9.13.0
3.01.1 ,
3
3 ,
0
0), ,()(
), ,()( and)()( : ,Given
2122
112121
x
ΣNxp
ΣNxpPP
55.015.0
15.095.0 1-Σ
672.38.0
0.28.0,0.2,952.2
2.2
0.1
2.2,0.1: , from sMahalanobi Compute
12,
21
1,2
21
m
mm
dΣ
dd
1,,21 that Observe .Classify EE ddx
Example:
33
Maximum Likelihood
:method The
to w.r. of Likelihood theasknown iswhich
);(
);,...,();(
,...,
);()( : parameter
ctorunknown vean in known with )(Let
tindependen andknown ,...., ,Let
1
21
21
21
X
xp
xxxpXp
xxxX
xpxp
xp
xxx
k
N
k
N
N
N
ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS Support S
lide
34
0)(
)(
)(
1
)(
)( :ˆ
);(ln);(ln)(
);(maxarg :ˆ
;
; 1
1
1ML
k
k
N
kML
k
N
k
k
Ν
k
xp
xp
L
xpXpL
xp
Support Slide
35
Support Slide
36
Asymptotically unbiased and consistent
0ˆlim
][lim
then,);()(
such that a is thereindeed, If,
2
0N
0N
0
0
ML
ML
E
E
xpxp
Support Slide
37
Example:
AA
AA
xN
xΣ
L
L
L
xΣx
Σ
xp
xΣxCxpL
xpxpxxxΣNxp
TT
k
N
kMLk
N
k
l
kT
klk
kT
k
N
kk
N
k
kkN
2)(
if :Remember
10)(
.
.
.
)(
)(
))()(2
1exp(
)2(
1);(
)()(2
1);(ln)(
);()(,...,,unknown, :),( :)(
1
1
1
1
1
2
12
1
11
21
Support Slide
38
Maximum Aposteriori Probability Estimation In ML method, θ was considered as a
parameterHere we shall look at θ as a random vector
described by a pdf p(θ), assumed to be known
Given
Compute the maximum of
From Bayes theorem
N21 x,...,x,xX
)( Xp
)(
)()()(
or )()()()(
Xp
XppXp
XpXpXpp
Support Slide
39
The method:
MLMAP
MAP
MAP
p
XpP
Xp
ˆenough broador uniform is )( If
))()((:ˆ
or )(maxargˆ
Support Slide
40
Support Slide
41
Example:
k
N
kML
k
N
k
MAP
k
N
kk
N
kMAP
ll
N
xN
For
N
x
xpxp
p
x,...,xXΣNxp
1MAP
2
2
2
2
12
2
0
02211
2
2
0
2
1
1ˆˆ
Nfor or ,1
1
ˆ
0)ˆ(1
)(1
or 0))()(ln( :
)2
exp(
)2(
1)(
unknown, ),,( :)(
Support Slide
42
Bayesian Inference
How??
)p( estimate :goal The
)( and )(},,...,{ :Given
followed. isroot different a Here
.for estimate single aMAPML,
1
Xx
pxpxxX N
Support Slide
43
N
kkNN
NN
xN
xNN
xN
NXp
Np
NxpLet
122
0
20
22
220
022
0
2
200
2
1 , ,
),()( :out that It turns
),()(
),()(
examplean ainsight vi morebit A
)()(
)()(
)()(
)(
)()()(
)()(
1
k
N
kxpXp
dpXp
pXp
Xp
pXpXp
dXpxpXxp
)(Support S
lide
44
The above is a sequence of Gaussians as
Maximum EntropyEntropy
xdxpxpH )(ln)(
sconstraint available
thesubject to maximum :)(ˆ Hxp
NSupport S
lide
45
Example: x is nonzero in the intervaland zero otherwise. Compute the ME pdf
• The constraint:
• Lagrange Multipliers
•
21 xxx
2
1
1)(x
x
dxxp
2
1
)1)((x
x
L dxxpHH
otherwise
0
1)(ˆ
)1exp()(ˆ
21
12
xxx
xxxp
xp
Support Slide
46
Mixture Models
Assume parametric modeling, i.e.,
The goal is to estimategiven a set
Why not ML? As before?
M
jj
J
jj
xdjxpP
Pjxpxp
1 x
1
1)( ,1
)()(
);jx(p
jPPP ,...,, and 21 N21 x,...,x,xX
),...,,;(max1,...,, jik
N
kPPPPxP
ji
Support Slide
47
This is a nonlinear problem due to the missing label information. This is a typical problem with an incomplete data set.
The Expectation-Maximisation (EM) algorithm.
• General formulation–
which are not observed directly.
We observe
a many to one transformation
, ;ypRYyy ym )( with ,set data complete the
),;( with ,)( xPmlRXygx xl
ob
Support Slide
48
• Let
• What we need is to compute
• But are not observed. Here comes the EM. Maximize the expectation of the loglikelihood conditioned on the observed samples and the current iteration estimate of
0));(ln(
:ˆ
k
ky
ML
yp
syk '
.
)(
);();(
specific a to' all )(
xY
yx ydypxp
xsyYxY
Support Slide
49
The algorithm:• E-step:
• M-step:
Application to the mixture modeling problem• Complete data
• Observed data
•
• Assuming mutual independence
k
ky tXypEtQ ))](;;(ln([))(;(
0))(;(
:)1(
tQ
t
Nkjx kk ,...,2,1 ),,(
Nkxk ,...,2,1 ,
jkkkkk Pjxpjxp );();,(
));(ln()(1
N
kjkkk PjxpL
Support Slide
50
• Unknown parameters
• E-step
• M-step
Tj
TTTT PPPPP ],...,,[ ,],[ 21
JjP
QQk
jk
,...,2,1,00
J
jjkk
k
jk
k PtjxptxptxP
PtjxptxjP
1
))(;())(;())(;(
))(;())(;(
))(ln ))(;(
] [)]);(ln([))(;(
;11
11
jkkkkk
J
j
N
k
N
k
N
kjkkk
Pjxp(txjP
EPjxpEtQ
k
Support Slide
51
Nonparametric Estimation
2
2
hx̂x̂
hx̂
0,,0
if , as )()(ˆ , continuous )( If
N
kkh
Nxpxpxp
NNN
2ˆ- ,
1)ˆ(ˆ)(ˆ
total
in
hxx
N
k
hxpxp
N
hk
N
kP
N
NN
52
Parzen Windows Divide the multidimensional space in
hypercubes
53
Define
• That is, it is 1 inside a unit side hypercube centered at 0
•
•
• The problem:
• Parzen windows-kernels-potential functions
))(1
(1
)(ˆ1
N
i
il h
xx
Nhxp
xN
at centered hypercube side-han
inside points ofnumber *1
*volume
1
ousdiscontinu (.)
continuous )(
xp
x
xdxx
x
1)( ,0)(
smooth is )(
otherwise2
1
0
1)( ij
ixx
54
Mean value
•
•
•
•
Hence unbiased in the limit
'1
')'()'
(1
)])([1
(1
)](ˆ[x
l
N
i
il
xdxph
xx
hh
xxE
NhxpE
lh
1 ,0h
0)'
( of width the0
h
xxh
1)'
(1
xdh
xx
hl
)()(1
0 xh
x
hh
l
'
)(')'()'()](ˆ[x
xpxdxpxxxpE
Support Slide
55
Variance• The smaller the h the higher the variance
h=0.1, N=1000 h=0.8, N=1000
56
h=0.1, N=10000
The higher the N the better the accuracy
57
If• • •
asymptotically unbiased
The method• Remember:
•
Νh
N
h 0
1112
2221
1
2
2
112 )(
)()(
)(
)(
P
P
xp
xpl
)(
)(1
)(1
2
1
12
11
N
i
il
N
i
il
hxx
hN
hxx
hN
58
CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the highest the number of points, N, the better the resulting estimate.
If in the one-dimensional space an interval, filled with N points, is adequately (for good estimation), in the two-dimensional space the corresponding square will require N2 and in the ℓ-dimensional space the ℓ-dimensional cube will require Nℓ points.
The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces.
59
NAIVE – BAYES CLASSIFIER
Let and the goal is to estimate i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, Nℓ points.
Assume x1, x2 ,…, xℓ mutually independent. Then:
In this case, one would require, roughly, N points for each pdf. Thus, a number of points of the order N·ℓ would suffice.
It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption.
x ixp |
1
||j
iji xpxp
60
K Nearest Neighbor Density Estimation
In Parzen:• The volume is constant• The number of points in the volume is
varying
Now:• Keep the number of points
constant
• Leave the volume to be varying
•
kkN
)()(ˆ
xNV
kxp
61
•
)( 11
22
22
11 VN
VN
VNkVN
k
62
The Nearest Neighbor Rule Choose k out of the N training vectors, identify
the k nearest ones to x
Out of these k identify ki that belong to class ωi
The simplest version
k=1 !!!
For large N this is not bad. It can be shown that:
if PB is the optimal Bayesian error probability, then:
jikk:x jii Assign
2)1
2( BBBNNB PPM
MPPP
63
For small PB:
k
P2PPP NN
BkNNB
BkNN PP,k
23 )(3
2
BBNN
BNN
PPP
PP
64
Voronoi tesselation
jixxdxxdxR jii ),(),(:
65
Bayes Probability Chain Rule
Assume now that the conditional dependence for each xi is limited to a subset of the features appearing in each of the product terms. That is:
where
BAYESIAN NETWORKS
)()|(...
...),...,|(),...,|(),...,,(
112
1211121
xpxxp
xxxpxxxpxxxp
2
121 )|()(),...,,(i
ii Axpxpxxxp
121 ,..., , xxxA iii
Support Slide
66
For example, if ℓ=6, then we could assume:
Then:
The above is a generalization of the Naïve – Bayes. For the Naïve – Bayes the assumption is:
Ai = Ø, for i=1, 2, …, ℓ
),|(),...,|( 456156 xxxpxxxp
15456 ,...,, xxxxA
Support Slide
67
A graphical way to portray conditional dependencies is given below
According to this figure we have that:
• x6 is conditionally dependent on x4, x5.
• x5 on x4 • x4 on x1, x2
• x3 on x2
• x1, x2 are conditionally independent on other variables.
For this case:)()()|()|(),|(),...,,( 122345456621 xpxpxxpxxpxxxpxxxp
Support Slide
68
Bayesian Networks
Definition: A Bayesian Network is a directed acyclic graph (DAG) where the nodes correspond to random variables. Each node is associated with a set of conditional probabilities (densities), p(xi|Ai), where xi is the variable associated with the node and Ai is the set of its parents in the graph.
A Bayesian Network is specified by:• The marginal probabilities of its root nodes.• The conditional probabilities of the non-root
nodes, given their parents, for ALL possible combinations.
Support Slide
69
The figure below is an example of a Bayesian Network corresponding to a paradigm from the medical applications field.
This Bayesian network models conditional dependencies for an example concerning smokers (S), tendencies to develop cancer (C) and heart disease (H), together with variables corresponding to heart (H1, H2) and cancer (C1, C2) medical tests.
Support Slide
70
Once a DAG has been constructed, the joint probability can be obtained by multiplying the marginal (root nodes) and the conditional (non-root nodes) probabilities.
Training: Once a topology is given, probabilities are estimated via the training data set. There are also methods that learn the topology.
Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently. Given the values of some of the variables in the graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence.
Support Slide
71
Example: Consider the Bayesian network of the figure:
a) If x is measured to be x=1 (x1), compute P(w=0|x=1) [P(w0|x1)].
b) If w is measured to be w=1 (w1) compute P(x=0|w=1) [ P(x0|w1)].
Support Slide
72
For a), a set of calculations are required that propagate from node x to node w. It turns out that P(w0|x1) = 0.63.
For b), the propagation is reversed in direction. It turns out that P(x0|w1) = 0.4.
In general, the required inference information is computed via a combined process of “message passing” among the nodes of the DAG.
Complexity:For singly connected graphs, message passing
algorithms amount to a complexity linear in the number of nodes.
Support Slide