1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

1

Sergios Theodoridis

Konstantinos Koutroumbas

Version 2

2

PATTERN RECOGNITIONPATTERN RECOGNITION

Typical application areas Machine vision Character recognition (OCR) Computer aided diagnosis Speech recognition Face recognition Biometrics Image Data Base retrieval Data mining Bionformatics

The task: Assign unknown objects – patterns – into the correct class. This is known as classification.

3

Features: These are measurable quantities obtained from the patterns, and the classification task is based on their respective values.

Feature vectors: A number of features

constitute the feature vector

Feature vectors are treated as random vectors.

,,...,1 lxx

lTl Rxxx ,...,1

4

An example:

5

The classifier consists of a set of functions, whose values, computed at , determine the class to which the corresponding pattern belongs

Classification system overview

x

sensor

featuregeneration

feature selection

classifierdesign

systemevaluation

Patterns

6

Supervised – unsupervised pattern recognition:

The two major directions Supervised: Patterns whose class is known a-

priori are used for training.Unsupervised: The number of classes is (in

general) unknown and no training patterns are available.

7

CLASSIFIERS BASED ON BAYES CLASSIFIERS BASED ON BAYES DECISION THEORYDECISION THEORY

Statistical nature of feature vectors

Assign the pattern represented by feature vector to the most probable of the available classes

That is maximum

Tl21 x,...,x,xx

x

M ,...,, 21

)(: xPx ii

8

Computation of a-posteriori probabilities Assume known

• a-priori probabilities

• This is also known as the likelihood of

)()...,(),( 21 MPPP

Mixp i ...2,1,)(

... i to rw x

9

2

1

)()()(

)(

)()()(

)()()()(

iii

iii

iii

Pxpxp

xp

PxpxP

PxpxPxp

The Bayes rule (Μ=2)

where

10

The Bayes classification rule (for two classes M=2) Given classify it according to the rule

Equivalently: classify according to the rule

For equiprobable classes the test becomes

x

212

121

)()(

)()(

xxPxP

xxPxP

If

If

)()()()()( 2211 PxpPxp

)()()( 21 xPxp

x

11

)()( 2211 RR and

12

Equivalently in words: Divide space in two regions

Probability of error Total shaded area

Bayesian classifier is OPTIMAL with respect to minimising the classification error probability!!!!

22

11

in If

in If

xRx

xRx

0

0

x

1

x

2e dx)x(pdx)x(pP

13

Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.

14

The Bayes classification rule for many (M>2) classes: Given classify it to if:

Such a choice also minimizes the classification error probability

Minimizing the average risk For each wrong decision, a penalty term is assigned

since some decisions are more sensitive than others

ijxPxP ji )()(

x i

15

For M=2• Define the loss matrix

• penalty term for deciding class ,although the pattern belongs to , etc.

Risk with respect to

)(2221

1211

L

12

1

1

xdxpxdxprRR

)()( 1121111

21

2

16

Risk with respect to

Average risk

2

xdxpxdxprRR

)()( 2222212

21

)()( 2211 PrPrr

Probabilities of wrong decisions, weighted by the penalty terms

17

Choose and so that r is minimized

Then assign to if

Equivalently:

assign x in if

: likelihood ratio

1R 2R

x i

)( 21

1112

2221

1

2

2

112 )(

)()(

)(

)(

P

P

xp

xp

12

)()()()(

)()()()(

222211122

222111111

PxpPxp

PxpPxp

18

yprobabiliterror

tion classifica Minimum if

)()( if

)()( if

0 and 2

1)()(

1221

21

1212 2

12

2121 1

221121

xPxPx

xPxPx

PP If

19

An example:

00.1

5.00

2

1)()(

))1(exp(1

)(

)exp(1

)(

21

22

21

L

PP

xxp

xxp

20

Then the threshold value is:

Threshold for minimum r

2

1

))1(exp()exp( :

: minimumfor

0

220

0

x

xxx

Px e

2

1

2

)21(ˆ

))1((exp2)(exp :ˆ

0

220

n

x

xxx

0x̂

21

Thus moves to the left of (WHY?)

0x̂ 02

1x

22

DISCRIMINANT FUNCTIONS DISCRIMINANT FUNCTIONS DECISION SURFACESDECISION SURFACES

If are contiguous:

is the surface separating the regions. On one side is positive (+), on the other is negative (-). It is known as Decision Surface

)()( :

)()( :

xPxPR

xPxPR

ijj

jii

ji RR , 0)()()( xPxPxg ji

+ - 0xg )(

23

If f(.) monotonic, the rule remains the same if we use:

is a discriminant function

In general, discriminant functions can be defined independent of the Bayesian rule. They lead to suboptimal solutions, yet if chosen appropriately, can be computationally more tractable.

jixPfxPfx jii ))(())(( :if

))(()( xPfxg ii

24

BAYESIAN CLASSIFIER FOR NORMAL BAYESIAN CLASSIFIER FOR NORMAL DISTRIBUTIONSDISTRIBUTIONS

Multivariate Gaussian pdf

called covariance matrix

))((

in matrix

)()(2

1exp

)2(

1)( 1

2

12

iii

ii

iii

i

i

xxE

xE

xxxp

25

is monotonic. Define:

Example:

)ln(

)(ln)(ln

))()(ln()(

ii

iii

Pxp

Pxpxg

ii

iiiiT

ii

C

CPxxxg

ln)2

1(2ln)

2(

)(ln)()(2

1)( 1

2

2

i0

0

26

That is, is quadratic and the surfaces

quadrics, ellipsoids, parabolas, hyperbolas, pairs of lines.

For example:

iiii

iii

CP

xxxxxg

)ln()(2

1

)(1

)(2

1)(

22

212

2211222

212

)(xgi

0)()( xgxg ji

27

Decision Hyperplanes

Quadratic terms:

If ALL (the same) the quadratic terms are not of interest. They are not involved in comparisons. Then, equivalently, we can write:

Discriminant functions are LINEAR

xx 1i

T

ΣΣi

ii

Τii

ii

ioTii

ΣPw

Σw

wxwxg

10

1

2

1)(ln

)(

28

Let in addition:•

•

•

• 22

02

2

)(

)(ln)(

2

1

,

)(

0)()()(

1)(

Then .

ji

ji

j

i

jio

ji

oT

jiij

iT

ii

P

Px

w

xxw

xgxgxg

wxxg

IΣ

29

Nondiagonal:

•

•

•

Decision hyperplane

2

0)()( 0 xxwxg Tij

)(1

jiw

2

11

20

)(

))(

)((n)(

2

1

1

1

xxx

P

Px

T

ji

ji

j

i

ji

)( tonormal

tonormalnot

1

ji

ji

30

Minimum Distance Classifiers

equiprobable

Euclidean Distance: smaller

Mahalanobis Distance: smaller

MP i

1)(

)()(2

1)( 1

i

T

ii xxxg

:Assign :2ixI

iE xd

:Assign :2ixI

2

11 ))()((

i

T

im xxd

31

32

:tionclassificaBayesian using2.2

0.1 vectorheclassify t

9.13.0

3.01.1 ,

3

3 ,

0

0), ,()(

), ,()( and)()( : ,Given

2122

112121

x

ΣNxp

ΣNxpPP

55.015.0

15.095.0 1-Σ

672.38.0

0.28.0,0.2,952.2

2.2

0.1

2.2,0.1: , from sMahalanobi Compute

12,

21

1,2

21

m

mm

dΣ

dd

1,,21 that Observe .Classify EE ddx

Example:

33

Maximum Likelihood

:method The

to w.r. of Likelihood theasknown iswhich

);(

);,...,();(

,...,

);()( : parameter

ctorunknown vean in known with )(Let

tindependen andknown ,...., ,Let

1

21

21

21

X

xp

xxxpXp

xxxX

xpxp

xp

xxx

k

N

k

N

N

N

ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS Support S

lide

34

0)(

)(

)(

1

)(

)( :ˆ

);(ln);(ln)(

);(maxarg :ˆ

;

; 1

1

1ML

k

k

N

kML

k

N

k

k

Ν

k

xp

xp

L

xpXpL

xp

Support Slide

35

Support Slide

36

Asymptotically unbiased and consistent

0ˆlim

][lim

then,);()(

such that a is thereindeed, If,

2

0N

0N

0

0

ML

ML

E

E

xpxp

Support Slide

37

Example:

AA

AA

xN

xΣ

L

L

L

xΣx

Σ

xp

xΣxCxpL

xpxpxxxΣNxp

TT

k

N

kMLk

N

k

l

kT

klk

kT

k

N

kk

N

k

kkN

2)(

if :Remember

10)(

.

.

.

)(

)(

))()(2

1exp(

)2(

1);(

)()(2

1);(ln)(

);()(,...,,unknown, :),( :)(

1

1

1

1

1

2

12

1

11

21

Support Slide

38

Maximum Aposteriori Probability Estimation In ML method, θ was considered as a

parameterHere we shall look at θ as a random vector

described by a pdf p(θ), assumed to be known

Given

Compute the maximum of

From Bayes theorem

N21 x,...,x,xX

)( Xp

)(

)()()(

or )()()()(

Xp

XppXp

XpXpXpp

Support Slide

39

The method:

MLMAP

MAP

MAP

p

XpP

Xp

ˆenough broador uniform is )( If

))()((:ˆ

or )(maxargˆ

Support Slide

40

Support Slide

41

Example:

k

N

kML

k

N

k

MAP

k

N

kk

N

kMAP

ll

N

xN

For

N

x

xpxp

p

x,...,xXΣNxp

1MAP

2

2

2

2

12

2

0

02211

2

2

0

2

1

1ˆˆ

Nfor or ,1

1

ˆ

0)ˆ(1

)(1

or 0))()(ln( :

)2

exp(

)2(

1)(

unknown, ),,( :)(

Support Slide

42

Bayesian Inference

How??

)p( estimate :goal The

)( and )(},,...,{ :Given

followed. isroot different a Here

.for estimate single aMAPML,

1

Xx

pxpxxX N

Support Slide

43

N

kkNN

NN

xN

xNN

xN

NXp

Np

NxpLet

122

0

20

22

220

022

0

2

200

2

1 , ,

),()( :out that It turns

),()(

),()(

examplean ainsight vi morebit A

)()(

)()(

)()(

)(

)()()(

)()(

1

k

N

kxpXp

dpXp

pXp

Xp

pXpXp

dXpxpXxp

)(Support S

lide

44

The above is a sequence of Gaussians as

Maximum EntropyEntropy

xdxpxpH )(ln)(

sconstraint available

thesubject to maximum :)(ˆ Hxp

NSupport S

lide

45

Example: x is nonzero in the intervaland zero otherwise. Compute the ME pdf

• The constraint:

• Lagrange Multipliers

•

21 xxx

2

1

1)(x

x

dxxp

2

1

)1)((x

x

L dxxpHH

otherwise

0

1)(ˆ

)1exp()(ˆ

21

12

xxx

xxxp

xp

Support Slide

46

Mixture Models

Assume parametric modeling, i.e.,

The goal is to estimategiven a set

Why not ML? As before?

M

jj

J

jj

xdjxpP

Pjxpxp

1 x

1

1)( ,1

)()(

);jx(p

jPPP ,...,, and 21 N21 x,...,x,xX

),...,,;(max1,...,, jik

N

kPPPPxP

ji

Support Slide

47

This is a nonlinear problem due to the missing label information. This is a typical problem with an incomplete data set.

The Expectation-Maximisation (EM) algorithm.

• General formulation–

which are not observed directly.

We observe

a many to one transformation

, ;ypRYyy ym )( with ,set data complete the

),;( with ,)( xPmlRXygx xl

ob

Support Slide

48

• Let

• What we need is to compute

• But are not observed. Here comes the EM. Maximize the expectation of the loglikelihood conditioned on the observed samples and the current iteration estimate of

0));(ln(

:ˆ

k

ky

ML

yp

syk '

.

)(

);();(

specific a to' all )(

xY

yx ydypxp

xsyYxY

Support Slide

49

The algorithm:• E-step:

• M-step:

Application to the mixture modeling problem• Complete data

• Observed data

•

• Assuming mutual independence

k

ky tXypEtQ ))](;;(ln([))(;(

0))(;(

:)1(

tQ

t

Nkjx kk ,...,2,1 ),,(

Nkxk ,...,2,1 ,

jkkkkk Pjxpjxp );();,(

));(ln()(1

N

kjkkk PjxpL

Support Slide

50

• Unknown parameters

• E-step

• M-step

Tj

TTTT PPPPP ],...,,[ ,],[ 21

JjP

QQk

jk

,...,2,1,00

J

jjkk

k

jk

k PtjxptxptxP

PtjxptxjP

1

))(;())(;())(;(

))(;())(;(

))(ln ))(;(

] [)]);(ln([))(;(

;11

11

jkkkkk

J

j

N

k

N

k

N

kjkkk

Pjxp(txjP

EPjxpEtQ

k

Support Slide

51

Nonparametric Estimation

2

2

hx̂x̂

hx̂

0,,0

if , as )()(ˆ , continuous )( If

N

kkh

Nxpxpxp

NNN

2ˆ- ,

1)ˆ(ˆ)(ˆ

total

in

hxx

N

k

hxpxp

N

hk

N

kP

N

NN

52

Parzen Windows Divide the multidimensional space in

hypercubes

53

Define

• That is, it is 1 inside a unit side hypercube centered at 0

•

•

• The problem:

• Parzen windows-kernels-potential functions

))(1

(1

)(ˆ1

N

i

il h

xx

Nhxp

xN

at centered hypercube side-han

inside points ofnumber *1

*volume

1

ousdiscontinu (.)

continuous )(

xp

x

xdxx

x

1)( ,0)(

smooth is )(

otherwise2

1

0

1)( ij

ixx

54

Mean value

•

•

•

•

Hence unbiased in the limit

'1

')'()'

(1

)])([1

(1

)](ˆ[x

l

N

i

il

xdxph

xx

hh

xxE

NhxpE

lh

1 ,0h

0)'

( of width the0

h

xxh

1)'

(1

xdh

xx

hl

)()(1

0 xh

x

hh

l

'

)(')'()'()](ˆ[x

xpxdxpxxxpE

Support Slide

55

Variance• The smaller the h the higher the variance

h=0.1, N=1000 h=0.8, N=1000

56

h=0.1, N=10000

The higher the N the better the accuracy

57

If• • •

asymptotically unbiased

The method• Remember:

•

Νh

N

h 0

1112

2221

1

2

2

112 )(

)()(

)(

)(

P

P

xp

xpl

)(

)(1

)(1

2

1

12

11

N

i

il

N

i

il

hxx

hN

hxx

hN

58

CURSE OF DIMENSIONALITY

In all the methods, so far, we saw that the highest the number of points, N, the better the resulting estimate.

If in the one-dimensional space an interval, filled with N points, is adequately (for good estimation), in the two-dimensional space the corresponding square will require N2 and in the ℓ-dimensional space the ℓ-dimensional cube will require Nℓ points.

The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces.

59

NAIVE – BAYES CLASSIFIER

Let and the goal is to estimate i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, Nℓ points.

Assume x1, x2 ,…, xℓ mutually independent. Then:

In this case, one would require, roughly, N points for each pdf. Thus, a number of points of the order N·ℓ would suffice.

It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption.

x ixp |

1

||j

iji xpxp

60

K Nearest Neighbor Density Estimation

In Parzen:• The volume is constant• The number of points in the volume is

varying

Now:• Keep the number of points

constant

• Leave the volume to be varying

•

kkN

)()(ˆ

xNV

kxp

61

•

)( 11

22

22

11 VN

VN

VNkVN

k

62

The Nearest Neighbor Rule Choose k out of the N training vectors, identify

the k nearest ones to x

Out of these k identify ki that belong to class ωi

The simplest version

k=1 !!!

For large N this is not bad. It can be shown that:

if PB is the optimal Bayesian error probability, then:

jikk:x jii Assign

2)1

2( BBBNNB PPM

MPPP

63

For small PB:

k

P2PPP NN

BkNNB

BkNN PP,k

23 )(3

2

BBNN

BNN

PPP

PP

64

Voronoi tesselation

jixxdxxdxR jii ),(),(:

65

Bayes Probability Chain Rule

Assume now that the conditional dependence for each xi is limited to a subset of the features appearing in each of the product terms. That is:

where

BAYESIAN NETWORKS

)()|(...

...),...,|(),...,|(),...,,(

112

1211121

xpxxp

xxxpxxxpxxxp

2

121 )|()(),...,,(i

ii Axpxpxxxp

121 ,..., , xxxA iii

Support Slide

66

For example, if ℓ=6, then we could assume:

Then:

The above is a generalization of the Naïve – Bayes. For the Naïve – Bayes the assumption is:

Ai = Ø, for i=1, 2, …, ℓ

),|(),...,|( 456156 xxxpxxxp

15456 ,...,, xxxxA

Support Slide

67

A graphical way to portray conditional dependencies is given below

According to this figure we have that:

• x6 is conditionally dependent on x4, x5.

• x5 on x4 • x4 on x1, x2

• x3 on x2

• x1, x2 are conditionally independent on other variables.

For this case:)()()|()|(),|(),...,,( 122345456621 xpxpxxpxxpxxxpxxxp

Support Slide

68

Bayesian Networks

Definition: A Bayesian Network is a directed acyclic graph (DAG) where the nodes correspond to random variables. Each node is associated with a set of conditional probabilities (densities), p(xi|Ai), where xi is the variable associated with the node and Ai is the set of its parents in the graph.

A Bayesian Network is specified by:• The marginal probabilities of its root nodes.• The conditional probabilities of the non-root

nodes, given their parents, for ALL possible combinations.

Support Slide

69

The figure below is an example of a Bayesian Network corresponding to a paradigm from the medical applications field.

This Bayesian network models conditional dependencies for an example concerning smokers (S), tendencies to develop cancer (C) and heart disease (H), together with variables corresponding to heart (H1, H2) and cancer (C1, C2) medical tests.

Support Slide

70

Once a DAG has been constructed, the joint probability can be obtained by multiplying the marginal (root nodes) and the conditional (non-root nodes) probabilities.

Training: Once a topology is given, probabilities are estimated via the training data set. There are also methods that learn the topology.

Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently. Given the values of some of the variables in the graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence.

Support Slide

71

Example: Consider the Bayesian network of the figure:

a) If x is measured to be x=1 (x1), compute P(w=0|x=1) [P(w0|x1)].

b) If w is measured to be w=1 (w1) compute P(x=0|w=1) [ P(x0|w1)].

Support Slide

72

For a), a set of calculations are required that propagate from node x to node w. It turns out that P(w0|x1) = 0.63.

For b), the propagation is reversed in direction. It turns out that P(x0|w1) = 0.4.

In general, the required inference information is computed via a combined process of “message passing” among the nodes of the DAG.

Complexity:For singly connected graphs, message passing

algorithms amount to a complexity linear in the number of nodes.

Support Slide

Documents

1 Sergios Theodoridis Konstantinos Koutroumbas Version 2