34
PART 2: Statistical Pattern Classification: Optimal Classification with Bayes Rule METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications

PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule

Embed Size (px)

DESCRIPTION

PART 2: Statistical Pattern Classification : Optimal Classification with Bayes Rule. METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications. Statistical Approach to P.R. Dimension of the feature space : Set of different state s of nature: Categories: - PowerPoint PPT Presentation

Citation preview

Page 1: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

PART 2: Statistical Pattern Classification: Optimal Classification with Bayes Rule

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications

Page 2: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Statistical Approach to P.R

Dimension of the feature space:Set of different states of nature:Categories:find

set of possible actions (decisions): Here, a decision might include a ‘reject option’ A Discriminant Function

in region ; decision rule : if

],...,,[ 21 dXXXX

d

},...,,{ 21 c

c

iR ji RR di RuR

},...,,{ 21 a

)()( XgXg ji

iR

)(Xg i ci 1

k )()( XgXg jk

1R1g

2g2R

3R3g

Page 3: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

A Pattern Classifier

So our aim now will be to define these functionsto minimize or optimize a criterion.

)(1 Xg

X

)(Xgc

)(2 Xg Max

cggg ,...,, 21

k

Page 4: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Parametric Approach to Classification

• 'Bayes DecisionTheory' is used for minimum-error/minimum

risk pattern classifier design.

• Here, it is assumed that if a sample is drawn from a class it is a random variable represented with a multivariate probability density function.

‘Class- conditional density function’

X i

)( iXP

Page 5: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• We also know a-priori probability

• (c is no. of classes)

• Then, we can talk about a decision rule that minimizes the probability of error.

• Suppose we have the observation • This observation is going to change a-priori assumption to a-

posteriori probability:

• which can be found by the Bayes Rule.

)( XP i

ci 1)( iP

X

Page 6: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• can be found by Total Probability Rule:

• When ‘s are disjoint,

• Decision Rule: Choose the category with highest a-posteriori probability, calculated as above, using Bayes Rule.

)(/),()( XPXPXP ii

c

iii PXPXP

1

)().()(

X

21

)(

)().(

XP

PXP ii

)(XP

i

c

ii XPXP

1

),()(

Page 7: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• then, 1

• Decision boundary:

• or in general, decision boundaries are where:

• between regions and

2R1R

21 gg

)()( XPXg ii

21 gg

12 gg

)()( XgXg ji iR jR

Page 8: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• Single feature – decision boundary – point • 2 features – curve • 3 features – surface • More than 3 – hypersurface

• •

• Sometimes, it is easier to work with logarithms

• Since logarithmic function is a monotonically increasing function, log fn will give the same result.

)().()( iii PXPXg

)]().(log[)( iii PXPXg

)(log)(log)( iii PXPXg

)(

)().()(

XP

PXPXgi ii

Page 9: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

2 Category Case:

Assign to if if

But this is the same as: if

By throwing away ‘s, we end up with:

if

Which the same as: Likelihood ratio

kXP

XP

XP

XP)(

)(

)(

)(

1

2

2

1

21,cc

1c )()( 21 XPXP

2c

)().()().( 221 PXPPXP i

)(XP

)(

)().(

)(

)().( 2211

XP

PXP

XP

PXP 1c

)()( 21 XPXP

1c

)( 2

)( 1

Page 10: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• Example: a single feature, 2 category problem with gaussian density

• : Diagnosis of diabetes using sugar count X• state of being healthy• state of being sick (diabetes)•

• The decision rule:

• if or•

)().()().( 2211 cPcXPcPcXP

1c 7.0)( 1 cP

2c

22

22 2/)(

22

2 .2

1)(

mXecXP

)(3.0)(7.0 21 cXPcXP

21

21 2/)(

21

1 .2

1)(

mXecXP

X

3.0)( 2 cP

1c

1m

)(XP

21

)( 2cXP

)().( 22 cPcXP

)().( 11 cXPcXP

)( 1cXP

d 2m

Page 11: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• Assume now: • • And we measured:

• Assign the unknown sample: to the correct category.

• Find likelihood ratio: for

• Compare with:

• So assign: to .•

17X

101 m 202 m

006.043.07.0

3.0

)(

)(

1

2 cP

cP

006.09.4 e

8/)20(

8/)10(

2

2

X

X

e

e

2cX

221

17X

X

Page 12: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• Example: A discrete problem• Consider a 2-feature, 3 category case

• where:•

• And , ,• • Find the decision boundaries and regions:•

• Solution:•

• •

1.0)4.0(4

1

2.0)( 3 cP

ii bXa 1

4.0)( 1 cP

ii bXa 2

0

)(

1

),( 221 iii bacXXP

3

1R

4.0)( 2 cP

3

4

5.3

1

3

2

1

b

b

b

3

5.0

1

3

2

1

a

a

a

3R

9

4.04.0

9

1

for

1

wiseother

2.02.01

2R

Page 13: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• Remember now that for the 2-class case:• • if • • or•

Likelihood ratio

• • • Error probabilities and a simple proof of minimum error

• Consider again a 2-class 1-d problem:

• • • Let’s show that: if the decision boundary is (intersection

point) • rather than any arbitrary point .

)().( 11 cPcXP

d

)2().2()1().1( cPcXPcPcXP

1R

1c

d

d

)().( 22 cPcXP

d2R

kcXP

cXP

cXP

cXP)(

)(

)(

)(

1

2

2

1

Page 14: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• Then (probability of error) is minimum.

• It can very easily be seen that the is minimum if .

12

)().()().( 2211 RRdXcPcXPdXcPcXP

)(EP

)().()().( 221112 cPcRXPcPcRXP

),(),()( 2112 cRXPcRXPEP

2 1

)(].)([)(].)([ 2211R RcPdXcXPcPdXcXP

dd )(EP

d d

Page 15: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Minimum Risk Classification Risk associated with incorrect decision might be more important

than the probability of error. So our decision criterion might be modified to minimize the

average risk in making an incorrect decision. We define a conditional risk (expected loss) for decision when

occurs as:

Where is defined as the conditional loss associated with decision when the true class is . It is assumed that is known.

The decision rule: decide on if for all

The discriminant function here can be defined as: 4

i X

)()( XRXg ii

ji cj 1)()( XRXR ji ic

ji

)( ji

c

jjji

i XPXR1

)().()(

Page 16: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• We can show that minimum – error decision is a special case of above rule where:

• • • • • then,

• so the rule is if•

0)( ii

1)( ji

)()( XRXP ji

)(1 XP i

i )(1)(1 XPXP ji

ij

ji XPXR )()(

Page 17: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• For the 2 – category case, minimum – risk classifier becomes:

• •

• • if

• if

• Otherwise, .

• This is the same as likelihood rule if• and

)()()( 2121111 XPXPXR

1 )()()()( 121222212111 XPXPXPXP

)()()( 1212222 XPXPXR

)().()().( 2221212111 XPXP

)()().()().().( 222212112111 PXPPXP

1)(

)(.)(

)(

)(

)(

1

2

1121

2212

2

1

P

P

XP

XP

2

01122 12112

Page 18: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Discriminant Functions so far

For Minimum Error:

For Minimum Risk:

Where

)(log)(log

)().(

)(

ii

ii

i

PXP

PXP

XP

)(XR i

c

jjji

i XPXR1

)().()(

Page 19: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Bayes (Maximum Likelihood)Decision:

• Most general optimal solution• Provides an upper limit(you cannot do better with

other rule)• Useful in comparing with other classifiers

Page 20: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Special Cases of Discriminant Functions

Multivariate Gaussian (Normal) Density :

The general density form: Here in the feature vector of size . M : d element mean vector

: covariance matrix (variance of feature ) - symmetric when and are statistically independent.

d

dxd

jX

),( MN

iX

TdMXE ],...,,[)( 21

X

2i

])[(

)])([(2

iiii

jjiiij

XE

XXE

1

)()(2/1

2/12/)2(

1)(

MXMX

d

T

eXP

iX0ij

Page 21: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• - determinant of

•General shape: Hyper ellipsoids where

• is constant:• Mahalanobis

Distance • •

• 2 – d problem: ,

• If ,• (statistically independent features) then,• major axes are parallel to major ellipsoid axes

21, XX2X

1X

021 012

2221

1221

2

1

M

1)()( MXMX T

2X1X

2

1

Page 22: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• if in addition• circular

• in general, the equal density curves are hyper ellipsoids. Now

• is used for since its ease in manipulation

• is a quadratic function of as will be shown.

22

21

),( iiMN

X

2

1

)(loglog)2/1(

)()).(2/1()(1

ii

i iT

ii

P

MXMXXg

)(Xgi

)(log)(log)( ieiei PXPXg

Page 23: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

a scalarThen,

On the decision boundary,

)(loglog.2/1

2/1.2/1

.2/1.2/1)(

1

11

11

ii

i

Tiii

T

i iTii

Ti

P

XMMX

MMXXXg

)(loglog.2/1.2/1

.2/1

1

1

1

iiiiTiio

i

Tii

ii

PMMW

MV

W

ioiiT

i WXVXWXXg )(

0

)()(

joiojijT

iT

ji

WWXVXVXWXXWX

XgXg

Page 24: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Decision boundary function is hyperquadratic in general.Example in 2d.

Then, above boundary becomes

0

0)()()(

0

WVXWXX

WWXVVXWWXT

joiojijiT

21

21

2221

1211

xxX

vvV

W

Page 25: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

General form of hyper quadratic boundary IN 2-d.The special cases of Gaussian: AssumeWhere is the unit matrixIi

2

02 0221122222112

2111 Wxvxvxxxx

002

121

2

1

2221

121121

W

x

xvv

x

xxx

I

2

2

2

2

000

0..00

000

000

i

di

Page 26: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

(not a function of X so can be removed)

Now assume

euclidian distance between X and Mi

Then,the decision boundary is linear !

iM

Ii 21 1

iii MXdMXXg ,,)( 22

212

)()( ji PP

)(log,)(

)(loglog).()()(2

21

221

21

2

2

iii

id

iT

ii

PMXXg

PMXMXXg

Page 27: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Decision

Rule: Assign the unknown sample to the closest mean’s category

unknown sample

d= Perpendicular bisector that will move towards the less probable category

2d

d

1M2M

)()( ji PP

1d

Page 28: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Minimum Distance Classifier

• Classify an unknown sample X to the category with closest mean !

• Optimum when gaussian densities with equal variance and equal a-priori probability.

Piecewise linear boundary in case of more than 2 categories.

1M

4R

2R

3R

4M

3M

2M1R

Page 29: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

• Another special case: It can be shown that when (Covariance matrices are the same)

• Samples fall in clusters of equal size and shape unknown sample

is called Mahalonobis Distance

is called Mahalonobis Distance

)()( ji PP

)()(

)(log)()()(

121

121

iT

i

iiT

ii

MXMX

PMXMXXg

i

Page 30: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Then, if The decision rule:

if (Mahalanobis Distance of unknown sample to ) > (Mahalanobis Distance of unknown sample to )

IfThe boundary moves toward the less probable one.

i

)()( ji PP

)()( ji PP

jM

iM

Page 31: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Binary Random Variables

• Discrete features: Features can take only discrete values. Integrals are replaced by summations.

• Binary Features: 0 or 1

• Assume binary features are statistically independent.• Where is binary

)1(

)1(

2

1

ii

ii

Xq

Xp

)( 1iXP

iX

TdXXXX ,...,, 21

ipip1

0 1 iX

Page 32: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

Binary Random Variables

Example: Bit – matrix for machine – printed characters

a pixel

Here, each pixel may be taken as a feature For above problem, we have

is the probability that for letter A,B,…

iX

1iX

1001010 d

1

0

ip

iX

AB

Page 33: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

defined for undefined elsewhere:

• If statistical independence of features is assumed.

• Consider the 2 category problem; assume:

d

ikiiiikkk

xi

xi

d

ii

d

i

i

xi

xii

wPpxpxwPwXPXg

ppxPXP

x

ppxP

ii

ii

1

1

11

1

)(log)1log()1(log))(log)(log()(

)1()()()(

1,0

)1()()(

)1(

)1(

2

1

xq

xp

i

i

Page 34: PART 2:  Statistical Pattern Classification : Optimal  Classification with Bayes Rule

then, the decision boundary is:

So if

The decision boundary is linear in X. a weighted sum of the inputs where:

and

0)( WXWXg ii

2

10

loglog)1(log

0)(log)(log

)1log()1(log)1log()1(log

)()(

11

21

2

1

else

category

xx

PP

qxqxpxpx

PP

qp

iqp

i

iiiiiiii

i

i

i

i

)()(

11

0 2

1lnln

PP

qp

i

iW)1()1(lnii

ii

pqqP

iW