29
Bayesian Decision Theory Prof. Dr. Mostafa Gadal-Haqq Faculty of Computer & Information Sciences Computer Science Department AIN SHAMS UNIVERSITY ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1 CSC446 : Pattern Recognition (Pattern Classifications, Ch2: Sec. 2.1 to Sec. 2.3)

CSC446: Pattern Recognition (LN4)

Embed Size (px)

Citation preview

Page 1: CSC446: Pattern Recognition (LN4)

Bayesian Decision Theory

Prof. Dr. Mostafa Gadal-Haqq

Faculty of Computer & Information Sciences

Computer Science Department

AIN SHAMS UNIVERSITY

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

CSC446 : Pattern Recognition

(Pattern Classifications, Ch2: Sec. 2.1 to Sec. 2.3)

Page 2: CSC446: Pattern Recognition (LN4)

2.1 Bayesian Decision Theory

• Bayesian Decision Theory is based on

quantifying the trade-offs between various

classification decisions using probabilities

and the costs that accompany such decisions.

• Assumes that: The decision problem is posed

in probabilistic terms and that all of the

relevant probability values are known.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 2

Page 3: CSC446: Pattern Recognition (LN4)

2.1 Bayesian Decision Theory

• Back to the Fish Sorting Machine:

– = a random variable (State of nature)={1 ,2}

• For example: 1 = Sea bass, and 2 = Salmon

• P(1 ) = the prior (a priori probability) that the

coming fish is sea bass.

• P(2 ) = the prior (a priori probability) that the

coming fish is salmon.

– The priors gives us the knowledge of how likely

we are to get salmon or Sea bass before the fish

actually appears.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 3

Page 4: CSC446: Pattern Recognition (LN4)

• Decision Rule Using Priors only:

– to make a decision about the fish that will

appear using only the priors, P(1) and P(2),

We use the following decision rule:

– which minimize the error.

2.1 Bayesian Decision Theory

Decide fish 1 if P(1) > P(2)

and fish 2 if P(1) < P(2)

Probability of error = min [ P(1) , P(2)]

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 4

Page 5: CSC446: Pattern Recognition (LN4)

• That is:

– If P(1) >> P(2) we will be right most of the

time when we decide that the fish belong to 1 .

– If P(1) = P(2) we have only fifty-fifty chance

of being right.

– Under these conditions, no other decision rules

can yield a larger probability of being right.

2.1 Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 5

Page 6: CSC446: Pattern Recognition (LN4)

• Improving the decision using observation:

2.1 Bayesian Decision Theory

• If we know the class –

conditional probability,

P(x | j), of an

observation x, we could

improve our decision.

• for example: x describes

the observed lightness of

the sea bass or salmon

P(x|w2)

P(x|w1)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 6

Page 7: CSC446: Pattern Recognition (LN4)

• We can improve our decision by using this

observed feature and the Bayes rule :

– Posterior = (Likelihood x Prior) / Evidence

– Where, for C categories :

Cj

j

jjPxPxP

1

)()|()(

2.1 Bayesian Decision Theory

)(

)()|()|(

xP

PxPxP

jj

j

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 7

Page 8: CSC446: Pattern Recognition (LN4)

• Bayesian decision is based on minimizing the

probability of error , i.e. for a given feature

value x :

• The probability of error for a particular x is :

2.1 Bayesian Decision Theory

Decide x 1 if P(1 | x) > P(2 | x)

and x 2 if P(1 | x) < P(2 | x)

P(error | x) = min [ P(1 | x), P(2 | x) ]

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 8

Page 9: CSC446: Pattern Recognition (LN4)

fish (x) 2

Suppose P(1)=2/3=0.67, and P(2)=1/3= 0.33 ,

2.1 Bayesian Decision Theory: Numerical Example

P(x|w2)

P(x|w1)

0.36

0.15

If x = 11.5, then P(x|1)= 0.15 , P(x|2)= 0.36

P(x) = 0.15*0.67 + 0.36*0.33 = 0.22

P(1|x)= 0.15*0.67/0.22

= 0.46

P(2|x)= 0.36*0.33/0.22

= 0.54

fish 1

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 9

Page 10: CSC446: Pattern Recognition (LN4)

2.1 Bayesian Decision Theory Computing

for all values

of x gives

decision

regions

(Rules) : R2 R2 R1 R1

• if x R1

decide 1

• if x R2

decide 2

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 10

Page 11: CSC446: Pattern Recognition (LN4)

• Draw Probability Densities and find the

decision regions for the following Classes:

= {1, 2},

P(x | 1) ~ N(20, 4),

P(x | 2) ~ N(15, 2),

P(1) = 1/3, and P(2) = 2/3,

– Then Classify a sample with feature value x= 17.

Assignment 2.1

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 11

Page 12: CSC446: Pattern Recognition (LN4)

2.2 General Bayesian Decision Theory

• Generalization of Bayesian decision theory is

done by allowing the following:

– Having more than one feature.

– Having more than two states of nature.

– Allowing actions and not only decide on the

state of nature.

– Introduce a loss of function which is more

general than the probability of error.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 12

Page 13: CSC446: Pattern Recognition (LN4)

• Allowing actions other than classification

primarily allows the possibility of rejection

• Rejection is refusing to make decision in close

or bad cases!

• The loss function states: how costly each

action taken is?

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 13

Page 14: CSC446: Pattern Recognition (LN4)

• Suppose we have c states of nature (categories)

= { 1, 2,…, c } ,

• a feature vector:

x = { x1, x2,…, xd } ,

• the possible actions

= { 1, 2,…, a } ,

• and the loss, (i | j ), incurred for taking

action i when the state of nature is j .

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 14

Page 15: CSC446: Pattern Recognition (LN4)

• The conditional risk, R(i | x), for select the action

i is given by:

cj

j

jjii xPxR1

)|()|()|(

• The Overall risk, R, is the Sum of all Conditional

risks R(i | x) for i = 1,…,a.

ai

i

i xRR1

)|(

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 15

Page 16: CSC446: Pattern Recognition (LN4)

Take action i (i.e. decide i)

if R(i | x) < R(j | x) ; j and j i.

The Bayesian decision rule becomes: select

the action i for which the conditional risk,

R(i | x), is minimum. That is :

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 16

Page 17: CSC446: Pattern Recognition (LN4)

• Minimizing R(i | x) for all actions, that is: for

all i ; i = 1,…, a, is minimizing R.

• The overall risk R is the “expected loss

associated with a given decision rule”.

• The overall risk R is called the Bayes risk,

which defines the best performance that can

be achieved!

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 17

Page 18: CSC446: Pattern Recognition (LN4)

• Two-category classification Example:

Suppose we have two categories {1 ,2} and two

actions {1 ,2 }, where:

1 : deciding 1 , and 2 : deciding 2 ,

and for simplicity we write ij = (i | j )

The conditional risks for taking 1 and 2 are:

R(1 | x) = 11P(1 | x) + 12P(2 | x)

R(2 | x) = 21P(1 | x) + 22P(2 | x)

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 18

Page 19: CSC446: Pattern Recognition (LN4)

decide 1 (i.e. 1) if R(1 | x) < R(2 | x)

and 2 (i.e. 2) if R(1 | x) > R(2 | x)

There are a variety of ways to express the

minimum-risk rule, each has its advantage:

1- The fundamental rule is:

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 19

Page 20: CSC446: Pattern Recognition (LN4)

2- The rule in terms of the posteriors is:

3- The rule in terms of the priors and conditional

densities is:

decide 1 if (21- 11) P(1 | x ) > (12- 22) P(2 | x )

decide 2 otherwise

2.2 General Bayesian Decision Theory

decide 1 if

(21- 11) P(x | 1 ) P(1) > (12- 22) P(x | 2 ) P(2 )

decide 2 otherwise

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 20

Page 21: CSC446: Pattern Recognition (LN4)

4- The rule in terms of the likelihoods ratios:

That is, the Bayes (Optimal) decision can be

interpreted as:

2.2 General Bayesian Decision Theory

decide 1 if

decide 2 otherwise

)(

)(.

)|(

)|(

1

2

1121

2212

2

1

P

P

xp

xp

“One can take an optimal decision, if the

likelihood ratio exceeds a threshold value

that is independent of the observation x”

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 21

Page 22: CSC446: Pattern Recognition (LN4)

• Decision regions depends on the values of the loss

function:

• For different loss function we have:

)(

)(2 then

0 1

2 0 if

)(

)( then

0 1

1 0 if

1

2

1

2

P

P

P

P

b

a

)|(

)|( :if decide then

)(

)(. Let

2

11

1

2

1121

2212

xp

xp

P

P

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 22

Page 23: CSC446: Pattern Recognition (LN4)

2.2 General Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 23

Page 24: CSC446: Pattern Recognition (LN4)

2.3 Minimum-Error Rate Classification

• Consider the zero-one (or symmetrical) loss function:

• Therefore, the conditional risk is:

• In other words, for symmetric loss function, the

conditional risk is the probability of error.

cjiji

jiji

,...,1, 1

0),(

1jij

cj

1jjjii

)x|(P1)x|(P

)x|(P)|()x|(R

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 24

Page 25: CSC446: Pattern Recognition (LN4)

The Minmax Criterion

• Sometimes we need to design our classifier to

perform well over a range of prior probabilities, or

where we do not know the prior probabilities.

• A reasonable approach is to design our classifier so

that the worst overall risk for any value of the

priors is as small as possible

• Minimax Criterion:

“minimize the maximum possible overall

risk”

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 25

Page 26: CSC446: Pattern Recognition (LN4)

The Minmax Criterion

• It is found that the overall risk is linear in P(ωj).

Then, when the constant of proportionality (the

slope) is zero, the risk is independent of priors. This

condition gives the minmax risk Rmm as:

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 26

Page 27: CSC446: Pattern Recognition (LN4)

The Minmax Criterion

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 27

Page 28: CSC446: Pattern Recognition (LN4)

The Neyman-Pearson Criterion

• The Neynam-Pearson Criterion:

“minimize the overall risk subject to a constraint”

• Generally Neyman-Pearson criterion is satisfied by adjusting decision boundaries numerically. However, for Gaussian and some other distributions, its solution can be found analytically.

R(αi|x) dx < constant

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 28

Page 29: CSC446: Pattern Recognition (LN4)

• Computer Exercises:

– Find the optimal decision for the following data:

= {1, 2},

p(x | 1) ~ N(20, 4),

p(x | 2) ~ N(15, 2),

P(1) = 2/3, and P(2) = 1/3,

– With a loss function:

– Then classify the samples: x = 12, 17, 18, and 20.

1 2

.51 1

Assignment 2.2

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 29