Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves

Preview:

Citation preview

Bayesian Decision Theory

Basic Concepts

Discriminant Functions

The Normal Density

ROC Curves

Figure 1.1

Basic Concepts

Definitions.

State of nature.Let c denote the state of nature. c is a random variable. (e.g., c = c1 for sea bass or c = c2 for salmon )

A priori probabilities.Let P(c1) and P(c2) denote the a priori probability of c1 and c2 respectively.

We know P(c1) + P(c2) = 1.

Decision rule.Decide c1 if P(c1) > P(c2); otherwise choose c2.

But we know there will be many mistakes ….

Basic Concepts

Class-conditional probability density function.

Let x be a continuous random variable. p(x|c) is the probability density for x given the state of nature c.

For example, what is the probability of lightness given that theclass is salmon?

p(lightness | salmon) ?

Or what is the probability of lightness given that the class is sea bass?

P(lightness | sea bass) ?

Figure 2.1

Bayes Formula

How do we combine a priori and class-conditional probabilities to know the probability of a state of nature?

Bayes Formula.

P(cj | x) = p(x|cj) P(cj) / p(x)

posteriorprobability

likelihood

evidence

prior probability

Bayes Decision:Choose c1 if P(c1|x) > P(c2|x); otherwise choose c2.

Figure 2.2

Simplifying Bayes Rule

Bayes Formula.

P(cj | x) = p(x|cj) P(cj) / p(x)

The evidence p(x) is the same for all states or classes sowe can dispense with it to make a decision.

Rule: Choose c1 if p(x|c1)P(c1) > p(x|c2)P(c2); otherwise decide c2

Observations:• If p(x|c1) = p(x|c2) then the decision depends on the prior probabilities only.• If P(c1) = P(c2) then the decision depends on the likelihoods only.

Thomas Bayes

Born in London (1701). Studied logic and theology (Univ. of Edinburgh).Fellow of the Royal Society (year 1742).

Given white and black balls in an urn, what is the prob. of drawing one or the other?

Given one or more balls, what can be said about the number of balls in the urn?

Minimizing Error

What is the probability of error?

P(error | x) = P(c1|x) if we decide c2 P(c2|x) if we decide c1

Does Bayes rule minimize the probability of error?

P(error) = ∫ P(error,x) dx = ∫ P(error|x) p(x) dx

and if for every x we minimize the error then P(error|x) is as small as it can be.

Answer is “yes”.

Loss Function

Let {c1,c2, …} be the possible states of nature.Let {a1, a2, …} be the possible actions.Loss function λ(ai|cj) is the loss incurred for taking action ai when the state of nature is cj.

Suppose we observe a particular x and think about taking action ai. If the true state is wj, the loss will be λ(ai|cj).

Expected lossR(ai|x) = Σj λ(ai|cj) P(cj|x) (this is also called the conditional risk.)

Decision: Select the action that minimizes the conditional risk ( ** best possible performance ** )

Two Categories

How does this apply when we only have two categories?Assume λij = λ (ai|cj).

R(a1|x) = λ11 P(c1|x) + λ12 P(c2|x)R(a2|x) = λ21 P(c1|x) + λ22 P(c2|x)

Decide a1 if

(λ21 - λ11) P(c1|x) > (λ12 - λ22) P(c2|x)

Two Categories

Another way of stating this is as follows.

Decide a1 if

P(x|c1) / P(x|c2) > [λ12 - λ22 / λ21 - λ11] P(c2) / P(c1)

Stated as:

Choose a1 if the likelihood ratio exceeds a threshold value independent of the observation x.

likelihood ratio constant

Figure 2.3

Zero-One Loss

We will normally be concerned with the symmetrical or zero-one loss function:

λ (ai|cj) = 0 if i = j 1 if i = j

In this case the conditional risk is:

R(ai|x) = Σj λ(ai|cj) P(cj|x)

= 1 – P(ci|x)

Exercise

This question refers to Bayes's classifiers. Let {C1,C2, …, Cn} be the possible states of nature or classes. Let {a1, a2, …, ak} be the possible actions. We define a loss function λ(ai|Cj) as the loss incurred for taking action ai when the state of nature is Cj. Let's define a new loss function as follows:

λ(ai|Cj) = 0 if i = j andλ(ai|Cj) = 3 if i ≠ j

Assume a uniform distribution on the posterior probability P(Cj|x) (i.e., all posterior probabilities have the same probability).

What is the expected risk R(ai | x) ?

Exercise

R(ai|x) = Σj λ(ai|Cj) P(Cj|x)= Σj λ(ai|Cj) 1/n

= 1/n 3(n-1) = 3 - 3/n = 3(1 - 1/n)

Discriminant Functions

How do we represent pattern classifiers?

The most common way is through discriminant functions.Remember we use {c1,c2, …} to be the possible states of nature.

For each class we create a discriminant function gi(x).

The classifier assigns class ci if

gi(x) > gj(x) for all j = i

Our classifier is a network or machine that computes c discriminant functions.

Figure 2.5

Simplifying the Discriminant Functions

Notice the decision is the same if we change every gi(x) for f(gi(x)) Assuming f(.) is a monotonically increasing function.

Alternatives:

• gi(x) = P(ci|x)• gi(x) = p(x|ci) P(ci)• gi(x) = ln p(x|ci) + ln P(ci) ( where ln is the natural logarithm)

The net effect is to divide the feature space into c regions (one for each class). We then have c decision regions separated by decision boundaries.

Figure 2.6

Normal Density

Recall that using Bayes rule a discriminant function needs two probabilities:

gi(x) = p(x|ci) P(ci)

How do we model the likelihood or class-conditional probability?

The most common approach is to assume a multivariate normal density.

p(x) = 1 / σ (2Π)0.5 e ( -1/2 [(x – u) / σ]2 )

u = E [ x ] = ∫ x p(x) dx

σ 2 = E [ (x – u)2 ] = ∫ (x – u)2 p(x) dx

Figure 2.7

Normal Density

If features are statistically independent and the variance is the samefor all features, the discriminant function is simple and is linear in nature.

A classifier that uses linear discriminant functions is called a linear machine.

The decision surface are pieces of hyperplanes defined by linear equations.

Figure 2.10

Figure 2.11

Signal Detection Theory

Suppose we want to detect a single pulse from a signal.

We assume the signal has some random noise.

When the signal is present we observe a normal distribution with mean u2.When the signal is not present we observe a normal distribution with mean u1.We assume same standard deviation.

Can we measure the discriminability of the problem?Can we do this independent of the threshold x*?

Discriminability:

d’ = | u2 – u1 | / σ

Figure 2.19

Signal Detection Theory

How do we find d’ if we do not know u1, u2, or x*?

From the data we can compute:

1. P( x > x* | c2) a hit.

2. P( x > x* | c1) a false alarm.

3. P( x < x* | c2) a miss.

4. P( x < x* | c1) a correct rejection.

If we plot a point in a space representing hit and false alarm rates,then we have a ROC (receiver operating characteristic) curve.

With it we can distinguish between discriminability and bias.

Figure 2.20

Points to Remember

Bayes formula and Bayes decision theory

Loss functions (e.g., zero-one loss)

Discriminant functions

ROC curves

Recommended