Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities

Bayesian Classification

Bayesian Classification: Why?

A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities

It is based on the famous Bayes theorem. A simple Bayesian classifier, naïve Bayesian classifier,

has comparable performance with decision tree and selected neural network classifiers

Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

• Let X be the data record (case) whose class label is unknown. Let H be some hypothesis, such as "data record X belongs to a specified class C.“

• For classification, we want to determine P (H|X) – the probability that the hypothesis H holds, given the observed data record X.

• P (H|X) is the posterior probability of H conditioned on X. For example, the probability that a fruit is an apple, given the condition that it is red and round.

• In contrast, P(H) is the prior probability, or a priori probability, of H. In this example P(H) is the probability that any given data record is an apple, regardless of how the data record looks.

• The posterior probability, P (H|X), is based on more information (such as background knowledge) than the prior probability, P(H), which is independent of X.

Bayes Theorem

Bayesian Classification: Simple introduction…

Similarly, P (X|H) is posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple.

P(X) is the prior probability of X, i.e., it is the probability that a data record from our set of fruits is red and round.

Bayes theorem is useful in that itprovides a way of calculating the posterior probability, P(H|X), from P(H), P(X), and P(X|H). Bayes theorem is

P (H|X) = P(X|H) P(H) / P(X)

Bayes Classifier

A probabilistic framework for solving classification problems

Conditional Probability:

Bayes theorem:

)()()|(

)|(AP

CPCAPACP

)(),(

)|(

)(),(

)|(

CPCAP

CAP

APCAP

ACP

Example of Bayes Theorem Given:

A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, what’s the probability he/she has meningitis?

0002.020/150000/15.0

)()()|(

)|( SP

MPMSPSMP

Bayesian Classifiers Consider each attribute and class label as random variables

Given a record with attributes (A1, A2,…,An) Goal is to predict class C Specifically, we want to find the value of C that maximizes

P(C| A1, A2,…,An )

Can we estimate P(C| A1, A2,…,An ) directly from data?

Bayesian Classifiers Approach:

compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem

Choose value of C that maximizes P(C | A1, A2, …, An)

Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)

How to estimate P(A1, A2, …, An | C ) ?

)()()|(

)|(21

21

21

n

n

n AAAPCPCAAAP

AAACP

Naïve Bayes Classifier Assume independence among attributes Ai when class

is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

Can estimate P(Ai| Cj) for all Ai and Cj.

New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal.

How to Estimate Probabilities from Data?

For continuous attributes: Discretize the range into bins

one ordinal attribute per bin violates independence assumption

Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute

Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution

(e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate

the conditional probability P(Ai|c)

k

Naïve Bayes Classifier: Example1

Name Give Birth Can Fly Live in Water Have Legs Class

human yes no no yes mammalspython no no no no non-mammalssalmon no no yes no non-mammalswhale yes no yes no mammalsfrog no no sometimes yes non-mammalskomodo no no no yes non-mammalsbat yes yes no yes mammalspigeon no yes no yes non-mammalscat yes no no yes mammalsleopard shark yes no yes no non-mammalsturtle no no sometimes yes non-mammalspenguin no no sometimes yes non-mammalsporcupine yes no no yes mammalseel no no yes no non-mammalssalamander no no sometimes yes non-mammalsgila monster no no no yes non-mammalsplatypus no no no yes mammalsowl no yes no yes non-mammalsdolphin yes no yes no mammalseagle no yes no yes non-mammals

Give Birth Can Fly Live in Water Have Legs Class

yes no yes no ?

0027.02013

004.0)()|(

021.0207

06.0)()|(

0042.0134

133

1310

131

)|(

06.072

72

76

76

)|(

NPNAP

MPMAP

NAP

MAP

A: attributes

M: mammals

N: non-mammals

P(A|M)P(M) > P(A|N)P(N)

=> Mammals

Naïve Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance

during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some

attributes It makes computation possible It yields optimal classifiers when satisfied But this is seldom satisfied in practice, as

attributes (variables) are often correlated.

Documents

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities