Upload
jack-harmon
View
216
Download
1
Embed Size (px)
Citation preview
Bayesian Classification
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
It is based on the famous Bayes theorem. A simple Bayesian classifier, naïve Bayesian classifier,
has comparable performance with decision tree and selected neural network classifiers
Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
• Let X be the data record (case) whose class label is unknown. Let H be some hypothesis, such as "data record X belongs to a specified class C.“
• For classification, we want to determine P (H|X) – the probability that the hypothesis H holds, given the observed data record X.
• P (H|X) is the posterior probability of H conditioned on X. For example, the probability that a fruit is an apple, given the condition that it is red and round.
• In contrast, P(H) is the prior probability, or a priori probability, of H. In this example P(H) is the probability that any given data record is an apple, regardless of how the data record looks.
• The posterior probability, P (H|X), is based on more information (such as background knowledge) than the prior probability, P(H), which is independent of X.
Bayes Theorem
Bayesian Classification: Simple introduction…
Similarly, P (X|H) is posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple.
P(X) is the prior probability of X, i.e., it is the probability that a data record from our set of fruits is red and round.
Bayes theorem is useful in that itprovides a way of calculating the posterior probability, P(H|X), from P(H), P(X), and P(X|H). Bayes theorem is
P (H|X) = P(X|H) P(H) / P(X)
Bayes Classifier
A probabilistic framework for solving classification problems
Conditional Probability:
Bayes theorem:
)()()|(
)|(AP
CPCAPACP
)(),(
)|(
)(),(
)|(
CPCAP
CAP
APCAP
ACP
Example of Bayes Theorem Given:
A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability he/she has meningitis?
0002.020/150000/15.0
)()()|(
)|( SP
MPMSPSMP
Bayesian Classifiers Consider each attribute and class label as random variables
Given a record with attributes (A1, A2,…,An) Goal is to predict class C Specifically, we want to find the value of C that maximizes
P(C| A1, A2,…,An )
Can we estimate P(C| A1, A2,…,An ) directly from data?
Bayesian Classifiers Approach:
compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem
Choose value of C that maximizes P(C | A1, A2, …, An)
Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)
How to estimate P(A1, A2, …, An | C ) ?
)()()|(
)|(21
21
21
n
n
n AAAPCPCAAAP
AAACP
Naïve Bayes Classifier Assume independence among attributes Ai when class
is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
Can estimate P(Ai| Cj) for all Ai and Cj.
New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal.
How to Estimate Probabilities from Data?
For continuous attributes: Discretize the range into bins
one ordinal attribute per bin violates independence assumption
Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute
Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution
(e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate
the conditional probability P(Ai|c)
k
Naïve Bayes Classifier: Example1
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammalspython no no no no non-mammalssalmon no no yes no non-mammalswhale yes no yes no mammalsfrog no no sometimes yes non-mammalskomodo no no no yes non-mammalsbat yes yes no yes mammalspigeon no yes no yes non-mammalscat yes no no yes mammalsleopard shark yes no yes no non-mammalsturtle no no sometimes yes non-mammalspenguin no no sometimes yes non-mammalsporcupine yes no no yes mammalseel no no yes no non-mammalssalamander no no sometimes yes non-mammalsgila monster no no no yes non-mammalsplatypus no no no yes mammalsowl no yes no yes non-mammalsdolphin yes no yes no mammalseagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ?
0027.02013
004.0)()|(
021.0207
06.0)()|(
0042.0134
133
1310
131
)|(
06.072
72
76
76
)|(
NPNAP
MPMAP
NAP
MAP
A: attributes
M: mammals
N: non-mammals
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
Naïve Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance
during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some
attributes It makes computation possible It yields optimal classifiers when satisfied But this is seldom satisfied in practice, as
attributes (variables) are often correlated.