Download pptx - bayesian-classifiers2

CLASSIFICATION: Bayesian Classifiers

Uses Bayes’ (Thomas Bayes, 1701-1781) Theorem to build probabilistic models of relationships between attributes and classes

Statistical principle for combining prior class knowledge with new evidence from data

Multiple implementations Naïve Bayes Bayesian networks

CLASSIFICATION: Bayesian Classifiers

Requires concept of conditional probability Measures the probability of an event given that (by

evidence or information) another event has occurred

Notation: P(A|B) = Probability of A given that knowledge of B occurred

P(A|B) = P(A∩B)/P(B) Equivalently if P(B) ≠ 0, = P(A∩B) = P(A|B)P(B)

BAYESIAN CLASSIFIERS: Conditional Probability

Example: Suppose 1% of a specific population has a form of cancer A new diagnostic test

produces correct positive results for those with the cancer of 99% of the time

produces correct negative results for those without the cancer of 98% of the time

P(cancer) = 0.01 P(cancer | positive test) = 0.99 P(no cancer | negative test) = 0.98

BAYESIAN CLASSIFIERS: Conditional Probability

Example: But what if you tested positive? What is the

probability that you actually have cancer? Bayes’ Theorem “reverses” the process to provide

us with an answer.

BAYESIAN CLASSIFIERS: Bayes’ Theorem

P(B|A) = P(B∩A)/P(A), if P(A)≠0

= P(A∩B)/P(A)

= P(A|B)P(B)/P(A)

Application to our example P(cancer | test positive) =

P(test positive | cancer)*P(cancer)/P(test positive) = 0.01*0.99/(0.01*0.99+0.99*0.98) = 0.01

BAYESIAN CLASSIFIERS: Bayes’ Theorem

cancer

Test positive

0.01Test negative

No cancerTest positive

Test negative

0.99

0.99

0.01

0.98

0.02

BAYESIAN CLASSIFIERS: Naïve Bayes’

Bayes’ Theorem Interpretation P(class C| F1, F2, … , Fn) =

P(class C) × P(F1, F2, … , Fn| C)/P(F1, F2, … , Fn) posterior = prior × likelihood/evidence


Key concepts Denominator independent of class C Denominator effectively constant Numerator equivalent to joint probability model

P(C, F1, F2, … , Fn) Naïve conditional independence assumptions

P(C|F1, F2, … , Fn) ∝ P(C)P(F1|C) P(F2|C) ⋯ P(Fn|C)

∝


Multiple distributional assumptions possible Gaussian Multinomial Bernoulli

BAYESIAN CLASSIFIERS: Naïve Bayes’ Example

Training set (example from Wikipedia)


Assumptions Continuous data Gaussian (Normal) distribution

P(male) = P(female) = 0.5


Classifier generated from training set


Test sample


Calculate posterior probabilities for both genders Posterior(male) = P(male)P(height|

male)P(weight|male)P(foot size|male)/evidence

Posterior(female) = P(female)P(height|female)P(weight|female)P(foot size|female)/evidence

Evidence is constant and same so we ignore denominators


Calculations for male P(male) = 0.5 (assumed) P(height|male) =

P(weight|male) =

P(foot size|male) =

Posterior numerator (male)


Calculations for female P(male) = 0.5 (assumed) P(height|fwmale) =

P(weight|female) =

P(foot size|female) =

Posterior numerator (female)


Conclusion Posterior numerator (significantly) greater for

female classification than for male, so classify sample as female


Note We did not calculate P(evidence) [normalizing

constant] since not needed, but could P(evidence) = P(male)P(height|

male)P(weight|male)P(foot size|male)+

P(female)P(height|female)P(weight|female)P(foot size|female)

BAYESIAN CLASSIFIERS: Bayesian Networks

Judea Pearl (UCLA Computer Science, Cognitive Systems Lab): one of the pioneers of Bayesian Networks

Author: Probabilistic Reasoning in Intelligent Systems,1988

Father of journalist Daniel Pearl Kidnapped and murdered in Pakistan in 2002

by Al-Queda


Probabilistic graphical model

Represents random variables and conditional dependencies using a directed acyclic graph (DAG)

Nodes of graph represent random variables


Edges of graph represent conditional dependencies

Unconnected nodes conditionally independent of each other

Does not require all attributes to be conditionally independent


Probability table associating each node to its immediate parent nodes If node X has no immediate parents, table

contains only prior probability P(X) If one parent Y, table contains P(X|Y) If multiple parents {Y1, Y2, ⋯ , Yn}, table

contains P(X|Y1, Y2, ⋯ , Yn)



Model encodes relevant probabilities from which probabilistic inferences can then be calculated Joint probability: P(G, S, R) = P(R)P(S|R)*P(G|S, R)

G = “Grass wet” S = “Sprinkler on” R = “Raining”


We can then calculate, for example:


That is


Building the model Create network structure (graph) Determine probability values of tables

Simplest case Network defined by user

Most real-world cases Defining network too com[plex Use machine learning: many algorithms


Algorithms built into Weka User defined network Conditional independence tests Genetic search Hill climber K2 Simulated annealing Maximum weight spanning tree Tabu search


Many other versions online BNT (Bayes’ Net Tree) Matlab toolbox

Kevin Murphy, University of British Columbia http://www.cs.ubc.ca/~murphyk/Software/