Naïve Bayes Classifiers

CS 171/271

Definition A classifier is a system that

categorizes instances Inputs to a classifier:

feature/attribute values of a given instance

Output of a classifier: predicted category for that instance

Classifiers

Classifier…featurevalues category

X1X2X3

Example:X1 (motility) = “flies”, X2 (number of legs) = 2, X3 (height) = 6 in

Y = “bird”

Learning from datasets In the context of a learning agent,

a classifier’s intelligence will be based on a dataset consisting of instances with known categories

Typical goal of a classifier: predict the category of a new instance that is rationally consistent with the dataset

Classifier algorithm (approach 1)

Select all instances in the dataset that match the input tuple (X1,X2,…,Xn) of feature values

Determine the distribution of Y-values for all the matches

Output the Y-value representing the most instances

Problems with this approach Classification process is

proportional to dataset size Time complexity: O( m ), where m

is the dataset size Not practical if the dataset is huge

Pre-computing distributions (approach 2) What if we pre-compute all distributions

for all possible tuples? The classification process is then a simple

matter of looking up the pre-computed distribution

Time complexity burden will be in the pre-computation stage, done only once

Still not practical if the number of features is not small Suppose there are only two possible values

per feature and there are n features -> 2n possible combinations!

What we need Typically, n (number of features) will be

in the hundreds and m (number of instances in the dataset) will be in the tens of thousands

We want a classifier that pre-computes enough so that it does not need to scan through the instances during the query, but we do not want to pre-compute too many values

Probability Notation What we want to estimate from our

dataset is a conditional probability P( Y=c | X1=v1, X2=v2, …, Xn = vn )

represents the probability that the category of the instance is c, given that the feature values are v1,v2,…,vn (the input)

In our classifier, we output the c with maximum probability

Bayes Theorem Bayes theorem allows us to invert

conditional probability P( A=a | B=b ) =

P( B=b | A=a ) P( A=a )P( B=b )

Why and how this will help? The answer will come later

P( A=a )

P( B=b )

Suppose U = W+X+Y+ZP( A=a | B=b ) = Z/(Z+Y)P( B=b | A=a ) = Z/(Z+X)P( A=a ) = (Z+X)/UP( B=b ) = (Z+Y)/UP( A=a ) / P( B=b ) = (Z+X)/(Z+Y)

P( B=b | A=a ) P( A=a )P( B=b )

= [ Z/(Z+X) ] (Z+X)/(Z+Y)

= Z/(Z+Y)= P( A=a | B=b )

Another helpful equivalence Assuming that two events are

independent, the probability that both events occur is equal to the product of their individual probabilities

P( X1=v1, X2=v2 ) =P( X1=v1 ) P( X2=v2 )

Goal: maximize this quantity over all possible Y-values

P( Y=c | X1=v1, X2=v2, …, Xn=vn ) =

P( X1=v1, X2=v2, …, Xn = vn | Y=c ) P( Y=c )P(X1=v1, X2=v2, …, Xn = vn)

P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c) P( Y=c )

P(X1=v1, X2=v2, …, Xn = vn)Can ignore the divisor since it

remains the same regardless of Y-value

The critical step

And here it is…

We want a classifier to computemax P( Y=c | X1=v1, X2=v2, …, Xn = vn )

We get the same c if we instead compute

max P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c) P(Y=c)

These values can be pre-computed and the number of computations is

not combinatorially explosive

Building a classifier (approach 3) For each category c, estimate P( Y=c ) =

number of c-instances total number of instances

For each category c, for each feature Xi, determine the distribution P( Xi | Y=c ) For each possible value v for Xi, estimate

P( Xi=v | Y=c ) =number of c-instances where Xi=v

number of c-instances

Using the classifier (approach 3)

For a given input tuple (v1,v2,…,vn), determine the category c that yields

max P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c)P(Y=c)

by looking up the terms from the pre-computed values

Output category c

Example Suppose we wanted a classifier that

categorizes organisms according to certain characteristics Organism categories (Y) are: mammal, bird, fish,

insect, spider Characteristics (X1,X2,X3,X4): motility (walks,

flies, swim), number of legs (2,4,6,8), size (small, large), body-covering (fur, scales, feathers)

The dataset contains 1000 organism samples m = 1000, n = 4, number of categories = 5

Comparing approaches Approach 1: requires scanning all tuples

for matching feature values entails 1000*4 = 4000 comparisons per

query, count occurrences of each category Approach 2: pre-compute probabilities

Preparation: for each of the 3*4*2*3 = 64 combinations, determine the probability for each category (64*5=320 computations)

Query: straightforward lookup of answer

Comparing approaches Approach 3: Naïve Bayes classifier

Preparation: compute P(Y=c) probabilities: 5 of them; compute P( Xi=v | Y=c ),5*(3+4+2+3)=60 of them

Query: straightforward computation of 5 probabilities, determine maximum, return category that yields the maximum

About the Naïve Bayes Classifier Computations and resources required are

reasonable, both for the preparatory stage and actual query stage

Even if the number n of features is in the thousands! The classifier is naïve because it assumes

independence of features (this is likely not the case)

It turns out that the classifier works well in practice even with this limitation

Log of probabilities are often used instead of actual probabilities to avoid underflow when computing the probability products

Related areas of study Density estimators: alternate

methods of computing the probabilities

Feature selection: eliminate unnecessary or redundant features (those that don’t help as much with classification) in order to reduce the value of n

Naïve Bayes Classifiers

Documents

Tool wear monitoring using naïve Bayes classifiers

Tackling the Poor Assumptions of Naïve Bayes Text Classifierscseweb.ucsd.edu/~elkan/254/NaiveBayesForText.pdf · Tackling the Poor Assumptions of Naïve Bayes Text Classifiers Jason

Classification: Logistic Regressionalvin/courses/ugml2016/02b.pdf · Logistic Regression Example Contrasting Naïve Bayes and Logistic Regression Naïve Bayes easier Naïve Bayes

Lirong Xia Naïve Bayes Classifiers Friday, April 8, 2014

Naïve Bayes Learning

Naïve Bayes

On Discriminative vs. Generative classifiers: Naïve Bayes Presenter : Seung Hwan, Bae

Bayes and Naïve Bayes Classifiers

Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Investigating the Performance of Naive- Bayes Classifiers ... · • Naïve Bayes classifier is one of the mostly used practical Bayesian learning methods. – Very effective when

MLE and MAP•Maximum a posteriori (MAP) estimate •Naïve Bayes •Various Naïve Bayes models •model 1: Bernoulli Naïve Bayes •model 2: Multinomial Naïve Bayes •model 3:

Naïve Bayes Classifiers William W. Cohen. TREE PRUNING

Naïve Bayes Classfication

Generative Classifiers: Part 1guerzhoy/411_2018/lec/week5/generative.pdf · •Discrete test •Continuous test •Naïve Bayes: Spam filtering example •Continuous features •Naïve

Gaussian Naive Bayes and Linear Regressionaritter.github.io/courses/5523_slides/linear_regression.pdf · 2020-06-13 · Naïve Bayes: What you should know • Designing classifiers

Comparison of Naïve Bayes, Random Forest, Decision Tree, … · paper investigate Naïve Bayes, Random Forest, Decision Tree, Support Vector Machines, and Logistic Regression classifiers

Naïve Bayes - santini.sesantini.se/teaching/ml/2016/Lect_06/06_NaiveBayes.pdf · Lecture 6 - Self-Study: Naive Bayes 19 Naïve Bayes: discussion ! Naïve Bayes works surprisingly

Bayesian Classifiers, Conditional Independence and Naïve …epxing/Class/10701-10s/Lecture/lecture4.pdfBayesian Classifiers, Conditional Independence and Naïve Bayes Machine Learning

Naïve Bayes for Text Classification - Penn Engineeringcis520/lectures/naive_bayes.pdf · Using Naive Bayes Classifiers to Classify Text: Bag of Words u General model: Features are

Classiﬁcation: Naïve Bayes