27
Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Naive Bayes Applied Machine Learning - cs.mcgill.ca

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Applied Machine LearningApplied Machine LearningNaive Bayes

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)

1

Page 2: Naive Bayes Applied Machine Learning - cs.mcgill.ca

generative vs. discriminative classifierNaive Bayes classifier

assumptiondifferent design choices

Learning objectivesLearning objectives

2

Page 3: Naive Bayes Applied Machine Learning - cs.mcgill.ca

so far we modeled the conditional distribution: p(y ∣ x)discriminative

learn the joint distribution p(y,x) = p(y)p(x ∣ y)generative

how to classify new input x?

p(y = c ∣ x) =

p(x)p(c)p(x∣c)

Bayes rule

prior class probability: frequency of observing this label

likelihood of input features given the class label(input features for each label come from a different distribution) 

posterior probabilityof a given class

p(x, c )∑c =1′C ′

marginal probability of the input (evidence)

3 . 1

image: https://rpsychologist.com

p(x∣y = 1)p(x∣y = 0)

x

Discreminative vs generative classificationDiscreminative vs generative classification

Page 4: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Winter 2020 | Applied Machine Learning (COMP551)

Example:Example: Bayes rule for classification Bayes rule for classification

patient having cancer?y ∈ {yes, no}

x ∈ {−, +} test results, a single binary feature

p(c ∣ x) =

p(x)p(c)p(x∣c)

prior: 1% of population has cancer p(yes) = .01

posterior: p(yes∣+) = .08

likelihood: 

p(+∣yes) = .9 

TP rate of the test (90%)

p(+) = p(yes)p(+∣yes) + p(no)p(+∣no) = .01 × .9 + .99 × .05 = .189evidence:

 FP rate of the test (5%)

3 . 2

in a generative classifier likelihood & prior class probabilities are learned from data

Page 5: Naive Bayes Applied Machine Learning - cs.mcgill.ca

p(y = c ∣ x) =

p(x)p(c)p(x∣c)

p(x, c )∑c =1′C ′

prior class probability: frequency of observing this label

likelihood of input features given the class label(input features for each label come from a different distribution) 

posterior probabilityof a given class

marginal probability of the input (evidence)

  Generative classificationGenerative classification

image: https://rpsychologist.com

Some generative classifiers:

Gaussian Discriminant Analysis: the likelihood is multivariate Gaussian

Naive Bayes: decomposed likelihood

4 . 1

Page 6: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Naive Bayes: Naive Bayes: modelmodel   

assumption about the likelihood p(x∣y) = p(x ∣y)∏d=1D

d

number of input features

when is this assumption correct?when features are conditionally independent given the label

knowing the label, the value of one input feature gives us no information about the other input features

x ⊥i x ∣j y

p(x∣y) = p(x ∣y)p(x ∣y,x )p(x ∣y,x ,x ) … p(x ∣y,x , … ,x )1 2 1 3 1 2 D 1 D−1

chain rule of probability (true for any distribution)

p(x ∣y,x ,x ) =3 1 2 p(x ∣y)3

conditional independence assumptionx1, x2 give no extra information, so

4 . 2

Page 7: Naive Bayes Applied Machine Learning - cs.mcgill.ca

= log p (y ) +∑n u(n) log p (x ∣y )w

(n) (n)

given the training dataset D = {(x , y ), … , (x , y )}(1) (1) (N) (N)

  Naive Bayes: Naive Bayes: objectiveobjective

maximize the joint likelihood (contrast with logistic regression)

ℓ(w,u) = log p (x , y )∑n u,w(n) (n)

= log p (y ) +∑n u(n)

log p (x ∣y )∑n w(n) (n)

= log p (y ) +∑n u(n)

log p (x ∣y )∑d∑n w [d] d

(n) (n)using Naive Bayes assumption

separate MLE estimates for each part

4 . 3

Page 8: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Winter 2020 | Applied Machine Learning (COMP551)

given the training dataset D = {(x , y ), … , (x , y )}(1) (1) (N) (N)

  Naive Bayes: Naive Bayes: train-testtrain-test

learn the prior class probabilities

learn the likelihood components

p (y)u

p (x ∣y) ∀dw [d] d

training time

test time

arg max p(c∣x) =c arg max c p (c ) p (x ∣c )∑

c =1′C

u′ ∏

d=1D

w [d] d′

p (c) p (x ∣c)u ∏d=1D

w [d] d

find posterior class probabilities

4 . 4

Page 9: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Class priorClass prior   

frequency of class 1 in the dataset

= N log(u) +1 (N − N ) log(1 −1 u)frequency of class 0 in the dataset

Bernoulli distribution p (y) =u u (1 −y u)1−y

binary classification

ℓ(u) = y log(u) +∑n=1N (n) (1 − y ) log(1 −(n) u)

maximizing the log-likelihood

ℓ(u) =dud

−uN 1

=1−uN−N 1 0 ⇒ u =

∗NN 1 max-likelihood estimate (MLE) is the

frequency of class labels

setting its derivative to zero

5 . 1

p(c∣x) =

p (c) p (x ∣c )∑c =1′C

u ∏d=1D

w [d] d′

p (c) p (x ∣c)u ∏d=1D

w [d] d

Page 10: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Winter 2020 | Applied Machine Learning (COMP551)

categorical distribution p (y) =u u ∏c=1C

cy c

multiclass classification

assuming one-hot coding for labels

is now a parameter vectoru = [u , … ,u ]1 C

p(c∣x) =

p (c) p (x ∣c )∑c =1′C

u ∏d=1D

w [d] d′

p (c) p (x ∣c)u ∏d=1D

w [d] d

maximizing the log likelihood ℓ(u) = y log(u )∑n∑c c(n)

c

subject to: u =∑c c 1

closed form for the optimal parameter u =∗ [ , … , ]NN 1

NN C

number of instances in class 1

all instances in the dataset

  Class priorClass prior

5 . 2

Page 11: Naive Bayes Applied Machine Learning - cs.mcgill.ca

choice of likelihood distribution depends on the type of features(likelihood encodes our assumption about "generative process")

Bernoulli: binary features

Categorical: categorical features

Gaussian: continuous distribution

...

   Likelihood terms Likelihood termsp(c∣x) =

p (c) p (x ∣c )∑c =1′C

u ∏d=1D

w [d] d′

p (c) p (x ∣c)u ∏d=1D

w [d] d

x d

w =[d]∗ arg max log p (x ∣w [d] ∑n=1

Nw [d] d

(n)y )(n)

each feature         may use a different likelihood

separate max-likelihood estimates for each feature

note that these are different from the choice ofdistribution for class prior

(class-conditionals)

6 . 1

Page 12: Naive Bayes Applied Machine Learning - cs.mcgill.ca

binary features: likelihood is Bernoulli

  BernoulliBernoulli Naive Bayes Naive Bayes

max-likelihood estimation is similar to what we saw for the prior

closed form solution of MLEw =[d],c

∗ N(y=c)

N(y=c,x =1)dnumber of training instances satisfying this condition

p (x ∣w [d] d y = 0) = Bernoulli(x ;w )d [d],0

p (x ∣w [d] d y = 1) = Bernoulli(x ;w )d [d],1{short form: p (x ∣w [d] d y) = Bernoulli(x ;w )d [d],y

one parameter per label

6 . 2

Page 13: Naive Bayes Applied Machine Learning - cs.mcgill.ca

ExampleExample:: Bernoulli Naive Bayes Bernoulli Naive Bayes  

using naive Bayes for document classification:

2 classes (documents types)

600 binary features

                     word d is present in the document n (vocabulary of 600)x =d

(n) 1

d

w [d],0∗

likelihood of words in two document types

w [d],1∗

def BernoulliNaiveBayes(prior,# vector of size 2 for class prior likelihood, #600 x 2: likelihood of each word under each class x, #vector of size 600: binary features for a new document ): logp = np.log(prior) + np.sum(np.log(likelihood * x[:,None]), 0) + \ np.sum(np.log((1-likelihood) * (1 - x[:,None])), 0) log_p -= np.max(log_p) #numerical stability posterior = np.exp(log_p) # vector of size 2 posterior /= np.sum(posterior) # normalize return posterior # posterior class probability

12345678910

6 . 3

Page 14: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Winter 2020 | Applied Machine Learning (COMP551)

MultinomialMultinomial Naive Bayes Naive Bayes   

what if we wanted to use word frequencies in document classification

Multinomial likelihood: p (x∣c) =w w

x !∏d=1D

d

( x )!∑d d ∏d=1D

d,cx d

x d

(n)is the number of times word d appears in document n

we have a vector of size D for each class C × D (parameters)

MLE estimates: w =d,c∗

x y ∑n∑d′d′(n)

c(n)

x y ∑ d

(n)c(n) count of word d in all documents labelled y

total word count in all documents labelled y

6 . 4

Page 15: Naive Bayes Applied Machine Learning - cs.mcgill.ca

GaussianGaussian Naive Bayes Naive Bayes   

p (x ∣w [d] d y) = N (x ;μ ,σ ) =d d,y d,y2

e 2πσ d,y

2

1 −

2σ d,y2

(x −μ )d d,y2

Gaussian likelihood terms

w =[d] (μ ,σ , … ,μ ,σ )d,1 d,1 d,C d,C

one mean and std. parameter for each class-feature pair

writing log-likelihood and setting derivative to zero we get maximum likelihood estimate:

μ =d,y x y

N c

1 ∑n=1N

d

(n)c(n)

σ =d,y2

y (x −N c

1 ∑n=1N

c(n)

d

(n)μ )d,y

2

empirical mean & std of feature  across instances with label y

x d

7 . 1

Page 16: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Example: Example: Gaussian Naive BayesGaussian Naive Bayes  

classification on Iris flowers dataset:

 samples with D=4 features, for each of C=3species of Iris flower

N =c 50

(a classic dataset originally used by Fisher)

our setting3 classes 2 features(septal width, petal length)

7 . 2

Page 17: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Example: Example: Gaussian Naive BayesGaussian Naive Bayes  

categorical class prior & Gaussian likelihood

def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ):

12345

N,C = y.shape6 D = X.shape[1]7 mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 s[c,:] = np.std(X[inds,:], 0)12 log_prior = np.log(np.mean(y, 0))[:,None]13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)

14

return log_prior + log_likelihood #N_text x C15

N,C = y.shape D = X.shape[1]

def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5

67

mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 s[c,:] = np.std(X[inds,:], 0)12 log_prior = np.log(np.mean(y, 0))[:,None]13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)

14

return log_prior + log_likelihood #N_text x C15

mu, s = np.zeros((C,D)), np.zeros((C,D)) for c in range(C): #calculate mean and std inds = np.nonzero(y[:,c])[0] mu[c,:] = np.mean(X[inds,:], 0) s[c,:] = np.std(X[inds,:], 0)

def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5 N,C = y.shape6 D = X.shape[1]7

89101112

log_prior = np.log(np.mean(y, 0))[:,None]13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)

14

return log_prior + log_likelihood #N_text x C15

log_prior = np.log(np.mean(y, 0))[:,None] log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)

def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5 N,C = y.shape6 D = X.shape[1]7 mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 s[c,:] = np.std(X[inds,:], 0)12

1314

return log_prior + log_likelihood #N_text x C15 return log_prior + log_likelihood #N_text x C

def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5 N,C = y.shape6 D = X.shape[1]7 mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 s[c,:] = np.std(X[inds,:], 0)12 log_prior = np.log(np.mean(y, 0))[:,None]13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2)

14

15

decision boundaries are not linear!

7 . 3

Page 18: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Example: Example: Gaussian Naive BayesGaussian Naive Bayes  

categorical class prior & Gaussian likelihood

def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ): N,C = y.shape D = X.shape[1] mu, s = np.zeros((C,D)), np.zeros((C,D)) for c in range(C): #calculate mean and std inds = np.nonzero(y[:,c])[0] mu[c,:] = np.mean(X[inds,:], 0) s[c,:] = np.std(X[inds,:], 0) log_prior = np.log(np.mean(y, 0))[:,None] log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:] - mu[:,None,:])/s[:,None,:])**2), 2) return log_prior + log_likelihood #N_text x C

1234567891011121314

15

posterior class probability for c=1

7 . 4

Page 19: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Winter 2020 | Applied Machine Learning (COMP551)

Example: Example: Gaussian Naive BayesGaussian Naive Bayes  

using the same variance for all classes

log_likelihood = - np.sum(.5*(((Xt[None,:,:] - mu[:,None,:]))**2), 2)

def GaussianNaiveBayes(1 X, # N x D2 y, # N x C3 Xtest,# N_test x D4 ):5 N,C = y.shape6 D = X.shape[1]7 mu, s = np.zeros((C,D)), np.zeros((C,D))8 for c in range(C): #calculate mean and std9 inds = np.nonzero(y[:,c])[0]10 mu[c,:] = np.mean(X[inds,:], 0)11 log_prior = np.log(np.mean(y, 0))[:,None]12

13 return log_prior + log_likelihood #N_text x C14

decision boundaries are linearits value does not make a difference

7 . 5

Page 20: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Decision boundaryDecision boundary in generative classifiers in generative classifiers  

p(y∣x) = p(y ∣x)′decision boundaries: two classes have the same probability

log =p(y=c ∣x)′p(y=c∣x) log =

p(c )p(x∣c )′ ′p(c)p(x∣c) log +

p(c )′p(c) log =

p(x∣c )′p(x∣c) 0which means

not a function of x (ignore)

this ratio is linear (in some bases) for a large family of probabilities(called linear exponential family)

e.g., Bernoulli is a member of this family with ϕ(x) = x

Bernoulli Naive Bayes has a linear decision boundary linear.

p(x∣c) =

Z(w )y,c

ew ϕ(x)y,cT

log =p(x∣c )′p(x∣c) (w −y,c w ) ϕ(x) +y,c′

T g(w ,w )y,c y,c′

linear using some bases not a function of x

8

Page 21: Naive Bayes Applied Machine Learning - cs.mcgill.ca

maximize conditional likelihood

p(y ∣ x)discriminative

maximize joint likelihood

p(y,x) = p(y)p(x ∣ y) generative

  Discreminative vs generative Discreminative vs generative classificationclassification

it makes assumptions about p(x) makes no assumption about p(x)

can deal with missing valuesoften works better on larger datasets

can learn from unlabelled data

often works better on smaller datasets

9 . 1

Page 22: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Winter 2020 | Applied Machine Learning (COMP551)

Example naive Bayes vs logistic regression on UCI datasets naive Bayes

logistic regression

from: Ng & Jordan 2001

  Discreminative vs generative Discreminative vs generative classificationclassification

m is #instances

9 . 2

Page 23: Naive Bayes Applied Machine Learning - cs.mcgill.ca

SummarySummarygenerative classification

learn the class prior and likelihoodBayes rule for conditional class probability

Naive Bayesassumes conditional independence

e.g., word appearances indep. of each other given document typeclass prior: Bernoulli or Categoricallikelihood: Bernoulli, Gaussian, Categorical...MLE has closed form (in contrast to logistic regression)estimated separately for each feature and each label

evaluation measures for classification accuracy

10

Page 24: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Measuring performance  Measuring performance  

binary classification

Posit

ive

Neg

ativ

e11 . 1

We use the confusion matrix

Example

A side note on measuring performance of classifiers

count the combinations of y and  y

Page 25: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Measuring performance  Measuring performance  binary classification

Accuracy = P+NTP+TN

F score =1 2 Precision+RecallPrecision×Recall

Recall = PTP

Precision = RPTP

 {Harmonic mean}

RP = TP + FP

RN = FN + TN

P = TP + FN

N = FP + TN

Error rate = P+NFP+FN

11 . 2

use the confusion matrix toquantify difference metrics

marginals:

Page 26: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Accuracy = P+NTP+TN

F score =1 2 Precision+RecallPrecision×Recall

Recall = PTP

Precision = RPTP

 {Harmonic mean}

Miss rate = PFN

Fallout = NFP

False discovery rate = RPFP

Selectivity = NTN

False omission rate = RNFN

Negative predictive value = RNTN

Less common

Measuring performance  Measuring performance  binary classification

11 . 3

Page 27: Naive Bayes Applied Machine Learning - cs.mcgill.ca

Threshold invariant: ROC & AUCThreshold invariant: ROC & AUCROC as a function of threshold

  TPR = TP/P (recall, sensitivity)

FPR = FP/N (fallout, false alarm)

 

11 . 4