On Discriminative vs. Generative classifiers: Naïve Bayes Presenter : Seung Hwan, Bae

Preview:

Citation preview

On Discriminative vs. Gener-ative classifiers: Naïve Bayes

Presenter : Seung Hwan, Bae

2

Andrew Y. Ng and Michael I. JordanNeural Information Processing System (NIPS),

2001 (slides adapted from Ke Chen from University of Manchester

and YangQiu Song from MSRA)Total Citation: 831

3

Machine Learning

4

Training classifiers involves estimating f: X->Y, or P(Y|X)– X: Training data, Y: Labels

Discriminative classifiers(also called ‘infor-mative’ by Rubinstein & Hastie):– Assume some functional form from for P(Y|X)– Estimate parameters of P(Y|X) directly from training data

Generative classifier– Assume some functional from for P(X|Y), P(X)– Estimate parameters of P(X|Y), P(X) directly from train-

ing data– Use Bayes rule to calculate

Generative vs. Discriminative Classi-fiers

5

Bayes Formula

6

Generative Model

• Color• Size• Texture• Weight• …

7

Discriminative Model

Logistic Regression

• Color• Size• Texture• Weight• …

8

Generative models– Assume some functional form for P(X|Y), P(Y)– Estimate parameters of P(X|Y), P(Y) directly from training

data– Use Bayes rule to calculate P(Y|X=x)

Discriminative models– Directly assume some functional form for P(Y|X)– Estimate parameters of P(Y|X) directly from training data

Comparison

Y

X2X1

Y

X2X1

Naïve BayesGenerative

Logistic RegressionDiscriminative

9

Probability Basics

• Prior, conditional and joint probability for random variables– Prior probability:

– Conditional probability: – Joint probability: – Relationship:– Independence:

• Bayesian Rule

)| ,)( 121 XP(XX|XP 2

)()()(

)(X

XX

PCPC|P

|CP

)(XP

) )( ),,( 22 ,XP(XPXX 11 XX

)()|()()|() 2211122 XPXXPXPXXP,XP(X1

)()() ),()|( ),()|( 212121212 XPXP,XP(XXPXXPXPXXP 1

EvidencePriorLikelihood

Posterior

10

Establishing a probabilistic model for classi-fication– Discriminative model

Probabilistic Classification

),, , )( 1 n1L X(Xc,,cC|CP XX

),,,( 21 nxxx x

Discriminative Probabilistic Classifier

1x 2x nx

)|( 1 xcP )|( 2 xcP )|( xLcP

11

Establishing a probabilistic model for classi-fication (cont.)– Generative model

Probabilistic Classification

),, , )( 1 n1L X(Xc,,cCC|P XX

GenerativeProbabilistic Model

for Class 1

)|( 1cP x

1x 2x nx

GenerativeProbabilistic Model

for Class 2

)|( 2cP x

1x 2x nx

GenerativeProbabilistic Model

for Class L

)|( LcP x

1x 2x nx

),,,( 21 nxxx x

12

MAP classification rule– MAP: Maximum A Posterior– Assign x to c* if

Generative classification with the MAP rule– Apply Bayesian rule to convert them into posterior prob-

abilities

– Then apply the MAP rule

Probabilistic Classification

Lc,,cccc|cCP|cCP 1** , )( )( xXxX

Li

cCPcC|P

PcCPcC|P

|cCP

ii

iii

,,2,1 for

)()(

)()()(

)(

xX

xXxX

xX

13

Bayes classification

- Difficulty: learning the joint probability

- If the number of feature n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible.

Naïve Bayes

)()|,,()()( )( 1 CPCXXPCPC|P|CP n XX

)|,,( 1 CXXP n

14

Naïve Bayes classification– Assume that all input attributes are conditionally inde-

pendent!

– MAP classification rule: for

Naïve Bayes

)|()|()|(

)|,,()|(

)|,,();,,|()|,,,(

21

21

22121

CXPCXPCXP

CXXPCXP

CXXPCXXXPCXXXP

n

n

nnn

),,,( 21 nxxx x

Lnn ccccccPcxPcxPcPcxPcxP ,, , ),()]|()|([)()]|()|([ 1*

1***

1

15

Naïve Bayes Algorithm (for discrete input attributes)– Learning phase: Given a train set S,

Output: conditional probability tables; for elements

– Test phase: Given an unknown instance Look up tables to assign the label c* to X’ if

Naïve Bayes

;in examples with )|( estimate)|(ˆ

),1 ;,,1( attributeeach of valueattributeevery For

;in examples with )( estimate)(ˆ

of et value each targFor 1

S

S

ijkjijkj

jjjk

ii

Lii

cCxXPcCxXP

N,knj Xx

cCPcCP

)c,,c(c c

LNX jj ,

),,( 1 naa X

Lnn ccccccPcaPcaPcPcaPcaP ,, , ),(̂)]|(̂)|(̂[)(̂)]|(̂)|(̂[ 1*

1***

1

16

Example

• Example: Play Tennis

17

Learning phase

Example

Outlook Play=Yes

Play=No

Sunny 2/9 3/5Overcast 4/9 0/5

Rain 3/9 2/5

Temperature

Play=Yes Play=No

Hot 2/9 2/5Mild 4/9 2/5Cool 3/9 1/5

Humidity Play=Yes

Play=No

High 3/9 4/5Normal 6/9 1/5

Wind Play=Yes

Play=No

Strong 3/9 3/5Weak 6/9 2/5

P(Play=Yes) = 9/14P(Play=No) = 5/14

18

Test Phase– Given a new instances x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

– Look up tables

– MAP rule

Example

P(Outlook=Sunny|Play=Yes) = 2/9

P(Temperature=Cool|Play=Yes) = 3/9

P(Huminity=High|Play=Yes) = 3/9

P(Wind=Strong|Play=Yes) = 3/9

P(Play=Yes) = 9/14

P(Outlook=Sunny|Play=No) = 3/5

P(Temperature=Cool|Play==No) = 1/5

P(Huminity=High|Play=No) = 4/5

P(Wind=Strong|Play=No) = 3/5

P(Play=No) = 5/14

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes)

= 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

19

• Test Phase– Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,

Wind=Strong)– Look up tables

– MAP rule

P(Outlook=Sunny|Play=No) = 3/5

P(Temperature=Cool|Play==No) = 1/5

P(Huminity=High|Play=No) = 4/5

P(Wind=Strong|Play=No) = 3/5

P(Play=No) = 5/14

P(Outlook=Sunny|Play=Yes) = 2/9

P(Temperature=Cool|Play=Yes) = 3/9

P(Huminity=High|Play=Yes) = 3/9

P(Wind=Strong|Play=Yes) = 3/9

P(Play=Yes) = 9/14

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|

Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be

“No”.

Example

20

Violation of Independent Assumption– For many real world tasks,– Nevertheless, naïve Bayes works surprisingly well any-

way! Zero conditional probability problem

– In no example contains the attribute value– In this circumstance, during test– For a remedy, conditional probabilities estimated with

Relevant Issues

)|()|( )|,,( 11 CXPCXPCXXP nn

0)|(̂ , ijkjjkj cCaXPaX0)|(ˆ)|(ˆ)|(ˆ

1 inijki cxPcaPcxP

)1 examples, virtual"" of(number prior o weight t:

) of valuespossible for /1 (usually, estimateprior :

for which examples trainingofnumber :

C and for which examples trainingofnumber :

)|(ˆ

mm

Xttpp

cCn

caXnmn

mpncCaXP

j

i

ijkjc

cijkj

21

Continuous-valued Input Attributes– Numberless vales for an attribute– Conditional probability modeled with the normal distribu-

tion

– Learning phase: Output: normal distributions and– Test phase:

• Calculate conditional probabilities with all the normal distribution• Apply the MAP rule to make a decision

Relevant Issues

ijji

ijji

ji

jij

jiij

cC

cX

XcCXP

for which examples of X valuesattribute ofdeviation standard :

Cfor which examples of valuesattribute of (avearage)mean :

2

)(exp

2

1)|(ˆ

2

2

Ln ccCXX ,, ),,,(for 11 X

Ln LicCP i ,,1 )( ),,(for 1 nXX X

22

Naïve Bayes based on the independent as-sumption– A small amount of training data to estimate parameters

(means and variances of the variable)– Only the variances of variables for each class need to be

determined and not the entire covariance matrix– Test is straightforward; just looking up tables or calculat-

ing conditional probabilities with normal distribution

Advantages of Naïve Bayes

23

Performance competitive to most of state-of-art classifiers even in presence of violat-ing independence assumption

Many successful application, e.g., spam mail fitering

A good candidate of a base learner in en-semble learning

Apart from classification, naïve Bayes can do more…

Conclusion

Recommended