31
Discriminative Naïve Discriminative Naïve Bayesian Classifiers Bayesian Classifiers Kaizhu Huang Kaizhu Huang Supervisors: Prof. Irwin Supervisors: Prof. Irwin King, King, Prof. Michael R. Prof. Michael R. Lyu Lyu Markers: Markers: Prof. Lai Wan Prof. Lai Wan Chan, Chan,

Discriminative Naïve Bayesian Classifiers

  • Upload
    harvey

  • View
    66

  • Download
    0

Embed Size (px)

DESCRIPTION

Discriminative Naïve Bayesian Classifiers. Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong Wong. Outline. Background Classifiers Discriminative classifiers: Support Vector Machines - PowerPoint PPT Presentation

Citation preview

Page 1: Discriminative Naïve Bayesian Classifiers

Discriminative Naïve Bayesian Discriminative Naïve Bayesian ClassifiersClassifiers

Kaizhu HuangKaizhu Huang

Supervisors: Prof. Irwin King, Supervisors: Prof. Irwin King,

Prof. Michael R. LyuProf. Michael R. Lyu

Markers: Markers: Prof. Lai Wan Chan, Prof. Lai Wan Chan,

Prof. Kin Hong WongProf. Kin Hong Wong

Page 2: Discriminative Naïve Bayesian Classifiers

OutlineOutline

BackgroundBackground– ClassifiersClassifiers

» Discriminative classifiers: Support Vector MachinesDiscriminative classifiers: Support Vector Machines

» Generative classifiers: Naïve Bayesian ClassifiersGenerative classifiers: Naïve Bayesian Classifiers

MotivationMotivation Discriminative Naïve Bayesian ClassifiersDiscriminative Naïve Bayesian Classifiers ExperimentsExperiments DiscussionsDiscussions ConclusionConclusion

Page 3: Discriminative Naïve Bayesian Classifiers

BackgroundBackground

Discriminative ClassifiersDiscriminative Classifiers– Directly maximize a discriminative function or Directly maximize a discriminative function or

posterior function posterior function – Example: Support Vector MachinesExample: Support Vector Machines

SVM

Page 4: Discriminative Naïve Bayesian Classifiers

BackgroundBackground

Generative ClassifiersGenerative Classifiers– Model the joint distribution for each class P(x|C) and Model the joint distribution for each class P(x|C) and

then use Bayes rules to construct posterior classifiers then use Bayes rules to construct posterior classifiers P(C|x).P(C|x).

– Example: Naïve Bayesian ClassifiersExample: Naïve Bayesian Classifiers» Model the distribution for each class under the assumption: Model the distribution for each class under the assumption:

each feature of the data is each feature of the data is independentindependent with others features, with others features, when given the class labelwhen given the class label. .

m

jiijc

iic

iic

ic

CPCxP

CPCxP

xp

CPCxP

xCPC

i

i

i

i

1

)()|(maxarg

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

Constant w.r.t. C

Combining the assumption

mjiCxPCxPCxxP jiji 1 ),|()|()|,(

Page 5: Discriminative Naïve Bayesian Classifiers

BackgroundBackground

ComparisonComparison

Example of Missing Information:

From left to right: Original digit, 50% missing digit, 75% missing digit, and occluded digit.

Page 6: Discriminative Naïve Bayesian Classifiers

BackgroundBackground Why Generative classifiers are Why Generative classifiers are not accurate asnot accurate as

Discriminative classifiers?Discriminative classifiers?Pre-classified dataset

Sub-dataset D1 for Class 1

Sub-dataset D2 for Class 2

Estimate the distribution P1 to approximate D1 accurately

Estimate the distribution P2 to approximate D2 accurately

Use Bayes rule to perform classification

1.1. It is incomplete for generative It is incomplete for generative classifiers to just approximate the classifiers to just approximate the inner-class information.inner-class information.

2.2. The inter-class discriminative The inter-class discriminative information between classes are information between classes are discardeddiscarded

Scheme for Generative classifiers in two-category classification tasks

Page 7: Discriminative Naïve Bayesian Classifiers

BackgroundBackground

Why Generative Classifiers Why Generative Classifiers are superior toare superior to Discriminative Discriminative Classifiers in Classifiers in handling missing informationhandling missing information problems? problems?– SVM SVM lacks the abilitylacks the ability under the uncertainty under the uncertainty

– NB can NB can conduct uncertainty inferenceconduct uncertainty inference under the estimated under the estimated distribution. distribution.

A is the feature set

T is the subset of A, which is missing

Page 8: Discriminative Naïve Bayesian Classifiers

MotivationMotivation

It seems that a good classifier should It seems that a good classifier should combinecombine the strategies of discriminative the strategies of discriminative classifiers and generative classifiers.classifiers and generative classifiers.

Our work trains one of the Our work trains one of the generativegenerative classifier: Naïve Bayesian Classifies in a classifier: Naïve Bayesian Classifies in a discriminativediscriminative way. way.

Page 9: Discriminative Naïve Bayesian Classifiers

Roadmap of our workRoadmap of our work

S u p p ort V e c to r M a ch in e s (S V M ) O th e rs

D isc rim in a tive C la ss if ie rs

N a ive B aye s ia n C la ss ife rs T re e -like B a yes ia n C la ss if ie rs O th e rs

B a ye s ia n M u ltin e t C la ss if ie rs O th e rs

B a ye s ia n N e tw o rk C la ss if ie rs (B N C )

G a u ss ia n M ix tu re M od e l(G M M ) H id d e n M a rkov M o d e l(H M M )

O th e rs M o d e ls

G e n e ra tive C lass if ie rs

Classifiers

Discriminative training

Page 10: Discriminative Naïve Bayesian Classifiers

How our work relates to other How our work relates to other work?work?

Discriminative Classifiers Generative Classifiers1.

Jaakkola and Haussler NIPS98

HMM and GMM Discriminative training2.

Difference: Our method performs a reverse process:

From Generative classifiers to Discriminative classifiers

Beaufays etc., ICASS99, Hastie etc., JRSS 96

Difference: Our method is designed for Bayesian classifiers.

Page 11: Discriminative Naïve Bayesian Classifiers

How our work relates to other How our work relates to other work?work?

Optimization on Posterior Distribution P(C|x)3.

Difference: LR will encounter computational difficulties in handling missing information problems. When number of the missing or unknown features grows, it will be intractable to perform inference.

Logistical Regression (LR)

Page 12: Discriminative Naïve Bayesian Classifiers

Roadmap of our workRoadmap of our work

S u p p ort V e c to r M a ch in e s (S V M ) O th e rs

D isc rim in a tive C la ss if ie rs

N a ive B aye s ia n C la ss ife rs T re e -like B a yes ia n C la ss if ie rs O th e rs

B a ye s ia n M u ltin e t C la ss if ie rs O th e rs

B a ye s ia n N e tw o rk C la ss if ie rs (B N C )

G a u ss ia n M ix tu re M od e l(G M M ) H id d e n M a rkov M o d e l(H M M )

O th e rs M o d e ls

G e n e ra tive C lass if ie rs

Classifiers

Page 13: Discriminative Naïve Bayesian Classifiers

Discriminative Naïve Bayesian Discriminative Naïve Bayesian ClassifiersClassifiers

Pre-classified dataset

Sub-dataset D1 for Class I

Sub-dataset D2 for Class 2

Estimate the distribution P1 to approximate D1 accurately

Estimate the distribution P2 to approximate D2 accurately

Use Bayes rule to perform classification

Working Scheme of Naïve Bayesian Classifier Mathematic Explanation of Naïve Bayesian Classifier

Easily solved by Lagrange Multiplier method

Page 14: Discriminative Naïve Bayesian Classifiers

Discriminative Naïve Bayesian Discriminative Naïve Bayesian Classifiers (DNB)Classifiers (DNB)

Optimization function of DNBOptimization function of DNB

•On one hand, the minimization of this function tries to approximate the dataset as accurately as possible.

• On the other hand, the optimization on this function also tries to enlarge the divergence between classes.

• Optimization on joint distribution directly inherits the ability of NB in handling missing information problems

Divergence item

Page 15: Discriminative Naïve Bayesian Classifiers

Discriminative Naïve Bayesian Discriminative Naïve Bayesian Classifiers (DNB)Classifiers (DNB)

Complete Optimization problemComplete Optimization problem

Cannot separately optimize and Cannot separately optimize and as in NB, Since they are interactive as in NB, Since they are interactive variables now.variables now.

1P 2P

Page 16: Discriminative Naïve Bayesian Classifiers

Discriminative Naïve Bayesian Discriminative Naïve Bayesian Classifiers (DNB)Classifiers (DNB)

Solve the Optimization problemSolve the Optimization problem– Nonlinear optimization problem under linear Nonlinear optimization problem under linear

constraints. Using Rosen Gradient Projection methodsconstraints. Using Rosen Gradient Projection methods

Page 17: Discriminative Naïve Bayesian Classifiers

Discriminative Naïve Bayesian Discriminative Naïve Bayesian Classifiers (DNB)Classifiers (DNB)

Gradient and Projection matrixGradient and Projection matrix

Page 18: Discriminative Naïve Bayesian Classifiers

Extension to Multi-category Extension to Multi-category Classification problemsClassification problems

Page 19: Discriminative Naïve Bayesian Classifiers

Experimental resultsExperimental results

Experimental SetupExperimental Setup– DatasetsDatasets

» 5 benchmark datasets from UCI machine learning repository5 benchmark datasets from UCI machine learning repository

– Experimental EnvironmentsExperimental Environments» Platform:Windows 2000Platform:Windows 2000

» Developing tool: Matlab 6.5Developing tool: Matlab 6.5

Page 20: Discriminative Naïve Bayesian Classifiers

Without information missingWithout information missing

ObservationsObservations

–DNB outperforms NB in every datasetsDNB outperforms NB in every datasets

–DNB wins in 2 datasets while it loses in three dataets DNB wins in 2 datasets while it loses in three dataets in comparison with SVMin comparison with SVM

–SVM outperforms DNB in Segment and SatimagesSVM outperforms DNB in Segment and Satimages

Page 21: Discriminative Naïve Bayesian Classifiers

With information missingWith information missing

DNB usesDNB uses

to conduct inference when there is to conduct inference when there is information missinginformation missing

SVM sets SVM sets 0 0 values to the missing features values to the missing features (the default way to process unknown (the default way to process unknown features in LIBSVM) features in LIBSVM)

Page 22: Discriminative Naïve Bayesian Classifiers

With information missingWith information missing

Page 23: Discriminative Naïve Bayesian Classifiers

With information missingWith information missing

Page 24: Discriminative Naïve Bayesian Classifiers

With information missingWith information missing

Page 25: Discriminative Naïve Bayesian Classifiers

With information missingWith information missing

1.1. ObservationsObservations NB demonstrates a robust ability in handling NB demonstrates a robust ability in handling

missing information problems.missing information problems. DNB inherits the ability of NB in handling DNB inherits the ability of NB in handling

missing information problems while it has a missing information problems while it has a higher classification accuracy than NBhigher classification accuracy than NB

SVM cannot deal with missing information SVM cannot deal with missing information problems easily.problems easily.

In small datasets, DNB demonstrates a In small datasets, DNB demonstrates a superior ability than NB. superior ability than NB.

Page 26: Discriminative Naïve Bayesian Classifiers

DiscussionDiscussion

Why SVM outperforms DNB when no Why SVM outperforms DNB when no information missing?information missing?

SVMSVM

DNB

SVM directly minimizes the error rate, while DNB minimizes an intermediate term.

SVM assumes no model, while DNB assumes independent relationship among features. “all models are wrong but some are useful”.

Page 27: Discriminative Naïve Bayesian Classifiers

DiscussionDiscussion

How DNB relates to Fisher Discriminant How DNB relates to Fisher Discriminant (FD)?(FD)?

FD

Using the difference of the mean between two classes as the divergence measure is not an informative way in comparison with using distributions.

FD is usually used as dimension reduction method rather than a classification method

Page 28: Discriminative Naïve Bayesian Classifiers

DiscussionDiscussion

Can DNB be extended to general Bayesian Can DNB be extended to general Bayesian Network (BN) Classifier?Network (BN) Classifier?– Finding optimal General Bayesian Network Finding optimal General Bayesian Network

Classifiers is an NP-complete problem.Classifiers is an NP-complete problem.– Structure learning problem will be involved. Structure learning problem will be involved.

Direct application of DNB will encounter Direct application of DNB will encounter difficulties since the structure is non-fixed in difficulties since the structure is non-fixed in restricted BNs .restricted BNs .

The tree-like discriminative Bayesian The tree-like discriminative Bayesian Network Classifier is ongoing.Network Classifier is ongoing.

Page 29: Discriminative Naïve Bayesian Classifiers

DiscussionDiscussionDiscriminative training of Tree-like Bayesian Discriminative training of Tree-like Bayesian Network Classifiers Network Classifiers

Two reference distributions

are used in each iteration.Approximate the Empirical distribution as close as possible

And as far as possible from the distribution of the other dataset

Page 30: Discriminative Naïve Bayesian Classifiers

Future workFuture work

Extensive evaluations on discriminative Extensive evaluations on discriminative Bayesian network classifiers including Bayesian network classifiers including Discriminative Naïve Bayesian Classifiers Discriminative Naïve Bayesian Classifiers and tree-like Bayesian Network Classifiers.and tree-like Bayesian Network Classifiers.

Page 31: Discriminative Naïve Bayesian Classifiers

ConclusionConclusion

We develop a novel model named We develop a novel model named Discriminative Naïve Bayesian ClassifiersDiscriminative Naïve Bayesian Classifiers

It outperforms Naïve Bayesian Classifiers It outperforms Naïve Bayesian Classifiers when no information is missingwhen no information is missing

It outperforms SVMs in handling missing It outperforms SVMs in handling missing information problems.information problems.