Upload
-
View
444
Download
3
Tags:
Embed Size (px)
DESCRIPTION
陳慶治Large Scale Text Classification using Semi-supervised Multinomial Naive bayes
Citation preview
Large scale text classification
Using Semi-supervised Multinomial Naïve Bayes
Presenter: QingZhi Chen
OUT LINE1.Introduction2.Text document representation3.Multinomial Naïve Bayes4.Semi-supervised Learning for MNB5.Experiments
INTRODUCTIONMultinomial Naïve Bayes Frequency Estimate * difficulty for collect labeled data , and large unlabeled data become useless
Expectation Maximization maximizing marginal log likelihood
Semi-supervised Frequency Estimate better conditional log likelihood
Type?
bag-of-wordsIgnore the ordering of words in d
Naïve Bayesian Assumption Each word is independent of each other
apple
Red rind
sweetroundedapple
Red rind
sweetrounded
rind apple red rounded sweet
Text Document representation
d={, , ,…. ,c} corresponds to a word in document d and its value is frequency ƒ i of in d c is the class label of d V is set of unique words ω in all d i T is the training set is -th document in T is indicate parameter estimates
Multinomial Naïve Bayesobjective function
parameter
P(c) is prior probability of c class in whole T is number of in T with the label c
is the number of occurrence of in the document
FE parameter learning
is the number of occurrence of in the documentwith the class label cFE objective function
Decompose to CLL + MLL
Semi-supervised Learning for MNB
Basic assumption In where # of unlabeled >> labeled we can use provide more information about modelExpectation Maximization classical semi-supervised method for MNB : assign document to c
* will be the same as FE&SFE’s * update(3)(2) using (6)(7)(1) until parameter are stable* this implementation still use the , and counts the labeled documents with 1 rather than . * deficit of EM : inferior CLL and too strong assumption
Semi-supervised Frequency Estimate
soft classify word to c
EXPERIMENT
Source Data setPerformance Index : AUC & AccuracyInfluence on Conditional log likelihoodImpact of Size of Unlabeled DataComputational Cost
*AUC refers to area under curve
Base on ’ MNB classifier