Naïve Bayes Classifier

Naïve Bayes ClassifierNaïve Bayes Classifier

Christina Wallin, Period 3

Computer Systems Research Lab 2008-2009

Christina Wallin, Period 3

Computer Systems Research Lab 2008-2009

Goal

-create a naïve Bayes classifier using the 20 Newsgroup database

-compare the effectiveness of a simple naïve Bayes classifier and one optimized

What is the Naïve Bayes?

-Classification method based on independence assumption-Machine learning

-trained with test cases as to what the classes are, and then can classify texts

-classification based on the probability that a word will be in a specific class of text

Previous Research

Algorithm has been around for a while (first use is in 1966)

At first, it was thought to be less effective because of its simplicity and false independence assumption, but a recent review of the uses of the algorithm has found that it is actually rather effective("Idiot's Bayes--Not So Stupid After

All?" by David Hand and Keming Yu)

Previous Research Cont’d

• Currently, the best methods use a combination of naïve Bayes and logistic regression (Shen and Jiang, 2003)

• Still room for improvement—data selection for training and how to incorporate the text length (Lewis, 2001)

• My program will investigate what features of training make them better for naïve Bayes, building upon the basic structure outlined in many papers

Program Overview

• Python with NLTK (Natural Language Toolkit)

• file.py

• train.py

• test.py

Procedures: file.py

So far, a program which inputs a text file Parses file Makes a dictionary of all of the words present

and their frequency Can choose to stem words or not With PyLab, can graph the 20 most frequent

words

Procedures: train.py

• Training the program as to what words occur more frequently in each class

• Make a PFX vector, the probability that each word is in the class

– Total number of texts in class which have a word/total number of texts in class

– Laplace smoothing

Procedures: test.py

• Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole

• Use log sum to figure out the probability, because multiplying all of them would cause problems

Testing

• Generated text files based on a probability of the words occurring

• Compared initial, programmed in, probability to PFX generated

• Also used generated files to test text classification

Results: file.py

20 most frequent words in sci.space from 20 Newsgroup 20 most frequent words in

rec.sports.baseball from 20 Newsgroup

Results: file.py

Approx the same length stories sci.space more dense and less to the point Most frequent word, ‘the’, the same

Results: Effect of stemming

• 82.6% correctly classified with stemmer vs 83.6% without in alt.atheism and rec.autos

• 66.6% vs 67.7% with comp.sys.ibm.pc.hardware and comp.sys.mac.hardware

• 69.3% vs 70.4% with sci.crypt and alt.atheism

• I expected it to help, but as shown using a Porter stemmer to stem words before generating the probability vector does not help

Still to come

• Optimization

– Analyze the data from 20 Newsgroups to see why certain certain classes can be classified more easily than others

• Change to a multinomial model

– Multiple occurrences of words in a file

Documents

Naïve Bayes Classifier