Upload
eric-hull
View
56
Download
1
Embed Size (px)
DESCRIPTION
Naïve Bayes Classifier. Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009. Goal. -create a naïve Bayes classifier using the 20 Newsgroup database -compare the effectiveness of a simple naïve Bayes classifier and one optimized. What is the Naïve Bayes?. - PowerPoint PPT Presentation
Citation preview
Naïve Bayes ClassifierNaïve Bayes Classifier
Christina Wallin, Period 3
Computer Systems Research Lab 2008-2009
Christina Wallin, Period 3
Computer Systems Research Lab 2008-2009
Goal
-create a naïve Bayes classifier using the 20 Newsgroup database
-compare the effectiveness of a simple naïve Bayes classifier and one optimized
What is the Naïve Bayes?
-Classification method based on independence assumption-Machine learning
-trained with test cases as to what the classes are, and then can classify texts
-classification based on the probability that a word will be in a specific class of text
Previous Research
Algorithm has been around for a while (first use is in 1966)
At first, it was thought to be less effective because of its simplicity and false independence assumption, but a recent review of the uses of the algorithm has found that it is actually rather effective("Idiot's Bayes--Not So Stupid After
All?" by David Hand and Keming Yu)
Previous Research Cont’d
• Currently, the best methods use a combination of naïve Bayes and logistic regression (Shen and Jiang, 2003)
• Still room for improvement—data selection for training and how to incorporate the text length (Lewis, 2001)
• My program will investigate what features of training make them better for naïve Bayes, building upon the basic structure outlined in many papers
Program Overview
• Python with NLTK (Natural Language Toolkit)
• file.py
• train.py
• test.py
Procedures: file.py
So far, a program which inputs a text file Parses file Makes a dictionary of all of the words present
and their frequency Can choose to stem words or not With PyLab, can graph the 20 most frequent
words
Procedures: train.py
• Training the program as to what words occur more frequently in each class
• Make a PFX vector, the probability that each word is in the class
– Total number of texts in class which have a word/total number of texts in class
– Laplace smoothing
Procedures: test.py
• Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole
• Use log sum to figure out the probability, because multiplying all of them would cause problems
Testing
• Generated text files based on a probability of the words occurring
• Compared initial, programmed in, probability to PFX generated
• Also used generated files to test text classification
Results: file.py
20 most frequent words in sci.space from 20 Newsgroup 20 most frequent words in
rec.sports.baseball from 20 Newsgroup
Results: file.py
Approx the same length stories sci.space more dense and less to the point Most frequent word, ‘the’, the same
Results: Effect of stemming
• 82.6% correctly classified with stemmer vs 83.6% without in alt.atheism and rec.autos
• 66.6% vs 67.7% with comp.sys.ibm.pc.hardware and comp.sys.mac.hardware
• 69.3% vs 70.4% with sci.crypt and alt.atheism
• I expected it to help, but as shown using a Porter stemmer to stem words before generating the probability vector does not help
Still to come
• Optimization
– Analyze the data from 20 Newsgroups to see why certain certain classes can be classified more easily than others
• Change to a multinomial model
– Multiple occurrences of words in a file