Upload
clement-little
View
215
Download
1
Embed Size (px)
Citation preview
Text Classification using SVM-light
DSSI 2008
Jing Jiang
Text Classification
• Goal: to classify documents (news articles, emails, Web pages, etc.) into predefined categories
• Examples– To classify news articles into “business” and “sports”– To classify Web pages into personal home pages and others– To classify product reviews into positive reviews and negative
reviews
• Approach: supervised machine learning– For each pre-defined category, we need a set of training
documents known to belong to the category.– From the training documents, we train a classifier.
Overview
• Step 1—text pre-processing– to pre-process text and represent each
document as a feature vector
• Step 2—training– to train a classifier using a classification tool
(e.g. SNoW, SVM-light)
• Step 3—classification– to apply the classifier to new documents
Pre-processing: tokenization
• Goal: to separate text into individual words
• Example: “We’re attending a tutorial now.” we ’re attending a tutorial now
• Tool:– Word Splitter
http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=WS
Pre-processing: stop word removal (optional)
• Goal: to remove common words that are usually not useful for text classification
• Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc.
• Stop word list:– http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_
utils/stop_words
Pre-processing: stemming (optional)
• Goal: to normalize words derived from the same root
• Examples:– attending attend– teacher teach
• Tool:– Porter stemmer http://
tartarus.org/~martin/PorterStemmer/
Pre-processing: feature extraction
• Unigram features: to use each word as a feature– To use TF (term frequency) as feature value– To use TF*IDF (inverse document frequency) as
feature value– IDF = log (total-number-of-documents / number-of-
documents-containing-t)
• Bigram features: to use two consecutive words as a feature
• Tool:– Write your own program/script– Lemur API
Index *ind = IndexManager::openIndex("index-file.key");
int d1; TermInfoList *tList = ind->termInfoList(d1);
tList->startIteration();while (tList->hasMore()) {
TermInfo * entry = tList->nextEntry();cout << "entry term id: " << entry->termID() << endl; cout << "entry term count: " << entry->termCount() << endl;
} delete dList;
delete ind;
Using Lemur to Extract Unigram Features
SVM (Support Vector Machines)
• A learning algorithm for classification– General for any classification problem (text
classification as one example)
• Binary classification
• Maximizes the margin between the two different classes
picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf
SVM-light
• SVM-light: a command line C program that implements the SVM learning algorithm
• Classification, regression, ranking• Download at http://svmlight.joachims.org/
• Documentation on the same page
• Two programs– svm_learn for training– svm_classify for classification
SVM-light Examples
• Input format1 1:0.5 3:1 5:0.4-1 2:0.9 3:0.1 4:2
• To train a classifier from train.data– svm_learn train.data train.model
• To classify new documents in test.data– svm_classify test.data train.model test.result
• Output format– Positive score positive class– Negative score negative class– Absolute value of the score indicates confidence
• Command line options– -c a tradeoff parameter (use cross validation to tune)
More on SVM-light
• Kernel– Use the “-t” option– Polynomial kernel– User-defined kernel
• Semi-supervised learning (transductive SVM)– Use “0” as the label for unlabeled examples– Very slow