Text Classification using SVM- light DSSI 2008 Jing Jiang

Text Classification using SVM-light

DSSI 2008

Jing Jiang

Text Classification

• Goal: to classify documents (news articles, emails, Web pages, etc.) into predefined categories

• Examples– To classify news articles into “business” and “sports”– To classify Web pages into personal home pages and others– To classify product reviews into positive reviews and negative

reviews

• Approach: supervised machine learning– For each pre-defined category, we need a set of training

documents known to belong to the category.– From the training documents, we train a classifier.

Overview

• Step 1—text pre-processing– to pre-process text and represent each

document as a feature vector

• Step 2—training– to train a classifier using a classification tool

(e.g. SNoW, SVM-light)

• Step 3—classification– to apply the classifier to new documents

Pre-processing: tokenization

• Goal: to separate text into individual words

• Example: “We’re attending a tutorial now.” we ’re attending a tutorial now

• Tool:– Word Splitter

http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=WS

http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=WS

Pre-processing: stop word removal (optional)

• Goal: to remove common words that are usually not useful for text classification

• Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc.

• Stop word list:– http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_

utils/stop_words

http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

Pre-processing: stemming (optional)

• Goal: to normalize words derived from the same root

• Examples:– attending attend– teacher teach

• Tool:– Porter stemmer http://

tartarus.org/~martin/PorterStemmer/

http://tartarus.org/~martin/PorterStemmer/



Pre-processing: feature extraction

• Unigram features: to use each word as a feature– To use TF (term frequency) as feature value– To use TF*IDF (inverse document frequency) as

feature value– IDF = log (total-number-of-documents / number-of-

documents-containing-t)

• Bigram features: to use two consecutive words as a feature

• Tool:– Write your own program/script– Lemur API

Index *ind = IndexManager::openIndex("index-file.key");

int d1; TermInfoList *tList = ind->termInfoList(d1);

tList->startIteration();while (tList->hasMore()) {

TermInfo * entry = tList->nextEntry();cout << "entry term id: " << entry->termID() << endl; cout << "entry term count: " << entry->termCount() << endl;

} delete dList;

delete ind;

Using Lemur to Extract Unigram Features

SVM (Support Vector Machines)

• A learning algorithm for classification– General for any classification problem (text

classification as one example)

• Binary classification

• Maximizes the margin between the two different classes

picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

SVM-light

• SVM-light: a command line C program that implements the SVM learning algorithm

• Classification, regression, ranking• Download at http://svmlight.joachims.org/

• Documentation on the same page

• Two programs– svm_learn for training– svm_classify for classification

http://svmlight.joachims.org/

SVM-light Examples

• Input format1 1:0.5 3:1 5:0.4-1 2:0.9 3:0.1 4:2

• To train a classifier from train.data– svm_learn train.data train.model

• To classify new documents in test.data– svm_classify test.data train.model test.result

• Output format– Positive score positive class– Negative score negative class– Absolute value of the score indicates confidence

• Command line options– -c a tradeoff parameter (use cross validation to tune)

More on SVM-light

• Kernel– Use the “-t” option– Polynomial kernel– User-defined kernel

• Semi-supervised learning (transductive SVM)– Use “0” as the label for unlabeled examples– Very slow

Documents

Text Classification using SVM- light DSSI 2008 Jing Jiang