26
Document Classification using the Natural Language Toolkit Ben Healey http://benhealey.info @BenHealey

Document Classification using the Python Natural Language Toolkit

Embed Size (px)

Citation preview

Page 1: Document Classification using the Python Natural Language Toolkit

Document Classification using the Natural Language Toolkit

Ben Healeyhttp://benhealey.info

@BenHealey

Page 2: Document Classification using the Python Natural Language Toolkit

Source: IStockPhoto

Page 3: Document Classification using the Python Natural Language Toolkit

The Need for

Automation

http://upload.wikimedia.org/wikipedia/commons/b/b6/FileStack_retouched.jpg

Page 4: Document Classification using the Python Natural Language Toolkit

Take ur pick!

http:

//up

load

.wik

imed

ia.o

rg/w

ikip

edia

/com

mon

s/d/

d6/C

at_l

oves

_sw

eets

.jpg

Page 5: Document Classification using the Python Natural Language Toolkit

Features:- # Words- % ALLCAPS- Unigrams- Sender- And so on.

The Development Set

Trained Classifier(Model)

New Document(Class Unknown)

ClassifiedDocument.

ClassificationAlgo.

Class:

DocumentFeatures

Page 6: Document Classification using the Python Natural Language Toolkit

Relevant NLTK Modules

• Feature Extraction– from nltk.corpus import words, stopwords– from nltk.stem import PorterStemmer– from nltk.tokenize import WordPunctTokenizer– from nltk.collocations import BigramCollocationFinder– from nltk.metrics import BigramAssocMeasures

– See http://text-processing.com/demo/ for examples

• Machine Learning Algos and Tools– from nltk.classify import NaiveBayesClassifier– from nltk.classify import DecisionTreeClassifier– from nltk.classify import MaxentClassifier– from nltk.classify import WekaClassifier

– from nltk.classify.util import accuracy

Page 7: Document Classification using the Python Natural Language Toolkit

NaiveBayesClassifier

P ( label | features )=P (label )   ∗ P ( features | label )P (features )

P ( label | features )=P (label )   ∗ P ( f 1| label ) ...  ∗ ∗ P ( fn | label )P (features )

http://61.153.44.88/nltk/0.9.5/api/nltk.classify.naivebayes-module.html

Page 8: Document Classification using the Python Natural Language Toolkit

http://www.educationnews.org/commentaries/opinions_on_education/91117.html

Page 9: Document Classification using the Python Natural Language Toolkit

517,431 Emails

Source: IStockPhoto

Page 10: Document Classification using the Python Natural Language Toolkit

Prep: Extract and Load

• Sample* of 20,581 plaintext files

• import MySQLdb, os, random, string

• MySQL via Python ODBC interface

• File, string manipulation

• Key fields separated out

– To, From, CC, Subject, Body

* Folders for 7 users with a large number of email. So not representative!

Page 11: Document Classification using the Python Natural Language Toolkit

Prep: Extract and Load

• Allocation of random number

• Some feature extraction

– #To, #CCd, #Words, %digits, %CAPS

• Note: more cleaning could be done

• Code at benhealey.info

Page 12: Document Classification using the Python Natural Language Toolkit

From: [email protected]

To: [email protected]

Subject: Re: Agenda for FERC Meeting RE: EOL

Louise --

We had decided that not having Mark in the room gave us the ability to wiggle if questions on CFTC vs. FERC regulation arose. As you can imagine, FERC is starting to grapple with the issue that financial trades in energy commodities is regulated under the CEA, not the Federal Power Act or the Natural Gas Act.

Thanks,

Jim

Page 13: Document Classification using the Python Natural Language Toolkit

From: [email protected]

To: [email protected]

Subject: Start Date: 1/11/02; HourAhead hour: 5;

Start Date: 1/11/02; HourAhead hour: 5; No ancillary schedules awarded. No variances detected.

LOG MESSAGES:

PARSING FILE -->> O:\Portland\WestDesk\California Scheduling\ISO Final Schedules\2002011105.txt

Page 14: Document Classification using the Python Natural Language Toolkit

Class[es] assigned for 1,000 randomly selected messages:

Deals, Trading, Modelling

Regulatory/Accounting

Info Tech

Admin/Planning

Other/Unclear

Human Resources

Social/Personal

External Relations

0 50 100 150 200 250

247

172

167

158

141

134

68

45

Page 15: Document Classification using the Python Natural Language Toolkit

Prep: Show us ur Features

• NLTK toolset– from nltk.corpus import words, stopwords– from nltk.stem import PorterStemmer– from nltk.tokenize import WordPunctTokenizer– from nltk.collocations import BigramCollocationFinder– from nltk.metrics import BigramAssocMeasures

• Custom code– def extract_features(record,stemmer,stopset,tokenizer):

• Code at benhealey.info

Page 16: Document Classification using the Python Natural Language Toolkit

Prep: Show us ur Features

• Features in boolean or nominal form

if record['num_words_in_body']<=20:features['message_length']='Very Short'

elif record['num_words_in_body']<=80:features['message_length']='Short'

elif record['num_words_in_body']<=300:features['message_length']='Medium'

else:features['message_length']='Long'

Page 17: Document Classification using the Python Natural Language Toolkit

Prep: Show us ur Features

• Features in boolean or nominal form

text=record['msg_subject']+" "+record['msg_body']tokens = tokenizer.tokenize(text)

words = [stemmer.stem(x.lower()) for x in tokens if x not in stopset and len(x) > 1]

for word in words:features[word]=True

Page 18: Document Classification using the Python Natural Language Toolkit

Sit. Say. Heel.

random.shuffle(dev_set)cutoff = len(dev_set)*2/3train_set=dev_set[:cutoff]test_set=dev_set[cutoff:]

classifier = NaiveBayesClassifier.train(train_set)

print 'accuracy for > ',subject,':', accuracy(classifier, test_set)

classifier.show_most_informative_features(10)

Page 19: Document Classification using the Python Natural Language Toolkit

Most Important Features

Page 20: Document Classification using the Python Natural Language Toolkit

Most Important Features

Page 21: Document Classification using the Python Natural Language Toolkit

Most Important Features

Page 22: Document Classification using the Python Natural Language Toolkit

Performance: ‘IT’ Model

IMPORTANT: These are ‘cheat’ scores!

Decile Mean Prob. % PR % Social % HR % Other % Admin % IT % Legal % Deal9 1.0000 - 1 - - 1 95 2 3 8 0.7364 - 11 7 17 13 49 5 13 7 0.0000 3 11 10 30 21 2 13 25 6 0.0000 4 8 14 13 16 2 11 38 5 0.0000 1 7 11 13 18 4 13 41 4 0.0000 5 6 16 16 17 4 17 32 3 0.0000 4 7 16 19 28 6 16 25 2 0.0000 6 8 18 10 16 2 37 21 1 0.0000 7 4 13 12 14 2 42 21 0 0.0000 15 5 29 11 14 2 16 28

Page 23: Document Classification using the Python Natural Language Toolkit

Performance: ‘Deal’ ModelDecile Mean Prob. % PR % Social % HR % Other % Admin % IT % Legal % Deal

9 1.0000 2 - 2 11 6 5 14 79 8 1.0000 1 2 1 3 4 4 3 93 7 0.9971 2 3 3 18 4 2 17 58 6 0.1680 3 15 9 35 17 11 17 9 5 0.0000 4 11 19 22 19 9 23 4 4 0.0000 5 8 21 17 32 14 18 - 3 0.0000 5 9 21 13 25 13 28 2 2 0.0000 7 6 22 9 26 18 24 - 1 0.0000 - 4 3 2 5 77 13 - 0 0.0000 16 10 33 11 20 14 15 3

IMPORTANT: These are ‘cheat’ scores!

Page 24: Document Classification using the Python Natural Language Toolkit

Performance: ‘Social’ Model

IMPORTANT: These are ‘cheat’ scores!

Decile Mean Prob. % PR % Social % HR % Other % Admin % IT % Legal % Deal9 1.0000 1 9 6 9 11 13 40 24 8 1.0000 - 15 17 18 16 2 21 21 7 1.0000 7 5 6 24 20 3 30 18 6 1.0000 2 2 21 15 25 11 24 22 5 1.0000 22 11 32 9 22 10 5 10 4 1.0000 10 7 20 13 22 13 10 24 3 1.0000 - 15 10 5 9 7 14 49 2 1.0000 1 3 15 22 24 14 14 24 1 0.7382 2 1 7 25 9 4 13 45 0 0.0001 - - - 1 - 89 1 10

Page 25: Document Classification using the Python Natural Language Toolkit

Don’t get burned.

• Biased samples• Accuracy and rare events• Features and prior knowledge• Good modelling is iterative!

• Resampling and robustness• Learning cycles

http://www.ugo.com/movies/mustafa-in-austin-powers

Page 26: Document Classification using the Python Natural Language Toolkit

Resources• NLTK: – www.nltk.org/– http://www.nltk.org/book

• Enron email datasets:– http://www.cs.umass.edu/~ronb/enron_dataset.html

• Free online Machine Learning course from Stanford – http://ml-class.com/ (starts in October)

• StreamHacker blog by Jacob Perkins– http://streamhacker.com