34
Statistical Text Categorization By Carl Sable

Statistical Text Categorization

  • Upload
    norris

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Statistical Text Categorization. By Carl Sable. Text Classification Tasks. Text Categorization (TC) - Assign text documents to pre-existing, well-defined categories. Clustering - Group text documents into clusters of similar documents. - PowerPoint PPT Presentation

Citation preview

Page 1: Statistical Text Categorization

Statistical Text Categorization

By Carl Sable

Page 2: Statistical Text Categorization

Text Classification Tasks

• Text Categorization (TC) - Assign text documents to pre-existing, well-defined categories.

• Clustering - Group text documents into clusters of similar documents.

• Information Retrieval (IR) - Retrieve text documents which match user query.

• Text Filtering - Retrieve documents which match a user profile.

Page 3: Statistical Text Categorization

Text Categorization

• Classify each test document by assigning pre-defined category labels.– M-ary categorization assumes M labels per

document.– Binary categorization requires yes/no decision

for every document/category pair.

• Most techniques require training.

Page 4: Statistical Text Categorization

Early Work

• The Federalist papers.– Published anonymously between 1787-1788.– Authorship of 12 papers in dispute (either

Hamilton or Madison).

• Mostellar and Wallace, 1963.– Compared rate per thousand words of high

frequency words.– Collected very strong evidence in favor of

Madison.

Page 5: Statistical Text Categorization

Rocchio

• Create TF*IDF word vector for every document and category.

• For each test document, compare its word vector with those of all categories.

• Choose category with highest similarity score.

• Many possible options!

Page 6: Statistical Text Categorization

K-Nearest Neighbors (KNN)

• Create word vector for every document.

• For each test document, compare its word vector with those of training documents.

• Select the most similar training documents.

• Use their categories (weighted) to predict category or categories of test document.

Page 7: Statistical Text Categorization

Naïve Bayes (NB)

• Compute probabilities of seeing each word in each category (based on training data).

• For each test document, loop through words, combining probabilities.

• Can incorporate a-priori category probabilities.• Choose category that gives document highest

probability.• “Naïve” because assumes word independence!

Page 8: Statistical Text Categorization

Many Other Methods

• Support Vector Machines (SVMs).

• Neural Nets (NNets).

• Linear Least Squares Fit (LLSF).

• Decision Trees.

• Maximum Entropy.

• Boosting.

Page 9: Statistical Text Categorization

Reuters Corpus

• Common corpus for comparing methods.

• Over 10,000 articles, 90 topic categories.

• Binary categorization.

5

grain, wheat, corn, barley, oat, sorghum

9earn

448gold, acq, platinum

http://www.research.att.com/~lewis/reuters21578.html

Page 10: Statistical Text Categorization

Our Corpus

• Raw data was tens of thousands of postings from Clarinet newsgroups.

• About 2000 articles had one or two associated images with captions.

• Volunteers manually labeled images or full documents based on our instructions.

Page 11: Statistical Text Categorization

Sample Image and Caption

A home along the eastern edge of Grand Forks, North Dakota lies almost completely submerged under the waters of the Red River of the North April 25. The waters of the river are beginning to recede substantially, however those homes on the eastern edge of the town faired the worst in the record

flooding.

Page 12: Statistical Text Categorization

Indoor vs. Outdoor

Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. They are clockwise from the top: Russian President Boris Yeltsin, U.S. President Bill Clinton, French President Jacques Chirac, Canadian Prime Minister Jean Chretien, Italian Prime Minister Romano Prodi, EU President Willem Kok, EC President Jacques Santer, British Prime Minister Tony Blair, Japanese Prime Minister Ryutaro Hashimoto and German Chancellor Helmut Kohl.

Villagers look at the broken tail-end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh. All 89 passengers and crew survived the accident, mostly with minor injuries. Most of the passengers were expatriate Bangladeshis returning home from London.

Page 13: Statistical Text Categorization

Event Categories

Politics Struggle

Disaster CrimeOther

Page 14: Statistical Text Categorization

Manual Categorization Tool

Page 15: Statistical Text Categorization

Our Columbia System

• First use Rocchio method with advanced features.

• Next apply Density Estimation.– Often improves performance.– Results offer confidence measure in terms of

probability.

Page 16: Statistical Text Categorization

Advanced Features

• Which words to use:– Examine various text spans (captions, first sentences of

captions, articles, etc.).

– Restrict to specific grammatical categories (all words, open class words, etc.).

• Options for disambiguating words:– Using POS tags.

– Case sensitivity.

• Normalization.

Page 17: Statistical Text Categorization

Density Estimation

• For each test document, first use Rocchio to compute similarity to every category.

• Next find all documents from training set with similar category similarities.

• Use categories of these training documents to predict categories (with probabilities) of test document.

Page 18: Statistical Text Categorization

Cross Validation

• Divide training set into multiple partitions of equal size (e.g. 3).

• Three-fold cross validation for all possible combinations of parameters.

• Compare relative performance of various settings for parameters.

• Only best setting is applied to test set.

Page 19: Statistical Text Categorization

AT&T System

• Group words with similar “features” together into a common “bin”.

• Based on training data, empirically estimate a term weight for words in each bin.– Smoothing, works well even if there is not

enough data for individual words.– Doesn’t assume simple relationships between

features.

Page 20: Statistical Text Categorization

Sample Words

Indoor Indicators“conference”“bed”

Outdoor Indicators“airplane”“earthquake”

Ambiguous“Gore”“ceremony”

Page 21: Statistical Text Categorization

Determine Bins for “airplane”

• Per category bins based on IDF and category counts.

• IDF(“airplane”) = 5.4.

• Examine first half of training data:– Appears in 0 indoor documents.– Appears in 2 outdoor documents.

Page 22: Statistical Text Categorization

Lambdas for “airplane”

• Determined at the bin level.

• Examine second half of training data:410*11.2)indoor|nobservatio(P 310*90.2)outdoor|nobservatio(P

)indoor|nobservatio(Plog2indoor )outdoor|nobservatio(Plog2outdoor

78.3outdoorindoor

Page 23: Statistical Text Categorization

Sample Words With Scores

Indoor Indicators“conference”

+5.91“bed”

+4.58

Outdoor Indicators“airplane”

-3.78“earthquake”

-4.86

Ambiguous“Gore”

+0.74“ceremony”

-0.32

Page 24: Statistical Text Categorization

Reuters Bins and Term Weights

• Did not use per-category bins.• Lambdas based on log-likelihood estimates

of two documents sharing same categories:

• 10 closest neighbors are used to predict labels for each test document.

documents) nonsimilar|nobservatio(P

documents)similar |nobservatio(Plog2

Page 25: Statistical Text Categorization

Reuters Lambdas

Page 26: Statistical Text Categorization

Standard Evaluation Metrics (1)

• Per Category Measures:– simple accuracy or error measures are

misleading for binary categorization.– Precision and recall.– F-measure, average precision, and

break-even point (BEP) combine precision and recall.

• Macro-averaging vs. Micro-averaging.– Macro treats all categories equal,

micro treats all documents equal.– Macro usually lower since small

categories are hard.

Yes iscorrect

No iscorrect

AssignedYES

a b

AssignedNO

c d

p = a / (a + b)

r = a / (a + c)

contingency table:

rp

r*p*2F1

Page 27: Statistical Text Categorization

Results for Reuters

Micro-F1

0.7000

0.7500

0.8000

0.8500

0.9000

SVM KNN LSF NNet

NB TF*IDF Columbia Bin

Macro-F1

0.30000.35000.40000.45000.50000.5500

SVM KNN LSF NNet

NB TF*IDF Columbia Bin

Page 28: Statistical Text Categorization

Standard Evaluation Metrics (2)

• Mutually exclusive categories:– Each test document has only one correct label.– Each test document assigned only one label.

• Performance measured by overall accuracy:

sprediction total#

spredictioncorrect #Accuracy

Page 29: Statistical Text Categorization

Results for Indoor vs. Outdoor

80.0%

81.0%

82.0%

83.0%

84.0%

85.0%

86.0%

87.0%

1st Qtr

Bin

Columbia

SVM

TF*IDF

KNN

• Columbia system using density estimation shows best performance.

• Even beats SVMs.• System using bins

very respectable.

Page 30: Statistical Text Categorization

Results for Event Categories

82.0%

83.0%

84.0%

85.0%

86.0%

87.0%

88.0%

89.0%

1st Qtr

Bin

Columbia

TF*IDF

KNN

• System using bins shows best performance.

• Columbia system respectable.

Page 31: Statistical Text Categorization

Clustering

• Group documents into classes:– Documents within a single class are “similar”

to each other.– Documents in different classes are not.

• No pre-defined categories.

• Hierarchical or non-hierarchical.

• Concept of a “centroid”.

Page 32: Statistical Text Categorization

Non-hierarchical Clustering

• Methods are heuristic in nature.

• Certain decisions, e.g. similarity threshold, made in advance.

• If encounter document not similar to existing clusters, start new cluster.

• Sometimes number of clusters chosen in advance.

Page 33: Statistical Text Categorization

Hierarchical Clustering

• Start off with each document as own cluster.

• Continuously join two “closest” clusters.– Various methods use different notions of

distance between clusters.– Method determines outcome, algorithm

determines efficiency.

• Stop one only one cluster remains.

Page 34: Statistical Text Categorization

More on Clustering

• Often used to aid information retrieval.

• For dynamic environments, mechanism for updates is necessary.

• Evaluation is a major problem! Human judgments often necessary.