37
An Introduction To Categorization Soam Acharya, PhD [email protected] 1/15/2003

An Introduction To Categorization Soam Acharya, PhD [email protected] 1/15/2003

Embed Size (px)

Citation preview

Page 1: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

An Introduction To Categorization

Soam Acharya, PhD

[email protected]

1/15/2003

Page 2: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

What is Categorization?

• {c1 … cm} set of predefined categories

• {d1 … dn} set of candidate documents• Fill decision matrix with values {0,1}

• Categories are symbolic labels

d1 … … dn

c1 a11 … … a1n

… … … … …

cm am1 … … amn

Page 3: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Uses

• Document organization

• Document filtering

• Word sense disambiguation

• Web– Internet directories– Organization of search results

• Clustering

Page 4: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Categorization Techniques

• Knowledge systems

• Machine Learning

Page 5: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Knowledge Systems

• Manually build an expert system– Makes categorization judgments– Sequence of rules per category– If <boolean condition> then category– If document contains “buena vista home

entertainment” then document category is “Home Video”

Page 6: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

UltraSeek Content Classification Engine

Page 7: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

UltraSeek CCE

Page 8: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Knowledge System Issues

• Scalability– Build– Tune

• Requires Domain Experts

• Transferability

Page 9: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Machine Learning Approach

• Build a classifier for a category– Training set– Hierarchy of categories

• Submit candidate documents for automatic classification

• Expend effort in building a classifier, not in knowing the knowledge domain

Page 10: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Machine Learning Process

Document Pre-

processing

documents

Classifier

Training

taxonomy

Training Set documents

DB

Page 11: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Training Set

• Initial corpus can be divided into:– Training set– Test set

• Role of workflow tools

Page 12: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Document Preprocessing• Document Conversion:

– Converts file formats (.doc, .ppt, .xls, .pdf etc) to text

• Tokenizing/Parsing:– Stemming

– Document vectorization

• Dimension reduction

Page 13: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Document Vectorization

• Convert document text into “bag of words”

• Each document is a vector of n weighted terms

Federal express 3

Severe 3

Mountain 2

Exactly 1

Simple 5

Flight 2

Y2000-Q3 1

Document

Page 14: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Document Vectorization• Use tfidf function for term weighting

• tfidf value may be normalized– All vectors of equal length– [0,1]

tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)]

# of times tk occurs in dj

# of documents where tk occurs at least once

Cardinality of training set

Page 15: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Dimension Reduction

• Reduce dimensionality of vector space• Why?

– Reduce computational complexity– Address “overfitting” problem

• Overtuning classifier

• How?– Feature selection– Feature extraction

Page 16: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Feature Selection

• Also known as “term space reduction”• Remove “stop” words• Identify “best” words to be used in

categorizing per topic– Document frequency of terms

• Keep terms that occur in highest number of documents

– Other measures• Chi square• Information gain

Page 17: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Feature Extraction

• Synthesize new features from existing features

• Term clustering– Use clusters/centroids instead of terms– Co-occurrence and co-absence

• Latent Semantic Indexing– Compresses vectors into a lower

dimensional space

Page 18: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Creating a Classifier

• Define a function, Categorization Status Value, CSV, that for a document d:– CSVi: D -> [0,1]

– Confidence that d belongs in ci

• Boolean• Probability• Vector distance

Page 19: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Creating a Classifier

• Define a threshold, thresh, such that if CSVi(d) > thresh(i) then categorize d under ci otherwise, don’t

• CSV thresholding– Fixed value across all categories– Vary per category

• Optimize via testing

Page 20: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Naïve Bayes Classifier

Probability of doc dj belonging in category ci

Training set terms/weights present in dj used to calculate probability of dj belonging to ci

Page 21: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Naïve Bayes ClassifierIf wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci)

After further derivation, the original equation looks like:

Can be used for CSV

Constants for all docs

Page 22: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Naïve Bayes Classifier

• Independence assumption

• Feature selection can be counterproductive

Page 23: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

k-NN Classifier

• Compute closeness between candidate documents and category documents

Similarity between dj and training set document dz

Confidence score indicating whether dz belongs to category ci

Page 24: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

k-NN Classifier

• k nearest neighbors– Find k nearest neighbors from all training

documents and use their categories– K can also indicate the number of top

ranked training documents per category to compare against

• Similarity computation can be:– Inner product– Cosine coefficient

Page 25: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Support Vector Machines

• “decision surface” that best separates data points in two classes

• Support vectors are the training docs that best define hyperplane

Optimal hyperplane

Max. margin

Page 26: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Support Vector Machines

• Training process involves finding the support vectors

• Only care about support vectors in the training set, not other documents

Page 27: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Neural Networks

• Train net to learn from a mapping of input words to a category

• One neural net per category– Too expensive

• One network overall• Perceptron approach without a hidden

layer• Three layered

Page 28: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Classifier Committees

• Combine multiple classifiers

• Majority voting

• Category specialization

• Mixed results

Page 29: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Classification Performance• Category ranking evaluation

– Recall = categories found and correct

– Precision = categories found and correct

• Micro and Macro averaging over categories

Total categories correct

Total categories found

Page 30: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Classification Performance

• Hard

• Two studies– Yiming Yang, 1997– Yiming Yang and Xin Liu, 1999

• SVM, kNN >> Neural Net > Naïve Bayes

• Performance converges for common categories (with many training docs)

Page 31: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Computational Bottlenecks

• Quiver– # of topics– # of training documents– # of candidate documents

Page 32: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Categorization and the Internet

• Classification as a service– Standardizing vocabulary– Confidentiality– performance

• Use of hypertext in categorization– Augment existing classifiers to take

advantage

Page 33: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Hypertext and Categorization

• An already categorized document links to documents within same category

• Neighboring documents in a similar category

• Hierarchical nature of categories

• Metatags

Page 34: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Augmenting Classifiers

• Inject anchor text for a document into that document– Treat anchor text as separate terms

• Depends on dataset• Mixed experimental results• Links may be noisy

– Ads– Navigation

Page 35: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Topics and the Web

• Topic distillation– Analysis of hyperlink graph structure

• Authorities – popular pages

• Hubs– Links to authorities

hubs authorities

Page 36: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Topic Distillation• Kleinberg’s HITS algorithm• An initial set of pages: root set

– Use this to create an expanded set

• Weight propagation phase– Each node: authority score and hub score– Alternate

• Authority = sum of current hub weights of all nodes pointing to it

• Hub = sum of all authority score of all pages it points to

– Normalize node scores and iterate until convergence

• Output is a set of hubs and authorities

Page 37: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

Conclusion

• Why Classifiy?

• The Classification Process

• Various Classifiers

• Which ones are better?

• Other applications