An Introduction To Categorization Soam Acharya, PhD [email protected] 1/15/2003

An Introduction To Categorization

Soam Acharya, PhD

[email protected]

1/15/2003

What is Categorization?

• {c1 … cm} set of predefined categories

• {d1 … dn} set of candidate documents• Fill decision matrix with values {0,1}

• Categories are symbolic labels

d1 … … dn

c1 a11 … … a1n

… … … … …

cm am1 … … amn

Uses

• Document organization

• Document filtering

• Word sense disambiguation

• Web– Internet directories– Organization of search results

• Clustering

Categorization Techniques

• Knowledge systems

• Machine Learning

Knowledge Systems

• Manually build an expert system– Makes categorization judgments– Sequence of rules per category– If <boolean condition> then category– If document contains “buena vista home

entertainment” then document category is “Home Video”

UltraSeek Content Classification Engine

UltraSeek CCE

Knowledge System Issues

• Scalability– Build– Tune

• Requires Domain Experts

• Transferability

Machine Learning Approach

• Build a classifier for a category– Training set– Hierarchy of categories

• Submit candidate documents for automatic classification

• Expend effort in building a classifier, not in knowing the knowledge domain

Machine Learning Process

Document Pre-

processing

documents

Classifier

Training

taxonomy

Training Set documents

DB

Training Set

• Initial corpus can be divided into:– Training set– Test set

• Role of workflow tools

Document Preprocessing• Document Conversion:

– Converts file formats (.doc, .ppt, .xls, .pdf etc) to text

• Tokenizing/Parsing:– Stemming

– Document vectorization

• Dimension reduction

Document Vectorization

• Convert document text into “bag of words”

• Each document is a vector of n weighted terms

Federal express 3

Severe 3

Mountain 2

Exactly 1

Simple 5

Flight 2

Y2000-Q3 1

Document

Document Vectorization• Use tfidf function for term weighting

• tfidf value may be normalized– All vectors of equal length– [0,1]

tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)]

# of times tk occurs in dj

# of documents where tk occurs at least once

Cardinality of training set

Dimension Reduction

• Reduce dimensionality of vector space• Why?

– Reduce computational complexity– Address “overfitting” problem

• Overtuning classifier

• How?– Feature selection– Feature extraction

Feature Selection

• Also known as “term space reduction”• Remove “stop” words• Identify “best” words to be used in

categorizing per topic– Document frequency of terms

• Keep terms that occur in highest number of documents

– Other measures• Chi square• Information gain

Feature Extraction

• Synthesize new features from existing features

• Term clustering– Use clusters/centroids instead of terms– Co-occurrence and co-absence

• Latent Semantic Indexing– Compresses vectors into a lower

dimensional space

Creating a Classifier

• Define a function, Categorization Status Value, CSV, that for a document d:– CSVi: D -> [0,1]

– Confidence that d belongs in ci

• Boolean• Probability• Vector distance

Creating a Classifier

• Define a threshold, thresh, such that if CSVi(d) > thresh(i) then categorize d under ci otherwise, don’t

• CSV thresholding– Fixed value across all categories– Vary per category

• Optimize via testing

Naïve Bayes Classifier

Probability of doc dj belonging in category ci

Training set terms/weights present in dj used to calculate probability of dj belonging to ci

Naïve Bayes ClassifierIf wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci)

After further derivation, the original equation looks like:

Can be used for CSV

Constants for all docs

Naïve Bayes Classifier

• Independence assumption

• Feature selection can be counterproductive

k-NN Classifier

• Compute closeness between candidate documents and category documents

Similarity between dj and training set document dz

Confidence score indicating whether dz belongs to category ci

k-NN Classifier

• k nearest neighbors– Find k nearest neighbors from all training

documents and use their categories– K can also indicate the number of top

ranked training documents per category to compare against

• Similarity computation can be:– Inner product– Cosine coefficient

Support Vector Machines

• “decision surface” that best separates data points in two classes

• Support vectors are the training docs that best define hyperplane

Optimal hyperplane

Max. margin

Support Vector Machines

• Training process involves finding the support vectors

• Only care about support vectors in the training set, not other documents

Neural Networks

• Train net to learn from a mapping of input words to a category

• One neural net per category– Too expensive

• One network overall• Perceptron approach without a hidden

layer• Three layered

Classifier Committees

• Combine multiple classifiers

• Majority voting

• Category specialization

• Mixed results

Classification Performance• Category ranking evaluation

– Recall = categories found and correct

– Precision = categories found and correct

• Micro and Macro averaging over categories

Total categories correct

Total categories found

Classification Performance

• Hard

• Two studies– Yiming Yang, 1997– Yiming Yang and Xin Liu, 1999

• SVM, kNN >> Neural Net > Naïve Bayes

• Performance converges for common categories (with many training docs)

Computational Bottlenecks

• Quiver– # of topics– # of training documents– # of candidate documents

Categorization and the Internet

• Classification as a service– Standardizing vocabulary– Confidentiality– performance

• Use of hypertext in categorization– Augment existing classifiers to take

advantage

Hypertext and Categorization

• An already categorized document links to documents within same category

• Neighboring documents in a similar category

• Hierarchical nature of categories

• Metatags

Augmenting Classifiers

• Inject anchor text for a document into that document– Treat anchor text as separate terms

• Depends on dataset• Mixed experimental results• Links may be noisy

– Ads– Navigation

Topics and the Web

• Topic distillation– Analysis of hyperlink graph structure

• Authorities – popular pages

• Hubs– Links to authorities

hubs authorities

Topic Distillation• Kleinberg’s HITS algorithm• An initial set of pages: root set

– Use this to create an expanded set

• Weight propagation phase– Each node: authority score and hub score– Alternate

• Authority = sum of current hub weights of all nodes pointing to it

• Hub = sum of all authority score of all pages it points to

– Normalize node scores and iterate until convergence

• Output is a set of hubs and authorities

Conclusion

• Why Classifiy?

• The Classification Process

• Various Classifiers

• Which ones are better?

• Other applications

Documents

An Introduction To Categorization Soam Acharya, PhD [email protected] 1/15/2003