Upload
elijah-sheridan
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
What is Categorization?
• {c1 … cm} set of predefined categories
• {d1 … dn} set of candidate documents• Fill decision matrix with values {0,1}
• Categories are symbolic labels
d1 … … dn
c1 a11 … … a1n
… … … … …
cm am1 … … amn
Uses
• Document organization
• Document filtering
• Word sense disambiguation
• Web– Internet directories– Organization of search results
• Clustering
Categorization Techniques
• Knowledge systems
• Machine Learning
Knowledge Systems
• Manually build an expert system– Makes categorization judgments– Sequence of rules per category– If <boolean condition> then category– If document contains “buena vista home
entertainment” then document category is “Home Video”
UltraSeek Content Classification Engine
UltraSeek CCE
Knowledge System Issues
• Scalability– Build– Tune
• Requires Domain Experts
• Transferability
Machine Learning Approach
• Build a classifier for a category– Training set– Hierarchy of categories
• Submit candidate documents for automatic classification
• Expend effort in building a classifier, not in knowing the knowledge domain
Machine Learning Process
Document Pre-
processing
documents
Classifier
Training
taxonomy
Training Set documents
DB
Training Set
• Initial corpus can be divided into:– Training set– Test set
• Role of workflow tools
Document Preprocessing• Document Conversion:
– Converts file formats (.doc, .ppt, .xls, .pdf etc) to text
• Tokenizing/Parsing:– Stemming
– Document vectorization
• Dimension reduction
Document Vectorization
• Convert document text into “bag of words”
• Each document is a vector of n weighted terms
Federal express 3
Severe 3
Mountain 2
Exactly 1
Simple 5
Flight 2
Y2000-Q3 1
Document
Document Vectorization• Use tfidf function for term weighting
• tfidf value may be normalized– All vectors of equal length– [0,1]
tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)]
# of times tk occurs in dj
# of documents where tk occurs at least once
Cardinality of training set
Dimension Reduction
• Reduce dimensionality of vector space• Why?
– Reduce computational complexity– Address “overfitting” problem
• Overtuning classifier
• How?– Feature selection– Feature extraction
Feature Selection
• Also known as “term space reduction”• Remove “stop” words• Identify “best” words to be used in
categorizing per topic– Document frequency of terms
• Keep terms that occur in highest number of documents
– Other measures• Chi square• Information gain
Feature Extraction
• Synthesize new features from existing features
• Term clustering– Use clusters/centroids instead of terms– Co-occurrence and co-absence
• Latent Semantic Indexing– Compresses vectors into a lower
dimensional space
Creating a Classifier
• Define a function, Categorization Status Value, CSV, that for a document d:– CSVi: D -> [0,1]
– Confidence that d belongs in ci
• Boolean• Probability• Vector distance
Creating a Classifier
• Define a threshold, thresh, such that if CSVi(d) > thresh(i) then categorize d under ci otherwise, don’t
• CSV thresholding– Fixed value across all categories– Vary per category
• Optimize via testing
Naïve Bayes Classifier
Probability of doc dj belonging in category ci
Training set terms/weights present in dj used to calculate probability of dj belonging to ci
Naïve Bayes ClassifierIf wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci)
After further derivation, the original equation looks like:
Can be used for CSV
Constants for all docs
Naïve Bayes Classifier
• Independence assumption
• Feature selection can be counterproductive
k-NN Classifier
• Compute closeness between candidate documents and category documents
Similarity between dj and training set document dz
Confidence score indicating whether dz belongs to category ci
k-NN Classifier
• k nearest neighbors– Find k nearest neighbors from all training
documents and use their categories– K can also indicate the number of top
ranked training documents per category to compare against
• Similarity computation can be:– Inner product– Cosine coefficient
Support Vector Machines
• “decision surface” that best separates data points in two classes
• Support vectors are the training docs that best define hyperplane
Optimal hyperplane
Max. margin
Support Vector Machines
• Training process involves finding the support vectors
• Only care about support vectors in the training set, not other documents
Neural Networks
• Train net to learn from a mapping of input words to a category
• One neural net per category– Too expensive
• One network overall• Perceptron approach without a hidden
layer• Three layered
Classifier Committees
• Combine multiple classifiers
• Majority voting
• Category specialization
• Mixed results
Classification Performance• Category ranking evaluation
– Recall = categories found and correct
– Precision = categories found and correct
• Micro and Macro averaging over categories
Total categories correct
Total categories found
Classification Performance
• Hard
• Two studies– Yiming Yang, 1997– Yiming Yang and Xin Liu, 1999
• SVM, kNN >> Neural Net > Naïve Bayes
• Performance converges for common categories (with many training docs)
Computational Bottlenecks
• Quiver– # of topics– # of training documents– # of candidate documents
Categorization and the Internet
• Classification as a service– Standardizing vocabulary– Confidentiality– performance
• Use of hypertext in categorization– Augment existing classifiers to take
advantage
Hypertext and Categorization
• An already categorized document links to documents within same category
• Neighboring documents in a similar category
• Hierarchical nature of categories
• Metatags
Augmenting Classifiers
• Inject anchor text for a document into that document– Treat anchor text as separate terms
• Depends on dataset• Mixed experimental results• Links may be noisy
– Ads– Navigation
Topics and the Web
• Topic distillation– Analysis of hyperlink graph structure
• Authorities – popular pages
• Hubs– Links to authorities
hubs authorities
Topic Distillation• Kleinberg’s HITS algorithm• An initial set of pages: root set
– Use this to create an expanded set
• Weight propagation phase– Each node: authority score and hub score– Alternate
• Authority = sum of current hub weights of all nodes pointing to it
• Hub = sum of all authority score of all pages it points to
– Normalize node scores and iterate until convergence
• Output is a set of hubs and authorities
Conclusion
• Why Classifiy?
• The Classification Process
• Various Classifiers
• Which ones are better?
• Other applications