22
Text Classification and Images by Carl Sable

Text Classification and Images

  • Upload
    ofira

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Text Classification and Images. by Carl Sable. Overview. Text Classification. Involves assigning text documents to one or more groups (classes). Techniques can be applied to image captions to classify corresponding images. - PowerPoint PPT Presentation

Citation preview

Page 1: Text Classification and Images

Text Classification and Images

by Carl Sable

Page 2: Text Classification and Images

Overview

• Text Classification.– Involves assigning text documents to one or more

groups (classes).

– Techniques can be applied to image captions to classify corresponding images.

• Various methods, evaluation techniques, and related issues will be discussed.

• Some discussion of other research involving image captions.

Page 3: Text Classification and Images

Text Classification Tasks

• Text Categorization (TC) - Assign text documents to existing, well-defined categories.

• Information Retrieval (IR) - Retrieve text documents which match user query.

• Clustering - Group text documents into clusters of similar documents.

• Text Filtering - Retrieve documents which match a user profile.

Page 4: Text Classification and Images

Text Categorization

• Classify each test document by assigning category labels.– M-ary categorization assumes M labels per

document.– Binary categorization requires yes/no decision for

every document/category pair.

• Most techniques require training.– Parametric vs non-parametric.– Batch vs on-line.

Page 5: Text Classification and Images

Early Work

• The Federalist papers.– Published anonymously between 1787-1788.– Authorship of 12 papers in dispute (either

Hamilton or Madison).

• Mostellar and Wallace, 1963.– Compared rate per thousand words of high

frequency words.– Collected very strong evidence in favor of

Madison.

Page 6: Text Classification and Images

Rocchio

• All documents and categories represented by word vectors.

• TF*IDF weights for words.– Term frequency is number of times word appears in

document or category.

– Inverse document relates to scarcity of word over entire training collection.

• Similarity computed for all document, category pairs.

Page 7: Text Classification and Images

Naïve Bayes

• Estimates probabilities of categories given a document.

• Uses joint probabilities of words and categories (Bayes’ rule).

• Assumes words are independent of each other.

• Can incorporate a priori probabilities of categories.

Page 8: Text Classification and Images

Other Common Methods

• K-Nearest Neighbor (kNN) - Use k closest training documents to predict category.

• Decision Trees (DTree)- Construct classification trees based on training data.

• Neural Networks (NNet) - Learn non-linear mapping from input words to categories.

• Expert Systems - Use manually constructed, domain-specific, application-specific rules.

Page 9: Text Classification and Images

Advanced Techniques

• Support Vector Machines (SVMs).– Use Structural Risk Minimization principle.– Find hypothesis which minimizes “true error”.

• Widrow-Hoff and EG - Update weight vector based on each training example.

• Maximum Entropy - Derive constraints expressing characteristics of training data.

• Boosting - Combine weak hypotheses to produce highly accurate classification rule.

Page 10: Text Classification and Images

Common Test Corpora

• Reuters - Collection of newswire stories from 1987 to 1991, labeled with categories.

• TREC-AP newswire stories from 1988 to 1990, labeled with categories.

• OHSUMED Medline articles from 1987 to 1991, MeSH categories assigned.

• UseNet newsgroups.

• WebKB - Web pages gathered from university CS departments.

Page 11: Text Classification and Images

Other Issues to Consider

• Which words to use (feature selection).

• Normalization.

• Use of lexical databases.– Longman Dictionary of Contemporary English

(LDOCE), WordNet, English Verb Classes and Alternations (EVCA).

– May cause problems due to lexical ambiguity.

• High cost of manual labels.

Page 12: Text Classification and Images

Categorizing Images

• Some previous research on content-based image categorization, very little on text-based image categorization!

• WebSEEk.– Categorizes images and videos based on key-terms

extracted from URL, alt text, hyperlinks, and directory names.

– Semi-automated key-term dictionary maps key-terms to subject(s) from a taxonomy.

Page 13: Text Classification and Images

Evaluation Metrics

• Per Category Measures:– simple accuracy or error measures

can be misleading.

– precision, recall, and fallout.

– F-measure, average precision, and break-even point (BEP) combine precision and recall.

• Macro-averaging vs Micro-averaging.

• Should choose metric ahead of time (maybe)!

Yes iscorrect

No iscorrect

AssignedYES

a b

AssignedNO

c d

p = a / (a + b)

r = a / (a + c)

f = b / (b + d)

Acc = (a + d) / n

Err = (b + c) / n

contingency table:

Page 14: Text Classification and Images

Some Results and Analysis

• Comparisons.– SVM and kNN, AdaBoost, WH, and EG all showed

very impressive performance.– Naïve Bayes and Rocchio tended to show relatively

poor performance.

• Rocchio possibly could have done better.– Should be using probabilistic Rocchio.– Works best if categories are mutually exclusive.– May perform at its best when only 2 categories.

Page 15: Text Classification and Images

Information Retrieval

• User inputs query, system should retrieve all relevant documents.

• Simple technique: keyword search.

• Other techniques use on word vectors.– TF*IDF commonly used for weights.– Can compute similarity between query vector and

document vectors.

• Evaluation - Similar to text categorization, treat relevant documents as single category.

Page 16: Text Classification and Images

Relevance Feedback

• After initial retrieval, user makes relevance judgements for retrieved documents.

• New round of retrieval based on feedback.• Similar to text categorization with two

categories: relevant vs non-relevant.• Rocchio algorithm originally created for this

task.• Naïve Bayes very successful.

Page 17: Text Classification and Images

Possible Improvements

• Lexical databases sometimes used for query expansion.

• Word sense disambiguation.– Expand query with correct senses.– Used on documents to prevent retrieval based

on false matches.

• Notion of semantic similarity.

Page 18: Text Classification and Images

Retrieval of Captioned Images

• Typical properties of image captions:– Shorter than documents in typical IR tasks.– Subject noun phrase usually denotes most significant

object in picture.– In news domain, first sentence generally describes

image, rest is background.

• Different types of queries.

• Many techniques from general IR not applicable.

Page 19: Text Classification and Images

Related Research

• Smeaton.– Automatically derived Hierarchical Concept Graphs

(HCGs) based on WordNet IS-A links.– Computed semantic similarity between nouns.– Some success improving image retrieval.

• Guglielmo and Rowe.– Used logical form records to capture meaning of

queries and captions for comparison.– System significantly beat keyword search.

Page 20: Text Classification and Images

Other Text Classification Tasks

• Clustering documents.– Create groups with similar attributes.– Various methods and algorithms exist.– Hierarchical vs non-hierarchical.– Each group has centroid.– Can aid in Information Retrieval.

• Text Filtering.– Filter articles of potential interest for a user.– Uses many of the same methods as TC and IR.

Page 21: Text Classification and Images

Processing Image Captions

• The Correspondence Problem - How to correlate visual information with words.– Visual semantics.– Symbolic representation of visual data.

• Srihari.– Piction - System that automatically identifies human

faces in captioned newspaper photos.– Integrates NLP module which parses captions with

IU module that detects objects.

Page 22: Text Classification and Images

Final Observations

• Previous Work.– General text categorization studied extensively.– Some research on text-based image retrieval.– Very little research involving text-based image

categorization.

• Image captions contain information unlikely to be extracted from just images.

• High potential exists for significant research involving text-based image categorization.