Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate

Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998

Cheshire II and Automatic Categorization

Ray R. Larson

Associate Professor

School of Information Management and Systems


Overview

Introduction to Automatic Classification and Clustering

Classification of Classification Methods Classification Clusters and Information

Retrieval in Cheshire II DARPA Unfamiliar Metadata Project


Classification

The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated.

In document classification the items are grouped together because they are likely to be wanted together» For example, items about the same topic.


Automatic Indexing and Classification

Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.

More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.

Automatic classification attempts to automatically group similar documents using either:» A fully automatic clustering method.» An established classification scheme and set of documents

already indexed by that scheme.


Background and Origins

Early suggestion by Fairthorne » “The Mathematics of Classification”

Early experiments by Maron (1961) and Borko and Bernick(1963)

Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s).

Early IR clustering work more concerned with efficiency issues than semantic issues.


Cluster Hypothesis

The basic notion behind the use of classification and clustering methods:

“Closely associated documents tend to be relevant to the same requests.”» C.J. van Rijsbergen


Classification of Classification Methods

Class Structure» Intellectually Formulated

– Manual assignment (e.g. Library classification)

– Automatic assignment (e.g. Cheshire Classification Mapping)

» Automatically derived from collection of items– Hierarchic Clustering Methods (e.g. Single Link)

– Agglomerative Clustering Methods (e.g. Dattola)

– Hybrid Methods (e.g. Query Clustering)


Classification of Classification Methods

Relationship between properties and classes» monothetic» polythetic

Relation between objects and classes» exclusive» overlapping

Relation between classes and classes» ordered» unordered

Adapted from Sparck Jones


Properties and Classes

Monothetic» Class defined by a set of properties that are both

necessary and sufficient for membership in the class

Polythetic» Class defined by a set of properties such that to be

a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.


A B C D E F G H 1 + + +2 + + +3 + + +4 + + +5 + + +6 + + +7 + + +8 + + +

Monothetic vs. Polythetic

Polythetic

Monothetic

Adapted from van Rijsbergen, ‘79


Exclusive Vs. Overlapping

Item can either belong exclusively to a single class

Items can belong to many classes, sometimes with a “membership weight”


Ordered Vs. Unordered

Ordered classes have some sort of structure imposed on them» Hierarchies are typical of ordered classes

Unordered classes have no imposed precedence or structure and each class is considered on the same “level”» Typical in agglomerative methods


Clustering Methods

Hierarchical Agglomerative Hybrid Automatic Class Assignment


Coefficients of Association Simple

Dice’s coefficient

Jaccard’s coefficient

Cosine coefficient

Overlap coefficient

|||,min(|||

||||

||

||||

||||

||2

||

BABA

BA

BA

BABA

BA

BA

BA


Hierarchical Methods

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

Single Link Dissimilarity Matrix

Hierarchical methods: Polythetic, Usually Exclusive, OrderedClusters are order-independent

||||

||1

BA

BAitydissimilar


Threshold = .1

Single Link Dissimilarity Matrix

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 04 0 0 05 1 0 0 1 1 2 3 4

2

1

35

4


Threshold = .2

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 14 0 0 05 1 0 0 1 1 2 3 4

2

1

35

4


Threshold = .3

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 14 1 1 15 1 0 0 1 1 2 3 4

2

1

35

4


ClusteringAgglomerative methods: Polythetic, Exclusive or Overlapping, Unorderedclusters are order-dependent.

DocDoc

DocDoc

DocDoc

DocDoc

1. Select initial centers (I.e. seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)

Rocchio’s method (similar to current K-means methods


Automatic Class Assignment

DocDoc

DocDoc

DocDoc

Doc

SearchEngine

1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme


Automatic Categorization in Cheshire II

The Cheshire II system is intended to provide a bridge between the purely bibliographic realm of previous generations of online catalogs and the rapidly expanding realm of full-text and multimedia information resources. It is currently used in the UC Berkeley Digital Library Project and for a number of other sites and projects.


Overview of Cheshire II It supports SGML as the primary database type. It is a client/server application. Uses the Z39.50 Information Retrieval Protocol. Supports Boolean searching of all servers. Supports probabilistic ranked retrieval in the Cheshire

search engine. Supports ``nearest neighbor'' searches, relevance

feedback and Two-Stage Search. GUI interface on X window displays (Tcl/Tk). HTTP/CGI interface for the Web (Tcl scripting).


Cheshire II - Cluster Generation

Define basis for clustering records.» Select field to form the basis of the cluster.» Evidence Fields to use as contents of the pseudo-

documents. During indexing cluster keys are generated with

basis and evidence from each record. Cluster keys are sorted and merged on basis

and pseudo-documents created for each unique basis element containing all evidence fields.

Pseudo-Documents (Class clusters) are indexed on combined evidence fields.


Cheshire II - Two-Stage Retrieval

Using the LC Classification System» Pseudo-Document created for each LC class containing

terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.)

» Permits searching by any term in the class» Ranked Probabilistic retrieval techniques attempt to

present the “Best Matches” to a query first.» User selects classes to feed back for the “second stage”

search of documents.

Can be used with any classified/Indexed collection.


Probabilistic Retrieval: Logistic Regression

Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.

nnkji vcvcvcctdR|qO ...),,(log 22110

)),|(log(1

1),|(

ji dqROjie

dqRP

m

kkjiji ROtdqROdqRO

1, )](log),|([log),|(log

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:



6

10),|(

iii XccDQRP

In Cheshire II probability of relevance is based onLogistic regression from a sample set of TREC documents to determine values of the coefficients.At retrieval the probability estimate is obtained by:

For 6 attributes or “clues” about term usage in the documents and the query



attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document (M) -- logged


Cheshire II Demo

Examples from the Unfamiliar Metadata Project

Basis for clusters is a normalized Library of Congress Class Number

Evidence is provided by terms from record titles (and subject headings for the “all languages”

Five different training sets (Russian, German, French, Spanish, and All Languages

Testing cross-language retrieval and classification


Resources

Cheshire II Home page» http://cheshire.lib.berkeley.edu/

Unfamiliar MetaData Project» http://info.sims.berkeley.edu/research/

metadata Cross-Language Classification Clusters

» http://sherlock.berkeley.edu/Language


References

C.J. van Rijsbergen, Information Retrieval, 2nd edition. London : Butterworths, 1979 » Available as http://sherlock.berkeley.edu/IS205/IR_CJVR

Ray R. Larson, “Experiments in Automatic Library of Congress Classification”. Journal of the American Society for Information Science, 43(2) 130-148, 1992

M.E. Maron, “Automatic Indexing: An Experimental Inquiry”. Journal of the ACM, 8 404-417, July 1961.

Alan Griffiths, Claire Luckhurst & Peter Willett, “Using Interdocument Similarity Information in Document Retrieval Systems”. JASIS, 37(1) 3-11, 1986


References

Christian Plaunt & Barbara Norgard, “An Association Based Method for Automatic Indexing with a Controlled Vocabulary”. To appear in JASIS. » Preprint available: http://bliss.berkeley.edu/papers/

Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul O’Leary & Ralph Moon, “Cheshire II: Designing a Next-Generation Online Catalog”. JASIS, 47(7) 555-567, 1996.