Upload
holly-phyllis-poole
View
216
Download
0
Embed Size (px)
Citation preview
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Cheshire II and Automatic Categorization
Ray R. Larson
Associate Professor
School of Information Management and Systems
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Overview
Introduction to Automatic Classification and Clustering
Classification of Classification Methods Classification Clusters and Information
Retrieval in Cheshire II DARPA Unfamiliar Metadata Project
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Classification
The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated.
In document classification the items are grouped together because they are likely to be wanted together» For example, items about the same topic.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Automatic Indexing and Classification
Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.
More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.
Automatic classification attempts to automatically group similar documents using either:» A fully automatic clustering method.» An established classification scheme and set of documents
already indexed by that scheme.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Background and Origins
Early suggestion by Fairthorne » “The Mathematics of Classification”
Early experiments by Maron (1961) and Borko and Bernick(1963)
Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s).
Early IR clustering work more concerned with efficiency issues than semantic issues.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Cluster Hypothesis
The basic notion behind the use of classification and clustering methods:
“Closely associated documents tend to be relevant to the same requests.”» C.J. van Rijsbergen
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Classification of Classification Methods
Class Structure» Intellectually Formulated
– Manual assignment (e.g. Library classification)
– Automatic assignment (e.g. Cheshire Classification Mapping)
» Automatically derived from collection of items– Hierarchic Clustering Methods (e.g. Single Link)
– Agglomerative Clustering Methods (e.g. Dattola)
– Hybrid Methods (e.g. Query Clustering)
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Classification of Classification Methods
Relationship between properties and classes» monothetic» polythetic
Relation between objects and classes» exclusive» overlapping
Relation between classes and classes» ordered» unordered
Adapted from Sparck Jones
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Properties and Classes
Monothetic» Class defined by a set of properties that are both
necessary and sufficient for membership in the class
Polythetic» Class defined by a set of properties such that to be
a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
A B C D E F G H 1 + + +2 + + +3 + + +4 + + +5 + + +6 + + +7 + + +8 + + +
Monothetic vs. Polythetic
Polythetic
Monothetic
Adapted from van Rijsbergen, ‘79
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Exclusive Vs. Overlapping
Item can either belong exclusively to a single class
Items can belong to many classes, sometimes with a “membership weight”
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Ordered Vs. Unordered
Ordered classes have some sort of structure imposed on them» Hierarchies are typical of ordered classes
Unordered classes have no imposed precedence or structure and each class is considered on the same “level”» Typical in agglomerative methods
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Clustering Methods
Hierarchical Agglomerative Hybrid Automatic Class Assignment
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Coefficients of Association Simple
Dice’s coefficient
Jaccard’s coefficient
Cosine coefficient
Overlap coefficient
|||,min(|||
||||
||
||||
||||
||2
||
BABA
BA
BA
BABA
BA
BA
BA
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Hierarchical Methods
2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4
Single Link Dissimilarity Matrix
Hierarchical methods: Polythetic, Usually Exclusive, OrderedClusters are order-independent
||||
||1
BA
BAitydissimilar
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Threshold = .1
Single Link Dissimilarity Matrix
2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4
2 03 0 04 0 0 05 1 0 0 1 1 2 3 4
2
1
35
4
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Threshold = .2
2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4
2 03 0 14 0 0 05 1 0 0 1 1 2 3 4
2
1
35
4
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Threshold = .3
2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4
2 03 0 14 1 1 15 1 0 0 1 1 2 3 4
2
1
35
4
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
ClusteringAgglomerative methods: Polythetic, Exclusive or Overlapping, Unorderedclusters are order-dependent.
DocDoc
DocDoc
DocDoc
DocDoc
1. Select initial centers (I.e. seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)
Rocchio’s method (similar to current K-means methods
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Automatic Class Assignment
DocDoc
DocDoc
DocDoc
Doc
SearchEngine
1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category
Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Automatic Categorization in Cheshire II
The Cheshire II system is intended to provide a bridge between the purely bibliographic realm of previous generations of online catalogs and the rapidly expanding realm of full-text and multimedia information resources. It is currently used in the UC Berkeley Digital Library Project and for a number of other sites and projects.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Overview of Cheshire II It supports SGML as the primary database type. It is a client/server application. Uses the Z39.50 Information Retrieval Protocol. Supports Boolean searching of all servers. Supports probabilistic ranked retrieval in the Cheshire
search engine. Supports ``nearest neighbor'' searches, relevance
feedback and Two-Stage Search. GUI interface on X window displays (Tcl/Tk). HTTP/CGI interface for the Web (Tcl scripting).
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Cheshire II - Cluster Generation
Define basis for clustering records.» Select field to form the basis of the cluster.» Evidence Fields to use as contents of the pseudo-
documents. During indexing cluster keys are generated with
basis and evidence from each record. Cluster keys are sorted and merged on basis
and pseudo-documents created for each unique basis element containing all evidence fields.
Pseudo-Documents (Class clusters) are indexed on combined evidence fields.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Cheshire II - Two-Stage Retrieval
Using the LC Classification System» Pseudo-Document created for each LC class containing
terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.)
» Permits searching by any term in the class» Ranked Probabilistic retrieval techniques attempt to
present the “Best Matches” to a query first.» User selects classes to feed back for the “second stage”
search of documents.
Can be used with any classified/Indexed collection.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Probabilistic Retrieval: Logistic Regression
Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.
nnkji vcvcvcctdR|qO ...),,(log 22110
)),|(log(1
1),|(
ji dqROjie
dqRP
m
kkjiji ROtdqROdqRO
1, )](log),|([log),|(log
Log odds of relevance is a linear function of attributes:
Term contributions summed:
Probability of Relevance is inverse of log odds:
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Probabilistic Retrieval: Logistic Regression
6
10),|(
iii XccDQRP
In Cheshire II probability of relevance is based onLogistic regression from a sample set of TREC documents to determine values of the coefficients.At retrieval the probability estimate is obtained by:
For 6 attributes or “clues” about term usage in the documents and the query
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Probabilistic Retrieval: Logistic Regression
attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document (M) -- logged
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Cheshire II Demo
Examples from the Unfamiliar Metadata Project
Basis for clusters is a normalized Library of Congress Class Number
Evidence is provided by terms from record titles (and subject headings for the “all languages”
Five different training sets (Russian, German, French, Spanish, and All Languages
Testing cross-language retrieval and classification
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
Resources
Cheshire II Home page» http://cheshire.lib.berkeley.edu/
Unfamiliar MetaData Project» http://info.sims.berkeley.edu/research/
metadata Cross-Language Classification Clusters
» http://sherlock.berkeley.edu/Language
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
References
C.J. van Rijsbergen, Information Retrieval, 2nd edition. London : Butterworths, 1979 » Available as http://sherlock.berkeley.edu/IS205/IR_CJVR
Ray R. Larson, “Experiments in Automatic Library of Congress Classification”. Journal of the American Society for Information Science, 43(2) 130-148, 1992
M.E. Maron, “Automatic Indexing: An Experimental Inquiry”. Journal of the ACM, 8 404-417, July 1961.
Alan Griffiths, Claire Luckhurst & Peter Willett, “Using Interdocument Similarity Information in Document Retrieval Systems”. JASIS, 37(1) 3-11, 1986
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998
References
Christian Plaunt & Barbara Norgard, “An Association Based Method for Automatic Indexing with a Controlled Vocabulary”. To appear in JASIS. » Preprint available: http://bliss.berkeley.edu/papers/
Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul O’Leary & Ralph Moon, “Cheshire II: Designing a Next-Generation Online Catalog”. JASIS, 47(7) 555-567, 1996.