34
1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

1

CS 430 / INFO 430Information Retrieval

Lecture 27

Classification 2

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

2

Course Administration

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

3

Cluster Analysis

Cluster Analysis

Methods that divide a set of n objects into m non-overlapping subsets.

For information discovery, cluster analysis is applied to

• terms for thesaurus construction

• documents to divide into categories (sometimes called automatic classification, but classification usually requires a pre-determined set of categories).

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

4

Cluster Analysis Metrics

Documents clustered on the basis of a similarity measure calculated from the terms that they contain.

Documents clustered on the basis of co-occurring citations.

Terms clustered on the basis of the documents in which they co-occur.

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

5

Non-hierarchical and Hierarchical Methods

Non-hierarchical methods

Elements are divided into m non-overlapping sets where m is predetermined.

Hierarchical methods

m is varied progressively to create a hierarchy of solutions.

Agglomerative methods

m is initially equal to n, the total number of elements, where every element is considered to be a cluster with one element.

The hierarchy is produced by incrementally combining clusters.

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

6

Simple Hierarchical Methods: Single Link

x

xx

xx

xxx

x

x

x

x

Similarity between clusters is similarity between most similar elements

Concept

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

7

Single Link

Single Link

A simple agglomerative method.

Initially, each element is its own cluster with one element.

At each step, calculate the similarity between each pair of clusters as the most similar pair of elements that are not yet in the same cluster. Merge the two clusters that are most similar.

May lead to long, straggling clusters (chaining).

Very simple computation.

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

8

Similarities: Incidence array

D1: alpha bravo charlie delta echo foxtrot golf

D2: golf golf golf delta alpha

D3: bravo charlie bravo echo foxtrot bravo

D4: foxtrot alpha alpha golf golf delta

alpha bravo charlie delta echo foxtrot golf

D1 1 1 1 1 1 1 1

D2 1 1 1

D3 1 1 1 1

D4 1 1 1 1

n 3 2 2 3 2 3 3

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

9

Term similarity matrix

alpha bravo charlie delta echo foxtrot golf

alpha 0.2 0.2 0.5 0.2 0.33 0.5

bravo 0.5 0.2 0.5 0.4 0.2

charlie 0.2 0.5 0.4 0.2

delta 0.2 0.33 0.5

echo 0.4 0.2

foxtrot 0.33

golf

Using incidence matrix and dice weighting

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

10

Example -- single link

alpha delta golf bravo echo charlie foxtrot

1

Agglomerative: step 1

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

11

Example -- single link

alpha delta golf bravo echo charlie foxtrot

1

2

Agglomerative: step 2

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

12

Example -- single link

alpha delta golf bravo echo charlie foxtrot

1

23

Agglomerative: step 3

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

13

Example -- single link

alpha delta golf bravo echo charlie foxtrot

1

23

6

4

5

This style of diagram is called a dendrogram.

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

14

Simple Hierarchical Methods: Complete Linkage

x

xx

xx

xxx

x

x

x

x

Similarity between clusters is similarity between least similar elements

Concept

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

15

Complete linkage

Complete linkage

A simple agglomerative method.

Initially, each element is its own cluster with one element.

At each step, calculate the similarity between each pair of clusters as the similarity between the least similar pair of elements in the two clusters. Merge the two clusters that are most similar.

Generates small, tightly bound clusters

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

16

Term similarity matrix

alpha bravo charlie delta echo foxtrot golf

alpha 0.2 0.2 0.5 0.2 0.33 0.5

bravo 0.5 0.2 0.5 0.4 0.2

charlie 0.2 0.5 0.4 0.2

delta 0.2 0.33 0.5

echo 0.4 0.2

foxtrot 0.33

golf

Using incidence matrix and dice weighting

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

17

Example – complete linkage

Cluster a b c d e f gelements

Least similar pair / distance

a - ab/.2 ac/.2 ad/.5 ae/.2 af/.33 ag/.5 b - bc/.5 bd/.2 be/.5 bf/.4 bg/.2 c - cd/.2 ce/.5 cf/.4 cg/.2 d - de/.2 df/.33 dg/.5 e - ef/.4 eg/.2 f - fg/.33 g -

Step 1. Merge clusters {a} and {d}

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

18

Example – complete linkage

Cluster a,d b c e f gelements

Least similar pair / distance

a,d - ab/.2 ac/.2 ae/.2 df/.33 ag/.5 b - bc/.5 be/.5 bf/.4 bg/.2 c - ce/.5 cf/.4 cg/.2 e - ef/.4 eg/.2 f - fg/.33 g -

Step 2. Merge clusters {a,d} and {g}

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

19

Example – complete linkage

Cluster a,d,g b c e felements

Least similar pair / distance

a,d,g - ab/.2 ac/.2 ae/.2 af/.33 b - bc/.5 be/.5 bf/.4 c - ce/.5 cf/.4 e - ef/.4 f -

Step 3. Merge clusters {b} and {c}

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

20

Example – complete linkage

Cluster a,d,g b,c e felements

Least similar pair / distance

a,d,g - ab/.2 ae/.2 af/.33 b,c - be/.5 bf/.4 e - ef/.4 f -

Step 4. Merge clusters {b,c} and {e}

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

21

Example -- complete linkage

alpha delta golf bravo charlie echo foxtrot

Step 1

Step 6Step 5

Step 2

Step 4Step 3

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

22

Non-Hierarchical Methods: K-means

1 Define a similarity measure between any two points in the space (e.g., square of distance).

2 Choose k points as initial group centroids.

3 Assign each object to the group that has the closest centroid.

4 When all objects have been assigned, recalculate the positions of the k centroids.

5 Repeat Steps 3 and 4 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

23

K-means

• Iteration converges under a very general set of conditions

• Results depend on the choice of the k initial centroids

• Methods can be used to generate a sequence of solutions for k increasing from 1 to n. Note that, in general, the results will not be hierarchical.

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

24

Problems with cluster analysis in information retrieval

Selection of attributes on which items are clustered

Choice of similarity measure and algorithm

Computational resources

Assessing validity and stability of clusters

Updating clusters as data changes

Method for using the clusters in information retrieval

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

25

Example 1: Concept Spaces for Scientific Terms

Large-scale searches can only match terms specified by the user to terms appearing in documents. Cluster analysis can be used to provide information retrieval by concepts, rather than by terms.

Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen (University of Arizona), Federating Diverse Collections of Scientific Literature, IEEE Computer, May 1996. Federating Diverse Collections of Scientific Literature

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

26

Concept Spaces: Methodology

Concept space:

A similarity matrix based on co-occurrence of terms.

Approach:

Use cluster analysis to generate "concept spaces" automatically, i.e., clusters of terms that embrace a single semantic concept.

Arrange concepts in a hierarchical classification.

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

27

Concept Spaces: INSPEC Data

Data set 1: All terms in 400,000 records from INSPEC, containing 270,000 terms with 4,000,000 links.

[24.5 hours of CPU on 16-node Silicon Graphics supercomputer.]

computer-aided instructionsee also educationUF teaching machinesBT educational computingTT computer applicationsRT educationRT teaching

Page 28: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

28

Concept Space: Compendex Data

Data set 2:

(a) 4,000,000 abstracts from the Compendex database covering all of engineering as the collection, partitioned along classification code lines into some 600 community repositories.

[ Four days of CPU on 64-processor Convex Exemplar.]

(b) In the largest experiment, 10,000,000 abstracts, were divided into sets of 100,000 and the concept space for each set generated separately. The sets were selected by the existing classification scheme.

Page 29: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

29

Objectives

• Semantic retrieval (using concept spaces for term suggestion)

• Semantic interoperability (vocabulary switching across subject domains)

• Semantic indexing (concept identification of document content)

• Information representation (information units for uniform manipulation)

Page 30: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

30

Use of Concept Space: Term Suggestion

Page 31: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

31

Future Use of Concept Space: Vocabulary Switching

"I'm a civil engineer who designs bridges. I'm interested in using fluid dynamics to compute the structural effects of wind currents on long structures. Ocean engineers who design undersea cables probably do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my civil engineering fluid dynamics terms into the ocean engineering terms and search the undersea cable literature."

Page 32: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

32

Example 2: Visual thesaurus for geographic images

Methodology:

• Divide images into small regions.

• Create a similarity measure based on properties of these images.

• Use cluster analysis tools to generate clusters of similar images.

• Provide alternative representations of clusters.

Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of Visual Thesauri for Browsing Large Collections of Geographic Images, May 1997. http://ai.bpa.arizona.edu/~mramsey/papers/visualThesaurus/visualThesaurus.html

Page 33: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

33

Page 34: 1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

34

The End

Search index

Return hits

Browse content

Return objects

Scan results