27
1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

Embed Size (px)

Citation preview

Page 1: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

1

CS 430: Information Discovery

Lecture 23

Cluster Analysis 2

Thesaurus Construction

Page 2: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

2

Course Administration

Next week

• Guest lecture on Thursday, Thorsten Joachims.

Final examination

• The final examination will include questions on all lectures, including the guest lectures, and the readings for the discussion classes.

• Examination date: Wednesday, December 18, 12:00 noon - 1:30 p.m.

• Early examination: Thursday December 12, 12:00 noon - 1:30 p.m. Contact Anat Nidar-Levi ([email protected]) if you plan to take the early examination.

Page 3: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

3

Example 2: Concept Spaces for Scientific Terms

Large-scale searches can only match terms specified by the user to terms appearing in documents. Cluster analysis can be used to provide information retrieval by concepts, rather than by terms.

Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen (University of Arizona), Federating Diverse Collections of Scientific Literature, IEEE Computer, May 1996. Federating Diverse Collections of Scientific Literature

Page 4: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

4

Concept Spaces: Methodology

Concept space:

A similarity matrix based on co-occurrence of terms.

Approach:

Use cluster analysis to generate "concept spaces" automatically, i.e., clusters of terms that embrace a single semantic concept.

Arrange concepts in a hierarchical classification.

Page 5: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

5

Concept Spaces: INSPEC Data

Data set 1: All terms in 400,000 records from INSPEC, containing 270,000 terms with 4,000,000 links.

[24.5 hours of CPU on 16-node Silicon Graphics supercomputer.]

computer-aided instructionsee also educationUF teaching machinesBT educational computingTT computer applicationsRT educationRT teaching

Page 6: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

6

Concept Space: Compendex Data

Data set 2:

(a) 4,000,000 abstracts from the Compendex database covering all of engineering as the collection, partitioned along classification code lines into some 600 community repositories.

[ Four days of CPU on 64-processor Convex Exemplar.]

(b) In the largest experiment, 10,000,000 abstracts, were divided into sets of 100,000 and the concept space for each set generated separately. The sets were selected by the existing classification scheme.

Page 7: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

7

Objectives

• Semantic retrieval (using concept spaces for term suggestion)

• Semantic interoperability (vocabulary switching across subject domains)

• Semantic indexing (concept identification of document content)

• Information representation (information units for uniform manipulation)

Page 8: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

8

Use of Concept Space: Term Suggestion

Page 9: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

9

Future Use of Concept Space: Vocabulary Switching

"I'm a civil engineer who designs bridges. I'm interested in using fluid dynamics to compute the structural effects of wind currents on long structures. Ocean engineers who design undersea cables probably do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my civil engineering fluid dynamics terms into the ocean engineering terms and search the undersea cable literature."

Page 10: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

10

Example 3: Visual thesaurus for browsing large collections of geographic images

Methodology:

• Divide images into small regions.

• Create a similarity measure based on properties of these images.

• Use cluster analysis tools to generate clusters of similar images.

• Provide alternative representations of clusters.

Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of Visual Thesauri for Browsing Large Collections of Geographic Images, May 1997. (http://ai.bpa.arizona.edu/~mramsey/papers/visualThesaurus/visualThesaurus.html)

Page 11: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

11

Page 12: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

12

Information Visualization

Human eye is excellent in identifying patterns in graphical data.

• Trends in time-dependent data.

• Broad patterns in complex data.

• Anomalies in scientific data.

• Visualizing information spaces for browsing.

Page 13: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

13

Pad++

Concept. A large collection of information viewed at many different scales. Imagine a collection of documents spread out on an enormous wall.

Zoom. Zoom out and see the whole collection with little detail. Zoom in part way to see sections of the collection. Zoom in to see every detail.

Semantic Zooming. Objects change appearance when they change size, so as to be most meaningful. (Compare maps.)

Performance. Rendering operations timed so that the frame refresh rate remains constant during pans and zooms.

Page 14: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

14

Pad++ File Browser

Page 15: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

15

Pad++ File Browser

Page 16: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

16

Pad++ File Browser

Page 17: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

17

Example: Tilebars

The figure represents a set of hits from a text search.

Each large rectangle represents a document or section of text.

Each row represents a search term or subquery.

The density of each small square indicates the frequency with which a term appears in a section of a document.

Hearst 1995

Page 18: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

18

Self Organizing Maps (SOM)

Page 19: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

19

Automatic Thesaurus Construction

Approach

• Select a subject domain.

• Choose a corpus of documents that cover the domain.

• Create vocabulary by extracting terms, normalization, precoordination of phrase, etc.

• Devise a measure of similarity between terms and thesaurus classes.

• Cluster terms into thesaurus classes, using complete linkage or other cluster method that generates compact clusters.

Page 20: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

20

Decisions in creating a thesaurus

1. Which terms should be included in the thesaurus?

2. How should the terms be grouped?

Page 21: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

21

Terms to include

• Only terms that are likely to be of interest for content identification

• Ambiguous terms should be coded for the senses likely to be important in the document collection

• Each thesaurus class should have approximately the same frequency of occurrence

• Terms of negative discrimination should be eliminated

after Salton and McGill

Page 22: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

22

Discriminant value

Discriminant value is the degree to which a term is able to discriminate between the documents of a collection

= (average document similarity without term k) - (average document similarity with term k)

Good discriminators decrease the average document similarity

Note that this definition uses the document similarity.

Page 23: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

23

Incidence array

D1: alpha bravo charlie delta echo foxtrot golf

D2: golf golf golf delta alpha

D3: bravo charlie bravo echo foxtrot bravo

D4: foxtrot alpha alpha golf golf delta

alpha bravo charlie delta echo foxtrot golf

D1 1 1 1 1 1 1 1

D2 1 1 1

D3 1 1 1 1

D4 1 1 1 1

7

3

4

4

Page 24: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

24

Document similarity matrix

D1 D2 D3 D4

D1 0.65 0.76 0.76

D2 0.65 0.00 0.87

D3 0.76 0.00 0.25

D4 0.76 0.87 0.25

Average similarity = 0.55

Page 25: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

25

Discriminant value

Average similarity = 0.55

without average similarity DV

alpha 0.53 -0.02

bravo 0.56 +0.01

charlie 0.56 +0.01

delta 0.53 -0.02

echo 0.56 +0.01

foxtrot 0.52 -0.03

golf 0.53 -0.02

alpha, delta, foxtrot, golf are good discriminators

Page 26: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

26

Phrase construction

In a thesaurus, term classes may contain phrases.

Informal definitions:

pair-frequency (i, j) is the frequency that a pair of words occur in context (e.g., in succession within a sentence)

phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency

cohesion (i, j) = pair-frequency (i, j)

frequency(i)*frequency(j)

Page 27: 1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

27

Phrase construction

Salton and McGill algorithm

1. Computer pair-frequency for all terms.

2. Reject all pairs that fall below a certain threshold

3. Calculate cohesion values

4. If cohesion above a threshold value, consider word pair as a phrase.

Automatic phrase construction by statistical methods is rarely used in practice. There is promising research on phrase identification using methods of computational linguistics