49
1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

Embed Size (px)

Citation preview

Page 1: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

1

CS/INFO 430Information Retrieval

Lecture 15

Metadata 2

Page 2: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

2

Course Administration

Discussion Class on October 19

This class will be held in Phillips Hall 213

Page 3: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

3

Course Administration

Assignment 2 and Midterm

Grades will be mailed as soon.

Midterm

This was well done.

Average grade was 26.

Range was from 21 to 30.

Page 4: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

4

Course Administration

Midterm: Common Mistake 1

Question: Define precision-recall graph

Answer: The basic precision recall graph applies to the results of a single query using ranked searching. For each value of r, from one to the number of hits returned, it plots precision against recall for the first r hits

Page 5: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

5

Course Administration

Midterm: Common Mistake 2

Question: With latent semantic indexing: The dotted lines are described as, "The dotted cone represents the region whose points are within a cosine of 0.9 from the query q." All the documents labeled c1-c5 are within this cone, but none of the documents labeled m1-m4. What does this imply?

Answer: This implies that each of c1-c5 are similar to q, but that m1-m4 are not, using the similarity measured discussed in the paper. It does not imply that the c1-c5 are relevant, though clearly that is the hope.

Page 6: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

6

Examples of Non-textual Materials

Content Attribute

maps lat. and long., content

photograph subject, date and place

bird songs and images field mark, bird song

software task, algorithm

data set survey characteristics

video subject, date, etc.

Page 7: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

7

Possible Approaches to Information Discovery for Non-text Materials

Human indexing

Manually created metadata records

Automated information retrieval

Automatically created metadata records (e.g., image recognition)

Context: associated text, links, etc. (e.g., Google image search)

Multimodal: combine information from several sources

User expertise

Browsing: user interface design

Page 8: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

8

Catalog Records for Non-Textual Materials

• General metadata standards, such as Dublin Core and MARC, can be used to create a textual catalog record of non-textual items.

• Subject based metadata standards apply to specific categories of materials, e.g., FGDC for geospatial materials.

• Text-based searching methods can be used to search these catalog records.

Page 9: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

9

Example 1: Photographs

Photographs in the Library of Congress's American Memory collections

In American Memory, each photograph is described by a MARC record.

The photographs are grouped into collections, e.g., The Northern Great Plains, 1880-1920: Photographs from the Fred Hultstrand and F.A. Pazandak Photograph Collections

Information discovery is by:

• searching the catalog records

• browsing the collections

Page 10: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

10

Page 11: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

11

Page 12: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

12

Page 13: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

13

Photographs: Cataloguing Difficulties

Automatic

• Image recognition methods are very primitive

Manual

• Photographic collections can be very large

• Many photographs may show the same subject

• Photographs have little or no internal metadata (no title page)

• The subject of a photograph may not be known (Who are the people in a picture? Where is the location?)

Page 14: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

14

Photographs: Difficulties for Users

Searching

• Often difficult to narrow the selection down by searching -- browsing is required

• Criteria may be different from those in catalog (e.g., graphical characteristics)

Browsing

• Offline. Handling many photographs is tedious. Photographs can be damaged by repeated handling

• Online. Viewing many images can be tedious. Screen quality may be inadequate.

Page 15: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

15

Example 2: Geospatial Information

Example: Alexandria Digital Library at the University of California, Santa Barbara

• Funded by the NSF Digital Libraries Initiative since 1994.

• Collections include any data referenced by a geographical footprint.

terrestrial maps, aerial and satellite photographs, astronomical maps, databases, related textual information

• Program of research with practical implementation at the university's map library

Page 16: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

16

Alexandria User Interface

Page 17: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

17

Alexandria: Computer Systems and User Interfaces

Computer systems

• Digitized maps and geospatial information -- large files

• Wavelets provide multi-level decomposition of image

-> first level is a small coarse image-> extra levels provide greater detail

User interfaces

• Small size of computer displays

• Slow performance of Internet in delivering large files

-> retain state throughout a session

Page 18: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

18

Alexandria: Information Discovery

Metadata for information discovery

Coverage: geographical area covered, such as the city of Santa Barbara or the Pacific Ocean.

Scope: varieties of information, such as topographical features, political boundaries, or population density.

Latitude and longitude provide basic metadata for maps and for geographical features.

Page 19: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

19

Gazetteer

Gazetteer: database and a set of procedures that translate representations of geospatial references:

place names, geographic features, coordinatespostal codes, census tracts

Search engine tailored to peculiarities of searching for place names.

Research is making steady progress at feature extraction, using automatic programs to identify objects in aerial photographs or printed maps -- topic for long-term research.

Page 20: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

20

Gazetteers

The Alexandria Digital Library (ADL): geolibrary at University of California at Santa Barbara where a primary attribute of objects is location on Earth (e.g., map, satellite photograph).

Geographic footprint: latitude and longitude values that represent a point, a bounding box, a linear feature, or a complete polygonal boundary.

Gazetteer: list of geographic names, with geographic locations and other descriptive information.

Geographic name: proper name for a geographic place or feature (e.g., Santa Barbara County, Mount Washington, St. Francis Hospital, and Southern California)

Page 21: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

21

Use of a Gazetteer

• Answers the "Where is" question; for example, "Where is Santa Barbara?"

• Translates between geographic names and locations. A user can find objects by matching the footprint of a geographic name to the footprints of the collection objects.

• Locates particular types of geographic features in a designated area. For example, a user can draw a box around an area on a map and find the schools, hospitals, lakes, or volcanoes in the area.

Page 22: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

22

Alexandria Gazetteer: Example from a search on "Tulsa"

Feature name State County Type Latitude Longitude

Tulsa OK Tulsa pop pl 360914N 0955933W

Tulsa Country OK Osage locale 360958N 0960012WClub

Tulsa County OK Tulsa civil 360600N 0955400W

Tulsa Helicopters OK Tulsa airport 360500N 0955205WIncorporatedHeliport

Page 23: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

23

Challenges for the Alexandria Gazetteer

Content standard: A standard conceptual schema for gazetteer information.

Feature types: A type scheme to categorize individual features, is rich in term variants and extensible.

Temporal aspects: Geographic names and attributes change through time.

"Fuzzy" footprints: Extent of a geographic feature is often approximate or ill-defined (e.g., Southern California).

Page 24: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

24

Challenges for the Alexandria Gazetteer (continued)

Quality aspects:

(a) Indicate the accuracy of latitude and longitude data.

(b) Ensure that the reported coordinates agree with the other elements of the description.

Spatial extents:

(a) Points do not represent the extent of the geographic locations and are therefore only minimally useful.

(b) Bounding boxes, often include too much territory (e.g., the bounding box for California also includes Nevada).

Page 25: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

25

Alexandria Thesaurus: Example

canals

A feature type category for places such as the Erie Canal.

Used for:

The category canals is used instead of any of the following.

canal bends canalized streams ditch mouths ditches drainage canals drainage ditches ... more ...

Broader Terms:

Canals is a sub-type of hydrographic structures.

Page 26: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

26

Alexandria Thesaurus: Example (continued)

canals (continued)

Related Terms:

The following is a list of other categories related to canals (non-hierarchial relationships).

channels locks transportation features tunnels

Scope Note:

Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power. » Definition of canals.

Page 27: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

27

Alexandria Gazetteer

Alexandria Digital Library

Linda L. Hill, James Frew, and Qi Zheng, Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library. D-Lib Magazine, 5: 1, January 1999.

http://www.dlib.org/dlib/january99/hill/01hill.html

Page 28: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

28

Cataloguing Online Materials: Dublin Core

Dublin Core is an attempt to apply cataloguing methods to online materials, notably the Web.

History

It was anticipated that the methods of full text indexing that were used by the early Web search engines, such as Lycos, would not scale up.

"... [automated] indexes are most useful in small collections within a given domain. As the scope of their coverage expands, indexes succumb to problems of large retrieval sets and problems of cross disciplinary semantic drift. Richer records, created by content experts, are necessary to improve search and retrieval."

Weibel 1995

Page 29: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

29

Dublin Core

Simple set of metadata elements for online information

• 15 basic elements

• intended for all types and genres of material

• all elements optional

• all elements repeatable

Developed by an international group chaired by Stuart Weibel since 1995.

(Diane Hillmann of Cornell has been very active in this group.)

Page 30: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

30

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 31: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

31

Dublin Core record for the Dublin Core Web Site

contributor: Dublin Core Metadata Initiative

description: The Dublin Core Metadata Initiative is an open forum engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models...

title: Dublin Core Metadata Initiative (DCMI) Home Page

date: 2004-10-05

format: text/html (MIME type)

language: en (English)

Page 32: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

32

Dublin Core elements

Element Name: Title

Definition: A name given to the resource.

Comment: Typically, Title will be a name by which the resource is formally known.

Element Name: Creator

Definition: An entity primarily responsible for making the content of the resource.

Comment: Examples of Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity.

Page 33: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

33

Dublin Core elements

Element Name: Subject

Definition: A topic of the content of the resource.

Comment: Typically, Subject will be expressed as keywords, key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.

Element Name: Description

Definition: An account of the content of the resource.

Comment: Examples of Description include, but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content.

Page 34: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

34

Dublin Core elements

Element Name: Publisher

Definition: An entity responsible for making the resource available

Comment: Examples of Publisher include a person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity.

Element Name: Contributor

Definition: An entity responsible for making contributions to the content of the resource.

Comment: Examples of Contributor include a person, an organization, or a service. Typically, the name of a Contributor should be used to indicate the entity.

Page 35: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

35

Dublin Core elements

Element Name: Date

Definition: A date of an event in the lifecycle of the resource.

Comment: Typically, Date will be associated with the creation or availability of the resource. Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and includes (among others) dates of the form YYYY-MM-DD.

Page 36: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

36

Dublin Core elements

Element Name: Type

Definition: The nature or genre of the content of the resource.

Comment: Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the DCMI Type Vocabulary [DCT1]). To describe the physical or digital manifestation of the resource, use the FORMAT element.

Page 37: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

37

Dublin Core elements

Element Name: Format

Definition: The physical or digital manifestation of the resource.

Comment: Typically, Format may include the media-type or dimensions of the resource. Format may be used to identify the software, hardware, or other equipment needed to display or operate the resource. Examples of dimensions include size and duration. Recommended best practice is to select a value from a controlled vocabulary (for example, the list of Internet Media Types [MIME] defining computer media formats).

Page 38: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

38

Dublin Core elements

Element Name: Identifier

Definition: An unambiguous reference to the resource within a given context.

Comment: Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Formal identification systems include but are not limited to the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL)), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).

Page 39: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

39

Dublin Core elements

Element Name: Source

Definition: A Reference to a resource from which the present resource is derived.

Comment: The present resource may be derived from the Source resource in whole or in part. Recommended best practice is to identify the referenced resource by means of a string or number conforming to a formal identification system.

Page 40: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

40

Dublin Core elements

Element Name: Language

Definition: A language of the intellectual content of the resource.

Comment: Recommended best practice is to use RFC 3066 [RFC3066] which, in conjunction with ISO639 [ISO639]), defines two- and three-letter primary language tags with optional subtags. Examples include "en" or "eng" for English, "akk" for Akkadian", and "en-GB" for English used in the United Kingdom.

Element Name: Relation

Definition: A reference to a related resource.

Comment: Recommended best practice is to identify the referenced resource by means of a string or number conforming to a formal identification system.

Page 41: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

41

Dublin Core elements

Element Name: Coverage

Definition: The extent or scope of the content of the resource.

Comment: Typically, Coverage will include spatial location (a place name or geographic coordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity). Recommended best practice is to select a value from a controlled vocabulary (for example, the Thesaurus of Geographic Names [TGN]) and to use, where appropriate, named places or time periods in preference to numeric identifiers such as sets of coordinates or date ranges.

Page 42: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

42

Dublin Core elements

Element Name: Rights

Definition: Information about rights held in and over the resource.

Comment: Typically, Rights will contain a rights management statement for the resource, or reference a service providing such information. Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. If the Rights element is absent, no assumptions may be made about any rights held in or over the resource.

Page 43: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

43

Qualifiers

Example: element qualifier

Example: Date

DC.Date.Created 1997-11-01

DC.Date.Issued 1997-11-15

DC.Date.Available 1997-12-01/1998-06-01

DC.Date.Valid 1998-01-01/1998-06-01

A qualifier refines the element name to add specificity

Page 44: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

44

Qualifiers

Example: value qualifiers

Example: Subject

DC.Subject.DDC 509.123 (Dewey Decimal Classification)

DC.Subject.LCSH Digital libraries-United States(Library of Congress Subject Heading)

Page 45: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

45

Dumbing Down Principle

"The theory behind this principle is that consumers of metadata should be able to strip off qualifiers and return to the base form of a property. ... this principle makes it possible for client applications to ignore qualifiers in the context of more coarse-grained, cross-domain searches."

Lagoze 2001

Page 46: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

46

Dumbing Down Principle

Qualified version

DC.Date.Created 1997-11-01

DC.Subject.LCSH Digital libraries-United States

Dumbed-down version

DC.Date 1997-11-01 a valid date

DC.Subject Digital libraries-United States a valid subject description

Page 47: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

47

Dublin Core with qualifiers

See the next two slides for an example of a Dublin Core record for a web site prepared by a professional cataloguer at the Library of Congress.

Note that the record does not follow the principle of dumbing-down.

Page 48: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

48

Page 49: 1 CS/INFO 430 Information Retrieval Lecture 15 Metadata 2

49