Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze –...

Preview:

Citation preview

Cornell CS 502

Metadata for the WebIssues and Simple Answers

CS 502 – 20020221Carl Lagoze – Cornell University

Cornell CS 502

“Metadata is data about data”

Cornell CS 502

Some untested hypotheses

• Metadata is useful for…– People– Machines

• More metadata is better• (semi) automated digital libraries and simple

metadata

Cornell CS 502

Some known facts

• Number and variety of metadata vocabularies will continue to increase

• The Tower of Babel is a franchise– There is not one common view of reality

• “The one thing I know about metadata is that it is expensive”

Cornell CS 502

Are metadata and data distinguishable?

• Objectivity?• Intellectual property?• Structure?• Aboutness?

Cornell CS 502

The fiction of classification

…there is no classification of the universe that is not fictional and

conjectural.

Jorge Luis Borges

Cornell CS 502

Lenses and Views

• All classification does and should provide a biased lens or view of reality

• Each view emphasizes certain characteristics and hides others

GeospatialRights

Museum

Cornell CS 502

Reality is Complex

Created by:George Castaldo

Created on:1994

Created by:Leonardo da Vinci

Created on:1506

Relationship?

Cornell CS 502

Objects are Related

IFLA Entity Model

Cornell CS 502

Entities, Events, and Agents

Photographer

Camera type Software

Computer artist

Cornell CS 502

Haven’t we done metadata already?

Cornell CS 502

What’s wrong with this model?

• Expensive– Complex (even for its original goal?) – Professional intervention (assumes single community

of expertise)

• Monolithic– One size fits all approach– Reflects its centralized system origins

• Bias towards physical artifacts– Fixed resources– Incomplete handling of resource evolution and other

resource relationships

• Anglo-centric

Cornell CS 502

Web Challenge to Traditional Cataloging

• Scale

• Permanence

• Authenticity

• Organizational Context

• Custodial Control

• Variety

Cornell CS 502

Internet Commons includes Multiple Communities

ScientificData

HomePages Geo

InternetCommons

Library

Museums

Commerce

Whatever...

Cornell CS 502

Realities of Web search and discovery

• Search systems are motivated by advertising• Index coverage is unpredictable and limited• Too much recall, too little precision• Index spam abounds• Resources (and their names) are volatile

Cornell CS 502

Metadata: Part of a Solution

• Structured data about data– helps to impose order on chaos– enables automated discovery/manipulation

• Variety across various dimensions:– specialization– decentralization– democratization

Cornell CS 502

Web Metadata Issues:Description vs. Discovery

• Library cataloging motivated by describing resources

• Fuzzy search buckets– Separate books about Sigmund Freud versus books

by Sigmund Freud into different buckets– But, different types of data appropriate for different

buckets: URLs, date strings, word strings, names

• But general, fuzzy categories may not be sufficient for describing resources

Cornell CS 502

Web Metadata Models:Drill-Down Searching Paradigm

• Moving along a specificity spectrum• Inter-domain vs. intra-domain terms, models,

query mechanisms• One size doesn't fit all

– Cognitive models of searching and browsing

Cornell CS 502

Metadata Takes Many Forms

resourcediscovery

documentadministration

rightsmanagement

contentrating

security andauthentication

archivalstatus

products andservices

databaseschemas

process controlor description

Cornell CS 502

Metadata:Part of the problem

cost

functionality

AACR2/MARC

googleDublin Core

Cornell CS 502

Metadata Challenges

• Accommodate multiple varieties of metadata– community-specific functionality, creation,

administration, access

• Tensions– functionality and simplicity – extensibility and interoperability– human and machine creation and use

Cornell CS 502

Interoperability has many facets

• Semantics– Meaning/classification/ontology

• Models/Structure– Entities and relationships

• Syntax– grammars to convey semantics and structure

Cornell CS 502

Warwick Framework: Containing Chaos

• Conceptual Architecture for metadata from the Warwick Metadata Workshop (DC-2)

• Conceptual architecture to support the specification, collection, encoding, and exchange of modular metadata

• Provide context for metadata efforts (including Dublin Core)– avoids the “black-hole” of comprehensive element

sets– focuses interoperability issues at package level

Cornell CS 502

Metadata Container

Container

Package

Dublin Core

Package

MARC record

Package

Indirect Reference

Package

Terms and Conditions

URI

Cornell CS 502

Modularization Allows Distributed Management

• Communities of expertise (not software vendors) are responsible for:– Semantics– Registration– Administration– Access management– Authority of data– Sharing and Distribution

Cornell CS 502

Modularization Implementation Issues

• Data encoding • Semantic interaction of overlapping sets

– between semantically-related packages– between semantically distinct packages

• Type registry

Cornell CS 502

Dublin Core Metadata Initiative

• A simple set of properties to support resource discovery on the web (fuzzy search buckets)?

• A cross-domain switchboard for interoperable metadata?

• An extensible ontology for resource desciption?

http://dublincore.org

Cornell CS 502

The fifteen Dublin Core Elements

Creator Title Subject

Contributor Date Description

Publisher Type Format

Coverage Rights Relation

Source Language I dentifi er

http://www.dublincore.org/documents/1999/07/02/dces/

Cornell CS 502

A Pidgin for Digital Tourists

• Metadata is language• Dublin Core is a small and simple language -- a

pidgin -- for finding resources across domains.• Speakers of different languages naturally

"pidginize" to communicate– E.g., tourists using simple phrases to order beer

("zwei Bier bitte" "dva pivo" "biru o san bai"...)

• We are all "tourists" on the global Internet.

Cornell CS 502

A Grammar of Dublin Core

• http://www.dlib.org/dlib/october00/baker/10baker.html

• By design not as subtle as mother tongues, but easy to learn and extremely useful in practice

• Pidgins: small vocabularies (Dublin Core: fifteen special nouns and lots of optional adjectives)

• Simple grammars: sentences (statements) follow a simple fixed pattern...

Cornell CS 502

Example Dublin Core statements

• Resource has Title 'Grammar of Dublin Core'.• Resource has Creator 'Tom Baker'.• Resource has Subject 'Metadata'.• Resource has Relation http://foo.org/file.htm.

Cornell CS 502

Resource has property

DC:CreatorDC:TitleDC:SubjectDC:Date...

X

implied subject

impliedverb

one of 15properties

property value(an appropriateliteral)

Recommended