43
1 Metadata Andy Powell Technical Development and Research UKOLN University of Bath http://www.ukoln.ac.uk/ [email protected]

1 Metadata Andy Powell Technical Development and Research UKOLN University of Bath [email protected]

Embed Size (px)

Citation preview

1

Metadata

Andy Powell

Technical Development and Research

UKOLN

University of Bath

http://www.ukoln.ac.uk/

[email protected]

2

Metadata

• What is metadata?• an introduction

• The Dublin Core• metadata for the Web

• Metadata management• Models for dealing with Web-site

metadata

• UKOLN metadata projects• overviews (and problems)

3

What is metadata?

• by definition:..data about data..

..data which provides information

about a resource..

• by example:• title, author, subject classification, shelf

mark• digital format, terms and conditions,

location (URL)

4

What is metadata? (2)

• by usage:• Resource discovery

– Searching, location– Authentication– Quality/rating

• Semantic interoperability• Resource management• User interface

– Grouping resources for printing– 3-D visualisations

5

Range of formats

Dublin Core

IAFA

SOIF

MARC

TEI headers

CIMI

Simple Rich

robot generated

hand crafted

Alta Vista

NetFirst

Lycos

6

Where is metadata?

• Embedded within resource• HTML <META> tags

• Linked to resource

• Remote database• distributed• union (centralised)

7

Who creates metadata?

• Publisher side• author• webmaster• institution

• Service side• search service• third party creators

robot generated

hand crafted

8

Dublin Core• 15 element core metadata set

• Primarily intended to aid resource discovery on the Web

• Main usage currently embedded into HTML META tags

• All elements optional and repeatable

• Status?• Agreed syntax for embedding in HTML

• Still discussion about the use of some of the elements

http://www.ukoln.ac.uk/metadata/resources/dc.html

9

Dublin Core History• 4 DC meetings

• Dublin, Warwick, Dublin, Canberra• (DC-5 - Helsinki coming soon)

• Mailing list discussions• [email protected]

• W3C interest• RDF (PICS-NG), MCF

• Various projects• Still no significant interest yet from the big

search engines :-(

10

DC Elements - 1

• Title

• Subject• intended to promote use of controlled vocabularies but

in practice likely to be used for uncontrolled list of keywords

• Description• abstract

• Creator

• Publisher

11

DC Elements - 2

• Contributor

• Date• the date ‘the resource was made available in its present form’.

Agreed default format uses subset of ISO 8601, e.g. 1997-09-15

• Type• category of resource - document, image, sound, home page,

novel, poem, etc. Still much discussion about the content of this element

• Format• MIME type

• Identifier

12

DC Elements - 3

• Source• Language

• language of the resource - NOT the metadata

• Relation• no guidelines for usage currently

• Coverage• separate working party looking at usage

• Rights• rights management seen as too complex for DC. This

will give a URL to some external information

13

Simple Example<HTML><HEAD>

<TITLE>UKOLN Home Page</TITLE>

<META NAME="DC.title” CONTENT="UKOLN: UK Office for Library and Information Networking">

<META NAME="DC.subject" CONTENT="national centre, network information support, library community, awareness, research, information services, public library networking, bibliographic management, distributed library systems, metadata, resource discovery, conferences, lectures, workshops">

<META NAME="DC.description" CONTENT="UKOLN is a national centre for support in network information management in the library and information communities. It provides awareness, research and information services">

<META NAME="DC.creator" CONTENT=”Stark, Isobel">

</HEAD>

...

14

Element qualifiers

• Need to refine meaning in some cases• TYPE

Refines meaning of element - sub-divides element namespace

• SCHEMEElement value taken from external schema, e.g. LCSH for DC.subject, Z39.53 for DC.language

• LANGUAGELanguage of element value (not of the resource being described!)

15

Examples - TYPE

• Original DC.creator tag<META NAME="DC.creator" CONTENT=”Stark, Isobel">

• Non-personal author<META NAME="DC.creator.corporate"

CONTENT=”UKOLN Information Services Group">

• Author’s email address<META NAME="DC.creator.email”

CONTENT=”[email protected]">

16

Examples - SCHEME

• Library of Congress Subject Heading<META NAME="DC.subject" CONTENT=”(SCHEME=LCSH)

Library information networks -- Great Britain">

<META NAME="DC.subject" CONTENT="(SCHEME=LCSH) Information technology -- higher education">

…or…<META NAME="DC.subject" SCHEME=“LCSH”

CONTENT=”Library information networks -- Great Britain">

<META NAME="DC.subject" SCHEME=“LCSH” CONTENT="Information technology -- higher education">

17

Metadata Management

Practical issues of using Dublin Core for Internet resource description...

• UKOLN metadata system• Requirements• 3 models for metadata management• Implementation at UKOLN

18

UKOLN metadata system requirements

• Easy to use

• Work with a variety of methods of creating HTML

• Simple migration to future metadata formats

• Separate metadata from resource

19

Managing Dublin Core (1)HTML Authoring tool

Pros…• Simple• May be useful for

training and familiarisation

Cons…• May not be possible

with all editors• Maintenance

problems• Easy to make errors

Embed by hand using HTML or text editor

20

DC-dot

• A Web based tool for creating Dublin Core <meta> tags

• Automatic generation of some tags based on content of the resource

• Forms based editing of tags• Cut-and-paste output into HTML• Conversion to other formats…

• SOIF, ROADS/WHOIS++, USMARC, GILS...

http://www.ukoln.ac.uk/metadata/dcdot/

21

Managing Dublin Core (2)Web-site management tool

Pros…• Use of Web-site

management tools likely to increase

• Object-oriented database approach

Cons…• Proprietry formats• Early days - too

early to evaluate use for metadata yet?

Use Web-site management tool,for example NetObjects Fusion

22

Managing Dublin Core (3)On the fly generation

Pros…• Separates

metadata from resource

• Future migration fairly simple

Cons…• Performance• Lack of integration

with HTML tools• Server specific

Hold Dublin Core separately and embedon-the-fly using server-side include (SSI)

23

UKOLN metadata system (1)

• Embed on-the-fly

• Apache SSI script

• Store metadata using SOIF records

• Use MS-Access as tool to create the records

• Associate metadata with resource by co-locating them in the Web server filestore

24

UKOLN metadata system (2)

MS-AccessDatabase

HTMLeditor

<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...

<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...

intro.html

@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}

@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}

intro.html.soif

Apache syntax for calling server-side script<!--#exec cmd="getmeta" -->

25

UKOLN metadata system (3)

MS-Access frontend...

Filename browser

Text boxes

Name choosers

UKOLNspecificmetadata

26

UKOLN metadata system (4)

UKOLNWeb server

<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...

<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...

intro.html

intro.html.soif

SSIscript

2

3

45

6

1

@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}

@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}

Webrobot

27

Issues

• Performance

• Interaction with Web caches

• Dublin Core vs Alta Vista style metadata<META NAME=”Description” CONTENT=”blah, blah"><META NAME="Keywords” CONTENT="xxx, yyy, zzz">

• Granularity• Which pages should have metadata?

28

What's the point...

…of embedding DC <meta> tags?

• Alta Vista isn't going to look for them

• But, worth doing...• within individual projects• within specific communities (e.g. eLib)

• Improve local search facilities• e.g. load SOIF records into a Netscape

Catalogue Server

• Web-site management benefits

29

UKOLN Metadata projects

• ROADS• Software for Subject Service

• DESIRE• European Web indexing

• NewsAgent• Current awareness service for Library and

Information Staff

• BIBLINK• Information flow from publishers to National

Bibliographic Agencies

30

ROADS

• Resource Organisation and Discovery in Subject-based Services

• Web based tools for Subject Services• SOSIG, ADAM, OMNI, …

• Manage and search Internet resource descriptions• ROADS templates (based on IAFA

templates)• WHOIS++

http://www.ukoln.ac.uk/roads/

31

ROADS - WHOIS++ (1)

• Simple client-server search and retrieve protocol

• Developed originally for ‘white pages’ applications

• Offer search facilities across several Subject Services

• Distribute a Subject Service across several physical servers

• Query routing - centroids and CIP

32

ROADS - WHOIS++ (2)• Centroid generated by ADAM contains… “you’ll

find the string ‘mona’ in the ‘title’ attribute of at least one record in the ADAM database”.

CGI-basedWHOIS++client

SOSIG

OMNI

ADAM

CIP sharingof centroids

Web browser

1

2

3

4

56

33

DESIREEuropean Web cataloguing• Subject Services

• EuroSOSIG (Bristol), EELS (Lund), Arts (Koninklijke Bibliotheek)

• Manually created ROADS templates

• European Web Index• based on Nordic Web Index (NWI)• Robot generated, all resources• Multiple servers linked with Z39.50 • GILS

http://www.nic.surfnet.nl/surfnet/projects/desire/desire.html

34

DESIRE - current work (1)

• Internationalisation of ROADS

• Use of robots to:• aid manual cataloguing of resources• build indexes based on list of URLs in

a ROADS database• Robot will use embedded Dublin Core

if available

35

DESIRE - current work (2)

• Re-design of EWI robot - including:• support for Dublin Core

• EWI records GILS-II compatible

• Allow users to search across subject services and the EWI using Z39.50• by converting ROADS records into

GILS records

• by building a WHOIS++ to Z39.50 gateway

http://roads.ukoln.ac.uk/cgi-bin/egwcgi/egwirtcl/targets.egw

36

NewsAgentCurrent awareness service for LIS...

• Distributed database• servers at LITC, FD, UKOLN - Z39.50• metadata (and some full-text)• based on DALI

• Mixture of content streams

• Variety of access methods• Web, e-mail and Z39.50 clients• user-configurable profiles

http://www.ukoln.ac.uk/metadata/NewsAgent/

37

NewsAgent - Content

• Journals• Program, VINE, Journal of

Librarianship and Information Science

• News and briefing material• LA, IIS, UKOLN (Ariadne), BL, LITC

• Web pages

• E-mail lists and USENET news

38

NewsAgent - Harvesting

• Web crawler• looking for embedded Dublin Core• Limiting the harvest

– simple heuristics– use of Dublin Core Relation element

• E-mail parser

http://www.ukoln.ac.uk/metadata/NewsAgent/dcusage.html

39

BIBLINK

Information flow between publishers• traditional• new - CD-ROM or Web (new to publishing)

and National Bibliographic Agencies• British Library, UK• Biblioteca Nacional, Madrid, Spain • Bibliothèque Nationale de France, Paris • Koninklijke Bibliotheek, Den Haag, Netherlands • Nasjonalbiblioteket, Rana, Norway • Universitat Oberta de Catalunya, Barcelona, Spain

http://www.ukoln.ac.uk/metadata/BIBLINK/

40

BIBLINK - research• Scope

• Electronic publications suitable for inclusion in National Bibliographies

• Metadata• Dublin Core (with extensions!), SGML DTD

• Identifiers• ISBN, ISSN, SICI, DOI, URN

• Transmission• Simple e-mail or Web crawler

• Authentication• MD5 hash assigned to each resource

41

BIBLINK - data set• Minimum data set

– Author, Title, Publisher, Place of Publication, Price, Extent (size), Keywords, Description, Edition/Version, Date of Publication, System Requirements, Format, Language, Terms and Conditions, Frequency, Identifier, Contributor, Checksum

• Similar to DC but some don’t fit…<META NAME=“BIBLINK.placePublication” CONTENT=“Bath, UK”>

<META NAME=“BIBLINK.frequency” CONTENT=“monthly”>

• Issues over conversion to MARC

42

NBAs/National Libraries

Publishers

BIBLINK - demonstrator

Dublin Core

Dublin Core

UNIMARC

??MARC

E-mail

• Cataloguing in Publication(CIP) level records

• Conversion on to local MARC format using USEMARCON

• Enhanced records optionally returned to publishers

43

Conclusions

• Think about metadata as a ‘process’

• Dublin Core syntax now stable enough to use

• Use within projects initially

• Choose metadata management model appropriate to your site

• Consider long term maintenance and transition to other formats