Upload
lindsay-walton
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
1
Metadata
Andy Powell
Technical Development and Research
UKOLN
University of Bath
http://www.ukoln.ac.uk/
2
Metadata
• What is metadata?• an introduction
• The Dublin Core• metadata for the Web
• Metadata management• Models for dealing with Web-site
metadata
• UKOLN metadata projects• overviews (and problems)
3
What is metadata?
• by definition:..data about data..
..data which provides information
about a resource..
• by example:• title, author, subject classification, shelf
mark• digital format, terms and conditions,
location (URL)
4
What is metadata? (2)
• by usage:• Resource discovery
– Searching, location– Authentication– Quality/rating
• Semantic interoperability• Resource management• User interface
– Grouping resources for printing– 3-D visualisations
5
Range of formats
Dublin Core
IAFA
SOIF
MARC
TEI headers
CIMI
Simple Rich
robot generated
hand crafted
Alta Vista
NetFirst
Lycos
6
Where is metadata?
• Embedded within resource• HTML <META> tags
• Linked to resource
• Remote database• distributed• union (centralised)
7
Who creates metadata?
• Publisher side• author• webmaster• institution
• Service side• search service• third party creators
robot generated
hand crafted
8
Dublin Core• 15 element core metadata set
• Primarily intended to aid resource discovery on the Web
• Main usage currently embedded into HTML META tags
• All elements optional and repeatable
• Status?• Agreed syntax for embedding in HTML
• Still discussion about the use of some of the elements
http://www.ukoln.ac.uk/metadata/resources/dc.html
9
Dublin Core History• 4 DC meetings
• Dublin, Warwick, Dublin, Canberra• (DC-5 - Helsinki coming soon)
• Mailing list discussions• [email protected]
• W3C interest• RDF (PICS-NG), MCF
• Various projects• Still no significant interest yet from the big
search engines :-(
10
DC Elements - 1
• Title
• Subject• intended to promote use of controlled vocabularies but
in practice likely to be used for uncontrolled list of keywords
• Description• abstract
• Creator
• Publisher
11
DC Elements - 2
• Contributor
• Date• the date ‘the resource was made available in its present form’.
Agreed default format uses subset of ISO 8601, e.g. 1997-09-15
• Type• category of resource - document, image, sound, home page,
novel, poem, etc. Still much discussion about the content of this element
• Format• MIME type
• Identifier
12
DC Elements - 3
• Source• Language
• language of the resource - NOT the metadata
• Relation• no guidelines for usage currently
• Coverage• separate working party looking at usage
• Rights• rights management seen as too complex for DC. This
will give a URL to some external information
13
Simple Example<HTML><HEAD>
<TITLE>UKOLN Home Page</TITLE>
<META NAME="DC.title” CONTENT="UKOLN: UK Office for Library and Information Networking">
<META NAME="DC.subject" CONTENT="national centre, network information support, library community, awareness, research, information services, public library networking, bibliographic management, distributed library systems, metadata, resource discovery, conferences, lectures, workshops">
<META NAME="DC.description" CONTENT="UKOLN is a national centre for support in network information management in the library and information communities. It provides awareness, research and information services">
<META NAME="DC.creator" CONTENT=”Stark, Isobel">
</HEAD>
...
14
Element qualifiers
• Need to refine meaning in some cases• TYPE
Refines meaning of element - sub-divides element namespace
• SCHEMEElement value taken from external schema, e.g. LCSH for DC.subject, Z39.53 for DC.language
• LANGUAGELanguage of element value (not of the resource being described!)
15
Examples - TYPE
• Original DC.creator tag<META NAME="DC.creator" CONTENT=”Stark, Isobel">
• Non-personal author<META NAME="DC.creator.corporate"
CONTENT=”UKOLN Information Services Group">
• Author’s email address<META NAME="DC.creator.email”
CONTENT=”[email protected]">
16
Examples - SCHEME
• Library of Congress Subject Heading<META NAME="DC.subject" CONTENT=”(SCHEME=LCSH)
Library information networks -- Great Britain">
<META NAME="DC.subject" CONTENT="(SCHEME=LCSH) Information technology -- higher education">
…or…<META NAME="DC.subject" SCHEME=“LCSH”
CONTENT=”Library information networks -- Great Britain">
<META NAME="DC.subject" SCHEME=“LCSH” CONTENT="Information technology -- higher education">
17
Metadata Management
Practical issues of using Dublin Core for Internet resource description...
• UKOLN metadata system• Requirements• 3 models for metadata management• Implementation at UKOLN
18
UKOLN metadata system requirements
• Easy to use
• Work with a variety of methods of creating HTML
• Simple migration to future metadata formats
• Separate metadata from resource
19
Managing Dublin Core (1)HTML Authoring tool
Pros…• Simple• May be useful for
training and familiarisation
Cons…• May not be possible
with all editors• Maintenance
problems• Easy to make errors
Embed by hand using HTML or text editor
20
DC-dot
• A Web based tool for creating Dublin Core <meta> tags
• Automatic generation of some tags based on content of the resource
• Forms based editing of tags• Cut-and-paste output into HTML• Conversion to other formats…
• SOIF, ROADS/WHOIS++, USMARC, GILS...
http://www.ukoln.ac.uk/metadata/dcdot/
21
Managing Dublin Core (2)Web-site management tool
Pros…• Use of Web-site
management tools likely to increase
• Object-oriented database approach
Cons…• Proprietry formats• Early days - too
early to evaluate use for metadata yet?
Use Web-site management tool,for example NetObjects Fusion
22
Managing Dublin Core (3)On the fly generation
Pros…• Separates
metadata from resource
• Future migration fairly simple
Cons…• Performance• Lack of integration
with HTML tools• Server specific
Hold Dublin Core separately and embedon-the-fly using server-side include (SSI)
23
UKOLN metadata system (1)
• Embed on-the-fly
• Apache SSI script
• Store metadata using SOIF records
• Use MS-Access as tool to create the records
• Associate metadata with resource by co-locating them in the Web server filestore
24
UKOLN metadata system (2)
MS-AccessDatabase
HTMLeditor
<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...
<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...
intro.html
@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}
@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}
intro.html.soif
Apache syntax for calling server-side script<!--#exec cmd="getmeta" -->
25
UKOLN metadata system (3)
MS-Access frontend...
Filename browser
Text boxes
Name choosers
UKOLNspecificmetadata
26
UKOLN metadata system (4)
UKOLNWeb server
<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...
<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...
intro.html
intro.html.soif
SSIscript
2
3
45
6
1
@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}
@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}
Webrobot
27
Issues
• Performance
• Interaction with Web caches
• Dublin Core vs Alta Vista style metadata<META NAME=”Description” CONTENT=”blah, blah"><META NAME="Keywords” CONTENT="xxx, yyy, zzz">
• Granularity• Which pages should have metadata?
28
What's the point...
…of embedding DC <meta> tags?
• Alta Vista isn't going to look for them
• But, worth doing...• within individual projects• within specific communities (e.g. eLib)
• Improve local search facilities• e.g. load SOIF records into a Netscape
Catalogue Server
• Web-site management benefits
29
UKOLN Metadata projects
• ROADS• Software for Subject Service
• DESIRE• European Web indexing
• NewsAgent• Current awareness service for Library and
Information Staff
• BIBLINK• Information flow from publishers to National
Bibliographic Agencies
30
ROADS
• Resource Organisation and Discovery in Subject-based Services
• Web based tools for Subject Services• SOSIG, ADAM, OMNI, …
• Manage and search Internet resource descriptions• ROADS templates (based on IAFA
templates)• WHOIS++
http://www.ukoln.ac.uk/roads/
31
ROADS - WHOIS++ (1)
• Simple client-server search and retrieve protocol
• Developed originally for ‘white pages’ applications
• Offer search facilities across several Subject Services
• Distribute a Subject Service across several physical servers
• Query routing - centroids and CIP
32
ROADS - WHOIS++ (2)• Centroid generated by ADAM contains… “you’ll
find the string ‘mona’ in the ‘title’ attribute of at least one record in the ADAM database”.
CGI-basedWHOIS++client
SOSIG
OMNI
ADAM
CIP sharingof centroids
Web browser
1
2
3
4
56
33
DESIREEuropean Web cataloguing• Subject Services
• EuroSOSIG (Bristol), EELS (Lund), Arts (Koninklijke Bibliotheek)
• Manually created ROADS templates
• European Web Index• based on Nordic Web Index (NWI)• Robot generated, all resources• Multiple servers linked with Z39.50 • GILS
http://www.nic.surfnet.nl/surfnet/projects/desire/desire.html
34
DESIRE - current work (1)
• Internationalisation of ROADS
• Use of robots to:• aid manual cataloguing of resources• build indexes based on list of URLs in
a ROADS database• Robot will use embedded Dublin Core
if available
35
DESIRE - current work (2)
• Re-design of EWI robot - including:• support for Dublin Core
• EWI records GILS-II compatible
• Allow users to search across subject services and the EWI using Z39.50• by converting ROADS records into
GILS records
• by building a WHOIS++ to Z39.50 gateway
http://roads.ukoln.ac.uk/cgi-bin/egwcgi/egwirtcl/targets.egw
36
NewsAgentCurrent awareness service for LIS...
• Distributed database• servers at LITC, FD, UKOLN - Z39.50• metadata (and some full-text)• based on DALI
• Mixture of content streams
• Variety of access methods• Web, e-mail and Z39.50 clients• user-configurable profiles
http://www.ukoln.ac.uk/metadata/NewsAgent/
37
NewsAgent - Content
• Journals• Program, VINE, Journal of
Librarianship and Information Science
• News and briefing material• LA, IIS, UKOLN (Ariadne), BL, LITC
• Web pages
• E-mail lists and USENET news
38
NewsAgent - Harvesting
• Web crawler• looking for embedded Dublin Core• Limiting the harvest
– simple heuristics– use of Dublin Core Relation element
• E-mail parser
http://www.ukoln.ac.uk/metadata/NewsAgent/dcusage.html
39
BIBLINK
Information flow between publishers• traditional• new - CD-ROM or Web (new to publishing)
and National Bibliographic Agencies• British Library, UK• Biblioteca Nacional, Madrid, Spain • Bibliothèque Nationale de France, Paris • Koninklijke Bibliotheek, Den Haag, Netherlands • Nasjonalbiblioteket, Rana, Norway • Universitat Oberta de Catalunya, Barcelona, Spain
http://www.ukoln.ac.uk/metadata/BIBLINK/
40
BIBLINK - research• Scope
• Electronic publications suitable for inclusion in National Bibliographies
• Metadata• Dublin Core (with extensions!), SGML DTD
• Identifiers• ISBN, ISSN, SICI, DOI, URN
• Transmission• Simple e-mail or Web crawler
• Authentication• MD5 hash assigned to each resource
41
BIBLINK - data set• Minimum data set
– Author, Title, Publisher, Place of Publication, Price, Extent (size), Keywords, Description, Edition/Version, Date of Publication, System Requirements, Format, Language, Terms and Conditions, Frequency, Identifier, Contributor, Checksum
• Similar to DC but some don’t fit…<META NAME=“BIBLINK.placePublication” CONTENT=“Bath, UK”>
<META NAME=“BIBLINK.frequency” CONTENT=“monthly”>
• Issues over conversion to MARC
42
NBAs/National Libraries
Publishers
BIBLINK - demonstrator
Dublin Core
Dublin Core
UNIMARC
??MARC
• Cataloguing in Publication(CIP) level records
• Conversion on to local MARC format using USEMARCON
• Enhanced records optionally returned to publishers