39
Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Embed Size (px)

Citation preview

Page 1: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Digital Library Interoperability Architecture

CS 502 – 20030305Carl Lagoze – Cornell University

Page 2: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Interoperability is multidimensional

• Syntax– XML

• Semantics– RDF/RDFS/OWL

• Vocabularies/Ontologies– Dublin Core/ABC/CIDOC-CRM

• Search and discovery– Z39.50– SDLIP– ZING

• Document models– METS– FEDORA

Page 3: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Contrast to Distributed Systems

• Distributed systems– Collections of components at different sites that are

carefully designed to work with each other

• Heterogeneous or federated systems– Cooperating systems in which individual components

are designed or operated automously

Page 4: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Measuring success of interoperability solutions

• Degree of component automony• Cost of infrastructure• Ease of contributing components• Ease of using components• Breadth of task complexity supported by the

solution• Scalability in the number of components

Page 5: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Families of interoperability solutions

S tro ng S tand ard s

F am ilies o f S tand ard s

External M ed iato rs(W rap p ers , gatew ays ,s c hem a trans alato rs )

S em antic S p ec ific atio n(O p eratio nal O nto lo gies )

A gents , A p p letsM o b ile C o d e

Page 6: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Interoperability Trade-offs

Cost

Functionality

HTTPGoogle

Z39.50SGML

DublinCore

MetadataHarvesting Dienst

Page 7: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Cornell CS 502 20020307 7

Dienst

• is a protocol and reference implementation of a distributed digital library service

• where a network of services provide• World Wide Web browser access,• uniform search over distributed indexes,• and access to structured documents.

Page 8: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Why a service based protocol?

• Expose the operational semantics of the services through an API,

• to permit flexible integration of the services,• and use of the services by other

clients/consumers/services.

Page 9: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Defining the services

• Repository – deposit, storage, and access to structured documents.

• Index – process queries on documents and returned handles

• Query Mediator – route queries to appropriate indexes• Collection – define services and content in logical

collections• User Interface – human-oriented front-end for services.• Name Server – Resolves URN’s (handles) to document

location(s)

Page 10: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Dienst Services

WWWbrowser

UserInterface

RepositoryIndexIndex Index

Repository Repository

QM

user query

generic search

request

specific searchrequest

NS

user documentrequest

URI

documentrequest

Collection

Collectionmetadata

Page 11: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Defining the protocol

• Structured messages– Service– Version– Verb– Arguments

• Template/Dienst/<service>/<version>/<verb>[?/]<arguments>

• Example/Dienst/Repository/4.0/Formats/ncstrl.cornell/TR94-1418

Page 12: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Why a Document Model?

• “Documents” in current web are both:– Unstructured (GET)– Chaotic (CGI)

• Different views and pieces of contents are needed for:– Bandwidth reduction– Rights management– Usability

Page 13: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Dienst Document Model

• Metadata – support for multiple descriptive formats

• Views – alternative expression or structural representation of the content encapsulated in the digital object

• Divs – hierarchically nested structure contained in a view

Page 14: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Expressing the document model in the protocol

• Structure – expose the views and structure for the digital object

• Disseminate – select the structural component (and packaging of it) to disseminate

• List-Meta-Formats – list available descriptive formats

Page 15: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Protocol Demonstration

• http://techreports.library.cornell.edu:8081/Dienst/Repository/4.0/List-Contents?file-after=2003-01-01

• http://techreports.library.cornell.edu:8081/Dienst/Repository/1.0/Disseminate/cul.cs/TR90-1160/%23oams/xml

• http://techreports.library.cornell.edu:8081/Dienst/Repository/2.0/Structure/cul.cs/TR90-1160

• http://techreports.library.cornell.edu:8081/Dienst/Repository/4.0/Formats/cul.cs/TR90-1160?part=body

• http://techreports.library.cornell.edu:8081/Dienst/Repository/1.0/Disseminate/cul.cs/TR90-1160/body/inline?pageimage=3

Page 16: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Cornell CS 502 20020307 16

Collection Service

• Periodically polled by each user interface server for– elements of the

collection– index servers for the

collectionUser Interface

Servers

IndexServers

Page 17: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Deploying Collection Globally

• Internet connectivity varies considerably• Good connectivity between nodes often does not

correspond to geographic proximity• Connectivity Region - a group of nodes on the network

that among them have good connectivity, relative to nodes outside of the region.

Page 18: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Connectivity Regions

• When possible route queries within region• In case of failure, use an alternate either within the

region or in a “nearby” region

Page 19: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Origins of the OAI

• Increasing interest in alternative scholarly publishing solutions – e.g., LANL arXiv

• Increasing impact through federation• UPS Mtg., Sante Fe, October 1999

– Representatives of various ePrint, library, publishing, communities

– Goal: definition of an interoperability framework among ePrint providers

– Reality: Rich interoperability protocols like Dienst are too complicated for widespread deployment

– Result: Santa Fe Convention, interoperability through metadata harvesting

Page 20: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

DiscoveryCurrent

AwarenessPreservation

Service Providers

Data Providers

Meta

data

harv

estin

g

The World According to OAI

Page 21: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Yes, its about resource discovery over distributed collections

metadata

AuthorTitleAbstractIdentifer

Page 22: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Facilitating/Monitoring Longevity of Distributed Content

W e b S ite W e b S i teM anage d

R e po s i to ry

Se le c t ive W e b C rawling

E ve ntR e c ords

P1 A1

P2 A2

P3 A3

P o lic y E n f o r c er

acti

ons

M anage dR e po s i to ry

P r e se r v a t io n M e t a da t aP r e se r v a t io n M e t a da t a

M e tadata H arve s t ing

PreservationService

Page 23: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

DigitalObject

Realaudio video

Powerpoint presentation

SMIL synchronization metadata

structuralmetadata

Portal A Portal B

View A:• View Slides• View Video• View synchronized presentation using applet

View B:• Get Transcript of Audio• Search for keyword• Get Slides translated to French

ToolRepository

Personalization of Content

Page 24: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Cross-Repository Reference Linking

citationmetadata

citationmetadata

citationmetadata

citationmetadata

citationmetadata

LinkageService

Page 25: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

OAI Technical Infrastructure Key technical features

• Deploy now technology – 80/20 rule• Two-party model – providers (data providers) and

consumers (service providers)• Simple HTTP encoding• XML schema for some degree of protocol conformance• Extensibility

– Multiple item-level metadata– Collection level metadata

Page 26: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Content and Metadata

resource

Item (metadata)

repository010010

record

Page 27: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

http://www.openarchives.org/OAI/openarchivesprotocol.html

Page 28: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

record

<record><header>

<identifier>oai:eg:001</identifier><datestamp>1999-01-01</datestamp>

</header><metadata>

<dc xmlns=“http://purl.org/dc”><title>My Example</title>

</dc></metadata><about>

<ea xmlns=“http://www.arXiv.org/ea”<usage>No restrictions</usage>

</ea></about>

</record>

protocol support

format-specificmetadata

community-specific

record data

Page 29: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

selective harvesting - datestamps

repos i tory

harvest withindate range

record

record

Page 30: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

selective harvesting - sets

repos i tory

harvest within setS1

recordrecord

record

S2

Page 31: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

set specifics

• repositories define hierarchical organization• each item in a repository may be organized in

one set, several sets, or no sets at all• meaning of sets or of set hierarchy is not

defined in protocol• individual communities may formulate

common set configurations

Page 32: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

HTTP encoding - requests

BASE-URL -----------> an.oa.org/OAI-scriptkeyword arguments -->verb=ListIdentifers&set=S1

GET http://an.oa.org/OAI-script?verb=ListIdentifers&set=S1POST POST http://an.oa.org/OAI-script HTTP/1.0 Content-Length: 78 Content-Type: application/x-www-form-urlencoded verb=ListIdentifers&set=S1

Page 33: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

HTTP encoding - responses

<xml version=1.0 encoding=“UTF-9” ?><GetRecord

xmlns=“http://oai.namespace.uri”xmlns:xsi=“http://w3.namespace.uri”xsi:schemaLocation=“http://oai.namespace.uri

http://oai.schemaURL”><responseDate>2000-19-01T19:30:30-04:00</responseDate><requestURL>http://an.oa.org/OAI-script?verb=GetRecord

&amp;identifier=oai%3AarXiv%3A0001&amp;metadataPrefix=oai_dc</requestURL>

<record>record contents

</recordadditional records

</GetRecord>

responseheader

xml namespace

s

responsedata

Page 34: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

metadata prefix and schema

• support for harvesting multiple metadata formats– metadata schema: each format must have a validating

XML schema at a publicly accessible URL (communities may define shared formats and schema.

– metadata prefix: each repository maps a prefix to the schema it supports, which is used in protocol requests.

• support for unqualified Dublin Core mandatory– DC OAI record syntax that builds on base DCMI schema– reserved prefix oai_dc.

Page 35: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

flow control

protocol requestharves ter

repos i tory

Page 36: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

flow control specifics

• applies to all protocol requests that return lists: ListRecords, ListIdentifiers, ListSets

• resumptionToken is opaque• semantics of partitioning of responses within

resumption requests is undefined

Page 37: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Extensibility Feature Summary

• Multiple metadata formats• Collection level metadata

– Identify “about” container

• Record data– Terms and conditions– Provenance

• Set structure– Pre-configured “queries”

Page 38: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Supporting protocol requests:• Identify• ListMetadataFormats• ListSets

Harvesting protocol requests:• ListRecords• ListIdentifiers• GetRecord

repos i tory

harves ter

service provider data provider

OAI Protocol

Page 39: Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell University

Challenges and Questions

• Utility of lowest common denominator metadata such as DC

• Quality of metadata from non-professional contributors

• Machines processing to reduce and compliment human effort

• Functionality of service structure