Interoperability Among Scholarly Repositories: Enabling Workflows Across Distributed Information Carl Lagoze Information Science Cornell University, USA

Interoperability Among Scholarly Repositories:Enabling Workflows Across Distributed

Information

Carl LagozeInformation Science

Cornell University, USA

Herbert Van de SompelResearch Library

Los Alamos National Laboratory, USA

http://creativecommons.org/licenses/by/2.5/

Acknowledgments

This talk based on the following work:

o NSF-funded Pathways project (IIS-0430906)- Cornell University (PIs: Carl Lagoze, Sandy Payette,

Simeon Warner) - LANL Digital Library Research & Prototyping Team (PI

Herbert Van de Sompel).o The LANL aDORe repository effort.

- http://dx.doi.org/10.1093/comjnl/bxh114- http://african.lanl.gov/aDORe/

o The PhD thesis by Jeroen Bekaert (Advisor Herbert Van de Sompel) regarding protocol-based interfaces for Open Archival Information Systems (OAIS). - http://hdl.handle.net/1854/4833

References

• “Rethinking Scholarly Communication”, D-Lib September 2004• “Interoperability for Distributed Scholarly Workflows”, D-Lib

October 2006• “Pathways: Augmenting Interoperability for Scholarly

Repositories”, Upcoming Journal of Digital Libraries

Some Background

• Digital transition of scholarly communication has been in form rather than nature

• Try and build a scholarly communication system that is more natively digital, i.e. use the capabilities of digital, network technologies

o Collaborationo Immediacyo Reuseo Dynamic

• Exploit advances in institutional repositories and interest in open access

• Frame scholarly communication as a workflow among distributed information units

• Provide framework for new advanced serviceso Visualizationo Usage analysiso …

Interoperability in a Heterogeneous World

• Diversity of (repository) technologyo DSpaceo Fedorao aDOReo EPrintso Greenstone

• Define an interoperability layer in whicho Information can be modeledo Information can be sharedo Information can be transferedo Information can be reused

Some Meta-Observations on Interoperability

• Scholarly communication is a long-term endeavor:• Dependent on stability and integrity of participants • Need abstract definitions of models and interfaces that can

be instantiated on the basis of various technologies as time goes by

• Identification is particularly important:

• Scalable• Agnostic about existing identification schemes• Granular

• Object decomposition• Repository origination

• Value chains do not require transfer of all digital object content• The content that needs to be transferred depends on the

nature of the value chain

Augmenting interoperability across RepositoriesD

Space

Fedora

aD

OR

e

ePri

nts

arX

iv

Natu

re

Individual Data Models and Services

Shared Data Model and Services

Motivation 1 : Richer cross-Repository services

• Distributed Repositories provide source materials for cross-Repository overlay services such as discovery services

• Manner in which those materials are exposed must allow for the seamless emergence of rich and meaningful services

Scenario 1: Chemical search engine

• A search engine monitors scholarly repositories but is only interested in making machine-readable chemical structures contained in Digital Objects available from those repositories searchable.

• This constitutes re-use of the (part of) the Digital Objects by a service overlaid upon the monitored repositories.

• And, of course, a chemical compound discovered via the search engine can be cited in some new paper, i.e. the value chain does not stop here

Richer cross-Repository services : Scenario

Motivation 2 : Scholarly communication workflow

• Distributed Repositories at the basis of a digital scholarly communication system

• Scholarly communication as a global workflow (value chain) across those Repositories

• Digital Objects from Repositories are the subject of the workflow; they are used and re-used in many contexts.

Scholarly communication workflow : Scenarios

Scenario 2: Citation

• An author writes a paper (to be Put into her institutional repository) and cites 10 papers available from other repositories.

• A citation to a paper is a type of re-use of the cited paper in a new context.

• And, of course, the new paper can be cited too, i.e. the value chain does not stop here.

Adding Value to Fundamental Units

Paul Ginsparg

Scholarly communication workflow : Scenarios

Scenario 3: Overlay journal

• The editor of an overlay journal selects papers from 3 different repositories for inclusion in the next issue of the overlay journal.

• Each of those articles is being re-used in a new context, with value being added.

• And, the overlay journal can be mirrored for preservation purposes, i.e. the value chain does not stop here.

Building Block I - Repositories

Networked system that provides services pertaining to a managed collection of digital objects.

Institutional repositories, online journals, dataset stores, learning objects, etc.

Aim: Digital Object use and re-use

• We must leverage the value of the materials that become available in those distributed Repositories.

• Think about these Repositories as active nodes in a global environment, not as passive local nodes

o These Repositories are about facilitating the use and re-use of materials in many contexts

o These Repositories are the starting point of value chains

Building Block II: Digital Objects

id

id

Digital Objects

Abstract units of scholarly communication

Compound aggregations consisting of:•Multiple media types•Linkage to services

Have a persistent identifier

Can be recursive: digital objects within digital objects

Instantiated in various implementations

c.f. Kahn/Wilensky Model

Digital Object: A data structure whose principal components are digital data and key-metadata. Digital data can be a Datastream or a Digital Object, i.e. a Digital Object may have one or more other Digital Objects as nested components. Key-metadata must include an identifier for the Digital Object.

id

Data Model: An abstraction for Digital Objects such that each Digital Object can be seen as an instance of the class defined by a Data Model. Example Data Models include the Pathways Core model, the MPEG-21 Digital Item Declaration model, etc.

Surrogate: A serialization of a Digital Object according to a Data Model.

m

Datastream: An ordered sequence of bytes.

Terminology

Obtain interface: a Repository interface that supports the request of services pertaining to individual Digital Objects (including their component Datastreams).

Terminology

Ob

tain

Repository: a networked system that provides services pertaining to a collection of Digital Objects.

Harv

est

Harvest interface: a Repository interface that exposes Surrogates for incremental collecting/harvesting.

Put Put interface: a Repository interface that supports submission of

one or more Surrogates into the Repository, thereby facilitating the addition of Digital Objects to the collection of the Repository.

Augmenting interoperability across RepositoriesD

Space

Fedora

aD

OR

e

ePri

nts

arX

iv

Natu

re

Individual Data Models and Services

m Ob

tain

Harv

est

Put

Common Data Model

Provides a common abstraction for describing digital objects despite their (repository, service)-specific implementation.

A common denominator:• Does not completely cover implementation-

specific features• Features conform to requirements of

interoperability fabric (e.g., identity, workflow support, etc.)

m

Model Core Requirement

• Recursion for n-levels of information containment• Identity independent of specific schemes• Lineage relationships among objects

o evidence of workflow for evidential citation• Semantics associated with entities

o facilitate service mapping• Link to concrete representation• Assertion of persistence levels

m

Data Model

Augmenting interoperability across Repositories

• A Surrogate is available for every Digital Object• A Surrogate is a representation of the Digital Object according to the Pathways Core data model

• The representation is uniform across repositories; not tied to identifier type, content type, application domain.• The Surrogate is what is used in the value chains; the Surrogate is used at Obtain, Harvest and Put interfaces.o Expresses properties and access points for the Digital Object (see later)

m Pathways Core Surrogates (currently XML/RDF)


• The Surrogates provide By-Reference access to constituent datastreams of Digital Objects

• Full asset transfer is only required for certain applications• Avoid IP issues at the level of the interoperability framework

• The idea is that the Surrogate itself is not encumbered by IP issues; attach - by definition - a liberal Creative Commons license to Surrogates • Allow Surrogates to flow freely independent of business models of the underlying content

m Pathways Core Surrogates (currently XML/RDF)

Obtain interface: a Repository interface that supports the request of services pertaining to individual Digital Objects (including their component Datastreams). The core service is the request of a Surrogate for a Digital Object.


Ob

tain

Harv

est

Harvest interface: a Repository interface that exposes Surrogates for incremental collecting/harvesting.

Put Put interface: a Repository interface that supports submission of one or more Surrogates into the Repository, thereby facilitating the addition of Digital Objects to the collection of the Repository.

Surrogate is at the core of the value chain

id

id

id

Ob

tain

Ob

tain

Put

Ob

tain

recombine &

add value

Lineage

Lineage

providerInfo

providerInfo

Repo1

Ob

tain

Harv

est

Put1 Harvest1

Obtain1

Put

Repo2

Ob

tain

Harv

est

Put2 Harvest2

Obtain2

Put

service

Repo2

Repo1

Ob

tain

Harv

est

Ob

tain

Harv

est

Put2 Harvest2

Obtain2

Put1 Harvest1

Obtain1

Put

Put

provider Obtain Harvest Put

Repo1 Obtain1 Harvest1 Put1

Repo2 Obtain2 Harvest2 Put2

Serv

ice

Regis

try

Meeting in NYC, April 20-21 2006

• Supported by Microsoft, Mellon Foundation, Coalition for Networked Information, Digital Library Federation, JISC

• Representatives from institutional Repository projects, scholarly content Repositories, Registry projects, various projects that touch on interoperability

• See http://msc.mellon.org/Meetings/Interop/ for Agenda, Participants, Topics & Goals, Terminology, Presentations, Prototype demonstration.

• Report available since beginning of August 2006• Very likely that an international interoperability effort will be

started towards the end of 2006

http://msc.mellon.org/Meetings/Interop/

Demonstration

• Overlay journal Scenario combined with Search engine Scenario

• Surrogates compliant with Pathways Core Data Model, expressed in RDF/XML.

• Obtain interfaces (OpenURL Application) at:o an aDORe repositoryo arXivo a DSpace repositoryo a Fedora repository

• Harvest interfaces (OAI-PMH) at:o an aDORe repositoryo arXiv o a Fedora repository

• Put interface at a Fedora repository• MS Live Clipboard functionality in user interfaces of arXiv,

Fedora, and the overlay search engine

Demonstration

• Acknowledgments:o Carl Lagoze, Sandy Payette, Simeon Warner, Chris Wilper at

Cornell Universityo Rob Tansley at HP o Luda Balakireva, Xiaoming Liu, Herbert Van de Sompel,

Zhiwu Xie at the Los Alamos National Laboratory

Demonstration

id

id

Ob

tain

Put

Live Clipboard Copy

Live Clipboard PasteSubmit

Documents

Interoperability Among Scholarly Repositories: Enabling Workflows Across Distributed Information Carl Lagoze Information Science Cornell University, USA