32
Interoperability Among Scholarly Repositories: Enabling Workflows Across Distributed Information Carl Lagoze Information Science Cornell University, USA Herbert Van de Sompel Research Library Los Alamos National Laboratory, USA

Interoperability Among Scholarly Repositories: Enabling Workflows Across Distributed Information

  • Upload
    bob

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Interoperability Among Scholarly Repositories: Enabling Workflows Across Distributed Information. Carl Lagoze Information Science Cornell University, USA. Herbert Van de Sompel Research Library Los Alamos National Laboratory, USA. Acknowledgments. This talk based on the following work: - PowerPoint PPT Presentation

Citation preview

  • Interoperability Among Scholarly Repositories:Enabling Workflows Across Distributed InformationCarl LagozeInformation ScienceCornell University, USA

    Herbert Van de SompelResearch LibraryLos Alamos National Laboratory, USA

  • AcknowledgmentsThis talk based on the following work:

    NSF-funded Pathways project (IIS-0430906)Cornell University (PIs: Carl Lagoze, Sandy Payette, Simeon Warner) LANL Digital Library Research & Prototyping Team (PI Herbert Van de Sompel).The LANL aDORe repository effort. http://dx.doi.org/10.1093/comjnl/bxh114http://african.lanl.gov/aDORe/The PhD thesis by Jeroen Bekaert (Advisor Herbert Van de Sompel) regarding protocol-based interfaces for Open Archival Information Systems (OAIS). http://hdl.handle.net/1854/4833

  • ReferencesRethinking Scholarly Communication, D-Lib September 2004Interoperability for Distributed Scholarly Workflows, D-Lib October 2006Pathways: Augmenting Interoperability for Scholarly Repositories, Upcoming Journal of Digital Libraries

  • Some BackgroundDigital transition of scholarly communication has been in form rather than natureTry and build a scholarly communication system that is more natively digital, i.e. use the capabilities of digital, network technologiesCollaborationImmediacyReuseDynamicExploit advances in institutional repositories and interest in open accessFrame scholarly communication as a workflow among distributed information unitsProvide framework for new advanced servicesVisualizationUsage analysis

  • Interoperability in a Heterogeneous WorldDiversity of (repository) technologyDSpaceFedoraaDOReEPrintsGreenstoneDefine an interoperability layer in whichInformation can be modeledInformation can be sharedInformation can be transferedInformation can be reused

  • Some Meta-Observations on InteroperabilityScholarly communication is a long-term endeavor:Dependent on stability and integrity of participants Need abstract definitions of models and interfaces that can be instantiated on the basis of various technologies as time goes byIdentification is particularly important:ScalableAgnostic about existing identification schemesGranularObject decompositionRepository originationValue chains do not require transfer of all digital object contentThe content that needs to be transferred depends on the nature of the value chain

  • Augmenting interoperability across RepositoriesDSpaceFedoraaDOReePrintsarXivNature

  • http://dx.doi.org/10.1045/september2004-vandesompel Scholarly communication as a cross-repository value chain

  • Motivation 1 : Richer cross-Repository services

    Distributed Repositories provide source materials for cross-Repository overlay services such as discovery servicesManner in which those materials are exposed must allow for the seamless emergence of rich and meaningful services

  • Richer cross-Repository services : ScenarioScenario 1: Chemical search engine

    A search engine monitors scholarly repositories but is only interested in making machine-readable chemical structures contained in Digital Objects available from those repositories searchable.This constitutes re-use of the (part of) the Digital Objects by a service overlaid upon the monitored repositories.

    And, of course, a chemical compound discovered via the search engine can be cited in some new paper, i.e. the value chain does not stop here

  • Motivation 2 : Scholarly communication workflow

    Distributed Repositories at the basis of a digital scholarly communication systemScholarly communication as a global workflow (value chain) across those RepositoriesDigital Objects from Repositories are the subject of the workflow; they are used and re-used in many contexts.

  • Scholarly communication workflow : ScenariosScenario 2: Citation

    An author writes a paper (to be Put into her institutional repository) and cites 10 papers available from other repositories. A citation to a paper is a type of re-use of the cited paper in a new context.

    And, of course, the new paper can be cited too, i.e. the value chain does not stop here.

  • Adding Value to Fundamental UnitsPaul Ginsparg

  • Scholarly communication workflow : ScenariosScenario 3: Overlay journal

    The editor of an overlay journal selects papers from 3 different repositories for inclusion in the next issue of the overlay journal. Each of those articles is being re-used in a new context, with value being added.

    And, the overlay journal can be mirrored for preservation purposes, i.e. the value chain does not stop here.

  • Scholarly communication workflow : ScenariosScenario 4: eScience

    A researcher uses datasets from 2 different dataset repositories, performs operations on those, and creates a publication that contains a resulting new dataset and an accompanying paper, and deposits this publication in her institutional repository.This constitutes re-use of the origin datasets, and value added through the creation of the new publication.

    And, of course, the new dataset can be re-used too, i.e. the value chain does not stop here.

  • Building Block I - RepositoriesNetworked system that provides services pertaining to a managed collection of digital objects.

    Institutional repositories, online journals, dataset stores, learning objects, etc.

  • Aim: Digital Object use and re-useWe must leverage the value of the materials that become available in those distributed Repositories.Think about these Repositories as active nodes in a global environment, not as passive local nodes

    These Repositories are about facilitating the use and re-use of materials in many contexts These Repositories are the starting point of value chains

  • Building Block II: Digital ObjectsAbstract units of scholarly communication

    Compound aggregations consisting of:Multiple media typesLinkage to services

    Have a persistent identifier

    Can be recursive: digital objects within digital objects

    Instantiated in various implementations

    c.f. Kahn/Wilensky Model

    Digital Objects

  • Digital Object: A data structure whose principal components are digital data and key-metadata. Digital data can be a Datastream or a Digital Object, i.e. a Digital Object may have one or more other Digital Objects as nested components. Key-metadata must include an identifier for the Digital Object.Data Model: An abstraction for Digital Objects such that each Digital Object can be seen as an instance of the class defined by a Data Model. Example Data Models include the Pathways Core model, the MPEG-21 Digital Item Declaration model, etc. Surrogate: A serialization of a Digital Object according to a Data Model.Datastream: An ordered sequence of bytes. Terminology

  • Obtain interface: a Repository interface that supports the request of services pertaining to individual Digital Objects (including their component Datastreams).TerminologyRepository: a networked system that provides services pertaining to a collection of Digital Objects.Harvest interface: a Repository interface that exposes Surrogates for incremental collecting/harvesting.Put interface: a Repository interface that supports submission of one or more Surrogates into the Repository, thereby facilitating the addition of Digital Objects to the collection of the Repository.

  • Augmenting interoperability across RepositoriesDSpaceFedoraaDOReePrintsarXivNature

  • Common Data ModelProvides a common abstraction for describing digital objects despite their (repository, service)-specific implementation.

    A common denominator:Does not completely cover implementation-specific featuresFeatures conform to requirements of interoperability fabric (e.g., identity, workflow support, etc.)

  • Model Core RequirementRecursion for n-levels of information containmentIdentity independent of specific schemesLineage relationships among objectsevidence of workflow for evidential citationSemantics associated with entitiesfacilitate service mappingLink to concrete representationAssertion of persistence levels

  • Data Model

  • Recursion

  • Entities Entity: to represent Digital Object to attach properties to contained elements hasEntity: to express containment/recursion

  • Identity

  • 2 levels of Identity hasIdentifier ~ traditional identifier(s) of Digital Object (e.g., DOI) providerInfo ~ repository-centric, fine granularity identification (provider,preferredIdentifier,versionKey) supports service requests at the granularity of the repository

  • Lineage Relationships

  • LineageProvides the basis for evidential citationCo-exists and complements bibliographic citation hasLineage: value is providerInfo of object from which it derives.Basis of value chains.

  • Basis for a Network of Linked Digital Objects

  • Semantics

  • Concrete Representation

  • Persistence Guarantees

  • Augmenting interoperability across Repositories A Surrogate is available for every Digital Object A Surrogate is a representation of the Digital Object according to the Pathways Core data model The representation is uniform across repositories; not tied to identifier type, content type, application domain. The Surrogate is what is used in the value chains; the Surrogate is used at Obtain, Harvest and Put interfaces. Expresses properties and access points for the Digital Object (see later)

    Pathways Core Surrogates (currently XML/RDF)

  • Augmenting interoperability across Repositories The Surrogates provide By-Reference access to constituent datastreams of Digital Objects Full asset transfer is only required for certain applications Avoid IP issues at the level of the interoperability framework The idea is that the Surrogate itself is not encumbered by IP issues; attach - by definition - a liberal Creative Commons license to Surrogates Allow Surrogates to flow freely independent of business models of the underlying content

    Pathways Core Surrogates (currently XML/RDF)

  • info:doi/10.9999/2006.02.001 info:doi/10.9999/2006.02.0011.0info:sid/overlay.org info:arxiv/cs.DL/0502057 info:arxiv/cs.DL/0502057info:sid/arXiv.org http://www.overlay.org/files/2006.02.001/pdf

  • Obtain interface: a Repository interface that supports the request of services pertaining to individual Digital Objects (including their component Datastreams). The core service is the request of a Surrogate for a Digital Object.Augmenting interoperability across RepositoriesHarvest interface: a Repository interface that exposes Surrogates for incremental collecting/harvesting.Put interface: a Repository interface that supports submission of one or more Surrogates into the Repository, thereby facilitating the addition of Digital Objects to the collection of the Repository.

  • Surrogate is at the core of the value chainLineageLineage

  • Repo1Put1Harvest1Obtain1

  • Repo2Repo1Put2Harvest2Obtain2Put1Harvest1Obtain1

    providerObtainHarvestPutRepo1Obtain1Harvest1Put1Repo2Obtain2Harvest2Put2

  • Meeting in NYC, April 20-21 2006Supported by Microsoft, Mellon Foundation, Coalition for Networked Information, Digital Library Federation, JISCRepresentatives from institutional Repository projects, scholarly content Repositories, Registry projects, various projects that touch on interoperabilitySee http://msc.mellon.org/Meetings/Interop/ for Agenda, Participants, Topics & Goals, Terminology, Presentations, Prototype demonstration.Report available since beginning of August 2006Very likely that an international interoperability effort will be started towards the end of 2006

  • DemonstrationOverlay journal Scenario combined with Search engine Scenario Surrogates compliant with Pathways Core Data Model, expressed in RDF/XML.Obtain interfaces (OpenURL Application) at:an aDORe repositoryarXiva DSpace repositorya Fedora repositoryHarvest interfaces (OAI-PMH) at:an aDORe repositoryarXiv a Fedora repositoryPut interface at a Fedora repositoryMS Live Clipboard functionality in user interfaces of arXiv, Fedora, and the overlay search engine

  • DemonstrationAcknowledgments:Carl Lagoze, Sandy Payette, Simeon Warner, Chris Wilper at Cornell UniversityRob Tansley at HP Luda Balakireva, Xiaoming Liu, Herbert Van de Sompel, Zhiwu Xie at the Los Alamos National Laboratory

  • DemonstrationLive Clipboard CopyLive Clipboard PasteSubmit

  • Questions, Comments, Flames