Building Interoperable Digital Libraries: A Practical Guide to creating Open Archives

  • Published on
    25-Feb-2016

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Building Interoperable Digital Libraries: A Practical Guide to creating Open Archives. Hussein Suleman, hussein@vt.edu Digital Library Research Laboratory Virginia Tech. 1. Introduction. What is the OAI? Motivation General System Strategy History Case study: NDLTD. - PowerPoint PPT Presentation

Transcript

  • Building Interoperable Digital Libraries: A Practical Guide to creating Open ArchivesHussein Suleman, hussein@vt.eduDigital Library Research LaboratoryVirginia Tech

    JCDL 2002

  • 1. IntroductionWhat is the OAI?MotivationGeneral System StrategyHistoryCase study: NDLTD

    JCDL 2002

  • 1.1. What is the OAI ?What is the Open Archives Initiative (OAI)?Organization dedicated to solving problems of digital library interoperability by defining simple protocols, most recently for the exchange of metadata.What is the Protocol for Metadata Harvesting?Protocol to transfer metadata from a source archive to a destination archive

    JCDL 2002

  • 1.2. MotivationExistence of some established but independent archivesNeed for cross-archive services (like search engines)Lack of low-cost interoperability technologyExperience from past projects such as Dienst

    JCDL 2002

  • 1.3. General System StrategyServicesMetadata HarvestingDocument Model

    JCDL 2002

  • 1.4. HistorySanta Fe Meeting October 1999Santa Fe Convention, January 2000Workshops (ACM-DL 2000, ECDL 2000)Structuring of the OAISteering CommitteeTechnical CommitteeOpen Days US/EuropeProtocol for Metadata Harvesting v1.0, January 2001Minor Update: v1.1 July 2001Version 2.0 June 2002

    JCDL 2002

  • 1.5. Case Study: NDLTDNetworked Digital Library of Theses and DissertationsMultiple independent university-based collections of electronic documentsVirginia TechHumboldt U.U. South FloridaInternationalETDLibraryOAIProtocol for MetadataHarvesting

    JCDL 2002

  • 2. Definitions / ConceptsBasic PrinciplesWhat is an Open Archive?Harvesting vs. FederationMetadata vs. DataData and Service ProvidersUnderlying TechnologyHTTP and XMLXML, XML Namespaces and SchemaProtocol PoliciesUniqueness and PersistenceWhat is a record?Multiplicity of MetadataSetsDatestamp, Harvesting and Flow Control

    JCDL 2002

  • 2.1. What is an Open Archive ?Any WWW-based system that can be accessed through the well-defined interface of the Open Archives Protocol for Metadata Harvesting a.k.a. OAI-Compliant RepositoryNo implications for:Physical storage of dataCost of dataMetadata and data formatsAccess control to server

    JCDL 2002

  • 2.2. Harvesting vs. FederationCompeting approaches to interoperabilityFederation is when services are run remotely on remote data (e.g. Federated searching)Harvesting is when data/metadata is transferred from the remote source to the destination where the services are located (e.g. Union catalogues)Federation requires more effort at each remote source but is easier for the local system and vice versa for harvestingOAI currently focuses on harvesting

    JCDL 2002

  • 2.3. Metadata vs. DataData refers to digital objects or digital representations of objectsMetadata is information about the objects (e.g. title, author, etc.)OAI focuses on metadata, with the implicit understanding that metadata usually contains useful links to the source digital objects

    JCDL 2002

  • 2.4. Data and Service ProvidersData Providers refer to entities who possess data/metadata and are willing to share this with others (internally or externally) via well-defined OAI protocols (e.g. database servers)Service Providers are entities who harvest data from Data Providers in order to provide higher-level services to users (e.g. search engines)OAI uses these denotations for its client/server model (data=server, service=client)

    JCDL 2002

  • 2.5. HTTP and XMLMetadata Harvesting Protocol is an almost stateless request/response protocolRequests and responses are sent via the HTTP protocolRequests are encoded as GET/POST operationsResponses are well-formed XML documents

    JCDL 2002

  • 2.6. XML Namespaces and SchemaConsistency and data quality is ensured by using XML Schema descriptions for each possible responseXML Namespaces are used where necessary to clearly define which parts of the responses are actual metadata and which support the Metadata Harvesting Protocol

    JCDL 2002

  • 2.7. Uniqueness and PersistenceEach record must be uniquely addressable by a distinct identifierEach metadata entity should ideally be persistent to guarantee that service providers can always refer back to the source

    JCDL 2002

  • 2.8. What is a record ?A record refers to an independent XML structure that may be associated with digital or physical objectsRecords are usually associated with metadata, not dataOAI advocates harvesting of records, which contain metadata and additional fields to support the harvesting operation

    JCDL 2002

  • 2.9. Sample OAI Record(note: schema and namespaces have been left out for clarity) oai:jcdl2002.org:tut1 2002-02-03 tut OAI Tutorial at JCDL Hussein Suleman English oai:jcdl2002.org:tut1md

    JCDL 2002

  • 2.10. Multiplicity of MetadataMultiple formats of metadata allowedDublin Core is mandatoryAny other format allowed as long as it has an XML encodingE.g. MARC (Libraries), IMS (Education), ETDMS (Theses/Dissertations), RFC1807 (Bibliographies)

    JCDL 2002

  • 2.11. SetsProtocol mechanism to allow for harvesting of sub-collectionsNo well-defined semantics depends completely on local data providersMay be defined by arrangement between data providers and service providersE.g. Subject areas, years, author names, search queries

    JCDL 2002

  • 2.12. Datestamps & Harvesting Each record needs a datestamp that indicates its date of creation or modificationDates are used to allow for harvesting by date range, thus allowing incremental and continuous transfer of metadata from a data provider to a service provider

    JCDL 2002

  • 2.13. Flow ControlHTTP retry-after mechanism can be leveraged to support server-side delaying of a clients requestResumption Tokens can be used to return partial results the client is issued with a token which may be presented to the server to receive more results

    JCDL 2002

  • 3. Requirements to be a Data ProviderSource of metadataServer technologyDatestampsDeletionsUnique identifiersMetadata mappings

    JCDL 2002

  • 3.1. Source of MetadataDatabase in proprietary formatCollection of metadata records in well-defined format/sFiles on diskMetadata may be dynamically or statically extracted from dataSynthetic collection

    JCDL 2002

  • 3.2. Server TechnologyWWW ServerProtocol may be implemented in many formsCGI Script (Perl, C++, Java)Java ServletPHPMetadata (e.g. database) access mechanism requiredSee www.openarchives.org for list of publicly available software templatesSee www.dlib.vt.edu for VT experimental software

    JCDL 2002

  • 3.3. DatestampsNeeded for every record to support incremental harvestingMust be updated for every addition/modification/deletion to ensure changes are correctly propagatedDifferent from dates within the metadata this date is used only for harvestingCan be either YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ (must be GMT timezone)

    JCDL 2002

  • 3.4. Unique IdentifiersEach record must have a unique identifierIdentifiers must be valid URIsExample: oai:: oai:etd.vt.edu:etd-1234567890Each identifier must resolve to a single record and always to the same record (for a given metadata format)

    JCDL 2002

  • 3.5. DeletionsArchives may keep track of deleted records, by identifier and datestampAll protocol result sets can indicate deleted recordsIf deletions are being tracked, this information must be stored indefinitely so as to correctly propagate to service providers with varying harvesting schedules

    JCDL 2002

  • 3.6. Metadata MappingsData provider must map its metadata to the formats it chooses to provide through its OAI interfaceUnqualified Dublin Core requiredBest practice is to include a link in the tag to the actual digital resource or at least a human-readable web pageNative formats recommendedCommunity-based formats recommended

    JCDL 2002

  • 4. Metadata Harvesting ProtocolService RequestsIdentifyListMetadataFormatsListSetsGetRecordListIdentifiersListRecordsMetadata MultiplicityDate RangesResumption TokensError and Exceptions

    JCDL 2002

  • 4.1. IdentifyPurposeReturn general information about the archive and its policiesParametersNoneSample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=Identify

    JCDL 2002

  • 4.2. Identify - Response

    JCDL 2002

  • 4.3. ListMetadataFormatsPurposeList metadata formats supported by the archive as well as their schema locations and namespacesParametersidentifier for a specific record (O)Sample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=ListMetadataFormats

    JCDL 2002

  • 4.4. ListMetadataFormats - Response

    JCDL 2002

  • 4.5. ListSetsPurposeProvide a hierarchical listing of sets in which records may be organizedParametersNoneSample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=ListSets

    JCDL 2002

  • 4.6. ListSets Response

    JCDL 2002

  • 4.7. GetRecordPurposeReturns the metadata for a single identifier in the form of an OAI recordParametersidentifier unique id for record (R)metadataPrefix metadata format (R)Sample URLhttp://www.anarchive.org/cgi-bin/OAI? verb=GetRecord&identifier=oai:test:123&metadataPrefix=oai_dc

    JCDL 2002

  • 4.8. GetRecord - Response

    JCDL 2002

  • 4.9. ListIdentifiersPurposeList headers for all records corresponding to the specified parametersParametersfrom start date (O)until end date (O)set set to harvest from (O)metadataPrefix metadata format to list identifiers for (R)resumptionToken flow control mechanism (X)Sample URLhttp://www.anarchive.org/cgi-bin/OAI? verb=ListIdentifiers&metadataPrefix=oai_dc

    JCDL 2002

  • 4.10. ListIdentifiers - Response

    JCDL 2002

  • 4.11. ListRecordsPurposeRetrieves metadata for multiple recordsParametersfrom start date (O)until end date (O)set set to harvest from (O)resumptionToken flow control mechanism (X)metadataPrefix metadata format (R)Sample URLhttp://www.anarchive.org/cgi-bin/OAI? verb=ListRecord&metadataprefix=oai_dc&from=2001-01-01

    JCDL 2002

  • 4.12. ListRecords - Response

    JCDL 2002

  • 4.13. Metadata Multiplicity

    JCDL 2002

  • 4.14. Date Ranges

    JCDL 2002

  • 4.15. Resumption Token

    JCDL 2002

  • 4.16. Errors and Exceptions

    JCDL 2002

  • 5. Implementation DetailsTools RequiredBasic program layoutObject-oriented approachesExtensible metadata generationData cleaningCaching of resultsError handlingDenial-of-service preventionCreating resumption tokens

    JCDL 2002

  • 5.1. Tools RequiredCode templates if available (available for many languages)Basic programming environmentXML generators (for non-trivial encoding)Database access libraries/drivers (e.g. DBI, ODBC, JDBC)

    JCDL 2002

  • 5.2. Basic program layoutparse WWW request to extract parametersif (verb=Identify) ProcessIdentify;else if (verb=ListMetadataFormats) ProcessListMetadataFormats;else if (verb=ListSets) ProcessListSets;else if (verb=GetRecord) ProcessGetRecord;else if (verb=ListIdentifiers) ProcessListIdentifiers;else if (verb=ListRecords) ProcessListRecords;else ReportError (badVerb);

    JCDL 2002

  • 5.3. Object-Oriented ApproachesCleaner separation of protocol, database access and metadata generationExample approachesEach service request is handled by a objectSimpler incremental developmentProtocol, Database and Metadata are objectsGreater portability of codeInheritance from a basic OAI data provider

    JCDL 2002

  • 5.4. Metadata GenerationApproachesMap from source to each metadata formatUse crosswalks (maybe XSLT) to generate additional formatsnameauthorsourcetitlecreatordctitleauthorrfc1807====

    JCDL 2002

  • 5.5. Data CleaningEscape special XML charactersConvert to UTF-8 version of UnicodeConvert entity referencesRemove extraneous whitespaceConvert CR/LF for paragraphsURLs/?#=&:;+ must be encoded as escape sequences

    JCDL 2002

  • 5.6. Result CachingFor multiple requests from many clients or to handle partial result setsKeep temporary tables/filesExpire temporary data when no longer neededIs this necessary to handle date-range requests where new items are added to the result set while harvesting is in progress?

    JCDL 2002

  • 5.7. Error HandlingAll protocol errors are in XML formatbadVerb: illegal verb requestedbadArgument: illegal parameter values or combinationsbadResumptionToken, cannotDisseminateFormat, idDoesNotExist: parameters are in right format but are not legal under current conditionsnoRecordsMatch, noMetadataFormats, noSetHierarchy: empty response exception

    JCDL 2002

  • 5.8. Denial-of-Service PreventionReturn only partial results and issue a resumption token for moreUse 503 retry-after HTTP errors to have clients try again after a specified back-off timeUse access control lists to limit who may access the archiveInvoke an explicit delay before sending back results

    JCDL 2002

  • 5.9. Creating resumptionTokensCombine from/until/metadataPrefix/set and a record number indicator with delimiters into a sequential token For example:from!until!metadataPrefix!set!recordnumber2000-01-01!2001-01-01!!All!100Use a session manager with automatic expiry For example:vtetd14june10amsession12

    JCDL 2002

  • 6. Common ProblemsNo unique identifiers !No datestamps !Incomplete information in databaseNew metadata formatXML responses not validating

    JCDL 2002

  • 6.1. No unique identifiersCreate an independent identifier mappingUse row numbers for a databaseUse filenames for data in filesUse a hash from other fieldsE.g. author+year+first word in title

    JCDL 2002

  • 6.2. No datestampsIgnore the datestamp parameters and stamp all records with the current dateCreate a date table with the current date for all old entries and update dates for new entriesMost Important: Any harvesting algorithm that is interoperably stable for an archive with real dates should be stable for an archive with synthesized dates

    JCDL 2002

  • 6.3. Incomplete informationSynthesize metadata fields based on a priori knowledge of the dataExample: publisher and language may be hard-coded for many archivesOmit fields that cannot be filled in correctly better to have less information than incorrect information !

    JCDL 2002

  • 6.4. New metadata formatFind the description, namespace and formal name of the standardFind an XML Schema description of the data formatIf none exists, write one (consult other OAI people for assistance)Create the mapping and test that it passes XML schema validation

    JCDL 2002

  • 6.5. XML not validatingCheck namespaces and schemaUse Repository Explorer in non-validating mode to check structure of XML, without looking at namespaces or schemataValidate schema by itself if it is non-standardLook at XML produced by other repositoriesWatch out for character encoding issues

    JCDL 2002

  • 7. Tools for TestingRepository ExplorerInteractive BrowsingTesting of parametersMultiple views of dataMultilingual supportAutomatic test suiteOAI RegistryXML Schema Validator

    JCDL 2002

  • 7.1. RE Interactive Browsing

    JCDL 2002

  • 7.2. RE Parameter Testing

    JCDL 2002

  • 7.3. RE Browsing

    JCDL 2002

  • 7.4. RE Browsing

    JCDL 2002

  • 7.5. RE Browsing

    JCDL 2002

  • 7.6. RE Browsing

    JCDL 2002

  • 7.7. RE Browsing

    JCDL 2002

  • 7.8. RE Multiple views of data

    JCDL 2002

  • 7.9. RE Multilingual Support

    JCDL 2002

  • 7.10. RE Automatic Test Suite

    JCDL 2002

  • 7.11. RE Error in Response

    JCDL 2002

  • 7.12. RE Error in XML

    JCDL 2002

  • 7.13. OAI Registry

    JCDL 2002

  • 7.14. OAI Registry

    JCDL 2002

  • 7.15. XSV Schema Validator

    JCDL 2002

  • 8. Service ProvidersHow to HarvestPoliciesIntermediate systemsToolsCase Study: A...