OAI Protocol for Metadata Harvesting
Tim BrodyIntelligence, Agents, Multimedia Group
University of SouthamptonOpCit – http://opcit.eprints.org/
www.ecs.soton.ac.uk
BCS Metadata Meeting, London 29th May 2002
(Many slides borrowed from Michael L. Nelson)
OAI 2.0
• Public, stable not released yet … (but very close)– Beta released mid-May– Public release scheduled: 1st June
• 2.0 implementations in the pipeline– British Library, Cornell Univ, Ex Libris, my.OAI, Humbolt
Univ, InQuirion Pty Ltd, Library of Congress, NASA, OCLC, Old Dominion Univ, U. of Illinois, U. of Southampton, UCLA,
John Hopkins U., Indiana U., NYU, UKOLN, Virginia Tech
Open Archives Initiative
The protocol is openlydocumented, and metadatais “exposed” to at least somepeer group (note: rights management can still apply!)
Archive defined as a“collection of stuff” --not the archivist’s definition of “archive”. “Repository” used in most OAI documents.
OAI is happeningat break-neck speed...
Metadata Harvesting• Move away from distributed searching• Extract metadata from various sources• Build services on local copies of metadata
– Resources remain at remote repositories
user
. . .
search for “cfd applications”
local copy ofmetadata
metadataharvested offline
metadataharvested offline
metadataharvested offline
metadataharvested offline
each node independently maintained
all searching, browsing, etc. performed on the metadata hereindividual nodes can
still support direct userinteraction
Metadata Harvesting
• Repositories (archives etc.) = low implementation cost
• Services = higher implementation cost
• Similar to web search model– DP9 gateway makes it exactly the same
about eprintsdocument
like objectsresources
metadata OAMSunqualifiedDublin Core
unqualifiedDublin Core
transport HTTP HTTP HTTP
responses XML XML XML
requests HTTP GET/POST HTTP GET/POST HTTP GET/POST
verbs Dienst OAI-PMH OAI-PMH
nature experimental experimental stable
modelmetadataharvesting
metadataharvesting
metadataharvesting
Santa Feconvention
OAI-PMHv.1.0/1.1
OAI-PMHv.2.0
OAI-PMH v.2.0 [06/2002]
• Goal: recurrent exchange of metadata about resources between systems
• Input:• OAI-PMH v.1.0 [01/01 – 09/02]• feedback on OAI-implementers• deliberations by OAI-tech [09/01 -]• alpha test group of OAI-PMH v.2.0 [03/02 -]
• low-barrier interoperability specification• metadata harvesting model: data provider / service
provider• metadata about resources • autonomous protocol• distinction between protocol and periphery
• community-specific extensions• HTTP based• XML responses• unqualified Dublin Core• stable (1.0 characterized as experimental)
OAI-PMH v.2.0 [06/2002]
OAI Data Model:
Resources / Items / Records
resource
all available metadata about David
item
Dublin Coremetadata
MARCmetadata
SPECTRUMmetadata records
item = identifier
record = identifier + metadata format + datestamp
Overview of OAI Verbs
Verb Function
Identify description of archive
ListMetadataFormats metadata formats supported by archive
ListSets sets defined by archive
ListIdentifiers OAI unique ids contained in archive
ListRecords listing of N records
GetRecord listing of a single record
archivalmetadata
harvestingverbs
most verbs take arguments: dates, sets, ids, metadata formatsand resumption token (for flow control)
Identify
• Arguments– none
• Errors– none
• Arguments– none
• Errors– badArgument
1.1 2.0
ListMetadataFormats
• Arguments– identifier
(OPTIONAL)
• Errors– id does not exist
• Arguments– identifier
(OPTIONAL)
• Errors– badArgument– noMetadataFormats– idDoesNotExist
1.1 2.0
ListSets
• Arguments– resumptionToken
(EXCLUSIVE)
• Errors– no set hierarchy
• Arguments– resumptionToken
(EXCLUSIVE)
• Errors– badArgument– badResumptionToken– noSetHierarchy
1.1 2.0
ListIdentifiers
• Arguments– from (OPTIONAL)
– until (OPTIONAL)
– set (OPTIONAL)
– resumptionToken (EXCLUSIVE)
• Errors– no records match
• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken
(EXCLUSIVE)– metadataPrefix (REQUIRED)
• Errors– badArgument– cannotDisseminateFormat– badResumptionToken– noSetHierarchy– noRecordsMatch
1.1 2.0
ListRecords
• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken
(EXCLUSIVE)– metadataPrefix
(REQUIRED)
• Errors– no records match– metadata format cannot be
disseminated
• Arguments– from (OPTIONAL)– until (OPTIONAL)– set (OPTIONAL)– resumptionToken
(EXCLUSIVE)– metadataPrefix (REQUIRED)
• Errors– noRecordsMatch– cannotDisseminateFormat– badResumptionToken– noSetHierarchy– badArgument
1.1 2.0
GetRecord
• Arguments– identifier
(REQUIRED)
– metadataPrefix (REQUIRED)
• Errors– id does not exist
– metadata format cannot be disseminated
• Arguments– identifier
(REQUIRED)– metadataPrefix
(REQUIRED)
• Errors– badArgument– cannotDisseminateFor
mat– idDoesNotExist
1.1 2.0
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord></OAI-PMH>
response no errors
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH><responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request><error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error></OAI-PMH>
response with error
• Idempotency of resumptionToken: return same incomplete list when rT is re-issued• while no changes occur in the repo: strict• while changes occur in the repo: all items with unchanged
datestamp• new attributes for the resumptionToken:
• expirationDate• completeListSize• cursor
resumptionToken Flow-Control
• evolution
• from talking about OAI-PMH
• to talking about projects that use OAI-PMH
• to talking about projects and failing to mention they use OAI-PMH
• => OAI-PMH becomes part of the infrastructure
Adoption
• 49 registered repositories [11/2001]
• 65 registered repositories [03/2002]
• 77 registered repositories [05/2002]
• 5+ million records
• many unregistered repositories
• private implementations (e.g. RDN)
Data Providers (a.k.a. repositories)
• Arc: cross-searching of registered repositories [ http://arc.cs.odu.edu ]
• CiteBase: research literature search + citation ranking[ http://citebase.eprints.org ]
• OLAC: cross-searching of Language Archive Community repositories[ http://www.language-archives.org/index.html ]
Service Providers
• Scirus scientific search engine [Elsevier][ http://www.scirus.com ]
• my.OAI : user-tailorable cross-searching of registered repositories [FS Consulting, Inc.][ http://www.myoai.com ]
• Growing interest from web search engines
Service Providers
• Repository Explorer: interactive exploration of repositories [Virginia Tech][ http://www.purl.org/NET/oai_explorer ]
• eprints.org: generic OAI-PMH compliant repository software [U of Southampton][ http://www.eprints.org ]
• ALCME repository and harvester software [OCLC][ http://alcme.oclc.org/index.html ]
• APIs, others tools @ www.openarchives.org
OAI-PMH tools
http://www.openarchives.org/