View
216
Download
0
Category
Tags:
Preview:
Citation preview
Extracting XML from Unicorn Extracting XML from Unicorn with OAI and SRUwith OAI and SRU
European Unicorn User Group ConferenceGlasgow Caledonian University
September 7th & 8th, 2006
Benoit PAUWELSUniversité Libre de Bruxelles (ULB)
Brussels
AgendaAgenda
• Introduction – Unicorn interfaces
• Part 1: An OAI frontend for Unicorn• Part 2: An SRU frontend for Unicorn
– Short description of OAI and SRU protocols– Overview of technical implementation– Use cases and demos
IntroductionIntroduction
• OAI and SRU are ‘open’ protocols that permit exchange of metadata between information systems
• Well-known Unicorn interfaces:– Unicorn API server– Unicorn Webcat/iBistro/iLink server– Unicorn Z39.50 server
• All comply to the philosophy of request/response sequences
Client system Unicorn server
Catalogue database
[ Records and indexes ]
TCPIP/SocketAPI request
TCPIP/Socket API responseAPI datacodes/values
API server
Unicorn interfaces: API Unicorn interfaces: API serverserver
SirsiDynix
• Character client
• C Workflows client
• Java Themes client
Communication protocol TCPIP/SocketInformation exchange protocol proprietary SirsiDynix API requests/responsesReturned record structure proprietary SirsiDynix format (data-codes and -values)
Client system Unicorn server
Catalogue database
[ Records and indexes ]
HTTPiLink request (URL)
HTTP HTML pageHTML
iLink
Unicorn interfaces: iLinkUnicorn interfaces: iLink
• Any Web browser
Communication protocol HTTPInformation exchange protocol URL requests / HTML responsesReturned record structure HTML
Web Server
Client system Unicorn server
Catalogue database
[ Records and indexes ]
Z39.50Z39.50 request
Z3950 Z3950 responseMARC21
Z39.50
Unicorn interfaces: Z39.50Unicorn interfaces: Z39.50
• Any Z3950 client
Communication protocol Z39.50 specificInformation exchange protocol Z39.50 specificReturned record structure typically MARC21
Unicorn interfacesUnicorn interfaces
• API: Proprietary– low interoperability level
• HTML: Record data not well structured– low reusability level
• Z39.50: Protocol specific– more difficult to implement (high learning curve)– Z39.50 is statefull
Difficult to integrate into today’s web services environments
communication: use HTTPinformation exchange: use open protocols (like OAI and
SRU)record data structure: use XML (according to well-
defined XML Schema)
2 new Unicorn interfaces2 new Unicorn interfaces
• HTTP / Open / XML
• OAI-PMH: Open Archives Initiative – Protocol for Metadata Harvesting
• SRU: Search and Retrieve via URL
Service Provider Data Provider
Document Archive
HTTP embeddedOAI requests
HTTP embeddedOAI responses
OAI Frontend
OAI-PMH : the protocolOAI-PMH : the protocol
Web Server
OAI-PMH: the protocolOAI-PMH: the protocol
• ‘Harvester collects metadata from archives’
• Stateless protocol: sequence of OAI requests/responses over HTTP
• Just harvesting -- NOT searching
OAI-PMH: the protocolOAI-PMH: the protocol
OAI requests
• HTTP GET|POST requests• Syntax
– BASE URL• host + port + path of OAI request handler
– key=value pairs• Examples:
– http://www.cible.ulb.ac.be:80/cgi-bin/OAI20/catalog?verb=Identify _
– http://www.biomedcentral.com/oai/1.1/bmcoai.asp?verb=GetRecord&identifier=oai:bmc:1471-2105-1-1&metadataPrefix=oai_dc
OAI-PMH: the protocolOAI-PMH: the protocol
OAI responses
• XML encoded bytestreams, containing the records• Record = triplet
– header (unique OAI identifier)– metadata– about
• Metadata schemes– XML Schema– Minimum: unqualified Dublin Core– Community specific
• Example of a record (catkey 450000 from ULB catalogue):– oai_dc marc21 umods
OAI-PMH: the protocolOAI-PMH: the protocol
Simple : 6 OAI requests/responses
• Identify– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=Identify _
• ListMetadataFormats [identifier]– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListMetadataFormats _
• ListSets– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListSets _
• GetRecord identifier, metadataPrefix– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=marc21 _
OAI-PMH: the protocolOAI-PMH: the protocol
Simple : 6 OAI requests/responses
• ListRecords metadataPrefix, [from,until,set]– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=oai_dc _– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=mhld21&set=elper _– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=marc21&from=2006-08-01 _
• ListIdentifiers metadataPrefix, [from,until,set]– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListIdentifiers&metadataPrefix=oai_dc _
OAI frontend for UnicornOAI frontend for Unicorn
• Implementation of the data provider functionality (2001)
• http://www.openarchives.org/tools/tools.htmlpick a template and interface with Unicorn through Unicorn database tools
• Our choice: Object Oriented Perl frontend (H. Suleman – Virginia Tech) _
OAI frontend for UnicornOAI frontend for Unicorn
HTTP embeddedOAI request
Unicorn Server
HTTP server
Unicorn database
CGIOAI
C wrapper
fork in ‘sirsi’environment
OAI.pl
• call the appropriate OAI request handler
• retrieve metadata fromUnicorn database
• format in XMLHTTP embeddedOAI response
OAI frontend for UnicornOAI frontend for Unicorn
Example: implementation of the GetRecord request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=oai_dc
1. Get metadata from Unicorn for catkey 245000$record = `echo $catkey | catalogdump -of | filtermarc
-iALL -od -Ds`; _@dates = split(‘\|’,`echo $catkey | selcatalog -iK -opr`);
2. Convert ANSEL character set into ISO-LATIN-13. Map from MARC to oai_dc _4. Format into XML
OAI frontend for UnicornOAI frontend for Unicorn
Example: implementation of the ‘set’ parameter of the ListRecords request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&set=elper
• Precompile set as a file of catkeys– name of file: « name of set_catkeys »
• einstein_albert_catkeys• elper_catkeys• sd_catkeys• all_catkeys
– through periodic execution of « mkoaisets » custom report
OAI frontend for UnicornOAI frontend for Unicorn
Example: implementation of the ‘from/until’ parameters of the ListRecords request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&from=2006-08-01&until=2006-08-31
• BRS index on creation/modification date?• Every Unicorn record that gets created or modified is
‘touched’ in the ‘textedit’ and ‘browsedit’ directories• Custom report ‘cadutext’
– saves catkeys to <ud>/Savedkeys/adutext/rptid– adds line ‘rptid|date|status’ to <ud>/Lastruns/cadutext
• Example: « from=2006-08-01&until=2006-08-31 »– obtain report ids for all runs of cadutext after 2006-08-01 and
before 2006-08-31 from the file <ud>/Lastruns/cadutext– for each of these report ids: obtain catkeys from
<ud>/Savedkeys/adutext/rptid and save them to randomnumber_catkeys file
– sort and uniq the randomnumber_catkeys file
OAI frontend for UnicornOAI frontend for Unicorn
• Limitations of implementation:– ListRecords/ListIdentifiers:
• The from and until parameters are not permitted if the set parameter is given on the request
• The from and until parameters are permitted if the set parameter is not given on the request, but their values should fall within a certain date range (at this moment arbitrarily set to ‘today - 2 months’ and ‘today’)
– Deleted records
• Complete source code and documentation available on the API Repository (http://sirsiapi.org)
OAI frontend - use cases OAI frontend - use cases @ ULB@ ULB
Use case 1: Vlink - OpenURL resolver systemjoint project with Vrije Universiteit Brussel (VUB)
ULBiLink
JSTOR
ISIWeb of Science
ElsevierScienceDirect
OVIDWebSpirs
HTMLextended services
OpenURL
Vlink
Vlinkknowledge base
OAI frontend - use cases OAI frontend - use cases @ ULB@ ULB
Use case 1: Vlink - OpenURL resolver system
• OpenURL sent from iLinkhttp://bibdev.vub.ac.be/cgi-bin/openurlulb? sid=ULB:Webcat&id=oai:ulbcat:617924
• This OpenURL does not contain enough metadata for the specific item ==> Vlink does a fetch back to Unicorn through an OAI GetRecord request to obtain a full MARC21 bibliographic description
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:617924&metadataPrefix=marc21
OAI frontend - use cases OAI frontend - use cases @ ULB@ ULB
Use case 1: Vlink - OpenURL resolver system
• Feed Vlink Knowledge Base through OAI harvesting
VLink
Vlink Knowledge Base Unicorn
OAI-PMH
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=mhld21&set=elper
OAI frontend - use cases OAI frontend - use cases @ ULB@ ULB
Use case 2: Unicat - Virtual Union Catalog of Belgium
University library Catalog
UnicornAleph
VIRTUAVUBIS
End User
Unicat WWW
Gateway
Unicat Indexer
Unicat Harvester
Search/Browse indexes
UnionOAI
Archive
OAI SRU
PublicMuseum
Other
OAI
Central Repository Data providers
HTML
Client System Unicorn Server
SRU Frontend
SRU : the protocolSRU : the protocol
Web Server Catalogue database
[ Records and indexes ]HTTP
SRU request
HTTP SRU responseXML
Communication protocol HTTPInformation exchange protocol SRUReturned record structure XML
SRU: the protocolSRU: the protocol
• ‘Client searches and retrieves metadata records from an archive’
• Stateless protocol: sequence of SRU requests/responses over HTTP
• Search and Retrieve (<-> OAI: harvesting)
SRU: the protocolSRU: the protocol
SRU requests
• HTTP GET requests
• Syntax– BASE URL
• host + port + path of SRU request handler– key=value pairs
• 3 possible requests (operations)– explain
• serves to record facilities available at an SRU server• used by clients to self-configure• returned explain record is in XML and follows the ZeeRex Schema • Example: http://z3950.loc.gov:7090/voyager?
version=1.1&operation=explain _– scan
• allows the client to request a range of the available terms at a given point within a list of indexed terms
• enables clients to present an ordered list of values and, if supported, how many hits there would be for a search on that term
– searchRetrieve
SRU: the protocolSRU: the protocol
searchRetrieve operation
• searchRetrieve (principal) parameters– Version: (of the request); current protocol version: 1.1– query: query expressed in CQL– startRecord: position within the sequence of matched records of the
first record to be returned– maximumRecords: number of records requested to be returned – recordSchema: schema requested for the records to be returned– stylesheet: URL for an xml stylesheet. The client requests that the
server simply return this URL in the response.
• CQL
« Traditionally, query languages have fallen into two camps: Powerful, expressive languages, not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL and google). CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries, with the richness of more expressive languages to accomodate complex concepts when necessary. »
(http://www.loc.gov/standards/sru/cql)
SRU: the protocolSRU: the protocol
searchRetrieve operation
Examples of CQL queries:
• dinosaurtitle = "complete dinosaur"title exact "the complete dinosaur"dinosaur not reptile dinosaur and bird or dinobird publicationYear < 1980
• title all "complete dinosaur"title contains all of the words: ‘complete’, and ‘dinosaur’
• title any "dinosaur bird reptile"title contains any of the words: ‘dinosaur’, ‘bird’, or ‘reptile’
• ribs prox/distance<=5 chevronsa more specific proximity query: ‘ribs’ within 5 words of ‘chevrons’
SRU: the protocolSRU: the protocol
searchRetrieve operation -- examples
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&query=author=einstein _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author=einstein _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author=einstein&recordSchema=dc _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author all "einstein albert“ _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“ _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleCanevas.xsl _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
SRU frontend for UnicornSRU frontend for Unicorn
Unicorn Server
SRU FrontendWeb Server Catalogue database
[ Records and indexes ]HTTP
SRU request
HTTP SRU responseXML
Client system
SRU frontend for UnicornSRU frontend for Unicorn
Unicorn Server
Z39.50 FrontendWeb Server
Catalogue database
[ Records and indexes ]
HTTPSRU request
HTTP SRU responseXML
SRU/Z39.50 Gateway
SRU/Z39.50
Z3950Z3950 request
Z3950Z3950 response
Client system
SRU frontend for UnicornSRU frontend for Unicorn
• SRU/Z39.50 Gateway: YAZ Proxy (Index Data)– Implemented at ULB: 7/2006 (2 days)– config.xml
<target name="cible" default="1"> <url>bib7.ulb.ac.be:2200</url> <xi:include href="explain.xml"/> <cql2rpn>pqf.properties</cql2rpn> </target> <target name=“slavko" default="1"> <url>velma.library.mun.ca:2200</url>
<xi:include href="explain.slavko.xml"/> <cql2rpn>pqf.slavko.properties</cql2rpn> </target>
– explain.xml• ZeeRex XML record as response to ‘explain’ operation
– pqf.properties• specifies the mapping of various CQL indexes,
relations, etc. into Type-1 query attributes
SRU frontend for UnicornSRU frontend for Unicorn
• YAZ Proxy
– http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
– http://bib49.ulb.ac.be:9000/Slavko?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
SRU frontend : use case @ SRU frontend : use case @ ULBULB
• Seamless integration of catalog searches in CMS• Typo3• Example
– HTML page containing biography of famous belgian historian Henri Pirenne
– frame pointing to the following URL:http://bib49.ulb.ac.be:9000/Cible? version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=pirenne%20and%20epub-dnu-*&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl
• Project– Unicorn contains descriptions of databases, websites,
etc with local thematic classification codes in 653– create thematic websites within our CMS, containing
frames that list available databases per theme
Recommended