View
2.859
Download
3
Embed Size (px)
Citation preview
Harvesting&MetadataFlorence, April 30th 2009
Harvesting&MetadataThe OAI-PMH Standard
Rudy Becarelli [email protected]
“The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. ... “
• The OAI approach:– to enable access to Web-accessible material– interoperable repositories for metadata sharing, publishing and
archiving.
• Low-barrier interoperability framework to access digital materials.
The Open Archive Initiative Mission
The OAI-PMH Standard
• The OAI-Protocol for Metadata Harvesting (OAI-PMH):– Simple technical option based on the open standards HTTP and XML. – Any format of metadata– Unqualified Dublin Core is specified to provide a basic level of
interoperability
• Metadata from many sources can be gathered together in one database
• The link between metadata and the related content is not defined by the OAI protocol
• OAI-PMH makes it possible to bring the data together in one place. In order to provide services, the harvesting approach must be combined with other mechanisms
The OAI-PMH Standard
Resource: object the metadata are "about"
Item: component of a repository from which metadata about a resource can be disseminated; has an unique identifier
Record: metadata in a specific metadata format
Identifier: unique key for an item in a repository
Set: optional construct for grouping items in a repository
The OAI-PMH Standard
• Archivea repository for stored information.
• Protocola set of rules defining communication between systems (HTTP, XML).
• Harvestingrefers specifically to the gathering together of metadata from a number of distributed repositories into a combined data store.
• Data Providermaintains one or more repositories (web servers) that support the OAI-PMH as a means of exposing metadata (1).
• Service Providerissues OAI-PMH requests to data providers and uses the metadata as a basis for building value-added services (1).
(1) OAI definition quoted from FAQ on OAI Web site
The OAI-PMH Standard
• Data Providers (open archives, repositories) provide free access to metadata, and may, but do not necessarily, offer free access to full texts or other resources.
• Service Providers use the OAI interfaces of the Data Providers to harvest and store metadata. – no live search requests to the Data Providers; – services are based on the harvested data via OAI-PMH.– may select certain subsets from Data Providers
The OAI-PMH Standard
• Multiple Service Providers can harvest from multiple Data Providers.
• Aggregators can sit between Data
Providers and Service Providers.
The OAI-PMH Standard
• Based on HTTP.
• Request arguments are issued as GET or POST parameters.
• Verbs
• Responses are encoded in XML syntax.
• Error messages are HTTP-based.
• Sets (optional)
• OAI-PMH supports flow control.
The OAI-PMH Standard
The OAI-PMH Standard
Harvesting&MetadataCulturaItalia experience
Fabio [email protected]
• An Italian experience: building an OAI-PMH Data Provider for CulturaItalia www.culturaitalia.it
• This Data Provider is conceived as a repository for metadata about Tuscany pieces of art.
• The mission of CulturaItalia:– to promote Italian culture and
heritage in Italy and abroad,– to promote and integrate
existing resources.
CulturaItalia experience
• CulturaItalia is a descriptive catalogue that indexes metadata and redirects to resources.
• Resources remains distributed and under management of the owner.
• Each institution can establish which data will be harvested by the Portal.
CulturaItalia experience
Standards
• CulturaItalia is based on international standards :– OAI-PMH– DCMI– HTTP– XML– XHTML
CulturaItalia experience
Metadata Schema: PICO DC Application Profile
• Designed for CulturaItalia by Irene Buonazia, M. E. Masci, Davide Merlitti et alii (Scuola Normale Superiore - Pisa)
• Dublin Core has been adopted as metadata standard
• a DC Application Profile has been developed according DCMI recommendations for this specific application and domain
CulturaItalia experience
Metadata Schema: PICO DC Application Profile
• The PICO DC Application Profile joins in one metadata schema:– All DC Elements;– All DC Element Refinements and Encoding Schemes from
the Qualified DC;– Other Qualifiers (refinements and encoding schemes)
specifically conceived for the CulturaItalia domain.
• Namespaces included into this metadata schema:– dc:– dcterms:– pico:
CulturaItalia experience
PICO AP Added Qualifiers – Element Refinements
Elements added Element Refinements
CREATOR author, commissionerDESCRIPTION information, contact, servicePUBLISHER distributor, printerCONTRIBUTOR editor, performer, responsible, producer,
translatorFORMAT material and techniqueRELATION promotes / is promoted by, manages / is managed
by, is owner of / is owned by, produces / is produced by, performs / is performed by, is responsible for/ has as responsible, contributes to / has as contributor, digitizes / is digitized by
COVERAGE place of birth, place of death, date of birth, date of death
CulturaItalia experience
PICO AP - Extensions to DCMI Type Vocabulary
• The element DCType, with its controlled vocabulary (DCMI Type Vocabulary), can describe the greatest part of resources to be managed within CulturaItalia.
• PICO Type Vocabulary integrates three more resource types.
dcmtype:Collectiondcmitype:Datasetdcmtype:Eventdcmtype:Imagedcmtype:MovingImagedcmtype:StillImagedcmtype:PhysicalObjectdcmtype:InteractiveResource
dcmtype:Servicedcmtype:Softwaredcmtype:Sounddcmtype:Text
picotype:Institutionpicotype:PhysicalPersonpicotype:Project
CulturaItalia experience
PICO AP – Further Extensions
• PICO AP can be further extended:
– By adding new encoding schemes: they must be defined and published as xsd schemas,
– Using DCSV (Dublin Core Structured Values), defined in:Simon Cox - Renato IannellaDCMI DCSV: A syntax for writing a list of labelled values in a text
string, 2000-07-28
http://es.dublincore.org/documents/dcmi-dcsv/
CulturaItalia experience
SIL “Museum”
NAL “In”NAL “Out”
Web ServiceWeb Service
CulturaItalia
Database Tuscany Repository
JDBC
OAI-PMHOAI-PMH
CART
Adapter
OAICat
CulturaItalia experience
Publishing process
• Building the envelope: the elements Typology Publisher Local identifier Set Metadata
• Building the envelope: serialization
OACOACMUSEUMMUSEUMoac_09_00000001_0oac_09_00000001_0OAC_COMUNE_FIRENZEOAC_COMUNE_FIRENZE
CulturaItalia experience
CART
NAL “Out”
Adapter
CARTCART WSWS
Tuscany Tuscany RepositoryRepository
• Software on NAL “Out” sends: – records to Data Provider– return receipts to publishers
CulturaItalia experience
Publishing process
• Crosswalk from original profile to PICO
• Storage on database
NAL “Uscita”
Web Service
Tuscany Repository
JDBC
Adapter
Database
CulturaItalia experience
Transformer
• Based on XSLT 2.0 language
• Different profiles:• OA, OAC (ICCD)• MFN (Fondazione Memofonte/Museo del Bargello -
Firenze)• GIOMM (Museo Marino Marini – Pistoia)
• Character encoding:OAI-PMH UTF-8
CulturaItalia experience
• Predefined Entity References NOT ALLOWED!
• Numerical Character References ALLOWED!
• Example:[...] si rimanda al volume "Manzù", 1988 [...]
• Some characters handled this way (beyond 300):Some characters handled this way (beyond 300):
ê, ½, <, >, &, «, », £, °, `, ´, “,”ê, ½, <, >, &, «, », £, °, `, ´, “,”
[...] si rimanda al volume "Manzù", 1988 [...][...] si rimanda al volume "Manzù", 1988 [...]
CulturaItalia experience
<AU><AUT><AUTN>Manzù Giacomo</AUTN><AUTA>1908/1991</AUTA></AUT><EDT><EDTN>Della Ragione Alberto</EDTN></EDT>
</AU>
<pico:author xsi:type="iccd:AUT"><pico:author xsi:type="iccd:AUT">
AUTN=ManzùGiacomo;AUTN=ManzùGiacomo;
AUTA=1908/1991AUTA=1908/1991
</pico:author></pico:author>
<dc:publisher xsi:type="oac:EDT"><dc:publisher xsi:type="oac:EDT">
EDTN=Della Ragione AlbertoEDTN=Della Ragione Alberto
</dc:publisher></dc:publisher>
Ref : Ref : Mapping PICO – ICCD ,
http://www.iccd.beniculturali.it/Catalogazione/standard-catalografici/metadati
CulturaItalia experience
DATA PROVIDER
• Open source software:– OAICat– Apache Axis– Apache Tomcat– MySQL
• Personalization:– Use of Tomcat DataSource – JDBC2Pico crosswalk
SERVICE PROVIDERCulturaItalia harvested more than
14000 records
OAICatOAICat
PICO harvester
DatabaseTomcat
JDBC
OAI-PMHOAI-PMH
CulturaItalia experience
Harvesting&MetadataEnrich experience
Paolo [email protected]
Enrich experience
• An european experience: the ENRICH Project http://enrich.manuscriptorium.com/
• ENRICH Project goal:create seamless access to information about the vast collections of manuscripts and incunabula distributed across major European libraries
Italian Partners:MICC (Media Integration and Communication Center)BNCF (The National Librabry of Florence)
• ENRICH Project:– Based on MANUSCRIPTORIUM Digital Library
http://www.manuscriptorium.eu(National Library of the Czech Republic, AIP-Beroun Ltd)
Enrich experience
• ENRICH Conceptual Model :• OAI-PMH• XML• TEI
• Report on the Development and Validation of Migration Tools 28 February 2009http://enrich.manuscriptorium.com/files/ENRICH_WP3_D3_3_Migration_Tools_01.pdf
Migration routes for a number of different data formats to the ENRICH specification.
Enrich experience
Recommendations for Migration Routes:
- mature, open source, cross-platform technologies;
- human-readable, text-based scripting languages.
– The metadata format transformation can be operated by the Service Provider or by the Data Provider and it depends on the XSLT skills of the Data Provider;
– The project offers a tool, named M-Tool, that guides the Data Provider to map its proprietary fields into the TEI-P5 ones.
Enrich experience
Migration of the metadata to the ENRICH:
• Data Format:
Enrich experience
MANUSCRIPTORIUMBased on MASTER (Manuscript Access through Standards for Electronic Records)XML data format (extension to TEI P4 Guidelines)MASTER Reference Manual (available at http://www.teic.org.uk/Master/Reference/oldindex.html )The MASTER data format was updated and modified and eventually incorporated as a module into the Text Encoding Initiative TEI P5 Guidelines
ENRICH Based on TEI P5 (ratified by the TEI Technical Council)
MASTER to ENRICH transformation XSL (released by Creative Commons Attribution license)
– over 1300 pages – 23 chapters– Over 500 XML elements
• ENRICH format specification is based on chapters for:– Manuscript Description– Digital images– Non-Unicode characters– Paleographic or trascriptional data
Enrich experience
•TEI P5: http://www.tei-c.org/Guidelines/P5/
1. Metadata describing the original source manuscript;
2. metadata describing digitized images of the original source manuscript;
3. a transcription of the text contained by the original source manuscript (not required in Manuscriptorium).
Enrich experience
ENRICH TEI P5 schema contains three distinct aspects of a digitized manuscript:
set # documents # images
1 Manoscritti in rete 33 3865
2 Bibliotheca Universalis II 183
63980
3 Carte Geografiche II 137
233
4 Bibliotheca Universalis I 377
159381
5 Carte Geografiche I 810
3765
6 Magliabechi 52096
211618
7 Galileo Galilei manuscripts 307
98650
8 Galileo Galilei printed books 183
58387
Contents of Biblioteca Nazionale di Firenze (BNCF) planned for aggregation via OAI-PMH
Enrich experience
• to aggregate the content and to keep the aggregated information unconstrained as much as possible
• to harvest the original primary metadata contents
• The italian case of the BNCF:- MARCXML (slim) records (historical metadata)- MAG records (structural metadata)
Enrich experience
The goal :
Enrich experienceExample of the mag profile record BNCF
• Two harvests: one for MAG and the other for MARCslim.
• To match appropriate records together and to perform an automated processing of both the input files in order to produce a single XML record using the TEI P5 ENRICH schema.
• This TEI record is further processed in the Manuscriptorium platform (which will be TEI P5 based) for the purposes of searching and presentation (via end-users interface or the OAI-PMH interface).
• Migration to TEI P5 in progress…
Enrich experience
Enrich experienceExample ONLINE Mag record BNCF
Enrich experienceExample ONLINE Mag record BNCF
Enrich experience
Enrich Harvesting Information:
-AIP Beroun, Beroun, Czech RepublicTomas Psohlavec [email protected] http://www.aipberoun.cz
Enrich Metadata Information:
- Oxford University Computing Services, Oxford, United Kingdom James Cummings [email protected] Sebastian Rahtz [email protected] http://www.oucs.ox.ac.uk/
THANK you !
Rudy Becarelli Fabio Lanzi
Paolo Mazzanti
MICC - LCI Lab. -Media Integration and Communication CenterViale Morgagni, 65 50134 Florence (Italy)Tel. +39.055.4237404http://lci.micc.unifi.it