Upload
oscar-west
View
214
Download
0
Embed Size (px)
Citation preview
Avano, an OAI harvester for marine and aquatic sciences
Fred Merceur
What could be improved in OAI-PMH protocol and in repositories implementation?
Table of contents
Main technical ideas of OAI-PMH
Avano presentationGeneral information
Filtering aquatic and marine records
Demonstrations
What could be improved in OAI-PMH protocol and in repositories implementation?
Main technical ideas of OAI-PMH Open Archives Protocol for Metadata Harvesting
Definitions and conceptsA protocol to share bibliographic recordsThe digital objects (documentation, images, dataset…) stay inside the repositories
Two groups of players
OAIharvesters
OAIharvesters
OAIserver
OAIserver
HT
TP
/ X
ML
Data providers (Open Archives, Institutional Repositories,
Commercial publishers, e.g.,Aquatic Commons, OceanDocs, MBL/WHOI)
Service providers, or harvesters including AVANO
A simple protocol
OAI-PMH is based on major web standard : HTTP, XML, Dublin Core
Harvesters issue repositories with simple HTTP requests. There are 6 request types (verbs) that can be issued by harvesters:
Identify Retrieve information about a repository (administrator email, information about
deleted records strategy…)
ListMetadataformats Retrieve the metadata formats available from a repository (XML DTD). All repositories
must at least allow the sharing of theirs records in unqualified Dublin Core
ListSetsGet the optional list of Set suggested by the Data Provider to harvest a selection of
records (Thematic sets, type of documents, full text available…)
ListIdentifiersGet the list of record identifiers available from a data provider
GetRecordGet the complete record for the identifier sent as parameter
ListRecordsGet a list of complete records available from a data provider
Some parameters to issue a repository
from - until (optional) Specify the range of dates of records to harvest (This applies
to the last date of modification and not to the date of
publication )
Set (optional) Specify the set of records to retrieve (Thematic sets, type of
document, full text available…)
metadataPrefix (mandatory)
Specify in which format (XML DTD) the record must be
returned
One example:
http://www.ifremer.fr/docelec/oai/OAIHandler?verb=ListRecords&metadataPrefix=oai_dc
Minimal OAI compliant metadata consists
of the unqualified 15 fields Dublin Core metadata :
TITLE
CREATOR
SUBJECT
DESCRIPTION
PUBLISHER
CONTRIBUTOR
DATE
TYPE
FORMAT
IDENTIFIER
SOURCE
LANGUAGE
RELATION
COVERAGE
RIGHTS
Avano, a thematic OAI-PMH harvester
implementation example
General informationsAvano was launched in September 2006. It is available at : http://
www.ifremer.fr/avano/
A part of the system is based on the University of Illinois Open Archives Initiative Metadata Harvesting Project
The publication web site and the filtering system are Ifremer In-House developments
It handles marine resources but also freshwater resources (rivers, lakes, ground waters, drinking water treatment, ...)
Avano harvests Open Archives, Institutional repositories and a few commercial publishers (E.g. : HighWire)
When possible, if a subset is available, we only harvest records with Full-Text
Repositories are not loaded if there is no full-text subset and if the repository contains mainly records with no full-text.
Repositories are not loaded if they offer records with link to digital objects stored outside the repository server
Harvesting marine repositories
The full content of these 9 marine repositories is automatically loaded into Avano (18904 records)
9 marine repositories harvested : ePic, Alfred Wegener Institute : 2679 recordsAquatic Commons, Iamslic : 269 recordsArchiMer, Ifremer : 2241 recordsDRS, National Institute Of Oceanography of India : 637 records IBSS, Institute of Biology of the Southern Seas : 181 recordsMarine & Ocean Science ePrints @ Plymouth : 1974 recordsOceanDocs, Africa and Latin America marine pub. : 1568 recordsPlankton*Net (AWI and Roscoff marine station) : 7686 images WHOAS (Woods Hole) : 1660 records
OA
I-P
MH
146 non-marine repositories
Temporarytable4.500.000 records
…fisheryfishesfishing%…
…Ocean DynamicsOcean EngineeringOcean ModellingOcean NavigatorOcean Research…
…abietinaria inconstansabietinaria kincaidiabietinaria labrataabietinaria pacifica…
Manual checking (40 000 records removed manually)Aquatic and marine
terms or expression
Filters
Filters
Journal titlesAquatic species scientific names…
Avano (88000 records)
OAI-PMH
Harvesting non-marine
repositories
Harvest non-marine repositoriesRecords that contain aquatic journal title, aquatic expressions or scientific names of aquatic species are automatically loaded into Avano. Avano is then already using:
An aquatic journal title list from ASFAA list of scientific names of fishes from FishBaseA list of scientific names of aquatic species from the FAOSeveral lists of scientific names of aquatic species from the NODC
But if you have lists of scientific names for aquatic algae, fungi, plants, mollusks , gastropods, insects, birds, mammals, if they contain only aquatic species, Please contact me!
Keyword filtering method deficits
It’s a time consuming methodWe may validate records (1 or 2%?) that don’t match any Avano subjectWe may also miss a few records from non-marine repositories (1 or 2%?) especially when :
The records are poor (no abstract)The record is only available in local language
But this is the only way we found to get the 80% of Avano records that
come from general repositories
Avano now contains more than 107 000 records from 156 Open Archives and 4 commercial editors
Publication year of documents available from Avano
The number of connections to Avano is increasing
Nu
mb
er o
f co
nn
ecti
on
s
An international public
Demonstrations
Filtering module
Public web site: http://www.ifremer.fr/avano/
One year of harvester management reviewWhat could be improved in OAI-PMH protocol
and in repositories implementation?
OAI-PMH, what could be improved?
Repository stabilitiesMany repositories (10-20%?) are difficult to harvest because of bad reliability:
Un-documented errors occurred during harvesting
HTTP time out errors during harvesting
OAI-PMH protocol not completely supported (some repositories can only be harvested via the GetRecords method, some others via the ListIdentifier method, some do not return the same number of records via the GetRecords method and via the ListIdentifier method)
OAI-PMH server URL changed without notification
…
OAI-PMH, what could be improved?
XML encoding, UTF8 errorsMany repositories deliver incorrect XML stream or records that contain UTF8 errors (encoding character errors). This is a problem for some harvesters (E.g. : Avano) if they are using XML parsers that cannot bypass these XML encoding or UTF-8 errors.
Records with UTF-8 errors are not loaded in Avano
Repositories with XML encoding errors cannot be harvested via the GetRecords method by Avano (which is a problem when the ListIndentifier method doesn’t work either)
…
OAI-PMH, what could be improved?
Big or slow repository harvestingBig or slow repositories can take several days to be harvested
This is a problem for unreliable repositories. If one error occurs, the harvesting must be restarted from the beginning (no way to start from where the harvesting stopped)
For some of these repositories, an intermediary solution would consist in dividing the harvesting by range of date but it cannot be applied all the time
OAI-PMH, what could be improved?
Duplicated recordsThis can happen if, for example, a publication is written in collaboration with several institutions. If so, this publication may be archived on each institution server. The international deposit rate is so low, especially for life sciences, that it is not really a problem nowadays.
Some national projects are also aggregating a selection of IR and re-exposing the records in OAI-PMH. For example, HAL is a French national Open Archive. Some French scientific organizations are using this platform to build their IR (IN2P3, INSERM…). All the records loaded in these IR are exposed twice (via the national platform and via the IR).
If harvesters manager did not heard about these specific national projects, then can load these duplicated IR (e.g. all IN2P3, INSERM… records are duplicated in Oaister)
OAI-PMH, what could be improved?
Deleted recordsMany repositories don’t support a mechanism (transient or persistent) that indicates to the harvesters that a record has been deleted
Harvesters then have to re-harvest completely (instead of using incremental harvests) the repositories to detect deleted records (which is a major problem for big, slow or not reliable repositories that need several days to be reharvested)
OAI-PMH, what could be improved?
Type Field26 000 of the 107 000 records available in Avano have no type field
A few (>500) have a type field which is impossible to normalize
A1
Airticle
8
Treball Final de Carrera
….
All these records will be removed from results if the end-user limits his query to a set type
OAI-PMH, what could be improved?
Publication Date Field15 000 of the 107 000 records available in Avano have no publication date
A few (>500) have bad-formatted date: 1970-04-00
1981.
Montréal, 2000
[196-?]
2005-92-26….
All these records will be removed from results list if the end-user limits his query to a range date
All these records will be displayed at the end of the hitlist if the enduser selects to sort the hitlist by date .
OAI-PMH, what could be improved?
Poor recordsSome repositories contain poor records (no abstract, no keyword, no author…). Some others contain records only available in national languages.
These records will have a bad visibility in harvester search engine because harvester only indexes the bibliographic data and often displays their result-list sorted by rank.
OAI-PMH, what could be improved?
Aggregating documentation and dataset recordsThis could be a problem for harvester if dataset records do not have the same granularity as the documentation records.
E.g. : Pangaea is a publishing network for geological and environmental data. It contains thousands of records that are almost identical (only a few geographical references can be different in these records)
E.g. : Pangaea contains 1389 almost identical records that contain the “color reflectance“ expression. If an end-user wants to find the few documentation records that also contain this expression he will have no chance to find them in this list of results:
OAI-PMH, what could be improved?
Records without free access to the digital object : maybe the main problem !
Many Open Archive and IR now contain records without fulltext, records with pay per view fulltext (E.g. : BePress/ProQuest) or records with restricted access to the full-text.
It should not be a problem if harvesters had the possibility to offer information to their end-users about the access to the full-text (and offer, as an option, the possibility to filter them). But this is not the case!
We still have to convince scientists and end-users that Open Access is useful and/or necessary. Immediate and free access to the full text is maybe the main argument to convince them. It is my opinion that hiding records with free full text among records with inaccessible full text is not helpful.
OAI-PMH, what could be improved?
Thematic harvestingThematic harvesting is supposed to be available via the Set method
In practice, no repository offers Set that matches exactly with the range of Avano
The OAI-PMH protocol does not allow the harvest of records that belong to several sets. As an example it would not have been possible to harvest “Full-Text” set and “Marine and aquatic” set at the same time.
This limitation led to the development of the key-word spotting system to filter marine and aquatic records in general repositories
Conclusion (1/2)
What do harvesters need to be able to find their place between Google and commercial bibliographic databases?
An higher Open Access deposit rate (less than 3% in marine/aquatic sciences?) and/or more commercial publishers to expose their records in OAI-PMH in order to cover the main part of the international scientific production
A new version of OAI-PMH that would offer a more reliable way to harvest OA and more qualified mandatory information (date and type field, information about access to the full text…), so that harvesters will be able to offer more powerfull and reliable search options
Conclusion (2/2)
Please, test and comment Avano. Do not hesitate to suggest modifications!
check if your repository is already harvested by Avano and, if no, please register!
contact me if you have lists of scientific names for aquatic algae, fungi, plants, mollusks , gastropods, insects, birds, mammals, if they contain only aquatic species!