33
Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Embed Size (px)

Citation preview

Page 1: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Avano, an OAI harvester for marine and aquatic sciences

Fred Merceur

What could be improved in OAI-PMH protocol and in repositories implementation?

Page 2: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Table of contents

Main technical ideas of OAI-PMH

Avano presentationGeneral information

Filtering aquatic and marine records

Demonstrations

What could be improved in OAI-PMH protocol and in repositories implementation?

Page 3: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Main technical ideas of OAI-PMH Open Archives Protocol for Metadata Harvesting

Page 4: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Definitions and conceptsA protocol to share bibliographic recordsThe digital objects (documentation, images, dataset…) stay inside the repositories

Two groups of players

OAIharvesters

OAIharvesters

OAIserver

OAIserver

HT

TP

/ X

ML

Data providers (Open Archives, Institutional Repositories,

Commercial publishers, e.g.,Aquatic Commons, OceanDocs, MBL/WHOI)

Service providers, or harvesters including AVANO

A simple protocol

OAI-PMH is based on major web standard : HTTP, XML, Dublin Core

Page 5: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Harvesters issue repositories with simple HTTP requests. There are 6 request types (verbs) that can be issued by harvesters:

Identify Retrieve information about a repository (administrator email, information about

deleted records strategy…)

ListMetadataformats Retrieve the metadata formats available from a repository (XML DTD). All repositories

must at least allow the sharing of theirs records in unqualified Dublin Core

ListSetsGet the optional list of Set suggested by the Data Provider to harvest a selection of

records (Thematic sets, type of documents, full text available…)

ListIdentifiersGet the list of record identifiers available from a data provider

GetRecordGet the complete record for the identifier sent as parameter

ListRecordsGet a list of complete records available from a data provider

Page 6: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Some parameters to issue a repository

from - until (optional) Specify the range of dates of records to harvest (This applies

to the last date of modification and not to the date of

publication )

Set (optional) Specify the set of records to retrieve (Thematic sets, type of

document, full text available…)

metadataPrefix (mandatory)

Specify in which format (XML DTD) the record must be

returned

One example:

http://www.ifremer.fr/docelec/oai/OAIHandler?verb=ListRecords&metadataPrefix=oai_dc

Page 7: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Minimal OAI compliant metadata consists

of the unqualified 15 fields Dublin Core metadata :

TITLE

CREATOR

SUBJECT

DESCRIPTION

PUBLISHER

CONTRIBUTOR

DATE

TYPE

FORMAT

IDENTIFIER

SOURCE

LANGUAGE

RELATION

COVERAGE

RIGHTS

Page 8: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Avano, a thematic OAI-PMH harvester

implementation example

Page 9: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

General informationsAvano was launched in September 2006. It is available at : http://

www.ifremer.fr/avano/

A part of the system is based on the University of Illinois Open Archives Initiative Metadata Harvesting Project

The publication web site and the filtering system are Ifremer In-House developments

It handles marine resources but also freshwater resources (rivers, lakes, ground waters, drinking water treatment, ...)

Avano harvests Open Archives, Institutional repositories and a few commercial publishers (E.g. : HighWire)

When possible, if a subset is available, we only harvest records with Full-Text

Repositories are not loaded if there is no full-text subset and if the repository contains mainly records with no full-text.

Repositories are not loaded if they offer records with link to digital objects stored outside the repository server

Page 10: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Harvesting marine repositories

The full content of these 9 marine repositories is automatically loaded into Avano (18904 records)

9 marine repositories harvested : ePic, Alfred Wegener Institute : 2679 recordsAquatic Commons, Iamslic : 269 recordsArchiMer, Ifremer : 2241 recordsDRS, National Institute Of Oceanography of India : 637 records IBSS, Institute of Biology of the Southern Seas : 181 recordsMarine & Ocean Science ePrints @ Plymouth : 1974 recordsOceanDocs, Africa and Latin America marine pub. : 1568 recordsPlankton*Net (AWI and Roscoff marine station) : 7686 images WHOAS (Woods Hole) : 1660 records

OA

I-P

MH

Page 11: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

146 non-marine repositories

Temporarytable4.500.000 records

…fisheryfishesfishing%…

…Ocean DynamicsOcean EngineeringOcean ModellingOcean NavigatorOcean Research…

…abietinaria inconstansabietinaria kincaidiabietinaria labrataabietinaria pacifica…

Manual checking (40 000 records removed manually)Aquatic and marine

terms or expression

Filters

Filters

Journal titlesAquatic species scientific names…

Avano (88000 records)

OAI-PMH

Harvesting non-marine

repositories

Page 12: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Harvest non-marine repositoriesRecords that contain aquatic journal title, aquatic expressions or scientific names of aquatic species are automatically loaded into Avano. Avano is then already using:

An aquatic journal title list from ASFAA list of scientific names of fishes from FishBaseA list of scientific names of aquatic species from the FAOSeveral lists of scientific names of aquatic species from the NODC

But if you have lists of scientific names for aquatic algae, fungi, plants, mollusks , gastropods, insects, birds, mammals, if they contain only aquatic species, Please contact me!

Page 13: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Keyword filtering method deficits

It’s a time consuming methodWe may validate records (1 or 2%?) that don’t match any Avano subjectWe may also miss a few records from non-marine repositories (1 or 2%?) especially when :

The records are poor (no abstract)The record is only available in local language

But this is the only way we found to get the 80% of Avano records that

come from general repositories

Page 14: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Avano now contains more than 107 000 records from 156 Open Archives and 4 commercial editors

Page 15: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Publication year of documents available from Avano

Page 16: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

The number of connections to Avano is increasing

Nu

mb

er o

f co

nn

ecti

on

s

Page 17: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

An international public

Page 18: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Demonstrations

Filtering module

Public web site: http://www.ifremer.fr/avano/

Page 19: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

One year of harvester management reviewWhat could be improved in OAI-PMH protocol

and in repositories implementation?

Page 20: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Repository stabilitiesMany repositories (10-20%?) are difficult to harvest because of bad reliability:

Un-documented errors occurred during harvesting

HTTP time out errors during harvesting

OAI-PMH protocol not completely supported (some repositories can only be harvested via the GetRecords method, some others via the ListIdentifier method, some do not return the same number of records via the GetRecords method and via the ListIdentifier method)

OAI-PMH server URL changed without notification

Page 21: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

XML encoding, UTF8 errorsMany repositories deliver incorrect XML stream or records that contain UTF8 errors (encoding character errors). This is a problem for some harvesters (E.g. : Avano) if they are using XML parsers that cannot bypass these XML encoding or UTF-8 errors.

Records with UTF-8 errors are not loaded in Avano

Repositories with XML encoding errors cannot be harvested via the GetRecords method by Avano (which is a problem when the ListIndentifier method doesn’t work either)

Page 22: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Big or slow repository harvestingBig or slow repositories can take several days to be harvested

This is a problem for unreliable repositories. If one error occurs, the harvesting must be restarted from the beginning (no way to start from where the harvesting stopped)

For some of these repositories, an intermediary solution would consist in dividing the harvesting by range of date but it cannot be applied all the time

Page 23: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Duplicated recordsThis can happen if, for example, a publication is written in collaboration with several institutions. If so, this publication may be archived on each institution server. The international deposit rate is so low, especially for life sciences, that it is not really a problem nowadays.

Some national projects are also aggregating a selection of IR and re-exposing the records in OAI-PMH. For example, HAL is a French national Open Archive. Some French scientific organizations are using this platform to build their IR (IN2P3, INSERM…). All the records loaded in these IR are exposed twice (via the national platform and via the IR).

If harvesters manager did not heard about these specific national projects, then can load these duplicated IR (e.g. all IN2P3, INSERM… records are duplicated in Oaister)

Page 24: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Deleted recordsMany repositories don’t support a mechanism (transient or persistent) that indicates to the harvesters that a record has been deleted

Harvesters then have to re-harvest completely (instead of using incremental harvests) the repositories to detect deleted records (which is a major problem for big, slow or not reliable repositories that need several days to be reharvested)

Page 25: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Type Field26 000 of the 107 000 records available in Avano have no type field

A few (>500) have a type field which is impossible to normalize

A1

Airticle

8

Treball Final de Carrera

….

All these records will be removed from results if the end-user limits his query to a set type

Page 26: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Publication Date Field15 000 of the 107 000 records available in Avano have no publication date

A few (>500) have bad-formatted date: 1970-04-00

1981.

Montréal, 2000

[196-?]

2005-92-26….

All these records will be removed from results list if the end-user limits his query to a range date

All these records will be displayed at the end of the hitlist if the enduser selects to sort the hitlist by date .

Page 27: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Poor recordsSome repositories contain poor records (no abstract, no keyword, no author…). Some others contain records only available in national languages.

These records will have a bad visibility in harvester search engine because harvester only indexes the bibliographic data and often displays their result-list sorted by rank.

Page 28: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Aggregating documentation and dataset recordsThis could be a problem for harvester if dataset records do not have the same granularity as the documentation records.

E.g. : Pangaea is a publishing network for geological and environmental data. It contains thousands of records that are almost identical (only a few geographical references can be different in these records)

Page 29: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

E.g. : Pangaea contains 1389 almost identical records that contain the “color reflectance“ expression. If an end-user wants to find the few documentation records that also contain this expression he will have no chance to find them in this list of results:

Page 30: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Records without free access to the digital object : maybe the main problem !

Many Open Archive and IR now contain records without fulltext, records with pay per view fulltext (E.g. : BePress/ProQuest) or records with restricted access to the full-text.

It should not be a problem if harvesters had the possibility to offer information to their end-users about the access to the full-text (and offer, as an option, the possibility to filter them). But this is not the case!

We still have to convince scientists and end-users that Open Access is useful and/or necessary. Immediate and free access to the full text is maybe the main argument to convince them. It is my opinion that hiding records with free full text among records with inaccessible full text is not helpful.

Page 31: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved?

Thematic harvestingThematic harvesting is supposed to be available via the Set method

In practice, no repository offers Set that matches exactly with the range of Avano

The OAI-PMH protocol does not allow the harvest of records that belong to several sets. As an example it would not have been possible to harvest “Full-Text” set and “Marine and aquatic” set at the same time.

This limitation led to the development of the key-word spotting system to filter marine and aquatic records in general repositories

Page 32: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Conclusion (1/2)

What do harvesters need to be able to find their place between Google and commercial bibliographic databases?

An higher Open Access deposit rate (less than 3% in marine/aquatic sciences?) and/or more commercial publishers to expose their records in OAI-PMH in order to cover the main part of the international scientific production

A new version of OAI-PMH that would offer a more reliable way to harvest OA and more qualified mandatory information (date and type field, information about access to the full text…), so that harvesters will be able to offer more powerfull and reliable search options

Page 33: Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Conclusion (2/2)

Please, test and comment Avano. Do not hesitate to suggest modifications!

check if your repository is already harvested by Avano and, if no, please register!

contact me if you have lists of scientific names for aquatic algae, fungi, plants, mollusks , gastropods, insects, birds, mammals, if they contain only aquatic species!