14
Data discovery through federated dataset catalogs Valeria Pesce Secretariat of the Global Forum on Agricultural Research (GFAR) Secretariat of the Global Open Data for Agriculture and Nutrition (GODAN) initiative eROSA workshop, Montpellier, 6-7 July 2017

Data discovery through federated dataset catalogs

Embed Size (px)

Citation preview

Data discoverythrough federated dataset catalogs

Valeria Pesce

Secretariat of the Global Forum on Agricultural Research (GFAR)Secretariat of the Global Open Data for Agriculture and Nutrition (GODAN) initiative

eROSA workshop, Montpellier, 6-7 July 2017

• Many institutional catalogs / geographically-scoped catalogs / thematic catalogs

• How many catalogs do I have to search?

>> General meta-catalogs? Different targeted catalogs?

>> Federated metadata catalogs/ Secondary catalogs

1. Dataset discovery: how

• Data are in datasets, stored in some dataset repository

• Datasets can be made searchable through a dataset catalog

Good dataset metadata at the level of the local repository / catalog

Open interoperable dataset metadata

at the level of the primary repository / catalog

IDEAL

>>> Linked Data federated search engines LOD-enabledprimary catalog

Heavy requirements for the local primary catalog

1. Dataset discovery: good metadata (1)

1. General metadata about the dataset “resource”:a) identifier(s)b) who is responsible for itc) when and where the data were collectedd) relations to organizations, persons, publications, software, projects, funding…e) the conditions for re-use (rights, licenses)f) provenance, versionsg) the specific coverage of the dataset (type of data, thematic coverage, geographic

coverage)

Normally covered by generic vocabularies like Dublin Core or DCAT

IDEAL

Let’s look at existing good practices and standards

1. Dataset discovery: good metadata (2)

a) The variables: the observed “dimensions” (e.g. time, geographic region, gender,

elevation…) and the measured / observed phenomenon (e.g. life expectancy)

b) The specification of the dimensions (units of measure, time granularity, syntax, any

scaling factors and metadata such as the status of the observation, reference

taxonomies…)

c) Possible time and space slices; subsets

Not always considered in generic dataset metadata vocabularies (DCAT) but traditionally

included in research datasets (e.g. in formats like NetCDF) and covered by DataCube

IDEAL

2. Metadata about the data structure!

1. Dataset discovery: good metadata (3)

1. Where to retrieve the dataset: URL (data dump, service…)

2. The necessary technical specifications to retrieve and parse a distribution of the dataset: - format (file format, data format), vocabularies / data dictionaries- protocol, API parameters…

Not always considered in generic dataset metadata vocabularies: DCAT covers data dump and format, VOID some services

IDEAL

3. Metadata about the actual “serializations” or “distributions” of the dataset.

Data will be processed by tools! Data formats and access protocols are important.

1. Dataset discovery: interoperable metadata

Secondary catalogs have to be able to retrieve metadata from the dataset catalog

IDEAL

Ideally, secondary catalogs would be able to retrieve only subsets of the catalog (by type of data, by data format, by phenomenon observed?)

Data service / API with filtering parameters Catalogs as DAAS - Data-as-a-Service

• All discovery-relevant metadata are exposed in machine-readable form

• Exposed metadata use shared semantics

• Standardization of the values, e.g. for “thematic coverage” or “dimensions” of datasets, “format” or “protocol used” of distributions etc.

• The value should be standardized, possibly a URI

• The value should be part of an authority list / code list

1. Dataset discovery: ideal architecture

Conclusions

• Dataset metadata ideally created by authors / curators at the local level, catalog associated with repository

• High-quality metadata in catalogs allowing for answers to all possible queries• Ownership, rights, temporal, spatial, thematic, data structure, access…

• Machine-readable metadata; agreed vocabularies; shared semantics; APIs for querying

• General or specialized secondary catalogs federate metadata from primary catalogs; multiply discoverability and cater for different audiences

• Also secondary catalogs expose good metadata and APIs

• There’s an inventory / registry of dataset repositories and all types of catalogs

IDEAL

2. Dataset discovery: current situation in Agriculture (1)

• Institutional data repositories are picking up (need for an inventory!)

CURRENT

• Use of standardized or semi-standardized data repository tools with cataloguing functionalities and APIs is picking up (Dataverse, CKAN…)

• Some governmental metadata catalogs exist, often using standardized tools (CKAN) and standard vocabularies (DCAT), that include agricultural datasets

• Some international data catalogs exist that include agricultural datasets (re3data, OpenAIRE, DataHub…)• Also research-oriented data services like OpenDAP or Unidata THREDDS

• Some secondary federated catalogs exist (? Need for an inventory!)• General one for agriculture (usable as an inventory): the CIARD RING

2. Dataset discovery: current situationExample of CIARD RING secondary catalog

• Architecture:• Datasets can be hosted anywhere, the RING only hosts the metadata

• Optionally, datasets can be uploaded and the RING can act as a subsidiary repository

• Datasets (metadata) can be federated from other catalogs• It uses the dataset / distribution DCAT model

• Metadata quality:• it uses a combination of the DCAT model + the VOID vocabulary and the DataCube

vocabulary + some extra properties (? a “RING DCAT profile” will be published)

• Shared semantics:• it has a Linked Data layer, URIs for all entities; all categories are published as SKOS

concepts in SKOS concept schemes and are mapped to external concepts whenever possible

2. Dataset discovery: current situationin Agriculture (2)

• Metadata quality of most used primary data catalog tools is not high• E.g. no metadata about data structure, no shared semantics for data types, topics,

formats, standards used

CURRENT

>> poor discovery services in secondary catalogs

• Metadata interoperability of most used primary data catalog tools is not high• No full compliance with broadly recognized vocabularies (DCAT, DataCube…)• No functionality to apply shared semantics for categorizations like topics, data types,

formats, dimensions (sometimes keywords from AGROVOC)

• Data not always accessible through exposed metadataNo full compliance with broadly recognized vocabularies (DCAT, DataCube…)

• Scarce population >> lack of reputation / authority of secondary catalogs.> lack of motivation to share

3. Dataset discovery: infrastructural improvements (1)

Quality depends on the metadata coming from the primary repository / catalog… how can a good infrastructure overcome this problem?

• Advocacy for better / improved tools?• Promote improvement of existing tools?

• Dataset repository / catalog platforms in the cloud?

• Complementary / subsidiary role of secondary catalogs?• Allow subsidiary use of secondary catalogs as primary catalog and even repository

for some datasets (small institutions, individuals)

• Cater or the improvement of metadata directly in the secondary catalogs

• Incentives to provide good metadata? • E.g. offer mechanisms to a) measure reuse; b) enforce respect of usage rights.

• Good agreed metadata standards and reference value vocabularies

• Combine existing standards (DCAT, DataCube, VOID…) in an application profile?

• Provide a reference framework of agreed value vocabularies with URIs?Mapping from local values to agreed ones?>> AgriSemantics – GACS, VEST/AgroPortal

• Avoid too much interdependence. Design a loosely coupled infrastructure. (How?)

3. Dataset discovery: infrastructural improvements (2)

Key questions

• Is it our task to aim at having better machine-readable metadata at the level of the primary local repository / catalog?• How can we influence this? Advocate for including metadata in researcher’s tools?

• Do we want to “drive” secondary catalogs or let them bloom? Or both?• At least a global one for food&ag? How many? Who decides? Who manages them?

• How can other infrastructural components facilitate good catalogs?• Subsidiary metadata in secondary catalogs? Good dataset catalog tools in the cloud?

• Good agreed metadata standards and reference value vocabularies?

• Mapping with local values

• How to design for resilience of the system? Loosely coupled components?

• How much of this is specific to food&ag and which aspects should be tackled in a broader context? (EOSC?)

Data discoverythrough dataset repositories

and catalogs

Thank you for your attention

eROSA workshop, Montpellier, 6-7 July 2017