26
BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus Döring Global Biodiversity Information Facility (GBIF)

BIS TDWG Conference 28 October 2013, Florence

  • Upload
    jacqui

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

BIS TDWG Conference 28 October 2013, Florence. Documenting data quality in a global network: the challenge for GBIF. Éamonn Ó Tuama , Andrea Hahn, Markus Döring Global Biodiversity Information Facility (GBIF ). Outline. 1. The GBIF network and the Data Quality challenge. - PowerPoint PPT Presentation

Citation preview

Page 1: BIS TDWG Conference 28 October 2013, Florence

BIS TDWG Conference28 October 2013, Florence

Documenting data quality in a global network: the challenge for GBIF

Éamonn Ó Tuama, Andrea Hahn, Markus DöringGlobal Biodiversity Information Facility (GBIF)

Page 2: BIS TDWG Conference 28 October 2013, Florence

Outline1. The GBIF network and the Data

Quality challenge

2. Current DQ processes in GBIF Portal

3. DQ and GBIF Nodes

4. Addressing DQ in GBIF work programme 2014-2016

Page 3: BIS TDWG Conference 28 October 2013, Florence

GBIF is … - a connected community

- an informatics infrastructure

- a window on biodiversity

- a tool for science and society

http://www.gbif.org/resources/2311

Page 4: BIS TDWG Conference 28 October 2013, Florence

Addressing data quality

Meeting the challenge of documenting data quality as the network and volume of data grow …

Page 5: BIS TDWG Conference 28 October 2013, Florence

Aug-0

7Nov

-07Feb

-08May

-08Au

g-08Nov

-08Feb

-09May

-09Au

g-09Nov

-09Feb

-10May

-10Au

g-10Nov

-10Feb

-11May

-11Au

g-11Nov

-11Feb

-12May

-12Au

g-12Nov

-12Feb

-13May

-1380

100120140160180200220240260280300320340360380400420440

Prim

ary

biod

iver

sity

reco

rds (

mill

ions

)

As of August 2013: >405,720,500 indexed records from 10,139 datasets from 493 publishers and spanning a wide range of geospatial, temporal and taxonomic coverages.

http://tinyurl.com/gbifMap

Current GBIF Network Data Coverage

Page 6: BIS TDWG Conference 28 October 2013, Florence

DQ processes in GBIF portal

• Minimum obligatory metadata• Check geographic values• Check taxonomic values

Page 7: BIS TDWG Conference 28 October 2013, Florence

Packaging metadata with data

Page 8: BIS TDWG Conference 28 October 2013, Florence

Verbatim data asserted to originate in USA as shared on the network

Geographic attributes

Page 9: BIS TDWG Conference 28 October 2013, Florence

Data following quality check• Coastal regions recognised• Offshore islands recognised

Geographic attributes85% (355/417 mil)georeferenced records

2.7% (9.4 million)georeferenced with issues

Page 10: BIS TDWG Conference 28 October 2013, Florence

Trochilidae (Hummingbirds)Using verbatim higher classification

Taxonomic attributes

Page 11: BIS TDWG Conference 28 October 2013, Florence

Taxonomic attributes

Trochilidae (Hummingbirds)Classification based on authoritative sources

56% of name usages also found in CoL

Page 12: BIS TDWG Conference 28 October 2013, Florence

Authoritative checklists• Fill gaps in the

GBIF taxonomic backbone

• Increase list of known synonyms

• Increase the number of common names known to GBIF

Page 13: BIS TDWG Conference 28 October 2013, Florence

New improved algorithm for GBIF backbone taxonomy• Some taxa (mainly autonyms) do

not have stable IDs• Too many accepted species

created because of lack of a good database of taxonomic synonyms

Page 14: BIS TDWG Conference 28 October 2013, Florence

Working with Catalogue of Life

GBIF backbonetaxonomy

Catalogue of Life

Global Species

Databases

GBIFChecklistBankDwC-A

Checklists

Page 15: BIS TDWG Conference 28 October 2013, Florence

GBIF backbonetaxonomy

Catalogue of Life

Global Species

Databases

GBIFChecklistBankDwC-A

Checklists

Working with Catalogue of Life

• 8188 names annotated• 6825 rejected names• 541 placed names (added to ILDIS)• remaining have syntactical problems

(CoL issue, not ILDIS)

First backbone based on CoL feedback loop expected around December 2013

The first two GSDs have already provided annotations:International Legume Database & Information Service (ILDIS)

Scarabs: World Scarabaeidae Database • 1339 names annotated• 0 rejected names

Page 16: BIS TDWG Conference 28 October 2013, Florence

Data Quality issuesNon-standardised valuesExample: dwc:country (http://rs.tdwg.org/dwc/terms/country)

29,052 distinct values for country namesOf these, 18,704 (concerning 2.2 mil records) could not be mapped to an ISO country code.

Typical issues:• Variants: 126 different values for “Italy”• Mismappings: taxon names instead of country

names• Incorrect level of detail: sub-national units, non-

country geographical entities

Page 17: BIS TDWG Conference 28 October 2013, Florence

Data Quality issuesNon-standardised valuesExample: dwc:basisOfRecord (http://rs.tdwg.org/dwc/terms/basisOfRecord)

625 values that cannot be interpreted at all (accounting for 13.3 mil records) Typical issues:• Spelling variants / language variants• Mismappings• Misunderstanding definition

30 mil records with no value or “unknown”

Interpretable values quite variede.g. 31 values mapped to “observation”, 146 to “specimen”

Page 18: BIS TDWG Conference 28 October 2013, Florence

DQ and GBIF NodesDesirable improvements

• Better metadata• Persistent IDs• Controlled vocabularies• Annotations• Independently validated datasets• Genetic validation of taxonomy

Page 19: BIS TDWG Conference 28 October 2013, Florence

DQ and GBIF NodesImplementing improvements

• Collate experiences of all Nodes and share best practices

• Build reusable DQ components (e.g., tools, vocabularies, workflows)

Page 20: BIS TDWG Conference 28 October 2013, Florence

DQ and GBIF NodesNext steps• Expand Data Quality Interest

Group• Establish a collaboration platform

Page 21: BIS TDWG Conference 28 October 2013, Florence

Addressing Data Quality

inGBIF Work Programme

2014-2016

Page 22: BIS TDWG Conference 28 October 2013, Florence

• Ensure stable identifiers for datasets and records• Provide a method for citation of data sets• Enable annotation of data

GBIF Work Programme2014-2016

Essential Infrastructure to support Data Quality

Page 23: BIS TDWG Conference 28 October 2013, Florence

Engagement of expert communities to form fitness-for-use working groups

• enhancements to data standards and classes of data in use in GBIF

• criteria and algorithms for evaluating data quality, fitness-for-use, coverage and completeness

• content mobilisation priorities (inc. improving already mobilised data)

• identification and curation of reference data sets

GBIF Work Programme2014-2016

Page 24: BIS TDWG Conference 28 October 2013, Florence

Guidelines and supporting tools to assess and improve metadata completeness for all data

• Evaluation and reporting on metadata completeness and quality

• Seeking to ensure that the basis of record is clear for each data record

GBIF Work Programme2014-2016

Criteria from fitness-for-use working groups

Page 25: BIS TDWG Conference 28 October 2013, Florence

GBIF portal upgrades to report data quality and fitness-for-use for each data set and species

Standards compliance Metadata completeness Presence of key data elements Automated checks for issues and outliers Endorsements of data publishers and data sets by

Nodes, fitness-for-use working groups and other stakeholders

GBIF Work Programme2014-2016

Criteria from fitness-for-use working groups

Page 26: BIS TDWG Conference 28 October 2013, Florence

Thank you

GBIF SecretariatUniversitetsparken 15DK-2100 Copenhagen ØDenmark

www.gbif.org

E-mail: [email protected]: +45 3532 1470Fax: +45 3532 1480