Download pptx - GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen ECAT Program Officer August 2010 G Informatics Infrastructure and Portal (IIP)

GLOBALBIODIVERSITY

INFORMATIONFACILITY

David Remsen

ECAT Program Officer

August 2010

WWW.GBIF.ORG

Informatics Infrastructure and Portal (IIP)

Contents

• Publishing– Current developments in DarwinCore, its

extension, and publishing solutions (incl. the Integrated Publishing Toolkit)

• Integration and discovery– Status of tools for the harvesting,

interpretation through controlled vocabularies and plans for the Data Portal

• Communications– Evaluation of the communication platforms,

update on staffing changes and resources

Publishing

• Objectives– Simplify the publication of primary biodiv. data– Support the publication of species-level data– Improve data quality & dataset documentation– Reduce the latency between publishing and

discovery through portals– Support the capacity to extend the published

content– Expand data publishing configuration options

Publishing What

• Species Occurrence Data– Primary Biodiversity Data– Observations / Nat. Hist. Collections

• Species-level Data– Taxonomic Catalogues– Annotated Species Checklists• Floral and Faunal lists• Thematically-defined lists (Red-List, Invasive,

etc.)

• Dataset (Resource) Metadata

Standards and Protocols

• Primary Biodiversity data– Darwin Core via DiGIR protocol– ABCD (Access to Biological Collections

Data) via BioCase protocol– TAPIR protocol– multiple output formats

• Taxonomic data– Taxon Concept Schema (TCS)• Few tools• Low uptake

Protocols impact harvesting latency – Schemas are complex and constrain data scope

Darwin Core

• Ratified in 2009• Significant additions/refinements• Set of terms– http://rs.tdwg.org/dwc/terms/index.htm

• Expressed via XML• Simple Darwin Core (Subset)• Express as Text– http://

rs.tdwg.org/dwc/terms/guides/text/index.htm

http://rs.tdwg.org/dwc/terms/index.htm

http://rs.tdwg.org/dwc/terms/guides/text/index.htm



Darwin Core Archives (DwC-A)

Extensions are text files

DwC-A Case Study: Ireland

• National Biodiversity Data Centre (Ireland)• Ireland joined GBIF in 2009• Selected DwC-A as the easiest integration• Incorporated into internal systems – Under 2 weeks of development

• Automatic registration through RegistryAPI– http://code.google.com/p/gbif-registry/wiki/ResourceAPI

• 34 Collections today• 450,000 records harvested

http://code.google.com/p/gbif-registry/wiki/ResourceAPI

Publish via: Direct Export of DwC-A• Requires basic DBA skills and

documentation– Darwin Core Terms– Darwin Core Archive Format– Publishing Taxonomic Catalogues &

Annotated Checklists via DwC-A– Publishing Occurrence Data via DwC-A

• Access to list of terms, supported extensions, and schemas– http://rs.gbif.org (Schema repository)

Status: Documentation release September 2010 via GBIF website

http://rs.gbif.org/

XML Descriptor file

• <archive xmlns="http://rs.tdwg.org/dwc/text/" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation=http://rs.tdwg.org/dwc/text/ http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd• metadata=”http://www.biodiv.org/docs/metadata/whale_catalogue.eml”>• • <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy='' ignoreHeaderLines="0"

rowType="http://rs.tdwg.org/dwc/terms/Taxon">• <files>• <location>taxa.txt</location>• </files>• <id index="0" />• <field index="1" term="http://rs.tdwg.org/dwc/terms/kingdom"/>• <field index="2" term="http://rs.tdwg.org/dwc/terms/phylum"/>• <field index="3" term="http://rs.tdwg.org/dwc/terms/class"/>• <field index="4" term="http://rs.tdwg.org/dwc/terms/order"/>• <field index="5" term="http://rs.tdwg.org/dwc/terms/family"/>• <field index="6" term="http://rs.tdwg.org/dwc/terms/genus"/>• <field index="7" term="http://rs.tdwg.org/dwc/terms/species"/>• <field index="8" term="http://rs.tdwg.org/dwc/terms/infraspecies"/>• <field index="9" term="http://rs.tdwg.org/dwc/terms/infraspeciesRank"/>• <field index="10" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>• <field default="ICZN" term="http://rs.tdwg.org/dwc/terms/nomenclaturalCode"/>• </core>• • <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy='' ignoreHeaderLines="0"

rowType="http://rs.gbif.org/terms/1.0/VernacularName">• <files>• <location>vernacular.txt</location>• </files>• <coreid index="0" />• <field index="1" term="http://rs.tdwg.org/dwc/terms/vernacularName"/>• <field index="2" term="http://purl.org/dc/terms/language"/>• <field index="3" term="http://rs.tdwg.org/dwc/terms/countryCode"/>• </extension>• </archive>

http://rs.tdwg.org/dwc/text/%20%C2%A0%20http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd

http://www.biodiv.org/docs/metadata/whale_catalogue.eml

Authoring meta.xml

Status: Beta release Sept. 1 http://code.google.com/p/gbif-meta-maker/

http://code.google.com/p/gbif-providertoolkit/

Excel Spreadsheet Templates

http://code.google.com/p/gbif-spreadsheet-processor/

Status: Beta release September (by TDWG)


Excel Spreadsheet Templates

http://code.google.com/p/gbif-spreadsheet-processor/


Integrated Publishing Toolkit

• A supported platform for publication of:– Occurrence-level content– Species checklist content– Dataset metadata • Sampling methods• Bibliographic citations• Temporal coverage

• DwC-A compatible– Reduced latency between publishing and

discovery


• GBIF Review 2009• “…with regards to software and tool

development…:– Lack of rigorous technical documentation;

open source software must be documented and annotated meticulously in order to take advantage of improvements made by users.

– Release of unstable code that is being worked on still by its initiators to a community who are not made aware that it is not finalised.”


• Received good feedback in first year of use

• Primary request: Simplify and “lighten” up the product

• RC4 testing initiating now– Enhanced metadata (still EML)– Darwin Core Archive import–Multiple organisation association (for

hosting centres)– Bug fixing


• RC4 will not address feedback, but will be a more stable version for new users

• RC5 development underway to address feedback– Simplification all around (intuitiveness)– Performance improvements– Enriched documentation/examples/webcasts– Server requirements dropping significantly

(target of 256MB of memory)


• Following RC5, we will initiate user testing and bug fixing, with no unnecessary functionality changes to move to the target of a stable robust platform by end 2010

• http://code.google.com/p/gbif-providertoolkit/


Vocabulary server

• Drupal implementation developed as a proof of concept– http://vocabularies.gbif.org/

• IPT uses extensions, vocabularies and schemas for the operation– http://vocabularies.gbif.org/

• No well defined workflows yet for community ownership of vocabularies– Discuss at TDWG ’10 – ViBRANT Funding to operationalise

http://vocabularies.gbif.org/

http://vocabularies.gbif.org/

Vocabulary server

Draft New Extensions

Draft New Vocabularies

Publish them

Internationalise them

Indexing and Discovery

• Objectives– Extend the classes of content that can be

discovered– Improve the means to discover (flexible

indexes)– Better determination of fitness for use

• Through dataset metadata

– Annotation / Feedback brokerage– Accurate citation– Reduce the latency between publishing and

discovery through portals

GBIF Registry (GBRDS)

• Index of the technical access points of the datasets comprising the GBIF network

• Captures basic metadata about institutions, datasets, nodes and their relationships

• Enhanced features under development– Improved attribution – Better data provenance declaration– More accurate reporting on the total

participation within the GBIF network– Dynamic definitions of thematic networks– API / Web app for automating registration

http://gbrds.gbif.org/index




GBIF Registry (GBRDS)





Registry: GBIF is complex…

Metadata catalogue

• Collection of XML-based dataset metadata documents (ISO, FGDC, EML, DIV formats)

• Associated with entities known to the GBIF Registry• Common search across content• Currently using Metacat

– Will review this following prototyping

Goal: Enriched documentation, discovery of unpublished datasets

Status: Under development, promoting publication of data documents through “small grant awards”

Harvesting and Indexing Toolkit

• The GBIF harvesting software:– Foundations to harvest DiGIR, BioCASe,

TAPIR, DwC-A– Synchronisation with the GBIF Registry– User interface for controlling and

scheduling harvesting operations–Metrics for the success of harvest runs– Access to logs for diagnostics– Sychronisation against the GBIF Portal

database



• In production use in GBIFS only• Some external users are testing – collecting feedback

now• In light of the GBIF review comments, need to assess:

– The need for such a tool – requirements are sought by community

– Resources needed to meet expectation by community (versioning, bug fixing, support, manuals)

• Is it rather a library to aid developers than a product per-say?

• Remember a homogenous network does not require multi protocol support and can be handled far more simply!

Data Portal

• http://data.gbif.org– Little functional development in recent

months• Bug fixing activities only

– Continues to grow in content• Jan 2010: 196 million• Aug 2010: 203 million

– 2500 – 3000 visitors per day (plus web service use)

– US visitors account approx 22% (2010 traffic)• 2nd is UK visitors at 5%

http://data.gbif.org/

Data Portal Evolutions

• Portal will evolve by end 2011– Improved taxonomic services and content

• Achieved through the Global Names Architecture

– Improved attribution and provenance• Achieve by enhancing the Registry

– Improved occurrence indexing• Scalable solution, richer fields, reduced latency… etc

– Improved map visualisations– Custom information feeds

• Abstracts, repatriation, records modified

– Improved dataset metadata• Determining fitness for use

Portal evolution

Currently Roadmap

Limited to 250,000 records for download

Access to unlimited volume of export formats

23 Darwin core properties available for search

Ability to support multiple indexes (Common, marine, terrestrial Plantae etc)

30 fields available on record detail Full record detail visible

Limited ability to determine fitness for use

Improved access to metadata where available. Improvements in automated determination of fitness for use (spatial resolution)

Poor understanding of the basis of record

Improvements in determining point versus grid based content, ex situ versus in situ records etc.

Limited spatial search Provide means to access content through user defined polygons

Occurrence content

Portal evolution

Currently Roadmap

Synthesised taxonomy assembled from all content

Multiple taxonomic organisation

Assembly methods of synthesised taxonomy poorly documented

Rigorous documentation for taxonomic organisation

Common name search limited Many names sources used to enable common name search

Limited comparison between taxonomies

Services to enable taxonomic comparison (overlap and contradiction)

Limited services for external integration Improved APIs for connecting external systems

Few checklist sources included (4-5?) in current data portal

100s of checklists accessible NOW in Dev Version – integrated into new Data Portal

Taxonomic content and organisation

Portal evolution

Currently Roadmap

Metadata limited to contact information Ability to use rich dataset metadata where available

Only datasets with digitised records discoverable

Datasets described through metadata discoverable

Attribution of data limited to provider and dataset

Better attribution of all parties involved in data publication

Citation in data exported limited to datasets only

Prototype citation services

Feedback delivered on a per record basis by email

Annotation brokerage services

Metadata, attribution and feedback

Checklist Bank Slide

Status: Dev

In-use by ALA

Data Portal Evolution

The Portal is more than just a discovery system. The Portal will be a hub that allows:

a) Data custodians to – Register the existence of biodiversity data sources– Publish their content and in addition, rich

information about the content (e.g. metadata documenting assembly methods)

– Subscribe to annotations made against their content– Subscribe to information about the usage of their

content– Access services of interest to them (e.g. quality

control)

Data Portal Evolution

b) Users to– Search content in real time, through various

customised search options (e.g. Terrestrial plants, marine mammals, natural history collections, protected areas)

– Browse content taxonomically, temporally, geographically etc.

– Define and run reports (not real time) to extract a data subset or derive metrics

– Subscribe to customised information feeds (e.g. Modified Pinaceae specimens in Australia)

– Publish annotations related to record quality, or assertions about that record (e.g. confirmed suitable for 100km modeling)

– Build better information systems that utilise services offered by the Portal

Nodes Portal Toolkit

• Customise-able toolkit to deploy of a National/Regional/Thematic discovery portal

• Technical advisory group for NPT recommend that:– To fully engage the NODES community in the design,

development, testing and deployment of the NPT. – To ensure tight integration of the NPT with the GBIF

Informatics Infrastructure, while taking benefit from a wide array of additional biodiversity-related web services.

– To adopt an open source content management platform such as Drupal, upon which to build and develop specific NPT modules (specifically those for integration, visualisation, and access of biodiversity-related data and information)

• A call for an NPT coordinator is currently in draft• ViBRANT funds will support

Communications

http://community.gbif.org/

Participant forums

13 August Launch


Communications

http://www.gbif.org/

Consolidate tech docs

RSS Feeds

More updates


Secretariat Tech Capacity

• Resources– 3 current openings Java Developers– Vocabularies/Ontology Developer (30

months) –ViBRANT– Taxonomic Publishing Developer (18

months) – i4Life / Catalogue of Life

Summary: Informatics Targets

Data custodians

Registry

Harvesters

Processing

Indexes

APIs(user / machine)

Clients

Dat

a flo

w

Point based occurrences

Grid based occurrences

Checklists

Dataset Metadata

Refined end to end workflows for: