47
e-SI Theme: Exploiting Diverse Sources of Scientific Data Integrating Diverse Sources of Scientific Data: Is it safe to match on names? Prof. Jessie Kennedy

Integrating Diverse Sources of Scientific Data: Is it safe to match on names?

Embed Size (px)

DESCRIPTION

Integrating Diverse Sources of Scientific Data: Is it safe to match on names?. Prof. Jessie Kennedy. Exploiting Diverse Sources of Scientific Data. Wealth and diversity of scientific data collected and stored is growing rapidly Increase in automation - PowerPoint PPT Presentation

Citation preview

Page 1: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

e-SI Theme: Exploiting Diverse Sources of Scientific Data

Integrating Diverse Sources of Scientific Data:

Is it safe to match on names?

Prof. Jessie Kennedy

Page 2: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 2

Exploiting Diverse Sources of Scientific Data

Wealth and diversity of scientific data collected and stored is growing rapidly Increase in automation

Genetic sequencing, remote sensing, astronomy satellites Decrease in technological costs

Computers more powerful, disk space greater for the same £Huge potential for scientific discovery by exploiting

this data especially multi-disciplinary research

Number, complexity and diversity of resources makes this a difficult task

Case Study Data Integration Matching data sets on biological names

Page 3: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 3

Science Environment for Ecological Knowledge USA National Science Foundation funding

Multidisciplinary project Biology: Ecology, Taxonomy Environmental science: Geography, Remote sensing,

Meteorology, Climatology Computer Science: Database, GRID/Web, Ontologies,

Workflows, Algorithms, Human Computer Interaction

SEEK

Page 4: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 4

Geographic Space Ecological Space

occurrence points on native species distribution

ecological niche modeling

Project back onto geography

Native range prediction

Invaded range prediction

The SEEK Prototype: Ecological Niche Modeling

temperature

Model of niche in ecological dimensions

pre

cip

itatio

n

Biodiversity information e.g.

data from museum

specimens, ecological surveys

Geospatial and remotely sensed

data

Results taken to integrate with

other data realms (e.g.,

human populations, public health,

etc.)

Page 5: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 5

Species prediction map

PredictedDistribution:Amur snakehead(Channa argus)

Image from http://www.lifemapper.org

Page 6: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 6

SEEK - Informatics Challenges

Data is DistributedData is Heterogeneous

Syntax e.g. Text, Excel, Relational Database…..

Schema e.g. Names of the tables, columns in tables

Semantics principal focus for SEEK From many disciplines

Biodiversity surveys, hydrology, atmospheric chemistry, spatial data, behavioural experiments,…

Data on economics, demographics, legal issues,…

Page 7: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 7

SEEK Overview

Analysis and Modelling System (Kepler)Modelling scientific workflows

EcoGrid:Making diverse environmental data systems interoperate

Semantic Mediation System:“Smart” data discovery and integration

Knowledge Representation WG:Ontologies, MetadataTaxon WG:

Taxonomic name/concept resolution server

BEAM WG:Biodiversity and Ecological Analysis and Modelling

Page 8: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 8

SEEK Overview

EcoGrid

Page 9: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 9

EcoGrid Resources

AND

LUQ

NTL

VCR

HBR

Metacat node

Legacy system

SRB node

DiGIR nodeVegBank node

Xanthoria node

Natural History Collections (>> 100)

LTER Network (24)

Organization of Biological Field Stations (180)

Partnership for Interdisciplinary Studies of Coastal Oceans (4)

UC Natural Reserve System (36)

Multi-agency Rocky Intertidal Network (60)

Page 10: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 10

EcoGrid Data Access

EcoGrid registry to discover data sourcesEML (Ecological Metadata Language)

Experimental data, survey data, spatial raster and vector data, etc.

XML based Discovery information

Creator, Title, Abstract, Keyword, etc.

Coverage Geographic, temporal, and taxonomic extent

Logical and physical data structure Data semantics via unit definitions and typing

Protocols and methods

DarwinCore Museum collections

Page 11: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 11

EcoGrid Services

Service to Analysis and Modelling Layer Interaction with Kepler – Workflows Interaction with Grid Computing Facilities

Distributed computation

Service to Semantic Mediation Layer Access to Ontologies; Taxon Services

Access to Legacy Apps LifeMapper Spatial Data Workbench

Page 12: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 12

SEEK Overview

AMS

Page 13: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 13

Model the way scientists currently work with data coordinate export and import of data among software systems

Workflows emphasize data flowOutput generation includes creating appropriate metadata

The analysis workflow itself becomes metadata The workflow describes the data lineage as it has been transformed Derived data sets can be stored in EcoGrid with provenance

Scientific Workflows

Query EcoGrid to find data

Archive output to EcoGrid with workflow

metadata

Page 14: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 14

Scientific workflows

EML provides semi-automated data binding

Page 15: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 15

Kepler: Ecological Niche Model

(200 to 500 runs per speciesx

2000 mammal speciesx

3 minutes/run)

=833 to 2083 days

Page 16: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 16

(200 to 500 runs per speciesx

2000 mammal speciesx

3 minutes/run)/

100 nodes=

8 to 20 days

Utilize distributed computing resourcesExecute single steps or sub-workflows on distributed

machines

Grid-enable Kepler

KeplerGrid for NicheModeling

Page 17: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 17

SEEK Overview

SMS

Page 18: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 18

Key information needed to read and machine process a data file is in the metadata Physical descriptors (CSV, Excel, RDBMS, etc.) Logical Entity (table, image..),Attribute (column) descriptions

Name Type (integer, float, string…) Codes (missing values, nulls...) Integrity constraints

Semantic descriptions (ontology-based type systems)Metadata driven data ingestion

Metadata

Page 19: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 19

Ecological ontologiesWhat was measured (biomass or photosynthetic solar radiation)

Type of quantity measured (mass, length)

Context of measurement (Psychotria limonensis, wavelength band)

How it was measured (dry weight, total solar radiation)

Page 20: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 20

Label data with semantic typesLabel inputs and outputs of analytical components

with semantic types

Use reasoning engine to generate transformation stepUse reasoning engine to discover relevant component

Semantic Mediation

Data Ontology Workflow Components

Page 21: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 21

Homogeneous data integration Integration via EML metadata is relatively straightforward

Heterogeneous Data integration Requires advanced metadata and processing

Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement relationships must be known

e.g., that ArealDensity=Count/Area

Data integration

Page 22: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 22

Simple Example

Page 23: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 23

Life Sciences Data

Much of the data gathered in ecological studies and used in ecological data analysis is bio-referenced data typically organisms are referenced by a Latin name

e.g. Picea rubens

Many analyses require integrating data originating in many locations and at various points in time

For most bio-referenced data, integration involves matching on organism name SEEK Taxon investigating associated issues

Page 24: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 24

Biological (Scientific) Names

Used for communicating information about known organisms and groups of organisms – taxa Framework for all biologists to communicate…

Arise from taxonomists applying them to species and higher taxa following classification

Formalized according to strict codes of nomenclature differ depending on kingdom

Use a Latin naming scheme polynomial for species + below; monomial for genus + above

Quoted as: LatinName NameAuthors Year Example: Carya floridana Sarg. 1913

Can cause problems in data analysis…..

Page 25: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 25

Taxon_concept

Taxon_concept Taxon_concept Taxon_concept

classify

Pile of specimens

Genus

Species

Taxonomic Hierarchy

_a

_b _c _d

Classification, Concepts & Names

Type specimens

Page 26: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 26

classify

Pile of specimens

Classification, Concepts & Names

Taxon_concept_dTaxon_concept_d

Page 27: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 27

In Linnaeus 1758 In Archer 1965 In Tucker 1991

In Pargiter 2003

In Pyle 1990

Aus aus L.1758

(ii) Aus L.1758

Aus bea Archer 1965

Archer 1965

(i) Aus L.1758

Aus aus L.1758

Linnaeus 1758

In Fry 1989

(iii) Aus L.1758

Aus aus L.1758

Aus bea Archer 1965

Aus cea BFry 1989

Fry 1989

(v) Aus L.1758

Xus beus (Archer) Pargiter 2003.

Aus ceus BFry 1989

Xus Pargiter 2003

Pargiter 2003

Aus aus L. 1758

bea and cea noted as invalid names and replaced with beus and ceus. Pyle 1990

Aus aus L.1758

Tucker 1991

(iv) Aus L.1758

Aus cea BFry 1989

Publications of Taxonomic Revisions

Publicationsof Purely Nomenclatural Observation

A diligent nomenclaturist, Pyle (1990), notes that the species epithet of Aus bea and Aus cea are of the wrong gender and publishes the corrected names Aus beus corrig. Archer 1965 and Aus ceus corrig. BFry 1989

Tucker publishes his revision without noting Pyle’s corrigendum of the name of Aus cea

Pargiter publishes his revision using Pyle’s corrigendum of the epithet bea to beus and Aus cea to Aus ceus.

type specimengenus nameGenus

concept

Species concept

species name

publication

specimen

Archer splits Aus aus L. 1758 into two species, retains the name for one and creates a new one

Fry splits Aus bea Archer. 1965 into two species, retains the name for one and creates a new one

Tucker finds new specimens and combines Aus aus L. 1758 and Aus bea Archer. 1965 into one species, retains the name.

Pargiter decides to re-split Aus aus but believes bea(beus) is in a new genus Xus.

Taxonomic history of Aus L. 1758

Page 28: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 28

Problems with Taxonomic Names

Are not unique “Re-use” of names with changed definition Name is ambiguous without definition/context

Subject to alterations and 'corrections' in time Often recorded inappropriately in datasets

No author and/or year (e.g. Carya floridana) Abbreviated (e.g. C. floridana) Internal code (e.g. PicRub for Picea rubens) Vernacular used (e.g. Scrub Hickory) Misspelled

Page 29: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 29

Taxon Concepts ……

The published expert opinion defining and describing a group of organisms which are given a (scientific) name Scientific names qualified with a reference to the

definition of a concept

Should be used for communicating about groups of organisms

Comparing or integrating data based on taxon concepts will be more accurate

Page 30: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 30

Taxon Concepts…

Created by someone - an Author Described in a PublicationGiven a Name

Related to the type specimenDefinitionReferenced by

Full Scientific name + “according to” (Author + Publication + Date) Definition

Carya floridana Sarg. (1913) “according to” Charles Sprague Sargent, Trees & Shrubs 2:193 plate 177 (1913)

Page 31: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 31

Taxon Concepts ……

Defined by set of Specimens examined during classification set of common Characters

context dependent; differentiate taxa rather than fully describe them;

use natural language with all its ambiguities

relationships to other Taxon Concepts Taxon circumscription

the lower level taxa

Congruence, overlap, includes etc. to taxa in other classifications

Page 32: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 32

Taxon Concepts ……

Original concept 1st use of name as described by the taxonomist

same author + date in scientific name and “according to” Carya floridana Sarg. (1913) Charles Sprague Sargent, Trees &

Shrubs 2:193 plate 177 (1913) TC_a

Revised concept Re-classification of a group

Carya floridana Sarg. (1913) “according to” Stone, Flora of North America 3:424 (1997)

TC_b

Relationship between the taxon concepts TC_b includes TC_a

Page 33: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 33

Legacy Data …

In legacy data names often appear in place of concepts

Names are imprecise inappropriate for referring to information regarding taxa

e.g. observational/collection data BUT…sometimes that’s all we have

How do we interpret names?….. potentially multiple definitions

the sum of all definitions that exist for the name one of the existing definitions the “attributes” in common to all the definitions represented by the type specimen

Page 34: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 34

Names as Taxon Concepts

Nominal concepts Sub-set of TaxonConcepts Name but no AccordingTo

non-unique (concept) identifier attributes can be given a unique concept identifier

No definition Explicitly saying it’s something with this name

but not really sure what is/was meant by the name Encourage people to understand and address the

issue of names Allowing mark-up of data with names allows them to

believe names are really good enough Will improve long term usefulness of scientific data Ease integration

Page 35: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 35

SEEK Taxon’s Message…..

Scientific names are not unique identifiers for biological entities

Integrating data from different sources based on names alone could cause serious errors in analysis of the integrated data

Biologists must reference organisms precisely if datasets to be of use long term or to other users

Reference by taxon concept rather than name integrate data for analysis on taxon concepts

Page 36: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 36

Taxonomic DatabasesMain taxonomic list servers are still name based

single perspective on taxonomy don’t represent multiple classifications

unclear what the definition is (don’t even try!) provide non-standardised interface (web page, xml

download)SEEK Taxon aims to prototype a concept/name

resolution service for ecologists working with SEEK Find concepts given a name Compare concepts Relate concepts Mark up ecological data sets with concepts

First Need data on names and concepts Need an exchange standard….

Page 37: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 37

Taxon Concept SchemaTCS standard for exchange of taxonomic

names/concept data Taxonomic Databases Working Group (TDWG) Global Biodiversity Information Facility (GBIF) XML based exchange schema Makes heavy use of Globally Unique Identifiers (GUIDs)

Not designed as the “correct way” to model a Taxon Concept No “rules” as to what a taxon must have Design to accommodate different models

Includes Taxon Names more constrained - the codes of nomenclature

TCS/EML TCS modifications to EML taxon coverage

Page 38: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 38

Taxon Names and Taxon Concepts

Important to be able to pass names alone For nomenclatural and some taxonomic

purposes But not for identifications/observations

Taxon Concepts refer to Names By GUID Names must not change

Can’t record original taxon concept

Page 39: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 39

Taxon Concept/Name Resolution Server

Taxon Object Server Schema based on the TCS model Implements the GUIDs using LSID technology Tool to import/export data from TCS documents

TOS Allows registration, retrieval of taxonomic datasets Match concepts given names, concepts, etc.

Allow users to See different taxonomic opinions Uses GUIDs to reference concepts (LSIDs) Find concepts… Author new concepts Make new relationships between existing concepts

Integrated with Kepler workflow system

Page 40: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 40

SEEK User Interface Tools

Concept mapper A desktop tool to assist taxonomists to relate

concepts from one source to another For use in creating data sets for TOS or TCS For creating new relationships between concepts in TOS

Taxonomy comparison visualisation Visualisation tool to explore different classifications Compare concepts

Page 41: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 41

Concept Mapper Main GUIQuery

concepts

Concepts

Relationships

Page 42: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 42

Concept Comparison Visualisation

Page 43: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 43

SEEK Summary

Environment to support large scale ecological data analysis Scientific Workflows: Kepler Semantic Mediation

Ecological ontology creation/use for data integration Grid/Wed based data discovery Resolution of Taxonomic Names/Concepts

Standards development Concept matching server Visualisation tools

http://seek.ecoinformatics.org

Page 44: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 44

Is it safe to match on names?

I hope I have convinced you that the answer is

NO as a general rule…

BUTDepends on the purpose of the data

therefore the accuracy required

The degree of automation used in matching greater automation – greater potential problem

Expertise of person involved in the matching

Page 45: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 45

Many Outstanding Issues….Educating biologists of the inherent problem in names

Not limited to the Linnaean system of nomenclatureLack of good taxon concept data Widening usage and application of taxon concepts

Adopting GUIDs Provision of reliable ‘look up’ facilities Cross referencing of GUIDs

Reuse is vital Must not create duplicate GUIDs if possible

Conversion of legacy dataDevelop good matching algorithmsPotential move from XML schema -> semantic web

technologies……..

Page 46: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 46

AcknowledgementsThis material is based upon work supported by:The National Science FoundationSEEK Collaborators: NCEAS (UC Santa Barbara),

University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Arizona State University, UC Davis Matt Jones – for many of the slides….

Global biodiversity Information FacilityeScience Institute

Research Theme Programme Malcolm Atkinson

Page 47: Integrating Diverse Sources of Scientific Data:  Is it safe to match on names?

Exploiting Diverse Sources of Scientific Data 47

Exploiting Diverse sources of Scientific Data

Upcoming Workshop discussing possible technology solutions

RDF, Ontologies and Meta-Data Workshop7th – 9th June, 2006 e-Science Institute

15 South College Street Edinburgh

http://www.nesc.ac.uk/esi/events/683/