View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Data Integration Issues in Biodiversity Research
Jessie KennedyShawn Bowers, Matthew Jones, Josh Madin,
Robert Peet, Deana Pennington, Mark Schildhauer, Aimee Stewart
Visual Tools for Managing Taxonomic Concepts
SEEKScience Environment for Ecological Knowledge
Research and develop information technology to radically improve the type and scale of ecological science that can be addressed
Visual Tools for Managing Taxonomic ConceptsBiochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Geography
Ecology
Science and Scientific Data are Complex
Visual Tools for Managing Taxonomic Concepts
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Visual Tools for Managing Taxonomic Concepts
Individual Scientist
Small Scientific Community
Large Scientific Community Scientific Laboraotory
Scientific Community: complex
Visual Tools for Managing Taxonomic Concepts
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Visual Tools for Managing Taxonomic Concepts
Science & Scientific Data are Continually ChangingConclusions become
foundations for new hypotheses
New experiments invalidate existing knowledge
Knowledge is open to interpretation Different opinions
Need to build this into our technological solutions
observation
experiment hypothesis
conclusion
Visual Tools for Managing Taxonomic Concepts
Exploiting Scientific Data
To support scientists in Discovery Access Sharing Integration/Linking Analysis
Scientists can then improve their potential for new scientific discovery
Visual Tools for Managing Taxonomic Concepts
Data Integration/Linking: approaches
Metadata to describe the data sets and know how to interpret the data sets
Ontologies to define the terminology used and know how data might be related and
to aid automatic transformation of the dataStandardisation of formats
for exchange of data + to ease integrationLSIDs
to uniquely identify things; know when 2 things are the sameWorkflows
to enable specification, refinement and repetition of integration/analysisProvenance of data
to record where the data has come from and what has happened to it en route.
Visual Tools for Managing Taxonomic Concepts
Projects in most sciences:
ESG
Visual Tools for Managing Taxonomic Concepts
Ecological Science - AnalysisEcological niche modeling of species distributions
Where do species occur now?
Image from http://www.lifemapper.org
Where will they occur in the future?
Visual Tools for Managing Taxonomic Concepts
Ecological Niche Modeling
EnvironmentalCharacteristicsfrom griddedGIS layers
Known Species Locations
Temperature layer
Many other layers
EnvironmentalChange
Prediction
Future ScenariosOf EnvironmentalCharacteristics
InvasionArea
Prediction
EnvironmentalCharacteristicsOf Different
Geographic Area
NativeDistributionPrediction
EnvironmentalCharacteristics
Of Surrounding Geographic Area
DevelopModel
MultidimensionalEcological Space
D1 = Temperature
D2 Dn
Visual Tools for Managing Taxonomic Concepts
Sources of Scientific Data
Data are massively dispersed Ecological field stations and research centers (100’s) Natural history museums and biocollection facilities (100’s) Agency data collections (10’s to 100’s) Individual scientists (1000’s)
Data are heterogeneous Syntax (format) Schema (model) Semantics (meaning)
Visual Tools for Managing Taxonomic Concepts
Challenge: Data Integration
Visual Tools for Managing Taxonomic Concepts
SEEK Components
Visual Tools for Managing Taxonomic Concepts
Semantic Annotation – SEEK ontologies
Integration/merge Concept mapping Units conversion Spatial & temporal scaling
Data discovery Finding relevant data sets Understanding data set content
Visual Tools for Managing Taxonomic Concepts
Smart (Data) Integration: MergeDiscover data of
interest
… connect to merge actor
… “compute merge”
Visual Tools for Managing Taxonomic Concepts
Smart Merge …Semantic type
annotations and ontology definitions used to find mappings between sources
Executing the merge actor results in an integrated data product (via “outer union”)
a1 a2 a3 a4a 5 10b 6 11
a1 a2 a3 a4a 5 10b 6 11
a5 a6 a7 a8 0.1 a 0.2 c 0.3 d
a5 a6 a7 a8 0.1 a 0.2 c 0.3 d
a3
a6
a1
a8
a4
Merge
a1a8
a3a6
a4
BiomassBiomass
BiomassBiomassSiteSite
SiteSite
a1 a3 a4a 5.0 10b 6.0 11a 0.1c 0.2d 0.3
a1 a3 a4a 5.0 10b 6.0 11a 0.1c 0.2d 0.3
Merge Result
Visual Tools for Managing Taxonomic Concepts
Challenges of Taxonomic Data
Scientific names change in meaning over time + geographical region conclusions being drawn from analysis of data integrated on names.
Visual Tools for Managing Taxonomic Concepts
Flora North America
SubAlpine Fir
USDA Plants & ITIS
Abies lasiocarpa
Abies bifolia
Abies lasiocarpa
var. arizonica
var. lasiocarpa
What is Abies lasiocarpa?
Visual Tools for Managing Taxonomic Concepts
Aus L.1758
Aus aus L.1758
Linneaus 1758
Aus aus L.1758
Tucker 1991
Aus L.1758
Aus cea BFry 1989
Aus aus L.1758
Aus L.1758
Aus bea Archer 1965
Archer 1965
Aus aus L.1758
Aus L.1758
Aus bea Archer 1965
Aus cea BFry 1989
Fry 1989
Aus L.1758
Xus beus (Archer) Pargiter 2003.
Aus ceus BFry 1989
(vi) Xus Pargiter 2003
Pargiter 2003
Aus aus L. 1758
Changes in meaning of names
Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus.
Pyle 1990
5 Revisions of Aus 1 name spelling change
Taxonomic history of imaginary genus Aus L. 1758
Visual Tools for Managing Taxonomic Concepts
Aus L.1758
Aus bea Archer 1965
Aus aus L.1758
Archer 1965
Aus L.1758
Aus aus L.1758
Linneaus 1758
Aus aus L.1758
Aus L.1758
Xus beus (Archer) Pargiter 2003.
Aus ceus BFry 1989
(vi) Xus Pargiter 2003
Pargiter 2003
Aus aus L. 1758
Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus.
Aus aus L.1758
Tucker 1991
Aus L.1758
Aus cea BFry 1989
Aus L.1758
Aus bea Archer 1965
Aus cea BFry 1989
Fry 1989
Changes in meaning of names
Pyle 1990
• 8 Names• 2 genus• 6 species
N4 - Aus beus Archer 1965
N1 - Aus aus L.1758
N1
C1.5
C1.4
C1.3
C1.2
C1.1 C1.1 - Aus aus L.1758 sec. Linneaeus 1758
C1.2 - Aus aus L.1758 sec. Archer 1965
C1.3 - Aus aus L.1758 sec. Fry 1989
C1.4 - Aus aus L.1758 sec. Tucker 1991
C1.5 - Aus aus L.1758 sec. Pargiter 2003
N2 - Aus bea Archer 1965
N5 C5.5N5 - Aus ceus Fry 1989
C5.5 - Aus ceus Fry 1989 sec. Fry 1989
C6.5N6N6 - Xus beus Pargiter 2003
C6.6 - Xus beus Pargiter 2003 sec. Pargiter 2003
N2
C2.3
C2.2 C2.2 - Aus bea Archer 1965 sec. Archer 1965
C2.3 - Aus bea Archer 1965 sec. Fry 1989
N3
N4C3.4
C3.3N3 - Aus cea Fry 1989 C3.3 - Aus cea Fry 1989 sec. Fry 1989
C3.4 - Aus cea Fry 1989 sec. Tucker 1991
N0 - Aus L.1758
N0
C0.5
C0.4
C0.3
C0.2
C0.1 C0.1 - Aus L.1758 sec. Linneaeus 1758
C0.2 - Aus L.1758 sec. Archer 1965
C0.3 - Aus L.1758 sec. Fry 1989
C0.4 - Aus L.1758 sec. Tucker 1991
C0.5 - Aus L.1758 sec. Pargiter 2003
C7.5N7
N7 - Xus Pargiter 2003
C7.6 - Xus Pargiter 2003 sec. Pargiter 2003
8 Names 17 Concepts
Each name has many
concepts ormeanings
Visual Tools for Managing Taxonomic Concepts
Find data sets containing Aus aus
Many possible interpretations of Aus aus (N1)
Original concept: C1.1 Most recent concept: C1.5 Preferred Authority (e.g. Fry 1989): C1.3 Everything ever named N1:
Union(C1.1,C1.2,C1.3,C1.4,C1.5) Best fit according to some matching algorithm
Best(C1.1,C1.2,C1.3,C1.4,C1.5) New concept containing only those features
common to all concepts with the name N1: Intersection(C1.1,C1.2,C1.3,C1.4,C1.5)
Is it appropriate to link or merge data sets returned on the scientific names? Depends on the user’s purpose Level of precision required
N1 - Aus aus L.1758
N1
C1.5
C1.4
C1.3
C1.2
C1.1
Visual Tools for Managing Taxonomic Concepts
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N0 N7
N1 N2N5 N6N3 N4
Information from literature on synonymy
Taxonomists record which names their concepts are synonymous withand any name changes
Parent child relationships in 5 revisions
Names for each of the concepts
Visual Tools for Managing Taxonomic Concepts
Find data sets with Aus aus (N1)
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N0 N7
N1 N2N5 N6N3 N4N1
C1.1 C1.2 C1.3 C1.5C1.4
N1
Visual Tools for Managing Taxonomic Concepts
Find data sets with Aus aus (N1)
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N0 N7
N1 N2N5 N6N3 N4N1 N2
C1.1 C1.2 C2.2 C1.3 C2.3 C1.5C1.4
N1
Visual Tools for Managing Taxonomic Concepts
Find data sets with Aus aus (N1)
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N0 N7
N1 N2N5 N6N3 N4N1 N2
C1.1 C1.2 C2.2 C1.3 C2.3 C1.5C1.4 C3.4C3.3
N1 N2N2 N3
Visual Tools for Managing Taxonomic Concepts
Find data sets with Aus aus (N1)
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N0 N7
N1 N2N5 N6N3 N4N1 N2
C1.1 C1.2 C2.2 C1.3 C2.3 C1.5C1.4 C3.4C3.3 C6.5
N6N3 N4N1 N2N2
Visual Tools for Managing Taxonomic Concepts
Find data sets with Aus aus (N1)
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N0 N7
N1 N2N5 N6N3 N4N1 N2
C1.1 C1.2 C2.2 C1.3 C2.3 C1.5 C5.5C1.4 C3.4C3.3 C6.5
N5 N6N3 N4N1 N2N2 N3
Results in everything returned for Aus aus by traversing the synonymy and name links
Visual Tools for Managing Taxonomic Concepts
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N1N5 N6
N2 N3 N4
N0 N7
= =
Information to improve data sets returned
Minimally what we need are set relationships from concepts in any taxonomy to earlier concepts
and name changes related to earlier names
We can build systems to return data suit for purpose
Visual Tools for Managing Taxonomic Concepts
Real Biological TaxonomiesLarger and change more frequently than the Aus exampleGerman mosses
14 classifications in 73 years covering 1548 taxa only 35% thought to be stable concepts
65% of names used in legacy data sets are ambiguous
Taxonomic Revisions of genus Alteromonas 34 years: from 1972 to 2006 At the species level
18 “emendations” 19 species reassigned to 4 genera
3 new combinations 6 synonyms 2 species to subspecies 2 subspecies to species
21 new species
Visual Tools for Managing Taxonomic Concepts
SEEK Taxon ApproachUse Taxon Concepts for referring to organisms
Aus aus L. 1758 sec. Tucker 1991 Abies lasiocarpa (Hook) Nutt. sec FNA 1997
Taxon Concept/Name Resolution International data exchange schema
TCS (Taxonomic Concept Schema)
Concept Repository and Resolution web service Linked to Kepler workflow system Globally unique identifiers (LSIDs) Visualization software for comparing Taxonomies and
Asserting Concept Relationships
Visual Tools for Managing Taxonomic Concepts
Taxon Object Server
Mammal Species of the World
TaxonomicLiterature
TaxonomicData
Providers
TOS
SEEK Cache
Databaseto TCSMappingTool
ConceptExtractionTool
TCS
TCS
ConceptMapper
Visual Tools for Managing Taxonomic Concepts
Taxonomic Object Service: SEEK
ConceptMapper
http://seek.nhm.ku.edu/TaxObjServ/services
TCSTCS
Find All Concepts
Get Synonymous Concepts
Get Best ConceptTOS
SEEK Cache
LSIDAuthority
Morpho
Data Analysis
EML Datasets
Identify species
EML(TCS)
Mark up datasets
Visual Tools for Managing Taxonomic Concepts
Recap…Re-emphasised the problems with Taxonomic Names
not good identifiers for organisms problem extends to most areas
characters, countries, habitats, vegetation types, genes…..
Shown that Taxonomic concepts are better for referring to organisms, specimens, observations… but
Need better systems for resolving taxonomic names/concepts Which require better information
Visual Tools for Managing Taxonomic Concepts
Provide better tools for usersTo help taxonomists create better quality data
Better access to reference/legacy data Explore differences/similarities in existing taxonomies To create relationships between concepts Improved data can be made available to the general biology
community for incorporating into bio-referenced databases.
To help end users understand and use the data and its limitations Biologists can use tools to understand the impact of using
particular data on their analysis
Visual Tools for Managing Taxonomic Concepts
Conclusion Science is complex (and therefore split into specialisms)
Identify the overlaps/linkages in the different domains Need useful approximations of things to simplify linked domain Need to understand the approximations or linking points well
Support re-composition, linking or building on the components Science is inherently changing
Science is full of legacy data Today’s scientific research is tomorrow’s legacy data
Track the changes in the data know when components or links have changed
Provide long-term persistent storage Any published scientific discovery should store the data as evidence Data needs to be accurately annotated
Sufficient to repeat analyses to test hypotheses
Visual Tools for Managing Taxonomic Concepts
AcknowledgementsColleagues on the SEEK project NSF and EPSRC fundinge-Science Centre funding Colleagues in TDWG
Thank You
Questions…