Directions in observational data organization: from schemas to ontologies
Matthew B. Jones1
Chad Berkley1
Shawn Bowers2
Joshua Madin3
Mark Schildhauer1
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California, Santa Barbara1
University of California, Davis2
MacQuarie University3
Ecological studies
• Ecological studies focus on– Distribution and abundance of organisms– Organism interactions– Population and community processes– Ecosystem processes– Mechanistic understanding of ecosystems
• Diverse data sources, e.g.,– Biodiversity monitoring– Experimental manipulations– Environmental monitoring
Synthesis over ecological process
• Gruner et al. 2008– Ecology Letters, (2008) 11: 740–
755
• Meta-analysis of 191 factorial manipulations of nutrients and herbivores
• Experimenters manipulated
– nutrient addition
– herbivore removal
• Effect on producer biomass
How did they do it?
• As a scientist, could you:
– Locate the precise data used?
– Locate the analytical processes used?•Reconstruct them?
• Today, only a slim chance...
– Why?
Insufficient sharing
• Researchers don’t publish their data
• Researchers don’t publish their analytical code
• In general, we have no way to verify or reproduce the conclusions in papers
• Synthesis requires access to global ecological data
• Single-schema databases do not suffice
• Loosely-coupled metadata and data collections– No constraints on data schemas
• Knowledge Network for Biocomplexity (KNB)
• National Biological Information Infrastructure (NBII)
Preserving data for synthesis
PhysicalPhysicalDataDataFormatFormat
Access and Access and DistributionDistribution
LogicalLogicalDataDataModelModel
MethodsMethodsCoverage:Coverage:
Space, Time, Space, Time, TaxaTaxa
Identity andIdentity andDiscovery Discovery InformationInformation
<EML>
22 independent modules
• open• modular• extensible
• Ecological Metadata Language
Grass roots metadata
Describe what data you have...rather than prescribe what to produce.
EML: Selected relationships
1995 2000 2005‘91 ‘92 ‘93 ‘94 ‘96 ‘97 ‘98 ‘99 ‘01 ‘02 ‘03 ‘04 ‘06 ‘07 ‘08 ‘09
EML1.0.0
EML1.3.0
EML1.4.x
EML2.0.0
CSDGM1.0
Michener ’97 paper
ESA FLEDReport
NBIIBDP
ISO 19115
DublinCore
OBOE
XML1.0
EML2.0.1
EML2.1.0?
Logical Model: Attribute structure
• Describes data tables and their attributes
• a typical data table with 10 attributes– some metadata are likely apparent, other ambiguous
– missing value code is present
– definitions need to be explicit, as well as data typing
YEAR MONTH DATE SITE TRANSECT SECTION SP_CODE SIZE OBS_CODE NOTES2001 8 2001-08-22 ABUR 1 0-20 CLIN 5 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 11 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 10 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 14 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 7 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 19 06 .2001 8 2001-08-22 ABUR 1 21-40 COTT 5 06 .2001 8 2001-08-22 ABUR 2 0-20 CLIN 5 06 .2001 8 2001-08-22 ABUR 2 21-40 NF 0 06 .2001 8 2001-08-27 AHND 1 0-20 NF 0 03 .
Species
Codes
Valuebounds
DateFormat
Codedefinitions
Logical Model: unit Dictionary
• Consistent assignment of measurement units
– Quantitative definitions in terms of SI units
– ‘unitType’ expresses dimensionality• time, length, mass, energy are all ‘unitType’s
• second, meter, gram, pound, joule are all ‘unit’s
MassMass
kilogramkilogram
gramgram
UnitType Unit
x1000
Knowledge Network for Biocomplexity (KNB)
PISCOPISCO
KNB IIKNB II
ANDAND
... (26)... (26)
GCEGCE LTERLTER
NCEASNCEAS
ESAESA
OBFSOBFS
KNB 1KNB 1
Building a data preservation network
• Preserve primary data
• Rich metadata descriptions
• Redundant backup via replication
• Access controlled by contributors
KNB 1KNB 1KNB IIKNB II
PISCOPISCOANDAND
... (26)... (26)
GCEGCE LTERLTER
NCEASNCEAS
ESAESA
OBFSOBFSKnowledge Network for Biocomplexity (KNB)
SouthAfricanData
Network
MozambiqueMozambique
MapungubweMapungubwe
MarakeleMarakeleKrugerKrugerSAEONSAEON
GrahamstowGrahamstownn
Cape TownCape TownSan ParksSan ParksWildernessWilderness
Cape Town UCape Town U
AddoAddo
KarooKaroo
TsitsikamaTsitsikama PhalaboraPhalabora
Savannah ClusterMarine Cluster
International LTER
• Recommendation for producing EML across all ILTER sites
• Recommendation for producing continental and regional metadata caches– one or more in each ILTER region– initial nodes may use Metacat
att1 | attr2 | attr3.... | .... | .......... | .... | .......... | .... | .......... | .... | ......
Dynamic Data Retrieval
Data StorageData Storage
MetadataParser
MetadataParser
DataLoader
DataLoader
DB
Results
Query
SELECT * FROM ...
CREATE
TABLE ...Data Query
Results
Data Manager
Data Manager
Store Data
Store Metadata
User
Client
Metadata CatalogMetadata Catalog
Importance of semantics
• So far we’ve dealt only with the logical data model– any semantics in EML in natural language
• The computer doesn’t really understand:– what is being measured– how measurements relate to one another– how semantics map to logical structure
• Analysis depends on understanding the semantic contextual relationships among data measurements– e.g., density measured within subplot
Semantic annotation
Observation Ontology
Data set
Mapping between data and the ontology via semantic annotation
slide from J. Madin
• Relational data lacks critical semantic information• no way for computer to determine that “Ht.” represents a “height” measurement • no way for computer to determine if Plot is nested within Site or vice-versa• no way for computer to determine if the Temp applies to Site or Plot or Species
Scientific Observations
• An Observation is the
Measurement of the Value of a
Characteristic of some Entity
in a particular Context
Provide extension points for loading specialized domain ontologies
Goal: semantically describe the structure of scientific observation and measurement as found in a data set
Observation ontology (OBOE)
Entities represent real-world objects or concepts that can be measured.
Observations are made about particular entities.
Every measurement has a characteristic, which defines the property of the entity being measured.
Observations can provide context for other observations.
slide from J. Madin
Datasets vs. Observations
• EML describes “data sets”– collections of related observations with relatively unspecified semantics
– mostly natural language descriptions
• OBOE describes “scientific observations”– semantically-precise descriptions of scientific measurements
– allows understanding of relationships among measurements and context of an observation
TDWG Observations Task Group
• An Observation is the Measurement of the Value of a Characteristic of some Entity in a particular Context
• Create: Community-sanctioned, extensible, and unified ontology model for observational data– Compatible with existing standards
– Integrate with metadata standards such as EML, CSDGM, etc.
– Reduce the “babel” of scientific dialects
Questions?
• http://www.nceas.ucsb.edu/ecoinformatics/
• http://knb.ecoinformatics.org/• http://seek.ecoinformatics.org/• http://kepler-project.org/
Acknowledgments
• This material is based upon work supported by:
• The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.
• Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis
• The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.
• The Andrew W. Mellon Foundation.
• Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence