Upload
duongthu
View
217
Download
2
Embed Size (px)
Citation preview
Building a virtual data center for the biological, ecological and g genvironmental sciences-- William Michener (University Libraries, University of New Mexico)-- DataONE Team
Presenter NamePresenter Name
Outline
Challenges faced by the biological, ecological, and environmental sciences Scientific Cyberinfrastructure
S i lt l Socio-cultural Vision for change (a DataONE perspective) Effecting change Effecting change Re-visiting the challenges
2
Building the knowledge pyramid
ge ge
Cov
erag
Kno
wle
dgIntensive science sites andexperiments
g S
patia
l
Pro
cess
experiments
Extensive science sites
ecre
asin
g
crea
sing
Volunteer & education networks
D Inc
Remotesensing
6
Data loss(including orphaned data)
Informationdwid
e
Information
es W
orl
Available StoragePet
abyt
e
Source: John Gantz, IDC Corporation: The Expanding Digital Universe
Available StorageP
7
Data deluge“the flood of increasingly heterogeneous data”
Data are heterogeneous Syntax
(format) (format) Schema
(model)S ti Semantics (meaning)
Jones et al. 2007
8
Poor data practice“data entropy”
Ti f bli tit
Time of publication
Specific details
Con
tent General details
Retirement or
mat
ion
C
Accident
Retirement or career change
Info
rm
Accident
Death
Time (Michener et al. 1997)
9
Scattered data sources“finding the needle in the haystack”
Data are massively dispersed Ecological field stations and research centers (100’s)g ( ) Natural history museums and biocollection facilities (100’s) Agency data collections (100’s to 1000’s) Individual scientists (1000’s to 10 000s to 100 000s) Individual scientists (1000 s to 10,000s to 100,000s)
10
The problem with standards…Metadata Interoperability Across Data Holdings
Data Archive Types of Data Managed Metadata Standard(s)
Biodiversity, taxonomic, ecological BDP, DwC, DC, y, , g , , ,OGIS
Biogeochemical dynamics, terrestrial ecological Earth observation imagery
DIF, BDP, ECHO
Ecological, biodiversity, biophysical, social, genomics, and taxonomic
EML
Avian populations and molecular biology DwC
Biological and taxonomic DC subset
Biophysical, biodiversity, disturbance, and Earth observation imagery
EML
Biodiversity, biotic structure, function/process, biogeochemical, climate,
and hydrologic
EML
11
EML=Ecological Metadata Language
BDP=Biological Data Profile DwC=Darwin Core
DC=Dublin Core ECHO=EOS ClearingHOuse
OGIS=OpenGIS
DC subset=Dublin Core subset
DIF=Directory Interchange Format
P idi i l t d t b t lif th d th i t th t t i it
A Vision for Change: DataONE
engaging the scientist in the data
Providing universal access to data about life on earth and the environment that sustains it
1. Build on existing cyberinfrastructurescientist in the data
curation process supporting the full data life cycle
y
encouraging data stewardship and sharing
ti b t2. Create new
cyberinfrastructure promoting best practices engaging citizens developing
cyberinfrastructure
3. Support new developing domain-agnostic solutions
ppcommunities of
practice
DataONE Cyberinfrastructure
Member Nodes
• diverse institutions
Coordinating Nodes• retain complete metadata catalog
Flexible, scalable, sustainable network
• serve local community
• provide resources for managing their data
g• subset of all data• perform basic indexing• provide network-wide servicesmanaging their dataservices• ensure data availability (preservation) • provide replication services
What data are in scope for DataONE?
Biological (genes to biomes) EnvironmentalEnvironmental Atmospheric Ecological Hydrological Oceanographic
Social Social e.g., Land use, human population
Economic e.g., trade, ecosystem services, resource extraction
Effecting change
1. Engage the community Perform baseline and iterative community assessments
U bilit t di Usability studies DataONE International Users Group
17
Effecting change2 L i ti b i f t t2. Leverage existing cyberinfrastructure
KNB, ESDIS, and Waters Networks19
Investigator Toolkit
Many existing open source efforts exist Data management: MATT, UDig, Specify
Analysis and modeling: R, Octave
Workflow systems: Kepler, Taverna, Triana, Pegasus
Grid systems: C d Gl b BOINC Grid systems: Condor, Globus, BOINC
Data and workflow portals: VegBank, myExperiment
Commercial tools important toop Excel, MATLAB, SAS, ArcGIS
DataONE: help communities build their own toolsI t t i t t t bili Integrate, interoperate, stabilize
Create libraries to DataONE Service Interface
20
Effecting change
Career Long Learning: Best Practice Guide
3. Educate
g g• best practice guides
• exemplary data management plans
Best Practice Guide
How to Cite Your Data
Using Metadata fore-research
5 in a series
Gold Star Data Management
Plan
Here’s HowBest Practice Guide
• podcasts, web-casts• workshops and seminars• downloadable curricula
6 in a seriesHow to Cite Your Data
6 in a series
Effecting change
4. Enable new science and demonstrate success
48 000 D t S t R dPilot Catalog: Simple Search Interface
NBII Metadata Clearinghouse (39,550)
Long Term Ecological Research (LTER) Network(6,897)
>48,000 Data Set Records(searches entire metadata record)
ORNL Distributed Active Archive Center for Biogeochemical Data (810)
Large Scale Biosphere-Atmosphere Experiment in Amazonia (LBA) (783)
Organization of Biological Field Stations (124)
Inter-American Institute for Global Change Research (IAI) (79)
MODIS and ASTER Products (LPDAAC) (38)
National Phenology Network (USANPN) (29)
23
Eurasian Collared Dove in North America
Search can be emailed, bookmarked, or placed in RSS feed
Search can be refined using multiple filtersmultiple filters
24
Exploration, Visualization and Analysis (EVA) Working Group Objectives
Guide creation of scientific kfl d D ONE dworkflows and DataONE data
exploration, visualization and analysis functionality.analysis functionality.
EVA Tools
25
1st EVA Example: Vegetation and Bird Phenology at 130,000 sites in North America
Phenology(Start of Season)
May
1ay
18
28M
a
Understand how bird migration patterns
(White et al. 2009)June
change through time
26
The Process
Iterative data-intensive workflow – VisTrails Data synthesis
S ti T l E l t M d l (STEM) Spatio-Temporal Exploratory Model (STEM)
1
n(s, t)f i(X,s, t)I(s,t i)
i1
m
27
Results: Spring Migration Detail
• Model predictions show ppopulation timing and location through migration
• Early April – Concentration • Early April Concentration on Gulf of Mexico Coast
• Mid April – Concentration l Mi i i i Ri ll
Occurrence of Indigo Bunting (2008)
Neotropical migrant that winters in central
along Mississippi River valley
• Mid May – Breeding distribution p g
and south America
28
Spring Migration Results
• Early April –Concentration on Gulf of Mexico Coast
• Mid April –Concentration
April 7 April 15
Concentration along Mississippi River valley
• Mid May –Breeding distribution
May 14April 2629
Experimentation/Forecasting
Red-eyed Vireo(Vireo olivaceus)
Questions:
1. Is Red-eyed Vireo i ti ti i
Experiment:
1. Fit STEM with NDVI predictormigration timing
associated with NDVI?2. If so, how is migration timing,
predictor2. Advance “Greening Date” by 14 days3. Plot “Spring Arrival g g,
direction, and speed affected by NDVI?
p gDates” as the first date when predicted probability of occurrence > 5%
30
Challenges
“i-Science” – (from the perspective of the scientist) Fast
D t & t l di Data & tool discovery Automated/automatic metadata annotation Visualization (i.e., rapid analysis)
Easy Data & metadata upload Integrated linked and synthesized databases Integrated, linked, and synthesized databases Scientific workflows Interoperability (but hidden from user)
U bilit (i t iti ) Usability (intuitiveness) Cheap
Data preservationp Free, open source tools
But, also “support” for tools I am used to – e.g., Excel31
The “elephant in the room” challenge
It’s the metadata …. need tools and approaches that are: Fast ! Easy !!! Cheap !
32
DataONE Institutions, Personnel & Support
Uni ersit of Ne Me ico William Michener Rebecca Koskela Mark Ser illa University of New Mexico – William Michener, Rebecca Koskela, Mark Servilla Cornell University – Paul Allen, Steve Kelling CSIRO (Australia) – Donald Hobern Duke University – Ryan Scherle, Todd Vision Ecological Society of America – Cliff Duke
O k Rid N i l L b B b C k J h C bb Li P h d B Wil Oak Ridge National Laboratory – Bob Cook, John Cobb, Line Pouchard, Bruce Wilson University of California Curation Center – Patricia Cruse, John Kunze University of California Davis – Bertram Ludaescher University of California Santa Barbara – Stephanie Hampton, Matt Jones University of Edinburgh (Scotland) – Peter Buneman University of Illinois – Urbana Champaign – Randy Butler, Von Welch University of Illinois – Chicago – Robert Sandusky University of Kansas – Dave Vieglais University of Manchester (UK) – Carole Goble University of Michigan – Peter Honeyman University of Southampton (UK) – David DeRoure University of Southern California – Ewa Deelman University of Tennessee – Suzie Allard, Carol Tenopir, Maribeth Manoff USGS (National Biological Information Infrastructure, National Phenology Network) – Mike Frame, Vivian Hutchison, Jake Weltzin Utah State University – Jeffery Horsburghy y g Steve Kelling1, Daniel Fink1, Suresh Santhana Vannan2, Kevin Webb1, Robert Cook2, William Michener3, Jeff Morisette4, Claudio Silva5,
Tom Dietterich6, Patrick O’Leary7, Alyssa Rosemartin4, and Damian Gessler8
1Lab of Ornithology, Cornell University and 2Oak Ridge National Laboratory, 3University of New Mexico, 4U.S. Geological Survey, 5University of Utah, 6Oregon State University, 7Idaho National Laboratory, 8I-Plant
NSF CISE Pathways Computational Sustainability 0832782, Leon Levy Foundation, and National Aeronautics and Space Administration
33