34
Building a virtual data center for the biological, ecological and environmental sciences -- William Michener (University Libraries, University of New Mexico) -- DataONE Team Presenter Name Presenter Name

Building a virtual data center for the biologggical, ecological … · Building a virtual data center for the biologggical, ecological and environmental sciences ... Dave Vieglais

Embed Size (px)

Citation preview

Building a virtual data center for the biological, ecological and g genvironmental sciences-- William Michener (University Libraries, University of New Mexico)-- DataONE Team

Presenter NamePresenter Name

Outline

Challenges faced by the biological, ecological, and environmental sciences Scientific Cyberinfrastructure

S i lt l Socio-cultural Vision for change (a DataONE perspective) Effecting change Effecting change Re-visiting the challenges

2

Global change

Smith, Knapp, Collins. In press.

4

Critical areas in the Earth’s system

5

Building the knowledge pyramid

ge ge

Cov

erag

Kno

wle

dgIntensive science sites andexperiments

g S

patia

l

Pro

cess

experiments

Extensive science sites

ecre

asin

g

crea

sing

Volunteer & education networks

D Inc

Remotesensing

6

Data loss(including orphaned data)

Informationdwid

e

Information

es W

orl

Available StoragePet

abyt

e

Source: John Gantz, IDC Corporation: The Expanding Digital Universe

Available StorageP

7

Data deluge“the flood of increasingly heterogeneous data”

Data are heterogeneous Syntax

(format) (format) Schema

(model)S ti Semantics (meaning)

Jones et al. 2007

8

Poor data practice“data entropy”

Ti f bli tit

Time of publication

Specific details

Con

tent General details

Retirement or

mat

ion

C

Accident

Retirement or career change

Info

rm

Accident

Death

Time (Michener et al. 1997)

9

Scattered data sources“finding the needle in the haystack”

Data are massively dispersed Ecological field stations and research centers (100’s)g ( ) Natural history museums and biocollection facilities (100’s) Agency data collections (100’s to 1000’s) Individual scientists (1000’s to 10 000s to 100 000s) Individual scientists (1000 s to 10,000s to 100,000s)

10

The problem with standards…Metadata Interoperability Across Data Holdings

Data Archive Types of Data Managed Metadata Standard(s)

Biodiversity, taxonomic, ecological BDP, DwC, DC, y, , g , , ,OGIS

Biogeochemical dynamics, terrestrial ecological Earth observation imagery

DIF, BDP, ECHO

Ecological, biodiversity, biophysical, social, genomics, and taxonomic

EML

Avian populations and molecular biology DwC

Biological and taxonomic DC subset

Biophysical, biodiversity, disturbance, and Earth observation imagery

EML

Biodiversity, biotic structure, function/process, biogeochemical, climate,

and hydrologic

EML

11

EML=Ecological Metadata Language

BDP=Biological Data Profile DwC=Darwin Core

DC=Dublin Core ECHO=EOS ClearingHOuse

OGIS=OpenGIS

DC subset=Dublin Core subset

DIF=Directory Interchange Format

Socio-cultural informatics challenges

Why? What is there to help me?p How should I best do this?

12

P idi i l t d t b t lif th d th i t th t t i it

A Vision for Change: DataONE

engaging the scientist in the data

Providing universal access to data about life on earth and the environment that sustains it

1. Build on existing cyberinfrastructurescientist in the data

curation process supporting the full data life cycle

y

encouraging data stewardship and sharing

ti b t2. Create new

cyberinfrastructure promoting best practices engaging citizens developing

cyberinfrastructure

3. Support new developing domain-agnostic solutions

ppcommunities of

practice

DataONE Cyberinfrastructure

Member Nodes

• diverse institutions

Coordinating Nodes• retain complete metadata catalog

Flexible, scalable, sustainable network

• serve local community

• provide resources for managing their data

g• subset of all data• perform basic indexing• provide network-wide servicesmanaging their dataservices• ensure data availability (preservation) • provide replication services

What data are in scope for DataONE?

Biological (genes to biomes) EnvironmentalEnvironmental Atmospheric Ecological Hydrological Oceanographic

Social Social e.g., Land use, human population

Economic e.g., trade, ecosystem services, resource extraction

DataONE – Building new global CI

16

Effecting change

1. Engage the community Perform baseline and iterative community assessments

U bilit t di Usability studies DataONE International Users Group

17

Effecting change2 L i ti b i f t t2. Leverage existing cyberinfrastructure

KNB, ESDIS, and Waters Networks19

Investigator Toolkit

Many existing open source efforts exist Data management: MATT, UDig, Specify

Analysis and modeling: R, Octave

Workflow systems: Kepler, Taverna, Triana, Pegasus

Grid systems: C d Gl b BOINC Grid systems: Condor, Globus, BOINC

Data and workflow portals: VegBank, myExperiment

Commercial tools important toop Excel, MATLAB, SAS, ArcGIS

DataONE: help communities build their own toolsI t t i t t t bili Integrate, interoperate, stabilize

Create libraries to DataONE Service Interface

20

Effecting change

Career Long Learning: Best Practice Guide

3. Educate

g g• best practice guides

• exemplary data management plans

Best Practice Guide

How to Cite Your Data

Using Metadata fore-research

5 in a series

Gold Star Data Management

Plan

Here’s HowBest Practice Guide

• podcasts, web-casts• workshops and seminars• downloadable curricula

6 in a seriesHow to Cite Your Data

6 in a series

Engaging citizens in science

www.CitizenScience.org

Effecting change

4. Enable new science and demonstrate success

48 000 D t S t R dPilot Catalog: Simple Search Interface

NBII Metadata Clearinghouse (39,550)

Long Term Ecological Research (LTER) Network(6,897)

>48,000 Data Set Records(searches entire metadata record)

ORNL Distributed Active Archive Center for Biogeochemical Data (810)

Large Scale Biosphere-Atmosphere Experiment in Amazonia (LBA) (783)

Organization of Biological Field Stations (124)

Inter-American Institute for Global Change Research (IAI) (79)

MODIS and ASTER Products (LPDAAC) (38)

National Phenology Network (USANPN) (29)

23

Eurasian Collared Dove in North America

Search can be emailed, bookmarked, or placed in RSS feed

Search can be refined using multiple filtersmultiple filters

24

Exploration, Visualization and Analysis (EVA) Working Group Objectives

Guide creation of scientific kfl d D ONE dworkflows and DataONE data

exploration, visualization and analysis functionality.analysis functionality.

EVA Tools

25

1st EVA Example: Vegetation and Bird Phenology at 130,000 sites in North America

Phenology(Start of Season)

May

1ay

18

28M

a

Understand how bird migration patterns

(White et al. 2009)June

change through time

26

The Process

Iterative data-intensive workflow – VisTrails Data synthesis

S ti T l E l t M d l (STEM) Spatio-Temporal Exploratory Model (STEM)

1

n(s, t)f i(X,s, t)I(s,t i)

i1

m

27

Results: Spring Migration Detail

• Model predictions show ppopulation timing and location through migration

• Early April – Concentration • Early April Concentration on Gulf of Mexico Coast

• Mid April – Concentration l Mi i i i Ri ll

Occurrence of Indigo Bunting (2008)

Neotropical migrant that winters in central

along Mississippi River valley

• Mid May – Breeding distribution p g

and south America

28

Spring Migration Results

• Early April –Concentration on Gulf of Mexico Coast

• Mid April –Concentration

April 7 April 15

Concentration along Mississippi River valley

• Mid May –Breeding distribution

May 14April 2629

Experimentation/Forecasting

Red-eyed Vireo(Vireo olivaceus)

Questions:

1. Is Red-eyed Vireo i ti ti i

Experiment:

1. Fit STEM with NDVI predictormigration timing

associated with NDVI?2. If so, how is migration timing,

predictor2. Advance “Greening Date” by 14 days3. Plot “Spring Arrival g g,

direction, and speed affected by NDVI?

p gDates” as the first date when predicted probability of occurrence > 5%

30

Challenges

“i-Science” – (from the perspective of the scientist) Fast

D t & t l di Data & tool discovery Automated/automatic metadata annotation Visualization (i.e., rapid analysis)

Easy Data & metadata upload Integrated linked and synthesized databases Integrated, linked, and synthesized databases Scientific workflows Interoperability (but hidden from user)

U bilit (i t iti ) Usability (intuitiveness) Cheap

Data preservationp Free, open source tools

But, also “support” for tools I am used to – e.g., Excel31

The “elephant in the room” challenge

It’s the metadata …. need tools and approaches that are: Fast ! Easy !!! Cheap !

32

DataONE Institutions, Personnel & Support

Uni ersit of Ne Me ico William Michener Rebecca Koskela Mark Ser illa University of New Mexico – William Michener, Rebecca Koskela, Mark Servilla Cornell University – Paul Allen, Steve Kelling CSIRO (Australia) – Donald Hobern Duke University – Ryan Scherle, Todd Vision Ecological Society of America – Cliff Duke

O k Rid N i l L b B b C k J h C bb Li P h d B Wil Oak Ridge National Laboratory – Bob Cook, John Cobb, Line Pouchard, Bruce Wilson University of California Curation Center – Patricia Cruse, John Kunze University of California Davis – Bertram Ludaescher University of California Santa Barbara – Stephanie Hampton, Matt Jones University of Edinburgh (Scotland) – Peter Buneman University of Illinois – Urbana Champaign – Randy Butler, Von Welch University of Illinois – Chicago – Robert Sandusky University of Kansas – Dave Vieglais University of Manchester (UK) – Carole Goble University of Michigan – Peter Honeyman University of Southampton (UK) – David DeRoure University of Southern California – Ewa Deelman University of Tennessee – Suzie Allard, Carol Tenopir, Maribeth Manoff USGS (National Biological Information Infrastructure, National Phenology Network) – Mike Frame, Vivian Hutchison, Jake Weltzin Utah State University – Jeffery Horsburghy y g Steve Kelling1, Daniel Fink1, Suresh Santhana Vannan2, Kevin Webb1, Robert Cook2, William Michener3, Jeff Morisette4, Claudio Silva5,

Tom Dietterich6, Patrick O’Leary7, Alyssa Rosemartin4, and Damian Gessler8

1Lab of Ornithology, Cornell University and 2Oak Ridge National Laboratory, 3University of New Mexico, 4U.S. Geological Survey, 5University of Utah, 6Oregon State University, 7Idaho National Laboratory, 8I-Plant

NSF CISE Pathways Computational Sustainability 0832782, Leon Levy Foundation, and National Aeronautics and Space Administration

33

Data Intensive Research

34