Upload
dobao
View
212
Download
0
Embed Size (px)
Citation preview
DataONE Data Observa-onal Network for Earth
Rebecca Koskela William Michener Dave Vieglais Amber Budden OSTP/NITRD Data Sharing and Metadata CuraDon: Obstacles and Strategies May 29, 2013
2 2
The metadata problem
3 3
12 21 26 95 95 96 97
266
676
DIF DwC DC EML FGDC Open GIS
ISO My Lab none
Metadata standards
ScienDsts want to share data Use other researchers’ datasets if easily accessible
Willing to share data across a broad group of researchers
Appropriate to create new datasets from shared data
84%
81%
76%
Currently share all of their data 6%
but don’t know how to and, if they do, want to get proper credit for doing so.
4 4
• Make it easy to describe data • Provide credit to the data/metadata author
• CitaDon • Promote discoverability
• Mandates (ideally, funded!)
Some soluDons
5 5
Best PracDces and So\ware Tools
6 6
Making it easy to describe data
Intercept researchers where they already work
7 7
Data & Metadata (EML)
8 8
Credit: Dryad repository for journal data & metadata
9 9
PromoDng data citaDons via Dryad
Ar-cle Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreDng novel, deep branches in phylogeneDc trees of phylogeneDc marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011 Dryad data package Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreDng novel, deep branches in phylogeneDc trees of phylogeneDc marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384
10 10
PromoDng data discovery Provide universal access to data about life on earth and the environment
1. Building community 2. Developing sustainable data discovery and interoperability soluDons
3. Enabling science through tools and services
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
11 11
DataONE Three major components for a flexible, scalable, sustainable network
Coordina-ng Nodes • retain complete metadata catalog
• indexing for search • network-‐wide services • ensure content availability (preservaDon)
• replicaDon services
12 12
DataONE Three major components for a flexible, scalable, sustainable network
Coordina-ng Nodes • retain complete metadata catalog
• indexing for search • network-‐wide services • ensure content availability (preservaDon)
• replicaDon services
Member Nodes • diverse insDtuDons • serve local community • provide resources for managing their data
• retain copies of data
13 13
DataONE Three major components for a flexible, scalable, sustainable network
Coordina-ng Nodes • retain complete metadata catalog
• indexing for search • network-‐wide services • ensure content availability (preservaDon)
• replicaDon services
Member Nodes • diverse insDtuDons • serve local community • provide resources for managing their data
• retain copies of data
14 14
DataONE Three major components for a flexible, scalable, sustainable network
Coordina-ng Nodes • retain complete metadata catalog
• indexing for search • network-‐wide services • ensure content availability (preservaDon)
• replicaDon services
Member Nodes • diverse insDtuDons • serve local community • provide resources for managing their data
• retain copies of data
Inves-gator Toolkit
15 15
DataONE: Enabling data discovery
ORNL DAAC
KNB
PISCO
SANParks
ESA
USGS CSAS Internal Metadata Index
ONEShare
UC Merrik
Extract a
nd Align Metadata
LTER
CLO/AKN
FGDC, ISO, DIF, FGDC
FGDC, ISO, FGDC
EML, FGDC
EML, ISO
EML
EML
EML
EML
EML
EML
Augm
ent M
etadata
Search API
16 16
ICE Collectors
ICE Users
DataONE Users 16
InformaDon Center for the Environment (ICE) UC Davis
ICE Collects Water Data ICE Users
agencies
ciDzens
faculty
Inves-gator Toolkit
17 17
• SemanDc mediaDon • Provenance • Improving metadata quality over Dme
Some remaining challenges
18 18
outcomes
Powerful Data Discovery via SemanDcs
topic model
formal ontologies/ controlled vocabularies
term matching (TF-‐IDF)
query
Enhanced models for knowledge representaDon in earth and environmental sciences
Powerful model-‐driven search interface for data discovery
Improved Precision Improved Recall Automated annota-on
18
19 19
Provenance Origin, context, deriva8on, ownership, history of (data) ar8facts
• Record processing history, data lineage
• dependency graph
• W3C standard: PROV
• DataONE Extension: D-‐PROV • Workflow provenance • System agnosDc!
. . . . . . . . .
20 20
Improving metadata quality for data reuse
Time (< 1 yr)
Inform
aDon
Con
tent
Planning
CollecDon
Assure
DocumentaDon
Archive
Sufficient for Sharing and Reuse
21 21
Mandates (ideally, funded!)
22 22
DataONE: SupporDng scienDfic data preservaDon, discovery, and innovaDon