Upload
kerstin-lehnert
View
64
Download
2
Tags:
Embed Size (px)
Citation preview
Data Standards & Best PracticesKerstin LehnertLamont-Doherty Earth Observatory
iedadata.org
2
Vouchering the Stratigraphic Record A synthesis database?
Aggregates data that are published in articles or in data repositories
Requirements: Integration, Quality (Trusted data!) Needs standardized metadata, semantics, and persistent unique
identifiers
A trusted repository? Publishes and ensures persistent access to data Requirements: Compliance with international data
curation and repository standards Long-term preservation, data identification (DOI), editorial
procedures, etc.
3
Data Standards
“documented agreements on representation, format, definition, structuring, tagging, transmission, manipulation, use, and management of data.”
Discipline specific Data type specific Application specific
4
Data Standards: Why?
Re-usability of data
Reproducibility of science
Integration/interoperability of data
6
Reproducibility in the Field Sciences Workshop in May 2015, organized by AAAS (M. McNutt), AGU, and
ESA, funded by the Arnold Foundation Report in preparation
Technical Requirements for Transparent, Reproducible Data1. The data themselves must be publicly available in machine-readable, non-
proprietary formats with accurate and precise descriptive metadata; 2. Data provenance—process(es) by which usable datasets were generated or
derived from raw, often streaming or machine-readable-only data—must be accurately and precisely specified;
3. Computer code (“scripts”) and software with which datasets were analyzed must be available and adequately described to ensure their repeated use and be publicly available in non-proprietary formats, and;
4. Version control should be used to ensure that the original data and code are maintained.
(from draft workshop report)
7
Coalition for Publishing Data in the Earth & Space Sciences (COPDESS)
Joint initiative of Earth Science publishers and Data Facilities to better help translate the aspirations of open, available, and useful data from policy into practice. Reaffirm and ensure adherence to existing journal and
publishing policies and society position statements regarding open data sharing and archiving of data, tools, and models.
Ensure that Earth science data will, to the greatest extent possible, be stored in community approved repositories that can provide additional data services.
Statement of Commitment signed by all major Earth & Space Science publishers
7
www.copdess.org
9
9
Repository Standards
Open access
Data quality assurance (editorial process)
Persistence (long-term preservation)
Persistent & unique identification of data (DOI registration)
Standard-based metadata (ISO) & APIs (OAI-PHM)
accessible
small data
findableidentification,persistence
protection,protocols
context,provenance
re-usableharmonized, machine-readable
interoperable
BIG DATA
Generic Repositories Community Data Collections
Adding V
alue
Domain Repositories
11
Distributed Data Curation
Alert: Stratigraphy is multi-disciplinary There are many data types that already have homes
Paleobio Database Macrostrat/Digital Crust Geochron (@IEDA) MagIC Open Core Data (@IEDA – under development) EarthChem (@IEDA) System for Earth Sample Registration (@IEDA)
Don’t reinvent, but leverage, link, & integrate!
EarthCube
EarthCube: A Process
Get all the info at: http://earthcube.org
COMPUTER SCIENCES
SOFTWARE ENGINEERS
SCIENTIFIC VISIONTECHNICAL ARCHITECTURE
ENGAGEMENTFUNDED PROJECTS
14
Back to Data Standards
Metadata Content Structure (data model) Vocabularies & Taxonomies
Identifiers
(API = Application Programming Interface)
15
Metadata Standards
Geospatial
Scientific Context
Object classifications
Methods (instrumentation, computation, etc.)
Actions dates actors
Data provenance (references, authors, etc.)
16
16
Open Geospatial Consortium (OGC):Observations & Measurements
Observation Result
Feature of Interest
Sampling Sampling Feature
Observation
“Observations commonly involve sampling of an ultimate feature of interest. This International Standard defines a common set of sampling feature types classified primarily by topological dimension, as well as
samples for ex-situ observations.” (OGC O&M 2.0.0 / ISO19156; editor: Simon Cox)
e.g. Station,Transect, Section
Observation Data Model v2
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
17
ODM2 Team:J S HorsburghA K AufdenkampeL HsuA JonesK LehnertE MayorgaL SongD TarbotonI Zaslavsky
18
18
Data Templates
LPSC 2015 Workshop: Restoration and Synthesis of Planetary Geochemical Data
Persistent Unique Identifiers
SamplesDataset
Article publication
Awards & grants
ORCID
Cruise ID
IGSN
DOI
FundRef
DOI
ResearchersField Program
Data DOI Metadata
22
22
Internet of Samples in the Earth Sciences Physical samples need to be linked to the digital data
generated by their study. Reproducibility! Access to the physical samples is required to
verify & reproduce observations. Re-usability! Access to information about samples is required
for proper evaluation & interpretation of sample-based data.
Physical samples need to be shared broadly for use & re-use.
Samples are often expensive to collect (drilling, remote locations). Many samples are unique and irreplaceable. Re-analysis augments utility of existing data. Samples often serve in ways that the collectors and repositories could not
have imagined.
3/26/2015
23
23
Unique Sample Identification
Imagine the possibilities … Easily find a specific sample and contact its owner Find all publications that mention a specific sample Find all data for that sample across the literature
and distributed databases Find other samples with similar properties
geospatial temporal compositional
24
24
Sample Identification Until Now
Samples have ambiguous and non-persistent names and cannot be properly cited.
The EarthChem Portal shows 75 publications with
geochemical data referenced to a sample with the name
M1 (or M-1). (www.earthchem.org)Names of dredge sample 3 of
the Amphitrite cruise(PetDB database, www.petdb.org)
25
25
Sample Identification From Now:IGSN: International Geo Sample Number
Persistent unique identifier for physical objects in the Earth Sciences Global uniqueness guaranteed via governance by the IGSN e.V.
Persistent access and preservation of sample metadata Cataloguing services of IGSN e.V. members Allows to build central search engine Resolving service of the IGSN central registry
Does not replace personal or institutional naming protocols
IGSN: Examples
Oriented Core Drill Hole (ODP)
Soil Section Rock Specimen
27
27
IGSN Status
International governance established in 2011 14 members (organizations) in the IGSN e.V. (www.igsn.org)
ca. 4 million samples registered (registration tripled in 2014)
>350 active users, including increasing number of individual scientists sample repositories & museums (Smithsonian, marine cores, geological surveys (USGS, Geoscience Australia, BGR) large-scale observatories and sampling campaigns
ICDP, IODP, CZO, DCO, GeoPRISMs, etc.)
IGSN Adoption
IGSN Adoption
COPDESS Statement of Commitment
IGSN in Action
31
IGSN in Action:
Publications
32
Metadata
Identification Sample name(s), registrant
Description Material, classification, age, size, comments
Geospatial information Geographical names, coordinates
Collection Expedition/cruise, platform, date, collector,
technique
Archiving/access Physical location of sample (repository), contact
32
IGSN Sample “Geneology” 33
34
34
Extended IGSN Metadata
Images Documents (.pdf, .xls, .doc) References URLs for related data resources User defined metadata
Internet of Samples in the Earth Sciences
iSamples RCN
Advance use of innovative CI to connect physical samples across the Earth Sciences with digital data infrastructure
Goals: Improve discovery, access, and re-usability of physical samples Improve re-usability and reproducibility of the data generated by their
study
Registries & Catalogs
Metadata
Identifiers
CitationRepositories
Software ToolsTaxonomies
C4P: Collaboration & Cyberinfrastructure for PaleoscienceAn EarthCube Research Coordination Network
Unravel the large-scale, long-term evolution of the Earth-Life System through the study of the geological record
Major challenges C4P addresses:• Heterogeneous & dispersed data• Modeling of age & time• Legacy & ‘dark’ data• Limited interoperability among resources• Variable semantics & ontologies
A diverse community:paleobiology, paleoclimate, paleoceanography, geochemistry, dendrochronology, stratigraphy, geochronology, sample curation, data management, bioinformatics, semantics, software architecture, and more ...
C4P achievements:• New resources
• data & software catalogs• Educational materials (webinars)
• New collaborations• Convergence on best practices (samples,
age, taxonomy)
37
Take Away Messages 37
develop leading practices for data
get community buy-in
align & coordinate with existing leading practices
leverage existing infrastructure
get started and don’t let the challenges stop you
“The Hitchhiker’s Guide to Geoinformatics”
(Lee Allison, LISTMG Workshop 2004)“Building an International
Collaboration for Geoinformatics”
(Walter Snyder, AGU 2005)
“Cyberinfrastructure for Solid Earth Geochemistry” (Kerstin Lehnert, GSA 2003)
The Cultural Challenges 38
39
Thank You!
"The wonderful thing about standards is that there are so many of them to
choose from”.
(Grace Hopper)