Ten Habits of Highly Successful Data

  • View
    572

  • Download
    16

Embed Size (px)

DESCRIPTION

Slides for http://www.discoveryinformaticsinitiative.org/ workshop, Quebec, Sunday July 27 2014

Text of Ten Habits of Highly Successful Data

Ten habits of highly effective data: Helping your dataset achieve its full potential Anita de Waard VP Research Data Collaborations a.dewaard@elsevier.com http://researchdata.elsevier.com/ Quebec City. Canada, July Who cares about Research Data? Funding bodies: Demonstrate impact Guarantee permanence, discoverability Avoid fraud Avoid double funding Serve general public Research Management/Libary: Generate, track outputs Comply with mandates Ensure availability Phil Bourne, Ass Director for Data Science at NIH: Foster an ecosystem that enables biomedical research to be done as a digital enterprise. Mike Huerta, Ass. Director NLM: Today, the major public product of science are concepts, written down in papers. But tomorrow, data will be the main product of science. We will require scientists to track and share their data as least as well, if not better, than they are sharing their ideas today. Researchers: Derive credit Comply with mandates Discover and use Cite/acknowledge Nathan Urban, PI Urban Lab, CMU, 3/13: If we can share our data, we can write a paper that will knock everybodys socks off! Barbara Ransom, NSF Program Director Earth Sciences: Were not going to spend any more money for you to go out and get more data! We want you first to show us how youre going to use all the data we paid yall to collect in the past! Whats the problem? One example: Using antibodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of their slides, and writes a paper. End of story. 7. Trusted (validated/checked by reviewers) Maslows Hierarchy of Needs (for Research Data) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 2. Archived (long-term & format- independent) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 3. Accessible (can be accessed by others) 8. Citable (able to point & track citations) 1. Preserve: Data Rescue Challenge With IEDA/Lamont: award succesful data rescue attempts Awarded at AGU 2013 23 submissions of data that was digitized, preserved, made available Winner: NIMBUS Data Rescue: Recovery, reprocessing and digitization of the infrared and visible observations along with their navigation and formatting. Over 4000 7-track tapes of global infrared satellite data were read and reprocessed. Nearly 200,000 visible light images were scanned, rectified and navigated. All the resultant data was converted to HDF-5 (NetCDF) format and freely distributed to users from NASA and NSIDC servers. This data was then used to calculate monthly sea ice extents for both the Arctic d the Antarctic. Conclusion: we (collectively) need to do more of this! How can we fund it? 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 8. Citable (able to point & track citations) 3. Accessible (can be accessed by others) 2. Archived (long-term & format- independent) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 2. Archived (long-term & format- independent) 8. Citable (able to point & track citations) 2. Archive: Olive Project CMU CS & Library: funded by a grant from the IMLS, Elsevier is partner Goal: Preservation of executable content - nowadays a large part of intellectual output, and very fragile Identified a series of software packages and prepared VM to preserve Does it work? Yes see video (1:24) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 8. Citable (able to point & track citations) 3. Access: Urban Legend 3. Accessible (can be accessed by others) 2. Archived (long-term & format- independent) Part 1: Metadata acquisition Step through experimental process in series of dropdown menus in simple web UI Can be tailored to workflow of individual researcher Connected to shared ontologies through lookup table, managed centrally in lab Connect to data input console (Igor Pro) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 8. Citable (able to point & track citations) 4. Comprehend: Urban Legend 3. Accessible (can be accessed by others) 2. Archived (long-term & format- independent) Part 2: Data Dashboard Access, select and manipulate data (calculate properties, sort and plot) Final goal: interactive figures linked to data Plan to expand to more labs, other data 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 8. Citable (able to point & track citations) 5. Discover: Data Discovery Index NIH interested in creating DDI consortium Three places where data is deposited: 1. Curated sources for a single data type (e.g.Protein Data Bank, VentDB, Hubble Space Data) 2. Non- or semicurated sources for different data types (e.g. DataDryad, Dataverse, Figshare) 3. Tables in papers: Ways to find this: Cross-domain query tools, i.e. NIF, DataOne, etc Search for papers -> link to data How to find data in papers?? Propose to build prototypes across all of these data sources: Needs NLP, models of data patterns? What else? 3. Accessible (can be accessed by others) 2. Archived (long-term & format- independent) Papers Non-curated DBs Curated DBs 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 8. Citable (able to point & track citations) 6. Reproduce: Resource Identifier Initiative Force11 Working Group to add data identifiers to articles that is 1) Machine readable; 2) Free to generate and access; 3) Consistent across publishers and journals. Authors publishing in participating journals will be asked to provide RRID's for their resources; these are added to the keyword field RRID's will be drawn from: The Antibody Registry Model Organism Databases NIF Resource Registry So far, Springer, Wiley, Biomednet, Elsevier journals have signed up with 11 journals, more to come Wide community adoption! 3. Accessible (can be accessed by others) 2. Archived (long-term & format- independent) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 8. Citable (able to point & track citations) 7.Trust: Moonrocks 3. Accessible (can be accessed by others) 2. Archived (long-term & format- independent) How can we scale up data curation? Pilot project with IEDA: Lunar geochemistry database: leapfrog & improve curation time 1-year pilot, funded by Elsevier If spreadsheet columns/headers map to RDB schema, we can scale up curation process and move from tables > curated databases! 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 9. Usable (allow tools to run on it) 4. Comprehensible (others can understand data & processes) 1. Preserved (existing in some form) 5. Discoverable (can be indexed by a system) 8. Citable (able to point & track citations) 8. Cite: Force11 Data Citation Principles Another Force11 Working group Defined 8 principles: Now seeking endorsement/working on implementation 3. Accessible (can be accessed by others) 2. Archived (long-term & format- independent) 1. Importance: Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications. 2. Credit and attribution: Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data. 3. Evidence: Where a specific claim rests upon data, the corresponding data citation should be provided. 4. Unique Identification: A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community. 5. Access: Data citations should facilitate access to the data themselves and to such associated me