16
The habits of highly successful data: How to help your dataset achieve its full potential University of Illinois, Urbana Champaign May 7, 2014 Anita de Waard VP Research Data Collaborations [email protected] http://researchdata.elsevier.com /

The habits of highly successful data:

Embed Size (px)

Citation preview

Page 1: The habits of highly successful data:

The habits of highly successful data:How to help your dataset achieve its full potential

University of Illinois, Urbana ChampaignMay 7, 2014

Anita de WaardVP Research Data Collaborations

[email protected]

http://researchdata.elsevier.com/

Page 2: The habits of highly successful data:

Why should we care about Research Data? Funding bodies: Demonstrate impact Guarantee permanence,

discoverability Avoid fraud Avoid double funding Serve general public

Research Management/Libary: Generate, track outputs Comply with mandates Ensure availability

Phil Bourne, (then) Associate Vice Chancellor, UCSD, 4/13: “We need to think about the university as a digital enterprise.”

Mike Huerta, Ass. Director NLM: “Today, the major public product of science are concepts, written down in papers. But tomorrow, data will be the main product of science…. We will require scientists to track and share their data as least as well, if not better, than they are sharing their ideas today.”

Researchers: Derive credit Comply with mandates Discover and use Cite/acknowledge

Nathan Urban, PI Urban Lab, CMU, 3/13: “If we can share our data, we can write a paper that will knock everybody’s socks off!”

Barbara Ransom, NSF Program Director Earth Sciences: “We’re not going to spend any more money for you to go out and get more data! We want you first to show us how you’re going to use all the data we paid y’all to collect in the past!”

Page 3: The habits of highly successful data:

What’s the problem? One example:

Using antibodiesand squishy bits Grad Students experimentand enter details into theirlab notebook. The PI then tries to make sense of their slides,and writes a paper. End of story.

Page 4: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

Maslow’s Hierarchy of Needs for Research Data

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

2. Archived (long-term & format-independent)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

3. Accessible (can be accessed by others)

8. Citable (able to point & track citations)

Page 5: The habits of highly successful data:

1. Preserve: Data Rescue Challenge• With IEDA/Lamont: award succesful data

rescue attempts• Awarded at AGU 2013• 23 submissions of data that was digitized,

preserved, made available• Winner: NIMBUS Data Rescue:

– Recovery, reprocessing and digitization of the infrared and visible observations along with their navigation and formatting.

– Over 4000 7-track tapes of global infrared satellite data were read and reprocessed.

– Nearly 200,000 visible light images were scanned, rectified and navigated.

– All the resultant data was converted to HDF-5 (NetCDF) format and freely distributed to users from NASA and NSIDC servers.

– This data was then used to calculate monthly sea ice extents for both the Arctic d the Antarctic.

• Conclusion: we (collectively) need to do more of this! How can we fund it?

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

8. Citable (able to point & track citations)

3. Accessible (can be accessed by others)

2. Archived (long-term & format-independent)

Page 6: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

3. Accessible (can be accessed by others)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

2. Archived (long-term & format-independent)

8. Citable (able to point & track citations)

2. Archive: Olive Project• CMU CS & Library: funded by a grant

from the IMLS, Elsevier is partner• Goal: Preservation of executable content

- nowadays a large part of intellectual output, and very fragile

• Identified a series of software packages and prepared VM to preserve

• Does it work? Yes – see video (1:24)

Page 7: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

8. Citable (able to point & track citations)

3. Access: Urban Legend

3. Accessible (can be accessed by others)

2. Archived (long-term & format-independent)

• Part 1: Metadata acquisition• Step through experimental process in series

of dropdown menus in simple web UI• Can be tailored to workflow of individual

researcher• Connected to shared ontologies through

lookup table, managed centrally in lab• Connect to data input console (Igor Pro)

Page 8: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

8. Citable (able to point & track citations)

4. Comprehend: Urban Legend

3. Accessible (can be accessed by others)

2. Archived (long-term & format-independent)

• Part 2: Data Dashboard• Access, select and manipulate data (calculate

properties, sort and plot)• Final goal: interactive figures linked to data• Plan to expand to more neuroscience labs• Plan to build for geochemistry use case

Page 9: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

8. Citable (able to point & track citations)

5. Discover: Data Indexing proposals• Collaborated on Data Discovery Index

proposal with UCSD/Carnegie Mellon• Also worked with UIUC!• Interested in developing distributed

infrastructures on making data easier to search: what is the ‘Goldilocks lndex’ where search is scalable, yet useful?

• Looking for academic/industry partners/use cases/platforms to address the next stage

• Discoverability is key driver for metadata/data format structure!

3. Accessible (can be accessed by others)

2. Archived (long-term & format-independent)

Page 10: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

8. Citable (able to point & track citations)

6. Reproduce: Resource Identifier InitiativeForce11 Working Group to add data identifiers to articles that is

– 1) Machine readable;– 2) Free to generate and access;– 3) Consistent across publishers and journals.

• Authors publishing in participating journals will be asked to provide RRID's for their resources; these are added to the keyword field

• RRID's will be drawn from:– The Antibody Registry– Model Organism Databases– NIF Resource Registry

• So far, Springer, Wiley, Biomednet, Elsevier journals have signed up with 11 journals, more to come

• Wide community adoption!3. Accessible (can be accessed by others)

2. Archived (long-term & format-independent)

Page 11: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

8. Citable (able to point & track citations)

7.Trust: Moonrocks

3. Accessible (can be accessed by others)

2. Archived (long-term & format-independent)

How can we scale up data curation?Pilot project with IEDA: • A database for lunar geochemistry:

leapfrog & improve curation time• 1-year pilot, funded by Elsevier• Main conclusion: if spreadsheet

columns/headers map to RDB schema we can scale curation cost!

Page 12: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

8. Citable (able to point & track citations)

8. Cite: Force11 Data Citation Principles• Another Force11 Working group• Defined 8 principles:

• Now seeking endorsement/working on implementation

3. Accessible (can be accessed by others)

2. Archived (long-term & format-independent)

1. Importance: Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.

2. Credit and attribution: Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.

3. Evidence: Where a specific claim rests upon data, the corresponding data citation should be provided.

4. Unique Identification: A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.

5. Access: Data citations should facilitate access to the data themselves and to such associated metadata, documentation, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.

6. Persistence: Metadata describing the data, and unique identifiers should persist, even beyond the lifespan of the data they describe.

7. Versioning and granularity: Data citations should facilitate identification and access to different versions and/or subsets of data. Citations should include sufficient detail to verifiably link the citing work to the portion and version of data cited.

8. Interoperability and flexibility: Data citation methods should be sufficiently flexible to accommodate the variant practices among communities but should not differ so much that they compromise interoperability of data citation practices across communities.

Page 13: The habits of highly successful data:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

8. Citable (able to point & track citations)

9. Use: Executable Papers• Result of a challenge to come up with

cyberinfrastructure components to enable executable papers

• Pilot in Computer Science journals– See all code in the paper– Save it, export it– Change it and rerun on data set:

3. Accessible (can be accessed by others)

2. Archived (long-term & format-independent)

Page 14: The habits of highly successful data:

Putting it all together:

7. Trusted (validated/checked by reviewers)

6. Reproducible (others can redo experiments)

9. Usable (allow tools to run on it)

4. Comprehensible (others can understand data & processes)

2. Archived (long-term & format-independent)

1. Preserved (existing in some form)

5. Discoverable (can be indexed by a system)

3. Accessible (can be accessed by others)

8. Citable (able to point & track citations)

Experimental Metadata: Workflows, Samples, Settings, Reagents, Organisms, etc.

Record Metadata: DOI, Date, Author, Institute, etc.

Processed Data: Mathematically/computationally processed

data: correlations, plots, etc.

Raw Data: Direct outputs from equipment: images, traces, spectra, etc.

Methods and Equipment: Reagents, settings, manufacturer’s details, etc.

Validation: Approval, Reproduction, Selection, Quality Stamp

Mor

e cu

ratio

nM

ore

usab

le

Page 15: The habits of highly successful data:

So how can we help research data be more happy and productive?

• Group therapy: Force11, W3C, other fora – shared standards help everyone (we play well with others !)

• Financial therapy: we have a lot of content & IT skills to support data-driven processes to support grant proposals; funders like us.

• Creative therapy: innovative collaboration projects that expand everyone’s mind – let’s put your data through its paces

• Relationship therapy: happy to address any issues or concerns!

Page 16: The habits of highly successful data:

Collaborations and discussions gratefully acknowledged: – CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Ed Hovy– UCSD: Brian Shoettlander, David Minor, Declan Fleming, Ilya Zaslavsky– NIF: Maryann Martone, Anita Bandrowski– Force11: Ed Hovy, Tim Clark, Ivan Herman, Paul Groth, Maryann Martone,

Cameron Neylon, Stephanie Hagstrom– OHSU: Melissa Haendel, Nicole Vasilevsky– Columbia/IEDA: Kerstin Lehnert, Leslie Hsu– MIT: Micah Altman

Thank you!

http://researchdata.elsevier.com/

Anita de [email protected]