76
Introduction to Data Management June 4, 2012 Karen Hanson, MLIS Knowledge Systems Librarian Alisa Surkis, PhD, MLS Translational Science Librarian This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Introduction to Data Management

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Data Management

Introduction to Data Management

June 4, 2012

Karen Hanson, MLIS

Knowledge Systems Librarian

Alisa Surkis, PhD, MLS

Translational Science Librarian

This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Page 2: Introduction to Data Management

Understand…

• current climate around data management and data sharing

• best practices in data documentation and description

• principles of storage and long-term preservation of data

• basic elements of a data management plan

Objectives

2/76

Page 3: Introduction to Data Management

1. Introduction

2. Incentives

3. Standards for description & documentation

4. Storage, archiving and sharing

5. Data Management Plans

Data management

3/76

Page 4: Introduction to Data Management

What is data?

• “Facts and statistics collected together for reference or analysis”

Oxford Dictionaries online

http://oxforddictionaries.com/definition/data?q=data

• “Research data, unlike other types of information, is collected, observed, or created, for purposes of analysis to produce original research results.”

University of Edinburgh, Information Services

http://www.ed.ac.uk/schools-departments/information-services/services/research-support/data-library/research-data-mgmt/data-mgmt/research-data-definition

4/76

Page 5: Introduction to Data Management

And that means…?

• Tables of numbers

• Sequences of bits (10110) or base pairs (GACTTA)

• Samples, specimens, slides

• Sound recordings, video recordings, images

• Laboratory notebooks

• Protocols, methodologies

• Software (code), algorithms, models

• “A myriad of other information objects, none of which may stand alone” - Christine Borgman

5/76

Page 6: Introduction to Data Management

Categories of data

• Observational (real time)

• Experimental (lab)

• Computational (model)

• Derived or Compiled

Source: National Science Board. Long-Lived Digital Data collections, 2005.

6/76

Page 7: Introduction to Data Management

What is data management?

• Not just creation, storage, processing and analysis

• Refers to managing the full lifecycle of data

7/76

Page 9: Introduction to Data Management

Data management lifecycle

Processing data

• enter data, digitize, transcribe, translate

• check, validate, clean data

• anonymize data where necessary

• describe data

• manage and store data

9/76

Page 10: Introduction to Data Management

Data management lifecycle

Analyzing data

• interpret / analyze data

• write publications

10/76

Page 11: Introduction to Data Management

Data management lifecycle

Preserving data

• migrate data to best format / medium

• back-up and store data

• create final metadata and documentation

• archive data

11/76

Page 12: Introduction to Data Management

Data management lifecycle

Giving access to data

• distribute / share data

• control access

12/76

Page 13: Introduction to Data Management

Why would anyone need my data?

Cow concept: Dorothea Salo, “Save the Cows”, 2009.

http://www.slideshare.net/cavlec/save-the-cows-data-curation-for-the-rest-of-us-1533252

Analyze, process

Publish

Page 14: Introduction to Data Management

You don’t need to kill the cow!

14/76

Page 15: Introduction to Data Management

Data management lifecycle

Re-using data

• follow-up research

• new research

• check validity

15/76

Page 16: Introduction to Data Management

Mini-series: Part 1

16/76

Page 17: Introduction to Data Management

1. Introduction

2. Incentives

• You and your data

• Government mandates

• Publisher requirements

• Lost credibility

• Faster progress, better science

• Citations

• Big data

3. Standards for description & documentation

4. Storage, archiving and sharing

5. Data management plans

Data management

17/76

Page 18: Introduction to Data Management

Why worry about data management?

• Bad things can happen if you don’t

• People will make you anyway

• Sharing is win-win

…you can’t share what you can’t find, read, decipher

18/76

Page 19: Introduction to Data Management

You and your data

• Make research process more efficient

• Comprehensibility

• Security

… what about other people and your data?

19/76

Page 20: Introduction to Data Management

• Government mandate

o Data sharing

o Data management plans

• Publisher requirements

• Credibility issues

Sharing - sticks!

20/76

Page 21: Introduction to Data Management

Sharing - carrots!

Faster progress!

Better science!

Also…

o Data becomes citable

o Data linked to publications

21/76

Page 22: Introduction to Data Management

Government mandates

Timeline

1999: US Office of Management and Budget amended the Freedom of Information Act

2003: NIH adopted a data sharing policy.

(still no teeth, but young yet)

22/76

Page 23: Introduction to Data Management

Government mandates

2008: NIH implements the Public Access Policy

2009: White House issues the Open Government Directive

2011 (Jan): NSF made data management plans a requirement

23/76

Page 24: Introduction to Data Management

Government mandates (bigger sticks on the way?)

• NSTC’s Interagency Working Group on Digital Data

o 11/2011 Request for Information (RFI) on Public Access to Digital Data Resulting from Federally Funded Scientific Research

• NIH Director Working Group on Data and Informatics

o 1/2012 Request for Information for Input into the Deliberations of the Advisory Committee to the NIH Director Working Group on Data and Informatics

24/76

Page 25: Introduction to Data Management

“The Federal policy framework should move public access to digital data away from the current idiosyncratic environment to a systematic approach that lowers barriers to data access, discovery, sharing and re-use.”

- Sayeed Choudhury

The Sheridan Libraries of Johns Hopkins University

One response to RFI

25/76

Page 26: Introduction to Data Management

Postdoc survey: Data management/sharing plans

To what extent have you dealt with NIH data sharing regulations or NSF data management plans?

26/76

38%

48%

12% 8%

39%

48%

11% 10%

Not aware ofpolicies

Aware but noinvolvement

Had to write dataplan

Had toImplement data

plan

Nationally

NYULMC

Page 27: Introduction to Data Management

Publisher requirements

Nature:

“After publication, readers who encounter refusal by the authors to comply with these policies should contact the chief editor of the journal... In cases where editors are unable to resolve a complaint, the journal may refer the matter to the authors' funding institution and/or publish a formal statement of correction, attached online to the publication, stating that readers have been unable to obtain necessary materials to replicate the findings.”

http://www.nature.com/authors/policies/availability.html

27/76

Page 28: Introduction to Data Management

Publisher requirements

Science:

“All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. All computer codes involved in the creation or analysis of data must also be available to any reader of Science. After publication, all reasonable requests for data and materials must be fulfilled. .”

http://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml

28/76

Page 29: Introduction to Data Management

Suspect data: Losing credibility

Comparison of statistical analyses: papers with shared data vs. papers with no sharing

• Unshared data had more errors in reporting of results • Unshared data was weaker

• p values of unshared data significantly closer to 0.05

NOTE: APA journals require sharing: 57% did not share.

Consequences? No teeth.

Wicherts JM, Bakker M, Molenaar D. Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS One. 2011;6(11):e26828. Epub 2011 Nov 2.

PMID:22073203; PMCID: PMC3206853.

29/76

Page 30: Introduction to Data Management

Retraction: Lost credibility “There were 60 children in the study. The ages were by accident duplicated between the upper and lower halves of the database. Thus, the ages for the first 30 children in the data set were identical and in the same order with the ages for the second set of 30 children…The files with the original data are not available any more, making it impossible to reconstruct a valid data set for reanalysis.”

http://www.ctajournal.com/content/2/1/6/abstract

30/76

Amy Wagers, Harvard stem cell researcher • 1/2010 Nature article: retracted 10/2010 • 8/2008 Blood article: retracted 12/2011

Shane Mayack, claimed “these errors occurred due to mistakes made in data retrieval that were a cause of a poor, but not a unique, data management and archiving system” but stands by results.

http://retractionwatch.wordpress.com/category/by-author/amy-wagers-retractions/

Page 31: Introduction to Data Management

Faster progress! Better science!

Case studies

• Human Genome Project

• Neuromorpho.org

31/76

Page 32: Introduction to Data Management

Human Genome Project

• NIH’s first foray in big science

• Experiment in data sharing

• Establishment of Bermuda principles o Automatic release of sequence assemblies

o Immediate publication of sequences

o Entire sequence freely available

• Full genome sequenced ahead of schedule

32/76

Page 33: Introduction to Data Management

Neuromorpho.org

• Detailed morphological reconstructions of neurons o Time-intensive

o Re-usable in many way

• > 6k reconstructions deposited since 2006

• > 100k downloads in 2011

33/76

Page 34: Introduction to Data Management

Citing data

• Interoperable data and publications

• Unique Identifiers

o Findable

o Citable

• More citations

34/76

Page 35: Introduction to Data Management

Big data

“..almost everything about science is changing because of the impact of information technology. Experimental, theoretical, and computational science are all being affected by the data deluge, and a fourth, ‘data-intensive’ science paradigm is emerging.”

- Jim Gray, Fourth Paradigm (2009)

March 29, 2012: Federal government announces Big Data Research and Development Initiative, 200M+ from 6 agencies.

35/76

Page 36: Introduction to Data Management

1. Introduction

2. Incentives

3. Standards for description & documentation • File Names

• Databases

• Versioning

• Metadata

• Quality control

4. Storage, archiving and sharing

5. Data management plans

Data management

36/76

Page 37: Introduction to Data Management

Postdoc Survey How do you determine how to structure your data or what information to save about the data in order to be able to effectively access data in the future? (check all that apply)

15%

67%

46%

19% 13%

2%

11%

70%

43%

17% 13%

1%

No pre-defined

standards

Personalstandards

Lab-basedstandards

Disciplinestandards

InstitutionalStandards

Other

Nationally

NYULMC

37/76

Page 38: Introduction to Data Management

Why to avoid making your own standard, if possible…

38/76

Standards. http://xkcd.com/927/

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.

Page 39: Introduction to Data Management

What does your data look like?

• Many files or one file

• Raw format (numeric, images, binary)

• Processed format

• File sizes

39/76

Page 40: Introduction to Data Management

File names bob_1262011.tif

Bob Smith? Bob Jones?

12 June, 2011? December 6, 2011? January 26, 2011?

40/76

Unambiguous dates, the ISO standard:

• YYYYMMDD or YYYY-MM-DD o e.g. 20120612 = June 6, 2012

• YYYYMMDDTHH:MM:SS o e.g. 20120612T14:03:12 = June 6, 2012 2:03:12 pm

Page 41: Introduction to Data Management

100s of slices

5-7 experiments a week…

3 post docs

100s of slides

100s of huge images

TIF TIF

TIF TIF

TIF TIF

TIF TIF

TIF

TIF TIF

TIF TIF

TIF TIF

TIF TIF

TIF

TIF TIF

TIF TIF

TIF

TIF TIF

TIF TIF

TIF

1000s of image files TIF TIF

1 rat heart

41/76

Page 42: Introduction to Data Management

File names should…

• Reflect contents of the file

• Use non cryptic/intuitive names if possible

• Consider any character restrictions

• Uniquely identify the file

• Avoid special characters (e.g. *, $, &, #)

• Use (“_”) instead of space or dash

42/76

Page 43: Introduction to Data Management

Example of a good file name

AtherRat_012_056_mb_0423.tif

AtherRat = experiment name

012 = experiment number

056 = sample number

mb = stain used, methylene blue

0423 = coordinates of image (4 across, 23 down)

43/76

Page 44: Introduction to Data Management

Spreadsheet

Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC (2011) Data from: Ultraconserved

elements anchor thousands of genetic markers for target enrichment spanning multiple evolutionary timescales. Dryad

Digital Repository. doi:10.5061/dryad.64dv0tg1

44/76

Page 45: Introduction to Data Management

Andrew Sparkes & Amanda Clare. AutoLabDB: a substantial open source database schema to support a high-throughput

automated laboratory Bioinformatics first published online March 29, 2012 doi:10.1093/bioinformatics/bts140

45/76

Relational database

Page 46: Introduction to Data Management

Databases

• Intuitive / meaningful field & table names

• Ensure it will support scope of analysis

• Institutional support for data modeling?

• Check for reusable discipline-based standard

o E.g. EUROCarbDB

46/76

Page 47: Introduction to Data Management

• What do file/field names mean?

• What does each file/field contain?

• How do files/data relate to each other?

• Are there a limited set of possible values?

Document your file/field names

Name Type Description Possible values

Stain Text Stain used on cell sample

IO = Iodine; EY = Eosin Y; MB = Methylene blue;

47/76

Page 48: Introduction to Data Management

Version control

• Have a plan

• Be consistent

• Document changes between versions (what, who)

48/76

Page 49: Introduction to Data Management

Metadata definition

Metadata is:

“Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”

NISO (2004). Understanding Metadata. NISO Press

http://www.niso.org/publications/press/UnderstandingMetadata.pdf

49/76

Page 50: Introduction to Data Management

Why use structured metadata?

• Systematic approach to capturing descriptive information

• Supports 3rd party use by:

o Making data findable (if metadata put online)

o Providing context

o Providing a unique identifier

50/76

Page 51: Introduction to Data Management

Metadata considerations

• What standard should you use?

• Are there discipline standards?

• Which is the best fit?

• Where are you depositing your data?

51/76

Page 52: Introduction to Data Management

Metadata – specialized standard Neuromorpho

• Neuromorpho ID (UID)

• Neuron Name

• Archive (researcher) name

• Species

• Strain of species

• Age range

• Gender

• Weight range

• Developmental stage

• Primary/Secondary/Tertiary brain regions

• Primary/Secondary/Tertiary Cell classes

• Original data format

• Experiment condition

• Experiment protocol

• Staining method

• Slicing Direction/Thickness

• Tissue Shrinkage

• Objective Type

• Magnification

• Reconstruction Method

• Dates of Deposition/Upload

• Associated publications

• Web URL of archives (if available) with any additional information about the reconstruction

52/76

Page 53: Introduction to Data Management

Metadata – specialized standard Neuromorpho

• Neuromorpho ID (UID)

• Neuron Name

• Archive (researcher) name

• Species

• Strain of species

• Age range

• Gender

• Weight range

• Developmental stage

• Primary/Secondary/Tertiary brain regions

• Primary/Secondary/Tertiary Cell classes

• Original data format

• Experiment condition

• Experiment protocol

• Staining method

• Slicing Direction/Thickness

• Tissue Shrinkage

• Objective Type

• Magnification

• Reconstruction Method

• Dates of Deposition/Upload

• Associated publications

• Web URL of archives (if available) with any additional information about the reconstruction

53/76

Page 54: Introduction to Data Management

Metadata – specialized standard Neuromorpho

• Neuromorpho ID (UID)

• Neuron Name

• Archive (researcher) name

• Species

• Strain of species

• Age range

• Gender

• Weight range

• Developmental stage

• Primary/Secondary/Tertiary brain regions

• Primary/Secondary/Tertiary Cell classes

• Original data format

• Experiment condition

• Experiment protocol

• Staining method

• Slicing Direction/Thickness

• Tissue Shrinkage

• Objective Type

• Magnification

• Reconstruction Method

• Dates of Deposition/Upload

• Associated publications

• Web URL of archives (if available) with any additional information about the reconstruction

54/76

Page 55: Introduction to Data Management

Metadata – specialized standard GenBank

• Locus

• Definition

• Accession number (UID)

• Version

• Keywords

• Source organism

• Reference(s)

o Authors

o Title

o Journal

o PubMed ID

• Features

o Source

o RBS (ribosome binding site)

o gene

o CDS (protein coding sequence)

• Terminator

• Modification date

55/76

Page 56: Introduction to Data Management

Metadata – general standard

• Dublin Core

o Designed to be generic/flexible

o Usually stored as XML

e.g. <dc:creator>Hanson, Karen L.</dc:creator>

o 15 fields:

Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, Type

56/76

Page 57: Introduction to Data Management

Minimum Information for Biological and Biomedical Investigations

• Covers data and metadata

• Standards for diverse bioscience communities

• ~35 guidelines so far

• Recommended by Science magazine

Let’s take a look… http://mibbi.sourceforge.net/portal.shtml

57/76

Page 58: Introduction to Data Management

Quality control

• Assign a person to be responsible o Naming conventions adhered to

o Good data quality

o Access controls in place

o Version controls followed

58/76

Page 59: Introduction to Data Management

Mini-series: Part 2

59/76

Page 60: Introduction to Data Management

1. Introduction

2. Incentives

3. Standards for description & documentation

4. Storage, archiving and sharing

• Backups

• Storage

• Security

• Archiving / preservation

• Sharing

5. Data management plans

Data management

60/76

Page 61: Introduction to Data Management

Backups

• Make a backup plan

• Multiple copies

• Geographically dispersed

61/76

Page 62: Introduction to Data Management

• Ask I.T. o Enterprise server

o IT managed cloud options

o Data warehouse

o Lab Information Management System (LIMS)

o Other systems?

• Proprietary cloud options (in a pinch) o Check ownership policies

o Pick >1 provider

Storage options

62/76

Page 63: Introduction to Data Management

Security considerations • Reasons to be concerned about security

o Ethical

o Commercial

o Privacy (e.g. HIPAA)

• Work with I.T.

• Other things:

o Add passwords

o Lock unused machines

o Sign use agreements

• Publishing/sharing data? May need to de-identify

63/76

Page 64: Introduction to Data Management

Postdoc survey results When you have finished analyzing/publishing from a dataset, where do you store it for long-term preservation, management, and/or access?

39%

19%

31%

22%

45%

14%

30%

20%

InstitutionalRepository

Discipline-specificRepository

Other Do not store forlong-term

preservation

Nationally

NYULMC

64/76

Page 65: Introduction to Data Management

Digital preservation • Storage ≠ preservation!

• Digital preservation is…

“a set of activities required to make sure digital objects can be located, rendered, used and understood in the future”

http://www.digitalpreservationeurope.eu/what-is-digital-preservation/

• Protects from

o hardware obsolescence

o software obsolescence

o file integrity issues

65/76

Page 66: Introduction to Data Management

• For digital preservation, storage and/or sharing

• Types of repositories:

o Institutional

o Discipline specific (GenBank)

o Cross disciplinary (Dryad)

Digital repositories

66/76

Page 67: Introduction to Data Management

Data format • Collection vs dissemination format

• Software export features

• Open formats e.g. XML, CSV, PDF, TIFF

• No open format? Use common proprietary formats e.g. DOC, SPSS

• Unencrypted

• Uncompressed

67/76

Page 68: Introduction to Data Management

Data ownership

• Can’t assume you own data

• Check for:

o Funder policies on data ownership

o Institution policies on data ownership

68/76

Page 69: Introduction to Data Management

1. Introduction

2. Incentives

3. Standards for description & documentation

4. Storage, archiving and sharing

5. Data Management Plans

Data management

69/76

Page 70: Introduction to Data Management

What should be included in the plan?

• Types of data

• Methods of collection

• Standards that will be applied

• Backup and storage procedures

• Plans for archiving / preservation

• Access policies and provisions for secondary use

• Measures to protect privacy or intellectual property

List adapted from NYU Libraries, Data Management Libguide

http://nyu.libguides.com/data_management

70/76

Page 71: Introduction to Data Management

Data management plans Where to start?

• Purdue’s Self Assessment Questionnaire http://research.hub.purdue.edu/resources/7

• MIT’s Data Management Check List

• NIH Data Sharing

There are good recipes…

…don’t reinvent the wheel!

71/76

Page 72: Introduction to Data Management

Mini-series: Part 3

72/76

Page 73: Introduction to Data Management

Conclusions

• Plan data management before starting research

• Documentation, documentation, documentation

• Can’t ignore the march toward research data sharing.. get ready!

73/76

Page 74: Introduction to Data Management

http://nyuhsl.libguides.com/data_management

Resources

74/76

Page 75: Introduction to Data Management

Photo references • AJ Yakstrangler. “Tithby” 2011. www.flickr.com/photos/yakstrangler/6030261340.

• BobPetUK. “Raw Minced Beef” 2010. www.flickr.com/photos/22179048@N05/5195112462/

• Like_the_Grand_Canyon. “McDonalds Hamburger Royal Bacon” 2008. www.flickr.com/photos/like_the_grand_canyon/3022123379

• The Adventures of Kristin & Adam. “Whose read for a beat down?!” 2008. www.flickr.com/photos/kristin-and-adam/2821678614/

• kristin_a. “Easter cupcakes” 2008. www.flickr.com/photos/kristinausk/2374459826

• wilf2. “Gummy smile” 2006. www.flickr.com/photos/wibbles/244268268

• outcast104. “Vampire weekend” 2005. www.flickr.com/photos/outcast104/2011632229

• Mel B. “Egg” 2008. www.flickr.com/photos/42dreams/2452044287

• psrobin. “Baking Powder Still Life” 2010. www.flickr.com/photos/psrobin/5092598788

• edenpictures. “Sugar” 2011. www.flickr.com/photos/edenpictures/6596639341

• Mel B. “Oil pour” 2008. http://www.flickr.com/photos/42dreams/2452876486

• afiler. “Piggly Wiggly Flour Bag” 2006. www.flickr.com/photos/afiler/121359709

• Bill HR. “Pure vanilla” 2009. http://www.flickr.com/photos/billhr/3190024762

• [F]oxmoron. “Baking Soda” 2011. http://www.flickr.com/photos/f-oxymoron/5423065696

• Eran Finkle. “Cinamon quills” 2007. http://www.flickr.com/photos/finklez/3059996880

• Joelk75. “choped walnuts” 2011. http://www.flickr.com/photos/75001512@N00/5405890483/

• Cyn74. “Happy Carrot” 2009. http://www.flickr.com/photos/kyntharyn74/3262089319

• nedrichards. “Carrot Cake” 2006. http://www.flickr.com/photos/nedrichards/307600027

75/76

Page 76: Introduction to Data Management

[email protected]

[email protected]

http://nyuhsl.libguides.com/data_management

Thank you... Questions?

http://hsl.med.nyu.edu 76/76