Introduction to Data Management

Preview:

Citation preview

Introduction to Data Management

June 4, 2012

Karen Hanson, MLIS

Knowledge Systems Librarian

Alisa Surkis, PhD, MLS

Translational Science Librarian

This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Understand…

• current climate around data management and data sharing

• best practices in data documentation and description

• principles of storage and long-term preservation of data

• basic elements of a data management plan

Objectives

2/76

1. Introduction

2. Incentives

3. Standards for description & documentation

4. Storage, archiving and sharing

5. Data Management Plans

Data management

3/76

What is data?

• “Facts and statistics collected together for reference or analysis”

Oxford Dictionaries online

http://oxforddictionaries.com/definition/data?q=data

• “Research data, unlike other types of information, is collected, observed, or created, for purposes of analysis to produce original research results.”

University of Edinburgh, Information Services

http://www.ed.ac.uk/schools-departments/information-services/services/research-support/data-library/research-data-mgmt/data-mgmt/research-data-definition

4/76

And that means…?

• Tables of numbers

• Sequences of bits (10110) or base pairs (GACTTA)

• Samples, specimens, slides

• Sound recordings, video recordings, images

• Laboratory notebooks

• Protocols, methodologies

• Software (code), algorithms, models

• “A myriad of other information objects, none of which may stand alone” - Christine Borgman

5/76

Categories of data

• Observational (real time)

• Experimental (lab)

• Computational (model)

• Derived or Compiled

Source: National Science Board. Long-Lived Digital Data collections, 2005.

6/76

What is data management?

• Not just creation, storage, processing and analysis

• Refers to managing the full lifecycle of data

7/76

Data management lifecycle

Processing data

• enter data, digitize, transcribe, translate

• check, validate, clean data

• anonymize data where necessary

• describe data

• manage and store data

9/76

Data management lifecycle

Analyzing data

• interpret / analyze data

• write publications

10/76

Data management lifecycle

Preserving data

• migrate data to best format / medium

• back-up and store data

• create final metadata and documentation

• archive data

11/76

Data management lifecycle

Giving access to data

• distribute / share data

• control access

12/76

Why would anyone need my data?

Cow concept: Dorothea Salo, “Save the Cows”, 2009.

http://www.slideshare.net/cavlec/save-the-cows-data-curation-for-the-rest-of-us-1533252

Analyze, process

Publish

You don’t need to kill the cow!

14/76

Data management lifecycle

Re-using data

• follow-up research

• new research

• check validity

15/76

Mini-series: Part 1

16/76

1. Introduction

2. Incentives

• You and your data

• Government mandates

• Publisher requirements

• Lost credibility

• Faster progress, better science

• Citations

• Big data

3. Standards for description & documentation

4. Storage, archiving and sharing

5. Data management plans

Data management

17/76

Why worry about data management?

• Bad things can happen if you don’t

• People will make you anyway

• Sharing is win-win

…you can’t share what you can’t find, read, decipher

18/76

You and your data

• Make research process more efficient

• Comprehensibility

• Security

… what about other people and your data?

19/76

• Government mandate

o Data sharing

o Data management plans

• Publisher requirements

• Credibility issues

Sharing - sticks!

20/76

Sharing - carrots!

Faster progress!

Better science!

Also…

o Data becomes citable

o Data linked to publications

21/76

Government mandates

Timeline

1999: US Office of Management and Budget amended the Freedom of Information Act

2003: NIH adopted a data sharing policy.

(still no teeth, but young yet)

22/76

Government mandates

2008: NIH implements the Public Access Policy

2009: White House issues the Open Government Directive

2011 (Jan): NSF made data management plans a requirement

23/76

Government mandates (bigger sticks on the way?)

• NSTC’s Interagency Working Group on Digital Data

o 11/2011 Request for Information (RFI) on Public Access to Digital Data Resulting from Federally Funded Scientific Research

• NIH Director Working Group on Data and Informatics

o 1/2012 Request for Information for Input into the Deliberations of the Advisory Committee to the NIH Director Working Group on Data and Informatics

24/76

“The Federal policy framework should move public access to digital data away from the current idiosyncratic environment to a systematic approach that lowers barriers to data access, discovery, sharing and re-use.”

- Sayeed Choudhury

The Sheridan Libraries of Johns Hopkins University

One response to RFI

25/76

Postdoc survey: Data management/sharing plans

To what extent have you dealt with NIH data sharing regulations or NSF data management plans?

26/76

38%

48%

12% 8%

39%

48%

11% 10%

Not aware ofpolicies

Aware but noinvolvement

Had to write dataplan

Had toImplement data

plan

Nationally

NYULMC

Publisher requirements

Nature:

“After publication, readers who encounter refusal by the authors to comply with these policies should contact the chief editor of the journal... In cases where editors are unable to resolve a complaint, the journal may refer the matter to the authors' funding institution and/or publish a formal statement of correction, attached online to the publication, stating that readers have been unable to obtain necessary materials to replicate the findings.”

http://www.nature.com/authors/policies/availability.html

27/76

Publisher requirements

Science:

“All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. All computer codes involved in the creation or analysis of data must also be available to any reader of Science. After publication, all reasonable requests for data and materials must be fulfilled. .”

http://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml

28/76

Suspect data: Losing credibility

Comparison of statistical analyses: papers with shared data vs. papers with no sharing

• Unshared data had more errors in reporting of results • Unshared data was weaker

• p values of unshared data significantly closer to 0.05

NOTE: APA journals require sharing: 57% did not share.

Consequences? No teeth.

Wicherts JM, Bakker M, Molenaar D. Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS One. 2011;6(11):e26828. Epub 2011 Nov 2.

PMID:22073203; PMCID: PMC3206853.

29/76

Retraction: Lost credibility “There were 60 children in the study. The ages were by accident duplicated between the upper and lower halves of the database. Thus, the ages for the first 30 children in the data set were identical and in the same order with the ages for the second set of 30 children…The files with the original data are not available any more, making it impossible to reconstruct a valid data set for reanalysis.”

http://www.ctajournal.com/content/2/1/6/abstract

30/76

Amy Wagers, Harvard stem cell researcher • 1/2010 Nature article: retracted 10/2010 • 8/2008 Blood article: retracted 12/2011

Shane Mayack, claimed “these errors occurred due to mistakes made in data retrieval that were a cause of a poor, but not a unique, data management and archiving system” but stands by results.

http://retractionwatch.wordpress.com/category/by-author/amy-wagers-retractions/

Faster progress! Better science!

Case studies

• Human Genome Project

• Neuromorpho.org

31/76

Human Genome Project

• NIH’s first foray in big science

• Experiment in data sharing

• Establishment of Bermuda principles o Automatic release of sequence assemblies

o Immediate publication of sequences

o Entire sequence freely available

• Full genome sequenced ahead of schedule

32/76

Neuromorpho.org

• Detailed morphological reconstructions of neurons o Time-intensive

o Re-usable in many way

• > 6k reconstructions deposited since 2006

• > 100k downloads in 2011

33/76

Citing data

• Interoperable data and publications

• Unique Identifiers

o Findable

o Citable

• More citations

34/76

Big data

“..almost everything about science is changing because of the impact of information technology. Experimental, theoretical, and computational science are all being affected by the data deluge, and a fourth, ‘data-intensive’ science paradigm is emerging.”

- Jim Gray, Fourth Paradigm (2009)

March 29, 2012: Federal government announces Big Data Research and Development Initiative, 200M+ from 6 agencies.

35/76

1. Introduction

2. Incentives

3. Standards for description & documentation • File Names

• Databases

• Versioning

• Metadata

• Quality control

4. Storage, archiving and sharing

5. Data management plans

Data management

36/76

Postdoc Survey How do you determine how to structure your data or what information to save about the data in order to be able to effectively access data in the future? (check all that apply)

15%

67%

46%

19% 13%

2%

11%

70%

43%

17% 13%

1%

No pre-defined

standards

Personalstandards

Lab-basedstandards

Disciplinestandards

InstitutionalStandards

Other

Nationally

NYULMC

37/76

Why to avoid making your own standard, if possible…

38/76

Standards. http://xkcd.com/927/

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.

What does your data look like?

• Many files or one file

• Raw format (numeric, images, binary)

• Processed format

• File sizes

39/76

File names bob_1262011.tif

Bob Smith? Bob Jones?

12 June, 2011? December 6, 2011? January 26, 2011?

40/76

Unambiguous dates, the ISO standard:

• YYYYMMDD or YYYY-MM-DD o e.g. 20120612 = June 6, 2012

• YYYYMMDDTHH:MM:SS o e.g. 20120612T14:03:12 = June 6, 2012 2:03:12 pm

100s of slices

5-7 experiments a week…

3 post docs

100s of slides

100s of huge images

TIF TIF

TIF TIF

TIF TIF

TIF TIF

TIF

TIF TIF

TIF TIF

TIF TIF

TIF TIF

TIF

TIF TIF

TIF TIF

TIF

TIF TIF

TIF TIF

TIF

1000s of image files TIF TIF

1 rat heart

41/76

File names should…

• Reflect contents of the file

• Use non cryptic/intuitive names if possible

• Consider any character restrictions

• Uniquely identify the file

• Avoid special characters (e.g. *, $, &, #)

• Use (“_”) instead of space or dash

42/76

Example of a good file name

AtherRat_012_056_mb_0423.tif

AtherRat = experiment name

012 = experiment number

056 = sample number

mb = stain used, methylene blue

0423 = coordinates of image (4 across, 23 down)

43/76

Spreadsheet

Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC (2011) Data from: Ultraconserved

elements anchor thousands of genetic markers for target enrichment spanning multiple evolutionary timescales. Dryad

Digital Repository. doi:10.5061/dryad.64dv0tg1

44/76

Andrew Sparkes & Amanda Clare. AutoLabDB: a substantial open source database schema to support a high-throughput

automated laboratory Bioinformatics first published online March 29, 2012 doi:10.1093/bioinformatics/bts140

45/76

Relational database

Databases

• Intuitive / meaningful field & table names

• Ensure it will support scope of analysis

• Institutional support for data modeling?

• Check for reusable discipline-based standard

o E.g. EUROCarbDB

46/76

• What do file/field names mean?

• What does each file/field contain?

• How do files/data relate to each other?

• Are there a limited set of possible values?

Document your file/field names

Name Type Description Possible values

Stain Text Stain used on cell sample

IO = Iodine; EY = Eosin Y; MB = Methylene blue;

47/76

Version control

• Have a plan

• Be consistent

• Document changes between versions (what, who)

48/76

Metadata definition

Metadata is:

“Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”

NISO (2004). Understanding Metadata. NISO Press

http://www.niso.org/publications/press/UnderstandingMetadata.pdf

49/76

Why use structured metadata?

• Systematic approach to capturing descriptive information

• Supports 3rd party use by:

o Making data findable (if metadata put online)

o Providing context

o Providing a unique identifier

50/76

Metadata considerations

• What standard should you use?

• Are there discipline standards?

• Which is the best fit?

• Where are you depositing your data?

51/76

Metadata – specialized standard Neuromorpho

• Neuromorpho ID (UID)

• Neuron Name

• Archive (researcher) name

• Species

• Strain of species

• Age range

• Gender

• Weight range

• Developmental stage

• Primary/Secondary/Tertiary brain regions

• Primary/Secondary/Tertiary Cell classes

• Original data format

• Experiment condition

• Experiment protocol

• Staining method

• Slicing Direction/Thickness

• Tissue Shrinkage

• Objective Type

• Magnification

• Reconstruction Method

• Dates of Deposition/Upload

• Associated publications

• Web URL of archives (if available) with any additional information about the reconstruction

52/76

Metadata – specialized standard Neuromorpho

• Neuromorpho ID (UID)

• Neuron Name

• Archive (researcher) name

• Species

• Strain of species

• Age range

• Gender

• Weight range

• Developmental stage

• Primary/Secondary/Tertiary brain regions

• Primary/Secondary/Tertiary Cell classes

• Original data format

• Experiment condition

• Experiment protocol

• Staining method

• Slicing Direction/Thickness

• Tissue Shrinkage

• Objective Type

• Magnification

• Reconstruction Method

• Dates of Deposition/Upload

• Associated publications

• Web URL of archives (if available) with any additional information about the reconstruction

53/76

Metadata – specialized standard Neuromorpho

• Neuromorpho ID (UID)

• Neuron Name

• Archive (researcher) name

• Species

• Strain of species

• Age range

• Gender

• Weight range

• Developmental stage

• Primary/Secondary/Tertiary brain regions

• Primary/Secondary/Tertiary Cell classes

• Original data format

• Experiment condition

• Experiment protocol

• Staining method

• Slicing Direction/Thickness

• Tissue Shrinkage

• Objective Type

• Magnification

• Reconstruction Method

• Dates of Deposition/Upload

• Associated publications

• Web URL of archives (if available) with any additional information about the reconstruction

54/76

Metadata – specialized standard GenBank

• Locus

• Definition

• Accession number (UID)

• Version

• Keywords

• Source organism

• Reference(s)

o Authors

o Title

o Journal

o PubMed ID

• Features

o Source

o RBS (ribosome binding site)

o gene

o CDS (protein coding sequence)

• Terminator

• Modification date

55/76

Metadata – general standard

• Dublin Core

o Designed to be generic/flexible

o Usually stored as XML

e.g. <dc:creator>Hanson, Karen L.</dc:creator>

o 15 fields:

Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, Type

56/76

Minimum Information for Biological and Biomedical Investigations

• Covers data and metadata

• Standards for diverse bioscience communities

• ~35 guidelines so far

• Recommended by Science magazine

Let’s take a look… http://mibbi.sourceforge.net/portal.shtml

57/76

Quality control

• Assign a person to be responsible o Naming conventions adhered to

o Good data quality

o Access controls in place

o Version controls followed

58/76

Mini-series: Part 2

59/76

1. Introduction

2. Incentives

3. Standards for description & documentation

4. Storage, archiving and sharing

• Backups

• Storage

• Security

• Archiving / preservation

• Sharing

5. Data management plans

Data management

60/76

Backups

• Make a backup plan

• Multiple copies

• Geographically dispersed

61/76

• Ask I.T. o Enterprise server

o IT managed cloud options

o Data warehouse

o Lab Information Management System (LIMS)

o Other systems?

• Proprietary cloud options (in a pinch) o Check ownership policies

o Pick >1 provider

Storage options

62/76

Security considerations • Reasons to be concerned about security

o Ethical

o Commercial

o Privacy (e.g. HIPAA)

• Work with I.T.

• Other things:

o Add passwords

o Lock unused machines

o Sign use agreements

• Publishing/sharing data? May need to de-identify

63/76

Postdoc survey results When you have finished analyzing/publishing from a dataset, where do you store it for long-term preservation, management, and/or access?

39%

19%

31%

22%

45%

14%

30%

20%

InstitutionalRepository

Discipline-specificRepository

Other Do not store forlong-term

preservation

Nationally

NYULMC

64/76

Digital preservation • Storage ≠ preservation!

• Digital preservation is…

“a set of activities required to make sure digital objects can be located, rendered, used and understood in the future”

http://www.digitalpreservationeurope.eu/what-is-digital-preservation/

• Protects from

o hardware obsolescence

o software obsolescence

o file integrity issues

65/76

• For digital preservation, storage and/or sharing

• Types of repositories:

o Institutional

o Discipline specific (GenBank)

o Cross disciplinary (Dryad)

Digital repositories

66/76

Data format • Collection vs dissemination format

• Software export features

• Open formats e.g. XML, CSV, PDF, TIFF

• No open format? Use common proprietary formats e.g. DOC, SPSS

• Unencrypted

• Uncompressed

67/76

Data ownership

• Can’t assume you own data

• Check for:

o Funder policies on data ownership

o Institution policies on data ownership

68/76

1. Introduction

2. Incentives

3. Standards for description & documentation

4. Storage, archiving and sharing

5. Data Management Plans

Data management

69/76

What should be included in the plan?

• Types of data

• Methods of collection

• Standards that will be applied

• Backup and storage procedures

• Plans for archiving / preservation

• Access policies and provisions for secondary use

• Measures to protect privacy or intellectual property

List adapted from NYU Libraries, Data Management Libguide

http://nyu.libguides.com/data_management

70/76

Data management plans Where to start?

• Purdue’s Self Assessment Questionnaire http://research.hub.purdue.edu/resources/7

• MIT’s Data Management Check List

• NIH Data Sharing

There are good recipes…

…don’t reinvent the wheel!

71/76

Mini-series: Part 3

72/76

Conclusions

• Plan data management before starting research

• Documentation, documentation, documentation

• Can’t ignore the march toward research data sharing.. get ready!

73/76

http://nyuhsl.libguides.com/data_management

Resources

74/76

Photo references • AJ Yakstrangler. “Tithby” 2011. www.flickr.com/photos/yakstrangler/6030261340.

• BobPetUK. “Raw Minced Beef” 2010. www.flickr.com/photos/22179048@N05/5195112462/

• Like_the_Grand_Canyon. “McDonalds Hamburger Royal Bacon” 2008. www.flickr.com/photos/like_the_grand_canyon/3022123379

• The Adventures of Kristin & Adam. “Whose read for a beat down?!” 2008. www.flickr.com/photos/kristin-and-adam/2821678614/

• kristin_a. “Easter cupcakes” 2008. www.flickr.com/photos/kristinausk/2374459826

• wilf2. “Gummy smile” 2006. www.flickr.com/photos/wibbles/244268268

• outcast104. “Vampire weekend” 2005. www.flickr.com/photos/outcast104/2011632229

• Mel B. “Egg” 2008. www.flickr.com/photos/42dreams/2452044287

• psrobin. “Baking Powder Still Life” 2010. www.flickr.com/photos/psrobin/5092598788

• edenpictures. “Sugar” 2011. www.flickr.com/photos/edenpictures/6596639341

• Mel B. “Oil pour” 2008. http://www.flickr.com/photos/42dreams/2452876486

• afiler. “Piggly Wiggly Flour Bag” 2006. www.flickr.com/photos/afiler/121359709

• Bill HR. “Pure vanilla” 2009. http://www.flickr.com/photos/billhr/3190024762

• [F]oxmoron. “Baking Soda” 2011. http://www.flickr.com/photos/f-oxymoron/5423065696

• Eran Finkle. “Cinamon quills” 2007. http://www.flickr.com/photos/finklez/3059996880

• Joelk75. “choped walnuts” 2011. http://www.flickr.com/photos/75001512@N00/5405890483/

• Cyn74. “Happy Carrot” 2009. http://www.flickr.com/photos/kyntharyn74/3262089319

• nedrichards. “Carrot Cake” 2006. http://www.flickr.com/photos/nedrichards/307600027

75/76

karen.hanson@med.nyu.edu

alisa.surkis@med.nyu.edu

http://nyuhsl.libguides.com/data_management

Thank you... Questions?

http://hsl.med.nyu.edu 76/76