Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Introduction to Data Management
June 4, 2012
Karen Hanson, MLIS
Knowledge Systems Librarian
Alisa Surkis, PhD, MLS
Translational Science Librarian
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.
Understand…
• current climate around data management and data sharing
• best practices in data documentation and description
• principles of storage and long-term preservation of data
• basic elements of a data management plan
Objectives
2/76
1. Introduction
2. Incentives
3. Standards for description & documentation
4. Storage, archiving and sharing
5. Data Management Plans
Data management
3/76
What is data?
• “Facts and statistics collected together for reference or analysis”
Oxford Dictionaries online
http://oxforddictionaries.com/definition/data?q=data
• “Research data, unlike other types of information, is collected, observed, or created, for purposes of analysis to produce original research results.”
University of Edinburgh, Information Services
http://www.ed.ac.uk/schools-departments/information-services/services/research-support/data-library/research-data-mgmt/data-mgmt/research-data-definition
4/76
And that means…?
• Tables of numbers
• Sequences of bits (10110) or base pairs (GACTTA)
• Samples, specimens, slides
• Sound recordings, video recordings, images
• Laboratory notebooks
• Protocols, methodologies
• Software (code), algorithms, models
• “A myriad of other information objects, none of which may stand alone” - Christine Borgman
5/76
Categories of data
• Observational (real time)
• Experimental (lab)
• Computational (model)
• Derived or Compiled
Source: National Science Board. Long-Lived Digital Data collections, 2005.
6/76
What is data management?
• Not just creation, storage, processing and analysis
• Refers to managing the full lifecycle of data
7/76
Data management lifecycle Creating data
• design research
• plan data management
• capture/create the data
• document process
Source: UK Data Archive, University of Essex.
http://www.data-archive.ac.uk/create-manage/life-cycle
8/76
Data management lifecycle
Processing data
• enter data, digitize, transcribe, translate
• check, validate, clean data
• anonymize data where necessary
• describe data
• manage and store data
9/76
Data management lifecycle
Analyzing data
• interpret / analyze data
• write publications
10/76
Data management lifecycle
Preserving data
• migrate data to best format / medium
• back-up and store data
• create final metadata and documentation
• archive data
11/76
Data management lifecycle
Giving access to data
• distribute / share data
• control access
12/76
Why would anyone need my data?
Cow concept: Dorothea Salo, “Save the Cows”, 2009.
http://www.slideshare.net/cavlec/save-the-cows-data-curation-for-the-rest-of-us-1533252
Analyze, process
Publish
You don’t need to kill the cow!
14/76
Data management lifecycle
Re-using data
• follow-up research
• new research
• check validity
15/76
1. Introduction
2. Incentives
• You and your data
• Government mandates
• Publisher requirements
• Lost credibility
• Faster progress, better science
• Citations
• Big data
3. Standards for description & documentation
4. Storage, archiving and sharing
5. Data management plans
Data management
17/76
Why worry about data management?
• Bad things can happen if you don’t
• People will make you anyway
• Sharing is win-win
…you can’t share what you can’t find, read, decipher
18/76
You and your data
• Make research process more efficient
• Comprehensibility
• Security
… what about other people and your data?
19/76
• Government mandate
o Data sharing
o Data management plans
• Publisher requirements
• Credibility issues
Sharing - sticks!
20/76
Sharing - carrots!
Faster progress!
Better science!
Also…
o Data becomes citable
o Data linked to publications
21/76
Government mandates
Timeline
1999: US Office of Management and Budget amended the Freedom of Information Act
2003: NIH adopted a data sharing policy.
(still no teeth, but young yet)
22/76
Government mandates
2008: NIH implements the Public Access Policy
2009: White House issues the Open Government Directive
2011 (Jan): NSF made data management plans a requirement
23/76
Government mandates (bigger sticks on the way?)
• NSTC’s Interagency Working Group on Digital Data
o 11/2011 Request for Information (RFI) on Public Access to Digital Data Resulting from Federally Funded Scientific Research
• NIH Director Working Group on Data and Informatics
o 1/2012 Request for Information for Input into the Deliberations of the Advisory Committee to the NIH Director Working Group on Data and Informatics
24/76
“The Federal policy framework should move public access to digital data away from the current idiosyncratic environment to a systematic approach that lowers barriers to data access, discovery, sharing and re-use.”
- Sayeed Choudhury
The Sheridan Libraries of Johns Hopkins University
One response to RFI
25/76
Postdoc survey: Data management/sharing plans
To what extent have you dealt with NIH data sharing regulations or NSF data management plans?
26/76
38%
48%
12% 8%
39%
48%
11% 10%
Not aware ofpolicies
Aware but noinvolvement
Had to write dataplan
Had toImplement data
plan
Nationally
NYULMC
Publisher requirements
Nature:
“After publication, readers who encounter refusal by the authors to comply with these policies should contact the chief editor of the journal... In cases where editors are unable to resolve a complaint, the journal may refer the matter to the authors' funding institution and/or publish a formal statement of correction, attached online to the publication, stating that readers have been unable to obtain necessary materials to replicate the findings.”
http://www.nature.com/authors/policies/availability.html
27/76
Publisher requirements
Science:
“All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. All computer codes involved in the creation or analysis of data must also be available to any reader of Science. After publication, all reasonable requests for data and materials must be fulfilled. .”
http://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml
28/76
Suspect data: Losing credibility
Comparison of statistical analyses: papers with shared data vs. papers with no sharing
• Unshared data had more errors in reporting of results • Unshared data was weaker
• p values of unshared data significantly closer to 0.05
NOTE: APA journals require sharing: 57% did not share.
Consequences? No teeth.
Wicherts JM, Bakker M, Molenaar D. Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS One. 2011;6(11):e26828. Epub 2011 Nov 2.
PMID:22073203; PMCID: PMC3206853.
29/76
Retraction: Lost credibility “There were 60 children in the study. The ages were by accident duplicated between the upper and lower halves of the database. Thus, the ages for the first 30 children in the data set were identical and in the same order with the ages for the second set of 30 children…The files with the original data are not available any more, making it impossible to reconstruct a valid data set for reanalysis.”
http://www.ctajournal.com/content/2/1/6/abstract
30/76
Amy Wagers, Harvard stem cell researcher • 1/2010 Nature article: retracted 10/2010 • 8/2008 Blood article: retracted 12/2011
Shane Mayack, claimed “these errors occurred due to mistakes made in data retrieval that were a cause of a poor, but not a unique, data management and archiving system” but stands by results.
http://retractionwatch.wordpress.com/category/by-author/amy-wagers-retractions/
Faster progress! Better science!
Case studies
• Human Genome Project
• Neuromorpho.org
31/76
Human Genome Project
• NIH’s first foray in big science
• Experiment in data sharing
• Establishment of Bermuda principles o Automatic release of sequence assemblies
o Immediate publication of sequences
o Entire sequence freely available
• Full genome sequenced ahead of schedule
32/76
Neuromorpho.org
• Detailed morphological reconstructions of neurons o Time-intensive
o Re-usable in many way
• > 6k reconstructions deposited since 2006
• > 100k downloads in 2011
33/76
Citing data
• Interoperable data and publications
• Unique Identifiers
o Findable
o Citable
• More citations
34/76
Big data
“..almost everything about science is changing because of the impact of information technology. Experimental, theoretical, and computational science are all being affected by the data deluge, and a fourth, ‘data-intensive’ science paradigm is emerging.”
- Jim Gray, Fourth Paradigm (2009)
March 29, 2012: Federal government announces Big Data Research and Development Initiative, 200M+ from 6 agencies.
35/76
1. Introduction
2. Incentives
3. Standards for description & documentation • File Names
• Databases
• Versioning
• Metadata
• Quality control
4. Storage, archiving and sharing
5. Data management plans
Data management
36/76
Postdoc Survey How do you determine how to structure your data or what information to save about the data in order to be able to effectively access data in the future? (check all that apply)
15%
67%
46%
19% 13%
2%
11%
70%
43%
17% 13%
1%
No pre-defined
standards
Personalstandards
Lab-basedstandards
Disciplinestandards
InstitutionalStandards
Other
Nationally
NYULMC
37/76
Why to avoid making your own standard, if possible…
38/76
Standards. http://xkcd.com/927/
This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
What does your data look like?
• Many files or one file
• Raw format (numeric, images, binary)
• Processed format
• File sizes
39/76
File names bob_1262011.tif
Bob Smith? Bob Jones?
12 June, 2011? December 6, 2011? January 26, 2011?
40/76
Unambiguous dates, the ISO standard:
• YYYYMMDD or YYYY-MM-DD o e.g. 20120612 = June 6, 2012
• YYYYMMDDTHH:MM:SS o e.g. 20120612T14:03:12 = June 6, 2012 2:03:12 pm
100s of slices
5-7 experiments a week…
3 post docs
100s of slides
100s of huge images
TIF TIF
TIF TIF
TIF TIF
TIF TIF
TIF
TIF TIF
TIF TIF
TIF TIF
TIF TIF
TIF
TIF TIF
TIF TIF
TIF
TIF TIF
TIF TIF
TIF
1000s of image files TIF TIF
1 rat heart
41/76
File names should…
• Reflect contents of the file
• Use non cryptic/intuitive names if possible
• Consider any character restrictions
• Uniquely identify the file
• Avoid special characters (e.g. *, $, &, #)
• Use (“_”) instead of space or dash
42/76
Example of a good file name
AtherRat_012_056_mb_0423.tif
AtherRat = experiment name
012 = experiment number
056 = sample number
mb = stain used, methylene blue
0423 = coordinates of image (4 across, 23 down)
43/76
Spreadsheet
Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC (2011) Data from: Ultraconserved
elements anchor thousands of genetic markers for target enrichment spanning multiple evolutionary timescales. Dryad
Digital Repository. doi:10.5061/dryad.64dv0tg1
44/76
Andrew Sparkes & Amanda Clare. AutoLabDB: a substantial open source database schema to support a high-throughput
automated laboratory Bioinformatics first published online March 29, 2012 doi:10.1093/bioinformatics/bts140
45/76
Relational database
Databases
• Intuitive / meaningful field & table names
• Ensure it will support scope of analysis
• Institutional support for data modeling?
• Check for reusable discipline-based standard
o E.g. EUROCarbDB
46/76
• What do file/field names mean?
• What does each file/field contain?
• How do files/data relate to each other?
• Are there a limited set of possible values?
Document your file/field names
Name Type Description Possible values
Stain Text Stain used on cell sample
IO = Iodine; EY = Eosin Y; MB = Methylene blue;
47/76
Version control
• Have a plan
• Be consistent
• Document changes between versions (what, who)
48/76
Metadata definition
Metadata is:
“Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”
NISO (2004). Understanding Metadata. NISO Press
http://www.niso.org/publications/press/UnderstandingMetadata.pdf
49/76
Why use structured metadata?
• Systematic approach to capturing descriptive information
• Supports 3rd party use by:
o Making data findable (if metadata put online)
o Providing context
o Providing a unique identifier
50/76
Metadata considerations
• What standard should you use?
• Are there discipline standards?
• Which is the best fit?
• Where are you depositing your data?
51/76
Metadata – specialized standard Neuromorpho
• Neuromorpho ID (UID)
• Neuron Name
• Archive (researcher) name
• Species
• Strain of species
• Age range
• Gender
• Weight range
• Developmental stage
• Primary/Secondary/Tertiary brain regions
• Primary/Secondary/Tertiary Cell classes
• Original data format
• Experiment condition
• Experiment protocol
• Staining method
• Slicing Direction/Thickness
• Tissue Shrinkage
• Objective Type
• Magnification
• Reconstruction Method
• Dates of Deposition/Upload
• Associated publications
• Web URL of archives (if available) with any additional information about the reconstruction
52/76
Metadata – specialized standard Neuromorpho
• Neuromorpho ID (UID)
• Neuron Name
• Archive (researcher) name
• Species
• Strain of species
• Age range
• Gender
• Weight range
• Developmental stage
• Primary/Secondary/Tertiary brain regions
• Primary/Secondary/Tertiary Cell classes
• Original data format
• Experiment condition
• Experiment protocol
• Staining method
• Slicing Direction/Thickness
• Tissue Shrinkage
• Objective Type
• Magnification
• Reconstruction Method
• Dates of Deposition/Upload
• Associated publications
• Web URL of archives (if available) with any additional information about the reconstruction
53/76
Metadata – specialized standard Neuromorpho
• Neuromorpho ID (UID)
• Neuron Name
• Archive (researcher) name
• Species
• Strain of species
• Age range
• Gender
• Weight range
• Developmental stage
• Primary/Secondary/Tertiary brain regions
• Primary/Secondary/Tertiary Cell classes
• Original data format
• Experiment condition
• Experiment protocol
• Staining method
• Slicing Direction/Thickness
• Tissue Shrinkage
• Objective Type
• Magnification
• Reconstruction Method
• Dates of Deposition/Upload
• Associated publications
• Web URL of archives (if available) with any additional information about the reconstruction
54/76
Metadata – specialized standard GenBank
• Locus
• Definition
• Accession number (UID)
• Version
• Keywords
• Source organism
• Reference(s)
o Authors
o Title
o Journal
o PubMed ID
• Features
o Source
o RBS (ribosome binding site)
o gene
o CDS (protein coding sequence)
• Terminator
• Modification date
55/76
Metadata – general standard
• Dublin Core
o Designed to be generic/flexible
o Usually stored as XML
e.g. <dc:creator>Hanson, Karen L.</dc:creator>
o 15 fields:
Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, Type
56/76
Minimum Information for Biological and Biomedical Investigations
• Covers data and metadata
• Standards for diverse bioscience communities
• ~35 guidelines so far
• Recommended by Science magazine
Let’s take a look… http://mibbi.sourceforge.net/portal.shtml
57/76
Quality control
• Assign a person to be responsible o Naming conventions adhered to
o Good data quality
o Access controls in place
o Version controls followed
58/76
1. Introduction
2. Incentives
3. Standards for description & documentation
4. Storage, archiving and sharing
• Backups
• Storage
• Security
• Archiving / preservation
• Sharing
5. Data management plans
Data management
60/76
Backups
• Make a backup plan
• Multiple copies
• Geographically dispersed
61/76
• Ask I.T. o Enterprise server
o IT managed cloud options
o Data warehouse
o Lab Information Management System (LIMS)
o Other systems?
• Proprietary cloud options (in a pinch) o Check ownership policies
o Pick >1 provider
Storage options
62/76
Security considerations • Reasons to be concerned about security
o Ethical
o Commercial
o Privacy (e.g. HIPAA)
• Work with I.T.
• Other things:
o Add passwords
o Lock unused machines
o Sign use agreements
• Publishing/sharing data? May need to de-identify
63/76
Postdoc survey results When you have finished analyzing/publishing from a dataset, where do you store it for long-term preservation, management, and/or access?
39%
19%
31%
22%
45%
14%
30%
20%
InstitutionalRepository
Discipline-specificRepository
Other Do not store forlong-term
preservation
Nationally
NYULMC
64/76
Digital preservation • Storage ≠ preservation!
• Digital preservation is…
“a set of activities required to make sure digital objects can be located, rendered, used and understood in the future”
http://www.digitalpreservationeurope.eu/what-is-digital-preservation/
• Protects from
o hardware obsolescence
o software obsolescence
o file integrity issues
65/76
• For digital preservation, storage and/or sharing
• Types of repositories:
o Institutional
o Discipline specific (GenBank)
o Cross disciplinary (Dryad)
Digital repositories
66/76
Data format • Collection vs dissemination format
• Software export features
• Open formats e.g. XML, CSV, PDF, TIFF
• No open format? Use common proprietary formats e.g. DOC, SPSS
• Unencrypted
• Uncompressed
67/76
Data ownership
• Can’t assume you own data
• Check for:
o Funder policies on data ownership
o Institution policies on data ownership
68/76
1. Introduction
2. Incentives
3. Standards for description & documentation
4. Storage, archiving and sharing
5. Data Management Plans
Data management
69/76
What should be included in the plan?
• Types of data
• Methods of collection
• Standards that will be applied
• Backup and storage procedures
• Plans for archiving / preservation
• Access policies and provisions for secondary use
• Measures to protect privacy or intellectual property
List adapted from NYU Libraries, Data Management Libguide
http://nyu.libguides.com/data_management
70/76
Data management plans Where to start?
• Purdue’s Self Assessment Questionnaire http://research.hub.purdue.edu/resources/7
• MIT’s Data Management Check List
• NIH Data Sharing
There are good recipes…
…don’t reinvent the wheel!
71/76
Conclusions
• Plan data management before starting research
• Documentation, documentation, documentation
• Can’t ignore the march toward research data sharing.. get ready!
73/76
http://nyuhsl.libguides.com/data_management
Resources
74/76
Photo references • AJ Yakstrangler. “Tithby” 2011. www.flickr.com/photos/yakstrangler/6030261340.
• BobPetUK. “Raw Minced Beef” 2010. www.flickr.com/photos/22179048@N05/5195112462/
• Like_the_Grand_Canyon. “McDonalds Hamburger Royal Bacon” 2008. www.flickr.com/photos/like_the_grand_canyon/3022123379
• The Adventures of Kristin & Adam. “Whose read for a beat down?!” 2008. www.flickr.com/photos/kristin-and-adam/2821678614/
• kristin_a. “Easter cupcakes” 2008. www.flickr.com/photos/kristinausk/2374459826
• wilf2. “Gummy smile” 2006. www.flickr.com/photos/wibbles/244268268
• outcast104. “Vampire weekend” 2005. www.flickr.com/photos/outcast104/2011632229
• Mel B. “Egg” 2008. www.flickr.com/photos/42dreams/2452044287
• psrobin. “Baking Powder Still Life” 2010. www.flickr.com/photos/psrobin/5092598788
• edenpictures. “Sugar” 2011. www.flickr.com/photos/edenpictures/6596639341
• Mel B. “Oil pour” 2008. http://www.flickr.com/photos/42dreams/2452876486
• afiler. “Piggly Wiggly Flour Bag” 2006. www.flickr.com/photos/afiler/121359709
• Bill HR. “Pure vanilla” 2009. http://www.flickr.com/photos/billhr/3190024762
• [F]oxmoron. “Baking Soda” 2011. http://www.flickr.com/photos/f-oxymoron/5423065696
• Eran Finkle. “Cinamon quills” 2007. http://www.flickr.com/photos/finklez/3059996880
• Joelk75. “choped walnuts” 2011. http://www.flickr.com/photos/75001512@N00/5405890483/
• Cyn74. “Happy Carrot” 2009. http://www.flickr.com/photos/kyntharyn74/3262089319
• nedrichards. “Carrot Cake” 2006. http://www.flickr.com/photos/nedrichards/307600027
75/76
http://nyuhsl.libguides.com/data_management
Thank you... Questions?
http://hsl.med.nyu.edu 76/76