44
Introduction to Introduction to Data Management and Sharing Data Management and Sharing University Libraries/Information Services University Libraries/Information Services Office of Research Compliance and Office of Research Compliance and Training Training

Introduction to Data Management and Sharing

Embed Size (px)

DESCRIPTION

Scholars and researchers are being asked by an increasing number of research sponsors and journals to outline how they will manage and share their research data. This is an introduction to data management and sharing practices with some specific information for Columbia University researchers.

Citation preview

Page 1: Introduction to Data Management and Sharing

Introduction to Introduction to Data Management and SharingData Management and Sharing

University Libraries/Information Services University Libraries/Information Services Office of Research Compliance and TrainingOffice of Research Compliance and Training

Page 2: Introduction to Data Management and Sharing

Why is there a new focus on Why is there a new focus on data management and data management and

sharing?sharing?

22

Page 3: Introduction to Data Management and Sharing

Data sharing is not widely practiced…

• Lack of time Lack of time for data clean up, user questionsfor data clean up, user questions

• Lack of recognition Lack of recognition not valued in promotion/tenurenot valued in promotion/tenure

• Lack of controlLack of control worries about scooping, misinterpretationworries about scooping, misinterpretation

• Legal concerns Legal concerns copyright, patentscopyright, patents

• Inadequate infrastructureInadequate infrastructure33

Page 4: Introduction to Data Management and Sharing

…yet its value is recognized

Data sharing was a key element of:Data sharing was a key element of:

Human Human Genome Genome ProjectProject

NIH-funded NIH-funded Alzheimer’s study Alzheimer’s study published in published in April 2011April 2011

Sloan Sloan Digital Digital Sky Sky SurveySurvey

44

Page 5: Introduction to Data Management and Sharing

55

There are new possibilities…

Networked digital Networked digital technology creates technology creates new potential for:new potential for:

•data collectiondata collection

•data analysisdata analysis

•data “mash ups”data “mash ups”

•collaborationcollaboration

•citizen sciencecitizen scienceNational Science FoundationNational Science Foundation

Page 6: Introduction to Data Management and Sharing

66

““The impact of science on The impact of science on people’s lives, and the people’s lives, and the implications of scientific implications of scientific assessments for society assessments for society and the economy are now and the economy are now so great that  people won’t so great that  people won’t just believe scientists when just believe scientists when they say “trust me, I’m an they say “trust me, I’m an expert.” … Science has to expert.” … Science has to adapt.” adapt.”

- Geoffrey Boulton, chair of working group - Geoffrey Boulton, chair of working group for study: for study: Science as a public enterprise: Science as a public enterprise:

opening up scientific informationopening up scientific information, 5.13.11, 5.13.11

…and science is in the spotlight

Page 7: Introduction to Data Management and Sharing

77

These factors have changed the These factors have changed the conversation, resulting in…conversation, resulting in…

Page 8: Introduction to Data Management and Sharing

Calls for data accessibility…

88

““It is obvious that It is obvious that making data widely making data widely available is an available is an essential element of essential element of scientific research.”scientific research.”

- Science - Science editorial “Making Data editorial “Making Data Maximally Available,” 2.11.11Maximally Available,” 2.11.11

Page 9: Introduction to Data Management and Sharing

…and new data management policies

NSF and other research sponsors are NSF and other research sponsors are strengthening their data management strengthening their data management and sharing policies to help: and sharing policies to help:

•increase the accessibility of data increase the accessibility of data

•create standards and protocolscreate standards and protocols

•develop interoperable data repositoriesdevelop interoperable data repositories

•encourage transparency of researchencourage transparency of research

99

Page 10: Introduction to Data Management and Sharing

Submitting a proposal to the NSF?

You must:You must:

•Submit a two-page data Submit a two-page data management plan with your management plan with your proposal.proposal.

•Share your research data (or Share your research data (or justify why you should not justify why you should not share share it).it).

1010

Page 11: Introduction to Data Management and Sharing

Publishing in a Nature journal?

1111

“…“…authors are required authors are required to make materials, data to make materials, data and associated and associated protocols promptly protocols promptly available to readers.”available to readers.”

Page 12: Introduction to Data Management and Sharing

1212

More than ever, researchers More than ever, researchers are expected to make their are expected to make their

data accessible to—and data accessible to—and usable by—others. usable by—others.

Page 13: Introduction to Data Management and Sharing

This means…This means…

Having a data Having a data management plan management plan is more important is more important

than ever.than ever.

1313

Library of CongressLibrary of Congress

Page 14: Introduction to Data Management and Sharing

Data management plan (DMP)

A data management A data management plan outlines how plan outlines how you will collect, you will collect, organize, manage, organize, manage, store, secure, back store, secure, back up, preserve, and up, preserve, and share your data. share your data.

1414

Academic CommonsAcademic Commons

Page 15: Introduction to Data Management and Sharing

Other DMP elements

•Designating who is Designating who is responsible for data responsible for data managementmanagement

•Tools or software Tools or software needed to needed to create/process/visualicreate/process/visualize the dataze the data

•Compliance with Compliance with policies and policies and regulations regulations

1515

NISTNIST

Page 16: Introduction to Data Management and Sharing

Columbia DMP Template

•Columbia provides a DMP template. Columbia provides a DMP template.

•Though created in response to NSF Though created in response to NSF requirements, you can use it as a guide requirements, you can use it as a guide for creating any DMP.for creating any DMP.

•You can find the template on theYou can find the template on theNSF Data Management Requirements page page of this website.of this website.

1616

Page 17: Introduction to Data Management and Sharing

1717

Some points to consider Some points to consider when creating your DMPwhen creating your DMP

Page 18: Introduction to Data Management and Sharing

Your data storage needs

•Data formats and Data formats and sizesize

•Retention periodRetention period

•Privacy or security Privacy or security requirementsrequirements

•Backup planBackup plan

•Access Access requirementsrequirements

1818

Pittsburgh Supercomputing CenterPittsburgh Supercomputing Center

Page 19: Introduction to Data Management and Sharing

Data storage planning

•Plan for the Plan for the entire life-cycle.entire life-cycle.

•Establish a Establish a baseline and baseline and project the rate project the rate of growth for of growth for the duration of the duration of the project.the project.

1919

CDC/Dorothy Roland CDC/Dorothy Roland

Page 20: Introduction to Data Management and Sharing

Two types of storage

•ActiveActive

Frequent Frequent additions and additions and updatesupdates

•ArchivalArchival

In fixed form; In fixed form; only need only need periodic accessperiodic access

2020

CDCCDC

Page 21: Introduction to Data Management and Sharing

Active storage at Columbia

•School/department/division servers School/department/division servers Many researchers use servers managed by Many researchers use servers managed by

“local” IT groups.“local” IT groups.

•CUIT CUIT 20-80 MB personal storage20-80 MB personal storage

Central LAN serviceCentral LAN service

•Center for Digital Research & ScholarshipCenter for Digital Research & Scholarship Consultation availableConsultation available

2121

Page 22: Introduction to Data Management and Sharing

Archival storage at Columbia

•DigitalDigital

Academic Commons Academic Commons is Columbia’s online is Columbia’s online research repository.research repository.

•PhysicalPhysical

Consult the Consult the appropriate Columbia appropriate Columbia University Libraries University Libraries archive.archive.

2222

Page 23: Introduction to Data Management and Sharing

2323

Best archival file formats

• Nonproprietary file Nonproprietary file formatsformats

• Uncompressed and Uncompressed and unencrypted filesunencrypted files

• Consider ease of Consider ease of migration going migration going forwardforward

• May need to May need to archive software as archive software as well as datawell as data

INLINL

Page 24: Introduction to Data Management and Sharing

Data retention requirements

2424

Page 25: Introduction to Data Management and Sharing

Other important retention policies

•NIH NIH

3 years3 years

•NSF NSF

Check with individual Check with individual NSF directoratesNSF directorates

•Health Information Health Information Portability and Portability and Accountability Act Accountability Act (HIPPA)(HIPPA)

At least 6 yearsAt least 6 years2525

USGSUSGS

Page 26: Introduction to Data Management and Sharing

Data security and integrity

•SecuritySecurity

Protect data from Protect data from unauthorized access unauthorized access or accidental or accidental disclosure.disclosure.

•IntegrityIntegrity

Ensure that data Ensure that data remains unaltered remains unaltered before, during, and before, during, and after analysis and after analysis and presentation.presentation.

2626

NPSNPS

Page 27: Introduction to Data Management and Sharing

Data security requirements

Your data may be subject to laws and Your data may be subject to laws and policies such as:policies such as:

• HIPAA (Health Information Portability and HIPAA (Health Information Portability and Accountability Act)Accountability Act)

• IRB (Institutional Review Board)IRB (Institutional Review Board)

•Columbia Columbia computing policiescomputing policies• See the Computing and Technology section of See the Computing and Technology section of

the Columbia Administrative Policy Librarythe Columbia Administrative Policy Library

2727

Page 28: Introduction to Data Management and Sharing

Physical security best practices

• Restricted access Restricted access to research to research facilities, facilities, computers, datacomputers, data

• Only trusted Only trusted individuals individuals troubleshoot troubleshoot computer problemscomputer problems

• Lab notebooks, Lab notebooks, samples in locked samples in locked cabinetscabinets

2828

Lawrence Berkeley National LaboratoryLawrence Berkeley National Laboratory

Page 29: Introduction to Data Management and Sharing

Digital security best practices

• Sensitive data on Sensitive data on computers not computers not connected to Internetconnected to Internet

• Virus protection up to Virus protection up to datedate

• No confidential data No confidential data via e-mail or FTPvia e-mail or FTP

• Passwords to access Passwords to access files and computersfiles and computers

• Proper data disposal Proper data disposal at end of retention at end of retention periodperiod

2929

Lawrence Livermore National LaboratoryLawrence Livermore National Laboratory

Page 30: Introduction to Data Management and Sharing

Data backup best practices

•Make 3 copies Make 3 copies OriginalOriginal

External/local External/local

•Verify recovery is possibleVerify recovery is possible Checksum validationChecksum validation

Test file restore after initial setupTest file restore after initial setup

PerPeriodically thereafteriodically thereafter

External/remote – different geographic areaExternal/remote – different geographic area

3030

Page 31: Introduction to Data Management and Sharing

Data backup options

•Hard driveHard drive

•Tape back-upTape back-up

•ServerServer

•Cloud storageCloud storage

Amazon S3Amazon S3

Subject Repository/ Data Subject Repository/ Data CentersCenters• Examples: PubChem, Dryad, IRI/LDEOExamples: PubChem, Dryad, IRI/LDEO

3131

NIHNIH

Page 32: Introduction to Data Management and Sharing

Sharing requirements

How, when, and what How, when, and what you share depends on:you share depends on:

• Data formatData format

• Restrictions on dataRestrictions on data

• Funder and publisher Funder and publisher guidelinesguidelines

• Customary embargo Customary embargo periodsperiods

• Availability of appropriate Availability of appropriate repositories or other repositories or other vehicles for sharingvehicles for sharing

3232

NIHNIH

Page 33: Introduction to Data Management and Sharing

3333

Sample data sharing guidelines

Page 34: Introduction to Data Management and Sharing

Sharing restrictions

Under HIPAA (Health Under HIPAA (Health Information Portability and Information Portability and Accountability Act), you cannot Accountability Act), you cannot share information that share information that compromises the compromises the confidentiality or privacy of confidentiality or privacy of human subjects. Any data human subjects. Any data resulting from studies using resulting from studies using human subjects must be human subjects must be scrubbed of identifying scrubbed of identifying information.information.

3434

Page 35: Introduction to Data Management and Sharing

3535

You may have other You may have other reasons that justify reasons that justify not sharing your not sharing your data, and you can data, and you can detail these in your detail these in your data management data management plan. Funders may plan. Funders may allow exceptions to allow exceptions to data sharing data sharing policies.policies.

Sharing restrictions

Page 36: Introduction to Data Management and Sharing

Don’t forget metadata

Metadata is structured Metadata is structured information that information that describes, explains, describes, explains, locates, and otherwise locates, and otherwise makes it easier to makes it easier to retrieve and use an retrieve and use an information resource. information resource.

3636

BLM NTSCBLM NTSC

Page 37: Introduction to Data Management and Sharing

““The metadata accompanying your data The metadata accompanying your data should be written for a user 20 years should be written for a user 20 years into the future -- what does that person into the future -- what does that person need to know to use your data properly? need to know to use your data properly? Prepare the metadata for a user who is Prepare the metadata for a user who is unfamiliar with your project, methods, or unfamiliar with your project, methods, or observations. “observations. “

Oak Ridge National Laboratory

Distributed Active Archive CenterDistributed Active Archive Center

3737

Metadata facilitates use of your data

Page 38: Introduction to Data Management and Sharing

Major metadata standards

•Darwin Core (Biology)Darwin Core (Biology)

•DDI (Data Documentation Initiative, for social DDI (Data Documentation Initiative, for social and behavioral sciences data) and behavioral sciences data)

•DIF (Directory Interchange Format for DIF (Directory Interchange Format for scientific data) scientific data)

• EML (Ecological Metadata Language) EML (Ecological Metadata Language)

• FGDC/CSDGM (geographic data) FGDC/CSDGM (geographic data)

•NBII (National Biological Information NBII (National Biological Information Infrastructure)Infrastructure)

3838

Page 39: Introduction to Data Management and Sharing

Online data repositoriesOnline data repositories

• organized around institutions or subjectsorganized around institutions or subjects

• often open accessoften open access

• archival, not active, archival, not active,

• may offer:may offer: long-term preservation and accesslong-term preservation and access

search engine optimizationsearch engine optimization

permanent URL or DOI permanent URL or DOI

Repositories for data sharing

3939

Page 40: Introduction to Data Management and Sharing

Columbia’s repository

Academic Commons accepts materials Academic Commons accepts materials from faculty, students, and staff. from faculty, students, and staff.

4040

• secure replicated secure replicated storagestorage

• accurate metadataaccurate metadata

• globally accessible globally accessible repository repository

• contextual linking contextual linking between data and between data and publicationspublications

• a permanent URLa permanent URL

Page 41: Introduction to Data Management and Sharing

Some subject-based repositories

4141

Space science Space science mission mission

repositoryrepository

Cryospheric Cryospheric data data repositoryrepository

Macromolecular Macromolecular structural data structural data repository repository

Marine data Marine data repositoryrepository

Biological Biological activities of small activities of small molecules data molecules data repositoryrepository

Page 42: Introduction to Data Management and Sharing

4242

More subject-based repositories

Deep-sea core Deep-sea core samples samples repository repository housed at housed at LDEOLDEO

Data repository Data repository for archeology for archeology and related and related disciplinesdisciplines

Basic and applied Basic and applied biosciences data biosciences data repository repository

Geodesy data Geodesy data repository repository

Social science Social science data repositorydata repository

Page 43: Introduction to Data Management and Sharing

4343

Licensing your data

• Copyright issues Copyright issues around data can around data can be complexbe complex

• These groups These groups offer “ready-offer “ready-made” licenses made” licenses for data that help for data that help clarify any clarify any restrictions on restrictions on reusereuse

Page 44: Introduction to Data Management and Sharing

4444

For more information

• Data Management section of Scholarly Data Management section of Scholarly Communication Program websiteCommunication Program website

• Sponsored Projects AdministrationSponsored Projects Administration

• Office of Research Compliance and TrainingOffice of Research Compliance and Training

• Center for Digital Research and ScholarshipCenter for Digital Research and Scholarship

• CUITCUIT

• Computing and Technology section of Columbia Computing and Technology section of Columbia Administrative Policy LibraryAdministrative Policy Library