64
Data Management for Research Aaron Collie, MSU Libraries Lisa Schmidt, University Archives

Data Management for Research (New Faculty Orientation)

Embed Size (px)

DESCRIPTION

Situates research data management as a contingency that should be addressed and provisioned for during planning and research design. Draws out fundamental practices for file management, data description, and enumerates storage decision points.

Citation preview

Page 1: Data Management for Research (New Faculty Orientation)

Data Management for Research

Aaron Collie, MSU LibrariesLisa Schmidt, University Archives

Page 2: Data Management for Research (New Faculty Orientation)

Introductions Please tell us your name and

department A brief description of your

primary research area What do you consider to be

your research data Experience and/or comfort

level with managing research data?

cc http://www.flickr.com/photos/quinnanya/

Page 3: Data Management for Research (New Faculty Orientation)

Data Management. Isn’t that… trivial?

Not so much. Data is a primary output of research; it is very expensive to produce high quality data. Data may be collected in nanoseconds, but it takes the expert application of research protocol and design to generate data.

CC-BY-SA-3.0 Rob Lavinsky CC-BY-SA-3.0 Rob

Page 4: Data Management for Research (New Faculty Orientation)

Even more consequential, data is the input of a process that generates higher orders of understanding.

Wisdom

Knowledge

Information

Data

Understanding is hierarchical!

Russell Ackoff

Page 5: Data Management for Research (New Faculty Orientation)

This is the engine of the academic industry…De

fine

a qu

estio

n

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

Page 6: Data Management for Research (New Faculty Orientation)

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data

Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

Page 7: Data Management for Research (New Faculty Orientation)

So, things can get a little messy.

Page 8: Data Management for Research (New Faculty Orientation)

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data

Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

The scientific method “is often misrepresented as a fixed sequence of steps,” rather than being seen for what it truly is, “a highly variable and creative process” (AAAS 2000:18).

Gauch, Hugh G. Scientific Method in Practice. New York: Cambridge University Press, 2010. Print. (Emphasis added)

Page 9: Data Management for Research (New Faculty Orientation)

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

Page 10: Data Management for Research (New Faculty Orientation)

The Research Depth Chart

Scientific Method

Research Design

Research Method

Research Tasks Mor

e Sp

ecifi

c

M

ore

Gen

eric

Page 11: Data Management for Research (New Faculty Orientation)

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

Problem Identification

Study Concept

Literature Review

Environmental Scan

Funding & Proposal

Research Design

Research Methodolog

y

Research Workflow

Hypothesis Formation

Design Validation

Research Activity

Data Management

Data Organization

Data Storage

Data Description

Data Sharing

Scholarly Communication

Report Findings

Publish

Peer Review

Page 12: Data Management for Research (New Faculty Orientation)

Defin

e a

ques

tion

Gath

er

info

rmati

on

Form

a

hypo

thes

is

Test

the

hypo

thes

is

Anal

yze

the

data Inte

rpre

t th

e da

ta

Publ

ish

resu

lts

Rete

st

Problem Identification

Study Concept

Literature Review

Environmental Scan

Funding & Proposal

Research Design

Research Methodolog

y

Research Workflow

Hypothesis Formation

Design Validation

Research Activity

Data Management

Data Organization

Data Storage

Data Description

Data Sharing

Scholarly Communication

Report Findings

Publish

Peer Review

Page 13: Data Management for Research (New Faculty Orientation)

Upfront Decisions for Researchers How are the data described and organized? Who are the expected and potential audiences for

the datasets? What publications or discoveries have resulted from

the datasets? How should the data be made accessible? How might the data be used, reused, and

repurposed?

Page 14: Data Management for Research (New Faculty Orientation)

Upfront Decisions for Researchers What is the expected lifespan of the data? Besides the researcher(s) on the project, who else

should be given access to the data? Does the dataset include any sensitive information? Who owns or controls the research data? Should any restrictions be placed on the dataset? How are the data stored and preserved?

Page 15: Data Management for Research (New Faculty Orientation)

• Introduction• Background

• The Impetus: NSF Data Management Plan Mandate• The Effect: Policy to Practice• The Response: Changing Data Landscape

• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup• Data Publishing, Sharing, & Reuse• Protecting Data & Responsible Reuse

• Data Lifecycle Resources

Agenda

Page 16: Data Management for Research (New Faculty Orientation)

But why are we really here?

Impetus: NSF has mandated that all grant applications submitted after January 18th, 2011 must include a supplemental “Data Management Plan”

Effect: The original NSF mandate has had a domino effect, and many funders now require or state guidelines for data management of grant funded research

Response: Data management has not traditionally received a full treatment in (many) graduate and doctoral curricula; intervention is necessary

Page 17: Data Management for Research (New Faculty Orientation)

Impetus: NSF Data Management Plan

Policies for re-use, re-distribution, and creation of derivatives Plans for archiving data, samples, and other research outcomes, maintaining access Types of data, samples, physical collections, software generated Standards for data and metadata format and content Access and sharing policies, with stipulations for privacy, confidentiality, security, intellectual property, or other rights or requirements

Page 18: Data Management for Research (New Faculty Orientation)

Impetus: NSF Data Management Plan

NSF will not evaluate any proposal missing a DMP PI may state that project will not generate data DMP is reviewed as part of intellectual merit or broader impacts of application, or both Costs to implement DMP may be included in proposal’s budget May be up to two pages long

Page 19: Data Management for Research (New Faculty Orientation)

Effect: Funder Policies

NASA “promotes the full and open sharing of all data”

“requires that data…be submitted to and archived by designated national data centers.”

“expects the timely release and sharing of final research data"

"IMLS encourages sharing of research data."

“…should describe how the project team will manage and disseminate data generated by the project”

Page 20: Data Management for Research (New Faculty Orientation)

Effect: More is on the way

Presidential Memorandum on Managing Government Records (August 24, 2012)• Managing Government Records Directive: All permanent electronic records in Federal agencies will be managed electronically to the fullest extent possible for eventual transfer and accessioning by NARA in an electronic format.

White House policy memo (February 22, 2013)• Increasing Access to the Results of Federally Funded Scientific Research: Federal agencies with more than $100M in R&D expenditures must develop plans to make the published results of federally funded research freely available to the public within one year of publication.

Page 21: Data Management for Research (New Faculty Orientation)

Effect: Local Policy

University Research Council Best PracticesResearch Data: Management, Control, and Access

• To assure that research data are appropriately recorded, archived for a reasonable period of time, and available for review under the appropriate circumstances.– Ownership = MSU– “Stewardship” = You– Period of Retention = 3 years– Transfer of Responsibility = Written Request

Page 22: Data Management for Research (New Faculty Orientation)

Response: Changing Data Landscape

Data Management Competencies Standards & Best Practices Discipline Specific Discourse

Data sharing and open data Data sets as publications Data journals Citations for data (e.g., used in secondary analysis) Data as supplementary materials to traditional articles Data repositories and archives

Page 23: Data Management for Research (New Faculty Orientation)

Data Sharing Impacts Reinforces open scientific

inquiry Encourages diversity of

analysis and opinion Promotes new research,

testing of new or alternative hypotheses and methods of analysis

Supports studies on data collection methods and measurement

Cc http://www.flickr.com/photos/pinchof_10/

Page 24: Data Management for Research (New Faculty Orientation)

Data Sharing Impacts

Facilitates education of new researchers

Enables exploration of topics not envisioned by initial investigators

Permits creation of new datasets by combining data from multiple sources

Page 25: Data Management for Research (New Faculty Orientation)

• Introduction• Background

• The Impetus: NSF Data Management Plan Mandate• The Effect: Policy to Practice• The Response: Changing Data Landscape

• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup• Data Publishing, Sharing, & Reuse• Protecting Data & Responsible Reuse

• Data Lifecycle Resources

Agenda

Page 26: Data Management for Research (New Faculty Orientation)

File Organization Practices: Overview

1. Design a file plan for your research project

2. Use file naming conventions that work for your project

3. Choose file formats to maximize usefulness

“When I was a freshmen I named my assignments Paper Paperr Paperrr Paperrrr”-Undergrad

Page 27: Data Management for Research (New Faculty Orientation)

Design a File Plan

File structure is the framework Classification system makes it easier to locate

folders/files Benefits:

Simple organization intuitive to team members and colleagues

Reduces duplicate copies in personal drives and e-mail attachments

Page 28: Data Management for Research (New Faculty Orientation)

Design a File Plan

Choose a sortable directory hierarchy Example 1: Investigator, Process, Date

CollieTEI_Encoding20110117

Example 2: Instrument, Date, Sample Usability Survey

20120430Sample 1

Page 29: Data Management for Research (New Faculty Orientation)

Design a File Plan

Example documentation of Directory Hierarchy: /[Project]/[Grant Number]/[Event]/[Investigator/Date]

Page 30: Data Management for Research (New Faculty Orientation)

Use File Naming Conventions

Why file naming conventions? Enable better access/retrieval of files Create logical sequences for file sorting More easily identify what you’re searching for

Page 31: Data Management for Research (New Faculty Orientation)

Meaningful but short—255 character limit Use alphanumeric characters

Example: abc123 Capital letters or underscores differentiate

between words Surname first followed by initials of first name

Use File Naming Conventions

Page 32: Data Management for Research (New Faculty Orientation)

Year-month-day format for dates, with or without hyphens Example 1: 2006-03-13 Example 2: 20060313

Decide on a simple versioning method Example: file_v001

Use File Naming Conventions

Page 33: Data Management for Research (New Faculty Orientation)

To create consistent file names, specify a template such as:

[investigator]_[descriptor]_[YYYYMMDD].[ext]

Use File Naming Conventions

This Not ThissharpeW_krillMicrograph_backscatter3_20110117.tif KrillData2011.tif

This Not ThisborgesJ_collocation_20080414.xml Borges_Textbase.xml

Page 34: Data Management for Research (New Faculty Orientation)

Choose Appropriate File Formats

• Non-proprietary• Open, documented standard• Common usage by research community• Standard representation (ASCII, Unicode)• Unencrypted• Uncompressed

Page 35: Data Management for Research (New Faculty Orientation)

Choose Appropriate File Formats

Format Genre Optimal Standards TEXT .txt; .odt; .xml; .html

AUDIO .flac; .wav,

VIDEO .mp2/.mp4; .mkv

IMAGE .tif; .png; .svg; .jpg

DATA .sql; .csv

Page 36: Data Management for Research (New Faculty Orientation)

Documentation Practices: Overview

Even researchers require proper documentation to decipher or reuse their datasets

Documentation = accessible, intelligible datasets

Page 37: Data Management for Research (New Faculty Orientation)

Documentation Practices: Overview

1. At minimum create a README file that you can use to document your project

2. Utilize standards for describing data including Metadata Standards

3. If applicable, use in-line code commentary to explain code

(cc) Will Scullin

Page 38: Data Management for Research (New Faculty Orientation)

Create a README file

At minimum, store documentation in readme.txt file or equivalent, with data

Page 39: Data Management for Research (New Faculty Orientation)

Create a README file

Significant documentation about dataset What data consists of How it was collected Restrictions to distribution or use Other descriptive information

Page 40: Data Management for Research (New Faculty Orientation)

“Data about data” Standardized way of describing data Explains who, what, where, when of data creation

and methods of use Data more easily found Data more easily compared to other data sets

Use Metadata Standards

Page 41: Data Management for Research (New Faculty Orientation)

Use Metadata Standards

Basic project metadata:

• Title • Language • File Formats

• Creator • Dates • File Structure

• Identifier • Location • Variable List

• Subject • Methodology • Code Lists

• Funders • Data Processing • Versions

• Rights • Sources • Checksums

• Access Information

• List of File Names

Page 42: Data Management for Research (New Faculty Orientation)

Use Metadata Standards Dublin Core: Commonly-used descriptive metadata

format facilitates dataset discovery across the Web. Data Documentation Initiative (DDI): Defines

metadata content, presentation, transport, and preservation for the social and behavioral sciences.

ISO 19115:2003: Describes geographic data such as maps and charts.

More examples:http://www.lib.msu.edu/about/diginfo/collect.jsp

Page 43: Data Management for Research (New Faculty Orientation)

Use In-Line Code Commentary

Example of R code commentary

# Cumulative normal densitypnorm(c(-1.96,0,1.96))

If applicable, in-line code commentary helps explain code

Page 44: Data Management for Research (New Faculty Orientation)

Backup Practices: Overview Data at significant risk of loss without storage

and backup plan, including: Hardware / network failures Bit rot Human error

Singular commercial grade hard drives Effective data storage plan provides for:

Primary authoritative copy Secondary local backup Tertiary remote backup

Page 45: Data Management for Research (New Faculty Orientation)

Backup Practices: Overview

1. Avoid single points of failure2. Ensure data redundancy & replication3. Understand the common types of storage

Page 46: Data Management for Research (New Faculty Orientation)

Avoid Single Points of FailureA single point of failure occurs when it would only take one event to destroy all data on a device

Use managed networked storage when possible Move data off of portable media Never rely on one copy of data Do not rely on CD or DVD copies to be readable Be wary of software lifespans

Page 47: Data Management for Research (New Faculty Orientation)

Ensure Data RedundancyBackup Do’s: Make 3 copies

E.g. original + external/local + external/remote E.g. original + 2 formats on 2 drives in 2 locations

Geographically distribute and secure Local vs. remote, depending on needed recovery time

Personal computer, external hard drives, departmental, or university servers may be used

Page 48: Data Management for Research (New Faculty Orientation)

Ensure Data Redundancy

Backup Don’ts: Do not rely on one copy Do not use CDs and DVDs Do not rely on ANGEL or

Desire2Learn

(cc) George Ornbo

Page 49: Data Management for Research (New Faculty Orientation)

Ensure Data Redundancy

Backup Maybe: Cloud storage

Amazon s3 Google MS Azure DuraCloud Rackspace Glacier

Note that many enterprise cloud storage services include a charge for in/out of data transfers

$$$

Page 50: Data Management for Research (New Faculty Orientation)

Understand Common Types of Storage

• Optical Media• Portable Flash Media• Commercial Hard Drives• Commercial NAS• Cloud Storage• Enterprise Network Storage• Trusted Archival Storage

Page 51: Data Management for Research (New Faculty Orientation)

Understand Common Types of Storage

• Features of storage types:• Portable data transfers• Short-term storage• Project term storage• Networked data transfer• Long-term storage• Reliable backup option

Page 52: Data Management for Research (New Faculty Orientation)

Understand Common Types of Storage

• Enterprise storage at MSU• AFS Storage (Free up to 1 GB)• Fee based

• Individual Storage• Mid-Tier Storage

• Free up to 1TB• HPCC Home Directory• HPCC Research Directory

Page 53: Data Management for Research (New Faculty Orientation)

Data Publishing, Sharing, Reuse: Overview

1. Prepare data in suitable format, for a potentially high return on investment

2. Publish data in several data publication venues to more broadly share results of research

Research datasets becoming first-class scholarly contributions on par with peer-reviewed journal articles

Page 54: Data Management for Research (New Faculty Orientation)

Sharing & Publishing Data

• Data preparation for sharing and publication is a time-intensive process

• Potential positive outcomes:• Increased research impact and citations• Enable additional scientific inquiry• Opportunities for co-authorship and collaboration• Enhance your grant proposal’s competitiveness

Page 55: Data Management for Research (New Faculty Orientation)

Data Publication Venues

• Multiple ways to publish research data• Faculty or project website• Journal supplementary materials• Disciplinary data repository (data archive)

• Varying levels of support for indexing, access controls, and long-term curation

Page 56: Data Management for Research (New Faculty Orientation)

Data Publication Venues

• Disciplinary Data Repository• Securely share data, ensure long-term access• High visibility• Often offer persistent citations• Availability varies across domains• Databib.org directory

Page 57: Data Management for Research (New Faculty Orientation)

Protecting Data & Responsible Reuse

1. Consider how to protect data and intellectual property rights while encouraging reuse

2. Keep in mind ethical concerns when sharing data

(cc) Will Scullin

Page 58: Data Management for Research (New Faculty Orientation)

Intellectual Property

• IP refers to exclusive rights of creators of works• Individual data cannot be protected by US

copyright• Organization of data such as database, creative

work produced by data, and research instruments used may be protected

Page 59: Data Management for Research (New Faculty Orientation)

Intellectual Property• Principal investigator’s institution holds IP rights• Provide clearly stated license for producing

derivatives, reusing, and redistributing datasets• License under Creative Commons• State if any restrictions or embargos on use

• Provide example of how work should be cited to encourage proper attribution on reuse

• Document any IP / copyright issues

Page 60: Data Management for Research (New Faculty Orientation)

Ethics & Data Sharing• Keep in mind the following ethical concerns when

sharing your data:• Privacy• Confidentiality• Security and integrity of the data

• For data involving human subjects, obtain written permission or consent stating how the data may be reused

Page 61: Data Management for Research (New Faculty Orientation)

Best Practices = High Impact Data• File organization ensures easier access and

retrieval of data• Documentation makes datasets accessible and

intelligible to users• Storage and backup safeguards data• Data publishing and sharing encourages the most

widespread reuse of data• Data protection ensures responsible reuse

Page 62: Data Management for Research (New Faculty Orientation)

• Introduction• Background

• The Impetus: NSF Data Management Plan Mandate• The Effect: Policy to Practice• The Response: Changing Data Landscape

• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup• Data Publishing, Sharing, & Reuse• Protecting Data & Responsible Reuse

• Data Lifecycle Resources

Agenda

Page 63: Data Management for Research (New Faculty Orientation)

http://www.lib.msu.edu/rdmg

Page 64: Data Management for Research (New Faculty Orientation)

ContactLisa M. SchmidtElectronic Records ArchivistUniversity Archives & Historical [email protected]

Aaron CollieDigital Curation LibrarianMSU [email protected]