88
Introduction to Research Data Management Amanda L. Whitmire Steven Van Tuyl OSU Libraries

Introduction to Data Management

Embed Size (px)

DESCRIPTION

Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.

Citation preview

Page 1: Introduction to Data Management

Introduction to Research Data Management

Amanda L. WhitmireSteven Van TuylOSU Libraries

Page 2: Introduction to Data Management

GOAL:

Achievable habits for implementing

data management best practices into your workflow

2

Page 3: Introduction to Data Management

What are data?

Page 4: Introduction to Data Management

“…the recorded factual material commonly accepted in the scientific community as necessary to validate

research findings.”

Research data is:

U.S. Office of Management and Budget, Circular A-110

4

Page 5: Introduction to Data Management

“Unlike other types of information, research data are collected, observed, or created, for the

purposes of analysis to produce and validate original research results.”

University of EdinburghMANTRA Research Data Management Training,

‘Research Data Explained’

Research data is:

Page 6: Introduction to Data Management

Actions that contribute to effective storage, preservation and reuse ofdata and documentation throughout the research lifecycle.

Data management:

6

Page 7: Introduction to Data Management

7

What data management is not:

Data/computational scienceDatabase administrationA research method:• what data to collect• how to collect them• how to design an experiment

Page 8: Introduction to Data Management

Increase visibility & impact

Protects investment

Increases research efficiency

Preservation

Saves time

Funding agency requirements

Why data management?

Further your field

image credit: http://www.flickr.com/photos/karenchiang/5041048588/

Page 9: Introduction to Data Management

Further your field

Page 10: Introduction to Data Management

85 cancer microarray clinical trial publications

69% increase in citations for articles w/publicly available data

Increase visibility & impact

Page 11: Introduction to Data Management

Funder mandates

“…directed Federal agencies with more than $100M in R&D expenditures to develop plans to make the published results of federally funded research freely available to the public within one year of publication and requiring researchers to better account for and manage the digital data resulting from federally funded scientific research.”

Page 12: Introduction to Data Management

Which agencies are affected?$54M

$35M

$23M

Page 13: Introduction to Data Management

Aspects of data management

DMPs/PlanningStorage & backup

File organization & namingDocumentation & metadata

Legal/ethical considerations

Sharing & reuseArchiving &

preservation

Page 14: Introduction to Data Management

Data types & formats

Page 15: Introduction to Data Management

Data types

15

Observational | Can’t be recaptured, recreated or replaced; Examples: sensor readings, sensory (human) observations, survey results

Experimental | Should be reproducible, but can be expensive; Examples: gene sequences, chromatograms, spectroscopy, microscopy, cell counts

Derived or compiled | Reproducible, but can be very expensive; Examples: text and data mining, compiled database, 3D models

Simulation | Models and metadata, where the input can be more important than output data; Examples: climate models, economic models, biogeochemical models

Reference/canonical | Static or organic collection [peer-reviewed] datasets, most probably published and/or curated; Examples: gene sequence databanks, chemical structures, census data, spatial data portals

Page 16: Introduction to Data Management

Amanda’s dissertation The spectral backscattering properties of marine particles

Observationsship-based sampling & moored instruments

Simulation results

scattering & absorption of light

Experimentaloptical properties of

phytoplankton cultures

Derived variablesendless things

Compiled observationsglobal oceanic bio-

optical observations[self + from peers]

Referenceglobal oceanic bio-

optical observations[NASA]

Page 17: Introduction to Data Management

Data types: another classification

Qualitative data “is a categorical measurement expressed not in terms of numbers, but rather by means of a natural language description. In statistics, it is often used interchangeably with "categorical" data.” See also: nominal, ordinal

Source: Wikibooks: Statistics

Quantitative data “is a numerical measurement expressed not by means of a natural language description, but rather in terms of numbers. However, not all numbers are continuous and measurable.”

“My favorite color is blue-green.” vs. “My favorite color is 510 nm.”

Page 18: Introduction to Data Management

More common data types

Geospatial data has a geographical or geospatial aspect. Spatial location is critically tied to other variables.

Digital image, audio & video data

Documentation & scripts Sometimes, software code IS data; likewise with documentation (laboratory notebooks, written observations, etc.)

Page 19: Introduction to Data Management

File Formats

“A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium.” - Wikipedia

Qualitative, tabularexperimental data

Data typeExcel spreadsheet (.xlsx)

Comma-delimited text (.csv)Access database

(.mdb/,accdb)Google Spreadsheet

SPSS portable file (.por)XML file

Possible formats

Page 20: Introduction to Data Management

What types & formats of data will you be generating and/or using?

Observational | Experimental | Derived | Compiled | Simulation | Reference/Canonical

Qualitative | Quantitative | GeospatialImage/audio/video | Scripts/codes

Reflective Writing: 60 seconds

Page 21: Introduction to Data Management

21

? ???

Page 22: Introduction to Data Management

The “big picture”

22

Page 23: Introduction to Data Management

Make a plan

Where do you start?

23

Page 24: Introduction to Data Management

Data storage & backup

Page 25: Introduction to Data Management

“Your data are the life blood of your research.

If you lose your data recovery could be slow,

costly or even worse…

it could be impossible.”

Page 26: Introduction to Data Management

Most common loss scenario:drive failure

Page 27: Introduction to Data Management

This happens a lot: physical theft & unintentional damage

Cute, but not a valid security plan.

Page 28: Introduction to Data Management

Rare, unexpected events happen

University of Southampton, School of Electronics and Computer Science, Southampton, UK, 2005

Page 29: Introduction to Data Management

29

It CAN happen to you

Page 30: Introduction to Data Management

Real-world lesson:Audit your backups…

Page 31: Introduction to Data Management

Data backup

“Keeping backups is probably your most important data

management task.”-Everyone

1. Some data backup is better than none.2. Automated backups are better than manual.3. Your data are only as safe as your last backup.

Page 32: Introduction to Data Management

Data backup

Original (working)

External Local

External Remote

Best Practice: 3 Copies of datasets

Page 33: Introduction to Data Management

Data storage options

1. Personal computers (PCs) & laptops

2. External storage devices

3. Networked Drives

4. Cloud servers

Page 34: Introduction to Data Management

Storage: PC/laptop

AdvantagesConvenient

DisadvantagesDrive failure common

Laptops: susceptible to theft & unintentional damage

Not replicated

Bottom LineDo NOT use to store master copies of data

Not a long term storage solution

Back up important data & files regularly

Page 35: Introduction to Data Management

Storage: external storage devices

AdvantagesConvenient, cheap & portable

DisadvantagesLongevity not guaranteed (e.g. Zip disks)

Errors writing to CD/DVD are common

Easily damaged, misplaced or lost (=security risk)

May not be big enough to hold all data; multiple drives needed

Bottom LineDo NOT use to store master copies of data

Not recommended for long-term storage

Page 36: Introduction to Data Management

Storage: networked drives

AdvantagesData in single place, backed up regularly

Replicated storage not vulnerable to loss due to hardware failure

Secure storage minimizes risk of loss, theft, unauthorized use

Available as needed (assuming network avail.)

DisadvantagesCost may be prohibitive; export control

Bottom LineHighly recommended for master copies of data

Recommended for long-term storage (~5 years)

Page 37: Introduction to Data Management

Storage: cloud storage

AdvantagesData in single place, backed up regularlyReplicated storage not vulnerable to loss due to hardware failureSecure storage minimizes risk of loss, theft, unauthorized use

DisadvantagesCost may be prohibitiveUpload/download bottleneck & feesLongevity?Export control

Bottom LinePossibly recommended for master copies of dataNot recommended for in-process data, large files

Page 38: Introduction to Data Management

Storage: Google Drive for OSU

AdvantagesAll same advantages of network & cloud storageFile sharing & collaboration w/variable access levelsUnlimited storage (GD), 30 GB non-GDAutomatic version control on GD

Disadvantages30 GB may not be enoughUpload/download bottleneck

Bottom LinePossibly recommended for master copies of dataPossibly not recommended for in-process data, large files

Page 39: Introduction to Data Management

Data organization

39

AGU presentation

Class presentation

OS presentation

Presentations

Ocean Sciences Class AGU

Page 40: Introduction to Data Management

Data storage options

40

Local Remote

Computer

External storage

Network server

Network server

Cloud storage

Google Apps

Box, Dropbox, etc.

Page 41: Introduction to Data Management

File-naming conventions

Page 42: Introduction to Data Management

File naming strategy?

42

Page 43: Introduction to Data Management

#OverlyHonestMethods

43

Page 44: Introduction to Data Management

File naming conventions

44

Project_instrument_location_YYYYMMDDhhmmss_extra.ext

Index/grant conditions Leading zero!

s/n, variable Retain order

Page 45: Introduction to Data Management

File naming strategies

45

Order by date:1955-04-12_notes_MassObs.docx

1955-04-12_questionnaire_MassObs.pdf

1963-12-15_notes_Gorer.docx

1963-12-15_questionnaire_Gorer.pdf

Order by subject:Gorer_notes_1963-12-15.docx

Gorer_questionnaire_1963-12-15.pdf

MassObs_notes_1955-04-12.docx

MassObs_questionnaire_1955-04-12.pdf

Order by type:Notes_Gorer_1963-12-15.docx

Notes_MassObs_1955-04-12.docx

Questionnaire_Gorer_1963-12-15.pdf

Questionnaire_MassObs_1955-04-12.pdf

Forced order with numbering:01_MassObs_questionnaire_1955-04-12.pdf

02_MassObs_notes_1955-04-12.docx

03_Gorer_questionnaire_1963-12-15.pdf

04_Gorer_notes_1963-12-15.docx

Page 46: Introduction to Data Management

Version control

46

Page 47: Introduction to Data Management

Version control

47

Page 48: Introduction to Data Management

48

? ???

Page 49: Introduction to Data Management

Random suggestion

49

Page 50: Introduction to Data Management

Disambiguate yourself

50

Open Researcher & Contributor ID

John L. CampbellForest Research EcologistOregon State University, Corvallis, OR

John L. CampbellForest Research Ecologist

Center for Research on Ecosystem Change

US Forest Service, Durham, NC

Page 51: Introduction to Data Management

51

Page 52: Introduction to Data Management

Another Doppelgänger

52

Chris LangdonStudies ocean acidification

OSU HMSC

Chris LangdonStudies ocean acidificationUniversity of Miami

Page 53: Introduction to Data Management

Metadata

Page 54: Introduction to Data Management

What is metadata?

• Data about data• Structured information that describes,

explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.

NISO, Understanding Metadata

54

Page 55: Introduction to Data Management

Metadata

55

“The metadata accompanying your data should be written for a user 20 years into the future -- what does that person need to know to use your data properly? Prepare the metadata for a user who is unfamiliar with your project, methods, or observations.”

Oak Ridge National Laboratory Distributed Active Archive Center for Biogeochemical

Dynamics(ORNL DAAC)

Page 56: Introduction to Data Management

Metadata in real life

You use it all the time…

Page 57: Introduction to Data Management

Darwin Core | biological diversity, taxonomy

Dublin Core | general

DDI (Data Documentation Initiative) | social &

behavioral sciences

DIF (Directory Interchange Format) |

environmental sciences

EML (Ecological Metadata Language) | ecology,

biology

ISO 19115 | geographic data

Major metadata standards

01101101 01100101 01110100 01100001 01100100 01100001 01110100 01100001

Page 58: Introduction to Data Management

Metadata examples

Santa Barbara Coastal Long Term Ecological Research (LTER)web link

Bureau of Labor Statistics, Consumer Price Index, 1913-1992web link

58

Page 59: Introduction to Data Management

Legal & ethical considerations

59

Page 60: Introduction to Data Management

Data sharing & reuse

“...digitally formatted scientific data resulting from unclassified research supported wholly

or in part by Federal funding should be stored and publicly accessible to search,

retrieve, and analyze.”

Office of Science and Technology PolicyThe White House

60

Page 61: Introduction to Data Management

How to preserve & share data

61

Page 62: Introduction to Data Management

How to preserve & share data

62

Page 63: Introduction to Data Management

Data journals

63

Page 64: Introduction to Data Management

Data management plans

64

Page 65: Introduction to Data Management

What is a data management plan?

65

…Physical specimens consist of carbonate biothems and filtered water samples are stored in Spilde or Northup’s labs until completion of analyses. At this time they are returned to the federal agency for museum curation or destroyed during analysis. Field notes will be scanned into pdf files, with a copy sent to Carlsbad Caverns National Park or other federal cave manager. SEM images will be saved in the tif format. …

Page 66: Introduction to Data Management

Sections of a data management plan

66

1. Types of data2. Data & metadata standards3. Archiving & preservation4. Sharing (access provisions)5. Transition from collection to

reuse

See resources on your handout + use DMPTool

Page 67: Introduction to Data Management

| Visit DMPTool

Your go-to resource for data

management plans

67

Page 68: Introduction to Data Management

Contact information

68

Amanda Whitmire | Data Management Specialist

[email protected]

Steve Van Tuyl| Data and Digital Repository [email protected]

http://bit.ly/OSUData

Page 69: Introduction to Data Management

end

69

Page 70: Introduction to Data Management

70

Extra / Outdated Slides

Page 71: Introduction to Data Management

Data does not speak for itself…

Page 72: Introduction to Data Management

YOU speak for YOUR data

Page 73: Introduction to Data Management

But first, you need to manage it

73

Page 74: Introduction to Data Management

74

Data management in the lifecycle

Page 75: Introduction to Data Management

How can OSU Libraries help?

75

✔✔

Page 76: Introduction to Data Management

Copyright – who owns the data?

Copyright applies to text, images, recordings, videos etc. in a digital form, in the same way that it would to analog versions of those works. The code of computer programs (both the human readable source code and the machine readable object code) is

protected by copyright as a literary work. Data compilations such as datasets and databases can be protected by copyright in

the literary works category, which includes ‘tables’ or ‘compilations’. A table or compilation, consisting of words,

figures or symbols (or a combination of these) is protected if it is a literary work and has the required degree of originality.

76

Page 77: Introduction to Data Management

Types, formats & stages of data

77

Raw data

Page 78: Introduction to Data Management

78

Types, formats & stages of data

Intermediate data

Page 79: Introduction to Data Management

Types, formats & stages of data

79

Final data

Page 80: Introduction to Data Management

Data Types: Observational

• Captured in situ• Can’t be recaptured, recreated or replaced

Examples: Sensor readings, sensory (human) observations, survey results

Page 81: Introduction to Data Management

Data Types: Experimental

• Data collected under controlled conditions, in situ or laboratory-based

• Should be reproducible, but can be expensive

Examples: gene sequences, chromatograms, spectroscopy, microscopy, cell counts

Page 82: Introduction to Data Management

Data Types: Derived or Compiled

• Compiled: integration of data from several sources (could be many types)

• Derived: new variables created from existing data

• Reproducible, but can be very expensive

Examples: text and data mining, compiled database, 3D models

Page 83: Introduction to Data Management

Data Types: Simulation

• Results from using a model to study the behavior and performance of an actual or theoretical system

• Models and metadata, where the input can be more important than output data

Examples: climate models, economic models, biogeochemical models

Page 84: Introduction to Data Management

Data Types: Reference/Canonical

Static or organic collection [peer-reviewed] datasets, most probably published and/or curated.

Examples: gene sequence databanks, chemical structures, census data, spatial data portals

Page 85: Introduction to Data Management

Data storage & preservation

85

Page 86: Introduction to Data Management

About backups…

86

University of Southampton, School of Electronics and Computer ScienceSouthampton, UK, 2005

Do them.

Page 87: Introduction to Data Management

Plan for unexpected events

87

Page 88: Introduction to Data Management

Metadata demonstration

88

Colectica