37
GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor Plan for Archiving & Preservation of Data

GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Embed Size (px)

DESCRIPTION

Plan for Archiving & Preservation of Data. GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor. Logistics. Heads up/reminder on the final: data management plan. Survey responses: thank you!. Today’s lesson. - PowerPoint PPT Presentation

Citation preview

Page 1: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

GRAD 521, Research Data Management Winter 2014 – Lecture 15

Amanda L. Whitmire, Asst. Professor

Plan for Archiving& Preservationof Data

Page 2: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Logistics

Heads up/reminder on the final: data management plan

Survey responses: thank you!

Page 3: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Today’s lesson1. Basic archival processes: data selection, format

migration, checksums, auditing, etc.

2. Address the need for conversion to standard formats needed for re-use

3. Options for a long-term sustainable preservation strategy/policy for your data

4. Costs & timelines for data storage, management tools and services

Page 4: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Backup

Data may change

Not permanent

“Working” formats

Usually stored locally (individual, department, college, IS/CN)

Archive

Finalized data; static record

Kept long-term (5+ years)

Preservation formats

Often stored in official archive

vs.

Page 5: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Archive-stage actions

1. Data selection or appraisal2. Format selection3. Perform checksums4. Select archive location5. Periodic file- and bit-level audits

Page 6: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

1. Data appraisal

“… the process of distinguishing records of continuing value from those of no further

value so that the latter may be eliminated.”The National Archives (UK)

Page 7: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Appraisal roles & responsibilitiesResearcher

(‘data creator’)

Provide enough info. for assessment

Provide info. on audience and access restraints

Provide data in recommended formats

Provide sufficient metadata

Data center or repository

Have explicit mission & data selection policy

Ensure legal and contract compliance

Ensure authenticity & integrity of digital objects

& metadata

Assume responsibility for data accessibility

Plan for long-term preservation

Page 8: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Appraisal criteria

1. Relevance to mission

2. Historical value

3. Uniqueness

4. Potential or redistribution

5. Non-replicability

6. Economic case

7. Full documentation

For a full discussion of the appraisal process, see this guide: Whyte, A. & Wilson, A. (2010). "How to Appraise and Select Research Data for Curation". DCC How-to Guides. Edinburgh: Digital Curation Centre. http://www.dcc.ac.uk/resources/how-guides

Page 9: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

2. Format selection

Ideal: non-

proprietary

or open

formats

For more info. on data formats: http://guides.library.oregonstate.edu/data-management-types-formats

Page 10: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Archive-stage actions

1. Data selection or appraisal2. Format selection3. Perform checksums4. Select archive location5. Periodic file- and bit-level audits

Page 11: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

3. Checksums

Checksums provide a way to:• ensure the integrity of your data• create a comprehensive list of your files

Page 12: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Data integrity

What is an MD5 checksum?• is like a fingerprint of a file• used to verify whether two files are identical

Each time you run a checksum:• a number string for each file is created• even if 1 byte of data has been altered or corrupted that

string will change• if the checksums match, the data has not altered

Page 13: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

ChecksumsHere is an example data collection:

Folder: C:\ … \datamanagementstuff

Page 14: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Checksums

Here is a MS Word document in that folder:

Page 15: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

FastSum

FastSum is a free MD5 checksum tool for windows available at http://www.fastsum.com/

1. Download and install the trial version

2. Run the Program

Page 16: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Creating a checksum

The wizard has created list of‘Checksum\State’ in FastSum

It has also created a text file in the \datamanagementstuff folder

Page 17: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Creating a checksum

Open up the text file and this is what you find:

*a checksum string and a list of file names*

Page 18: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Using a checksum

In this example:

•Reopened the Word document from earlier

•Deleted a period, saved, and closed the document

•When you run the checksum wizard again, the value for the ‘Datamanagement.doc’ file should change

Page 19: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Comparing checksums

Before …

… After

Page 20: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Comparing checksums

Notice how the values for Datamanagement.doc have changed:

0CA9E83E612447E793D4758BF7A5244D91BAE7EC0C642D967585D01DD6AA4096

- values for the other files stay the same- values stay the same across machines unless a file has changed

Page 21: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Creating a file listFastSum has created a list of all the files in the folder it was pointed it toward:

Page 22: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Archive-stage actions

1. Data selection or appraisal2. Format selection3. Perform checksums4. Select archive location5. Periodic file- and bit-level audits

Page 23: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

4. Select archive location

Considerations• Costs• Size of dataset• Public vs. private access• Length of preservation• Hands-on vs. hands-off• Security of platform

LocationsIndividualDepartment/CollegeUniversity-wideDiscipline-specific3rd-party

Archive vs. sharing mechanism

Page 24: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Archive-stage actions

1. Data selection or appraisal2. Format selection3. Perform checksums4. Select archive location5. Periodic file- and bit-level audits

Page 25: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Data in Real Life

A design firm was handling their own backups. The system was working fine and the backup software was reporting that the data was successfully backed up.

Imag

es c

ourt

esy

of H

eath

er H

enke

l

Page 26: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Data in Real Life

The administrator checked the backups immediately after they were done and confirmed they were good.

CC Im

age

cour

tesy

of a

ngie

lauw

on

Flic

kr

Page 27: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Data in Real Life

After a computer virus erased most of their files, they went back to their backups. Unfortunately they found that the backups were all blank and all of the data was gone. Only after some investigation did they discover that the computer tapes (which contained the backups) were placed against a wall that had an elevator on the other side of it. When the elevator went past, the magnets inside erased all of the tapes.

Take home message: had they checked their backups again, they probably would have noticed this issue before there was

an emergency & complete loss of files.

Page 28: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Preservation strategy

Create an archive backup policy that clearly identifies: o roleso responsibilitieso where the data is backed upo how often the files are backed upo how to access the fileso recommended file formats to be used &o policies for migrating data to assure data are not lost due to media degradation or changing formats or programs

Review your backup policy & plan periodically to ensure it is still valid and applicable

o Update contacts, if appropriate

Page 29: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Best PracticesMinimize or remove reliance on users to perform manual backups (if possible)

o Implement standardized and automatic backupso If possible, put experts in charge of this task (computer staff) as they are more likely to keep up-to-date regarding software updates, hardware issues, best practices, etc.

Don’t assume backups are being performed for youo You don’t want to find out after the fact that no backups have been performedo If you are using third-party software (like Yahoo or Google Mail), what happens if they lose your files?

Page 30: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Example options for preservation

Page 31: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

A typical OSU researcher

> 55% produce 100 GB or less per project

Page 32: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Archive on your own• You buy & manage hardware, replication, backups and

networking (if applicable, for offsite access)• OK for unrestricted, sensitive (FERPA), and protected data

Costs (100 GB dataset)

Ranges(but generally cheap)

$

Page 33: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Archive w/ department IT• 30-day backup/recovery window for files on personal or

departmental storage• RAID protected, backed up online storage• Accessible (to you) remotely (via VPN)

Costs (100 GB dataset in COSINe)

($0/year * 4 GB) + ($60/100 GB/year) =

$60/year (ongoing)$300 for 5 years$

Page 34: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Archive @ OSU w/ CN• Storage is in 2 separate data centers & backups retained

for 3 months• Accessible (to you) remotely (via VPN)• OK for unrestricted, sensitive (FERPA), and protected data

Costs (100 GB dataset)

($0/year * 5 GB) + ($4/GB/year * 95 GB) =

$380/year (ongoing)$1,900 for 5 years$

Page 35: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Archive in discipline-specific repository

• Replicated, archive-quality storage• Data curation throughout ingest & archive period• Data in context with other datasets

Costs

Ranges$

Page 36: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

3rd party storage platforms

Costs

Ranges$

Page 37: GRAD 521, Research Data Management  Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Bottom line

No “one-size-fits all” approach

Balance costs, storage quality, access, degree of involvement, security, longevity etc.

Plan ahead so you can budget appropriately