Upload
preston-mcfadden
View
21
Download
2
Embed Size (px)
DESCRIPTION
Plan for Archiving & Preservation of Data. GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor. Logistics. Heads up/reminder on the final: data management plan. Survey responses: thank you!. Today’s lesson. - PowerPoint PPT Presentation
Citation preview
GRAD 521, Research Data Management Winter 2014 – Lecture 15
Amanda L. Whitmire, Asst. Professor
Plan for Archiving& Preservationof Data
Logistics
Heads up/reminder on the final: data management plan
Survey responses: thank you!
Today’s lesson1. Basic archival processes: data selection, format
migration, checksums, auditing, etc.
2. Address the need for conversion to standard formats needed for re-use
3. Options for a long-term sustainable preservation strategy/policy for your data
4. Costs & timelines for data storage, management tools and services
Backup
Data may change
Not permanent
“Working” formats
Usually stored locally (individual, department, college, IS/CN)
Archive
Finalized data; static record
Kept long-term (5+ years)
Preservation formats
Often stored in official archive
vs.
Archive-stage actions
1. Data selection or appraisal2. Format selection3. Perform checksums4. Select archive location5. Periodic file- and bit-level audits
1. Data appraisal
“… the process of distinguishing records of continuing value from those of no further
value so that the latter may be eliminated.”The National Archives (UK)
Appraisal roles & responsibilitiesResearcher
(‘data creator’)
Provide enough info. for assessment
Provide info. on audience and access restraints
Provide data in recommended formats
Provide sufficient metadata
Data center or repository
Have explicit mission & data selection policy
Ensure legal and contract compliance
Ensure authenticity & integrity of digital objects
& metadata
Assume responsibility for data accessibility
Plan for long-term preservation
Appraisal criteria
1. Relevance to mission
2. Historical value
3. Uniqueness
4. Potential or redistribution
5. Non-replicability
6. Economic case
7. Full documentation
For a full discussion of the appraisal process, see this guide: Whyte, A. & Wilson, A. (2010). "How to Appraise and Select Research Data for Curation". DCC How-to Guides. Edinburgh: Digital Curation Centre. http://www.dcc.ac.uk/resources/how-guides
2. Format selection
Ideal: non-
proprietary
or open
formats
For more info. on data formats: http://guides.library.oregonstate.edu/data-management-types-formats
Archive-stage actions
1. Data selection or appraisal2. Format selection3. Perform checksums4. Select archive location5. Periodic file- and bit-level audits
3. Checksums
Checksums provide a way to:• ensure the integrity of your data• create a comprehensive list of your files
Data integrity
What is an MD5 checksum?• is like a fingerprint of a file• used to verify whether two files are identical
Each time you run a checksum:• a number string for each file is created• even if 1 byte of data has been altered or corrupted that
string will change• if the checksums match, the data has not altered
ChecksumsHere is an example data collection:
Folder: C:\ … \datamanagementstuff
Checksums
Here is a MS Word document in that folder:
FastSum
FastSum is a free MD5 checksum tool for windows available at http://www.fastsum.com/
1. Download and install the trial version
2. Run the Program
Creating a checksum
The wizard has created list of‘Checksum\State’ in FastSum
It has also created a text file in the \datamanagementstuff folder
Creating a checksum
Open up the text file and this is what you find:
*a checksum string and a list of file names*
Using a checksum
In this example:
•Reopened the Word document from earlier
•Deleted a period, saved, and closed the document
•When you run the checksum wizard again, the value for the ‘Datamanagement.doc’ file should change
Comparing checksums
Before …
… After
Comparing checksums
Notice how the values for Datamanagement.doc have changed:
0CA9E83E612447E793D4758BF7A5244D91BAE7EC0C642D967585D01DD6AA4096
- values for the other files stay the same- values stay the same across machines unless a file has changed
Creating a file listFastSum has created a list of all the files in the folder it was pointed it toward:
Archive-stage actions
1. Data selection or appraisal2. Format selection3. Perform checksums4. Select archive location5. Periodic file- and bit-level audits
4. Select archive location
Considerations• Costs• Size of dataset• Public vs. private access• Length of preservation• Hands-on vs. hands-off• Security of platform
LocationsIndividualDepartment/CollegeUniversity-wideDiscipline-specific3rd-party
Archive vs. sharing mechanism
Archive-stage actions
1. Data selection or appraisal2. Format selection3. Perform checksums4. Select archive location5. Periodic file- and bit-level audits
Data in Real Life
A design firm was handling their own backups. The system was working fine and the backup software was reporting that the data was successfully backed up.
Imag
es c
ourt
esy
of H
eath
er H
enke
l
Data in Real Life
The administrator checked the backups immediately after they were done and confirmed they were good.
CC Im
age
cour
tesy
of a
ngie
lauw
on
Flic
kr
Data in Real Life
After a computer virus erased most of their files, they went back to their backups. Unfortunately they found that the backups were all blank and all of the data was gone. Only after some investigation did they discover that the computer tapes (which contained the backups) were placed against a wall that had an elevator on the other side of it. When the elevator went past, the magnets inside erased all of the tapes.
Take home message: had they checked their backups again, they probably would have noticed this issue before there was
an emergency & complete loss of files.
Preservation strategy
Create an archive backup policy that clearly identifies: o roleso responsibilitieso where the data is backed upo how often the files are backed upo how to access the fileso recommended file formats to be used &o policies for migrating data to assure data are not lost due to media degradation or changing formats or programs
Review your backup policy & plan periodically to ensure it is still valid and applicable
o Update contacts, if appropriate
Best PracticesMinimize or remove reliance on users to perform manual backups (if possible)
o Implement standardized and automatic backupso If possible, put experts in charge of this task (computer staff) as they are more likely to keep up-to-date regarding software updates, hardware issues, best practices, etc.
Don’t assume backups are being performed for youo You don’t want to find out after the fact that no backups have been performedo If you are using third-party software (like Yahoo or Google Mail), what happens if they lose your files?
Example options for preservation
A typical OSU researcher
> 55% produce 100 GB or less per project
Archive on your own• You buy & manage hardware, replication, backups and
networking (if applicable, for offsite access)• OK for unrestricted, sensitive (FERPA), and protected data
Costs (100 GB dataset)
Ranges(but generally cheap)
$
Archive w/ department IT• 30-day backup/recovery window for files on personal or
departmental storage• RAID protected, backed up online storage• Accessible (to you) remotely (via VPN)
Costs (100 GB dataset in COSINe)
($0/year * 4 GB) + ($60/100 GB/year) =
$60/year (ongoing)$300 for 5 years$
Archive @ OSU w/ CN• Storage is in 2 separate data centers & backups retained
for 3 months• Accessible (to you) remotely (via VPN)• OK for unrestricted, sensitive (FERPA), and protected data
Costs (100 GB dataset)
($0/year * 5 GB) + ($4/GB/year * 95 GB) =
$380/year (ongoing)$1,900 for 5 years$
Archive in discipline-specific repository
• Replicated, archive-quality storage• Data curation throughout ingest & archive period• Data in context with other datasets
Costs
Ranges$
3rd party storage platforms
Costs
Ranges$
Bottom line
No “one-size-fits all” approach
Balance costs, storage quality, access, degree of involvement, security, longevity etc.
Plan ahead so you can budget appropriately