Keep Calm and CurateEasy steps to help you better manage your content and make it more useful
Gareth Knight, Digital Curation Specialist
Food for Thought. University of Cambridge
27th June 2011
2
DIGITAL LIFECYCLEWe’ve created our research content
It has taken a lot of time &
effort
I want to get maximum value from it
How can I ensure that it doesn’t go to
an early grave?
3
WHY MANAGE YOUR CONTENT?
Researcher perspective1. Protect value of content2. Maximise visibility and impact of researcher3. Enable continue development and use
Institutional Perspective4. Protect financial investment5. Evidence of operation and impact
6. Compliance with appropriate regulations
4
MANAGEMENT CHALLENGES
•Inability to locate:• Data files have been lost or corrupted and an alternative copy
cannot be found.•Inability to access:
• Data files cannot be decoded using available software•Uncertainty over content:
• Many data files exist, but it is unclear what constitutes the final product of research process and what is a by-product of investigation.
•Inability to understand:• Data files can be accessed using appropriate software, but
context of research content cannot be established•Unclear usage:
• Rights issues associated with publication and use of content is unclear – should the institution err on the side of caution?
5
RESEARCHER EXPERIENCE @ KCL
JISC PEKin project performed assessment of six research & administrative departments within King’s College London during 09/10:
Storage use:• Staff were uncertain where to store data. Network drives did not offer sufficient
capacity, resulting in use of local storage (external USB disks, USB sticks) that had no backups, or as 3rd party services unknown to institution (e.g. DropBox)
Data encoding & conversion:• Data formats: Uncertainty over correct file format to use to store data.• Data conversion problematic – tools cause loss of some significant properties
Authenticity concerns: Questionable provenance:• Staff do not understand origin of data. Many different copies with
unidentified/unknown changes made by different authors• Result: Staff store known good copy on local drive. Some rely upon print-outs
of digital original
Archival value and retention period:• Value of research papers understood, but value of datasets & other outputs
not recognised (implications for REF & publication). Some data stored, others deleted
6
DATA STORAGEReality: ALL digital storage media is unreliable:
• Gradual degradation over time• Unexpected failure through power surge, unexpected motion,
and theft (as well as accidental washing)• Media obsolescence – 5 ¼ and 3 ½ inch
floppy disk, Zip disks, & many others• 3rd party storage providers can close
their service & delete your content
Practical approaches to take:• Appraise - do you need to keep everything?• Store content on at least 2 forms of storage in different
locations, e.g. store 2 local copies, one on internal drive and one on USB stick, hard disk, etc, and at least one remote copy (e.g. departmental shared drive)
• Submit your research to the institutional repository• Test your backups to ensure they are still valid.• Copy data files to new media every 2-5 years after first
creationhttp://www.flickr.com/photos/timypenburg/5442288539/
7
DATA ORGANISATION
•If someone examined your data for the first time, what would they wish to know?
• What research collection is contained within the directory?• What type of information does it contain?• Where can I find specific content, e.g. final report, analysis
data?
•Practical approaches to take:• Establish directory structure that clearly distinguishes between
groups of files (e.g. reports, photographs, etc.) Use sub-directories for sub-categories (e.g. topics, date, version)
• Adopt a consistent approach to organising directories (across your department, if possible)
• Label files in manner that allows purpose, version and other relevant information to be quickly identified (e.g. using filename, cover page)
http://www.flickr.com/photos/amcclen/253640379/
8
CHOOSING THE RIGHT FORMAT
How do you choose correct file format to store your content?
• Each format has diff. capabilities & are not suitable for every task, e.g. MSWord not suitable for web access, etc.
• Some formats remove content or functionality to reduce file size & limit use, e.g. JPEGs lack detail, difficult to edit PDFs
• Target audience may not see content in same way that you do - each application interprets data and renders content differently, e.g. diff. fonts, layout changes
How do you ensure that content can be accessed in long-term?• Format obsolescence: Gradual change may result in formats & older
versions becoming difficult to access over time, e.g. MS Word, WordPerfect, older AutoCAD formats, complex objects – gradual change - Know file contains information, but what does it mean?
• Format conversion: Some content attributes may be changed or lost when converting between formats
9
DIFFERENCES IN SOFTWARE INTERPRETATION
Open Office
Microsoft Powerpoint
Open Office Impress
10
PHOTOGRAPH FORMATS
Original photograph
stored as TIFF
79712 colours
JPEG, 85% compression
Considerable detail loss
GIF, 256 colours, colour
banding on petals
Open Office
11
CHOOSING THE “RIGHT” FORMAT
Select diff. formats based upon needs, rather than single format:
• Digital master: Preservation copy intended for long-term storage
• Dissemination: Access formats for use by specific users, e.g. PDF
•Format of the digital master:• Try to use common, widely used formats supported by a
range of software tools.• Store content in formats that support required attributes (e.g.
16 million colours) and will not degrade when resaved – ensure that you re-examine your file after you’ve saved it
• Retain all data associated with original creation/capture process – may contain information properties that is useful at a later date
12
POTENTIAL FORMATS
Digital master Distribution copy
Plain Text ASCII/Unicode text ASCII text, Unicode text
Document Open Document, Rich Text
Format, MS DocX (possibly)
PDF/A
Database Comma Separated (CSV) or
Tab-delimited text (tab),
SQL Dump (possibly)
MySQL, MS Access,
FileMaker Pro through
appropriate front-end
Photos TIFF, PNG, RAW JPEG, PNG
Audio AIFF, Microsoft Wave, FLAC
(potentially)
MP3, ASF
Video MPEG2 (as used on DVDs),
JPEG 2000 in an MJPEG
wrapper, MJPEG
MPEG2, Quicktime, AVI ,etc.
13
DOCUMENTATION
Information necessary to interpret, understand and use a given dataset or set of documents
What would someone wish to know about your content?
• Who created it?• When was it created?• Why was it created?
Who funded it?• What is the source of the
material used?• What is the motivation for the
approach you took?• What content can be
published?• How can it be used?
Practical approach to take:
• Attach a cover page to your document with relevant creator & rights information
• Create a catalogue record for your digital repository
• Create an administrative file for internal use that can help colleagues and repository staff and assign it an appropriate filename
Where to go for help:
DSpace@Cambridge Guidance:http://www.lib.cam.ac.uk/
dataman/pages/metadata.html
http://www.flickr.com/photos/playingwithpsp/3031647963/
14
CONCLUSIONS• A number of factors may limit use of your content
over time. However, you can make choices that will enable your content to be accessed in the long-term and be usable by others
• Ways to protect the value of your data:• Store your data in 2 or more locations• Organise it using an easy to understand structure• Adopt a digital master format that is fit for purpose• Document information that cannot be obtained elsewhere
• Support & documentation available within institution (DSpace@Cambridge) and externally which can help you with choices
15
USEFUL REFERENCES
Cambridgehttp://www.lib.cam.ac.uk/dataman/
Glasgow http://www.gla.ac.uk/services/datamanagement/
Edinburgh http://www.ed.ac.uk/schools-departments/information-
services/services/research-support/data-library/
research-data-mgmt