View
60
Download
0
Category
Tags:
Preview:
DESCRIPTION
Challenges of Digital Preservation. MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library. “Digital Content”?. Digitized (born-analog). Born-digital Tweets Web sites Email Documents PDF Word, OpenOffice … Spreadsheets - PowerPoint PPT Presentation
Citation preview
Challenges of Digital PreservationMA / CS 109April 22, 2011
Andrea GoethalsManager of Digital Preservation & Repository ServicesHarvard Library
“Digital Content”?Digitized (born-
analog)Born-digital
◦ Tweets◦ Web sites◦ Email◦ Documents
PDF Word, OpenOffice … Spreadsheets
◦ Data sets
Digital content is not new1957: 1st digital
image1969: ARPAnet1971: 1st email
sent1972: 1st
consumer-level video game
1975: 1st digital camera
Russell Kirsch’s son (source: NIST)
But has only recently exploded
1998: 1st Google index◦ 26 million pages
2000: Google index◦ 1 billion pages
2008: Google link processors◦ 1 trillion unique URIs◦ “… and the number of
individual Web pages out there is growing by several billion pages per day” – from the official Google blog
The coming tsunami2010: estimated
at 1.2 ZB (1 ZB is 1 million TBs)◦ DVDs stacked from
Earth to the Moon and back
2020: expected to grow by a factor of 44 to 35 ZB◦ DVDs stacked
halfway to MarsSource: 2010 IDC Digital Universe Study sponsored by EMC
Outpacing storage
Source: 2009 IDC Digital Universe Study sponsored by EMC
Why do we care?
May be historically significant
Captured March 19, 2011 for a Japan Earthquake collection created by Virginia Tech, Internet Archive (http://www.archive-it.org/public/collection.html?id=2438)
May be an important reference
Only availabl
e in digital
form
Who cares?Cultural heritage institutions
◦Libraries, archives◦Museums, historical societies◦Academic institutions
GovernmentsEntertainment, news and media
industryScientific communityFunding bodies (NSF, NIH)You?
Preservation historicallyArchives and libraries have been
preserving all kinds of analog material for centuries using:◦Environmental control◦Conservation treatments
Can store away until resources allow processing◦Benign neglect approach works well
Analog content is fairly durableEven damaged, may still be
identifiable, readable, usableAnatolian Cuneiform Tablet, circa 1850 BCE
In contrast digital content isEasily destroyedTransientHiddenRequires more active attention –
benign neglect approach doesn’t work
Digital content is easily destroyedBad peopleHardware or
software failuresHuman mistakes
◦ The slip of a finger can lead to catastrophic results
◦ “Help! Accidental deletion. I accidentally deleted 62 images… can you please recover them from backups?”
Digital content is transientAverage lifespan of a Web site is
between 44 and 100 days
Captured April 8, 2009 Visited October 13, 2010
Digital content is hiddenBoth. Use helps but its not
enough to detect corruption.
But is it usable???It’s not enough to preserve the
digital bits◦AppleWorks?◦WordStar?◦Excel 1.0?
To use digital content we need software that can read the format
Reading formats
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
Reading formats
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
Reading formats
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
Access to information
informationcontent
bitsformats
SWHW
HW (paper)informationcontent
HW (paper)
symbols
language
Analog book
Unmediated access
Digital bookTechnology-mediated
access
Formats are key to digital preservation
informationcontent
bitsformats
SWHW
supp
ortin
g
tech
nolog
ies
digita
l
cont
ent
If the format of our content is unsupported by technology, we can’t access the content’s information!
Dependent on fleeting technologyWe are dependent on technology
to interpret (render, play, etc.) digital content
No technology sticks around – it all ages and disappears
Eventually all digital content in its original format becomes unusable!
Format obsolescenceKodak PhotoCD
◦Used by libraries in the 1990’s and into 2000’s as a preservation format
◦Best decoders were from Kodak and are no longer supported
◦Very few software decoders remaining – soon images in this format will be unusable
◦Harvard’s Digital Repository Service has 7,243 of these
Two sub-problemsKeep the bits
safeKeep the
information usable as technology changes
Safe bitsInfrastructure, polices, practices and
professional staff to counter risks◦High quality storage◦Redundancy (multiple copies, multiple
locations)◦Media refreshing (replacing)◦Security and access restrictions◦Content recovery◦Integrity monitoring (check for
corruption)…
Integrity monitoringMessage digests – unique
signatures for digital content◦Fixed-size bit strings
6326ec82b3200df4a87fc54356d2cb73◦Calculated by cryptographic hash
functions, e.g. MD5, SHA1, …Any changes to a file result in a
changed message digestUseful for detecting corruption
Usable informationPeople have to be able to find itPeople must be able to manage itDocument what’s important
(description, context, ownership, processing history)
Know what you are preserving (formats)…
A TIFF is a TIFF?Tiff 4.0Tiff 5.0Tiff 6.0Tiff 6.0 extension
YCbCr (Class Y)TIFF/IT (ISO
12639:2003)TIFF/EP (ISO 12234-
2:2001)RichTIFFEXIF 2.0
EXIF 2.1 (JEIDA-49-1998)
EXIF 2.2 (JEITA CP-3451)GeoTIFF 1.0TIFF-FX (RFC 2301)Class F (RFC 2306)RFC 1314Canon RAW
(.crw, .cr2, .tif)Nikon RAW (.nef)DNG (Adobe Digital
Negative)
Identifying formatsTechniques: “magic numbers”,
full parseFew tools
◦Support limited number of formats◦Accuracy varies
Some improvements◦File Information Tool Set (FITS)
fits.google.code◦NARA-sponsored research
Usable informationMake sure there’s technology to
support the formats! (technology watch)
Preservation strategies◦Technology preservation◦Creation of viewing software◦Emulation & variations:
Universal Virtual Machine Universal Virtual Computer
◦Format normalization◦Format migrations…
Key format migration considerationsWhat can’t be lost in the
transformation? “Significant properties”◦E.g. color, embedded metadata, resolution,
ICC profiles, interaction, attachments, fonts, links
◦How important are each of these properties? – weighted criteria
To what format? “Preservable” formatsWhat else must be changed? Ex: LinksHow many versions to keep?
Preservation lifecycle – a series of hand-offsCreate or acquire digital contentIngest into a preservation repository
◦Continuous cycle of: Monitoring Planning Intervention
◦Subject to collection management decisions
Transfer to next generation of the repository or to a different repository
Ongoing commitmentRequires continual proactive
program◦You can’t just start and stop◦Time frames are MUCH shorter than
for preservation of analog materialRequires ongoing investment in
infrastructure and staff
Can’t do it aloneDigital preservation activities
must be shared across institutions
Even collectively we don’t have adequate resources or understanding
Preservation communityCollaborative organizations
(NDSA, IIPC, OPF)Collaborative projectsStandards and best practicesShared infrastructure and tools
◦Formats registry◦Repository software◦Preservation planning tools◦Format tools
Recommended