View
220
Download
3
Tags:
Embed Size (px)
Citation preview
1
CS 502: Computing Methods for Digital Libraries
Lecture 27
Preservation
2
Administration
Online survey
http://create.hci.cornell.edu/cssurvey.cfm
Course evaluations
at end of class today
3
Long-term preservation
Objective
Retain digital library materials over centuries
Longer than ...
• computer architectures (Wintel, Linux, 390, ...)
• magnetic storage (disks, tapes, ...)
• formats, protocols, applications (Unicode, Java, XML, ...)
• Internet or the web
for purposes that we have not yet considered
4
5
6
7
8
9
Levels of preservation
• Preserve full look and feel of digital material in its context
e.g., A video game with its hardware
• Preserve content with an access system but migrate the look and feel to new environments
e.g., successive versions of MS Windows
• Preserve raw content but no software system
e.g., UTF-8 text with XML/XSL mark-up, but no XML/XSL software
The complexity of preservation varies greatly with the level.
10
Challenges: user needs
Digital information differs from print
May be useless without its environment.
Creator and subscriber may not have copies.
Numerous versions.
Example: A scientific journal on-line
If the author does not subscribe - no access to own article.
If the library does not renew subscription - no access to anything.
11
Challenges: technical problems
Technical issues
Storage media have short life-span.
Formats and specifications change continually.
Computing environments are very complex.
Example: personal files
I have retained all my personal computer files since 1984, but have great difficulty in reading some of them.
12
Challenges: economic and legal
Legal
Archives require permission to save information.
Institutions:
Library of Congress, National Archives, etc. do not provide the same services for electronic information that they provide for physical artifacts.
Example: discontinued serials
What happens if a journal publisher goes bankrupt, or a scientific archive does not get its grant renewed?
13
Technical approaches: 1. Persistent storage
Material Approximate life (years)
Acid-free paper 500+
Microfilm 300
Optical disks 100?
Color film 25-50
CDs 20?
Magnetic disk and tape 5
• Persistent storage preserves raw content only
• Research in high-volume, long-term digital media in lacking
14
Technical approaches2. Copying bits (refreshing)
Refreshing bits
Repeatedly copy bits from one storage medium to the next.
• A standard technique in data processing.• Benefits from the rapid fall in prices of storage devices.• Preserves raw content only.
Requires active management
Mirrors
Have many copies of the same information with independent management.
15
Technical approaches3. Migration of content
Migration
• Retain content but change formats and representations to keep current with technology
• Used by journal publishers
• Preserves content and an access system
Example. Pension funds
The Social Security Administration has records of every FICA payment, which migrate between systems over many years.
16
Technical approaches4. Emulation
Concept
• Record a full specification of the computing environment in which the digital information was created
• At time in future, emulate the original computing environment
• Would preserve full look and feel
Clearly not practical for complex computing systems
• Emulation is never perfect
• Computing environments are remarkably complex
But may be useful for parts of systems
e.g., Java virtual machine
17
Technical approaches5. Digital archeology
After periods of neglect, archeologists are needed
• Recover data from old media
• Reverse engineer lost formats and specifications
• Experts in digital paleography (reading archaic scripts and formats)
Example. East Germany
German archivists are reconstructing the records of the East German state from worn out tapes, broken computer systems, undocumented data bases, and the recollections of staff.
18
Preservation at publication
This is a period of experimentation and change in formats, protocols, object models, etc.
Some information is easier to preserve than others.
Longevity is more likely if:
Formats are widely used, in important applications.
Methods are simple, without using obscure options.
Coding schemes are easy to interpret.
Example. Internet RFC Series
The Internet RFC Series use text/ascii. The RFCs go back to 1969 and have no preservation problems. A few RFCs are in PostScript and already hard to decipher
19
Metadata
Digital information needs interpretation
• Self-documentation is always good
• Persistent identification is vital
• Simple, standard metadata has a chance of long-life
• Authentication of material need not be complex (e.g., hash)
• History of changes (e.g., migration to different format)
20
Preservation of specifications
Digital information needs a context
Therefore store the specifications of:
• Formats
• Database designs
• Technical documentation
• User manuals
...on high-quality archival materials, e.g., paper.
21
Final word
Long-term preservation needs people
and organizations who want it!