1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

Preview:

Citation preview

1

CS 502: Computing Methods for Digital Libraries

Lecture 27

Preservation

2

Administration

Online survey

http://create.hci.cornell.edu/cssurvey.cfm

Course evaluations

at end of class today

3

Long-term preservation

Objective

Retain digital library materials over centuries

Longer than ...

• computer architectures (Wintel, Linux, 390, ...)

• magnetic storage (disks, tapes, ...)

• formats, protocols, applications (Unicode, Java, XML, ...)

• Internet or the web

for purposes that we have not yet considered

4

5

6

7

8

9

Levels of preservation

• Preserve full look and feel of digital material in its context

e.g., A video game with its hardware

• Preserve content with an access system but migrate the look and feel to new environments

e.g., successive versions of MS Windows

• Preserve raw content but no software system

e.g., UTF-8 text with XML/XSL mark-up, but no XML/XSL software

The complexity of preservation varies greatly with the level.

10

Challenges: user needs

Digital information differs from print

May be useless without its environment.

Creator and subscriber may not have copies.

Numerous versions.

Example: A scientific journal on-line

If the author does not subscribe - no access to own article.

If the library does not renew subscription - no access to anything.

11

Challenges: technical problems

Technical issues

Storage media have short life-span.

Formats and specifications change continually.

Computing environments are very complex.

Example: personal files

I have retained all my personal computer files since 1984, but have great difficulty in reading some of them.

12

Challenges: economic and legal

Legal

Archives require permission to save information.

Institutions:

Library of Congress, National Archives, etc. do not provide the same services for electronic information that they provide for physical artifacts.

Example: discontinued serials

What happens if a journal publisher goes bankrupt, or a scientific archive does not get its grant renewed?

13

Technical approaches: 1. Persistent storage

Material Approximate life (years)

Acid-free paper 500+

Microfilm 300

Optical disks 100?

Color film 25-50

CDs 20?

Magnetic disk and tape 5

• Persistent storage preserves raw content only

• Research in high-volume, long-term digital media in lacking

14

Technical approaches2. Copying bits (refreshing)

Refreshing bits

Repeatedly copy bits from one storage medium to the next.

• A standard technique in data processing.• Benefits from the rapid fall in prices of storage devices.• Preserves raw content only.

Requires active management

Mirrors

Have many copies of the same information with independent management.

15

Technical approaches3. Migration of content

Migration

• Retain content but change formats and representations to keep current with technology

• Used by journal publishers

• Preserves content and an access system

Example. Pension funds

The Social Security Administration has records of every FICA payment, which migrate between systems over many years.

16

Technical approaches4. Emulation

Concept

• Record a full specification of the computing environment in which the digital information was created

• At time in future, emulate the original computing environment

• Would preserve full look and feel

Clearly not practical for complex computing systems

• Emulation is never perfect

• Computing environments are remarkably complex

But may be useful for parts of systems

e.g., Java virtual machine

17

Technical approaches5. Digital archeology

After periods of neglect, archeologists are needed

• Recover data from old media

• Reverse engineer lost formats and specifications

• Experts in digital paleography (reading archaic scripts and formats)

Example. East Germany

German archivists are reconstructing the records of the East German state from worn out tapes, broken computer systems, undocumented data bases, and the recollections of staff.

18

Preservation at publication

This is a period of experimentation and change in formats, protocols, object models, etc.

Some information is easier to preserve than others.

Longevity is more likely if:

Formats are widely used, in important applications.

Methods are simple, without using obscure options.

Coding schemes are easy to interpret.

Example. Internet RFC Series

The Internet RFC Series use text/ascii. The RFCs go back to 1969 and have no preservation problems. A few RFCs are in PostScript and already hard to decipher

19

Metadata

Digital information needs interpretation

• Self-documentation is always good

• Persistent identification is vital

• Simple, standard metadata has a chance of long-life

• Authentication of material need not be complex (e.g., hash)

• History of changes (e.g., migration to different format)

20

Preservation of specifications

Digital information needs a context

Therefore store the specifications of:

• Formats

• Database designs

• Technical documentation

• User manuals

...on high-quality archival materials, e.g., paper.

21

Final word

Long-term preservation needs people

and organizations who want it!

Recommended