51
Digital Preservation in Theory and Practice Preservation and Archiving Special Interest Group (PASIG) Boot Camp Tom Cramer Chief Technology Strategist Stanford University Libraries @tcramer PASIG DIGItAl PreServAtIon Boot cAmP

Digital Preservation in Theory and Practice

Embed Size (px)

Citation preview

Page 1: Digital Preservation in Theory and Practice

Digital Preservation in Theory and Practice Preservation and Archiving Special Interest Group (PASIG) Boot Camp Tom Cramer Chief Technology Strategist Stanford University Libraries @tcramer

PASIG DIGItAl PreServAtIon Boot cAmP

Page 2: Digital Preservation in Theory and Practice

PASIG DIGItAl PreServAtIon Boot cAmP

• Founded in 2007 by SUN and Stanford University • Forum for practitioners, industry experts, researchers to

exchange architectures, approaches, best practices • 14 international meetings with some of the world’s

leading experts on digital preservation

www.preservationandarchivingsig.org

Page 3: Digital Preservation in Theory and Practice

Agenda • Introduction

– Traditional vs. Digital Preservation – Examples of Digital Preservation Case Studies

• Risk & Strategies • OAIS: The Foundation of Digital Preservation

Theory • Authenticity, Trust, Sustainability • Preservation in Action

– Tools – Planning, Actions & Assessment – A case study: The Stanford Digital Repository

Vorführender
Präsentationsnotizen
See to light, temperature, humidity. Papyri last 1000’s of years. Alkaline paper has a life expectancy of over 1,000 years for the best paper and 500 years for average grades. If you locked a book printed today in a controlled environment, it would certainly be readable in 100 years. This
Page 4: Digital Preservation in Theory and Practice

The case of two formats…

Which of these will be usable in 100 years?

Vorführender
Präsentationsnotizen
Domesday Book: more than 900 years old. Still legible. iPad w/ Book reader. Full text search, graphics, holds thousands of books. Will it be legible in 90 years? We don’t know. Question of preservation.
Page 5: Digital Preservation in Theory and Practice

Preservation

Effective Strategies for Traditional Materials • Materials science: stable media • Physical Protection

• Stable environmental conditions • Controlled use

• Conservation, Repair and Reformatting

The mission of the Preservation Directorate at the Library of Congress is to assure long-term, uninterrupted access to the intellectual content of the Library's collections.

http://www.loc.gov/preservation/about/org.html

Vorführender
Präsentationsnotizen
See to light, temperature, humidity. Papyri last 1000’s of years. Alkaline paper has a life expectancy of over 1,000 years for the best paper and 500 years for average grades. If you locked a book printed today in a controlled environment, it would certainly be readable in 100 years.
Page 6: Digital Preservation in Theory and Practice

Slide courtesy of Keith Johnson

Page 7: Digital Preservation in Theory and Practice

Slide courtesy of Keith Johnson

Page 8: Digital Preservation in Theory and Practice

What Is Digital Information?

• Data that is encoded as a series of 0’s and 1’s… • typically stored on magnetic (disk, tape) or optical (CD’s,

DVD’s) media, that… • requires specialized hardware to read (e.g., disk or tape

drive, CD player), and… • specialized software to operate (e.g., firmware, operating

systems), and… • still more software (applications) to interpret and render

(e.g., Powerpoint) into usable form, and • contextual knowledge about how to operate that software

in order to use it (e.g., double click to open)

Vorführender
Präsentationsnotizen
Mac vs. PC users…
Page 9: Digital Preservation in Theory and Practice

What Is the Problem?

• In short, access to digital information requires hardware, software and people…

• BUT technology and people change, therefore

• creating potential barriers to re-use.

Page 10: Digital Preservation in Theory and Practice

What Is Digital Preservation?

Digital preservation is the series strategies and actions taken to promote the availability and usability of digital information over time.

Page 11: Digital Preservation in Theory and Practice

(In)Famous Examples of Loss

• The 1976 Viking Mars Landings

• Raw data still available, but… • Some not processed; • Some not documented; • Original software defunct.

• 1988: Processing 3,000 images from original tapes took 2 years of reverse engineering to build modern software.

• c2000: Researchers looking for biological data from the landing couldn’t process the raw data in unknown formats. Tracked down printouts and hired students to rekey it all.

Vorführender
Präsentationsnotizen
http://www.nytimes.com/1990/03/20/science/lost-on-earth-wealth-of-data-found-in-space.html?src=pm 1998: There were copies of some old computer programs used to turn the raw data into pictures, but the source codes the computer needed to run the programs could not be found and the computers themselves no longer existed. After tracking down the data, Mr. Eliason looked up the NASA documents that described how they were entered. ''It was written in technical jargon,'' he said. ''Maybe it was clear to the person who wrote it but it was not clear to me 20 years later.'’ ''I tried to make sense of the raw data but in the first image I processed, all the pixels were backwards,'' said Mr. Eliason, referring to the tiny squares on a grid that make up a photographic image. ''I knew the images were in there somewhere,'' Mr. Eliason continued, ''so I talked to the old guys'' who designed the camera and other instruments. In this way Mr. Eliason learned that the Viking tape recorder ran forward and backward, so that data was encoded forward and backward as well. After a year, Mr. Eliason succeded in writing a computer program that can extract the missing images of Mars. And last year, more than a decade after the Viking cameras collected data, he found the highest resolution image ever taken of Mons Olympus, the most signficant feature on Mars and the largest volcano in the solar system. http://www.cbsnews.com/stories/2003/01/21/tech/main537308.shtml Reported in 2003: Joe Miller recently discovered. The University of Southern California neurobiologist couldn't read magnetic tapes from the 1976 Viking landings on Mars. With the data in an unknown format, he had to track down printouts and hire students to retype everything. "All the programmers had died or left NASA," Miller said. "It was hopeless to try to go back to the original tapes."
Page 12: Digital Preservation in Theory and Practice

(In)Famous Examples of Loss

• 1986: BBC Domesday Project. Maps, images, statistical data, video and virtual reality tours.

• More than 1,000,000 participants in the project commemorating 900th anniversary William the Conqueror’s census of England

• Data stored on two laser discs (300 MB per side!): Required a Philips VP415 “Domesday Player” to read.

• 1999 – 2003: Two separate data recovery efforts involving both emulation and migration.

• 2090: Estimated date by which the project will be copyright free.

Page 13: Digital Preservation in Theory and Practice

Less Famous Examples of Loss

Parker on the Web •559 Anglo-Saxon manuscripts:

~200,000 high resolution images

•Used “Internet3” to transfer images from Cambridge to Stanford: Hard drives shipped via DHL

•Cambridge retained one full “insurance” copy on 1 TB external hard drives

•Within 5 years of start of project, 7 of 32 of these drives had failed. (Full copy of all images preserved at Stanford, though!)

Page 14: Digital Preservation in Theory and Practice

Less Famous Examples of Loss

Monterey Jazz Festival

•Festival founded in 1958: longest running jazz festival in the world.

•Rich collection of recordings from inception, spanning over 50 years, in varying states of condition & decay.

•~800 audio recordings, 1.6 TB

•First 35 years were recorded on analog tape – 1 out of 100’s was critically damaged

•In 1994, the festival converted to DAT – of the first 40 DAT tapes reformatted, 5 were critically damaged

Vorführender
Präsentationsnotizen
There is a common perception that digital technologies are more reliable, more stable—probably stemming from the fact that digital copies can be perfect copies. However, they can in fact be more fragile.
Page 15: Digital Preservation in Theory and Practice

Less Famous Examples of Loss

April 2011 “Dear Google Video User, Later this month, hosted video content on Google Video will no longer be available for playback. Google Video stopped taking uploads in May 2009 and now we’re removing the remaining hosted content.”

Vorführender
Präsentationsnotizen
Google launched a video hosting and search service in 2005. Google acquired YouTube in 2006. Stopped upload in 2009. Announced unavailability of videos in April, 2011. After public outcry, deferred removal, provided add’l tools to assist in migrating content to YouTube. Key points: Economic viability. Web Scale challenges of preservation. http://googlewebmastercentral.blogspot.com/2011/04/update-on-google-video-finding-easier.html http://www.thinkdigit.com/General/Google-Video-to-be-axed-come-April_6628.html http://www.centernetworks.com/google-video-delete-uploaded-videos#ixzz1MvVNIy5Z�
Page 16: Digital Preservation in Theory and Practice

Risks to Digital Information

• Media decay (bit rot) • Obsolescence

• File Format • Software • Hardware • Media

• Technology Failure • Software • Hardware • Media

• Communication errors

• Lack of context – Data but no codebook

• Ambiguous IP State – Copyright – Licensing

• Natural disasters • Information attack • Economic failure • Organizational failure • Loss of will • Human error

Page 17: Digital Preservation in Theory and Practice

• Replication • Migration

• Format • Media • Technology

• Emulation • Hardware • Software

• Encapsulation • Redundancy and

heterogeneity • Technology • Location • Organization

• Succession Planning

Strategies

Digital Preservation (aka Long Term Access) is realized through a series of relays over time.

Page 18: Digital Preservation in Theory and Practice

Digital Preservation is More Than Technology…

10 Core Requirements for Digital Archives from CRL / TRAC

1. Mandate and Commitment to Digital Object Maintenance

2. Organizational Fitness

3. Legal and Regulatory Fitness

4. Efficient & Effective Policies

5. Adequate Technical Infrastructure

Vorführender
Präsentationsnotizen
CRL = College and TRAC = Trustwothy Repositories Audit and Checklist
Page 19: Digital Preservation in Theory and Practice

More Than Just Technology… (cont.)

6. Acquisition and Ingest

7. Preservation of Digital Object Integrity, Authenticity & Usability

8. Metadata Management & Audit Trails

9. Dissemination

10. Preservation Planning and Action

Page 20: Digital Preservation in Theory and Practice

Preservation & Archiving vs…

• Backup

• Disaster Recovery / Business Continuity

• Enterprise Content Management Systems – Document, Records, Web, Email

• Digital Asset Management – Images, Audio, Video

• Hierarchical Storage Management (HSM)

Page 21: Digital Preservation in Theory and Practice

The OAIS Model

• OAIS = Open Archival Information System • Consultative Committee for Space Data Systems

(CCSDS) • 1995 – First international workshop; developing

OAIS proposed • 1997, 1999 – Drafts of reference model

circulated • 2000 – Published as draft standard • 2002 – Approved as ISO standard 14721 • 2012 – Second version: recommended practice

Vorführender
Präsentationsnotizen
CCSDS = forum for international space agencies interested in cooperative development of stds for data handling supporting space research. At ISO’s request in the mid-’90’s, began research into a shared framework for long term storage of digital data generated from space missions. Found no generally accepted conceptual models or vocabulary for digital archiving. Proposed to develop one in 1995. Ccsds 650.0-M-2 2009 – 2nd issue enters review, draft stage for improvements; 2012 Recommended Practice document published by CCSDS http://public.ccsds.org/publications/RefModel.aspx This Recommended Practice defines the Reference Model for an Open Archival Information System (OAIS). The current issue includes clarifications to many concepts, in particular, Authenticity with the concept of Transformational Information Property introduced; corrections and improvements in diagrams; addition of Access Rights Information to PDI.
Page 22: Digital Preservation in Theory and Practice

OAIS: Key Concepts & Definitions

• Open = developed in an open public forum

• Archival Information System = “an organization of people and systems that has accepted the responsibility to • preserve information and • make it available for a • Designated Community”

Vorführender
Präsentationsnotizen
PRESERVE Information PROVIDE ACCESS to Information for a DESIGNATED COMMUNITY
Page 23: Digital Preservation in Theory and Practice

OAIS: Mandatory Responsibilities

• Accept content • Obtain control (including necessary IP rights) • Define user community • Ensure that the preserved information is

independently understandable to the user community

• Follow documented procedures to • Preserve information against reasonable

contingencies • Enable dissemination of authenticated copies

• Make preserved information available

Vorführender
Präsentationsnotizen
i.e., users can understand the information without assistance from the producer.
Page 24: Digital Preservation in Theory and Practice

OAIS: Long-Term

A period long enough to raise concern about the effect of changing technologies, including support for new media and data formats, and of a changing user community. (OAIS, RLG-OCLC)

Page 25: Digital Preservation in Theory and Practice

OAIS: Environment

Image from Lavoie, 2004

Vorführender
Präsentationsnotizen
Producers: those who transfer information (data & metadata) to the archive. Often governed by a submission agreement. Management: set high level policy framework: strategy, scope of archive, level of service, (perhaps) funding. NOT day-to-day mgmt of the archive. Consumer: Those who use the information in the archive Designated community: The intended and targeted audience of the archive’s contents. May be a subset of the consumers. OAIS: preserve information and make it available.
Page 26: Digital Preservation in Theory and Practice

OAIS: Functional Model

Image from Lavoie, 2004

Vorführender
Präsentationsnotizen
Ingest, Archival Storage, Access, Administration, Data Management, Preservation Planning
Page 27: Digital Preservation in Theory and Practice

OAIS: Functional Model

• Ingest. Processes to accept information from producers and transfer to the archival store.

• Archival Storage. Long-term storage and maintenance of digital materials.

• Data Management. Maintains the descriptive and administrative metadata for the OAIS. • essentially an inventory and control system supporting

other functions (ingest, access, planning).

Vorführender
Präsentationsnotizen
Ingest: submission & receipt of deposits, validation, packaging, extraction of metadata, transfer to archival store Archival Storage: Store & maintain digital objects in original and usable form. Includes fixity services (replication, audit, repair), media refresh, media migration, and format migration. No end user access.
Page 28: Digital Preservation in Theory and Practice

OAIS: Functional Model

• Preservation Planning. Defines preservation strategy for the OAIS; monitors and controls for external changes in… • technology, designated community, user expectations,

economics, preservation methods

• Access. Enables users to discover, request & receive information.

• Administration. Day to day operations of the archive; coordination of all high-level services, plus interaction with Producers, Management and Consumers.

Vorführender
Präsentationsnotizen
Preservation Planning: the OAIS’s safeguard to constantly evolving user and technology environments. Access: may also do transformations of content. Must apply security & access controls. Administration. Coordinates updates to system.
Page 29: Digital Preservation in Theory and Practice

OAIS: Functional Model

Image from Lavoie, 2004

Vorführender
Präsentationsnotizen
Ingest, Archival Storage, Access, Administration, Data Management, Preservation Planning
Page 30: Digital Preservation in Theory and Practice

OAIS: Information Model

• Data Object = the content of interest • Representation Information = the necessary

information required by the designated community to independently interpret the data object

• Information Object = Content Information, and also the object that is preserved.

Image from OAIS , 2002

Vorführender
Präsentationsnotizen
E.g.. Data object = datastream from Viking Lander on Mars Rep Info = description of the data’s values, context, equipment used to generate and interpret it Information object = complete package of both.
Page 31: Digital Preservation in Theory and Practice

OAIS: Information Model

Image from OAIS , 2002

• Archival Information Packages include both the content information and Preservation Description Information (PDI)

• Both of these are packaged together in a common actual or logical package.

Vorführender
Präsentationsnotizen
Preservation Description Information gives info on the present and past states of the Content Information
Page 32: Digital Preservation in Theory and Practice

OAIS: Information Model

Preservation Description Information (PDI) • Provenance. Source of the content information;

chain of custody, processing history. • Context. Relation of the content information to

other information outside the information package.

• Reference. Unique identifiers for the content (such as ISBN for books, or UUIDs for files)

• Fixity. Protects content from undocumented alteration. E.g., a checksum.

Vorführender
Präsentationsnotizen
Provenance is a major component in establishing authenticity. In version 1.1 draft of the model, Access rights is elevated to a 5th type of PDI. Note that PREMIS, Preservation Metadata, has elaborated and extended on the notion and categories of PDI extensively.
Page 33: Digital Preservation in Theory and Practice

OAIS: Information Model

Image from Lavoie, 2004

Components of the OAIS Archival Information Package

Vorführender
Präsentationsnotizen
Descriptive Information describes the entire package, and is stored in the Data Management function to enable cataloging & discovery of the archival information package.
Page 34: Digital Preservation in Theory and Practice

OAIS: Significance

• Common framework for the Digital Preservation Community • Shared concepts & vocabulary

• Foundation for further standards work • Auditing, Metadata

• An abstract model • Not an architecture, not a system • Not a cookbook for digital preservation

success

Vorführender
Präsentationsnotizen
What is the relation of OAIS to Digital Preservation? The difference between a model and an architecture.
Page 36: Digital Preservation in Theory and Practice

Review: OAIS

• Objective: preservation and access over the long term for a designated community

• Locates an archive in an environment: Producers, Management, Consumers

• Six high level functions: Ingest, Archival Storage, Data Management, Preservation Planning, Access, Administration

• Information Model to support preserving information (through PDI) and making it Independently Understandable

Vorführender
Präsentationsnotizen
Recap high points of OAIS
Page 37: Digital Preservation in Theory and Practice

Other Key Concepts

• Authenticity • Trust • Sustainability

Page 38: Digital Preservation in Theory and Practice

Authenticity

• Authentic = Genuine = Bona Fide

• The trustworthiness of a record as a record: i.e., the quality of a record that is what it purports to be and that it free from tampering or corruption (InterPARES)

• Property that a digital object is what it purports to be. (PREMIS)

Vorführender
Präsentationsnotizen
InterPARES – 2001: The quality of being authentic, or entitled to acceptance. “Worthy of acceptance or belief as conforming to or based on fact”, synonymous with genuine, bona fide. Genuine implies actual character not counterfeited, imitated or adulterated, and connotes definite origin from a source. Bona fide implies good faith and sincerity of intention.
Page 39: Digital Preservation in Theory and Practice

Authenticity

• Human record is becoming a digital one • Important not only for cases of law, but also the scientific,

intellectual, and cultural record • Print…

• Protecting physical access to object protects access to information it contains

• Changes to object are relatively easy to detect • Changes to one object don’t propagate to others

• Digital… • Information must be copied to be read; perfect copies,

instantly • Changes to the copy can be made undetectably

(on purpose or by accident) • Changes propagate as easily as the original

Vorführender
Präsentationsnotizen
InterPARES – 2001: The quality of being authentic, or entitled to acceptance. “Worthy of acceptance or belief as conforming to or based on fact”, synonymous with genuine, bona fide. Genuine implies actual character not counterfeited, imitated or adulterated, and connotes definite origin from a source. Bona fide implies good faith and sincerity of intention.
Page 40: Digital Preservation in Theory and Practice

Authenticity • Authenticity (traditional & digital) derives

from… • Source • Chain of custody • Processing history • Fixity • Trust

Maintaining and disseminating authentic information is a primary mission for digital preservation systems.

Vorführender
Präsentationsnotizen
Source, Chain of custody and processing history all components of OAIS “provenance” metadata Processing history: preservation events such as media migrations, transformations Fixity metadata helps detect changes See PREMIS for more details on provenance Trust derives from a repository’s demonstrable history of following consistent processes in controlled manner.
Page 41: Digital Preservation in Theory and Practice

Trust & Trustworthiness

• As early as 1996, Trust in repositories was cited as a prerequisite for digital preservation • Task Force on Archiving of Digital Information

• Perception of competence, security, long-term commitment is necessary from… • Producers • Funders • Consumers

Vorführender
Präsentationsnotizen
1996: Task Force on Archiving of Digital Information Trust : libraries & archives often regarded as trusted authorities of the cultural & historic record. Would need to apply in digital ream as well Ironically, perception of competence may be more important than competence itself… Producers: deposit content, sufficient IP rights Funders: to sustain the repository’s operations Consumers: in the use of the repository’s services, trust in the usability and authenticity of its information
Page 42: Digital Preservation in Theory and Practice

Trust & Trustworthiness

• Trust is granted by a third party to a repository • Trustworthiness is demonstrated by adherence to

four principles (nestor, DCC) 1. Documentation 2. Transparency 3. Adequacy 4. Measurability

•Audits help establish Trustworthiness

Vorführender
Präsentationsnotizen
Nestor = German project, Network of Expertise in the Longterm Storage of Digital Resources Documentation = evidence of objectives, design, specification, implementation and operation of the digital repository Transparency (internal & external) – trust derives from visibiliy into strategy & processes Adequacy = fit of approach to the need Measurability = objective criteria against which to measure effectiveness The question is what measurable standards to use for ascertaining long term effectiveness in an immature field
Page 43: Digital Preservation in Theory and Practice

Sustainability

• Long-term preservation, by definition, requires management of information over generations of change in… – Technology – Users & expectations – Staffing – Economic conditions

• Preservation is a journey, not a destination

Page 44: Digital Preservation in Theory and Practice

Sustainability

• Technological Strategies • Simplicity, component-based, plan for migration,

plan for replacement

• Organizational Strategies • Succession planning is a formal process of enabling

handoffs across archives

• Economic Strategies • Contain costs, be selective, emphasize value of

access

Page 45: Digital Preservation in Theory and Practice

Sustainability

• Make the case for use • Preservation = Long-Term Access • “In all cases, access to information tomorrow

requires preservation actions taken today” (BRTF)

• Not all-or-nothing • Bit-level preservation + basic access may be

more pragmatic than open-ended commitment to format migration / emulation.

• Not once-and-for-all • Rather than a 100 year commitment, think of a

10 year commitment, with an option to renew

Vorführender
Präsentationsnotizen
A fundamental fact of digital sustainability is that without preservation, there is no access. NDIIPP Lesson – preservation wasn’t fundable. However long term access was. Recission & restoration of funding to NDIIPP
Page 46: Digital Preservation in Theory and Practice

Review: What Is Digital Preservation?

Digital preservation is the series strategies and actions taken to promote the availability and usability of authentic digital information over time.

Page 47: Digital Preservation in Theory and Practice

• Replication • Migration

• Format • Media • Technology

• Emulation • Hardware • Software

• Encapsulation • Redundancy and

heterogeneity • Technology • Location • Organization

• Succession Planning

Digital Preservation (aka Long Term Access) is realized through a series of relays over time.

Review: Digital Preservation Approaches

Page 48: Digital Preservation in Theory and Practice

Digital Preservation in Action: Plan, Do, Check

Page 49: Digital Preservation in Theory and Practice

Toolkit for Preservation Planning and Actions

http://www.digitalpreservation.gov/tools/

Tool Type Purpose

Identification What kind of file is this bit stream?

Characterization What are its important features?

Bit Audit / Fixity Have any of its bits changed?

Manipulation I’d like to modify/update/transform this digital object

Wrapping I’d like to package this object for storage or transfer

Transfer I’d like to move this digital object

Page 50: Digital Preservation in Theory and Practice

Technology Implications

• Minimize dependencies – Encapsulate your metadata with your objects

• Minimize correlated errors – Embrace redundancy – Embrace diversity

• Monolithic systems tend to serve poorly – Complex, expensive, inflexible – Migration costs can capsize you

• Keep it simple; have an exit plan for every component

Page 51: Digital Preservation in Theory and Practice

SDR Preservation Core The Stanford Digital Repository (SDR) provides services to make scholarly resources

available over the long term by helping ensure their

integrity, authenticity, and reusability. To fulfill its mission, the SDR must be

secure, sustainable and trustworthy.