34
An Introduction to Preservation Metadata and the PREMIS Data Dictionary Rebecca Guenther, Library of Congress ALA Midwinter 2009 Intellectual Access to Preservation Metadata Interest Group Sat. January 24, 2009

An Introduction to Preservation Metadata and the PREMIS Data Dictionary Rebecca Guenther, Library of Congress ALA Midwinter 2009 Intellectual Access to

Embed Size (px)

Citation preview

An Introduction to Preservation Metadata and the PREMIS Data Dictionary

Rebecca Guenther, Library of Congress

ALA Midwinter 2009Intellectual Access to Preservation Metadata Interest GroupSat. January 24, 2009

Overview

What is preservation metadata? Development of PREMIS PREMIS data dictionary

• Features of the data dictionary• Implementing PREMIS

PREMIS Maintenance Agency Future directions

Digital preservation: advances & remaining challenges

Groups around the world and conferences continue to make significant progress in raising awareness about digital preservation imperative

Gradual shift in focus from articulating problem to solving it …• Not so much “Why is digital preservation important” anymore; rather,

“What must be done to achieve preservation objectives?”

Many practical challenges in implementing reliable, sustainable digital preservation programs

One key implementation challenge: preservation metadata

Preservation metadata includes:

Provenance:• Who has had custody/ownership of the

digital object?

Authenticity:• Is the digital object what it purports to be?

Preservation Activity:• What has been done to preserve it?

Technical Environment:• What is needed to render and use it?

Rights Management:• What IPR must be observed?

Makes digital objects self-documenting across time

Content

PreservationMetadata

10 years on

50 years on

Forever!

What does PREMIS cover?

Administrative metadata that supports the digital preservation process

Provides information to help manage a resource for preservation purposes• Technical characteristics• Information about actions on an object• Relationships (structural and derivative)

• Structural: indicates how compound objects are put together

• Derivative: results of common preservation actions• Rights metadata associated with preservation

Some background …

Pre-2002: various preservation metadata element sets released• Different scopes, purposes, underlying models/assumptions• No international standard; little consolidation of expertise/best practice

June 2002: Preservation Metadata Framework• International working group (jointly sponsored by OCLC, RLG)• Comprehensive, high-level description of types of information constituting preservation

metadata• Used OAIS reference model as starting point• Set of “prototype” preservation metadata elements• Consensus-based foundation for developing formal preservation metadata specifications

… but not an “off-the-shelf, ready to implement” solution

Post-2002: Need implementable preservation metadata, with guidelines for application and use, relevant to a wide range of digital preservation systems and contexts

• Motivated formation of PREMIS Working Group

PREMIS Working Group

June 2003: OCLC, RLG sponsored new international working group:• PREMIS: Preservation Metadata: Implementation Strategies

Membership: • > 30 experts from 5 countries, representing libraries, museums, archives,

government agencies, and the private sector• Co-Chairs: Priscilla Caplan (FCLA), Rebecca Guenther (LC)

Objective 1: Identify and evaluate alternative strategies for encoding, storing, managing, and exchanging preservation metadata

• PREMIS Survey Report (September 2004)• Snapshot of current practices/emerging trends related to managing and using

preservation metadata in digital archiving systems• http://www.oclc.org/research/projects/pmwg/surveyreport.pdf

Objective 2: Define implementable, core preservation metadata, with guidelines/recommendations for management and use

PREMIS Data Dictionary

May 2005: Data Dictionary for PreservationMetadata: Final Report of the PREMIS Working Group

March 2008: PREMIS Data Dictionary for Preservation Metadata, version 2.0

Includes PREMIS Data Dictionary, context/assumptions, data model, usage examples

XML schema to support implementation

Data Dictionary: • Comprehensive view of information needed to support digital preservation

• Guidelines/recommendations to support creation, use, management• Based on deep pool of institutional experiences in setting up and managing

operational capacity for digital preservation

http://www.loc.gov/standards/premis/v2/premis-2-0.pdf

2005 British Conservation Awards: Digital Preservation Award

2006 Society of American Archivists Preservation Publication Award

Some guiding principles …

“Implementable, core, preservation metadata”:• “Preservation metadata”: maintain viability, renderability,

understandability, authenticity, identity in a preservation context• “Core”: What most preservation repositories need to know to

preserve digital materials over the long-term• “Implementable”: rigorously defined; supported by usage

guidelines/recommendations; emphasis on automated workflows

“Technical neutrality”:• Digital archiving system: no assumptions about specific archiving

technology, system/DB architectures, preservation strategy• Metadata management: no assumptions about whether metadata is

stored locally or in external registry; recorded explicitly or known implicitly; instantiated in one metadata element or multiple elements

• Promotes flexibility, applicability in wide range of contexts

Scope of data dictionary Implementation independent Descriptive metadata out of scope Technical metadata applying to all or most format

types Media or hardware details are limited Business rules are essential for working

repositories, but not covered Rights information for preservation actions, not

access

What PREMIS is and is not

What PREMIS is:• Common data model for organizing/thinking about preservation

metadata• A checklist for core metadata in a repository• Guidance for local implementations• Standard for exchanging information packages between repositories

What PREMIS is not:• Out-of-the-box solution: need to instantiate as metadata elements in

repository system• All needed metadata: excludes business rules, format-specific

technical metadata, descriptive metadata for access, non-core preservation metadata

• Lifecycle management of objects outside repository• Rights management: limited to permissions regarding actions taken

within repository

PREMIS Data Model

IntellectualEntities

Objects

RightsStatements

Agents

Events

Intellectual Entities

Examples: Rabbit Run by John Updike (a

book) “Maggie at the beach”

(a photograph) The Library of Congress

Website (a website) The Library of Congress:

American Memory Home page (a web page)

Set of content that is considered a single intellectual unit for purposes of management and description (e.g., a book, a photograph, a map, a database)

May include other Intellectual Entities (e.g. a website that includes a web page)

**Has one or more digital representations**

Not fully described in PREMIS DD, but can be linked to in metadata describing digital representation

Objects

Examples: chapter1.pdf (a file) chapter1.pdf + chapter2.pdf +

chapter3.pdf (representation of a book w/3 chapters)

TIFF file containing header and 2 images (2 bitstreams (images), each with own set of properties (semantic units): e.g., identifiers, technical metadata, inhibitors, … )

Discrete unit of information in digital form

**Objects are what repository actually preserves**

Three types of Object:• FILE: named and ordered

sequence of bytes that is known by an operating system

• REPRESENTATION: set of files, including structural metadata, that, taken together, constitute a complete rendering of an Intellectual Entity

• BITSTREAM: data within a file with properties relevant for preservation purposes (but needs additional structure or reformatting to be stand-alone file)

Object Example 1: photo in two formats

Intellectual Entity:“Picture of my dog”

Representation1: TIFF version

Representation 2:JPEG2000 version

File 1: dog.TIFF File 2: dog.JP2

Bitstream 1:Embedded metadata

Object Example 2: book in two versions

Intellectual EntityDa Vinci Code by

Dan Brown

Representation 1Page image

version

Representation 2ebook version

File 1: page1.tiff

File 2:page2.tiff

File N:pageN.tiff

File 1:book.lit

File N+1:METS.xml

An important aside about Objects

Repository does NOT have to control Objects at all levels• E.g., repository may only manage files, not representations or

bit streams.

The PREMIS DD tells you:• IF you control at the representation level, these are the

semantic units (properties) that pertain to representations;• IF you control at the file level, these are the semantic units

(properties) that pertain to files;• IF you control at the bitstream level, these are the semantic

units (properties) that pertain to bit streams;• AND IF you control at multiple levels, you need to record

relationships between them

Events

Examples: Validation Event: use JHOVE

tool to verify that chapter1.pdf is a valid PDF file

Ingest Event: transform an OAIS SIP into an AIP

Migration Event: create a new version of an Object in an up-to-date format

An action that involves or impacts at least one Object or Agent associated with or known by the preservation repository

Helps document digital provenance. Can track history of Object through the chain of Events that occur during the Objects lifecycle

Determining which Events are in scope is up to the repository (e.g., Events which occur before ingest, or after de-accession)

Determining which Events should be recorded, and at what level of granularity is up to the repository

Agents

Examples: Priscilla Caplan (a person) Florida Center for Library

Automation (an organization) Dark Archive in the Sunshine

State implementation (a system)

JHOVE version 1.0 (a software program)

Person, organization, or software program/system associated with an Event or a Right (permission statement)

Agents are associated only indirectly to Objects through Events or Rights

Not defined in detail in PREMIS DD; not considered core preservation metadata beyond identification

Rights Statements

Example: Priscilla Caplan grants FCLA

digital repository permission to make three copies of metadata_fundamentals.pdf for preservation purposes.

An agreement with a rights holder that grants permission for the repository to undertake an action(s) associated with an Object(s) in the repository.

Not a full rights expression language; focuses exclusively on permissions that take the form:• Agent X grants Permission

Y to the repository in regard to Object Z.

Semantic units pertaining to objects: technical metadata

objectIdentifier preservationLevel significantProperties objectCategory objectCharacteristics creatingApplication originalName storage environment

signatureInformation relationship linkingEventIdentifier linkingIntellectual

Entity Identifier linkingPermission

StatementIdentifier

Semantic units pertaining to Events: provenance and preservation activity eventIdentifier eventType eventDateTime eventDetail eventOutcome eventOutcomeDetail linkingAgentIdentifier linkingObjectIdentifier

Semantic units pertaining to Rights

rightsStatement rightsStatement

Identifier rightsBasis copyrightInformation licenseInformation statuteInformation

rightsGranted act restriction termOfGrant rightsGranted

linkingObjectIdentifier linkingAgentIdentifier rightsExtension

Semantic units pertaining to Agents

agentIdentifier agentName agentType

Data dictionary descriptions

Semantic unit

Name that is descriptive and unique. Use externally aids interoperability. Need not be used internally in repository.

Semantic components If a container, lists its sub-elements. Each component has own entry.

Definition Meaning of semantic unit

Rationale Why the unit is needed (if not obvious)

Data constraint How it should be encoded; ”Container”: an umbrella for two or more; no values given”None”: can take any form”Value should be taken from a controlled vocabulary”

Object category Representation File Bit stream

Applicability Whether it applies to the category of object

Examples Illustrative examples of values

Repeatability Whether it can take multiple values

Obligation Whether values must be given.”Mandatory”: something the repository must know independent of how or whether the repository records it. Means mandatory if applicable. If not explicitly recorded, it must be provided in exchange.

Creation/maintenance notes

Information about how values may be obtained or updated.

Usage notes Information about intended use.

For eachlevel ofObject

Sample Data Dictionary entry

Semantic unit size Semantic components

None

Definition The size in bytes of the file or bitstream stored in the repository.

Rationale Size is useful for ensuring the correct number of bytes from storage have been retrieved and that an application has enough room to move or process files. It might also be used when billing for storage.

Data constraint Integer Object category Representation File Bitstream Applicability Not applicable Applicable Applicable Examples 2038927 Repeatability Not repeatable Not repeatable Obligation Optional Optional Creation/ Maintenance notes

Automatically obtained by the repository.

Usage notes Defining this semantic unit as size in bytes makes it unnecessary to record a unit of measurement. However, for the purpose of data exchange the unit of measurement should be stated or understood by both partners.

Community interest

PREMIS Data Dictionary product of collaboration and consensus• PREMIS membership reflects variety of institutions,

domains, countries• Multiplicity of perspectives promotes applicability in

multiplicity of contexts• Digital preservation is a shared problem; this invites

shared solutions

Data Dictionary useful to any institution or organization committed to the long-term preservation of digital materials

PREMIS Maintenance Activity

Web site:• Permanent Web presence, hosted by

Library of Congress• Central destination for PREMIS-related

info, announcements, resources• Home of the PREMIS Implementers’ Group (PIG) discussion list

PREMIS Editorial Committee:• Set directions/priorities for PREMIS development• Coordinate future revisions of Data Dictionary and XML schema• Promote implementation• Membership: Library of Congress, OCLC, FCLA, British Library,

Library and Archives Canada, BStU (Germany), MIT/DSpace

http://www.loc.gov/standards/premis/

Activities First revision of Data Dictionary (PREMIS 2.0)

• Revised by EC and released in late March• Changes based on extensive discussions with implementers and

on Editorial Committee Guidelines for using PREMIS with METS (draft available at:)

• http://www.loc.gov/premis/guidelines-premismets.html PREMIS Implementers’ Registry

• http://www.loc.gov/premis/premis-registry.html Developing controlled vocabularies in LC registry PREMIS Tutorials:

• Previous: Glasgow, Boston, Stockholm, Albuquerque, Washington, San Diego

• Upcoming: Rome in Feb. Current consultancies:

• Understanding PREMIS• Tool for converting PREMIS to METS/PREMIS and vice versa

Some implementers …

DAITTSS (Florida): a preservation repository for the use of the libraries of the public universities of Florida. Uses a locally-developed software application (DAITSS), which implements most of the PREMIS data elements.

Ex Libris (DigiTool): an enterprise solution for the management of digital assets in libraries and academic environments consisting of a number of modules, each designed to address different needs, functions, and workflows pertaining to the life cycle of a digital object

• British Library electronic journal archiving project: uses METS, MODS, PREMIS for information packages

For more information see:• http://www.loc.gov/premis/premis-registry.html

Why is PREMIS important to catalogers?

As we take responsibility for more digital materials, we need to ensure that they can be used in the future

Most preservation metadata will be generated from the object, but catalogers may need to verify its accuracy

Catalogers may need to play a role in assessing and organizing digital materials• Understanding the structure of complex digital objects• Determining significant properties that need to be preserved

Conclusions

PREMIS Data Dictionary provides critical piece of reliable digital preservation infrastructure comprised of technology, standards, and best practice

PREMIS was produced from an international, cross-domain, consensus-building process and is applicable to any preservation effort

PREMIS Data Dictionary is a building block with which effective, sustainable digital preservation strategies can be implemented

PREMIS Data Dictionary tightly focused on implementation:• Practical implementation was guiding principle in all discussions• Developed tools (XML schema) to support implementation; released

with Data Dictionary• Further work with encouragement for international participation and

tools development is ongoing

Unglamorous but necessary infrastructure!

URLs, etc.

PREMIS Maintenance Activity:

http://www.loc.gov/standards/premis/

PREMIS Data Dictionary for Preservation Metadata:http://www.loc.gov/standards/premis/v2/premis-2-0.pdf

PREMIS Implementation Registryhttp://www.loc.gov/standards/premis/premis-registry.php

PREMIS Implementers Group listhttp://listserv.loc.gov/listarch/pig.html