37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center SDSC Digital Preservation Project with NARA Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/

Digital Preservation SDSC

  • Upload
    -

  • View
    230

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 1/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

SDSC Digital PreservationProject with NARA

Reagan W. Moore

San Diego Supercomputer Center

[email protected]

http://www.npaci.edu/DICE/

Page 2: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 2/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Data and Knowledge Systems GroupStaff • Reagan Moore

• Ilkai Altintas

• Chaitan Baru

• Sheau Yen Chen

• Charles Cowart

• Amarnath Gupta

• George Kremenek 

•M. Kulrul• Bertram Ludäscher

• Richard Marciano

• A. Memon

• XuFei Qian

• Roman Olshanowsky

• Arcot Rajasekar

• Abe Singer• Michael Wan

• Ilya Zaslavsky

• Bing Zhu

Graduate Students

• A. Bagchi• S. Bansal

• A. Behere

• R. Bharath

• S. Bharath

• L. Sui

Undergraduate Interns

• N. Cotofana

• D. Le

• J. Trang

• L. Yin

• +/- NN

Page 3: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 3/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Topics

• Digital preservation approach

• Levels of abstraction

• Application to NARA collections

Page 4: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 4/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Persistent Archive Approach

• Preservation of authentic documents

• Create archivable form for digital entity

• Define context by assembling a collection

• Create archivable form for collection

• Manage persistent archive

• Support self-instantiating archive• Support discovery and presentation

Page 5: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 5/37

ERA Concept model

M e d i a ti o n o f I n fo r m a t io n u s in g X M L

S t o r a g e R e s o u r c e B r o k e r / E x t e n s i b l e M e ta - d a t a C A T a l o g

E R A :  A r c h i v a l C o m p o n e n t s C o n c e p t

M e ta d a t a

A rc h i v a l

R ep o s i t o r y

O rd er  

F u l f i l l m e n t

S y s t e m

R ef e r e n c e

W o r k b en c h

Q u e ry

R e b u i l d

P r e s e n t

T a p e s

A cc e s s io n i n g

W o r k b en c h

A c c e s s io n

V e r if y

W ra p &

C o n t a i n e r i z e

D es c r i b e

C o ll e c ti o n

D is k s

I n t e r n e t

C o ll e c ti o n

C o ll e c ti o n

A rc h iv a l R e s e a rc h C a t a lo gR e c o r d s

S c h e d u l e s

G r i d S e c u r i t y I n f r a s t r u c tu r e

Page 6: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 6/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Fundamental Challenge

Technology Evolution• Data is a sequence of bits

• Presentation applications are needed to

display a digital entity, based upon a datamodel

• Applications issue I/O calls to operating

systems• Operating systems send commands to

storage and display systems

Page 7: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 7/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Presentation of Digital Objects

Storage System

Operating System

Application

Digital Object

Display System

Page 8: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 8/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Technology Management - Emulation

New Storage System

New Operating System

Old Application

Digital Object

New Display System

Wrap Application

Page 9: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 9/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Technology Management -

SDSC

Old Storage System

New Operating System

New Application

Digital Object

Old Display System

Wrap Storage System Wrap Display System

Migrate Encoding Format

Page 10: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 10/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Specifying Levels of Abstraction

• Technology management becomes simpler if the persistent archive infrastructure

operates on abstractions, rather than anexplicit physical implementation of aresource

•  Need abstractions for  – Digital objects

 – Repositories

Page 11: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 11/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Managing Distributed Storage

• Separate the organization of digital objects from

their physical storage

 – Logical Name Space to manage attributes about thedigital objects

 –  Data handling system to manage interactions with

remote storage systems

• Create storage abstraction layer • Storage Resource Broker (SRB) provides data

management system

Page 12: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 12/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Information Management-

Logical Name Space• Set of attributes to describe digital entities

that are registered into the logical name

space• SRB metadata - Unix file system semantics

• Provenance metadata - Dublin Core

• Resource metadata - User access control lists

• Discipline metadata - User defined attributes

• Each digital entity may have uniqueattributes

Page 13: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 13/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Unix

ShellJava, NT

Browsers

Web

WSDLProlog

Predicate

SDSC Storage Resource Broker & Meta-data CatalogLevels of Abstraction

Archives

HPSS, ADSM,UniTree, DMF

Databases

DB2, Oracle,Postgres

File Systems

Unix, NT,Mac OSX

Application

HRM

Clients

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++,

Libraries

Logical NameSpace

LatencyManagement

Data Transport

Metadata Transport

Consistency Management / Authorization-Authentication

PrimeServer

LinuxI/O

DLL /Python

Page 14: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 14/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Types of Digital Entity

Abstractions• Logical representation

 – What does the digital entity represent?

 – What is the associated meaning?

• Physical representation

 – What is the physical structure of the digital

entity?

Page 15: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 15/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Levels of Abstraction for Bits

Abstraction forDigital Entity

Digital Entity

Abstraction for

Repository

Repository

Logical:I-nodes

Physical: Track / Sector

Bit Stream

Logical:

File Name

Physical:

File System(NFS/AFS/NTFS)

Disk

Page 16: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 16/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Levels of Abstraction for Data

Abstraction forDigital Entity

Digital Entity

Abstraction for

Repository

Repository

Logical:Data Model

(units, semantics)

Physical:Encoding Format

(syntax, structure)

Files

Logical:

Name Space

Physical:

Data HandlingSystem -SRB/MCAT

File System, Archive

Page 17: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 17/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Information Management

• Abstraction layer for interacting with informationrepositories –  Manage the schema and physical table structures of a

database –  Extensible schema

 –  User defined attributes

• Extensible Metadata CATalog (EMCAT) manages

collections• mySRB.html interface supports dynamic collection

creation

Page 18: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 18/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Levels of Abstraction for Information

Abstraction forDigital Entity

Digital Entity

Abstraction for

Repository

Repository

Logical:CollectionSchema

Physical:XML Syntax

Metadata Attributes

Logical:

DatabaseSchema

Physical:

EMCAT/CWM

Database

Page 19: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 19/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Knowledge Management - Characterizing

Properties of Collections• Characterization of relationships between attributes

 –  Semantic / logical - cross-walks

 –  Procedural / temporal - records management –  Structural / spatial - GIS

• Characterization of knowledge repository operations

• Mapping from collection attributes to discipline

concepts• Mapping from knowledge relationships to rules for 

application in inference engines

Page 20: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 20/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Levels of Abstraction for Knowledge

Abstraction forDigital Entity

Digital Entity

Abstraction for

Repository

Repository

Logical:Relationship

Schema

Physical:ER/UML/XMI/RDF syntax

Concept Space(ontology instance)

Logical:

KnowledgeRepository Schema

Physical:

Model-basedMediation System

Knowledge Repository

Page 21: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 21/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Persistent Archives

• Storage system abstraction – Logical name space and data manipulations

• Information repository abstraction – Logical schema and physical table structure

• Knowledge repository abstraction – Topic maps and inference rules

• Digital object abstraction – Data model and encoding format

Page 22: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 22/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

 NARA Prototype• Demonstrate ability to ingest, archive, recreate,

query, and present a digital object from a 1million record E-mail collection (RFC1036)

 – 2.5 GB of data – 6 required fields

 – 13 optional fields

 – User defined fields (over 1000)

• Determine resources required to scale size of collection

Page 23: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 23/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

XML

DTD

for 

-mail

<!ELEMENT rfc1036_mesg (headers, body)>

<!ELEME NT headers (required_headers, optional_headers, other_headers)>

<!ELEME NT body #PCDATA>

<!ELEME NT required_headers (From, Date, Newsgroups, Subject, Message-ID, Path)><!ELEME NT optional_headers (Folloup-To?, Expires?, Reply-To?, Sender?, References?,

Control?, Distribution?, Keywords?, Summary?, Approve

Lines?, Xref?, Organization?)>

<!ELEME NT other_headers other+>

<!-- 6 required header keywords --><!ELEME NT From #PCDATA>

<!ELEME NT Date #PCDATA>

<!ELEME NT Newsgroups #PCDATA>

<!ELEME NT Subject #PCDATA>

<!ELEME NT Message-ID #PCDATA>

<!ELEME NT Path #PCDATA>

<!ATTLIST From seqno CDATA #REQUIRED>

<!ATTLIST Date seqno CDATA #REQUIRED>

<!ATTLIST Newsgroups seqno CDATA #REQUIRED>

<!ATTLIST Subject seqno CDATA #REQUIRED>

<!ATTLIST Message-ID seqno CDATA #REQUIRED>

<!ATTLIST Path seqno CDATA #REQUIRED>

Page 24: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 24/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Formatted

MessageUsing

XML

DTD

Page 25: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 25/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Page 26: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 26/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Web-based

Interface

for

Accessing

the E-mail

Collection

Page 27: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 27/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Automation of Ingestion Process

• Application of an Accessioning Template

 – Defines the concepts, policies or acceptance of 

the collection

• Creation of attributes that represent the

accessioning template concepts

• Analysis of attributes for anomalies andimplied inherent knowledge

Page 28: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 28/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Information Generation

Processes• Create occurrence index

 –  (Occurrence, attribute, value)

 –  This is needed to be able to recreate original form of digital object

• Analyze completeness of information –  Inverse index of attribute values

 –  Identifies unexpected values - consistency

• Analyze closure of collection –  Are additional concepts needed to represent inverse

index value ranges?

Page 29: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 29/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Ingestion Processes for Collection

Data

Organization

Data

Storage

Aggregation of original objects into containersfor storage

Page 30: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 30/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Ingestion Processes for Collection

Attribute

Tagging

Attribute

Selection

Information

Generation

Data

Organization

Collection

Storage

Migration of objects into a standard representation

Page 31: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 31/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Ingestion Processes for Collection

Accession

Template

Attribute

Tagging

Attribute

Selection

Closure

Concept/Attribute

Attribute

Inverse Indexing

Occurrence

Tagging

Knowledge

Generation

Information

Generation

Data

Organization

Collection

Storage

View Management

Page 32: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 32/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Persistent Collection• Define context for archiving data -annotate

information content• Create archivable form - standard encoding format

• Archive information content along with data

• Test closure of the collection - all digital objectsthat can be discovered in the collection aremembers of the collection

• Test completeness of the collection - inherentrelationships within the collection can be cast interms of attributes generated from the annotatedinformation. –  Differentiate between inherent knowledge and anomalies

/ artifacts

Page 33: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 33/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Self-Instantiating Archive

• Archive the processes that are used to control theingestion process –  Conversion to archivable form

 –  Annotation of information content

• When accessing the collection, retrieve the processes and the original digital objects –  Apply the processing steps to re-create the information

content

 –  Query the result to discover desired digital objects

• A self-instantiating archive is a virtual data grid

Page 34: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 34/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Differentiating between Data,

Information, and Knowledge

• Data

 –  Digital object

 –  Objects are streams of bits

• Information –  Any tagged data, which is treated as an attribute.

 –  Attributes may be tagged data within the digital object, or tagged data that is

associated with the digital object

• Knowledge –  Relationships between attributes

 –  Relationships can be procedural/temporal, structural/spatial,

logical/semantic, functional

Page 35: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 35/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Types of Knowledge Relationships

• Logical / semantic – Digital Library cross-walks

• Temporal / procedural – Workflow systems

• Spatial / structural

 – GIS systems• Functional / algorithmic

 – Scientific feature analysis

Page 36: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 36/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Knowledge Based Data Grids

Attributes

Semantics

Knowledge

Information

Data

Ingest

Services

Management Access

Services

(Model-based Access)

(Data Handling System - SRB)

   M   C   A   T   /   H   D   F

   G  r   i   d  s

   X   M   L   D   T   D

   S   D   L   I   P

   X   T   M    D

   T   D

   R  u   l  e  s  -   K   Q   L

Information

Repository

Attribute- based

Query

Feature-based

Query

Knowledge or 

Topic-Based

Query / Browse

Knowledge

Repository for 

Rules

Relationships

Between

Concepts

Fields

Containers

Folders

Storage

(Replicas,

Persistent IDs)

Page 37: Digital Preservation SDSC

8/3/2019 Digital Preservation SDSC

http://slidepdf.com/reader/full/digital-preservation-sdsc 37/37

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center

Further Information

http://www.npaci.edu/DICE