Upload
-
View
230
Download
0
Embed Size (px)
Citation preview
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 1/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
SDSC Digital PreservationProject with NARA
Reagan W. Moore
San Diego Supercomputer Center
http://www.npaci.edu/DICE/
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 2/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Data and Knowledge Systems GroupStaff • Reagan Moore
• Ilkai Altintas
• Chaitan Baru
• Sheau Yen Chen
• Charles Cowart
• Amarnath Gupta
• George Kremenek
•M. Kulrul• Bertram Ludäscher
• Richard Marciano
• A. Memon
• XuFei Qian
• Roman Olshanowsky
• Arcot Rajasekar
• Abe Singer• Michael Wan
• Ilya Zaslavsky
• Bing Zhu
Graduate Students
• A. Bagchi• S. Bansal
• A. Behere
• R. Bharath
• S. Bharath
• L. Sui
Undergraduate Interns
• N. Cotofana
• D. Le
• J. Trang
• L. Yin
• +/- NN
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 3/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Topics
• Digital preservation approach
• Levels of abstraction
• Application to NARA collections
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 4/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Persistent Archive Approach
• Preservation of authentic documents
• Create archivable form for digital entity
• Define context by assembling a collection
• Create archivable form for collection
• Manage persistent archive
• Support self-instantiating archive• Support discovery and presentation
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 5/37
ERA Concept model
M e d i a ti o n o f I n fo r m a t io n u s in g X M L
S t o r a g e R e s o u r c e B r o k e r / E x t e n s i b l e M e ta - d a t a C A T a l o g
E R A : A r c h i v a l C o m p o n e n t s C o n c e p t
M e ta d a t a
A rc h i v a l
R ep o s i t o r y
O rd er
F u l f i l l m e n t
S y s t e m
R ef e r e n c e
W o r k b en c h
Q u e ry
R e b u i l d
P r e s e n t
T a p e s
A cc e s s io n i n g
W o r k b en c h
A c c e s s io n
V e r if y
W ra p &
C o n t a i n e r i z e
D es c r i b e
C o ll e c ti o n
D is k s
I n t e r n e t
C o ll e c ti o n
C o ll e c ti o n
A rc h iv a l R e s e a rc h C a t a lo gR e c o r d s
S c h e d u l e s
G r i d S e c u r i t y I n f r a s t r u c tu r e
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 6/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Fundamental Challenge
Technology Evolution• Data is a sequence of bits
• Presentation applications are needed to
display a digital entity, based upon a datamodel
• Applications issue I/O calls to operating
systems• Operating systems send commands to
storage and display systems
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 7/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Presentation of Digital Objects
Storage System
Operating System
Application
Digital Object
Display System
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 8/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Technology Management - Emulation
New Storage System
New Operating System
Old Application
Digital Object
New Display System
Wrap Application
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 9/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Technology Management -
SDSC
Old Storage System
New Operating System
New Application
Digital Object
Old Display System
Wrap Storage System Wrap Display System
Migrate Encoding Format
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 10/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Specifying Levels of Abstraction
• Technology management becomes simpler if the persistent archive infrastructure
operates on abstractions, rather than anexplicit physical implementation of aresource
• Need abstractions for – Digital objects
– Repositories
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 11/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Managing Distributed Storage
• Separate the organization of digital objects from
their physical storage
– Logical Name Space to manage attributes about thedigital objects
– Data handling system to manage interactions with
remote storage systems
• Create storage abstraction layer • Storage Resource Broker (SRB) provides data
management system
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 12/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Information Management-
Logical Name Space• Set of attributes to describe digital entities
that are registered into the logical name
space• SRB metadata - Unix file system semantics
• Provenance metadata - Dublin Core
• Resource metadata - User access control lists
• Discipline metadata - User defined attributes
• Each digital entity may have uniqueattributes
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 13/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Unix
ShellJava, NT
Browsers
Web
WSDLProlog
Predicate
SDSC Storage Resource Broker & Meta-data CatalogLevels of Abstraction
Archives
HPSS, ADSM,UniTree, DMF
Databases
DB2, Oracle,Postgres
File Systems
Unix, NT,Mac OSX
Application
HRM
Clients
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Sybase
C, C++,
Libraries
Logical NameSpace
LatencyManagement
Data Transport
Metadata Transport
Consistency Management / Authorization-Authentication
PrimeServer
LinuxI/O
DLL /Python
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 14/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Types of Digital Entity
Abstractions• Logical representation
– What does the digital entity represent?
– What is the associated meaning?
• Physical representation
– What is the physical structure of the digital
entity?
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 15/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Levels of Abstraction for Bits
Abstraction forDigital Entity
Digital Entity
Abstraction for
Repository
Repository
Logical:I-nodes
Physical: Track / Sector
Bit Stream
Logical:
File Name
Physical:
File System(NFS/AFS/NTFS)
Disk
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 16/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Levels of Abstraction for Data
Abstraction forDigital Entity
Digital Entity
Abstraction for
Repository
Repository
Logical:Data Model
(units, semantics)
Physical:Encoding Format
(syntax, structure)
Files
Logical:
Name Space
Physical:
Data HandlingSystem -SRB/MCAT
File System, Archive
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 17/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Information Management
• Abstraction layer for interacting with informationrepositories – Manage the schema and physical table structures of a
database – Extensible schema
– User defined attributes
• Extensible Metadata CATalog (EMCAT) manages
collections• mySRB.html interface supports dynamic collection
creation
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 18/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Levels of Abstraction for Information
Abstraction forDigital Entity
Digital Entity
Abstraction for
Repository
Repository
Logical:CollectionSchema
Physical:XML Syntax
Metadata Attributes
Logical:
DatabaseSchema
Physical:
EMCAT/CWM
Database
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 19/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Knowledge Management - Characterizing
Properties of Collections• Characterization of relationships between attributes
– Semantic / logical - cross-walks
– Procedural / temporal - records management – Structural / spatial - GIS
• Characterization of knowledge repository operations
• Mapping from collection attributes to discipline
concepts• Mapping from knowledge relationships to rules for
application in inference engines
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 20/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Levels of Abstraction for Knowledge
Abstraction forDigital Entity
Digital Entity
Abstraction for
Repository
Repository
Logical:Relationship
Schema
Physical:ER/UML/XMI/RDF syntax
Concept Space(ontology instance)
Logical:
KnowledgeRepository Schema
Physical:
Model-basedMediation System
Knowledge Repository
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 21/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Persistent Archives
• Storage system abstraction – Logical name space and data manipulations
• Information repository abstraction – Logical schema and physical table structure
• Knowledge repository abstraction – Topic maps and inference rules
• Digital object abstraction – Data model and encoding format
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 22/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
NARA Prototype• Demonstrate ability to ingest, archive, recreate,
query, and present a digital object from a 1million record E-mail collection (RFC1036)
– 2.5 GB of data – 6 required fields
– 13 optional fields
– User defined fields (over 1000)
• Determine resources required to scale size of collection
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 23/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
XML
DTD
for
<!ELEMENT rfc1036_mesg (headers, body)>
<!ELEME NT headers (required_headers, optional_headers, other_headers)>
<!ELEME NT body #PCDATA>
<!ELEME NT required_headers (From, Date, Newsgroups, Subject, Message-ID, Path)><!ELEME NT optional_headers (Folloup-To?, Expires?, Reply-To?, Sender?, References?,
Control?, Distribution?, Keywords?, Summary?, Approve
Lines?, Xref?, Organization?)>
<!ELEME NT other_headers other+>
<!-- 6 required header keywords --><!ELEME NT From #PCDATA>
<!ELEME NT Date #PCDATA>
<!ELEME NT Newsgroups #PCDATA>
<!ELEME NT Subject #PCDATA>
<!ELEME NT Message-ID #PCDATA>
<!ELEME NT Path #PCDATA>
<!ATTLIST From seqno CDATA #REQUIRED>
<!ATTLIST Date seqno CDATA #REQUIRED>
<!ATTLIST Newsgroups seqno CDATA #REQUIRED>
<!ATTLIST Subject seqno CDATA #REQUIRED>
<!ATTLIST Message-ID seqno CDATA #REQUIRED>
<!ATTLIST Path seqno CDATA #REQUIRED>
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 24/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Formatted
MessageUsing
XML
DTD
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 25/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 26/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Web-based
Interface
for
Accessing
the E-mail
Collection
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 27/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Automation of Ingestion Process
• Application of an Accessioning Template
– Defines the concepts, policies or acceptance of
the collection
• Creation of attributes that represent the
accessioning template concepts
• Analysis of attributes for anomalies andimplied inherent knowledge
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 28/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Information Generation
Processes• Create occurrence index
– (Occurrence, attribute, value)
– This is needed to be able to recreate original form of digital object
• Analyze completeness of information – Inverse index of attribute values
– Identifies unexpected values - consistency
• Analyze closure of collection – Are additional concepts needed to represent inverse
index value ranges?
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 29/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Ingestion Processes for Collection
Data
Organization
Data
Storage
Aggregation of original objects into containersfor storage
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 30/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Ingestion Processes for Collection
Attribute
Tagging
Attribute
Selection
Information
Generation
Data
Organization
Collection
Storage
Migration of objects into a standard representation
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 31/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Ingestion Processes for Collection
Accession
Template
Attribute
Tagging
Attribute
Selection
Closure
Concept/Attribute
Attribute
Inverse Indexing
Occurrence
Tagging
Knowledge
Generation
Information
Generation
Data
Organization
Collection
Storage
View Management
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 32/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Persistent Collection• Define context for archiving data -annotate
information content• Create archivable form - standard encoding format
• Archive information content along with data
• Test closure of the collection - all digital objectsthat can be discovered in the collection aremembers of the collection
• Test completeness of the collection - inherentrelationships within the collection can be cast interms of attributes generated from the annotatedinformation. – Differentiate between inherent knowledge and anomalies
/ artifacts
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 33/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Self-Instantiating Archive
• Archive the processes that are used to control theingestion process – Conversion to archivable form
– Annotation of information content
• When accessing the collection, retrieve the processes and the original digital objects – Apply the processing steps to re-create the information
content
– Query the result to discover desired digital objects
• A self-instantiating archive is a virtual data grid
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 34/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Differentiating between Data,
Information, and Knowledge
• Data
– Digital object
– Objects are streams of bits
• Information – Any tagged data, which is treated as an attribute.
– Attributes may be tagged data within the digital object, or tagged data that is
associated with the digital object
• Knowledge – Relationships between attributes
– Relationships can be procedural/temporal, structural/spatial,
logical/semantic, functional
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 35/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Types of Knowledge Relationships
• Logical / semantic – Digital Library cross-walks
• Temporal / procedural – Workflow systems
• Spatial / structural
– GIS systems• Functional / algorithmic
– Scientific feature analysis
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 36/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Knowledge Based Data Grids
Attributes
Semantics
Knowledge
Information
Data
Ingest
Services
Management Access
Services
(Model-based Access)
(Data Handling System - SRB)
M C A T / H D F
G r i d s
X M L D T D
S D L I P
X T M D
T D
R u l e s - K Q L
Information
Repository
Attribute- based
Query
Feature-based
Query
Knowledge or
Topic-Based
Query / Browse
Knowledge
Repository for
Rules
Relationships
Between
Concepts
Fields
Containers
Folders
Storage
(Replicas,
Persistent IDs)
8/3/2019 Digital Preservation SDSC
http://slidepdf.com/reader/full/digital-preservation-sdsc 37/37
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center
Further Information
http://www.npaci.edu/DICE