45
National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive Computing San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE

National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

Embed Size (px)

Citation preview

Page 1: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Collection-based Persistent Archives

Reagan W. MooreAssociate Director, Data Intensive Computing

San Diego Supercomputer Center

[email protected]://www.npaci.edu/DICE

Page 2: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Topics

• Experiences learned building a prototype Persistent Archive• Information model • Hierarchical levels of information• Interoperability mechanisms

• Application to workshop topics• Ingestion methodology• Data set identification• Certification of archives

Page 3: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Persistent Archive Goals

• Provide collection based archive• Data set relevance is organized by the collection

• Provide information model to describe the context for the data collection• Enough information is needed to be able to dynamically

create the collection from archived information

• Decouple collection creation from digital object archiving

• Provide accessioning system to turn data sets into digital objects• Accessioning is independent of the final collection

Page 4: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

NARA Persistent Archive Prototype

• Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record E-mail collection (RFC1036)• 2.5 GB of data• 6 required fields• 13 optional fields• User defined fields (over 1000)

• Determine information model needed for persistent archive

Page 5: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Page 6: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Key Concepts Learned

• Information model• Semi-structured representation of information - XML• Infrastructure independent representation of information

context - XML DTD• Differentiation between information context for digital

objects,collection and presentation• DTD for objects• DTD for collection• XSL style sheets for presentation

• Instantiation software for creating the collection from the information model

• XML databases now appearing

Page 7: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Hierarchy of Information Contexts

• Digital object context• Meta-data to define the structure of the object• When publishing a digital object, must also publish the

context of the object

• Use collections to organize objects • Meta-data to define the structure of the collection • When publishing a collection, must also publish the

information needed to organize the collection.

• Use presentation context to control access• Meta-data to define structure of presentation

Page 8: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

XML DTD for

E-mail

<!ELEMENT rfc1036_mesg (headers, body)>

<!ELEMENT headers (required_headers, optional_headers, other_headers)><!ELEMENT body #PCDATA>

<!ELEMENT required_headers (From, Date, Newsgroups, Subject, Message-ID, Path)><!ELEMENT optional_headers (Folloup-To?, Expires?, Reply-To?, Sender?, References?, Control?, Distribution?, Keywords?, Summary?, Approved?, Lines?, Xref?, Organization?)><!ELEMENT other_headers other+>

<!-- 6 required header keywords --><!ELEMENT From #PCDATA><!ELEMENT Date #PCDATA><!ELEMENT Newsgroups #PCDATA><!ELEMENT Subject #PCDATA><!ELEMENT Message-ID #PCDATA><!ELEMENT Path #PCDATA>

<!ATTLIST From seqno CDATA #REQUIRED><!ATTLIST Date seqno CDATA #REQUIRED><!ATTLIST Newsgroups seqno CDATA #REQUIRED><!ATTLIST Subject seqno CDATA #REQUIRED><!ATTLIST Message-ID seqno CDATA #REQUIRED>

<!ATTLIST Path seqno CDATA #REQUIRED>

Page 9: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Formatted Message Using XML DTD

Page 10: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Key Concepts Learned

• Digital object encapsulation• Minimize the number of times a digital object must be

touched• Once archived, a digital object should only be retrieved

when requested by a user

• Implies meta-data stored with digital objects should only describe the objects

• Collection and presentation meta-data should be stored separately

Page 11: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Persistent Archive Requirements• Distributed environment to ensure separable

components• Accession workbench• Archive• Presentation platform

• Data handling mechanisms for interoperability as basis for system evolution• No tightly coupled systems• Unique names are only used by the data handling system• Use of containers to aggregate digital objects for storage• Implies a hierarchical naming scheme

• Collection / container / digital object

Page 12: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

TAPE

DISK

CD

FTP

Media Handlers

METADATA

REPOSITORY

RECORDS

REPOSITORY

AccessioningWork Bench

(snapin)

Text

Image

Photo

Video

Audio

Geographical Information System

Compound Records

WEB

DatabaseMetadata wrapper

record

ReferenceWorkbench

(snapin)

Arrangement

A R C

Catalog

OrderFulfillment

RetrieveRecords

WRAPPER

ACCESSION ARCHIVES REFERENCE TRANSFER

FTP

TAPE

DISK

CDUNWRAPPER

Electronic Records Archive (ERA)

Query &Reference

Tools

InternetIntranet

Presentation

Page 13: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Federation of Data Collections into Digital Libraries

DPOSS Sky Survey

2MASS Sky Survey

NASACatalog

NSDigLib

Wash. Brain Image

UCLA Brain Image

MSU Brain Image

UCSD Neuroscience

CEED / ESA

REINAS

U Md Archive

ADL

Elib - Flora

ESSDigLib Protein Data Bank

Wash U Genome

U H Mol Trajectory

MSDigLib

UCCalif

FindingAids

UMDL Social Science

AMICO Image Library

NARA Persistent Archive

U Wisc. Video Lib.

Pacific Rim DL

Page 14: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Conclusions

• Ingestion• Infrastructure independent representation for digital

objects• Infrastructure independent representation for information

model• Turn data sets into digital objects by adding attribute tags

• Aggregate digital objects in containers for storage

Page 15: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Conclusions

• Data set identification• Unique names only required by data handling system

• Attribute based access through collection

• Hierarchical naming• Collection / Container / Digital object• Finding Aid for collection / Data handling system ID for container /

Unique ID for object

Page 16: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Conclusion

• Certification of persistent archive• Demonstrate that can provide infrastructure independent

representation for• Finding aids for locating collections• Information model for building collection• Data handling system container Ids for storage access• Digital object attribute tags

• Demonstrate that can use information models to create finding aids, collections, and access interfaces on new technology

• Demonstrate that can independently migrate any component of architecture

Page 17: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Further Information

http://www.npaci.edu/DICE

Page 18: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

NARA Persistent Archive

Application Infrastructure

Accessioning Workbench InformationModel User interface / Analysis tools

Finding Aids Federation / Mediation ofCollections

Information discovery MarkupLanguage Digital Library Services

Collection migration system Collection Management

Storage Resource Broker Meta-data Data Handling System

HPSS / file system Archive Storage

Page 19: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Context Based Objects

• For data to be useful, the context must be defined• Data format - binary/integer representation• Physical meaning - units• Structure - geometry• Relevance - feature annotation• Semantics - data dictionary for attributes

• Context is preserved as meta-data attributes

Page 20: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Information Models for Organization of Data

Digital Object Attributes

Collection Attributes

Presentation Attributes

Page 21: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Information Models for Access to Data

Presentation of data from multiple digital libraries

Collections from federated databases

Digital object Attributes

Page 22: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Common Information Model

• eXtensible Markup Language (XML) • Use tags to define semantic context for components of the

data set

• Document Type Definition (DTD)• Provides semi-structured representation for organizing

tags that can be applied to groups of digital objects

• Development of standards for tags• Digital sky, Protein Data Bank, Neuroscience brain images• California Digital Library - Art Museum Image Consortium

Page 23: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Information Management Hierarchy

• Presentation / Information Discovery / Analysis• Visualization - Shastra, 3D visualization tools• Presentation information model - XSL style sheet

• Collection organization• Meta-data catalog - MCAT• Collection information model - XML DTD

• Data handling• Storage Resource Broker - SRB

• Storage• Archival storage system - HPSS• Digital object model - XML DTD

Page 24: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Open Grid Architecture to Encourage Interoperability

Data HandlingSystems

StorageResource

s

RemoteProcedureExecution

Data ModelManagement

Application

StorageSystem

Description

InformationDiscovery

DynamicInfo

Discovery

Page 25: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Technology Sources• Archive Community

• IEEE Mass Storage Systems Technical Committee• Scalable storage systems

• Digital Library Community• NSF Digital Library Initiative, Phase II• Information management mediation - XML

• Supercomputer Community• Scalable analysis platforms

• Grid Forum• Data handling systems for interoperability

• Archivist Community / Library Community• Management policies and standards

Page 26: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Technology Sources

Data HandlingSystems

StorageResource

s

RemoteProcedureExecution

Data ModelManagement

Application

StorageSystem

Description

InformationDiscovery

DynamicInfo

Discovery

Digital Library

Computational Grid

Page 27: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Information Management Architecture• Digital library community technologies

• Distributed information resources• Digital library interoperability protocols - SDLIP• Mediation of information using XML - MIX

• Grid Forum technologies• Support for distributed services / procedures• Inter-realm authentication

• GSI Grid Security Infrastructure

• Data handling system• Storage Resource Broker, Meta-data Catalog

Page 28: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Grid Forum Data Access Architecture

Data HandlingSystems

StorageResources

API that provides“glue” to underlyingstorage, QoS, etc.[GASS, IBP, SRB]

RemoteProcedureExecution

DPSS, DFS, NFSHPSS, ADSM, DMF, Unitree, NASstore,

DB2, Oracle, Informix, Sybase, O2, ObjectStore, Objectivity

API that provides “glue” to underlying data handling

systems (security, scheduling, QoS, access

protocol, data format/model, adaptivity, info discovery, location

control)

Data ModelManagement

Application

StorageSystem

Description

InformationDiscovery

ArmadaD’agents,FEL, ADRGRAM,

SRB, Java, CORBA

+ authentication+ authorization

DynamicInfo

DiscoveryGloPerf,

Netlogger, NWS

Condor, GASS, NILE, SRB, I-2 caching,

ADR

DTD, ADR, object class

LDAP, Database, Flat file, Object database

Page 29: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Data Handling System Capabilities

• SDSC Storage Resource Broker• Protocol transparency

• Common API for access to remote data resources• Explicit drivers for each type of storage system

• Name transparency• Attribute based access to data

• Location transparency• Distribution of collection across multiple physical resources

• Time transparency• Minimization of latency for data access

Page 30: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

SDSC Storage Resource Broker & Meta-data Catalog

SRB

ADSM HPSS DB2 Oracle Unix

Application

File SID DBLobj SID Obj SID

MCAT

Dublin Core

Resource

User

ApplicationMeta-data

Page 31: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Time Transparency

• How to minimize latency Prefetch data to local high performance disk, so that all

accesses can be done at high speed from local resources

• How to maximize access rate Composite or aggregate data into a single data set to avoid

multiple accesses• Stream data at high rates using parallel I/O, amortizing the

access latency by the volume of data that is delivered.

• How to avoid congestion• Replicate data across multiple servers

Page 32: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

SRB Containers - Managing Archive Latency

• Create container in a logical storage resource containing at least one “cacheable” resource

• Create objects in containers• “Cache” daemon will move filled

containers to archive• synch and purge API’s

SRB client

UNIX

Distributed Storage Resources

SRB Server

HPSSHPSS

container

cached containers

Page 33: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Generality of Information Infrastructure

• Same information model needed to manage• Federation in space

• Metacomputing environment• Interoperable services for digital libraries

• Migration over time• Collection creation and update• Persistent archive

• Same storage systems needed to support• Supercomputer center data• Discipline specific data collections• Digital library collections

Page 34: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Art Museum Image Consortium

• Demonstrated• Support for heterogeneous digital objects• Automated conversion of meta-data to XML DTD• Validation of meta-data• XSL style sheet for presenting information

Page 35: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

AMICO Meta-data Conversion to XML

Page 36: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

2. XSL StyleSheet Script

1. AMICOXML DataRecords

3. Rendered Output

AMICO Presentation Interface

Page 37: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

National Partnership for Advanced Computational Infrastructure

• Facilitate the conduct of science through development of knowledge resources• Publish - Data collection infrastructure • Info discovery - Digital Library infrastructure • Data access - Data handling infrastructure

• Apply to federal, state, and university projects• NSF / DOE / NASA / USPTO / NARA / Census Bureau• California Digital Library• UCSD - Pacific Rim Digital Library Alliance

Page 38: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Publishing Scientific Data

Archival Storage

Applications

Digital Library

Data Storage

Information Management

CollectionBuilding

CDLUCB - ElibUCSB - ADLStanford - SDLIPU Michigan - UMDL

Digital SkyNeuroscience

Protein Data BankMolecular Structures

Earth Systems Science

Applications Libraries

Page 39: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

NPACI is a National Partnership of Partnerships

46 institutions

20 states

4 countries

5 national labs

Many projects (new and old)

Vendors and industry

Government agencies

Page 40: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

National Partnership for Advanced Computational Infrastructure

• Provide Teraflops / Petabyte capable systems for use by national academic community• Current systems at the San Diego Supercomputer Center

• 250 Gflops peak computation rate

– IBM SP, CRAY T3E• 250 Terabyte archive capacity, 100 TB in archive

– High Performance Storage System

• By end of year• 1 TFlop peak computation rate

– IBM SP• 500 Terabyte archive capacity

Page 41: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Challenges

• Facilitate access to high-end resources• Support data intensive computing

• Facilitate access to distributed data resources• Support information discovery

• Minimize complexity of user interfaces• Provide unifying data access system

• Requires information management infrastructure

Page 42: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Bio-Informatics

Application Infrastructure

Structural Comparison (n x n) Information

ModelUser interface / Analysis

tools

Mediation of Information

using XML / Extensible Meta-

data Catalog

Federation / Mediation of

Collections

Protein Data Bank Services Markup

LanguageDigital Library Services

PDB / Genome / Molecular

Trajectory Collections

Collection Management

Storage Resource Broker Meta-data Data Handling System

HPSS / file system Archive Storage

Page 43: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

Art Museum Image Consortium - AMICO

Application Infrastructure

Classroom lectures InformationModel User interface / Analysis tools

Mediation of Information usingXML

Federation / Mediation ofCollections

Internet Explorer – XSL stylesheets

MarkupLanguage Digital Library Services

AMICO Collection Collection Management

Storage Resource Broker Meta-data Data Handling System

HPSS / file system Archive Storage

Page 44: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

National Virtual Observatory

Application Infrastructure

Astronomer’s Workbench InformationModel User interface / Analysis tools

Correlation catalogs Federation / Mediation ofCollections

Statistical Analysis MarkupLanguage Digital Library Services

2MASS / DPOSS / SDSS /NOAA

Collection Management

Storage Resource Broker Meta-data Data Handling System

HPSS / file system Archive Storage

Page 45: National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive

National Partnership for Advanced Computational Infrastructure

California Digital Library

Application Infrastructure

Research / Education / Publicweb-based access

InformationModel User interface / Analysis tools

Mediation of Information usingXML / Extensible Meta-dataCatalog

Federation / Mediation ofCollections

Electronic Notebook / Infoscapes MarkupLanguage Digital Library Services

AMICO / ADEPT / UCB Floracollection

Collection Management

Storage Resource Broker Meta-data Data Handling System

HPSS / file system Archive Storage