40
San Diego Supercomputer Center National Partnership for Advanced Computational Infrast 1 Data Grids, Digital Libraries, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center http://www.npaci.edu/DICE [email protected]

Data Grids, Digital Libraries, and Persistent Archives

  • Upload
    hunter

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Grids, Digital Libraries, and Persistent Archives. Reagan W. Moore San Diego Supercomputer Center http://www.npaci.edu/DICE [email protected]. Archive Definition. Computer science - archive is the hardware and software infrastructure used to manage data - PowerPoint PPT Presentation

Citation preview

Page 1: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure1

Data Grids, Digital Libraries, and Persistent Archives

Reagan W. MooreSan Diego Supercomputer Center

http://www.npaci.edu/[email protected]

Page 2: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure2

Archive Definition

• Computer science - archive is the hardware and software infrastructure used to manage data

• Preservation community - archives is the material that is being preserved

Page 3: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure3

Persistent Archive

• Software system that manages evolution of the hardware and software infrastructure– A persistent archive preserves the authenticity and

integrity of digital entities while the underlying technology evolves

• Combination of the material that is being preserved and the infrastructure used to preserve the material

Page 4: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure4

Data Grid

• Grid Community definition– The infrastructure used to manage distributed data as a

collection

• Digital library and preservation community definition – The distributed data that is being organized and managed

as a collection

• A data grid is a mechanism to support sharing of data and the collection that is being shared

Page 5: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure5

Data Sharing

• Management of access controls on local resources to share data– Put controls on resources

• Creation of a collection that is being shared across distributed resources– Put controls on collection

• The SRB data grid does both, enacts controls on both resources and on collections (data and metadata)

Page 6: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure6

Topics

• Data Grids - managing distributed data– Distributed data management for a project

• Digital Libraries - publication of data– Management of collection hierarchies

• Persistent Archives - preservation of data– Management of technology evolution

• Storage Resource Broker example– Currently supporting all three (seven) data management

environments

Page 7: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure7

Data Management Systems(Supported by Storage Resource Broker)

• Data collecting– Sensor systems, object ring buffers and portals

• Data organization– Collections, manage data context

• Data sharing– Data grids, manage heterogeneity of resources

• Data publication– Digital libraries, support discovery

• Data preservation– Persistent archives, manage technology evolution

• Data analysis– Processing pipelines, manage knowledge extraction

Page 8: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure8

Data Management Systems

• Data grid for managing distributed data– Latency management for bulk analyses of collections

– Infrastructure independent name spaces for describing data, resources, users, and state information

• Digital library for managing data context– Curation services for managing collections

– Descriptive metadata for discovery

• Persistent archive to manage technology evolution– Interoperability mechanisms between heterogeneous

storage systems and user access mechanisms

Page 9: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure9

Provide Context for Data• Properties of files

– Provenance - source– Descriptive attributes– Structure

• Organize properties as metadata in a collection hierarchy– Define operations on file properties– Manage state information - location, replicas, containers

• Separate context management from content management– Maintain consistency of context as operations are done on

content

Page 10: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure10

Data Grids

• Software systems that manage distributed data• Control global name spaces for

– Resources– Users– Files– Metadata context

• Provide standard operations on each name space• Provide single sign-on authentication, collection

management, latency management, replication, and federation

• Generic distributed data management technology

Page 11: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure11

Managing Distributed Data

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Naming conventions provided by storage systems

Page 12: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure12

Data Grids Provide a Level of Indirection for Each Naming Convention

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection

Data Access Methods (C library, Unix, Web Browser)

Data is organized as a collection

Page 13: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure13

Logical Name Spaces

• Storage resources– Logical names for managing collections of resources

• User names (user-name / domain / data grid)– Distinguished names for users to manage access controls

• Digital Entities (files, blobs, structured data, …)– Logical name space for global identifiers for files

• Context - Metadata attributes– Standard metadata attributes, Dublin Core– State information resulting from data grid operations– User-defined metadata

Page 14: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure14

Logical Resource Name

• Represents a list of physical resources

• Operations on the logical resource name result in operations on the list of physical resources– Load leveling -write to the next physical resource in the list– Fault tolerance - write to “k” of “n” physical resources– Replication - write to each physical resource– Compound resource - write to the disk cache in front of the tape

archive– Federated resource - write to the controlled resource in another

data grid

Page 15: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure15

Storage Repository Virtualization

Archive Database File System

User ApplicationHow does one access data stored on multiple systems?

Page 16: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure16

Storage Repository Virtualization(Standard Operations on Logical Resource Names)

Archive Database File System

Common set of operations for interacting with every type of storage repository

User ApplicationRemote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering QueriesCollective operations Load leveling Fault tolerance Replication

Page 17: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure17

Logical File Name Abstraction

Archiveat SDSC

DatabaseAt U Md

File Systemat NARA

User ApplicationHow does one identifyfiles stored on multiplesystems?

Page 18: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure18

Context Abstraction

Archiveat SDSC

DatabaseAt U Md

File Systemat U Texas

Common naming convention and set of attributes for describing digital entities

User Application

Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata

Inter-realm authentication Single sign-on system

Page 19: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure19

SRBserver

SRB agent

SRBserver

Federated Server Architecture

MCAT

Read Application

SRB agent

1

2

34

6

5

Logical NameOr

Attribute Condition

1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control

Peer-to-peer

Brokering

Server(s) SpawningData

Access

Parallel Data Access

R1R2

5/6

Page 20: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure20

SRB Latency Management

ReplicationServer-initiated I/O

StreamingParallel I/O

CachingClient-initiated I/O

Remote Proxies,Staging

Data AggregationContainers

SourceDestination

Prefetch

NetworkDestinationNetwork

Page 21: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure21

Latency Management -Bulk Operations• Bulk register

– Create a logical name for a file

• Bulk load– Create a copy of the file on a data grid storage repository

• Bulk unload– Provide containers to hold small files and pointers to each file location

• Bulk delete– Mark as deleted in metadata catalog– After specified interval, delete file

• Bulk metadata load• Requests for bulk operations for access control setting, …

Page 22: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure22

Data Grid Federation

• Link multiple independent data grids– Coordinate metadata between independent metadata catalogs

• Provide consistency and access constraints for each of the four logical name spaces (resources, users, files, metadata)– Peer-to-peer federations, data access– Replication federations, shared resources– Hierarchical federations, consistency constraints

• Tune data grid federation by implementing different consistency and access constraints

Page 23: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure23

Federation

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection A

Access controls and consistency constraints on cross registration of digital entities

Page 24: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure24

Replicated Catalog

Deep Archive

Partial User-ID Sharing

Partial Resource Sharing

No Metadata Synch Hierarchical Zone OrganizationOne Shared User-ID

System Managed ReplicationConnection From Any ZoneComplete Resource Sharing

System Set Access ControlsSystem Controlled Complete SynchComplete User-ID Sharing System Managed Replication

System Set Access ControlsSystem Controlled Partial SynchNo Resource Sharing

Super Administrator Zone Control

System Controlled Complete SynchNo User-ID Sharing

Peer-to-Peer Data Grids

Replication Data Grids

Hierarchical Data Grids

Occasional Interchange

Free Floating

Resource Interaction

User and Data Replica

Nomadic

Snow Flake

Master Slave

Replicated Data

Federation Environments

ReplicationConstraints

ConsistencyConstraints

AccessConstraints

Page 25: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure25

Generic Infrastructure

• SDSC developed the Storage Resource Broker (SRB) to support access to distributed data– Effort started in 1996 as a DARPA funded project– Now support over 30 national/international projects

• Development team of 12 staff is led by– Michael Wan, data management systems– Arcot Rajasekar , information management systems

Page 26: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure26

Data Grid Capabilities• Data manipulation

– Containers– Parallel I/O– Firewall interactions

• Resource interactions– Fault tolerance– Load leveling– Replication

• HIPAA security requirements– Authentication of all users– Access controls on data and metadata– Audit trails– Data encryption– Centralized control

• Application interfaces– C library, Shell commands, Java, Perl, Python, WSDL, workflow

Page 27: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure27

Digital Library

• Collection hierarchy for organizing data– User-defined metadata– Collection level metadata

• Metadata manipulation– Schema extension– Bulk metadata processing– Queries on metadata– Access controls on metadata– Views on collections

• Digital library APIs– DSpace, Fedora, OAI-PMH, web browsers– METS metadata XML schema

Page 28: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure28

Page 29: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure29

Persistent Archives• Authenticity metadata

– Provenance– User logical name space

• Integrity metadata– Audit trails, checksums– Access controls

• Consistency– Context update on all content operations

• Persistency– Infrastructure independence

• Storage repository abstraction• Information repository abstraction• Access abstraction (standard operations)

Page 30: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure30

National Archives Persistent Archive

NARA U Md SDSC

MCAT MCAT MCAT

Principle copystored at NARAwith completemetadata catalog

Replicated copyat U Md for improvedaccess, load balancingand disaster recovery

Deep Archive atSDSC, no useraccess, but complete copy

Page 31: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure31

Unix Shell

Java, NTBrowser

Kepler Actors

OAI,WSDL,WSRF

HTTPDSpace

OpenDAP

Archives - Tape,Sam-QFS, DMF,

HPSS, ADSM,UniTree, ADS

DatabasesDB2, Oracle, Sybase,SQLserver,Postgres,

mySQL, Informix

File SystemsUnix, NT,Mac OSX

Application

ORB

Storage Repository VirtualizationCatalog Abstraction

DatabasesDB2, Oracle, Sybase,

Postgres, mySQL,Informix

C, C++, Java Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency & Metadata Management / Authorization,Authentication,Audit

Linux I/O

DLL /Python,

Perl

Federation Management

Data Grid Federation - zoneSRB

Page 32: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure32

Examples of Extensibility• Storage Repository Driver evolution

– Initially supported Unix file system– Added archival access - UniTree, HPSS– Added FTP/HTTP– Added database blob access– Added database table interface– Added Windows file system– Added project archives - Dcache, Castor, ADS– Added Object Ring Buffer, Datascope– Adding GridFTP version 3.3

• Database management evolution– Postgres– DB2– Oracle– Informix– Sybase– mySQL (most difficult port - no locks, no views, limited SQL)

Page 33: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure33

Examples of Extensibility

• The 3 fundamental APIs are C library, shell commands, Java– Other access mechanisms are ported on top of these interfaces

• API evolution– Initial access through C library, Unix shell command– Added iNQ Windows browser (C++ library)– Added mySRB Web browser (C library and shell commands)– Added Java (Jargon)– Added Perl/Python load libraries (shell command)– Added WSDL (Java)– Added OAI-PMH, OpenDAP, DSpace digital library (Java)– Added Kepler actors for dataflow access (Java)– Adding GridFTP version 3.3 (C library)

Page 34: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure34

Sites Using the SRBCiteSeer, Penn StateCity Univ. of New YorkGeospatial Environment, UCSDDrexel UniversityEOSDIS Distributed Active, NASA GoddardGeorgia TechKentucky State Libraries & ArchivesLibrary of CongressLos Alamos National LabNASA AmesNASA Goddard Space Flight CenterNCSA Grid Computing NIH (NCI Center for Bioinformatics)Penn State UniversityPittsburgh Supercomputing CenterPurdue University. IndianaStanford UniversityTACC, University of TexasTexas A & MUC Santa CruzUCLAUCSD NeuroscienceUniversity of MarylandUniversity of Michigan, CAC department University of New MexicoUniversity of WashingtonUniversity of WisconsinUSCYale University

Academia Sinica, TaiwanASCC, Computing Centre, TaiwanAustralian National UniversityBedford Oceanography,CanadaBioinformatics Institute, SingaporeCSIRO, AustraliaData Storage Institute, SingaporeEGEE, French National CenterGeoForschungsZentrum, GermanyJames Cook University, AustraliaKEK High Energy Physics, JapanMax Planck Institute, NetherlandsParallab, NorwaySouth Australian Advanced ComputingUIB (Parallab) , NorwayUniversity of AmsterdamUniversity of Cambridge, AstronomyUniversity of Cambridge, e-ScienceUniversity of EdinburghUniversity of Genoa, ItalyUniversity of Hong KongUnivrsity of ManchesterUniversity of OsloUniversity of SouthamptonYork Univ (UK)

Page 35: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure35

Storage Resource Broker Collections at SDSC(11/2/2004)

GBs ofdata

stored

Numberof files

Numberof Users

Data Grid      

NSF/ITR - National Virtual Observatory 53,858 9,536,698 80NSF - National Partnership for Advanced Computational Infrastructure 24,738 5,754,890 380

Hayden Planetarium - Evolution of the Solar System visualizations 7,201 113,600 178

NSF/NPACI - Joint Center for Structural Genomics 5,228 652,031 50

NSF/NPACI - Biology and Environmental collections 8,851 33,340 67

NSF - TeraGrid, ENZO Cosmology simulations 121,550 1,096,947 3,247

NIH - Biomedical Informatics Research Network 6,002 4,107,508 214

Digital Library      

NLM - Digital Embryo image collection 720 45,365 23

NSF/NPACI - Long Term Ecological Reserve 253 8,436 36

NSF/NPACI - Grid Portal 2,211 51,227 407

NIH - Alliance for Cell Signaling microarray data 856 62,291 21

NSF - National Science Digital Library SIO Explorer collection 2,080 808,901 27

NSF/NPACI -Transana education research video collection 92 2,387 26

NSF/ITR - Southern California Earthquake Center 91,040 1,791,494 62

Persistent Archive      

UCSD Libraries archive 128 204,828 29

NARA- Research Prototype Persistent Archive 166 316,813 58

NSF - National Science Digital Library persistent archive 3,571 26,908,350 122

TOTAL 328 TB 51 million 4,900

Page 36: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure36

Grid Interfaces• GSI, support versions 1, 2, 3, Java• GridFTP version 3.3 interface to SRB collection

– Use GSI certificate to identify the user to the SRB– Reference file by a SRB logical name space– Use SRB access controls for allowed operations– Initially support serial transport– SRB supports 4 different firewall interaction protocols (client-driven

parallel I/O, server-driven parallel I/O, bulk file registration, federated data grid access)

• GridFTP version 3.3 driver for SRB collection– Store data at a remote site under the SRB ID

• Data will be shareable through SRB access controls\

– Store data at a remote site under user GSI certificate• Data will not be shareable through SRB access controls

Page 37: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure37

Grid Interfaces• Replica Location Service Interface

– Simon Metson <[email protected]>– GMCat mimics the LRC interface, enabling the files registered in an

MCat to appear on the giggle framework (RLS). – Available from http://tuber1.phy.bris.ac.uk:8080/GMCatWS3 – (also linked from the third party software on the SRB page)

• Storage Resource Manager– SRM Version 1, SRB driver created to store data in SRM– SRM Version 2, development effort to put SRM interface on top of

SRB (Alasdair Earl)– SRM Version 3, development effort to put SRM interface on top of

SRB (Peter Kunszt)

Page 38: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure38

Conclusion

• Distributed data management systems can be built on generic data grid infrastructure– Data grids to support bulk access across remote

sites– Integration of data grid and digital library

capabilities to manage massive data collections– Federation of data grids to build international

discipline-wide collections

Page 39: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure39

SDSC SRB Team(left to right)

• Arun Jagatheesan• George Kremenek• Sheau-Yen Chen• Arcot Rajasekar (SRB development lead)• Reagan Moore (SRB PI)• Michael Wan (SRB architect)• Roman Olschanowsky (BIRN)• Bing Zhu• Charlie Cowart• Lucas Gilbert • Tim Warnock• Wayne Schroeder (SRB product)• Adam Birnbaum (SRB production)• Antoine De Torcy• Vicky Rowley (BIRN)• Marcio Faerman (SCEC)• Students & emeritus

– Erik Vandekieft– Reena Mathew– Xi (Cynthia) Sheng– Allen Ding– Grace Lin– Qiao Xin– Daniel Moore– Ethan Chen– Jon Weinburg

• Supported by about 20 projects (NSF, DOE, NASA, NARA, NIH, LOC, NHPRC)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture. QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 40: Data Grids, Digital Libraries, and Persistent Archives

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure40

For More Information

Reagan W. MooreSan Diego Supercomputer Center

[email protected]

http://www.npaci.edu/DICE

http://www.npaci.edu/DICE/SRB

http://www.npaci.edu/dice/srb/mySRB/mySRB.html