Transcript
Page 1: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

Data Grids, Digital Libraries, and PersistentArchives: An Integrated Approach to Sharing,Publishing, and Archiving Data

REAGAN W. MOORE, ARCOT RAJASEKAR, AND MICHAEL WAN, MEMBER, IEEE

Invited Paper

The integration of grid, data grid, digital library, and preser-vation technology has resulted in software infrastructure that isuniquely suited to the generation and management of data. Gridsprovide support for the organization, management, and applica-tion of processes. Data grids manage the resulting digital entities.Digital libraries provide support for the management of informa-tion associated with the digital entities. Persistent archives providelong-term preservation. We examine the synergies between thesedata management systems and the future evolution that is requiredfor the generation and management of information.

Keywords—Data grids, digital libraries, persistent archives,information management.

I. INTRODUCTION

Data grids support massive data collections that are dis-tributed across multiple institutions. Communities such asthe National Institutes of Health (NIH) Biomedical Infor-matics Research Network [1] (16 sites, 4 million files, 6 TBof data) promote the sharing of data between NIH-fundedresearchers by federating access to geographically remotestorage systems. International collaborations such as the

Manuscript received March 1, 2004; revised June 1, 2004. This work wassupported in part by the National Science Foundation (NSF) National Part-nership for Advanced Computational Infrastructure (NPACI) under GrantACI-9619020 (National Archives and Records Administration supplement),in part by the NSF Digital Library Initiative Phase II Interlib project, in partby the NSF National Science Digital Library under Subaward S02-36645, inpart by the Department of Energy Scientific Data Management project underAward DE-FC02-01ER25486 and the Particle Physics Data Grid, in part bythe NSF National Virtual Observatory, in part by the NSF Grid Physics Net-work, and in part by the NASA Information Power Grid. The views and con-clusions contained in this document are those of the author and should not beinterpreted as representing the official policies, either expressed or implied,of the National Science Foundation, the National Archives and Records Ad-ministration, or the U.S. government.

The authors are with the San Diego Supercomputer Center, SanDiego, CA 92093-0505 USA (e-mail: [email protected]; [email protected];[email protected]).

Digital Object Identifier 10.1109/JPROC.2004.842761

Worldwide Universities Network [2] (five initial sites) sup-port the sharing of data between academic institutions in theUnited States and the United Kingdom. National ScienceFoundation (NSF)-funded Information Technology Re-search projects such as the Southern California EarthquakeCenter [3] (five sites, 1.7 million files, 91 TB of data) builddigital libraries of domain specific material for publicationand use by all members of the scientific discipline. The NSFNational Science Education Digital Library [4] (26 millionfiles, 3.5 TB of data) uses data grid technology to implementa persistent archive of material that has been gathered fromWeb crawls. The SIOExplorer project [22] (808 000 files,2 TB of data) manages an archive of ship logs from oceano-graphic research vessels. All of these projects are facedwith the organization of digital entities into collections, theassignment of descriptive metadata to support discovery,and the controlled access to data that are distributed acrossmultiple sites. All of these projects use collections to providea context for the interpretation of their digital entities. All ofthese systems are based upon a generic data management in-frastructure, the San Diego Supercomputer Center (SDSC),San Diego, CA, Storage Resource Broker (SRB) [5].

The management of data has traditionally been supportedby software systems that assume explicit control over localstorage systems (file systems) or that assume local controlover information records (databases). The SRB managesdistributed data, enabling the creation of data grids thatfocus on the sharing of data, digital libraries that focus onthe publication of data, and persistent archives that focuson the preservation of data. Data grid technology providesthe fundamental management mechanisms for distributeddata. This includes support for managing data on remotestorage systems, a uniform name space for referencing thedata, a catalog for managing information about the data, andmechanisms for interfacing to the preferred access method.Digital libraries can be implemented on top of data grids

0018-9219/$20.00 © 2005 IEEE

578 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005

Page 2: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

through the addition of mechanisms to support collectioncreation, browsing, and discovery. The underlying opera-tions include schema extension, bulk metadata load, import,and export of metadata encapsulated in XML, and manage-ment of collection hierarchies. Persistent archives can beimplemented on top of data grids by addition of integritymetadata needed to assert the invariance of the depositedmaterial. The mechanisms provided by data grids to manageaccess to heterogeneous data resources can also be used tomanage migration from old systems to new systems, andhence manage technology evolution. The SRB is being usedas the underlying infrastructure for both digital libraries andpersistent archives and is a proof in practice that commoninfrastructure can be used for data management.

Despite the success in integrating digital libraries and datagrids, significant challenges remain. The issues are relatedto information generation and management and can be ex-pressed as characterization of the criteria used to federateaccess across multiple data management environments. Acareful explanation is needed to explain precisely what wemean by the terms data, information, and knowledge [10].The data grid community defines “data” to be the strings ofbits that compose a digital entity. A digital entity might rep-resent, for example, a data file, an object in an object ringbuffer, a record in a database, a URL, or a binary large ob-ject in a database. Data are stored in storage repositories (filesystems, archives, databases, etc.). Meaning is assigned toa digital entity by associating a semantic label. Informationconsists of the set of semantic labels that are assigned tostrings of bits. The semantic labels can be used to assert aname for a digital entity, assert a property of a digital entity,and assert relationships that are true about a digital entity.Information is stored in information repositories (relationaldatabases, XML databases, flat files, etc.). The combinationof a semantic label and associated data is treated as metadata.Metadata are organized through specification of a schemaand stored as attributes in a relational database. The digitalentities that are registered into the database comprise a col-lection. The metadata in the collection in turn provides thecontext for interpreting the significance of the registered dig-ital entities.

Grids manage distributed execution of processes. TheSRB data grid manages simulation results, observationaldata, and derived data products. Grids and data grids arecomplementary technologies that together enable the cre-ation and management of data. Digital libraries organizeinformation in collections. Persistent archives preservethe information content of collections. Persistent archivesmanage the evolution of all components of the hardwareand software infrastructure, including the encoding syntaxstandards for data models. The integration of informationmanagement is one of the next steps in the evolution of gridtechnology.

We examine how grid technology has evolved, describethe current state of the art in data grid technology, and thendemonstrate the evolution required in grid technology forthe characterization of information and the integration ofdigital library and persistent archive technology. An inte-

Table 1Evolution of Grid Functionality

grated environment for the generation, publication, sharing,and preservation of information is the next step in gridinfrastructure.

II. GRID EVOLUTION

One approach to understanding the current state of gridservices is to look at how grid technology has evolvedover the last four years [6]. The original grid environmentsassumed that applications directly accessed remote data thatwere stored under the user’s Unix-ID, that data would bepulled to the computation, that accesses could be based uponphysical file names, and that the applications would accessdata through library calls. Generalizations now exist foreach of these functions, typically implemented as namingindirection abstractions. In Table 1, the evolution path isshown for each function. The left-hand column representsthe original grid approach, the middle column representsfunctions provided by current digital library and persistentarchive technology, and the right-hand column defines thecapability enabled by the new function.

Each of the evolutionary steps required the specificationof a new naming convention for resources, users, files, col-lections, and services. The naming convention made it pos-sible for a community to create uniform labels for accessingremote data and resources. The aggregation of the namingconventions is called a virtual organization [56]. A virtual or-ganization is created to meet the needs of a particular group,project, or institution. It is quite possible for virtual organi-zations to implement different naming conventions.

The naming conventions are assigned by a set of criteriaspecific to each virtual organization. The criteria might de-pend upon cultural considerations (status of a person withina project), organizational considerations (site that owns a re-source), or choice of infrastructure (software systems used toimplement the name space). The assignment of names corre-sponds to the creation of a new semantic label for each entity.The creation of the semantic label is an assertion by each vir-tual organization that the associated criteria have been met.

Federation is the sharing of resources, user names, files,and metadata between grids. When grids are federated, the

MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES 579

Page 3: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

underlying assumptions governing the creation of the namespaces must be integrated. The name space integration is pos-sible if the assumptions underlying the application of thenaming convention are compatible. The future evolution ofthe grid will strongly rely upon the use of information man-agement technologies that can express the criteria used to as-sign semantic labels.

An additional observation is needed: that the driving mo-tivation for many of the grid evolutionary steps has been theneed to manage the results created by services, in additionto managing the execution of the services. Digital librariesand persistent archives focus on the management of the datathat results from the application of services. They define acontext that includes the state information that results fromall processes performed upon a digital entity and organizethe digital entities into a collection. For grid technology tosupport end-to-end data management applications, grid tech-nology will need to incorporate digital library informationmanagement capabilities as well as persistent archive tech-nology management capabilities.

III. INTEGRATING DIGITAL LIBRARIES AND DATA GRIDS:SPANNING THE INFORMATION DIVIDE

A major research issue in data grids and digital librariesis the integration of knowledge management systems withexisting data and information management systems. Knowl-edge management is needed to support constraints thatare applied in federation of data grids and in semanticcross-walks between digital libraries. A growing number ofcommunities [from astronomy (NVO [7]) to neuroscience(Biomedical Informatics Research Network (BIRN) [1])to ecology (SEEK [8]) to geology (GEON [9])] are devel-oping grids and digital portals for organizing, sharing, andarchiving their scientific data. At SDSC, we have seen theseand other communities specify diverse and sometimes or-thogonal requirements for managing and sharing their data.The integration of constraint-based knowledge managementtechnology with existing state-of-the-art data grids, digitallibraries, and persistent archives requires equivalent supportfor relationship-based constraints across all three environ-ments. The application of constraints for the integration ofdata grids and digital libraries will be an essential part ofcyberinfrastructure.

The assignment of a semantic label to a digital entity re-quires the application of a processing step. A set of relation-ships or logical rules is used to assert the application of thesemantic label. An example is the naming of the fields withina binary file. For a scientific data set, one might attach the fol-lowing types of semantic labels to a field:

• name of the physical variable that is represented by thefile (asserted by the structural order of the fields withinthe data set);

• units associated with the physical variable (asserted bya choice of metrics for the field);

• data model by which the bits are organized (asserted assay a column-ordered or row-ordered array);

• structural mapping implied by the data model (assertedas the type of geometry and coordinate system);

• spatial mapping imposed on the data model (assertedthrough the number of spatial dimensions);

• procedural mapping imposed on the data model (as-serted through the name of the last processing step).

Each digital entity may have multiple semantic labels thatare used to characterize its meaning. Of interest is the factthat a semantic label typically represents the application ofmultiple relationships. The assertions behind the applicationof a semantic label can be used to define a context for thesemantic label, essentially an information context.

Knowledge is the expression of relationships between se-mantic labels. Relationships are typically typed as logical(“is a”; “has a”), structural (existence of a structure withinthe string of bits), spatial (mapping of a string of bits to acoordinate system), temporal (mapping to a point in time),procedural (mapping to process results), functional (mappingof features to evaluation algorithms), and systemic (proper-ties that cover all members of a collection). The managementof knowledge requires the ability to describe, organize, andapply relationships.

Knowledge generation is closely tied to the processingof data. Each semantic label is the result of the applicationof a process (set of relationships and rules) that determineswhether or not the semantic label can be applied to a givendigital entity. The rules and relationships can be interpretedas “constraints.” Information is created by the applicationof constraints appropriate for a given community. One canview the creation of derived information (new semantic la-bels or new data sets) from a given data collection as the ap-plication of rules and relationships. Each type of knowledgeconstraint can be given a name and associated with a digitalentity as a semantic label. The digital library community en-capsulates knowledge constraints in the curation processesthat are applied when a collection is assembled. The preser-vation community encapsulates knowledge constraints in thearchival processes that are applied when the archival collec-tion is created [11]–[13]. The data grid community character-izes knowledge constraints as applied processes or functionsthat transform digital entities into derived data products [14].In each case, the process is encapsulated as “rigidly built”software that is applied to digital entities.

A major change in perspective is needed when dealingwith sociological imperatives that arise from interactionsbetween independent groups of researchers. Each group hasits own set of assumptions about the set of constraints thatshould be applied for the creation of a specified semanticlabel or for a specified action to be performed to create aderived data product. With current technology, the ability tospecify such relationships is not possible. A major changein data and information infrastructure is needed to associateknowledge constraints with each assertion of a semanticlabel. The result will be the ability to compare the intendedsemantic meaning between research groups, when a processis applied in a data grid, digital library, or persistent archive.

In practice, the requirement for management of knowledgeconstraints is pervasive even within the data management in-frastructure itself. A simple example is the federation of dig-ital libraries or data grids. Federations provide mechanismsto share storage resources, digital entities, user identities, and

580 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005

Page 4: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

information about the digital entities. Constraints are neededto enforce controls on interactions between the federated datamanagement systems, both for access and for consistency.The constraints constitute relationships or rules that mustbe evaluated each time the shared item is accessed. For thedigital library community, access constraints include digitallibrary crosswalks that define how semantic labels withinone community may be mapped to semantic labels used byanother community. The preservation community associatesauthenticity metadata with each digital entity, which consti-tutes an assertion about the archival processes that have beenapplied. By keeping track of all of the archival processes thathave been applied, assertions can be made about the lineageof a digital entity, and whether it continues to represent theoriginal digital entity that was deposited into the preservationenvironment.

It is possible to build a static system in which the knowl-edge relationships are specified in software and applied atthe time of access. This is the approach used in currentdata grid technology. When constraints change in time, orwhen collections are federated, the dynamic applicationof changing constraints becomes essential to avoid havingto rewrite software. Knowledge management technologywill be viable when it is possible to change the relationshipassertion behind the creation of an information label, eitherto apply an updated form of the relationship or to apply therelationship assumed by another group that is now viewingthe data. The ability to federate data grids, digital libraries,and persistent archives strongly depends upon the ability todynamically apply the knowledge relationships expected byeach group participating in the federation.

The context used to describe digital entities consists of thesemantic labels (information) that are assigned to each digitalentity. The context used to describe a semantic label consistsof the relationships (knowledge) used to assert the applica-tion of the semantic label. Traditionally, the assertions usedto apply a semantic label are characterized as relationships,organized as an ontology, and managed in a knowledge baseor concept space. The information context is a generalizationof a semantic label, allowing the multiple properties that arerepresented by the semantic label to be expressed.

The integration of grids (support for application of pro-cessing steps) with digital libraries (support for managingthe semantic labels assigned as a result of the processingsteps) provides the simplest approach to the creation of atrue information management system. The SRB provides acommon data management infrastructure for integrating datagrids, digital libraries, and persistent archives. The ability tocharacterize the relationships that underlie the assignment ofa semantic label constitutes an integral part of the informa-tion management infrastructure. The ability to characterizethe information context behind a semantic label is needed tobuild the next-generation information management systems.

IV. SRB—INTEGRATION OF DIGITAL LIBRARIES AND

DATA GRIDS

SDSC has collaborated extensively with the communitieslisted in Table 2 on the development of data and informa-tion management technology. A generic data management

Table 2Example Projects Using the SRB Technology

system, called the Storage Resource Broker, was developedwhich is used to build digital libraries for the publication ofdata, data grids for the sharing of data, and persistent archivesfor the preservation of data [5], [15], [16], [55]. Because theSRB supports the capabilities required by all of the listedprojects, the SRB has become the most advanced data man-agement system in production use in academia for the organ-ization and management of distributed data.

The SRB is used extensively within the National Partner-ship for Advanced Computational Infrastructure (NPACI)project [17], with over 350 TB of data stored under the man-agement of the SRB at SDSC, comprising over 50 millionfiles. Supported projects include computer science, scientificdiscipline collections, education initiatives, and internationalcollaborations. NPACI computational science researchersuse the system for data sharing (one user registered over500 000 files into the system to build a logical name space

MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES 581

Page 5: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

that he shared with his students while he was on sabbatical),and for data publication (one user registered over 0.5 TB ofdata as a digital library for Web-based discovery and access).Many groups are using the SRB to support replication ofdata onto the TeraGrid for bulk data analysis (2-Micron AllSky Survey [18], Digital Palomar Observatory Sky Survey[19]).

Other groups access the SDSC archive (Joint Centerfor Structural Genomics beam line data [20], Alliance forCell Signaling microarray data [21]), or build data sharingenvironments (Scripps Institution of Oceanography voyagelogs [22], GPS sensor data archiving [23], and Long TermEcological Reserve data grid for collection federation [24]).The projects include international collaborations that areinstalling data girds that span multiple countries (WorldwideUniversities Network [2], the Compact Muon Solenoid highenergy physics experiment [25], and the BaBar high energyphysics experiment [26]). The latter project relies uponfederation of data grids to meet sociological requirementson data distribution and sharing.

The implementation of the SRB [16] technology foruse within the NPACI data grid required the developmentof fundamental virtualization mechanisms [41]. A storagerepository virtualization was created that defined the set ofoperations that can be performed on any storage system.The abstraction includes Unix file operations (create, open,close, unlink, read, write, seek, sync, stat, fstat, mkdir, rmdir,chmod, opendir, closedir, and readdir). Additional remoteoperations were implemented for latency management andmetadata manipulation. Drivers were implemented to mapfrom the storage repository abstraction to the protocolrequired by Unix file systems (Linux, AIX, Irix, Solaris,Sun OS, Mac OS X, Unicos), by Windows file systems, byarchives (HPSS, Unitree, ADSM, DMF), database blobs(Oracle, DB2, Sybase, SQLServer, Postgres, Informix),object ring buffers, storage resource managers, FTP sites,GridFTP, and tape drives [42].

A data virtualization mechanism was implemented to sup-port collections that spanned multiple storage repositories. Alogical name space provides a persistent infrastructure inde-pendent naming convention. The logical name space is orga-nized as a collection hierarchy, permitting the managementof administrative, descriptive, and authenticity metadata foreach digital entity registered into the data grid.

An information repository virtualization was defined formanipulating collections that are stored in databases [43].The abstraction consists of the operations needed to add newmetadata attributes, automate SQL generation, support tem-plate based metadata generation, support bulk metadata load,support distributed joins across databases via token-basedsemantic interoperability, support metadata formatting intoXML or HTML files, etc.

A service virtualization was defined for the set of oper-ations that a user could initiate, equivalently the servicesprovided by the SRB data grid [44]. From the service ab-straction, it is possible to map to any preferred access mecha-nism, including C library calls, C++ library calls, Unix shellcommands, Python shell commands, Perl shell commands,

Windows browsers, Web browsers, Java, WSDL/SOAP in-terface [45], [46], Open Archives Initiative interface, etc.The result is an interoperability environment, that lets the re-searchers apply their preferred access mechanisms to any ofthe resources for which SRB drivers have been created [47].The service abstraction was used to implement latency man-agement operations (prefetch, cache, stage, stream, replicate,data aggregation in containers, metadata aggregation in XMLfiles [48], I/O command aggregation through remote proxyexecution), and all of the operations supported by the storageand information repository virtualizations.

The projects listed in Table 2 required the ability to supportprocessing at the remote storage systems where the data waslocated. An interesting view of data grid technology is real-ized by examining the different types of remote processingoperations that were required.

National Aeronautics and Space Administration(NASA) Information Power Grid—“traditional” datagrid [32]. Bulk operations are used to register filesinto the grid. Containers are used to package (aggre-gate) files before loading into an archive. Transportoperations are specified through logical file names.NASA Data Management System/Global Modelingand Assimilation Office—data grid [37]. The logicalname space is partitioned across multiple physicaldirectories to improve performance. The OpenDAPaccess protocol [57] was ported on top of the SRB.Department of Energy (DOE) Particle Physics DataGrid (PPDG)/BaBar high-energy physics experi-ment—data grid [26]. Bulk operations are used toregister files, load files into the data grid, and unloadfiles from the data grid. A bulk remove operation hasbeen requested to complement the bulk registrationoperation. Staging and status operations are used tointeract with a hierarchical storage manager.National Virtual Observatory (NVO)/United StatesNaval Observatory-B—data grid [7]. Registration offiles is coordinated with the movement of grid bricks.Data is written to a disk cache locally (grid brick). Thegrid brick is physically moved to a remote site wherebulk registration and bulk load are invoked on the gridbrick to import the data into the data grid.NSF/NPACI—data grid [17]. Containers are used tominimize the impact on the archive name space forlarge collections of small files. Remote processes areused for metadata extraction. The seek operation isused to optimize paging of data for a four-dimensionalvisualization rendering system. Data transfers are in-voked using server-initiated parallel I/O to optimizeinteractions with the HPSS archive. Bulk registration,load, and unload are used for collections of small data.Results from queries on databases are aggregated intoXML files for transport.NIH/BIRN—data grid [1]. Encryption and compres-sion of data are managed at the remote storage systemas a property of the logical name space. This ensuresprivacy of data during transport.

582 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005

Page 6: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

Fig. 1. NVO architecture.

NSF/Real-time Observatories, Applications, and Datamanagement Network (Roadnet)—data grid [23].Queries are made to object ring buffers to obtain resultsets.NSF/Joint Center for Structural Genomics—data grid[20]. Parallel I/O is used to push experimental data intoremote archives, with data aggregated into containers.NVO/2-Micron All Sky Survey—digital library [18].Five million images are aggregated into 147 000 con-tainers for storage in an archive. An image cutoutservice is implemented as a remote process, executeddirectly on the remote storage system. A metadataextraction service is run as a remote process, withthe metadata parsed from the image file headers andaggregated before transfer.NVO/Digital Palomar Observatory Sky Survey—dig-ital library [19]. Bulk registration is used to registerthe images. An image cutout service is implementedas a remote process, executed directly on the remotestorage repository.NSF/Southern California Earthquake Center—digitallibrary [3]. Bulk registration of files is used to loadsimulation output files into the logical name space(1.5 million files generated in a simulation using3000 time steps).National Archives and Records Administration(NARA)—persistent archive [39]. Bulk registration,load, and unload are used to access digital entitiesfrom Web archives. Containers are used to aggregatefiles before storage in archives. Transport operationsare automatically forwarded to the appropriate datagrid for execution through peer-to-peer federationmechanisms.NSF/National Science Digital Library (NSDL)—per-sistent archive [4]. Bulk registration, load, and unloadare used to import digital entities into an archive. Webbrowsers are used to access and display the importeddata, using http.

The SRB is the underlying data management technology ineach of these projects. However, each project integrates theSRB with additional systems to create the final data man-agement system. The resulting architectures typically havesimilar components to those used in the NVO environment.

Fig. 1 lists the components that are used to implement theNVO architecture. The components include:

portals that provide a user interface to the NVOservices;registry for publishing the existence of NVO services;Web-based services that implement interactive datamanipulation or analysis tasks;workflow environments for support of processingpipelines;SRB data grid for access to the storage repositories;grid software for distributed computation;catalogs and image archives of sky surveys;storage systems and archives.

Data grid technology is the interface between the storagesystems, image archives, dataflow environments, and Web-based services. What is immediately obvious is that multipletypes of interfaces must be supported. The data grid trans-lates from the protocols used by the storage repositories tothe access mechanisms used by a particular data manipula-tion environment. Thus the data grid serves as an interop-erability mechanism. In the NVO architecture, the computeservices are used for bulk operations performed on entire datacollections. The data services provide interactive access.

The SRB continues to be the leading data managementenvironment. The concepts implemented and proven in theSRB are now being used by practically all other data gridimplementations. These concepts include use of federatedclient server architecture to manage interactions with het-erogeneous physical resources, use of a logical name spaceto build global location-independent identifiers, mappingof attributes onto the logical name space to manage servicestate information, and use of access controls on digital en-tities to manage interactions with collection or communityowned data. Explicit services developed within the SRBfor replication, aggregation of data into containers, supportfor user-defined metadata, role-based access controls, andticket-based authentication, are now being implemented inother data grids, including the Globus toolkit [49].

V. DATA MANAGEMENT CONCEPTS

A generic approach has been pursued at SDSC to identifythe fundamental distributed data management concepts.

MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES 583

Page 7: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

The concepts are best illustrated in terms of data grid ter-minology, but can also be readily applied to digital librariesand persistent archives. Distributed data management pro-ceeds by the creation of logical name spaces that are used toassign global persistent identifiers to digital entities, users,resources, and applications. The logical name spaces providea location independent naming convention. Grid servicesmap distributed state information to the logical names asattributes. An example is a mapping from a logical digitalentity name to a physical file name to support replication.Each replica is represented by the site where it is stored, theaccess protocol needed to interact with the site, the creationtime, the file size, etc. Grids are implemented as middleware[50], which manages the distributed state information foreach service.

The management of consistency constraints on the map-pings that are applied to the logical name space becomes im-portant when two independent data grids want to share data.Unless both data grids can specify the constraints that havebeen applied to the mappings, inconsistencies will occur inthe management, characterization, and manipulation of thedigital entities. An example is peer-to-peer federation of datagrids. How does one impose access controls on data that hasbeen copied into another data grid? Can one build a systemin which access controls are a property of the digital en-tity, rather than the storage repository? The SRB implementsthis concept by imposing multiple levels of constraints onthe logical name spaces. Users are represented by a logicaluser name space managed by the SRB. Users authenticatetheir identity to the SRB as a distinguished name within theuser name space. A mapping is imposed on the file logicalname space through access controls for each digital entity,specifying an access role for each distinguished user name.The SRB imposes access constraints by storing digital en-tities under a SRB unix ID. User access to data is then ac-complished by authenticating the user to the SRB, checkingthat the user has access permission based upon the mappingthat is maintained by the SRB, authenticating the SRB’s ac-cess to a remote storage system, and retrieving the digitalentity through the SRB data handling system. The user inter-acts with the SRB, which then serves as the proxy for inter-acting with the remote storage systems. When data is movedto another location, the access controls remain managed bythe SRB as a property of the data. The access controls donot change when data is moved. This approach works verywell for building scientific data collections, for sharing datawithin an organization, and for publishing data on the Web.

Data grids can be viewed as systems that manage and ma-nipulate consistency constraints on mappings of distributedstate information. Digital libraries add mappings to manageuser-defined metadata to support discovery and browsing.Persistent archives add mappings to manage the authenticityof the deposited digital entities [51]. Attributes are added torecord all operations on the data, and assign signatures orchecksums to prove that the original bits have not been un-expectedly changed.

The three types of data management systems can beviewed as defining multiple levels of aggregation semantics

Table 3Types of Constraints for Federation of Collections

(constraints) upon collections of digital entities. At thesame time, each level of aggregation is managed by a setof constraint relationships. We recognize multiple levels ofaggregation and associated constraints shown in Table 3,that are needed to specify sociological requirements [52].

VI. GRID IMPLEMENTATIONS

Middleware was originally proposed as the software infra-structure that manages distributed state information resultingfrom distributed services [53]. A newer definition is that mid-dleware is the software infrastructure that manages informa-tion flow between processes and distributed collections.1 Theconcepts underlying this interpretation are the following.

• Computations are executed to generate data.• Output from computation represents a quantifiable pre-

diction that can be compared with either observationsor other computation results.

• Organization of computation results into collectionsmakes it possible to associate a context with thesimulation output. The context consists of metadata at-tributes that are chosen by the collection creator. Eachdiscipline can implement a separate context, whichrepresents the set of information that will be used byresearchers within the discipline. The same compu-tational result can be stored into multiple collectionswith different choices for the information context. Adigital entity becomes useful when a context is pro-vided that defines how to interpret the digital entity.Without a context, digital entities are just meaninglessbit strings.

• Digital entities within a collection that are never ac-cessed are useless.

• Information and data movement (context and content)from collections to processes represents the accessand use of the results from the original computation orobservation.

• The end goal of computation is to facilitate theadvancement of knowledge through a better under-standing of how to simulate reality. The comparison ofsimulation output with observations is a fundamentalpart of knowledge generation. Data is useful when it isbeing moved and analyzed.

1Based on an observation by D. Petravick that data is only relevant whenit is moving ([email protected]).

584 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005

Page 8: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

Table 4Comparison of Grid and Digital Library Approaches toContext Management

Table 5Grid Evolutionary Steps

This view of data management systems as mechanisms tofacilitate information flow is feasible if the underlying func-tionality provided by grids allows the association of stateinformation with the output files. This raises the issue ofgrid software implementation. Grids focus on execution ofaccess services. Digital libraries focus on management ofthe results. The driving concepts behind the two approachesare listed in Table 4. The grid approach manages applica-tion of processes. The digital library approach manages thedata and information that are created. The approaches arecomplementary.

VII. GRIDS AND DIGITAL LIBRARIES

Given these characterizations of knowledge generationand management, we can examine why grid technologywill undergo further evolution. In Table 5, additional evolu-tionary steps are defined. If we examine the starred items inTable 5 in terms of the contrasting approaches between gridand digital library information management, we can predict

the new functionality that will need to be implemented ingrid services. For each category, we define additional gridservices that will be needed for supporting digital librariesand knowledge generation systems.

A. Federated Name Spaces

A global name space for files is used in the Europeandata grid to assert equivalence of digital entities across ser-vice catalogs. Separate service catalogs are used to managethe state information that results from each service. Thus,a replica location service manages the location of replicas,a community authorization service manages access controlson the replicas, and a metadata catalog service lists descrip-tive metadata for the replicas. Each catalog manages the re-sults from all applications of that particular service. A globalunique identifier is used to map entries in each service cat-alog to a particular digital entity.

In the digital library community, state information ismapped onto a logical identifier that is associated with eachdigital entity and organized as metadata in a collection.The decision to create a collection is independent of theset of services that were applied to the digital entities. Thecollection asserts relationships between digital entities byannotating digital entities with metadata attributes. Thedigital library community constructs union catalogs tofederate access across collections. Equivalent federationmechanisms are needed to federate each of the logical namespaces managed by data grids, including name spaces forfiles, users, and resources. Federation extends the originalnaming indirection mechanisms developed for grids tosupport access across independently assembled collectionsof computational results.

Interoperability between different virtual organizations(which define their own logical name spaces) is managed byservices that implement constraints on the sharing of the log-ical name spaces. Examples are the registration of files fromone virtual organization into the name space of a secondvirtual organization, the registration of a user name from onevirtual organization into a second virtual organization, thesharing of storage resources between virtual organizations,and the sharing of metadata between virtual organizations.Federation mechanisms manage the sharing constraints.

B. Processing Pipeline

The Grid provides workflow processing systems thatspecify each service that is applied to a digital entity.Control mechanisms are applied to the services to specifytheir completion status. A dataflow environment focuseson the digital entities, and applies control mechanisms tothe digital entities that are processed by the services. Anexample of a dataflow environment is the execution of aquery on a collection, then the processing of the result set.The processing status of each digital entity in the result setis maintained. A workflow environment typically knows inadvance the names of the digital entities and the processingsteps that will be applied to each digital entity. The dataflowenvironment allows the digital entities to be identified aspart of the dataflow and provides controls to allow looping,conditional execution, and branching based upon the results

MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES 585

Page 9: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

of each service. The output from the dataflow may be storedin a collection or consumed by another dataflow or a device(such as video streaming). The collection context includesthe state information that is generated by the applicationof the services. The design of appropriate dataflow controlmechanisms is an integral part of access to distributed data.

Operations may be performed more efficiently at a remotestorage system when the result is the movement of a smalleramount of data over the network. Processing mechanismshave been incorporated into data access by the database com-munity. Equivalent functionality will need to be supported bythe grid community to improve performance. The grid willneed to support application of processes at remote storagelocations.

The scheduling of dataflows as combinations of pro-cessing at compute resources and at storage resourcesrequires the specification of the complexity of the dataprocessing steps (the number of operations to be performedper byte of data). Processes with a small complexity areexecuted at the storage system. Processes with a largecomplexity are performed most rapidly by moving the datato a supercomputer. The decision for where to execute aprocess can be characterized as an execution constraint thatis evaluated during the dataflow. This leads to the definitionof a dataflow environment in terms of two sets of constraints.

1) The set of relationships and rules that govern theprocessing of the digital entity. This is equivalent toidentifying the processing steps required to generate aderived data product.

2) The set of execution constraints that control where theprocessing will take place, and the order of executionof the processes.

Given the set of constraints, knowledge management tech-nology can be used to describe each of the processing steps,associate the information with the derived data products, andmanage the information in a collection.

C. Consistency Management

The major component missing from grid technology isthe ability to maintain consistency between content andcontext when multiple services are invoked. Consistencymanagement is complicated by the desire by different datamanagement communities to impose different constraintson the data manipulation. Constraint-based consistencymanagement will be required to implement end-to-endapplications such as persistent archives [54]. Persistentcontext can only be maintained if the state information thatresults from each grid service is consistently updated in apreservation catalog.

Constraint-based consistency management is an exampleof the application of rules and relationships to the executionof the grid services themselves. A simple example is the spec-ification of the order in which grid services must be appliedfor state information to be valid when replicating data. Be-fore the existence of a replica is recorded, the creation of thecopy must be completed. More sophisticated examples occurin the federation of data grids. Two data grids may establishcriteria under which a user in the first data grid may access

data in the second data grid. The second data grid may re-quire that the user be authenticated by the first data grid onevery access. The second data grid cannot apply its accesscontrols until the first data grid has verified the identity ofthe user.

D. Information Flow

The challenge in managing information flow is that coor-dination is required between services. When result sets aremanipulated instead of individual files, it may be appropriateto do processing at the remote storage location for some ofthe digital entities, but may be more efficient to move anothermember of the result set to a compute node for processing.Information flow imposes a generality of solution that is notavailable with current grid technology. It requires all of theconcepts that have been discussed:

• federated name spaces for operations across collectionsand data grids;

• mapping of state information to each digital entity;• organization of digital entities into collections, with the

collection defining the information context that will bemaintained;

• consistency management mechanisms for updating thestate information that results from the application ofmultiple services.

A fundamental change for grids is the ability to define acontext that can be managed independently of grid services.Grid environments that support data management will evolveto provide the following services:

• application of consistency constraints;• storage of the consistency constraints in knowledge

repositories;• knowledge repository virtualization mechanism, for

the management of knowledge constraints in differentvendor knowledge repository products;

• knowledge virtualization, to provide a uniformnaming convention for the management of consis-tency constraints that are stored in multiple knowledgerepositories.

VIII. CONCLUSION

The integration of data grids, digital libraries, and per-sistent archives is forcing continued evolution of gridtechnology. Grids have been evolving through the additionof naming indirection mechanisms. The ability to manageinformation context will require further evolution of gridtechnology and the ability to characterize the assertionsbehind the application of the grid name spaces. The resultwill be the ability to manage the consistency of federateddata collections while flowing information and data fromdigital libraries through grid services into preservationenvironments.

REFERENCES

[1] BIRN—The Biomedical Informatics Research Network [Online].Available: http://www.nbirn.net

[2] WUN—Worldwide Universities Network [Online]. Available:http://www.wun.ac.uk/

[3] SCEC—Southern California Earthquake Center Community DigitalLibrary [Online]. Available: http://www.sdsc.edu/SCEC/

586 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005

Page 10: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

[4] NSDL—National Science Digital Library [Online]. Available:http://www.nsdl.org/

[5] SRB—The Storage Resource Broker Web Page [Online]. Available:http://www.npaci.edu/DICE/SRB/

[6] R. Moore, “Evolution of data grid concepts,” presented at the GlobalGrid Forum 10 Workshop: The Future of Grid Data Environments,Berlin, Germany, 2004.

[7] NVO—National Virtual Observatory [Online]. Available: http://www.us-vo.org/

[8] SEEK—Science Environment for Ecological Knowledge [Online].Available: http://seek.ecoinformatics.org/

[9] GEON—Geosciences Network [Online]. Available: http://www.geongrid.org

[10] R. Moore, “Preservation of data, information, and knowledge,” pre-sented at the World Library Summit, Singapore, 2002.

[11] R. Moore, C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M.Wan, W. Schroeder, and A. Gupta. (2000, Apr./Mar.) Collection-based persistent digital archives—Parts 1 and 2. D-Lib Mag. [On-line]. Available: http://www.dlib.org/dlib/march00/moore/03moore-pt1.html

[12] R. Moore, “Knowledge-based persistent archives,” presented at theLa Conservazione Dei Documenti Informatici Aspetti OrganizzativiE Tecnici, Rome, Italy, 2000.

[13] A. Rajasekar, R. Marciano, and R. Moore, “Collection based per-sistent archives,” in Proc. 16th IEEE Symp. Mass Storage Systems,1999, pp. 176–184.

[14] R. Moore, C. Baru, P. Bourne, M. Ellisman, S. Karin, A. Rajasekar,and S. Young, “Information based computing,” presented at theWorkshop Research Directions for the Next Generation Internet,Washington, DC, 1997.

[15] C. Baru, R. Moore, A. Rajasekar, and M. Wan, “The SDSCstorage resource broker,” presented at the CASCON’98 Conference,Toronto, ON, Canada.

[16] A. Rajasekar, M. Wan, and R. Moore, “mySRB and SRB, compo-nents of a data grid,” presented at the 11th High Performance Dis-tributed Computing Conf., Edinburgh, U.K., 2002.

[17] NPACI Data Intensive Computing Environment Thrust Area [On-line]. Available: http://www.npaci.edu/DICE/

[18] 2MASS—Two Micron All Sky Survey [Online]. Available:http://www.ipac.caltech.edu/2mass/

[19] DPOSS—Digital Palomar Sky Survey [Online]. Available:http://www.sdss.jhu.edu/~rrg/science/dposs/

[20] JCSG—Joint Center for Structural Genomics [Online]. Available:http://www.jcsg.org/

[21] AFCS—Alliance for Cell Signaling [Online]. Available:http://www.afcs.org

[22] SIO Explorer Digital Library Project to Provide Educational Mate-rial from Oceanographic Voyages in Collaboration with NSDL [On-line]. Available: http://nsdl.sdsc.edu/

[23] ROADnet, California Institute for Telecommunications and Tech-nology SensorNet [Online]. Available: http://www.calit2.net/sensornets/

[24] LTER, US Long Term Ecological Research Network [Online].Available: http://lternet.edu/

[25] CMS—Pre-Production Challenge Data Management forthe Compact Muon Solenoid [Online]. Available: http://www.gridpp.ac.uk/gridpp8/gridpp8_cms_status.ppt

[26] BaBar—B Meson Detection System [Online]. Available: http://www.slac.stanford.edu/BFROOT/

[27] CDL—California Digital Library [Online]. Available: http://www.cdlib.org/

[28] Interlib-Digital Library Initiative Phase II Project with the CaliforniaDigital Library [Online]. Available: http://www-diglib.stanford.edu/

[29] Transana-Education research tool for the transcription and qual-itative analysis of audio and video data [Online]. Available:http://www.transana.org/

[30] ArtStor—Andrew Mellon Initiative to Create a Collection ofArt Images for Use in Art History Courses [Online]. Available:http://www.artstor.org/

[31] Digital Embryo—Collection of Images for Embryology Courses[Online]. Available: http://netlab.gmu.edu/visembryo/index.html

[32] IPG—NASA Information Power Grid [Online]. Available: http://www.ipg.nasa.gov/

[33] IVOA—International Virtual Observatory Alliance [Online]. Avail-able: http://www.ivoa.net/

[34] United Kingdom Data Grid [Online]. Available: http://www.escience-grid.org.uk/

[35] TeraGrid—NSF sponsored project to build the world’s largest, mostcomprehensive, distributed infrastructure for open scientific research[Online]. Available: http://www.teragrid.org/

[36] ESIP—Federation of Earth System Information Providers [Online].Available: http://www.esipfed.org/

[37] 12th NASA Goddard /21st IEEE Conf. Mass Storage Systems andTechnologies.

[38] LDAS—NASA Land Data Assimilation System [Online]. Available:http://ldas.gsfc.nasa.gov/

[39] NARA Persistent Archives Project [Online]. Available: http://www.sdsc.edu/NARA/

[40] PAT—Persistent Archive Testbed [Online]. Available: http://www.sdsc.edu/PAT

[41] R. Moore and C. Baru, “Virtualization services for data grids,” inGrid Computing: Making the Global Infrastructure a Reality. NewYork: Wiley, 2003, pp. 409–433.

[42] M. Wan, A. Rajasekar, R. Moore, and P. Andrews, “A simple massstorage system for the SRB data grid,” presented at the 20th IEEESymp. Mass Storage Systems and 11th Goddard Conf. Mass StorageSystems and Technologies, San Diego, CA, 2003.

[43] MCAT—The Metadata Catalog [Online]. Available: http://www.npaci.edu/DICE/SRB/mcat.html

[44] H. Stockinger, O. Rana, R. Moore, and A. Merzky, “Data manage-ment for grid environments,” in Proc. High Performance Computingand Networking (HPCN 2001), pp. 151–160.

[45] WSDL, Web Services Description Language [Online]. Available:http://www.w3.org/TR/wsdl

[46] SOAP, Simple Object Access Protocol [Online]. Available:http://www.w3.org/TR/SOAP/

[47] R. Moore, “Knowledge-based grids,” presented at the 18th IEEESymp. Mass Storage Systems and 9th Goddard Conf. Mass StorageSystems and Technologies, San Diego, CA, 2001.

[48] XML—Extensible Markup Language [Online]. Available: http://www.w3.org/XML/

[49] Globus—The Globus Toolkit [Online]. Available: http://www.globus.org/toolkit/

[50] R. Moore, C. Baru, A. Rajasekar, R. Marciano, and M. Wan, “Dataintensive computing,” in The Grid: Blueprint for a New ComputingInfrastructure, I. Foster and C. Kesselman, Eds. San Francisco,CA: Morgan Kaufmann, 1999.

[51] B. Ludäscher, R. Marciano, and R. Moore, “Towards self-validatingknowledge-based archives,” in Proc. 11th Int. Workshop ResearchIssues in Data Engineering: Document Management for Data In-tensive Business and Scientific Applications, 2001, pp. 9–16.

[52] R. Moore, “The San Diego project: Persistent objects,” presented atthe Workshop XML as a Preservation Language, Urbino, Italy, 2002.

[53] B. Aiken, B. Carpenter, I. Foster, J. Mambretti, R. Moore, J.Strassner, and B. Teitelbaum. (1998, Dec.) Terminology for de-scribing middleware for network policy and services. [Online].Available: http://www-fp.mcs.anl.gov/middleware98/report.html

[54] R. Moore and A. Rajasekar. (2003) Common consistency require-ments for data grids, digital libraries, and persistent archives (GridProtocol Architecture Research Group draft). Global Grid Forum 8[Online]. Available: http://www.sdsc.edu/dice/Pubs/Moore-HPDC.doc

[55] A. Rajasekar, M. Wan, R. Moore, W. Schroeder, G. Kremenek, A. Ja-gatheesan, C. Cowart, S.-Y. Chen, and R. Olaschanowsky, “Storageresource broker—Managing distributed data in a grid,” J. Comput.Soc. India, vol. 33, no. 4, pp. 41–53, Oct.–Dec. 2003.

[56] I. Foster and C. Kesselman, The Grid-2; Blueprint for a New Com-puting Infrastructure, 2nd ed. San Franciso, CA: Morgan Kauf-mann, 2003.

[57] OpenDAP—The open source project for a network data access pro-tocol [Online]. Available: http://opendap.org

Reagan W. Moore received the B.S. degreein physics from the California Institute ofTechnology, Pasadena, in 1967 and the Ph.D.degree in plasma physics from the University ofCalifornia, San Diego, in 1978.

He is Director for Data Intensive ComputingEnvironments at the San Diego SupercomputerCenter (SDSC), University of California, SanDiego. He coordinates research efforts on digitallibraries, data grids, and persistent archives.Notable collaborations include the National

Science Foundation (NSF) National Virtual Observatory, the NSF NationalScience Digital Library persistent archive, the NSF Southern CaliforniaEarthquake Center community digital library, the Department of Energy(DOE) Particle Physics Data Grid, the NHPRC Persistent Archive Testbed,and the NARA Prototype Persistent Archive.

MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES 587

Page 11: Data Grids, Digital Libraries, and Persistent Archives: An ... · PDF fileData Grids, Digital Libraries, and Persistent Archives: An Integrated Approach to Sharing, Publishing, and

Arcot Rajasekar received the Ph.D. degreefrom the University of Maryland, College Park,in 1989.

He is the Director of the Data Grid Technolo-gies Group at the San Diego SupercomputerCenter (SDSC), University of California, SanDiego. He is a key architect of the SDSC StorageResource Broker, an intelligent data grid inte-grating distributed data archives, file repositoriesand digital collections. He has more than 50publications in artificial intelligence, databases,

and data grid systems. His research interests include data grids, digitallibrary systems, persistent archives, and distributed data collection andmetadata management.

Michael Wan received the M.S. degree in nuclearengineering from the Georgia Institute of Tech-nology, Atlanta, in 1972.

He was a Nuclear Reactor Physics Engineerat General Atomics. He has been a systemsanalyst/programmer at the San Diego Supercom-puter Center (SDSC), University of California,San Diego, since it began in 1985 and has devel-oped a variety of key enhancements to variousoperating system components. He is the ChiefArchitect/Designer of the Storage Resource

Broker (SRB) and a Senior Software Engineer and Systems Analystat SDSC. Collaborating with Dr. A. Rajasekar and others, he has beeninstrumental in the development of the SRB thoughout its entire history.

588 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005


Recommended