Data Grids, Digital Libraries, and PersistentArchives: An Integrated Approach to Sharing,Publishing, and Archiving Data
REAGAN W. MOORE, ARCOT RAJASEKAR, AND MICHAEL WAN, MEMBER, IEEE
The integration of grid, data grid, digital library, and preser-vation technology has resulted in software infrastructure that isuniquely suited to the generation and management of data. Gridsprovide support for the organization, management, and applica-tion of processes. Data grids manage the resulting digital entities.Digital libraries provide support for the management of informa-tion associated with the digital entities. Persistent archives providelong-term preservation. We examine the synergies between thesedata management systems and the future evolution that is requiredfor the generation and management of information.
KeywordsData grids, digital libraries, persistent archives,information management.
Data grids support massive data collections that are dis-tributed across multiple institutions. Communities such asthe National Institutes of Health (NIH) Biomedical Infor-matics Research Network  (16 sites, 4 million files, 6 TBof data) promote the sharing of data between NIH-fundedresearchers by federating access to geographically remotestorage systems. International collaborations such as the
Manuscript received March 1, 2004; revised June 1, 2004. This work wassupported in part by the National Science Foundation (NSF) National Part-nership for Advanced Computational Infrastructure (NPACI) under GrantACI-9619020 (National Archives and Records Administration supplement),in part by the NSF Digital Library Initiative Phase II Interlib project, in partby the NSF National Science Digital Library under Subaward S02-36645, inpart by the Department of Energy Scientific Data Management project underAward DE-FC02-01ER25486 and the Particle Physics Data Grid, in part bythe NSF National Virtual Observatory, in part by the NSF Grid Physics Net-work, and in part by the NASA Information Power Grid. The views and con-clusions contained in this document are those of the author and should not beinterpreted as representing the official policies, either expressed or implied,of the National Science Foundation, the National Archives and Records Ad-ministration, or the U.S. government.
The authors are with the San Diego Supercomputer Center, SanDiego, CA 92093-0505 USA (e-mail: firstname.lastname@example.org; email@example.com;firstname.lastname@example.org).
Digital Object Identifier 10.1109/JPROC.2004.842761
Worldwide Universities Network  (five initial sites) sup-port the sharing of data between academic institutions in theUnited States and the United Kingdom. National ScienceFoundation (NSF)-funded Information Technology Re-search projects such as the Southern California EarthquakeCenter  (five sites, 1.7 million files, 91 TB of data) builddigital libraries of domain specific material for publicationand use by all members of the scientific discipline. The NSFNational Science Education Digital Library  (26 millionfiles, 3.5 TB of data) uses data grid technology to implementa persistent archive of material that has been gathered fromWeb crawls. The SIOExplorer project  (808 000 files,2 TB of data) manages an archive of ship logs from oceano-graphic research vessels. All of these projects are facedwith the organization of digital entities into collections, theassignment of descriptive metadata to support discovery,and the controlled access to data that are distributed acrossmultiple sites. All of these projects use collections to providea context for the interpretation of their digital entities. All ofthese systems are based upon a generic data management in-frastructure, the San Diego Supercomputer Center (SDSC),San Diego, CA, Storage Resource Broker (SRB) .
The management of data has traditionally been supportedby software systems that assume explicit control over localstorage systems (file systems) or that assume local controlover information records (databases). The SRB managesdistributed data, enabling the creation of data grids thatfocus on the sharing of data, digital libraries that focus onthe publication of data, and persistent archives that focuson the preservation of data. Data grid technology providesthe fundamental management mechanisms for distributeddata. This includes support for managing data on remotestorage systems, a uniform name space for referencing thedata, a catalog for managing information about the data, andmechanisms for interfacing to the preferred access method.Digital libraries can be implemented on top of data grids
0018-9219/$20.00 2005 IEEE
578 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005
through the addition of mechanisms to support collectioncreation, browsing, and discovery. The underlying opera-tions include schema extension, bulk metadata load, import,and export of metadata encapsulated in XML, and manage-ment of collection hierarchies. Persistent archives can beimplemented on top of data grids by addition of integritymetadata needed to assert the invariance of the depositedmaterial. The mechanisms provided by data grids to manageaccess to heterogeneous data resources can also be used tomanage migration from old systems to new systems, andhence manage technology evolution. The SRB is being usedas the underlying infrastructure for both digital libraries andpersistent archives and is a proof in practice that commoninfrastructure can be used for data management.
Despite the success in integrating digital libraries and datagrids, significant challenges remain. The issues are relatedto information generation and management and can be ex-pressed as characterization of the criteria used to federateaccess across multiple data management environments. Acareful explanation is needed to explain precisely what wemean by the terms data, information, and knowledge .The data grid community defines data to be the strings ofbits that compose a digital entity. A digital entity might rep-resent, for example, a data file, an object in an object ringbuffer, a record in a database, a URL, or a binary large ob-ject in a database. Data are stored in storage repositories (filesystems, archives, databases, etc.). Meaning is assigned toa digital entity by associating a semantic label. Informationconsists of the set of semantic labels that are assigned tostrings of bits. The semantic labels can be used to assert aname for a digital entity, assert a property of a digital entity,and assert relationships that are true about a digital entity.Information is stored in information repositories (relationaldatabases, XML databases, flat files, etc.). The combinationof a semantic label and associated data is treated as metadata.Metadata are organized through specification of a schemaand stored as attributes in a relational database. The digitalentities that are registered into the database comprise a col-lection. The metadata in the collection in turn provides thecontext for interpreting the significance of the registered dig-ital entities.
Grids manage distributed execution of processes. TheSRB data grid manages simulation results, observationaldata, and derived data products. Grids and data grids arecomplementary technologies that together enable the cre-ation and management of data. Digital libraries organizeinformation in collections. Persistent archives preservethe information content of collections. Persistent archivesmanage the evolution of all components of the hardwareand software infrastructure, including the encoding syntaxstandards for data models. The integration of informationmanagement is one of the next steps in the evolution of gridtechnology.
We examine how grid technology has evolved, describethe current state of the art in data grid technology, and thendemonstrate the evolution required in grid technology forthe characterization of information and the integration ofdigital library and persistent archive technology. An inte-
Table 1Evolution of Grid Functionality
grated environment for the generation, publication, sharing,and preservation of information is the next step in gridinfrastructure.
II. GRID EVOLUTIONOne approach to understanding the current state of grid
services is to look at how grid technology has evolvedover the last four years . The original grid environmentsassumed that applications directly accessed remote data thatwere stored under the users Unix-ID, that data would bepulled to the computation, that accesses could be based uponphysical file names, and that the applications would accessdata through library calls. Generalizations now exist foreach of these functions, typically implemented as namingindirection abstractions. In Table 1, the evolution path isshown for each function. The left-hand column representsthe original grid approach, the middle column representsfunctions provided by current digital library and persistentarchive technology, and the right-hand column defines thecapability enabled by the new function.
Each of the evolutionary steps required the specificationof a new naming convention for resources, users, files, col-lections, and services. The naming convention made it pos-sible for a community to create uniform labels for accessingremote data and resources. The aggregation of the namingconventions is called a virtual organization . A virtual or-ganization is created to meet the needs of a particular group,project, or institution. It is quite possible for virtual organi-zations to implement different naming conventions.
The naming conventions are assigned by a set of criteriaspecific to each virtual organization. The criteria might de-pend upon cultural considerations (status of a person withina project), organizational considerations (site that owns a re-source), or choice of infrastructure (software systems used toimplement the name space). The assignment of names corre-sponds to the creation of a new semantic label for each entity.The creation of the semantic label is an assertion by each vir-tual organization that the associated criteria have been met.
Federation is the sharing of resources, user names, files,and metadata between grids. When grids are federated, the
MOORE et al.: DATA GRIDS, DIGITAL LIBRARIES, AND PERSISTENT ARCHIVES 579
underlying assumptions governing the creation of the namespaces must be integrated. The name space integration is pos-sible if the assumptions underlying the application of thenaming convention are compatible. The future evolution ofthe grid will strongly rely upon the use of information man-agement technologies that can express the criteria used to as-sign semantic labels.
An additional observation is needed: that the driving mo-tivation for many of the grid evolutionary steps has been theneed to manage the results created by services, in additionto managing the execution of the services. Digital librariesand persistent archives focus on the management of the datathat results from the application of services. They define acontext that includes the state information that results fromall processes performed upon a digital entity and organizethe digital entities into a collection. For grid technology tosupport end-to-end data management applications, grid tech-nology will need to incorporate digital library informationmanagement capabilities as well as persistent archive tech-nology management capabilities.
III. INTEGRATING DIGITAL LIBRARIES AND DATA GRIDS:SPANNING THE INFORMATION DIVIDE
A major research issue in data grids and digital librariesis the integration of knowledge management systems withexisting data and information management systems. Knowl-edge management is needed to support constraints thatare applied in federation of data grids and in semanticcross-walks between digital libraries. A growing number ofcommunities [from astronomy (NVO ) to neuroscience(Biomedical Informatics Research Network (BIRN) )to ecology (SEEK ) to geology (GEON )] are devel-oping grids and digital portals for organizing, sharing, andarchiving their scientific data. At SDSC, we have seen theseand other communities specify diverse and sometimes or-thogonal requirements for managing and sharing their data.The integration of constraint-based knowledge managementtechnology with existing state-of-the-art data grids, digitallibraries, and persistent archives requires equivalent supportfor relationship-based constraints across all three environ-ments. The application of constraints for the integration ofdata grids and digital libraries will be an essential part ofcyberinfrastructure.
The assignment of a semantic label to a digital entity re-quires the application of a processing step. A set of relation-ships or logical rules is used to assert the application of thesemantic label. An example is the naming of the fields withina binary file. For a scientific data set, one might attach the fol-lowing types of semantic labels to a field:
name of the physical variable that is represented by thefile (asserted by the structural order of the fields withinthe data set);
units associated with the physical variable (asserted bya choice of metrics for the field);
data model by which the bits are organized (asserted assay a column-ordered or row-ordered array);
structural mapping implied by the data model (assertedas the type of geometry and coordinate system);
spatial mapping imposed on the data model (assertedthrough the number of spatial dimensions);
procedural mapping imposed on the data model (as-serted through the name of the last processing step).
Each digital entity may have multiple semantic labels thatare used to characterize its meaning. Of interest is the factthat a semantic label typically represents the application ofmultiple relationships. The assertions behind the applicationof a semantic label can be used to define a context for thesemantic label, essentially an information context.
Knowledge is the expression of relationships between se-mantic labels. Relationships are typically typed as logical(is a; has a), structural (existence of a structure withinthe string of bits), spatial (mapping of a string of bits to acoordinate system), temporal (mapping to a point in time),procedural (mapping to process results), functional (mappingof features to evaluation algorithms), and systemic (proper-ties that cover all members of a collection). The managementof knowledge requires the ability to describe, organize, andapply relationships.
Knowledge generation is closely tied to the processingof data. Each semantic label is the result of the applicationof a process (set of relationships and rules) that determineswhether or not the semantic label can be applied to a givendigital entity. The rules and relationships can be interpretedas constraints. Information is created by the applicationof constraints appropriate for a given community. One canview the creation of derived information (new semantic la-bels or new data sets) from a given data c...