12
J Grid Computing (2007) 5:283–294 DOI 10.1007/s10723-007-9071-y Enterprise Grids: Challenges Ahead R. Jiménez-Peris · M. Patiño-Martínez · B. Kemme Received: 15 October 2006 / Accepted: 17 January 2007 / Published online: 2 March 2007 © Springer Science + Business Media B.V. 2007 Abstract Grid technologies have matured over the last few years. This level of maturity is espe- cially true in the field of scientific computing in which Grids have become the main infrastructure for scientific problem solving. Due to its success, the use of Grid technology rapidly finds its in- troduction into other fields. One of such fields is enterprise computing in which Grids are seen as a new architecture for data centers. In this paper, we describe the vision of enterprise Grids, current scientific achievements that will leverage this vision, and challenges ahead. This work has been partially funded by the European S4ALL project, the Spanish Ministry of Education and Science (MEC) under grant TIN2004-07474-C02-01, the Madrid Regional Research Council (CAM) under grant 0505/TIC/000285, and the Spanish Ministry of Industry (MITyC) under grant FIT-340000-2006-138. R. Jiménez-Peris (B ) · M. Patiño-Martínez Facultad de Informática, Universidad Politécnica de Madrid (UPM), Madrid, Spain e-mail: rjimenez@fi.upm.es M. Patiño-Martínez e-mail: mpatino@fi.upm.es B. Kemme McGill University, School of Computer Science, Montreal, Quebec, Canada e-mail: [email protected] 1 Introduction Grids have succeeded in providing an infrastruc- ture for deploying parallel applications in a dis- tributed setting with a high degree of automation. The definition of Grids has been redefined along the years. Initially Grids were defined as an in- frastructure to provide easy and inexpensive ac- cess to high-end computing [24]. Then, it was refined in [18] as an infrastructure to share re- sources for collaborative problem solving. More recently, in [19] the Grid definition evolves into an infrastructure to pool and virtualize resources and enable their use in a transparent fashion. Grid infrastructure exhibits several interesting features. One of the main functionalities is that a Grid hides the heterogeneity of its underlying components such as hardware, operating systems or storage systems, and serves as a middleware that provides seamless connectivity of the com- ponents it manages. Another interesting charac- teristic is the transparent pooling of many kinds of resources such as computing power, storage, data, and services. This pooling enables appli- cations deployed on the Grid to transparently share resources and utilize the provided capacity more effectively. A related attribute is the abil- ity to allocate and reserve virtualized resources. Virtualization facilitates the sharing of resources: it allows the preservation of quality of service guarantees for time critical applications, and at

Enterprise Grids: Challenges Ahead

Embed Size (px)

Citation preview

J Grid Computing (2007) 5:283–294DOI 10.1007/s10723-007-9071-y

Enterprise Grids: Challenges Ahead

R. Jiménez-Peris · M. Patiño-Martínez · B. Kemme

Received: 15 October 2006 / Accepted: 17 January 2007 / Published online: 2 March 2007© Springer Science + Business Media B.V. 2007

Abstract Grid technologies have matured overthe last few years. This level of maturity is espe-cially true in the field of scientific computing inwhich Grids have become the main infrastructurefor scientific problem solving. Due to its success,the use of Grid technology rapidly finds its in-troduction into other fields. One of such fieldsis enterprise computing in which Grids are seenas a new architecture for data centers. In thispaper, we describe the vision of enterprise Grids,current scientific achievements that will leveragethis vision, and challenges ahead.

This work has been partially funded by the EuropeanS4ALL project, the Spanish Ministry of Education andScience (MEC) under grant TIN2004-07474-C02-01,the Madrid Regional Research Council (CAM) undergrant 0505/TIC/000285, and the Spanish Ministry ofIndustry (MITyC) under grant FIT-340000-2006-138.

R. Jiménez-Peris (B) · M. Patiño-MartínezFacultad de Informática, Universidad Politécnicade Madrid (UPM), Madrid, Spaine-mail: [email protected]

M. Patiño-Martíneze-mail: [email protected]

B. KemmeMcGill University, School of Computer Science,Montreal, Quebec, Canadae-mail: [email protected]

1 Introduction

Grids have succeeded in providing an infrastruc-ture for deploying parallel applications in a dis-tributed setting with a high degree of automation.The definition of Grids has been redefined alongthe years. Initially Grids were defined as an in-frastructure to provide easy and inexpensive ac-cess to high-end computing [24]. Then, it wasrefined in [18] as an infrastructure to share re-sources for collaborative problem solving. Morerecently, in [19] the Grid definition evolves intoan infrastructure to pool and virtualize resourcesand enable their use in a transparent fashion.

Grid infrastructure exhibits several interestingfeatures. One of the main functionalities is thata Grid hides the heterogeneity of its underlyingcomponents such as hardware, operating systemsor storage systems, and serves as a middlewarethat provides seamless connectivity of the com-ponents it manages. Another interesting charac-teristic is the transparent pooling of many kindsof resources such as computing power, storage,data, and services. This pooling enables appli-cations deployed on the Grid to transparentlyshare resources and utilize the provided capacitymore effectively. A related attribute is the abil-ity to allocate and reserve virtualized resources.Virtualization facilitates the sharing of resources:it allows the preservation of quality of serviceguarantees for time critical applications, and at

284 R. Jiménez-Peris, et al.

the same time offers true resource scavenging bytaking advantage of unused resources to performbatch computations.

These features make Grid computing attractivefor enterprises that want to convert it into their in-frastructure for deploying enterprise applications.This interest has resulted in minting a new term,Enterprise Grids. Enterprise Grid computing re-flects the use of Grid computing within the contextof a business or enterprise rather than for scientificapplications. A first generation of enterprise Gridsolutions, as developed by IBM [15] or Oracle[47], start to appear. A proof for the increasinginterest in Enterprise Grids is the recent cre-ation of the Enterprise Grid Alliance (EGA) withthe mission of developing a common EnterpriseGrid reference model [14] and fostering the useof Grids for enterprise computing. Very recentlyEGA and the Global Grid Forum have joinedinto the Open Grid Forum (OGF) [46]. OGF aimsat bringing together the goals of academic andenterprise applications.

However, Enterprise Grids have not yet founda wide adoption in industry. The problem is thatin order to effectively apply Grid technology toenterprise applications, many technical and scien-tific hurdles still have to be overcome. Originally,the need to exploit distributed resources for com-pute intensive scientific applications triggered thedevelopment of Grid architectures. These appli-cations are well-suited for automated distributionand scheduling, and Grid technology focused onproviding a virtualized abstraction of computingresources to these applications. However, enter-prise applications have many characteristics thatmake them inherently harder to deploy on a Gridinfrastructure. Examples are the stateful natureof business applications, the typical underlyingmulti-tier architecture where request executionfollows a complex path through a diverse set ofcomponents, the existence of transactional data,and the need for transactional interaction betweenthe different application components. In this pa-per, we aim at identifying some of the gaps be-tween what Grid infrastructures currently provideand what enterprise Grids need to offer. Fromthere, we survey recent achievements in the fieldof distributed computing that will help to address

some of the issues, and identify the open chal-lenges to realize the enterprise Grid vision.

The paper is structured as follows. First, inSection 2 we identify some of the main require-ments in Enterprise Grids and contrapose themwith current functionality provided by Grid sys-tems in order to identify any mismatches. Then,we describe some of the recent advances in dis-tributed systems that can be helpful to overcomethe discrepancies in Section 3. We continue inSection 4 by identifying the existing open scientificchallenges to realize the Enterprise Grid vision.Finally we present our conclusions in Section 5.

2 The Gap between Existing Grid Technologyand the Requirements for Enterprise Grids

In this section we explore in detail several gapsbetween what is currently provided by Grid tech-nology and what is needed to make Grids supporta wider range of enterprise applications.

Gap 1: Support for Online Applications Tradi-tional enterprise computing is usually composedof a mix of batch and online applications. Gridsalready have a good grip on batch applications andprovide sophisticated automation making batchapplications run efficiently on large distributed in-frastructures. Abstractions such as “task bags” canvery well serve as a basis for automating the pro-gramming and execution of batch enterprise appli-cations. For such tasks, the use of scavenging idleresources within an organization allows for betterresource utilization and increased throughput.

However, a large fraction of enterprise appli-cations are inherently interactive. We refer tothem as online applications because an end useris directly connected to the system. Online appli-cations require timely execution of requests andprompt responses to the client. As such, a sin-gle bottleneck within the request execution canlead to unsatisfying performance. That is, for on-line applications, the main performance metricsis the average response time, often with strictconstraints on its statistical distribution. This needis clearly documented in benchmarks for onlinesystems such as TPC-C for database applications[10], ECperf [64] and SPECjAppServer [63] for

Enterprise Grids: challenges ahead 285

information systems, or TPC-W [11] for webbased systems. Much of the current Grid techno-logy, however, focuses on optimizing the through-put for batch-style applications, i.e., serving alarge set of tasks without paying attention to theresponse time of individual requests.

Gap 2: Support for Transactional Data Businessapplications are often data-intensive. They access,process and manipulate large amounts of data,most of which reside in database repositories.Access to such data is nearly always in the contextof transactions in order to guarantee consistencyand durability of changes to the data. While gen-eral computational requests can be assigned tocomputing resources quite arbitrarily and at run-time, database requests have to be executed atsites that contain the database. Dynamic migra-tion or replication of entire databases to currentlyfree machines, however, cannot be achieved thateasily. It also has the additional overhead of syn-chronization protocols that keep track of all datareplicas for consistency purposes. For instance,while Oracle 10g allows a database to run acrossmultiple nodes in a cluster, its configuration ofdatabase servers is quite complex and relativelystatic.

Thus, current provisioning in most Grid envi-ronments focuses on stateless applications, andsometimes on applications that access read-onlypersistent data, in which replication and cachingcan be applied quite easily.

There has been some work related to “dataGrids.” The emphasis has been on data integra-tion and management of large collections of data.The goal of data integration is to provide a homo-geneous view of heterogeneous data schemascontaining similar information. Data integrationcan be an important issue in some enterprise ap-plications in which data from different domainsexhibit different schemas and a homogeneousview and access means should be provided. Ad-ditionally, there is the trend of converting dataaccess into services. In this context, the OGSA-DAI [45] middleware provides support for dataintegration and access via the Grid. It supportsthe exposure of various data sources (such asrelational and XML data), via homogeneous web

service interfaces, which can be a very usefulabstraction for business applications. In general,there has been a tremendous amount of work inthe context of data integration, and OGSA-DAIhas performed first steps to bring data integrationto the Grid. As such, data integration might notbe a challenging gap anymore. However, guaran-teeing transactional properties in a heterogeneousand distributed Grid environment raises newchallenges.

Gap 3: Support for Stateful and TransactionalApplications Business applications do not onlyaccess transactional data residing in backend data-bases, they often maintain state within the applica-tion code. Often, this state reflects the interactionsbetween an online client and the application (suchas session information or a shopping cart). Suchinteraction patterns reduce the flexibility in whichindividual tasks can be assigned to Grid resources.Another problem in this context is that a failureof a site can result in losing important applicationstate. Business processes can last for days, weeks,months, or even years. This is very common forworkflows and orchestrated web services. The lossof a stateful interaction would be unacceptable inthis context. Furthermore, advanced transactionmodels are often used [26, 67] in this context inorder to relax isolation. In the advent of a failureone possibility to attain consistency is to abort allrunning transactions. However, for long runningtransactions this is not really an option since theyhave to progress forward despite failures, unless itis impossible to attain their goal. Availability sup-port for such long running applications is limitedwith current Grid technology.

Gap 4: Support for Multi-tier Architectures En-terprise applications are typically deployed onmulti-tier architectures consisting of web, applica-tion server and database tiers, such as J2EE [65].As mentioned in Gaps 2 and 3, managing each ofthe individual tiers on the Grid is challenging, inparticular because of the state they manage. Pro-viding a Grid solution that supports multiple tiersis even more complicated due to the interactionpatterns between the tiers. For instance, web-userrequests typically trigger execution on all tiers.

286 R. Jiménez-Peris, et al.

Often the execution on all tiers is encapsulatedin a single transaction, that is, the actions of thetransaction can span data updates on all tiers.This means, gridification of multi-tier applicationsneeds to preserve consistency across tiers consid-ering any data and execution dependencies be-tween the tiers. Also failures need to be handledappropriately. That is, failover should be per-formed in such a way that consistency is still pre-served. Certainly, Grid systems provide supportfor multiple applications and servers. However,this support tends to deal with different serversas isolated entities. In order to extend function-ality to multi-tier systems, a systematic approachis needed to understand the interaction patternsbetween adjacent tiers, the execution across alltiers, and the impact of both on the gridificationprocess. Current Grid-support for multi-tier ormulti-component applications is limited. We be-lieve much more has to be done in this direction.

Gap 5: Autonomic support Another importantgoal for the Enterprise Grid vision is that ofautonomic management [34]. Autonomic man-agement encompasses several attributes such asself-healing, self-provisioning, self-optimizing,and self-configuring. Self-healing capabilities,which allow tolerating failures, provide the En-terprise Grid continuous availability. Fault toler-ance is typically attained by replication. Anotherfacet required to realize a self-healing enterpriseGrid and the associated high availability is theability to recover failed sites. Most approachesto recovery of replicas imply stopping systemoperation in order to obtain a quiescent stateand transfer it to the recovering replica. Oncethe recovery is completed the system operationis resumed. However, this offline recovery goesagainst the initial goal of high availability. Onthe other hand, recovery is needed to maintaina particular level of availability. Otherwise, rep-licas will fail and eventually the system willbecome unavailable. Therefore, there is a needfor performing recovery in a non-intrusive fashionmaintaining the system online. Currently, Gridsystems provide self-healing facilities but mainlyfor stateless applications. Sites can join and leave,but little is done to keep the consistency betweenthe state of working sites and the sites joining

the Grid. Therefore, there is still a large gapbetween the needs of Enterprise Grids and whatis provided by current Grid systems relative tostateful applications.

Another essential aspect of enterprise Grids isperformance and the need for self-optimization.Underlying middleware and database systemshave numerous parameters that enable to tunethem to maximize their performance. Unfortu-nately, these parameters are set manually andrequire a highly qualified administrator. What ismore, the right value for these parameters de-pends on the actual workload. An enterprise Gridcannot rely on human tuning and should be ableto tune itself to maximize performance. Ideally,human administrators only need to set high levelgoals in form of utility functions to hint the systemin which dimensions to maximize first. Autonomicperformance tuning implies to monitor the systemload and performance. When the system detectsthat performance decreases, tuning parametersneed to be adjusted in order to stabilize the systemand achieve a configuration that provides optimalperformance for the the current workload andits characteristics. This self-optimizing behaviorcan only be built upon an analytical model ofthe system that enables to understand the systemstate by monitoring the relationship between loadand performance and knowing which parametersinfluence this relationship. This means that theself-optimizing support requires specific work foreach kind of server, and in the case of enterpriseGrids, for each particular tier. Certainly, thereare currently no provisions for self-optimization ofmulti-tier architectures in Grids.

Self-optimization at each individual site/server/tier guarantees that its performance is maximizedgiven the received workload. However, individualself-optimization is not enough to maximize per-formance at the global level. The reason is thatsome sites might be overloaded whilst others areidle, despite each site being tuned to maximize itsindividual performance. Thus, processing poweris being lost due to the imbalance in the load.This means that an enterprise Grid should beable to self-configure itself to distribute the loadacross sites in a balanced way. This load-balancingshould be performed on a continuous basisand in a non-intrusive fashion. There has been

Enterprise Grids: challenges ahead 287

substantial work in providing load-balancing solu-tions for Grid environments. However, althoughload-balancing techniques and policies for differ-ent kinds of applications and servers share a com-mon background, each server and/or applicationhas typically very specific characteristics whichlead to individual load-balancing requirements.Thus, while current Grid systems offer load-balancing mechanisms that are well suited for par-allel and scientific applications they do not servewell transactional stateful applications with theirmany constraints in regard to data and executiondependencies. In particular, also the migration ofalready running jobs is inherently more complexdue to the state and execution dependencies.

Another interesting challenge arises if the com-puting power requirements of an enterprise appli-cation change over time as this requires some formof self-provisioning. If more processing power isneeded, more resources need to be added to beable to cope with the load. This should happenautonomously. Thus, a pool of free sites needsto be available. When load submitted increasesand the system feels that the given processingpower will soon be unable to handle this load,the system should self-provision itself and in-crease the amount of devoted resources. Con-versely, if the load is reduced the resources shouldshrink to free unneeded resources. However, self-provisioning in enterprise Grids is very challeng-ing due to their stateful nature. Adding a newreplica means that state should be first trans-ferred to it. This means that online recoveryis needed to enable self-provisioning as it wasneeded for attaining the self-healing property.Grid systems have achieved significant progressin self-provisioning. This infrastructure can begeneric enough to be reused in the context of En-terprise Grids, although support for dealing withstateful applications will have to be integrated.However, the recovery solutions for self-healingpurposes can be reused in this context.

Finally, one has to realize that autonomicmanagement in an Enterprise Grid cannot justrely on a component or server-based approach.Instead, autonomic management in EnterpriseGrids needs to adopt a holistic approach. The En-terprise Grid Architecture (EGA) suggests havingan entity in charge of management, denoted as

Grid Management Entity (GME). This GME isa recursive component. There is a GME at theglobal Enterprise Grid level. Then, each subsys-tem within the Enterprise Grid has its own GMEthat interacts with the higher level GME and thelower level GMEs corresponding to the containedsubsystems. For instance, in a multi-tier architec-ture, there will be a GME for the full system, andthen a GME for each gridified tier, e.g., web tier,application tier, and database tier. There will alsobe a GME within each instance of the server ina tier. That is, if there are n servers in the appli-cation server tier, each of them will have its ownGME for local autonomic management. In thisarea, the gap with what is currently provided byGrids is significant, since Grids typically deal withindependent servers or at least deal with them asif they were independent.

Gap 6: Support for Service Level Agreements Animportant class of enterprise applications providesservice provisioning governed by service levelagreements (SLAs). SLAs set a contract betweenclient and provider regarding the offered service,quality of service, cost vs. service tradeoffs, track-ing of the quality service offered, etc. One of themain concerns is that by sharing resources acrossdifferent clients, the demand from one client canlead to violations of the SLA of another client.This problem can be addressed by resource vir-tualization. The idea is to split the resources of asite into multiple isolated execution environmentsthat can be assigned to different clients. The per-formance isolation achieved by virtualization en-ables to adopt policies to maximize the fulfillmentof SLAs. Grid computing has produced manyresults regarding resource virtualization that canbe exploited to attain Enterprise Grids. However,there is still an important gap regarding how tohandle SLAs and the mapping of resource virtu-alization to clients, especially when the resourcesare spread across different tiers in a multi-tierarchitecture.

3 Filling the Gap: Recent Advancesin Distributed Systems

Although there are many open challenges andproblems that need to be addressed before

288 R. Jiménez-Peris, et al.

Enterprise Grids become a mainstream infra-structure, recent developments in several areashave proposed promising solutions that can possi-bly be applied in the context of Enterprise Grids.In here, we outline some of them.

Database Replication Database replication [2, 7,13, 25, 30, 38, 48–50, 52, 55, 59, 68] is impor-tant to support the gridification of stateful andtransactional online applications and the databasetier itself. Database replication has been an ac-tive research topic for many years. Especially inrecent years, many new approaches have beenproposed. What is important is that many of theseapproaches have response time as an importantperformance metrics which is needed to allow forgridification of online applications.

Some solutions looked at lazy replication whereupdate transactions are executed only at onereplica and the resulting updates are propagatedonly after commit to the other replicas. For in-stance, in [7], consistency is provided by assign-ing each object a master and only the mastercan accept and propagate updates; furthermoreassignment of objects to masters and the prop-agation topology is restricted to guarantee cor-rect executions. A different research line aimedat making the lack of consistency explicit to theuser, providing different degrees of data freshnessto the user [48, 49, 59]. Freshness measures theamount of updates that might have been missedby transactions and therefore quantifies an upperbound of the staleness of the data read by a trans-action. Another set of lazy approaches, such asIceCube [35] and its clustered extension [40], hasbeen devoted to reconciliation in optimistic repli-cation. Replicas are allowed to proceed in parallelwithout any measure to prevent conflicts. Later,conflicts due to concurrent accesses are fixed bymeans of semantic corrective actions.

A different line of research looked at scalablesolutions for eager data replication, where up-dates are propagated before commit, hence pro-viding much better consistency guarantees. Mostof these approaches look at solutions providing1CS. Many of them, such as [50, 52], build ongroup communication to guarantee consistencyin advent of any failure scenario. They addition-ally exploit asymmetric update processing [28] in

which update transactions are only fully processedat one replica (typically, different transactionsare fully processed at different replicas), whileother replicas only apply the updated tuples. Thisasymmetric processing is essential to attain scal-ability for update workloads [28]. Other eagerapproaches are based on schedulers [2] that pro-vided the necessary message ordering and con-sistency guarantees. All these approaches attainsome reasonable scalability up to several tens ofnodes while providing full consistency.

Both lazy and eager replication can find theirplace in enterprise Grids since they both allow totake advantage of additional resources and dis-tribute load among them.

An important issue that has also been lookedat is how to architect database replication. Threedifferent approaches have been identified: whitebox, black box, and gray box. The white box ap-proach consists in implementing replication withinthe database. The advantage is that it can beimplemented efficiently however it requires accessto the database code and the replication logic be-comes interwoven with the database logic. On theother extreme of the spectrum, one finds the blackbox approaches that do not modify the database[1]. However, it has the shortcoming that transac-tions need to be executed serially what hampersthe performance of the database. Finally, gray boxapproaches [50] advocate for implementing someminimal functionality within the database thatenables efficient and scalable replication at themiddleware level. Recently, a number of reflectiveapproaches [44, 60] have been studied to exposesome database functionality to enable performantreplication at the middleware level. In an enter-prise Grid, where Grid technology needs to con-trol to some degree provisioning and distribution,a middleware-based replication approach is likelyto be more advantageous because it allows fora better interaction between the replica controlsoftware and the Grid environment. A step in thisdirection could be the concept of data farming inwhich different databases are replicated and thereis a farm of database servers serving requests forall the applications [56].

Clustering in Multi-tier Architectures The re-search in clustering of multi-tier architectures will

Enterprise Grids: challenges ahead 289

be useful to support gridification of multi-tierarchitectures and the stateful and transactionalapplications that are typical in these architectures.

The replication of the web server tier has beenstudied very deeply during the last decade [8]. Theresearch has basically set the ground work for vir-tualizing IP addresses in different ways, typicallythrough a router. The result is a virtual IP addressthat can be served by a pool of servers with differ-ent physical IP addresses. When moving web clus-ters to a Grid environment, additional problemshave to be solved. Currently, most routers guar-antee that a request is sent to a working server,but there are no guarantees about serving therequest, since the server can crash before complet-ing request execution. What is lacking is a solutionthat provides end-to-end exactly-once semanticsto enable a transparent fault-tolerant and highlyavailable interaction. Furthermore, in order tointegrate the self-optimizing and self-provisioningfeatures of the Grid system into IP virtualization,it should be possible to change the set of sitesserving a virtual IP address dynamically.

A significant body of research has looked intodata caching for web servers covering issues suchas cache consistency combined with caching ofdynamic data. However, distributed and repli-cated data caching is still an open problem. One ofthe main difficulties is how to accomplish an inde-pendent gridified data caching tier (not recognizedas a tier in J2EE), such that its performance andresilience can be increased whilst still providingfull data coherence.

In the application server tier, most work hasfocused on providing availability via primary-backup replication or active replication. Earlywork started with CORBA (e.g. Eternal [43])ignoring the interactions with other tiers. The in-tegration of a replicated application server withother, non-replicated tiers, and in particular, thedatabase tier has been first discussed for CORBAand .NET [4, 16, 69] an in general settings [21].More recently, research on J2EE applicationservers has yielded several promising results in thecontext of the European Adapt project [5]. Forinstance, [68] discusses integration of replicationof the application server with advanced execu-tion and transaction patterns. Some recent resultsshow that it is possible to provide highly available

transactions in which, despite failovers, transac-tions are not aborted from the client perspective[54]. However, research in this area is still at anearly stage in which only solutions for fault toler-ance (primary-backup) have been provided. Morework is needed which takes self-optimization andself-provisioning into account.

Furthermore, little work has been done whenlooking at gridification, or at least replication,across several tiers. The X-ability framework [20]allows to reason about correctness in a generalmulti-tier architecture. [32] looks at different in-tegration approaches when several tiers are repli-cated. Integration can basically be achieved in twodifferent ways. One is that each tier is replicatedindependently (see Fig. 1a). That is, the applica-tion server (AS) tier is replicated independently ofthe database tier, also known as horizontal repli-cation. The second approach consists of takinga pair of application and database servers andreplicating this pair (see Fig. 1b). This approachis known as vertical replication [54].

Autonomic Computing The sheer size and com-plexity of modern information systems often over-whelms even the most skilled engineers thatare supposed to configure, optimize, secure, andrepair them. In fact, if the current trend of com-plexity growth continues, the administration ofthe next generation systems will be beyond humancapabilities. Even if the complexity does not in-crease, it is predicted that the number of deployedsystems in the near future will be substantiallyhigher that the number of qualified engineers tomanage them [34]. In fact, these studies haveresulted in launching research initiatives in mostlarge software companies to create a new gener-ation of systems able to self-manage themselves

DB DB DB

AS AS

DB

AS

DB

AS

DB

AS

a) Horizontal Replication b) Vertical Replication

Fig. 1 Replication approaches

290 R. Jiménez-Peris, et al.

with no human intervention except for settinghigh level goals or deciding between perfor-mance and/or costs tradeoffs. Examples of theseinitiatives are, among others, IBM autonomiccomputing, HP Adaptive Enterprise, MicrosoftDynamic Systems, and Intel Proactive Comput-ing. The vision is to attain systems able to man-age themselves, that is, systems able to self-heal,self-provision, self-optimize, and self-configurethemselves. This vision for self-management andself-adaptation is also part of the Enterprise Gridvision [14].

As mentioned before, self-healing requires re-covering failed replicas or introducing fresh ones.This will be crucial in order to allow for dynamicresource provisioning and adjustments in the load-distribution when either the user demands or thecapacity of the resources changes. In a statefulcontext, this requires to transmit the state of work-ing replicas to these new replicas so they can joina Grid of servers. Since such recovery might takeconsiderable time it is impossible to stop execu-tion during this process since this would violatethe initial goal of high availability. Recent workhas looked into online recovery for self-healingdatabases [27, 31]. Kemme et al. [31] proposeddifferent ways to perform online recovery withinthe database kernel; Jiménez-Peris et al. [27] stud-ies how to implement online recovery at the mid-dleware level providing toggles for a higher levelself-configuration entity to adapt the resourcesdevoted to recovery to the current load. It has alsobeen recently researched how to perform recoveryin a context in which the data centers from dif-ferent collaborating organizations (e.g. branchesunder a common hotel chain name) are intercon-nected to perform data warehousing. Some redun-dancy is introduced in the system that enables toperform recovery with a service interface [36].

The potential scalability provided by new repli-cation approaches can only be attained by re-sorting to autonomic protocols that self-optimizelocally and self-configure globally to maximizethroughput. In the database arena, a significanteffort has been made during the last decade todevelop self-tuning and self-optimization proto-cols, that act locally on a particular server. Oneof the subsystems in a database system is memorymanagement. When the server overloads, the sys-

tem enters into thrashing without grateful degra-dation. Several self-management techniques havebeen proposed for database memory managementin [66] such as access pattern aware adaptivecaching or speculative, self-adapting pre-fetchingpolicies.

Another important area for self-optimizationin databases has been adaptive load control. Itaims at choosing the optimal configuration for achanging system load. It is also important in theGrid context since we have to assume that thecapacity of Grid resources might vary dependingon what other tasks they take on. One of themost important parameters that affect contentionis the multiprogramming level (MPL), i.e., thenumber of transactions effectively executed con-currently. For instance, if the system has twentyoutstanding transactions, it can decide to exe-cute them one by one, two by two, or all twentyconcurrently. Some approaches have looked atcontention occasioned by locking. [42] determinesthe optimal degree of conflict and adjusts theMPL to maintain this optimal degree. [23, 61]apply feedback control to adjust the MPL to max-imize the throughput of a dataset. [41] proposes ahierarchical self-configuration/optimization forclustered databases where load-control is per-formed individually at each replica and load-balancing is performed globally across all replicas.

Another important requirement of EnterpriseGrids is the self-provisioning capacity. In the areaof workflows, there has been some interestingwork, in which replicas of the workflow serverare allocated dynamically depending on the sys-tem load [51]. In the context of database repli-cation, feedback-based approaches and machinelearning techniques have been used [9, 62] todynamically determine the set of active databasereplicas for database applications with changingworkload patterns.

4 Challenges Ahead

While the last section showed individual accom-plishments that have a great potential to be trans-ferred to Enterprise Grids, more work needs to bedone to develop a holistic solution. In this section,we shortly outline some of the issues.

Enterprise Grids: challenges ahead 291

Boosting the Scalability of Data Grids Most ofthe existing data replication strategies assume fullreplication (i.e., replicating the full database ateach site). However, such approach has a scala-bility ceiling that might be not enough for futurevery large data centers. Partial replication justreplicates a subset of data at each site and can be asolution for the scalability ([12]). However, thereare still many issues to resolve in this context. Oneof the interesting research problems is whethera combination of full and partial replication canyield a better scalability. Another important prob-lem is that solutions relying on pure partial repli-cation will likely need to resort to distributedatomic commitment that is the current bottleneckof distributed information systems. Recent pro-tocols for high performance distributed atomiccommitment such as the commit server [29] can bea solution to leverage partial replication to boostthe scalability of database replication.

The use of group communication might hamperthe scalability. More concretely, one option lies inexploiting the use of total order multicast. Opti-mistic delivery, a more aggressive version of opti-mistic multicast [53], has been proposed to masklatency of multicast by exploiting the spontaneoustotal order that happens in LANs [3, 33, 58].[27, 33, 50] have exploited successfully optimisticdelivery combined with a reordering technique toreduce the latency of data replication and mini-mize the number of aborts when the optimisticdelivery was wrong.

Low-latency Gridification across WANs A re-quirement of Enterprise Grids is the inter-connection of data centers across WANs. Thiswill need support for data replication acrossWANs. WANs differ from LANs mainly in thelatency of the communication. This higher la-tency becomes especially troublesome for groupcommunication based replication approaches inwhich the required ordering and reliability guar-antees might require several rounds of mes-sages. However, purely lazy replication strategiesmight not be acceptable for some of the trans-actional, update-intensive data. Thus, new repli-cation protocols are needed with higher levels ofscalability and lower latencies while at the sametime achieving acceptable levels of consistency.

Some initial steps are being taken towards datareplication in WANs [1, 22, 39] in which the useof group communication across WANs is eitherminimized or just dismissed.

Reducing the Overhead of Gridification Anothercrucial aspect for the successful gridification ofenterprise applications is the reduction of the in-herent overhead of the coordination among sites.Modern system area networks provide functional-ity that might be exploited to reduce this overheadby delegating functions, possibly dynamically, tothe underlying hardware. Some researchers havesuccessfully exploited modern system area net-works in the context of a distributed shared mem-ory middleware [6].

Holistic Approach to Virtualization and SLAsDespite the advances in virtualization of the dif-ferent kinds of resources (storage, networking,server tier etc.), new research is needed to adopta holistic approach to virtualization, in whichthe computational requirements of an applicationacross various tiers are automatically analyzed.From there, the necessary resources are devotedfor deploying and running the new applicationand adjusted autonomically to the changes inthe workload and priorities extracted from SLAs.There have been some early approaches to SLAsfor commercial Grids, such as [37].

Scalable and Consistent Gridification of Multi-tierArchitectures The current state of the art hasaddressed solutions for some of the gridificationissues of individual tiers. Much more work hasto be done, both in regard to gridification ofindividual tiers, and across the entire multi-tierarchitecture, taking into account any form of dataand execution dependencies.

Systemic Autonomic Computing Most currentapproaches for autonomic computing focus on in-dividual servers. New approaches addressing thehierarchical self-management of multi-tier sys-tems or complex systems as a whole are needed.Recent advances such as [57], in which perfor-mance bottlenecks of complex systems deployedin multi-tier architectures are automatically iden-tified, are a start in this direction.

292 R. Jiménez-Peris, et al.

Accurate Models for Self-optimization and Self-provisioning Despite the progress attain in self-*properties, the attained results do not seem tobe enough for the Enterprise Grid vision. Oneof the aspects that need further research is themodeling of systems with higher accuracy andautonomic protocols with higher guarantees thancurrent best effort approaches. Economic-basedapproaches seem to be a promising path to followthat provides mathematical models for resourcesharing [17].

5 Conclusions

In this paper we have identified some of the short-falls of current Grid technology when it comes toenterprise applications. The problems are mainlydue to the interactive nature of business appli-cations, the large amount of data that residesin database systems and requires transactionalaccess, and the component-based and multi-tierarchitecture of current enterprise applications.However, we believe that many recent researchadvances in the area of data management anddistributed systems can be applied and incorpo-rated in a Grid infrastructure, to get a step closerto what OGF envisions as the Enterprise Grid.Replication, caching, online reconfiguration tech-niques, farming, and load-balancing mechanismsthat have been proposed, seem promising candi-dates. However, we believe that adjustments toproposed technology need to be made to makethem work in a Grid environment. This paperaims at fostering cross-fertilization between theGrid and the distributed systems communities tomake this happen. We also think that there aresome outstanding issues that still need furtherfundamental research.

References

1. Amir, Y., Danilov, C., Miskin-Amir, M., Stanton, J.,Tutu, C.: On the performance of consistent wide-areadatabase replication. Technical Report CNDS-2003-3,John Hopkins University (2003)

2. Amza, C., Cox, A.L., Zwaenepoel, W.: Distributedversioning: consistent replication for scaling back-end databases of dynamic content web sites. In: Int.Middleware Conf. (2003)

3. Balakrishnan, M., Birman, K.: PLATO: Predicitvelatency-aware total ordering. In: Proc. of the Int. Symp.on Reliable Distributed Systems (SRDS) (2006)

4. Barga, R., Lomet, D., Weikum, G.: Recovery guaran-tees for general multi-tier applications. In: Int. Conf. onData Engineering (ICDE) (2002)

5. Bartoli, A., Jiménez-Peris, R., Kemme, B., and all:Adapt: towards autonomic web services. In: Distrib-uted Systems Online (2005)

6. Bilas, A., Iftode, L., Singh, J.P.: Evaluation of hard-ware support for shared virtual memory clusters. In:Proc. of the 12th ACM International Conference onSupercomputing (ICS98) (1998)

7. Breitbart, Y., Korth, H.F.: Replication and consistency:being lazy helps sometimes. In: ACM Int. Conf. onPrinciples of Database Systems (PODS) (1997)

8. Cardellini, V., Casalicchio, E., Colajanni, M., Yu, P.S.:The state of the art in locally distributed Web-serversystems. ACM Comput. Surv. 34(2), 263–311 (2002)

9. Chen, J., Soundararajan, G., Amza, C.: Autonomicprovisioning of backend databases in dynamic contentweb servers. In: Int. Conf. on Autonomic Computing(ICAC) (2006)

10. Council, T.P.P.: TPC Benchmark C (2006)11. Council, T.P.P.: TPC Benchmark W (2006)12. de Sousa, A.L.P.F., Oliveira, R.C., Moura, F.,

Pedone, F.: Partial replication in the database statemachine. In: IEEE Int. Symposium on NetworkComputing and Applications (2001)

13. Elnikety, S., Zwaenepoel, W., Pedone, F.: Databasereplication using generalized snapshot isolation. In:IEEE Int. Symp. on Reliable Distributed Systems(SRDS) (2005)

14. Enterprise Grid Alliance: EGA Reference Model(2005)

15. Ferreira, L., Easton, J., Kra, D., et.al.: Patterns: Emerg-ing Patterns for Enterprise Grids. IBM RedBooks(2006)

16. Felber, P., Narasimhan, P.: Reconciling replication andtransactions for the end-to-end reliability of CORBAapplications. In: DOA (2002)

17. Ferguson, D.F., Nikolau, C., Sairamesh, J., Yemini, Y.:Economic models for allocating resources in computersystems. In: Clearwater, S.H. (ed.): Market-based Con-trol: A Paradigm for Distributed Resource Allocation,pp. 156–183. World Scientific Publishing Co. Inc. (1996)

18. Foster, I.T.: The anatomy of the Grid. In: CCGRID(2001)

19. Foster, I.T., Kesselman, C., Nick, J., Tuecke, S.: Thephysiology of the Grid. In: Global Grid Forum (2002)

20. Frølund, S., Guerraoui, R.: X-ability: a theory of repli-cation. In: Symp. on Principles of Distributed Comput-ing (PODC) (2000)

21. Frølund, S., Guerraoui, R.: e-Transactions: end-to-endreliability for three-tier architectures. IEEE Trans.Softw. Eng. 28(4), 378–395 (2002)

22. Grov, J., Soares, L., Correia Jr., A., Pereira, J.,Oliveira, R., Pedone, F.: A pragmatic protocol for data-base replication in interconnected clusters. In: Proc.of the 12th IEEE Int. Symp. Pacific Rim DependableComputing (PRDC). Riverside, CA (2006)

Enterprise Grids: challenges ahead 293

23. Heiss, H., Wagner, R.: Adaptive load control in trans-action processing systems. In: Proc. of 17th Very LargeData Bases Conf. (VLDB) (1991)

24. Foster, I., Kesselman, C.: The Grid. MKP, Budapest(1998)

25. Irún-Briz, L., Decker, H., de Juan-Marín, R.,Castro-Company, F., Armendáriz-Iñigo, J.E., Muñoz-Escoí, F.D.: MADIS: a slim middleware for databasereplication. In: Euro-Par, pp. 349–359 (2005)

26. Jajodia, S., Kerschberg, L. (eds.): Advanced Trans-action Models and Architectures. Kluwer, Dordrecht(1997)

27. Jiménez-Peris, R., Patiño-Martínez, M., Alonso, G.:Non-intrusive, parallel recovery of replicated data. In:IEEE Symp. on Reliable Distributed Systems (SRDS)(2002)

28. Jiménez-Peris, R., Patiño-Martínez, M., Alonso, G.,Kemme, B.: Are quorums an alternative for data repli-cation. ACM Trans. Database Syst. 28(3), 257–294(2003)

29. Jiménez-Peris, R., Patiño-Martínez, M., Alonso, G.,Arevalo, S.: A low-latency non-blocking atomic com-mitment. In: Int. Conf. on Distributed Computing(DISC) (2001)

30. Kemme, B., Alonso, G.: Postgres-R, a new way to im-plement database replication. In: Int. Conf. on VeryLarge Data Bases (VLDB) (2000)

31. Kemme, B., Bartoli, A., Babaoglu, O.: Online recon-figuration in replicated databases based on group com-munication. In: Int. Conf. on Dependable Systems andNetworks (DSN) (2001)

32. Kemme, B., Jiménez-Peris, R., Patiño-Martínez, M.,Salas, J.: Exactly once interaction in a multi-tier archi-tecture. In: VLDB Workshop on Design, implementa-tion, and deployment of database replication (2005)

33. Kemme, B., Pedone, F., Alonso, G., Schiper, A.,Wiesmann, M.: Using optimistic atomic broadcast intransaction processing systems. IEEE Trans. Knowl.Data Eng. 15(4), 1018–1032 (2003)

34. Kephart, J., Chess, D.: The vision of autonomic com-puting. IEEE Comput. 36(1), 41–50 (2003)

35. Kermarrec, A.-M., Rowstron, A.I.T., Shapiro, M.,Druschel, P.: The IceCube approach to the reconcil-iation of divergent replicas. In: ACM Int. Conf. onPrinciples of Distributed Computing (PODC) (2001)

36. Lau, E., Madden, S.: An integrated approach to recov-ery and high availability in an updatable, distributeddata warehouse. In: Proc. of the 32nd Int. Conf. onVery Large Data Bases (VLDB) (2006)

37. Leff, A., Rayfield, J.T., Dias, D.M.: Service-level agree-ments and commercial Grids. IEEE Internet Comput-ing 7(4), 44–50 (2003)

38. Lin, Y., Kemme, B., Patiño-Martínez, M., Jiménez-Peris, R.: Middleware based Data replication providingsnapshot isolation. In: ACM Int. Conf. on Managementof Data (SIGMOD) (2005)

39. Lin, Y., Kemme, B., Patiño-Martínez, M., Jiménez-Peris, R.: Consistent data replication: is it feasible inWANs? In: Euro-Par (2005)

40. Martins, V., Pacitti, E., Valduriez, P.: A dynamicdistributed algorithm for semantic reconciliation. In:

6th Workshop on Distributed Data and Structures(WDAS) (2006)

41. Milan, J., Jiménez-Peris, R., Patiño-Martínez, M.,Kemme, B.: Adaptive middleware for data replication.In: Int. Middleware Conf. (Middleware) (2004)

42. Moenkeberg, A., Weikum, G.: Performance evaluationof an adaptive and robust load control method for theavoidance of data contention trashing. In: Int. Conf. onVery Large Data Bases (VLDB) (1992)

43. Moser, L.E., Melliar-Smith, P.M., Narasimhan, P.,Tewksbury, L., Kalogeraki, V.: The eternal system: anarchitecture for enterprise applications. In: Int. on En-terprise Computing Conf. (EDOC) (1999)

44. Muñoz-Escoí, F.D., Pla-Civera, J., Ruiz-Fuertes, M.I.,Irún-Briz, L., Decker, H., Armendáriz-Íñigo, J.E.,de Mendívil, J.R.G.: Managing transaction conflicts inmiddleware-based database replication architectures.In: IEEE Int. Symp. On Reliable Distributed Systems(SRDS) (2006)

45. OGSA-DAI: http://www.ogsadai.org.uk/.46. Open Grid Forum: http://www.ogf.org/.47. Oracle: Grid Computing with Oracle. White Paper

(2005)48. Pacitti, E., Simon, E.: Update propagation strategies to

improve freshness in lazy master replicated databases.VLDB Journal 8(3,4), 305–318 (2000)

49. Pape, C.L., Gançarski, S., Valduriez, P.: Refresco: Im-proving query performance through freshness controlin a database cluster. In: CoopIS (2004)

50. Patiño-Martínez, M., Jiménez-Peris, R., Kemme, B.,Alonso, G.: Middle-R: Consistent database replicationat the middleware level. ACM Trans. Comput. Syst.(TOCS) 23(4), 275–423 (2005)

51. Pautasso, C., Heinis, T., Alonso, G.: Autonomic execu-tion of web service compositions. In: Int. Conf. on WebServices (ICWS) (2005)

52. Pedone, F., Guerraoui, R., Schiper, A.: The databasestate machine approach. Distributed and Parallel Data-bases 14(1), 71–98 (2003)

53. Pedone, F., Schiper, A.: Optimistic atomic broadcast.In: Kutten, S. (ed.) Proc. of 12th Distributed Comput-ing Conference (DISC), Vol. LNCS 1499, pp. 318–332(1998)

54. Perez, F., Vuckovic, J., Patiño-Martínez, M.,Jiménez-Peris, R.: Highly available long runningtransactions and activities for J2EE applications. In:IEEE Int. Conf. on Distributed Computing Systems(ICDCS) (2006)

55. Plattner, C., Alonso, G.: Ganymed: Scalable replica-tion for transactional web applications. In: Proc. of theACM/IFIP/USENIX Int. Middleware Conf. (2004)

56. Plattner, C., Alonso, G., Ozsu, T.: DBFarm: A scal-able cluster for multiple databases. In: Proc. of theACM/IFIP/USENIX Int. Middleware Conf. (2006)

57. Robertson, P., Williams, B.: Automatic recovery fromsoftware failure. Commun. ACM 49, 41–47 (2006)

58. Rodrigues, L., Mocito, J., Carvalho, N.: From spon-taneous total order to uniform total order: Differentdegrees of optimistic delivery. In: Proc. of the ACMSymposium on Applied Computing (SAC), pp. 723–727 (2006)

294 R. Jiménez-Peris, et al.

59. Röhm, U., Böhm, K., Schek, H.-J., Schuldt, H.: FAS -A freshness-sensitive coordination middleware for acluster of OLAP components. In: Int. Conf. on VeryLarge Data Bases (VLDB) (2002)

60. Salas, J., Jiménez-Peris, R., Patiño-Martínez, M.,Kemme, B.: Lightweight reflection for middlewarebased database replication. In: IEEE Symp. on Reli-able Distributed Systems (SRDS) (2006)

61. Schroeder, B., Harchol-Balter, M., Iyengar, A.,Nahum, E.M., Wierman, A.: How to determine agood multi-programming level for external scheduling.In: Int. Conf. on Data Engineering (ICDE) (2006)

62. Soundararajan, G., Amza, C., Goel, A.: Database repli-cation policies for dynamic content applications. In:ACM SIGOPS EuroSys (2006)

63. Standard Performance Evaluation Corporation: SPEC-jAppServer2004 version 1.03. Standard PerformanceEvaluation Corporation (2006)

64. Sun Microsystems: ECperf specification v1.1 finalrelease. Sun Microsystems (2003a)

65. Sun Microsystems: Java 2 Platform Enterprise Editionv1.4. Sun Microsystems (2003b)

66. Weikum, G., Christian, A., Kraiss, A., Sinnwell, M.:Towards self-tuning memory management for dataservers. IEEE Data Eng. Bull. 22(2), 3–11 (1999)

67. WS-CAF: Web services composite application frame-work (WS-CAF). OASIS (2005)

68. Wu, S., Kemme, B.: Postgres-R(SI): Combining replicacontrol with concurrency control based on snapshotisolation. In: IEEE Int. Conf. on Data Engineering(ICDE) (2005)

69. Zhao, W., Moser, L.E., Melliar-Smith, P.M.: Unifi-cation of replication and transaction processing inthree-tier architectures. In: IEEE Int. Conf. onDistributed Computing Systems (ICDCS), pp. 290–300(2002)