26

Extending DBMSs with satellite databases - Distributed Computing

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

658 C. Plattner et al.

Fig. 1 Example of a Ganymed system con�guration

kept consistent with its master at all times, using a lazyreplication scheme. Transaction dispatchers make surethat clients always see a consistent state [we use snap-shot isolation (SI) as correctness criteria] and that thesystem enforces strong consistency guarantees (a clientalways sees its own updates and it is guaranteed to workwith the latest snapshot available). Finally, satellites donot necessarily need to be permanently attached to amaster. In the paper we explore how to create satel-lites dynamically (by using spare machines in a pool) toreact to changes in load or load patterns. This opens upthe possibility of offering database storage as a remoteservice as part of a data grid.

1.2 Consistency

A key difference between our system and existing rep-lication solutions (open source and commercial) is thatclients always see correct data according to SI [34]. SI,implemented in, e.g., Oracle [27], PostgreSQL [29], andMicrosoft SQL Server 2005 [25], avoids the ANSI SQLphenomena P1–P4 [7]. It also eliminates con�icts be-tween reads and writes. Read-only transactions get asnapshot of the database (a version) as of the time thetransaction starts. From then on, they run unaffected bywriters.

For correctness purposes, we differentiate betweenupdate (there is at least one update operation in thetransaction) and read-only transactions (or queries).Transactions do not have to be sent by clients as a block(unlike in, e.g., [19,26]), they can be processed statementby statement. Update transactions are forwarded to themaster. Read-only transactions will, whenever possible,be sent to one of the satellites and executed with SI.The same snapshot (on the same satellite) is used forall statements in the transaction, thereby ensuring that

a client’s read-only transactions always see a consistentstate of the database over their whole execution time.

Queries are executed only after the assigned satellitehas applied all the master’s updates up to the time thetransaction starts. Hence, a client always sees its ownupdates and all the snapshots it sees are consistent intime. To enforce this requirement on the satellites, weresort to transaction tagging. Obviously, if updates donot get applied fast enough on the satellites, then readersmust be delayed until the necessary snapshot becomesavailable.

1.3 Validation

The system described in the paper has been fully imple-mented. We describe here only a proof of concept imple-mentation that demonstrates the feasibility of the idea.We do not claim that ours is the most optimal imple-mentation. We also do not claim that we have solved allproblems and issues that arise in this context. Rather,the objective is to show the potential of our architectureby analyzing their behavior under a variety of contextsand applications.

For the experiments performed, there is an over-whelming number of design variables to consider. Thereare also many product-speci�c tools, interfaces, and solu-tions that would not only signi�cantly boost the perfor-mance of our system but also detract from its generality.Consequently, in the paper we use only generic solu-tions instead of product-speci�c optimizations. This isof particular relevance in regard to update extractionfrom the master, application of updates at the satellites,and the dynamic creation of copies, all of which canbe performed more ef�ciently by using product-speci�csolutions.

Also note that we discuss two potential applications:full replication satellites and specialized satellites. Imple-menting full replication satellites across heterogeneousdatabases is very complex. However, most of the tech-niques necessary to implement full replication satellitesalso apply to specialized satellites. We use full replicationsatellites as a worst case scenario for analyzing oursystem. The lessons learned from this experience helpin implementing specialized satellites which may proveto be the most interesting application. The advantageof specialized satellites is that most of the dif�cultiesinvolved in creating and operating satellites disappear.Thus, specialized satellites may not only be more useful,they are also easier to implement. Yet, the purpose of thepaper is to explore the entire design space so we includethe full replication experiments to show the capabilitiesand limitations of the idea we propose.

Extending DBMSs with satellite databases 659

Note that none of the middleware-based replicationsolutions recently proposed ([5,10,23,26]) can cope withpartial replication, let alone specialized satellites.

1.4 Contributions

In this paper, we propose a novel replication architec-ture to extend databases. We also propose novelexecution strategies for some applications over thisarchitecture. The main contributions of the architectureand the paper are:

– The use of satellites acting as external data bladesfor database engines, in homogeneous and hetero-geneous settings.

– We show how satellites can used to extend data-base engines with conventional replication and withspecialized functionality that may not be available inthe original master DBMS. This external data bladeapproach allows the addition of extensions with-out modi�cation of the original database engine andwithout affecting its performance.

– We provide clients with a consistent view of the data-base using a technique that we have introduced in anearlier paper [28]. We are not aware of any imple-mented solution, open source or commercial, thatprovides the consistency levels we provide (muchless with the performance our system exhibits).

– We support dynamic creation of satellites. This oper-ation is intrinsically very expensive for full replica-tion but can be used to great effect with specializedsatellites that only need a fraction of the master’sdata to be created.

1.5 Paper organization

The rest of this paper is organized as follows: �rst, wedescribe the Ganymed system architecture (Sect. 2).This is followed by a description of the details of SI andthe system’s underlying transaction routing algorithm(Sect. 3). In the following Sects. 4 and 5 we describethe problems that have to be handled in heterogeneoussetups. Then, in Sect. 6, we describe the experimentalevaluation of our system.

To demonstrate the more advanced possibilities ofthe system, namely, specialized satellites, we then showin Sect. 7 how we used PostgreSQL satellites to extendan Oracle master with a skyline query facility [9,11].As another possibility, we also show how we used theGanymed system to implement a full text keywordsearch based on satellite databases. The objective ofthese experiments is to show how satellites can be usedto implement new operators that require heavy

computation without affecting the master. In Sect. 8 wethen explore the behavior of the system when satellitesare dynamically created. We include experiments on thecost of dynamic satellite creation (as well as a discussionon how this is done) and experiments on dynamic sky-line satellites. The paper ends with a discussion of relatedwork (Sect. 9) and conclusions.

2 System architecture

2.1 Design criteria

One of the main objectives of Ganymed is to offerdatabase extensibility while maintaining a single systemimage. Clients should not be aware of load balancing,multiple copies, failing satellites, or consistency issues.We also do not want to make use of relaxed consistencymodels – even though we internally use a lazy repli-cation approach, clients must always see a consistentstate. Unlike existing solutions (e.g., [5,10,35]) that relyon middleware for implementing database replication,we neither depend on group communication nor imple-ment complex versioning and concurrency control at themiddleware layer.

This point is a key design criterion that sets Ganymedapart from existing systems. The objective is to have avery thin middleware layer that nevertheless offers con-sistency and a single system image. Unlike [23,35], wedo not want to use group communication to be able toprovide scalability and fast reaction to failures. Avoidinggroup communication also reduces the footprint of thesystem and the overall response time. As recent workshows [23], group communication is also no guaran-tee for consistency. Although not stated explicitly, thedesign in [23] offers a trade-off between consistency –clients must always access the same copy – and singlesystem image – at which point clients are not guaranteedto see consistent snapshots. We also want to avoid dupli-cating database functionality at the middleware level.Ganymed performs neither high level concurrency con-trol (used in [ 10], which typically implies a signi�cantreduction in concurrency since it is done at the levelof table locking ) nor SQL parsing and version control(used in [5], both expensive operations for real loadsand a bottleneck once the number of versions startsto increase and advanced functionality like material-ized views is involved). The thin middleware layer isa design objective that needs to be emphasized as theredundancy is not just a matter of footprint or ef�ciency.We are not aware of any proposed solution that dupli-cates database functionality (be it locking, concurrencycontrol, or SQL parsing) that can support real database

660 C. Plattner et al.

Fig. 2 Data �ows between the dispatcher, the master and the satellites

engines. The problem of these designs is that theyassume that the middleware layer can control every-thing happening inside the database engine. This is nota correct assumption (concurrency control affects morethan just tables, e.g., recovery procedures, indexes). Forthese approaches to work correctly, functionality suchas triggers, user-de�ned functions and views would haveto be disabled or the concurrency control at the mid-dleware level would have to work at an extremely con-servative level. In the same spirit, Ganymed imposes nodata organization, structuring of the load, or particulararrangements of the schema (unlike, e.g., [18]).

In terms of the DBMSs that the architecture shouldsupport, the objective is �exibility and, thus, we do notrely on engine speci�c functionality. The design we pro-pose does not rely on the existence of special featuresor modi�cations to the underlying DBMS.

2.2 Overview

The system works by routing transactions through a dis-patcher over a set of backend databases. For a givendispatcher, the backends consist of one master and aset of satellites. The objective is to use the satellites toextend the master according to two principles: clientssee a consistent database at all times and the master cantake over if the satellites cannot deal with a query.

The latter point is crucial to understanding the designof the system. In the worst case, our system behaves asa single database: the master. When in doubt, the dis-patcher routes the traf�c to the master. We also relyon the master to provide industrial strength (e.g., crashrecovery and fault tolerance). The idea is that the satel-lites extend the functionality or capacity of the masterbut neither replace it nor implement redundant func-tionality. This same principle applies to the problemof replicating triggers, user-de�ned functions, etc. Oursystem is not meant to extend that functionality. Thus,

transactions that involve triggers or user-de�ned func-tions are simply sent to the master for execution there.

A basic assumption we make is that we can achieve aperfect partition of the load between master andsatellites. However, unlike previous work [10,26], we donot require the data to be manually partitioned acrossnodes. For the purposes of this paper, the loads we con-sider involve full replication and specialized function-ality (skyline queries and keyword search). For fullyreplicated satellites, the master executes all write opera-tions while the satellites execute only queries (read-onlytransactions). In the case of specialized functionality thesatellites execute skyline queries and keyword searches,all other transactions are done at the master. We alsoassume that queries can be answered within a singlesatellite.

2.3 Main components

The main components of the system are as follows (seeFig. 2). The dispatcher is responsible for routing trans-actions to the master and satellites. It acts as front endfor clients. The system is controlled and administeredfrom a management console. Communication with thebackend DBMSs always takes place throughadapters,which are thin layers of software installed on the DBMSmachines.

In terms of database machines, we consider threetypes: masters, primary satellites, and secondary satel-lites. Primary satellites are optional and used fordynamic creation of satellites. The purpose of primarysatellites is to be able to create new satellites without hit-ting the master for all the data necessary to spawn a newsatellite. When implementing dynamic satellites, thereis always one primary satellite attached to the master.Secondary satellites are those created dynamically.1

1 In Fig. 2, PITR stands for the technique we use in our prototypeto dynamically create satellites. Please refer to Sect.8.

Extending DBMSs with satellite databases 661

The current Ganymed prototype, implemented inJava, does not support multiple, in parallel workingdispatchers, yet it is not vulnerable to failures of thedispatcher. If a Ganymed dispatcher fails, it can imme-diately be replaced by a standby dispatcher. The deci-sion for a dispatcher to be replaced by a backup hasto be made by the manager component. The managercomponent, running on a dedicated machine, constantlymonitors the system. The manager component is alsoresponsible for recon�gurations. It is used, e.g., by thedatabase administrator to add and remove replicas.Interaction with the manager component takes placethrough a graphical interface. In the following sections,we describe each component in more detail.

2.4 Transaction dispatcher

The transaction dispatcher (see Fig.3) acts as the gate-way to services offered by the master database andthe satellite engines. As its interface we have imple-mented the PostgreSQL protocol speci�cation.2 This isa signi�cant departure from existing replication projects(e.g., [5,6,10,26,28]) that rely on proprietary interfaceswith limited support for programming languages and OSplatforms.

The dispatcher is mainly in charge of routing trans-actions across the system. Routing has three main com-ponents: assigning transactionsto the correct backendDBMS, load balancing if more than one DBMS is ableto answer a request andtransaction taggingto enforceconsistent views of the available data.

However, before a client can send transactions, it �rstneeds to authenticate the current session. The dispatcherhas no internal user database, rather it relies on theadapters to verify locally whether the given credentialsare correct. On the satellites, for simplicity, we assumethat the same users exist as on the master DBMS. Accesscontrol to tables is therefore always performed locallyby the backend DBMS that is executing a transaction.

In an authenticated session, assigning transactionsinvolves deciding where to execute a given transaction.Typically, updates go to the master, queries to the satel-lites. Since the dispatcher needs to know if a transactionis of type update or read only, the application code hasto communicate this.3

If an application programmer forgets to declare aread-only transaction, consistency is still guaranteed, but

2 Therefore, the dispatcher can be accessed from any plat-form/language for which a PostgreSQL driver has been written[e.g., C#, C/C++, Java (JDBC), Perl, Python, etc.].3 For example, a Java client would have to use the standard JDBCConnection.setReadonly()method.

Fig. 3 Example of a dispatcher that is connected to two adaptersrunning on an Oracle master and a PostgreSQL satellite

reduced performance will result due to the increasedload on the master replica. On the satellites, all transac-tions will always be executed in read-only mode. There-fore, if a programmer erroneously marks an updatetransaction as read only, it will be aborted by the satel-lite’s DBMS upon the execution of the �rst updatestatement.

In the case of multiple equivalent satellites, the dis-patcher also performs load balancing as a part of therouting step. Different policies can be selected by theadministrator, currently round-robin , least-pending-requests-Þrst,or least-loadedare offered. The latter isdetermined based upon load information in the metadata sent by the adapters that is piggy backed on queryresults (see Sect.2.7).

As explained above, satellites always contain consis-tent snapshots of the master. Transactions that work on

662 C. Plattner et al.

the master can use whatever isolation levels are offeredby that database. On the satellites, however, queries areanswered by using SI. To be able to produce fresh snap-shots, the satellites consumewritesets[21,28] from themaster. Each produced writeset relates to an updatetransaction and contains all the changes performed bythat transaction. Writesets are applied on the satellitesin the order of the corresponding commit operationson the master, thereby ensuring that the satellites con-verge to the same state as the master. The applicationof writesets is done under SI: updates from the mas-ter and queries from the clients do not interfere in thesatellites.

The adapter on the master is responsible for assign-ing increasing numbers (in the order of successful com-mit operations) to all produced writesets. Every timea transaction commits on the master, the dispatcher isinformed, along with the successfulCOMMIT reply, aboutthe number of the produced writeset. The number nof the highest produced writeset WSn so far is thenused by the dispatcher to tag queries when they start,before they are sent to the satellites. When a satellitestarts working on a query, it must be able to assigna snapshot at least as fresh asWSn. If such a snap-shot cannot be produced, then the start of the trans-action is delayed until all needed writesets have beenapplied. Note that transactions that are sent to the mas-ter are never tagged, since they always see the latestchanges.

Since only a small amount of state information mustbe kept by a Ganymed dispatcher, it is even possibleto construct parallel working dispatchers. This helps toimprove the overall fault tolerance. In contrast to tra-ditional eager systems, where every replica has its ownscheduler that is aware of the global state, the exchangeof status information between a small number of RSI-PC dispatchers can be done very ef�ciently. Even in thecase that all dispatchers fail, it is possible to reconstructthe overall database state: a replacement dispatcher canbe used and its state initialized by inspecting all availablereplicas.

In the case of failing satellites, a Ganymed dispatchersimply ignores them until they have been repaired by anadministrator. However, in the case of a failing master,things are a little bit more complicated. By just elect-ing a new master the problem is only halfway solved.The dispatcher must also make sure that no updatesfrom committed transactions get lost, thereby guaran-teeing ACID durability. This objective can be achievedby only sending commit noti�cations to clients afterthe writesets of update transactions have successfullybeen applied on a certain, user-de�ned amount ofreplicas.

2.5 Master DBMS

The master DBMS, typically a commercial engine, isused to handle all update transactions in the system. Wealso rely on the master for availability, consistency, andpersistence. Since all update traf�c in the system is han-dled by the master, all con�ict handling and deadlockdetection are done there.

A critical point in our design is update extraction fromthe master. This is a well-known problem in databasereplication that has been approached in many differentways. In general, there are three options: propagatingthe SQL statements from the dispatcher, using triggersto extract the changes, and reading the changes fromthe log. Log analysis is commonly used in commercialsolutions. Triggers are preferred in open source repli-cation systems [13]. SQL propagation is mostly used inresearch systems [6,19,26]. There is also the option ofusing product-speci�c tools (Oracle, e.g., has numerousinterfaces that could be used for this purpose).

For generality, we use trigger-based extraction andSQL propagation (which we call the genericapproach,see Sect.4). Nevertheless, both introduce problems oftheir own. Triggers are expensive and lower the per-formance of the master. SQL propagation does not af-fect the master but has problems with non-determinis-tic statements and creates SQL compatibility problemsacross heterogeneous engines. In terms of update extrac-tion, there is really no perfect solution, and focusing ona single type of master rather than aiming for generalitywill always yield a better solution. Since, as already indi-cated, the purpose is to have a proof of concept imple-mentation, the two options we explore provide enoughinformation about the behavior of the system. We leavethe evaluation of other options and tailoring to speci�cengines to future work.

2.6 Satellites

The main role of the satellites is to extend the master.The satellites have to be able to appear and disappearat any time without causing data loss. There might bework lost but the assumption in our system is that anydata at the satellites can be recreated from the master.

Satellites apply incoming writesets to an FIFO order.For specialized satellites (e.g., with aggregations or com-binations of the master’s tables), writesets have to betransformed before they are applied (e.g., combiningsource columns into new tables, or performing aggrega-tion by combining writesets with the tablespace data).In general, the transformation is based on two parame-ters: the incoming writeset WSk and the satellite’s latestsnapshotSkŠ1 (see Sect.3.1).

Extending DBMSs with satellite databases 663

Fig. 4 The managementconsole

2.7 Adapters

Adapters (see Figs.2, 3) are used for gluing the differentcomponents of the system together. The dispatcher andthe management console never talk directly to any back-end DBMS; communication is always with an adapter.Adapters form a common abstraction of databaseengines and offer services like writeset handling andload estimation. Apart from these tasks, adapters alsosend information to the dispatcher (piggyback on thequery results) on CPU and I/O load,4 latest appliedwriteset, COMMIT numbers, etc. This information is alsoused by the management console, to decide if the systemhas the appropriate number of satellites to handle thecurrent load.

Adapters are also used to solve problems that arise inheterogeneous environments, mainly SQL dialect differ-ences between engines. For the experiments in thispaper, we have implemented translation modules foradapters, which can, to a certain extent, dynamicallytranslate queries from the master’s SQL dialect to theone used by the satellite on which the adapter is run-ning. As an example, these translators help to executeOracle TOP-N queries based on theROWNUM con-struct on a PostgreSQL satellite database which in turnoffers the LIMIT clause. In general, such query rewrit-ing is a typical O(n2) problem for n database engines.

4 The actual measurement of CPU and I/O usage does not intro-duce additional load as we simply collect the needed statistics fromthe Linux proc �le system, similar to the vmstatutility.

Nevertheless, note that query rewriting is necessary onlyfor full replication satellites but not for specialized sat-ellites. The assumption we make is that if the notion ofsatellite databases takes hold and full replication satel-lites are implemented, there will be enough motivationto develop such query rewriters, at least for the moreimportant database engines.

2.8 Management console

The manager console (see Fig.4 for a screenshot) isresponsible for monitoring the Ganymed system. Onthe one hand, it includes a permanently running processwhich monitors the load of the system and the failureof components. On the other hand, it is used by theadministrator to perform con�guration changes.

While replica failure can directly be handled by thedispatcher (failed satellites are simply discarded, failedmasters are replaced by a satellite, if possible), the fail-ure of a dispatcher is more critical. In the event that a dis-patcher fails, the monitor process in the manager consolewill detect this and is responsible for starting a backupdispatcher.5 Assuming fail stop behavior, the connec-tions between the failing dispatcher and all replicas willbe closed, all running transactions will be aborted bythe assigned replicas. The manager will then inspect allreplicas, elect a master and con�gure the new dispatcher

5 As part of our prototype implementation we have also createda modi�ed PostgreSQL JDBC driver that will detect such failuresand try to �nd a working dispatcher according to its con�guration.

664 C. Plattner et al.

so that transaction processing can continue. Theinspection of a replica involves the detection of the lastapplied writeset, which can be done by the same soft-ware implementing the writeset extraction.

The manager console is also used by administratorsthat need to change the set of attached replicas to a dis-patcher, or need to reactivate disabled replicas. Whilethe removal of a replica is a relatively simple task, theattachment or re-enabling of a replica is a more chal-lenging task. Syncing-in a replica is actually performedby copying the state of a running replica to the new one.At the dispatcher level, the writeset queue of the sourcereplica is also duplicated and assigned to the destina-tion replica. Since the copying process uses a SERIAL-IZABLE read-only transaction on the source replica,there is no need for shutting down this replica duringthe duplicating process. The new replica cannot be usedto serve transactions until the whole copying process isover. Its writeset queue, which grows during the copyprocess, will be applied as soon as the copying has �n-ished. Although from the viewpoint of performance thisis not optimal, in the current prototype the whole copy-ing process is done by the dispatcher under the controlof the manager console.

Besides its role as the central administrative interface,the management console hosts a monitoring componentthat watches the whole setup, and performs, if neces-sary, recon�guration according to the policy set by theadministrator. Based on those policies, the monitor com-ponent can take machines out of a spare machine pooland assign them as a secondary satellite to a primary, orit can remove and put them back into the pool. In ourprototype, policies are implemented as Java plugins.6

3 Consistency guarantees in ganymed

As indicated, Ganymed uses SI as the correctness cri-terium for replica management. In this section, we givea short introduction to SI and then extend the notionof SI to apply it to replicated databases. The result isthe transaction routing algorithm we use in our system:Replicated SI with primary copy (RSI-PC).

3.1 Snapshot isolation

Snapshot isolation [7,32] is a multiversion concurrencycontrol mechanism used in databases. Popular database

6 The following is a basic example of the logic contained in sucha policy: “If the average CPU load on primary satellite X is morethan 90% during more than 10 s, then start adding a new second-ary satellite to X. If the load on X is less than 20% during morethan 10 s, then remove a secondary satellite.”

engines7 that are based on SI include Oracle [27] andPostgreSQL [29]. One of the most important propertiesof SI is the fact that readers are never blocked by writers,similar to the multiversion protocol in [ 8]. This prop-erty is a big win in comparison to systems that usetwophase locking(2PL), where many non-con�icting upd-aters may be blocked by a single reader. SI completelyavoids the four extended ANSI SQL phenomena P0–P3described in [7], nevertheless it does not guarantee seri-alizability. As shown in [ 16,15] this is not a problemin real applications, since transaction programs can bearranged in ways so that any concurrent execution ofthe resulting transactions is equivalent to a serializedexecution.

For the purposes of this paper, we will work with thefollowing de�nition of SI (slightly more formalized thanthe description in [7]):

SI: A transaction Ti that is executed under SI getsassigned a start timestampstart(Ti) which re�ectsthe starting time. This timestamp is used to de�ne asnapshotSi for transaction Ti. The snapshotSi con-sists of the latest committed values of all objects ofthe database at the timestart(Ti). Every read oper-ation issued by transactionTi on a database objectx is mapped to a read of the version of x which isincluded in the snapshotSi. Updated values by writeoperations of Ti (which make up the writeset WSi ofTi) are also integrated into the snapshotSi, so thatthey can be read again if the transaction accessesupdated data. Updates issued by transactions thatdid not commit before start(Ti) are invisible to thetransaction Ti. When transaction Ti tries to com-mit, it gets assigned a commit timestampcommit(Ti),which has to be larger than any other existing starttimestamp or commit timestamp. Transaction Ti canonly successfully commit if there exists no other com-mitted transaction Tk having a commit timestampcommit(Tk) in the interval {start(Ti), commit(Ti)}andWSk

⋂WSi �= {}. If such a committed transaction

Tk exists, thenTi has to be aborted (this is called theÞrst-committer-wins rule, which is used to preventlost updates). If no such transaction exists, thenTi

can commit (WSi gets applied to the database) andits updates are visible to transactions which have astart timestamp which is larger thancommit(Ti).

A sample execution of transactions running on a data-base offering SI is given in Fig.5. The symbols B, C and

7 The new Microsoft SQL Server 2005 [25] implements a hybridapproach, offering SI along with traditional concurrency control.However, for the SI part of this paper we focus on pure SI DBMSs.

Extending DBMSs with satellite databases 665

Fig. 5 An example ofconcurrent transactionsrunning under SI

A refer to the begin, commit and abort of a transaction.The long running transaction T1 is of type readonly, i.e.,its writeset is empty:WS1 = {}. T1 neither will be blockedby any other transaction, nor will it block other transac-tions. Updates from concurrent updaters (like T2,T3,T4,and T6) are invisible to T1. T2 will update the databaseelement X , it does not con�ict with any other transac-tion. T3 updatesY, it does not see the changes made byT2, since it started whileT2 was still running. T4 updatesX and Y. Conforming to the �rst-committer-wins rule itcannot commit, since its writeset overlaps with that fromT3 and T3 committed while T4 was running. The trans-action manager has therefore to abortT4. T5 is readonlyand sees the changes made byT2 andT3. T6 can success-fully update Y. Due to the fact that T4 did not commit,the overlapping writesets of T6 and T4 do not impose acon�ict.

As will be shown in the next section, practical systemshandle the comparison of writesets and the �rst-com-mitter-wins rule in a different, more ef�cient way. Both,Oracle and PostgreSQL, offer two different ANSI SQLisolation levels for transactions: SERIALIZABLE andREAD COMMITTED. An extended discussion regard-ing ANSI SQL isolation levels is given in [ 7].

3.1.1 The SERIALIZABLE isolation level

Oracle and PostgreSQL implement a variant of the SIalgorithm for transactions that run in the isolation levelSERIALIZABLE. Writesets are not compared at trans-action commit time, instead this process is done progres-sively by using row level write locks. When a transactionTi running in isolation level SERIALIZABLE tries tomodify a row in a table that was modi�ed by a con-current transaction Tk which has already committed,then the current update operation of Ti gets abortedimmediately. Unlike PostgreSQL, which then aborts thewhole transaction Ti, Oracle is a little bit more �exible:it allows the user to decide if he wants to commit thework done so far or if he wants to proceed with otheroperations in Ti. If Tk is concurrent but not commit-ted yet, then both products behave the same: they blocktransaction Ti until Tk commits or aborts. If Tk com-mits, then the same procedure gets involved as describedbefore, if however Tk aborts, then the update operation

of Ti can proceed. The blocking of a transaction due to apotential update con�ict can of course lead to deadlocks,which must be resolved by the database by abortingtransactions.

3.1.2 The READ COMMITTED isolation level

Both databases offer also a slightly less strict isolationlevel called READ COMMITTED, which is based ona variant of SI. READ COMMITTED is the defaultisolation level for both products. The main differenceto SERIALIZABLE is the implementation of the snap-shot: a transaction running in this isolation mode getsa new snapshot for every issued SQL statement. Thehandling of con�icting operations is also different thanin the SERIALIZABLE isolation level. If a transac-tion Ti running in READ COMMITTED mode triesto update a row which was already updated by a con-current transaction Tk, then Ti gets blocked until Tkhas either committed or aborted. If Tk commits, thenTi ’s update statement gets reevaluated again, since theupdated row possibly does not match a used selectionpredicate anymore. READ COMMITTED avoids phe-nomena P0 and P1, but is vulnerable to P2 and P3 (fuzzyread and phantom).

3.2 RSI-PC

For simplicity in the presentation and without loss ofgenerality, we will describe RSI-PC, the dispatcher’stransaction routing algorithm, assuming that the objec-tive is scalability through replication for a single masterDBMS. When the satellite databases implement special-ized functionality, the routing of the load at the middle-ware layer will take place according to other criteria(such as keywords in the queries, tables being accessed,etc.).

The RSI-PC algorithm works by routing transactionsthrough a middleware dispatcher over a set of back-end databases. There are two types of backend data-bases: one master, and a set of satellites. From the clientview, the middleware behaves like an ordinary database.Inconsistencies between master and satellite databasesare hidden by the dispatcher. The dispatcher differen-tiates betweenupdateand read-only transactions. Note

666 C. Plattner et al.

that “update transaction” does not necessarily mean atransaction consisting of pure update statements, updatetransactions may contain queries as well. Transactionsdo not have to be sent by clients in a single block (unlikein, e.g., [19,26]), they are processed by the dispatcherstatement by statement. However, the client must spec-ify at the beginning of each transaction whether it is aread only or an update. Otherwise they are handled asupdate transactions by the dispatcher.

In Ganymed, update transactions are alwaysforwarded to the master and executed with the isolationlevel speci�ed by the client (e.g., in the case of Oracleor PostgreSQL this is either READ COMMITTED orSERIALIZABLE). Read-only transactions will, when-ever possible, be sent to one of the satellites and exe-cuted with SI in the SERIALIZABLE mode. Therefore,the same snapshot (on the same satellite) will be used forall statements in the transaction, ensuring that a client’sread-only transactions see always a consistent state ofthe database over their whole execution time. To keepthe satellites in sync with the master, so-calledwrite-set extractionis used [19]. Every update transaction’swriteset (containing the set of changed objects on themaster) is extracted from the master and then sent toFIFO queues on the satellites. Writesets are applied onthe satellites in the order of the corresponding commitoperations on the master, thereby guaranteeing that thesatellites converge to the same state as the master. Theapplication of writesets is done under SI: readers andwriters do not interfere. This speeds up both the readingwhile the satellite is being updated, and the propagationof changes – as their application is not slowed down byconcurrent readers.

3.3 Properties of RSI-PC

If required, RSI-PC can be used to provide differentconsistency levels. This has no in�uence on update trans-actions, since these are always sent to the master andtheir consistency is dictated by the master, not by oursystem. For instance, if a client is not interested in thelatest changes produced by concurrent updates, then itcan run its read-only transactions with session consis-tency. This is very similar to the concept of strong ses-sion 1SRintroduced in [ 14]. On this consistency level,the dispatcher assigns the client’s read-only transac-tions a snapshot which contains all the writesets pro-duced by that client, though it is not guaranteed that thesnapshot contains the latest values produced by otherclients concurrently updating the master. By default,however, all read-only transactions are executed withfull consistency. This means that read-only transactionsare always executed on a satellite that has all the updates

performed up to the time the transaction starts. Theactual selection of a satellite is implementation speci�c(e.g., round-robin, least-pending-requests-�rst, etc.).Obviously, if writesets do not get applied fast enoughon the satellites, then readers might be delayed. If themaster supports SI and has spare capacity, the dispatcherroutes readers to the master for added performance.

Another possible consistency level is based on the ageof snapshots, similar to [31]. The age of a snapshot is thedifference between the time of the latest commit on themaster and the time the snapshot’s underlying commitoperation occurred. A client requesting time consistencyalways gets a snapshot which is not older than a certainage speci�ed by the client. Time consistency and sessionconsistency can be combined: read-only snapshots thenalways contain a client’s own updates and are guaran-teed not to be older than a certain age.

Due to its simplicity, there is no risk of a dispatcherimplementing the RSI-PC algorithm becoming the bot-tleneck in the system. In contrast to other middlewarebased transaction schedulers, like the ones used in [6,10],this algorithm does not involve any SQL statement pars-ing or concurrency control operations. Also, no row ortable level locking is done at the dispatcher level. Thedetection of con�icts, which by de�nition of SI can onlyhappen during updates, is left to the database running onthe master replica. Moreover, (unlike [ 10,18]), RSI-PCdoes not make any assumptions about the transactionalload, the data partition, organization of the schema, oranswerable and unanswerable queries.

3.4 Fine grained routing of single queries

For the common case of read-only transactions in autocommit mode (i.e., single queries), our algorithm is moreselective. Upon arrival of such a query, the dispatcherparses the SQL SELECT statement to identify the tablesthat the query will read. If this is possible (e.g., no func-tion invocations with unknown side effects or depen-dencies are detected), it then chooses a satellite whereall the necessary tables are up to date. This optimiza-tion is based on the observation that, in some cases, itis perfectly legal to read from an older snapshot. Oneonly has to make sure that the subset of data read in theolder snapshot matches the one in the actual requestedsnapshot.

Parsing of single queries also makes it possible toimplement partial replication. In the case of ordinaryread-only transactions, it is not possible to determine inadvance which objects will be touched during a transac-tion. Therefore, snapshots always have to be created onsatellites which hold all objects. For single queries thisis different: the required tables can be identi�ed and a

Extending DBMSs with satellite databases 667

snapshot can be created on a satellite which holds onlythose tables. This opens the possibility of dynamicallycreating satellites that hold only hot spot tables and evenmaterialized views.

4 Update extraction

Unfortunately, writeset extraction, which is needed onthe master replica, is not a standard feature of data-base systems. Therefore, additional programming effortis needed for every attached type of master. Currently,our prototype supports the following databases:

PostgreSQL: This open source DBMS can be usedboth as master and satellite. PostgreSQL is very �exible,it supports the loading of additional functionality dur-ing runtime. Our writeset support consists of a sharedlibrary written in C which can be loaded at runtime. Thelibrary enables the ef�cient extraction of writesets basedon capturing triggers; All changes to registered tables,be it changes through ordinary SQL statements, user-de�ned triggers or user-de�ned functions and storedprocedures, will be captured. The extracted writesetsare table row based, they do not contain full disk blocks.This ensures that they can be applied on a replica whichuses another low level disk block layout than the masterreplica.

Oracle: Similar to PostgreSQL, we have implemen-ted a trigger-based capturing mechanism. It consists ofa set of Java classes that are loaded into the Oracleinternal JVM. They can be accessed by executing callsthrough any of the available standard Oracle interfaces(e.g., OCI). The software is able to capture the changesof user executed DML (data manipulation language)statements, functions and stored procedures. A limi-tation of our current approach is the inability to reli-ably capture changes as a result of user de�ned trig-gers. To the best of our knowledge it is not possible tospecify the execution order of triggers in Oracle. It isthus not guaranteed that our capturing triggers are al-ways the last ones to be executed. Hence, changes couldhappen after the capturing triggers. A more robust ap-proach would be to use the Oracle change data capturefeature [27].

DB2: We have developed support for DB2 basedon loadable Java classes, similar to the approach cho-sen for Oracle. As before, we use triggers to capturechanges and a set of user-de�ned functions which canbe accessed through the standard DB2 JDBC driver toextract writesets. Due to the different ways in whichDB2 handles the JVM on different platforms, the codeworks on Windows but not on DB2 for Linux. A morecomplete implementation would be based on using DB2

log extraction. Due to the incompleteness of our DB2support for Linux, and to maintain fairness in thecomparisons, we used the generic approach, describednext, to connect to DB2.

4.1 Generic capturing of updates

Our generic approach is similar to that used on replica-tion systems, e.g., [5,10]. We parse a transaction’s incom-ing statements at the middleware and add allstatements that possibly modify the database to its write-set. Of course, this type of writeset extraction (actuallymore a form of writeset collection) is not unproblem-atic: the execution of SELECT statements with sideeffects (e.g., calling functions that change rows in tables)is not allowed. The distribution of the original DMLstatements instead of extracted writesets also leads to avariety of dif�culties. As an example take the followingstatement:

INSERT INTO customer (c_id, c_first_login)VALUES (19763, current_timestamp)

The execution of this statement on different repli-cas might lead to different results. Therefore, in thegeneric approach, the middleware replaces occurrencesof current_timestamp, current_date, current_time, sys-date, random and similar expressions with a valuebefore forwarding the statement. Our current prototypefor use with TPC-W does this by search-and-replace. Afull blown implementation would have to be able toparse all incoming DML statements into tokens andthen apply the substitution step. In the case of que-ries, an alternative would be to send specialized state-ments to the master and forward to the satellites onlythose that do not involve any transformation. As men-tioned before, our system provides at least the function-ality and performance of the master. If the applicationis willing to help, then the satellites can be used withthe corresponding gains. This is not different from cli-ents providing hints to the optimizer, or writing queriesoptimized for an existing system. Although it can bekept completely transparent, the maximum gains willbe obtained when queries are written taking our systeminto account. It should nevertheless be noted that eventhough the generic approach has obvious limitations,it is still suitable for many common simple databaseapplications which do not rely on database triggers or(user-de�ned) functions with side effects. This problemdoes not appear either if the satellite databases imple-ment functionality that is different from that of themaster.

668 C. Plattner et al.

Table 1 Overhead of capturing triggers

PostgreSQL Insert 126%Update 120%Delete 122%

Oracle Insert 154%Update 161%Delete 160%

4.2 Trigger overhead

Our two extensions for PostgreSQL and Oracle makeuse of capturing triggers. Therefore, the question arisesas to how much overhead these mechanisms add. To �ndout, we measured the difference of performing changesin the databases with and without the capturing soft-ware. For both systems, we made measurements wherewe determined the overhead foreach single update state-ment. Three types of transactions were tested: one con-taining an INSERT operation, one with an UPDATEand one with a DELETE statement. No other load waspresent to obtain stable results.

Table 1shows that the overhead per update statementcan be as high as 61%. These �gures, however, need tobe prorated with the overall load as the overhead isperupdate statement. In the next section, we show that theproportion of update transactions in TPC-W is nevermore than 25% of the total load, and they do not solelyconsist of update statements. Hence, the actual triggeroverhead is much less (see the experiments). In addition,our system reduces the load at the master, thereby leav-ing more capacity to execute updates and the associatedtriggers in an ef�cient manner.

4.3 Update propagation

Ganymed applies writesets on the satellites in the com-mit order at the master. Similar to the writeset extrac-tion problem, there is no general interface which couldbe used to ef�ciently get noti�ed of the exact commitorder of concurrent commit operations. Since our initialintention is to prove the concept of satellite databases,we currently use a brute force approach: commits forupdate transactions are sent in a serialized order fromthe master adapter to the master database. However,this can be done more ef�ciently, since all involved com-ponents in this time critical process reside on the samemachine.

5 SQL compatibility

Every DBMS available today features extensions andmodi�cations to the SQL standard. Moreover, even if

all DBMSs would understand the same dialect of SQL,it would not be guaranteed that the optimal formulationof a query on one DBMS would also lead to an optimalexecution on another DBMS. To illustrate these prob-lems, we use two examples from TPC-W.

5.1 TOP-N queries

In the TPC-W browsing workload (see later section), aTOP-N query is used in theNew Products Web Interac-tion. The query itself returns a list of the 50 most recentlyreleased books in a given category. One possible way toimplement this in Oracle is as follows:

SELECT * FROM (SELECT i_id, i_title, a_fname, a_lnameFROM item, authorWHERE i_a_id = a_idAND i_subject = ’COOKING’ORDER BY i_pub_date DESC, i_title) WHERE ROWNUM <= 50

The TOP-N behavior is implemented by surroundingthe actual query with a “SELECT * FROM (. . .) WHEREROWNUM <= n” construct. The database then not onlyreturns at most n rows, but the query planner can alsouse that knowledge to optimize the execution of theinner query.

Unfortunately, since ROWNUM is a virtual columnspeci�c to Oracle, the query does not work on Post-greSQL or DB2. However, it can easily be modi�ed tobe run on those systems: in the case of PostgreSQL, tak-ing the inner query and adding aLIMIT 50 clause to theend of the query leads to the same result. For DB2, theapproach is very similar, the corresponding clause thathas to be appended is calledFETCH 50 ROWS ONLY.

5.2 Rewriting as optimization

As a second example, we look at the query that isused in TPC-W for the “Best Sellers Web Interaction”.Amongst the 3,333 most recent orders, the query per-forms a TOP-50 search to list a category’s most popularbooks based on the quantity sold. Again, we show theSQL code that could be used with an Oracle database:

SELECT * FROM (SELECT i_id, i_title, a_fname, a_lname,SUM(ol_qty) AS orderkeyFROM item, author, order_lineWHERE i_id = ol_i_id AND i_a_id = a_idAND ol_o_id >(SELECT MAX(o_id)-3333 FROM orders)AND i_subject = ’CHILDREN’

Extending DBMSs with satellite databases 669

GROUP BY i_id, i_title, a_fname,a_lname

ORDER BY orderkey DESC) WHERE ROWNUM <= 50

This query has the same problem as the previousTOP-N query. However, at least in the case of Post-greSQL, replacing the ROWNUM construct with aLIMIT clause is not enough: although the query works,the performance would suffer. The reason is the useof the MAX operator in the SELECT MAX(o_id)-3333FROM orders subquery. Many versions of PostgreSQLcannot make use of indexes to ef�ciently execute thissubquery due to the implementation of aggregatingoperators. When speci�ed using the MAX operator, thesubquery leads to a sequential scan on the (huge) orderstable. To enable the use of an index set on theo_id col-umn of the orders table, the subquery must be rewrittenas follows:

SELECT o_id-3333 FROM ordersORDER BY o_id DESC LIMIT 1

At �rst glance, sorting the orders table in descendingorder looks inef�cient. However, the use of the LIMIT 1clause leads to an ef�cient execution plan which simplylooks up the highesto_id in the orders table by peekingat the index set on theo_id column8.

5.3 Rewriting SQL on the �y

These last two examples illustrate the problems of het-erogeneous setups. Since we assume that the client soft-ware should not be changed (or only in a very narrowmanner), obviously the incoming queries need to berewritten by the middleware (more exactly, in the adapt-ers). We propose two ways to perform the translationstep:

Generic rewriting: Depending on the DBMS (Ora-cle, DB2, . . .) a query was originally formulated for, andthe effective place of execution (a satellite with pos-sibly a different DBMS), a set of speci�c modules inthe middleware could rewrite the query. There are twoways to structure this: one is de�ning a common inter-mediate representation and set of in- and out- mod-ules. The other possibility is to have direct translationmodules for every supported pair of master/satelliteDBMS. These solutions need no support from the clientapplication programmer, but are hard to develop fromthe perspective of the middleware designer. Yet, if the

8 In newer PostgreSQL versions (8.0+) this particular problemhas been resolved. If the database detects the use of theMAXoperator as explained, it will rewrite the query execution planusing theLIMIT operator.

idea of satellite databases becomes widely used, suchmodules will certainly be incrementally developed.

Application-speciÞc plug-ins: If the query diversityis very limited (the generic rewriting approach wouldbe too much overhead), or if a certain generic rewrit-ing approach for a set of master/satellites is not avail-able, it would make sense to allow the administratorto be able to add custom rewriting rules. At the costof additional maintenance, a more �exible and possiblymore ef�cient integration of applications and new mas-ter/satellite DBMS pairs would be possible. Of course,the two approaches can also be mixed: for some transla-tions the generic modules could be used, for others thesystem could call user speci�ed translation plug-ins.

Our current prototype is mainly used with TPC-Wdatabases, and so the query load is well known. Sincethe rules needed for TPC-W query rewriting are rathersimple, we have chosen to use speci�c plug-ins whichcan be loaded into the adapters. The plug-ins use stringreplacement which only works for our TPC-W loads butwith negligible overhead.

6 Experimental evaluation

To verify the validity of our approach we performed sev-eral tests. We start with a worst case scenario intended toprovide a lower bound on what can be done with satellitedatabases: satellites that contain full replicas of a mas-ter database. This is a worst case scenario because of theamount of data involved (the entire database) and theissues created by heterogeneity (of the engines and thedialects of SQL involved). Yet, the experiments showthat satellite databases can be used to provide a signi�-cant degree of scalability in a wide range of applications.

6.1 Experiments performed

First, we did extensive scalability measurements forhomogeneous setups consisting only of fully replicatedPostgreSQL databases. We used a load generator thatsimulates the transaction traf�c of a TPC-W applicationserver. Additionally, we tested the behavior of Ganymedin scenarios where database replicas fail. The failure ofboth, satellites and masters, was investigated.

In a later section, we present measurements for morecomplex setups, namely, heterogeneous Ganymed sys-tems implementing full replication for Oracle and DB2masters with attached PostgreSQL satellites. Again, wemeasured the achievable performance by using differentTPC-W loads.

670 C. Plattner et al.

6.2 The TPC-W load

The TPC benchmark W (TPC-W) is a transactionalweb benchmark from the Transaction Processing Coun-cil [12]. TPC-W de�nes an internet commerce envi-ronment that resembles real world, business oriented,transactional web applications. The benchmark also de-�nes different types of workloads which are intended tostress different components in such applications[namely, multiple on-line browser sessions, dynamicpage generation with database access, update of con-sistent web objects, simultaneous execution of multi-ple transaction types, a backend database with manytables with a variety of sizes and relationships, trans-action integrity (ACID) and contention on data accessand update]. The workloads are as follows: primarilyshopping (WIPS), browsing (WIPSb) and web-basedordering (WIPSo). The difference between the differentworkloads is the ratio of browse to buy: WIPSb consistsof 95% read-only interactions, for WIPS the ratio is 80%and for WIPSo the ratio is 50%. WIPS, being the primaryworkload, is considered the most representative one.

For the evaluation of Ganymed we generated data-base transaction traces with a running TPC-W instal-lation. We used an open source implementation [24]that had to be modi�ed to support a PostgreSQL back-end database. Although the TPC-W speci�cation allowsthe use of loose consistency models, this was not usedsince our implementation is based on strong consistency.The TPC-W scaling parameters were chosen as follows:10,000 items, 288,000 customers and the number of EBswas set to 100.

Traces were then generated for the three differentTPC-W workloads: shopping mix (a trace �le based onthe WIPS workload), browsing mix (based on WIPSb)and ordering mix (based on WIPSo). Each trace consistsof 50,000 consecutive transactions.

6.3 The load generator

The load generator is Java based. Once started, it loadsa trace �le into memory and starts parallel databaseconnections using the con�gured JDBC driver. After allconnections are established, transactions are read fromthe in-memory trace�le and then fed into the database.Once a transaction has �nished on a connection, the loadgenerator assigns the next available transaction from thetrace to the connection.

For the length of the given measurement interval,the number of processed transactions and the averageresponse time of transactions are measured. Also, toenable the creation of histograms, the current number

of processed transactions and their status (committed/aborted) are recorded every second.

6.4 Experimental setup

For the experiments, a pool of machines were used tohost the different parts of the Ganymed system. Forevery component (load generator, dispatcher, databas-es (master and satellites)) of the system, a dedicatedmachine was used. All machines were connectedthrough a 100 MBit Ethernet LAN. Java software wasrun with the Blackdown-1.4.2-rc1 Java 2 platform. Allmachines were running Linux with a 2.4.22 kernel. Theload generator ran on a dual Pentium-III 600 MHzmachine with 512 MB of RAM. The dispatcher andthe webservice ran on dual Pentium-IV machines with3 GHz and 1 GB of main memory. Using identicalmachines (Dual AMD Athlon 1400 MHz CPU, 1 GBRAM, 80 GB IBM Deskstar harddisk) we installed sev-eral PostgreSQL (8.0.0), one Oracle (10G) and one DB2(8.1) database. For the full replication experiments, eachdatabase contained the same set of TPC-W tables (scal-ing factor 100/10K [12]). All databases were con�guredusing reasonable settings and indexes; however, we donot claim that the databases were con�gured in themost optimal manner: in the case of Oracle we usedautomatic memory management and enabled the cost-based optimizer. DB2 was con�gured using the graphi-cal con�guration advisor. Additional optimizations weredone manually. The PostgreSQL satellites were con�g-ured to not use fsync, which enables fast applicationof writesets. The disabling of the fsync on the satel-lites is not problematic (in terms of durability), sincethe failure of a satellite would involve a re-sync any-way and durability is guaranteed by the master. Beforestarting any experiment, all databases were always resetto an initial state. Also, in the case of PostgreSQL, theVACUUM FULL ANALYZE command was executed.This ensured that every experiment started from thesame state.

6.5 Homogeneous full replication

First, we describe our experimental results that resultfrom a Ganymed system used to implement homoge-neous full replication with a set of PostgreSQL data-bases. We did extensive scalability measurements bycomparing the performance of different multiple satel-lite con�gurations with a single PostgreSQL instance.Second, we describe the behavior of Ganymed in sce-narios where databases fail. The failure of both satellitesand masters was investigated.

Extending DBMSs with satellite databases 671

6.5.1 Performance and scalability

The �rst part of the evaluation analyzes performanceand scalability. The Ganymed prototype was comparedwith a reference system consisting of a singlePostgreSQL instance. We measured the performanceof the Ganymed dispatcher in different con�gurations,from 0 up to 5 satellites. This gives a total of seven exper-imental setups (called PGSQL and SAT-n, 0 � n � 5),each setup was tested with the three different TPC-Wtraces.

The load generator was then attached to the database(either the single instance database or the dispatcher,depending on the experiment). During a measurementinterval of 100 s, a trace was then fed into the systemover 100 parallel client connections and at the same timeaverage throughput and response times were measured.All transactions, read only and updates, were executedin the SERIALIZABLE mode. Every experiment wasrepeated until a suf�cient, small standard deviation wasreached.

Figure 6 shows the results for the achieved through-put (transactions per second) and average transactionresponse times, respectively. The ratio of aborted trans-actions was below 0.5% for all experiments.

Figure 7 shows two example histograms for theTPC-W ordering mix workload: on the left side the ref-erence system, on the right side SAT-5. The sharp dropin performance in the SAT-5 histogram is due to mul-tiple PostgreSQL replicas that did checkpointing of theWAL ( write ahead log) at the same time. The replicaswere con�gured to perform this process at least every300 s; this is the default for PostgreSQL.

Based on the graphs, we can prove the lightweightstructure of the Ganymed prototype. In a relaycon�guration, where only one replica is attachedto the Ganymed dispatcher, the achieved performanceis almost identical to the PostgreSQL reference sys-tem. The performance of the setup with two replicas,where one replica is used for updates and the otherfor read-only transactions, is comparable to the sin-gle replica setup. This clearly re�ects the fact that theheavy part of the TPC-W loads consists of complexread-only queries. In the case of the write intensiveTPC-W ordering mix, a two replica setup is slightlyslower than the single replica setup. In the setups wheremore than two replicas are used, the performance com-pared to the reference system could be signi�cantlyimproved. A close look at the response times chartshows that they converge. This is due to the RSI-PCalgorithm which uses parallelism for different transac-tions, but no intra-parallelism for single transactions.A SAT-5 system, for example, would have the same

500%

400%

300%

200%

100%

0SAT-5SAT-4SAT-3SAT-2SAT-1SAT-0PGSQL

Ave

rage

TP

S

System Type

TPS Comparison, 100 clients

TPC-W Browsing TraceTPC-W Shopping TraceTPC-W Ordering Trace

120%

100%

80%

60%

40%

20%

0SAT-5SAT-4SAT-3SAT-2SAT-1SAT-0PGSQLA

vera

ge R

espo

nse

Tim

e [m

s]

System Type

Response Time Comparison, 100 clients

TPC-W Browsing TraceTPC-W Shopping TraceTPC-W Ordering Trace

Fig. 6 Ganymed performance for TPC-W mixes

0 50

100 150 200 250 300 350

0 20 40 60 80 100

TP

S

Seconds

Transaction Throughput Histogram, PGSQL, TPC-W Ordering Trace

TPC-W Ordering Throughput

0 50

100 150 200 250 300 350

0 20 40 60 80 100

TP

S

Seconds

Transaction Throughput Histogram, SAT-5, TPC-W Ordering Trace

TPC-W Ordering Throughput

Fig. 7 Example histograms for the TPC-W ordering mix

performance as a SAT-1 system when used only by asingle client.

One can summarize that in almost all cases a nearlylinear scale-out was achieved. These experiments showthat the Ganymed dispatcher was able to attain animpressive increase in throughput and reduction oftransaction latency while maintaining the strongestpossible consistency level.

It must be noted that in our setup all databases wereidentical. By having more specialized index structures

672 C. Plattner et al.

0

50

100

150

200

0 20 40 60 80 100

TP

S

Seconds

Transaction Throughput Histogram, SAT-3/2 Satellite Failure,TPC-W Shopping Trace

TPC-W Shopping Throughput

Fig. 8 Ganymed reacting to a satellite failure

on the satellites the execution of read-only transactionscould be optimized even more. We are exploiting thisoption as part of future work.

6.5.2 Reaction to a failing satellite replica

In this experiment, the dispatcher’s reaction to a failingsatellite replica was investigated. A SAT-3 system wascon�gured and the load generator was attached with theTPC-W shopping mix trace.

After the experiment was run for a while, one of thethe satellite replicas was stopped, by killing the Post-greSQL processes with a kill (SIGKILL ) signal. It mustbe emphasized that this is different from the usage ofa SIGTERM signal, since in that case the PostgreSQLsoftware would have had a chance to catch the signaland shutdown gracefully.

Figure 8shows the generated histogram for this exper-iment. In second 56, a satellite replica was killed asdescribed above. The failure of the satelite replica ledto an abort rate of 39 read-only transactions in second56; otherwise no transaction was aborted in this run.The arrows in the graph show the change of the averagetransaction throughput per second. Clearly, the system’sperformance degraded to that of a SAT-2 setup. As canbe seen from the graph, the system recovered immedi-ately. Transactions running on the failing replica wereaborted, but otherwise the system continued workingnormally. This is a consequence of the lightweight struc-ture of the Ganymed dispatcher approach: if a replicafails, no costly consensus protocols have to be executed.The system just continues working with the remainingreplicas.

6.5.3 Reaction to a failing master replica

In the last experiment, the dispatcher’s behavior in thecase of a failing master replica was investigated. As inthe previous experiment, the basic con�guration was aSAT-3 system fed with a TPC-W shopping mix trace.

0

50

100

150

200

0 20 40 60 80 100

TP

S

Seconds

Transaction Throughput Histogram, SAT-3/2 Master Failure,TPC-W Shopping Trace

TPC-W Shopping Throughput

Fig. 9 Ganymed reacting to a master failure

Again, a SIGKILL signal was used to stop PostgreSQLon the master replica.

Figure 9 shows the resulting histogram for this exper-iment. Transaction processing is normal until in second45 the master replica stops working. The immediatemove of the master role to a satellite replica leavesa SAT-2 con�guration with one master and two satel-lite replicas. The failure of the master led to an abortof two update transactions; no other transactions wereaborted during the experiment. As before, the arrowsin the graph show the change of the average transactionthroughput per second.

This experiment shows that Ganymed is also capableof handling failing master replicas. The system reactsby reassigning the master role to a different, still work-ing satellite. It is important to note that the reactionto failing replicas can be done by the dispatcher with-out intervention from the manager console. Even witha failed or otherwise unavailable manager console thedispatcher can still disable failed replicas and, if needed,move the master role autonomously.

6.6 Heterogeneous full replication

In this section, we discuss several experiments per-formed using Oracle and DB2 master databases. In bothcases we attached a dispatcher and a varying set of Post-greSQL satellite databases. Again, we used the threeTPC-W workloads to measure the performance. How-ever, this time we also added a fourth load which isused to demonstrate the capability of the heterogeneoussystems to handle peaks of read-only transactions: 200concurrent clients were attached to the system, having100 clients sending TPC-W ordering transactions and100 clients sending read-only transactions. The read-only transaction load was created by taking a TPC-Wbrowsing trace and eliminating all update transactions.

For all experiments, the dispatcher was con�gured toexecute read-only transactions with full consistency. Wedid not make use of loose consistency levels or partial

Extending DBMSs with satellite databases 673

replication possibilities (as these bene�t our approach).The assignment of read-only transactions to satelliteswas done using aleast-pending-requests-Þrststrategy.Since Oracle offers SI, we also assigned readers(using isolation level read only) to the master whenreaders would have been otherwise blocked (due to theunavailability of the requested snapshot on the satel-lites). In the case of DB2 the master could not be usedto offer SI for readers. Therefore, in the worst case, read-only transactions were delayed until a suitable snapshoton a satellite was ready.

To get an idea of the design space, we tested Oraclewith update extraction using triggers (table level trig-gers) and using the generic approach (SQL propaga-tion). DB2 was tested only with the generic approach,since we have not implemented a trigger-based approachyet.

6.6.1 Oracle using triggers

In our �rst series of experiments we tested a dispatcherthat uses triggers to do writeset extraction on Oracle.First we measured performance by attaching the loadgenerator directly to Oracle without triggers. Then, weattached the dispatcher and installed the writeset extrac-tion features on the Oracle master for the remainingmeasurements. The results, relative throughput in termsof transactions per second (TPS) and response times, areshown in Fig. 10. For both graphs the �rst experiment,labeled ORA, represents Oracle without triggers andwithout our system. The second column, SAT-0, showsthe performance when putting the dispatcher betweenthe load generator and the master but no satellite da-tabases. The next columns show the results of havingattached additional 1–5 PostgreSQL satellite databasesto the dispatcher (SAT-1 until SAT-5).

As the graphs show, the system provides substantialscalability both in throughput and response time forloads with enough read operations. In addition, usingthe master for blocked readers has interesting conse-quences. While in the browsing workload readers almostnever have to be delayed, the shopping workload pro-duces many readers that need a snapshot which is notyet available on the satellites. As a result, the shoppingworkload can be scaled better for small numbers of repli-cas, since the resources on the master are better utilized.Second, the trigger approach and commit order seriali-zation adds a certain overhead. Thus, if clients executemany update transactions (as in the ordering workload)scale-out is reduced and performance is dominated bythe master. However, scale-out is only reduced fromthe viewpoint of clients that execute updates: the mixedworkload is able to achieve a very good scale-out, even

500%

400%

300%

200%

100%

SAT-5SAT-4SAT-3SAT-2SAT-1SAT-0ORA

TP

S

Trigger based Oracle Support: TPS Scaleout

100 Browsing Clients100 Shopping Clients100 Ordering Clients

100 Ordering + 100 RO Clients

150%125%100%

75%50%25%

SAT-5SAT-4SAT-3SAT-2SAT-1SAT-0ORA

RT

Trigger based Oracle Support: Response Time Reduction

100 Browsing Clients100 Shopping Clients100 Ordering Clients

100 Ordering + 100 RO Clients

Fig. 10 Measurement results for Oracle with the trigger-basedapproach

though the number of clients is doubled in comparisonto the ordering workload. The system is therefore ableto handle peaks of readers.

6.6.2 Oracle using the generic approach

Since our used TPC-W workload does not use any storedprocedures, it can also be used with the generic writesetextraction. This eliminates the overhead of the captur-ing triggers. In the second experiment series we testedsuch a con�guration, again with Oracle. The results areshown in Fig. 11.

Obviously, as can be seen by comparing ORA andSAT-0, the overhead of the dispatcher approach is nowvery small. However, commit order detection is stillin place, which has an impact on update transactions.Again, the graphs show that the system is able to scalewell for read dominated loads.

6.6.3 DB2 using the generic approach

Using DB2 with generic writeset capturing we per-formed the same experiments as before. However, somemodi�cations had to be introduced. Since DB2 offersno SI, the system was con�gured to block such trans-actions until the snapshot became available. Also, our�rst trial experiments with DB2 suffered from the sameproblem as those with Oracle, that is to say, serializedcommits lead to limited scale-out for update intensiveworkloads. In the case of DB2, we then optimized thedatabase engine con�guration to speed-up commits.

The results based on DB2 are shown in Fig.12. Asbefore, our system offers very good scalability. One canalso observe that the optimized commit at the master

674 C. Plattner et al.

500%

400%

300%

200%

100%

SAT-5SAT-4SAT-3SAT-2SAT-1SAT-0ORA

TP

SGeneric Oracle Support: TPS Scaleout

100 Browsing Clients100 Shopping Clients100 Ordering Clients

100 Ordering + 100 RO Clients

150%125%100%75%50%25%

SAT-5SAT-4SAT-3SAT-2SAT-1SAT-0ORA

RT

Generic Oracle Support: Response Time Reduction

100 Browsing Clients100 Shopping Clients100 Ordering Clients

100 Ordering + 100 RO Clients

Fig. 11 Measurement results for Oracle with the genericapproach

500%

400%

300%

200%

100%

SAT-5SAT-4SAT-3SAT-2SAT-1SAT-0DB2

TP

S

Generic DB2 Support: TPS Scaleout

100 Browsing Clients100 Shopping Clients100 Ordering Clients

100 Ordering + 100 RO Clients

150%125%100%75%50%25%

SAT-5SAT-4SAT-3SAT-2SAT-1SAT-0DB2

RT

Generic DB2 Support: Response Time Reduction

100 Browsing Clients100 Shopping Clients100 Ordering Clients

100 Ordering + 100 RO Clients

Fig. 12 Measurement results for DB2 with the generic approach

pays off: the performance of the shopping workloadapproaches that of the mixed workload. An interestingpoint is the scale-out of the browsing workload, which isless than expected. The reason is the slightly different setof indexes and queries which we used on the DB2 andPostgreSQL engines. As a consequence, PostgreSQLcould more ef�ciently handle the read-only transac-tions in the shopping and order workloads, while DB2performed better when executing the (more complex)mix of read-only transactions in the browsing workload.Since the objective of the experiments was to prove thefeasibility of the approach, we did not proceed any fur-ther with optimizations and tuning on both master andsatellites.

6.6.4 Interpretation of heterogeneous results

The main conclusion from these experiments is that sat-ellites can provide signi�cant scalability (both in termsof throughput and response time) as long as there isenough reads in the load. The differences in behaviorbetween the three set-ups are due to low level details ofthe engines involved. For instance, with Oracle we needto commit transactions serially (only the commit, notthe execution of the operations) to determine the write-set sequence. This causes a bottleneck at the masterthat lowers the scalability for the shopping and orderingtraces (the ones with most updates). We also executequeries at the master when there are many updates,thereby also increasing the load at the master and henceenhancing the bottleneck effect for traces with highupdate ratios. In DB2 we have a more ef�cient mecha-nism for determining commit order and we neverexecute queries on the master. This leads to better sca-lability for the shopping and ordering traces. As anotherexample, the browsing trace scales worse with DB2 asmaster than with Oracle as master. The reason is theslightly different set of indexes and queries which weused on the DB2 and PostgreSQL engines. As a con-sequence, PostgreSQL could more ef�ciently handlethe read-only transactions in the shopping and orderingworkloads, while DB2 performed better when executingthe (more complex) mix of read-only transactions in thebrowsing workload.

When the overhead is dominated by queries (brows-ing mix), assigning readers to a single satellite does notyield any performance gain. Two satellites are needed toincrease the read capacity in the system – this is also thecase for Oracle, since the master is used only forblockedqueries.

For all set-ups, even when there is a signi�cant updateload at the master, the system provides good scalability.This can be seen in the mixed workload (100 ordering+100 read-only clients) which has twice as many clientsas in the other workloads. These results are encourag-ing in that they demonstrate the potential of satellitedatabases. They also point out to the many issues in-volved in implementing full replication in satellites. Onesuch issue is the trade off between loading the master ornot loading it (the Oracle set-up achieves slightly betterresponse times although the throughput is reduced dueto the bottleneck effect mentioned). Other issues thatare beyond the scope of this paper include: SQL incom-patibilities, overhead of update extraction, and engine-speci�c optimizations. Most of these problems are to alarge extent engineering issues that need to be resolvedin a product-speci�c manner. This is why we see theseresults as a form of lower bound in terms of what can

Extending DBMSs with satellite databases 675

be achieved rather than as an absolute statement of theperformance of full replication satellites.

7 Specialized satellites

The third set of experiments explore the performance ofspecialized satellites. The objective here is to show thetrue potential of satellite databases in a setting that turnsout to be much easier to implement and less constrainedthan full replication satellites. We show satellites imple-menting skyline queries as an extension to an Oraclemaster and satellites implementing an extra table forkeyword search as an extension to a DB2 master.

7.1 Skyline satellites

As a �rst example of specialized satellites, we usePostgreSQL satellites to extend an Oracle master withskyline queries [9,11]. The objective is to show how sat-ellites can be used to implement new operators thatrequire heavy computation without affecting the master(unlike, e.g., data blades). Our implementation of theskyline operator for PostgreSQL databases is based onthe Block Nested Loop (BNL) algorithm [ 9]. Clients canexecute queries of the following form:

SELECT ... FROM ... WHERE ...GROUP BY ... HAVING ... ORDER BY ..SKYLINE OF c1 [MIN|MAX] {, cN [MIN|MAX]}

These statements are distinctive and can easily bedetected by the dispatcher, which forwards them to thesatellites. Any other update or query is sent to the mas-ter. The routing policy in the dispatcher is least-pending-requests-Þrst.

For the experiments, we used Oracle as master and1 to 8 PostgreSQL satellites (only replicating theordersand order_line tables). We performed two experiments,one with no updates at the master and one with updatesat the master (50 NewOrder transactions per secondfrom the TPC-W benchmark). We measured through-put and response times for various numbers of skylineclients. All results are given relatively to the perfor-mance of a single client connected to a system with asingle satellite (we cannot compare with a master sincethe master does not support the skyline operator). Eachclient constantly sends read-only transactions contain-ing the following skyline query to the system, with nothinking time between transactions:

SELECT o_id, o_total, SUM (ol_qty) AS qty_sumFROM orders, order_lineWHERE o_id = ol_o_id AND ol_o_id >

(SELECT o_id-500 FROM orders

900%800%700%600%500%400%300%200%100%

0%0 5 10 15 20 25 30 35

TP

S

Number of Clients

Skyline Queries, No Updates: Throughput

SAT1 SAT2 SAT3 SAT4 SAT5 SAT6 SAT7 SAT8

900%800%700%600%500%400%300%200%100%

0%0 5 10 15 20 25 30 35

TP

S

Number of Clients

Skyline Queries, 50 Update XACTS/s: Throughput

SAT1 SAT2 SAT3 SAT4 SAT5 SAT6 SAT7 SAT8

300%250%200%150%100%50%0%

0 5 10 15 20 25 30 35

RT

Number of Clients

Skyline Queries, No Updates: Response Time

SAT1 SAT2 SAT3 SAT4 SAT5 SAT6 SAT7 SAT8

300%250%200%150%100%50%0%

0 5 10 15 20 25 30 35

RT

Number of Clients

Skyline Queries, 50 Update XACTS/s: Response Time

SAT1 SAT2 SAT3 SAT4 SAT5 SAT6 SAT7 SAT8

Fig. 13 Skyline performance results

ORDER BY o_id DESC LIMIT 1)GROUP BY o_id, o_total ORDER BY o_totalSKYLINE OF o_total MAX qty_sum MIN

7.1.1 Results: skyline satellites

The results are shown in Fig.13 (the axis represent sca-lability in percentage versus number of clients submit-ting skyline queries, the lines in each graph correspondto the different number of satellites). Overall, addingadditional satellites supports more clients at a higherthroughput. It also improves the response time as clientsare added, thereby contributing to the overall scalabilityof the resulting system. One can also observe that thesystem starts saturating when the number of clients istwice the number of satellites – which corresponds tothe two CPUs present in each satellite machine. This is aconsequence of the BNL-based skyline operator whichis very CPU intensive – a client that constantly sends

676 C. Plattner et al.

queries is able to almost saturate one CPU. This alsoexplains why the response time increases linearly withthe number of clients. The important aspect here is that,as the experiments show, we have extended an Oracledatabase with scalable support for skyline queries withlittle overhead on the original master.

7.2 Keyword search satellites

As a second example of specialized satellites, we usea PostgreSQL-based full text search that extends theTPC-W schema deployed in a DB2 master. The objec-tive here is to show how satellites can be used to addnew tables (or, e.g., materialized views, indexes, etc.)to an existing database without in�icting the cost ofmaintaining those tables on the master.

The keyword search facility consists of a keywordtable (holding 312,772 triples consisting of a keyword,book id and weight) and an index to speed-up look-ups. The keywords are generated by extracting wordsfrom the i_desc �eld in the item table. The keywordsearch facility is not static: we assigned weights to thekeywords which correlate with the last 3,333 orders inthe order_line table.

The challenge is how to keep the keyword tableup-to-date. In terms of the load, maintaining the tablehas two consequences: �rst, the read-load increases dueto the need to read the data to compute the table. Sec-ond, there is an additional load caused by the logic thatcomputes the table.

The approach we have tested is one in which the tableis re-built periodically on the satellite by additional Javacode that is hooked into the adapter. It scans the descrip-tions of all books in the item table and builds a keywordtable with regard to the latest orders. This is, of course,not the most ef�cient solution to this problem but it isquite representative of a wide range of applications thatuse such external programs to maintain derived data.

7.3 Results: keyword search satellites

Figure 14 shows the results of this experiment withtwo sets of measurements: throughput (solid columns)and build time (shaded columns).

500%400%300%200%100%

0%Satellites, LoadedMain Engine, LoadedMain Engine, Idle

100% 74% 100%100%

472%

103%

ThroughputRebuild Time

Fig. 14 Rebuilding the keyword table

For throughput, the base line is DB2 running 100TPC-W shopping clients (100% performance). The mid-dle solid column (main engine, loaded) shows whathappens to the throughput of those 100 clients whenthe table is rebuilt: it drops to 74%. The rightmost solidcolumn shows the throughput for a DB2 master withone PostgreSQL satellite that is used to store the table(and replicate the item and order_line tables) and keepit up-to-date. Throughput goes back to 100% since themaster does not see the overhead of updating the table.

For build time, we use as base line the time it takes torebuild the table in DB2 when DB2 is not loaded (30 s,taken as 100% in the �gure). When we try to rebuild thetable but DB2 is loaded with the 100 TPC-W shoppingclients, the rebuild time is almost �ve times longer thanwhen DB2 is idle (140 s, or 472% of the base line). Oncethe rebuild happens in the satellite, the rebuild time goesdown to 103% even with the master loaded with the 100TPC-W shopping clients.

7.4 Discussion: specialized satellites

The two experiments show that specialized satellites canbe used to extend an existing master database withoutaffecting performance at the master. This new function-ality can be, e.g., operators that are not supported atthe master (like the skyline queries) or additional tablesderived from already existing data (like the keywordsearch). In both cases, the experiments show that sat-ellites can easily support such extensions in a scalable,non-intrusive manner.

Overall, it is in this type of applications where sat-ellites may prove most valuable. On the one hand, thecomplex issues around full replication satellites are notpresent in specialized satellites, thereby simplifying theimplementation. On the other hand, the fact that oursatellites are open source databases offers an open plat-form in which to implement new functionality. This isparticularly important for many scienti�c applications.The only downside is that satellites are static and maynot always be needed. The obvious next step is to beable to add such specialized satellites dynamically.

8 Dynamic creation of satellites

In this section, we describe our approach to dynamicallycreate satellites. First, we describe the basic availableoptions. Then, we describe the differences between theso-called logical and physical copy approach. Next, wedescribe our PITR (point-in-time recovery) based phys-ical copy approach that we use in Ganymed. We also

Extending DBMSs with satellite databases 677

provide several experiments that show the �exibility ofour PITR-based prototype.

8.1 Basic options

We have identi�ed three basic techniques that couldbe used to dynamically create satellites; they can actu-ally be combined (all of them similar to crash recoveryprocedures):

Copy: To create a satellite, one can simply copy thelatest snapshot from the master. If another satellite isalready available, one can copy the data from the satel-lite instead. Also, during the copy process, the writesetqueue of the new satellite is already being populated.However, these writesets can only be applied after thesatellite has successfully �nished the copying process.When adding the �rst satellite to a master, this is theonly possible approach. If the master does not supportSI (as DB2), then this means that updates have to beblocked until the initial satellite has been set up (forinstance by using table level locks). However, this hap-pens only once, since further satellites can be generatedby taking a snapshot from existing satellites.

Writeset replay: This approach is feasible if we havean old checkpoint of the master. For instance, this couldbe useful when a satellite is taken out of a Ganymedsetup and then later added again. If the contained dataare not too stale, then the middleware can apply just thewritesets which are needed to bring it to the latest state.Of course, this involves keeping a history of writesets– either on the master or on the primary satellite. Themaximum acceptable staleness is therefore dependantupon the space assigned for keeping the history.

Hybrid: A more �ne grained approach is to decidefor each table object which of the two above approachesis more appropriate. Note that the availability of thewriteset replay approach for a certain table does notimply that the copying approach has to be disregarded.For example, in the case of a small table, copying isprobably cheaper than applying a large set of updates.

Even though the writeset reply and hybrid approachesoffer interesting possibilities, we have so far only imple-mented the copy approach in our prototype.

8.2 Logical or physical copy

There are two options for implementing the copyapproach, one has to consider two options: logical copyand physical copy. By logical copy we refer to the mech-anism of extracting and importing data using queries,while physical copy refers to directly transferring DBMStable space �les from machine to machine. In our proto-

type we use a variant of the physical copy approach tospawn new satellites given an existing primary satellite.

The big advantage of the logical copy approach is high�exibility. Table data can be replicated between differenttypes of DBMSs through well-de�ned interfaces, e.g.,using a DBMS’s import/export tools or JDBC. 9 Usinglogical copies, partial replication is straightforward (onecould replicate, e.g., only the customer table or Euro-pean customers). However, the price for this �exibilityis speed. First, copying at the logical level is more expen-sive, and second, indexes have to be rebuilt after thecorresponding table data have been copied.

The main advantage of the physical copy approachis speed. Throughput is only limited either by disk con-trollers or available network bandwidth. Also, indexesdo not have to be rebuilt, they are just copied the sameway as the table data. The downside of the approach isthe limited use in heterogeneous setups. Yet, when usingthe same DBMS type across machines, physical copy isoften only possible by copying the full DBMS installa-tion, even though only a subset of the tables is needed.Therefore, enhancing a running replica with additionaltables is not straightforward. Another problem ariseswhen the source machine does not allow to copy thetablespaces while the DBMS is online. Then, depend-ing on the capabilities of the used DBMS software, itmight be that the source DBMS has to be stopped (andtherefore its writesets have to be queued), as long asthe copying process is ongoing. This leads to a tempo-rary decrease in performance, since one less satellitecan work on read-only transactions. Also, the copieddata cannot be accessed until all data have been trans-ferred, since only then the new DBMS replica can bestarted up. This is in contrast to the incremental natureof the logical copy approach: there, as soon as pendingwritesets have been applied, consistent access to tablesis possible – even though indexes and other tables maynot yet be ready.

To demonstrate the difference in terms of perfor-mance of the two alternatives, we have performed mea-surements based on the replication of theorder_linetable de�ned in TPC-W. The table has three indexesand was populated in two versions:small and big (whichcorresponds to the TPC-W scaling factors 10,000 and100,000). The measurements were done using two Post-greSQL satellites, the results are shown in Table2.

As expected, the physical copy approach is faster,even though much more data have to be transferred.Yet, the logical copy approach allows to create a satel-lite in minutes, which is already acceptable in many

9 Octopus [30], for example, is a popular transfer tool based onJDBC.

678 C. Plattner et al.

Table 2 Logical versus physical copy performance for the TPC-W order_line table

Logical copy Physical copy

Small Big BigSmall Small Big Big

Small

Tuples 777,992 7,774,921 9.99 777,992 7,774,921 9.99Transfer data size bytes 65,136,692 658,239,572 10.11 95,608,832 953,860,096 9.98Data Copy Time s 22.86 234.80 10.27 8.96 83.39 9.31Data Transfer Speed MB/s 2.72 2.67 0.98 10.18 10.91 1.07Sum of Indexes Size bytes 45,506,560 453,607,424 9.97 45,506,560 453,607,424 9.97Indexes Creation/CopyTime s 10.58 358.94 33.93 4.67 41.49 8.88Indexes Transfer Speed MB/s NA NA NA 9.29 10.43 1.12Total time s 33.44 593.74 17.76 13.63 124.88 9.16

applications. One could further reduce this cost by usingcheckpoints and the hybrid approach discussed before.Note, however, the result for the index recreation timeusing logical copies: even though the ratio of tuples isapproximately 10:1, the ratio of index recreation times isabout 34:1. This is due to the involved sorting algorithmwhich is needed to rebuild the indexes.

8.3 Physical copy: The PITR approach

In our prototype we create new satellites by physicallycopying an existing satellite to a fresh, empty satellite.As already said, the main advantage over a logical copy-ing process (e.g.,dump-and-restore) is the time savedby not having to re-create any data structures (like in-dexes). We assume that the master has oneprimarysatellite permanently attached to it. In heterogenous set-ups, it is the job of the administrator to create an initialprimary satellite when setting up a Ganymed system.New, secondarysatellites are then created from the pri-mary satellite, not from the master. Reading the neededdata from the primary satellite allows us to minimize theimpact on the master and removes the problems createdby the heterogeneity between master and satellites.

The main problem we face during the creation ofnew satellites is the initialization phase. We need to geta copy of the primary, install it in the secondary, applyany additional changes that may have occurred in themeantime, and start the secondary. And all this withoutinterrupting the clients that are operating on the sys-tem. The solution we propose is based on a techniquecalled PITR ( point-in-time-recovery). The PITR supportin PostgreSQL (only available starting with the betareleases of version 8) allows the creation of a satellite byusing the REDO log �les from the primary. We extendthis technique so that the rest of the transactions can beobtained from the adapter that resides at the master.

Adding a dynamic satellite works in several stages.First, an copy of the primary installation is pushed intothe secondary. Then, a marker transaction is submitted,

which becomes part of the REDO log �les at the pri-mary. It marks those transactions executed between thetime the copy of the primary was made and the time themarker was submitted (i.e., the time it took to place thecopy on the secondary). In the next step, the REDO log�les from the primary are copied to the new satellite andPITR is used to recover up to the marker transaction.What is left to apply are the transactions that were exe-cuted after the marker transaction. This is done by theadapter of the new satellite, which reads the contentswritten by the marker transaction. There, among otherthings, it �nds the number of the latest applied writeset.It then uses this information to ask the adapter at themaster for the missing writesets. Since the master keepsa (limited) history of writesets, the fresh satellite canask for slightly older writesets. If the time between themarker transaction and the startup of the adapter is toolong, the needed writesets may not exist on the masteranymore and the whole operation fails. The length ofthe history kept at the master is a parameter that can beset at con�guration time and largely depends on whattype of satellites are being created.

Therefore, the history on the master has to be keptnot too small, and the last steps (copying of the latestREDO �les, PITR recovery and startup of the adapter)shall be as ef�cient as possible. The main task of theprocess, namely, the copying of the data�les, is not timecritical; however, it can lead to increased amounts ofREDO information depending on the current update-load and the total copying time.

8.4 Results: dynamic satellites

We have performed two sets of experiments to evaluatethe dynamic creation of satellites. In a �rst experiment,we measure the time it takes to create a new satellitefrom a given idle primary. We use two different sizes forthe primary: a small TPC-W database (scaling factors100/10 K, with a database size of 631 MB) and a big-ger one (100/1 M, requiring 1.47 GB). The results are

Extending DBMSs with satellite databases 679

0

50

100

150

200

TPC-W 100/10KTPC-W 100/1M

Tim

e [s

]183.35s

92.73s

165.60s

74.31s

16.72s 16.15s1.03s 2.27s

TotalDatafiles Copying

REDO Log CopyingDB Recovery

Fig. 15 Replication times with PITR

140%

120%

100%

90s80s70s60s50s40s30s20s10s0s

RT

Fig. 16 Effect of replication on source machine

shown in Fig. 15: we can create a new satellite for thelarge primary in about 3 min., for the small primary theprocess only takes 1.5 min. As the times for each phase ofthe satellite creation indicate, the largest cost consists ofthe physical copying of the database. This time is directlydetermined by the available network bandwidth.

In a second experiment, we evaluated the overheadcaused at the primary by dynamic satellite creation. Tomeasure this, we loaded a primary satellite with a con-stant load of 50 TPC-W promotion Related transactionsper second and measured the response times at the pri-mary during the creation of a new satellite. Fig.16showsthe results. As can be seen, the response times degradefor a short time (data copying took place from seconds0 to 89) but quickly revert to normal once the copy hasbeen made.

8.5 Experiment details: dynamic skyline satellites

Once we have established that satellites can be dynami-cally created, we need to look at the effects ofdynamic creation over specialized satellites. For thispurpose we ran an experiment with dynamic skylinesatellites (characteristics identical to the ones describedabove).

In the experiment, an Oracle master is extended witha PostgreSQL primary satellite con�gured to offer theskyline facility over a subsetof the TPC-W data (orderand order_line tables). The system was loaded with avarying number of skyline query clients over time.Additionally, the master database was loaded with50 TPC-W NewOrder update transactions per second,hence, each running satellite also had to consume 50writesets per second. The management console was con-�gured with a pool of spare satellite machines which itactivates according to the policy described in the foot-note in Sect. 2.8 (the exact nature of the policy is notrelevant for the purposes of this paper). Response times

were measured relative to the results of a single clientworking on the primary satellite. Note that our currentimplementation can only create one secondary satelliteper primary satellite at a time.

8.6 Results: dynamic skyline satellites

The results are shown in Fig.17. The �gure shows howthe number of clients varies over time (with two peaks of12 clients), the number of satellites spawned by the man-agement console in response to the load, and the actualresponse time observed by the clients (on average overall clients). In the �rst peak that has a slow increase of cli-ents, the system can respond well by gradually increasingthe number of satellites so that the response time slowlyconverges to the initial one with only one client. In thesecond peak, there is a surge of clients that saturate thesystem. As a result, the copying of the source tables isslower than before. Nevertheless, the system eventuallymanages to create enough satellites to cope with the newload.

As shown, the dynamic process not only spawns newsatellites, it also removes the ones no longer needed.

8.7 Discussion: dynamic satellites

These experiments demonstrate that satellites can becreated dynamically. The time it takes to create newsatellites is dominated by the size of the data to be cop-ied, the available network bandwidth, and the load atthe primary. Again, this favors specialized satellites thatdo not need to copy as much data as full replicationsatellites. Together with the results we show for staticspecialized satellites, the conclusion is that satellites canbe used to great advantage to extend the functionality ofexisting database engines and to do so dynamically, ondemand. This ability opens up several interesting appli-cations for satellites such as database services within adata grid, dynamic database clusters and autonomic sca-lability by dynamically spawning specialized satellites.

As in the case of full replication, the basic proce-dure described here can be optimized in many differentways. For instance, the machines used for creating satel-lites could already store an old version of the database.Then dynamic creation would only involve replayingthe log. Also, many modern computer clusters have sev-eral networks, thereby providing more bandwidth. Theslow reaction observed when new satellites are createdfrom a saturated system can be avoided by using a moreaggressive policy for creating satellites. Also, for verylarge increases in load, several satellites could be cre-ated simultaneously, rather than one at a time. We leavethese optimizations for future work.

680 C. Plattner et al.

121086420

1600s1400s1200s1000s800s600s400s200s0s

500%400%300%200%100%

Response Time# Clients

# Satellites

Fig. 17 Autonomic adaptation to a varying number of skyline clients

9 Related work

Work in Ganymed has been mainly in�uenced byC-JDBC [ 10], an open source database cluster middle-ware based on a JDBC driver. C-JDBC is meant mainlyfor fault tolerance purposes and it imposes severallimitations in the type of load that can be run. It offersvariants of the Read-One Write-All approach withconsistency guaranteed through table level locking.C-JDBC differs from Ganymed in that it duplicatesdatabase functionality at the middleware level, includ-ing locking and writeset extraction and propagation.

Daffodil Replicator [ 13] is an open source tool thatallows data extraction from several database enginesand replication in heterogeneous engines. It is based ontriggers and stored procedures but does not provide anydegree of consistency to the client. Clients must alsoconnect exclusively either to the master or to a replicaand therefore do not see a single database.

Distributed versioning and con�ict aware schedulingis an approach introduced in [5,6]. The key idea is touse a middleware-based scheduler which accepts trans-actions from clients and routes them to a set of replicas.Consistency is maintained through the bookkeeping ofversionsof tables in all the replicas. Every transactionthat updates a table increases the corresponding versionnumber. At the beginning of every transaction, clientshave to inform the scheduler about the tables they aregoing to access. The scheduler then uses this informationto assign versions of tables to the transactions. Similarto the C-JDBC approach, SQL statements have to beparsed at the middleware level for locking purposes.

Group communication has been proposed for use inreplicated database systems [3,2,20]; however only afew working prototypes are available. Postgres-R andPostgres-R(SI) [21,35] are implementations of such asystem, based on modi�ed versions of PostgreSQL (v6.4and v7.2). The advantage of these approaches is theavoidance of centralized components. Unfortunately, inthe case of bursts of update traf�c, this becomes a dis-advantage, since the system is busy resolving con�ictsbetween the replicas. In the worst case, such systems are

slower than a single instance database and throughputincreases at the cost of a higher response time. A solutionto the problem of high con�ict rates in group communi-cation systems is the partition of the load [18]. In this ap-proach, although all replicas hold the complete data set,update transactions cannot be executed on every replica.Clients have to predeclare for every transaction whichelements in the database will be updated (so calledcon-ßict classes). Depending on this set of con�ict classes, aso-calledcompound conßict classcan be deduced. Everypossible compound con�ict class is statically assignedto a replica, replicas are said to act asmaster siteforassigned compound con�ict classes. Incoming updatetransactions are broadcasted to all replicas using groupcommunication, leading to a total order. Each replicadecides then if it is the master site for a given transac-tion. Master sites execute transactions, other sites justinstall the resulting writesets, using the derived totalorder. Recently, the work has been extended to dealwith autonomous adaption to changing workloads [26].

In contrast to this work [ 6,10,26], Ganymed does notinvolve table level locking for updates, nor does it forcethe application programmer to pre-declare the struc-ture of each transaction or to send transactions as fullblocks. Furthermore, Ganymed supports partial repli-cation, something that would be very dif�cult to dowith the systems mentioned. Much of such existing workapplies exclusively to full replication [ 5,23,26,31,35].

Recent work has demonstrated the possibility of cach-ing data at the edge of the network [1,4,17,22] from acentralized master database. This work, however, con-centrates on caching at the database engine level andit is very engine speci�c. Some of these approaches arenot limited to static cache structures; they can react tochanges in the load and adapt the amount of data keptin the cache. To local application servers, the cacheslook like an ordinary database system. In exchange withdecreased response times, full consistency has to begiven up.

In terms of the theoretical framework for Ganymed,the basis for consistency in the system is the use ofsnapshot isolation [7]. SI has been identi�ed as a more

Extending DBMSs with satellite databases 681

�exible consistency criteria for distribution andfederation [32]. It has also been studied in conjunc-tion with atomic broadcast as a way to replicate datawithin a database engine [19]. This work was the �rst tointroduce the optimization of propagating only write-sets rather than entire SQL statements. The �rst useand characterization of snapshot isolation for databasereplication at the middleware level is our own work [ 28].In that early system, SI was used as a way to utilize lazyreplication (group communication approaches use ea-ger replication and, thus, enforce tight coupling of alldatabases) and to eliminate the need for heavy process-ing at the middleware level (be it versioning, locking,or atomic broadcast functionality). The level of consis-tency introduced by Ganymed (clients always see theirown updates and a consistent snapshot) was indepen-dently formalized as strong session consistency in [14],although those authors emphasize serializability. Never-theless, note that Ganymed provides a stronger level ofconsistency in that clients do not only get strong sessionconsistency, they get the latest available snapshot. Morerecently, [23] have proposed a middleware-based repli-cation tool that also uses SI. Conceptually, the systemthey describe is similar to the one described in [28]. Thedifferences arise from the use of group communicationand the avoidance of a central component in [23] andthe corresponding limitations that this introduces: loadbalancing is left to the clients, partial replication is notfeasible, and the middleware layer is limited to the sca-lability of the underlying atomic broadcast implementa-tion. More importantly, unless clients always access thesame replica, [23] does not provide strong session con-sistency as clients are not guaranteed to see their ownupdates. Clients are not even guaranteed to see increas-ing snapshots as time travel effects are possible whenconsecutive queries from the same client are executedon different replicas that are not equally up-to-date.

A possible extension of our system is to implement asolution similar to that proposed in [ 31], where clientscan request to see a snapshot not older than a giventime bound. In our system, this can be achieved by usingsatellites that lag behind in terms of snapshots.

10 Conclusions

This paper has presented the design, implementation,and evaluation of a database extension platform (Gany-med). The core of the platform is what we call satel-lite databases: a way to extend and scale out existingdatabase engines. Satellite databases are lightweightdatabases that implement either copies of the master

database or specialized functionality that is notavailable at the master database. In the paper, wedescribe the architecture of the system and its imple-mentation. We also provide extensive experiments thatprove the feasibility of the idea by showing how satellitedatabases perform in a wide variety of settings: full rep-lication, specialized functionality (skyline queries andkeyword search), dynamic creation of satellites, and theoverall performance of the system.

Acknowledgements Many people have contributed valuableideas to the system described in this paper. We are grateful toRoger Rüegg, Stefan Nägeli, Mario Tadey and Thomas Gloor fortheir help in developing and benchmarking different parts of thesystem. We would also like to thank Donald Kossmann for hishelp with skyline queries; Dieter Gawlick for his input on Oracle;C. Mohan and Bruce Lindsay for their input on DB2 and otherIBM replication tools; Adam Bosworth for new ideas on how touse satellites; and Mike Franklin and the database group at UCBerkeley for many discussions on the use and possibilities of satel-lite databases. Many thanks also go to Willy Zwaenepoel who gaveus feedback when we came up with the initial idea for RSI-PC.

References

1. Altinel, M., Bornhövd, C., Krishnamurthy, S., Mohan, C.,Pirahesh, H., Reinwald, B.: Cache tables: paving the wayfor an adaptive database cache. In: Proceedings of the 29thInternational Conference on Very Large Data Bases. Berlin,Germany, September 9–12, 2003

2. Amir, Y., Tutu, C.: From Total Order to Database Repli-cation. Technical report, CNDS (2002). http://www.citeseer.nj.nec.com/amir02from.html

3. Amir, Y., Moser, L.E., Melliar-Smith, P.M., Agarwal, D.A.,Ciarfella, P.: The totem single-ring ordering and membershipprotocol. ACM Trans. Comput. Syst. 13(4), 311–342 (1995).http://www.citeseer.nj.nec.com/amir95totem.html

4. Amiri, K., Park, S., Tewari, R., Padmanabhan, S.: DBProxy:A dynamic data cache for web applications. In: Proceedingsof the 19th International Conference on Data Engineering,Bangalore, India, 5–8, March 2003

5. Amza, C., Cox, A.L., Zwaenepoel, W.: A comparative eval-uation of transparent scaling techniques for dynamic contentservers. In: ICDE ’05: Proceedings of the 21st InternationalConference on Data Engineering (ICDE’05), pp. 230–241(2005)

6. Amza, C., Cox, A.L., Zwaenepoel, W.: Distributed ver-sioning: Consistent replication for scaling back-end data-bases of dynamic content web sites. In: Middleware 2003,ACM/IFIP/USENIX International Middleware Conference,Rio de Janeiro, Brazil, June 16–20, Proceedings (2003)

7. Berenson, H., Bernstein, P., Gray, J., Melton, J., O’Neil, E.,O’Neil, P.: A critique of ANSI SQL isolation levels. In: Pro-ceedings of the SIGMOD International Conference on Man-agement of Data, pp. 1–10 (1995)

8. Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrencycontrol and recovery in database systems. Addison-Wesley,Reading (1987)

9. Borzsonyi, S., Kossmann, D., Stocker, K.: The skyline opera-tor. In: IEEE Conference on Data Engineering, pp 421–430.Heidelberg, Germany (2001)

682 C. Plattner et al.

10. Cecchet, E.: C-JDBC: a middleware framework for databaseclustering. IEEE Data Engineering Bulletin, vol. 27, no. 2(2004)

11. Chomicki, J., Godfrey, P., Gryz, J., Liang, D.: Skyline withpresorting. In: ICDE, pp 717–816 (2003)

12. Council, T.T.P.P.: TPC-W, a transactional web e-commercebenchmark. http://www.tpc.org/tpcw/

13. Daffodil Replicator: http://sourceforge.net/projects/daffodil-replica

14. Daudjee, K., Salem, K.: Lazy database replication with order-ing guarantees. In: Proceedings of the 20th international con-ference on data engineering (ICDE 2004), Boston, MA, USA,pp 424–435, 30 March – 2 April 2004

15. Fekete, A.D.: Serialisability and snapshot isolation. In: Pro-ceedings of the Australian Database Conference, pp. 210–210(1999)

16. Fekete, A., Liarokapis, D., O’Neil, E., O’Neil, P.,Sasha, D.: Making snapshot isolation serializable.http://www.cs.umb.edu/isotest/snaptest/snaptest.pdf

17. Härder, T., Bühmann, A.: Query processing in constraint-based database caches. IEEE Data Engineering Bulletin, vol.27, no. 2 (2004)

18. Jiménez-Peris, R., Patiño-Martínez, M., Kemme, B., Alonso,G.: Improving the scalability of fault-tolerant database clus-ters. In: IEEE 22nd International Conference on DistributedComputing Systems, ICDCS’02, Vienna, Austria, pp. 477–484(2002)

19. Kemme, B.: Database replication for clusters of workstations.Ph.D. thesis, Diss. ETH No. 13864, Department of ComputerScience, Swiss Federal Institute of Technology Zurich (2000)

20. Kemme, B.: Implementing database replication based ongroup communication. In: Proceedings of the InternationalWorkshop on Future Directions in Distributed Computing(FuDiCo 2002), Bertinoro, Italy (2002). http://www.citeseer.nj.nec.com/pu91replica.html

21. Kemme, B., Alonso, G.: Don’t be lazy, be consistent:Postgres-R, a new way to implement database replication.In: Proceedings of the 26th International Conference on VeryLarge Databases, 2000

22. Larson, P.A., Goldstein, J., Zhou, J.: Transparent mid-tierdatabase caching in SQL server. In: Proceedings of the2003 ACM SIGMOD International Conference on Man-agement of Data, pp 661–661. ACM Press, (2003). DOIhttp://www.doi.acm.org/10.1145/872757.872848

23. Lin, Y., Kemme, B., Patiño-Martínez, M., Jiménez-Peris, R.:Middleware based data replication providing snapshot

isolation. In: SIGMOD’05: Proceedings of the 2005 ACMSIGMOD International Conference on Management of Data,pp. 419–430 (2005)

24. Lipasti, M.H.: Java TPC-W Implementation Distribution ofProf. Lipasti’s Fall 1999 ECE 902 Course. http://www.ece.wisc.edu/pharm/tpcw.shtml

25. Microsoft SQL Server 2005: http://www.microsoft.com/sql/2005

26. Milan-Franco, J.M., Jiménez-Peris, R., Patiño-Martínez, M.,Kemme, B.: Adaptive distributed middleware for data repli-cation. In: Middleware 2004, ACM/IFIP/USENIX 5th Inter-national Middleware Conference, Toronto, Canada, October18–22, Proceedings (2004)

27. Oracle Database Documentation Library: Oracle StreamsConcepts and Administration, Oracle Database AdvancedReplication, 10g Release 1 (10.1) (2003). http://www.otn.oracle.com

28. Plattner, C., Alonso, G.: Ganymed: Scalable replication fortransactional web applications. In: Middleware 2004, 5thACM/IFIP/USENIX International Middleware Conference,Toronto, Canada, October 18–22, Proceedings (2004)

29. PostgreSQL Global Development Group. http://www.post-gresql.org

30. Project, T.E.O.: Octopus, a simple Java-based Extraction,Transformation, and Loading (ETL) tool. http://www.octopus.objectweb.org/

31. Röhm, U., Böhm, K., Schek, H., Schuldt, H.: FAS – A fresh-ness-sensitive coordination middleware for a cluster of OLAPcomponents. In: Proceedings of the 28th International Con-ference on Very Large Data Bases (VLDB 2002), Hong Kong,China, pp. 754–765 (2002)

32. Schenkel, R., Weikum, G.: Integrating snapshot isolationinto transactional federation. In: Cooperative InformationSystems, 7th International Conference, CoopIS 2000, Eilat,Israel, Proceedings, 6–8 September 2000

33. Stolte, E., von Praun, C., Alonso, G., Gross, T.R.: Scienti�cdata repositories: designing for a moving target. In: Proceed-ings of the 2003 ACM SIGMOD International Conference onManagement of Data, pp. 349–360 (2003)

34. Weikum, G., Vossen, G.: Transactional information systems.Morgan Kaufmann, San Francisco (2002)

35. Wu, S., Kemme, B.: Postgres-R(SI): combining replica con-trol with concurrency control based on snapshot isolation. In:ICDE ’05: Proceedings of the 21st International Conferenceon Data Engineering (ICDE’05), pp. 422–433