6
1158 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, NO. 12, DECEMBER 1985 System Architecture for Partition-Tolerant Distributed Databases SUNIL K. SARIN, MEMBER, IEEE, BARBARA T. BLAUSTEIN, MEMBER, IEEE, AND CHARLES W. KAUFMAN Abstract—An overview is presented of an approach to distrib- uted database design that emphasizes high availability in the face of network partitions and other communication failures. This approach is appropriate for applications that require continued operation and can tolerate some loss of integrity of the data. Each site presents its users and application programs with the best possible view of the data that it can, based on those updates that it has received so far. Mutual consistency of replicated copies of data is ensured by using timestamps to establish a known total ordering on all updates issued, and by a mechanism that ensures the same final result regardless of the order in which a site actually receives these updates. A mechanism is proposed, based on alert- ers and triggers, by which applications can deal with exception conditions that may arise as a consequence of the high-availability architecture. A prototype system that demonstrates this approach is near completion. Index Terms—Concurrency control, compensation, distrib- uted databases, mutual consistency, network partitions, repli- cated data. I. INTRODUCTION R EPLICATION of data at multiple sites has long been proposed as a means of increasing availability of the data in the face of failures. However, higher availability is frequently not realized because of the intersite syn- chronization needed to ensure consistency of the replicated data. If an update transaction must block whenever any site holding a copy of data to be updated is inaccessible, the availability of data decreases rather than increases as a result of replication. Most existing solutions to this problem take one of two forms. 1) Maintain an "uplist" of sites currently believed to be operating, and require transactions to update only those data item copies that are located at sites on the uplist [2]. 2) Require that transactions read and update a majority of copies of a data item; or, more generally, reads and updates must apply to intersecting quorums of sites [9]. The problem with both of the above approaches is that they fail when the system is partitioned into two or more discon- nected groups of sites. Using majority or quorum consensus, Manuscript received May 2, 1985; revised August 15, 1985. This work was supported in part by the Defense Advanced Research Projects Agency of the Department of Defense and in part by the Air Force Systems Command at Rome Air Development Center under Contract F30602-84-C-0112. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either ex- pressed or implied, of the Defense Advanced Research Projects Agency or the U.S. Government. S. K. Sarin and C. W. Kaufman are with Computer Corporation of America, Cambridge, MA 02142. . T. Blaustein is with Computer Corporation of America, Alexandria, VA 22314. no more than one group, and sometimes none, will be able to access and update a given data item. Using uplists, each group will form its own uplist and will continue to execute transactions, but the database copies in the different groups will diverge because there is no synchronization across groups. A disconnection of the sites in a two-site system will either not allow any site to continue operating or will lead to divergent and inconsistent database copies. In many applications, such as banking, scheduling re- servations, and inventory control, availability of data is of primary importance. It is often not acceptable for a site to suspend processing when it cannot communicate with a ma- jority of sites. Our objective is to permit an individual site to service users' queries and updates regardless of which and how many other sites it can communicate with. In order to achieve this goal, transactions in our system, which we call interactions, do not execute serializably. The database updates issued by different interactions are totally ordered by the system in order to ensure eventual mutual consistency of sites' database copies when they learn of each other's updates. However, the data read by an interaction at a site will reflect only some incomplete subset of the total order, namely, just those updates that the site knows about so far. As a result of nonserializable execution, integrity of the data may suffer. We believe that in many cases this may be acceptable if it occurs infrequently and the application has some means of compensating for it; this is discussed further in Section V. Our system design does not assume that site failures and network partitions are detected or announced. 1 This contrasts strongly with other designs for partition-tolerant databases (e.g., [6]), which assume that partitions are detectable and distinguishable from site failures, and which require sites to switch operating modes when a partition is detected. When the partition is repaired, sites enter a "merge" phase to rec- oncile their database copies and then revert to "normal" operating mode when the merge phase is completed. In our approach, sites operate in a single "mode" in which they are in effect continuously merging their database cop- ies. Sites expect communication delays as the norm; site failures and network partitions manifest themselves only as very long communication delays. Because there is no switching of modes when partitions and reconnections oc- cur, the system gracefully handles contingencies such as another partition quickly following a reconnection. 'We do assume that failed sites do not maliciously or accidentally corrupt their database copies or issue incorrect messages, i.e., the system does not handle "Byzantine" failures. 0018-9340/85/1200-1158$01.00 © 1985 IEEE

System Architecture for Partition-Tolerant Distributed Databases

Embed Size (px)

Citation preview

1158 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, N O . 12, DECEMBER 1985

System Architecture for Partition-Tolerant Distributed Databases

SUNIL K. SARIN, MEMBER, IEEE, BARBARA T. BLAUSTEIN, MEMBER, IEEE, AND CHARLES W. KAUFMAN

Abstract—An overview is presented of an approach to distrib­uted database design that emphasizes high availability in the face of network partitions and other communication failures. This approach is appropriate for applications that require continued operation and can tolerate some loss of integrity of the data. Each site presents its users and application programs with the best possible view of the data that it can, based on those updates that it has received so far. Mutual consistency of replicated copies of data is ensured by using timestamps to establish a known total ordering on all updates issued, and by a mechanism that ensures the same final result regardless of the order in which a site actually receives these updates. A mechanism is proposed, based on alert­ers and triggers, by which applications can deal with exception conditions that may arise as a consequence of the high-availability architecture. A prototype system that demonstrates this approach is near completion.

Index Terms—Concurrency control, compensation, distrib­uted databases, mutual consistency, network partitions, repli­cated data.

I. INTRODUCTION

R EPLICATION of data at multiple sites has long been proposed as a means of increasing availability of the

data in the face of failures. However, higher availability is frequently not real ized because of the in ters i te syn­chronization needed to ensure consistency of the replicated data. If an update transaction must block whenever any site holding a copy of data to be updated is inaccessible, the availability of data decreases rather than increases as a result of replication. Most existing solutions to this problem take one of two forms.

1) Maintain an "uplist" of sites currently believed to be operating, and require transactions to update only those data item copies that are located at sites on the uplist [2].

2) Require that transactions read and update a majority of copies of a data item; or, more generally, reads and updates must apply to intersecting quorums of sites [9].

The problem with both of the above approaches is that they fail when the system is partitioned into two or more discon­nected groups of sites. Using majority or quorum consensus,

Manuscript received May 2, 1985; revised August 15, 1985. This work was supported in part by the Defense Advanced Research Projects Agency of the Department of Defense and in part by the Air Force Systems Command at Rome Air Development Center under Contract F30602-84-C-0112. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either ex­pressed or implied, of the Defense Advanced Research Projects Agency or the U.S. Government.

S. K. Sarin and C. W. Kaufman are with Computer Corporation of America, Cambridge, MA 02142.

Β. T. Blaustein is with Computer Corporation of America, Alexandria, VA 22314.

no more than one group, and sometimes none, will be able to access and update a given data item. Using uplists, each group will form its own uplist and will continue to execute transactions, but the database copies in the different groups will diverge because there is no synchronization across groups. A disconnection of the sites in a two-site system will either not allow any site to continue operating or will lead to divergent and inconsistent database copies.

In many applications, such as banking, scheduling re­servations, and inventory control, availability of data is of primary importance. It is often not acceptable for a site to suspend processing when it cannot communicate with a ma­jority of sites. Our objective is to permit an individual site to service users' queries and updates regardless of which and how many other sites it can communicate with.

In order to achieve this goal, transactions in our system, which we call interactions, do not execute serializably. The database updates issued by different interactions are totally ordered by the system in order to ensure eventual mutual consistency of sites' database copies when they learn of each other's updates. However, the data read by an interaction at a site will reflect only some incomplete subset of the total order, namely, just those updates that the site knows about so far. As a result of nonserializable execution, integrity of the data may suffer. We believe that in many cases this may be acceptable if it occurs infrequently and the application has some means of compensating for it; this is discussed further in Section V.

Our system design does not assume that site failures and network partitions are detected or announced. 1 This contrasts strongly with other designs for partition-tolerant databases (e.g. , [6]), which assume that partitions are detectable and distinguishable from site failures, and which require sites to switch operating modes when a partition is detected. When the partition is repaired, sites enter a "merge" phase to rec­oncile their database copies and then revert to "normal" operating mode when the merge phase is completed. In our approach, sites operate in a single "mode" in which they are in effect continuously merging their database cop­ies. Sites expect communication delays as the norm; site failures and network partitions manifest themselves only as very long communication delays. Because there is no switching of modes when partitions and reconnections oc­cur, the system gracefully handles contingencies such as another partition quickly following a reconnection.

'We do assume that failed sites do not maliciously or accidentally corrupt their database copies or issue incorrect messages, i.e., the system does not handle "Byzantine" failures.

0018-9340/85/1200-1158$01.00 © 1985 IEEE

SARIN et al.: SYSTEM ARCHITECTURE FOR DISTRIBUTED DATABASES 1159

For simplicity, we assume a fully replicated distributed database, i .e. , every site has a complete copy of the database. This has allowed us to defer certain problems, which we intend to explore at a later date when we relax this assumption and allow partial replication. We assume that site failures and network partitions are not permanent, i .e . , are eventually repaired. Under this assumption, the system guarantees that every update issued will eventually be seen by every site, at which point the sites' database copies will be mutually con­sistent. If a network partition persists forever and is never repaired, the guarantee of mutual consistency will only hold among the sites in each disconnected group.

This paper is intended to provide an overview of the vari­ous technical issues that arise in designing a system of the kind described; more detailed descriptions of the issues and solutions will appear in forthcoming papers. The paper is organized as follows. Section II describes how mutual con­sistency of replicated data is achieved. Section III describes the architecture and operation of individual sites in the sys­tem. Section IV introduces a set of protocols for managing the overall system configuration. Section V describes tech­niques available to the application for dealing with non-serializable behavior. Section VI presents the status of our design and prototype implementation, and plans for future research.

II. MUTUAL CONSISTENCY

Sites in our system issue updates without any syn­chronization with other sites; other sites are informed of these updates only after the fact. As a result, different sites may see the same set of updates in different orders; if they simply process the updates as received, the copies of the database at different sites will no longer agree.

To ensure mutual consistency of sites' database copies, we impose a known total ordering on all updates issued by all sites. Our system uses timestamps [11] to define the total ordering; while other mechanisms may be used, we will use the phrase "timestamp order" to denote whichever ordering is being established. At any given time, a site's database copy must reflect those updates that it has seen, executed in time-stamp order. Assuming eventual connectivity of sites and eventual delivery of updates to all sites, this implies eventual mutual consistency of all sites' database copies.

The problem with realizing the above is that, because of communication delays and failures and partitions, a given site will typically not receive all updates in timestamp order. If a site receives an update with a timestamp smaller than one or more already-processed updates, it must somehow incor­porate this update in such a way that the new database state continues to reflect all received updates as if they had been processed in timestamp order. Past work on using timestamp ordering for replicated database consistency [10], [3], [8] has solved this problem by considering only overwriting up­dates, which completely replace the existing value of a data item with a specified new value. 2 In this case, it is sufficient

2Additions to and deletions from a set are equivalent to overwriting the membership function in the set to true or false.

to associate a timestamp with each data item and to discard received updates that have a lower timestamp. Our system, on the other hand, allows arbitrary update types, such as increments and conditionals; the only constraint on an update is that it be deterministic, i .e . , always have the same effect when executed on the same database state.

Maintaining mutual consistency when an arbitrary update arrives out of order requires some additional mechanism. Conceptually, each higher-timestamped update is rolled back so that the database state now reflects only those updates with timestamps smaller than the newly received one. Then, the newly r e c e i v e d u p d a t e is e x e c u t e d and the h igher -timestamped ones (which were rolled back) are reexecuted. The database state now reflects all received updates executed in timestamp order.

Rolling back and redoing updates can be expensive, and we would like to achieve the same effect more efficiently. We believe this can be done for many kinds of updates by ex­ploiting semantic properties of the updates. For example, updates with nonintersecting read and write sets always com­mute; the same result is obtained regardless of the order of execution, and it is therefore unnecessary to roll back and redo higher-timestamped updates when a nonconflicting earlier-timestamped update is received. Updates to the same data item in general do conflict, but again there are many special cases where rollback and redo can be avoided. For example, updates that increment a data item (e.g. , deposits to and withdrawals from a bank account) commute, and can be applied in any order; and, if an update overwrites the value of a data item, an earlier-timestamped update to the same data item can be ignored.

We have developed a method, which we call log trans­formation, for determining an efficient way of incorporating newly received updates to achieve the desired database state. When one or more incoming updates arrive and must be integrated with already-processed updates, an initial merge log is constructed consisting of rollback actions for all higher-timestamped updates followed by all of the updates in timestamp order. Optimization rules are then applied which allow actions to be removed from the log based on their commutativity and conflict properties. Details of this tech­nique are presented in a separate paper [4]. Here, we illus­trate how the technique works using a simple example.

Suppose the data item X has an initial value of 10, and suppose that site 51 executes the following update:

(Γ3) X := X - 2 .

The value of X is now 8. Now suppose the following update, with an earlier timestamp, arrives from another site:

{Tl) X '= X + 1 .

Because of the timestamp ordering, the final database state should be the result of applying T2 followed by Γ3, i .e. , the value of X should be 9. Since T3 has already been executed, its rollback action T3' (which increments X by 2, the amount subtracted) can be applied to the current database state to get back the initial state. Therefore, the correct new database

1160 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, N O . 12, DECEMBER 1985

state can be obtained from the current state by applying the sequence

7 3 ' 72 7 3 .

This is referred to as the initial merge log; it specifies a correct but possibly inefficient way of obtaining the desired new database state. The log transformation rules will allow the site to optimize this log as follows. 72 and 73 commute, and can therefore be interchanged, giving 7 3 ' 73 72. Then, 73 and its rollback 7 3 ' cancel each other out and can be discarded. Therefore, the /ma/ merge log, which will be used to obtain the same effect more efficiently, is simply 72. That is, the incoming update 72 is applied to the current database state. The desired final state, X = 9, is obtained without any rollback or redo.

An update issued by an interaction in our system may specify multiple database changes; these are treated as an atomic unit. For example, the following update both decre­ments the balance of account A and sets a flag if the balance becomes negative, i .e. , the account is overdrawn:

Decrement(Balance[A ] , D) if Balance[A] < 0 then Overdrawn[A] := true

Update types such as the above may require some rolling back and redoing in order to preserve mutual consistency. Suppose the following updates are issued against a given bank account A having initial balance zero:

7 1 : deposit $200 72: deposit $200 73: withdraw $300 .

Suppose a given site sees 71 and 73 and has not yet received 72. Its copy of BalancefA] will be - $ 1 0 0 , and the condi­tional part of 73 will set Overdrawn [A ] to true. When the site later receives 72, it must roll back the conditional part of 73 because the latter reads data (Balance[A ]) that are being modi­fied by 72. This rolling back will reset Overdrawn[A] to false. Reexecution of the conditional part of 73 will now find a positive balance ($100) and will therefore not set Over-drawn[A ] . This correctly matches what would have happened if the site had seen the updates in the correct order; namely, Balance[A] would never have been negative and Over-drawn[A] would have remained false.

Rolling back and redoing as described above apply only to database changes, not to the external effects of interactions. Updates that report external effects do so unconditionally; once money is given to a customer, his account balance is decremented whether or not it would become negative. The decision as to whether or not to give the money may be made at the time of the interaction by testing the currently believed balance in the site's database copy, but this is done once by the interaction and is not part of the update issued. If it is later determined that the money should not have been given, i .e . , the account turns out to be overdrawn, that can only be com­pensated for by running further interactions; this is discussed in Section V.

III. S Y S T E M ARCHITECTURE

In order to implement our approach, we propose three distinct software modules at each site in the system; this is illustrated in Fig. 1.

o t h e r s i t e s

Fig. 1. Local site architecture.

1) Interactor: presents data to the users and accepts updates.

2) Distributor: implements a reliable broadcast protocol to ensure that all sites eventually see all updates.

3) Checker: installs updates in the database and ensures mutual consistency using log transformations.

In addition to the local copy of the application database, there is a separate data structure containing the history of all updates issued or received by this site. The history and data­base are maintained on "stable storage," which is assumed to survive site crashes. The history is needed so that the checker can correctly process updates in a way that ensures the same result regardless of the order in which the updates are re­ceived. The history is primarily an append-only structure; updates once added are never changed and are not removed except for periodic reclamation described in Section IV. Separating the history from the database allows its storage structure to be tailored and optimized accordingly.

Application programs, or interactions, running in the in­teractor module query the database and accept input from external devices such as user terminals and sensors. The interaction may perform some computation on the informa­tion received, format it for display, send commands to exter­nal devices, and issue database updates. Updates issued are appended to the history.

The distributor sends updates issued by the local interactor to other sites and accepts updates from other sites' distribu­tors; the latter are stored permanently in the update history. Reliable delivery of every update to every site is accom­plished by a combination of the following.

1) When a site's interactor issues a new update, its distribu­tor makes an initial "best effort" to send the update to all other sites. This initial broadcast uses a forwarding tree that is constructed using a variant of Awerbuch and Even's algo­rithm [1]. Not every site is guaranteed to receive this initial broadcast; some sites may be down or disconnected from the source site by a partition.

2) Each site's distributor periodically supplies its neigh­bors with information specifying how many updates ( i .e . , the highest sequence number) it has received from each site (in­cluding itself). If a site A finds that one of its neighbors Β has received more updates from site C than A has, A will ask Β to retransmit the additional updates.

Eventual reliable delivery of every update is assured pro­vided the network is eventually connected. The protocol does not require any given pair of sites to be up and commu­nicating at the same time; if site C issues an update while A

SARIN et al.: S Y S T E M ARCHITECTURE FOR D I S T R I B U T E D D A T A B A S E S 1161

is down and then C fails, when A comes up it can get the update from some other site Β that did receive it.

The checker reads newly received updates from the history and performs log transformations to determine how to inte­grate these updates with already-processed updates. The ap­plication database is then updated accordingly. The checker maintains various bookkeeping data structures indicating which received updates have already been processed and which have not; the former may be marked for rolling back and reexecution as needed to preserve mutual consistency. Previous values of data items are maintained to allow un­doing of updates and to allow updates to read the correct versions of data items as of their timestamps, if these versions are not the latest ones reflected in the application database.

Each site also maintains communication status information specifying how many updates each site is known to have issued and the time elapsed since the last communication received from each site. This information is exchanged peri­odically by site distributors as part of the reliable delivery protocol. It is used by the checker to decide, based on some simple heuristics, whether to perform log transformations and incorporate newly received updates immediately or to wait for additional updates to arrive and then incorporate all of them at once.

IV. CONFIGURATION MANAGEMENT

The system is initialized by loading each site with an initial database copy and a list of the sites in the system; it is the responsibility of site administrators to ensure that the initial database copies and site lists are identical at all sites. As each site is brought up, it starts interacting with users and commu­nicating with sites that are accessible. It is not necessary for every site to start at the same time; each site will learn of all Updates issued by other sites, and integrate them into its data­base copy, when it starts up and is able to communicate with the other sites.

The above procedure has proved to be sufficient for demonstration and testing purposes. We are, in addition, developing a variety of configuration management protocols that will make the system more usable in an operational setting.

1) A protocol for garbage-collection of old updates from the history, that will allow reclamation of storage space. This is based on sites exchanging messages and agreeing that no further updates will be issued earlier than a given "cutoff" timestamp.

2) A protocol for orderly system shutdown, leaving every site's database copy in a consistent state that reflects all updates ever issued. This will be used in acceptance testing of our prototype: after shutdown, sites' database copies will be compared to verify that they are in fact identical.

3) A protocol for removing a site from the system, once an authorized administrator has determined that the site is per­manently dead or is faulty. Removal of a dead site will allow garbage-collection of old updates, held up because of lack of response from the site, to proceed at the remaining sites.

print letter to CustomerfA ] update (if Overdrawn[A ] and not Fii

then Fined[A ] : = true

4) A protocol for adding a new site to the system. The sites currently in the system are informed of the new site by any one site administrator issuing a "configuration update" that is broadcast via the distributor in the same way as data­base updates. The new site must be loaded with a database checkpoint reflecting the database state as of the most re­cently agreed-upon cutoff timestamp. The new site will then receive and process all updates with a higher timestamp than the cutoff; these updates will still be available from other sites because they will not have been garbage-collected.

We are also exploring how partial replication may be supported in our architecture. This can improve overall sys­tem availability if it allows more sites, that have insufficient storage to support full replication, to be included in the system. Partial replication \vill not work if an update to data item X stored at a site requires reading the value of data item Y which is not stored at that site. If the possible update types for a given application are designed carefully, a preanalysis will allow assignment of data item copies to sites in such a way that this does not happen. This problem is the subject of future research.

V. APPLICATION DESIGN

While database updates are totally ordered to ensure mu­tual consistency, the interactions that issued the updates may not have seen all prior updates and may therefore be non-serializable. This may cause integrity constraints on the data to be violated. The application is expected to compensate for such violations after they occur. This is done using alerters [5], which detect exception conditions arising from non-serializable execution, and triggers [7], which are compen­sating interactions invoked automatically when exception conditions arise. A compensating interaction may notify a human agent to investigate the problem or may automatically issue corrective actions, or both.

Consider the following interaction, which handles a cus­tomer withdrawal from a bank account:

interaction type Withdrawal input A : a c c t # , D: dollars ;get from customer if Balance [A] >= D then update (Decrement(Balance[A],D)

if Balance[A] < 0 then Overdrawn [A ] : = true)

output command to dispense D dollars else output "insufficient funds"

If withdrawal interactions are performed concurrently on the same account before they see each other 's updates, then each may believe that sufficient money remains in the account, but the resulting balance may be negative. The alerter Over­d r a w n ^ ] detects whether or not the balance of account A ever becomes negative, in the timestamp-ordered execution of deposits and withdrawals to that account. If the bank wishes to assess a fine on an account A that is overdrawn, and notify the customer, it can run the following compen­sating interaction:

d[A] ;fine no more than once, ;if actually overdrawn

1162 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, N O . 12, DECEMBER 1985

A compensating interaction may be invoked in a variety of ways. For example, a bank may periodically run a "batch" of interactions that perform the above for all overdrawn ac­counts that have not already been fined; or, the interaction may be triggered immediately on occurrence of the condition of interest (OverdrawnfA] and not Fined[A]) for some account A. More generally, a delay can be specified between the occurrence of the condition and the triggering of the compensating interaction. This delay reduces the probability that compensation is triggered by an alerter that was falsely set as a result of not seeing all updates through a given timestamp, e.g. , if an account appears to be overdrawn when in fact an earlier deposit has not yet arrived at this site. The conditional update issued above ensures that no fine is sub­tracted if the account turns out not to have been overdrawn. The letter to the customer will have been printed; further compensation for this kind of anomaly, e .g. , an apology and the gift of a toaster, can be specified if desired.

Alerters and triggers can be useful in any database system; even with atomic transactions, compensation is occasionally needed for incorrect (but apparently consistent) transactions executed as a result of external errors. In our system, how­ever, nonserializable execution of interactions may require so much compensation as to make the system unusable. To alleviate this problem, we are designing some probabilistic techniques for bringing the cost of compensation down to an acceptable level. Examples of such techniques are: introducing delays (as described above) before triggering compensating interactions, and a controlled amount of inter-site synchronization to forestall problems that may require compensation. For example, an interaction wishing to pro­tect certain data from modification may send a message to all sites asking them not to update the data for a short period of time.

While the above technique resembles traditional database locking in many ways, it is different in that we permit more flexible ways of dealing with failures, delays, and negative responses. For example, the amount of money that can be withdrawn from an account at a site might be made an in­creasing function of how many sites are in communication with this site. While such defensive measures will have some negative impact on availability (e.g. , by limiting withdrawal amounts when the network is partitioned), application de­signers can select from a wide range of protocols and pa­rameters so as to achieve a balance between availability and integrity that is appropriate for the given application. For certain critical data, the application may even choose to im­plement a conservative concurrency control technique such as majority consensus, with correspondingly restricted avail­ability for such data. Any such synchronization that the appli­cation uses is completely optional in that it does not affect the underlying mutual consistency mechanism; its only purpose is to improve performance by controlling the rate at which compensation is triggered.

VI. SUMMARY AND CONCLUSION

We have presented an architecture for providing high availability of a database in the face of network partitions and other failures. Each site presents its users and application programs with the "best" data that it can, reflecting all up­

dates that it has seen so far, and the users and programs issue updates based on these data without any intersite synchroni­zation. Mutual consistency among site copies of the database is ensured by using timestamps to define a total order on all updates, and by exploiting semantic properties of the updates to efficiently process them when received out of timestamp order. Applications can compensate for loss of integrity using alerters and triggers and can use probabilistic synchronization among sites to control the frequency and cost of compensation.

A prototype of this system is currently operational on the DARPA Internet, using VAX and Sun machines running Ber­keley UNIX, and with Ingres as the local DBMS for each site's local database copy. A distributor has been imple­mented that reliably transmits updates to all sites even when the source sites fail. The current checker only handles over­writing updates (which include adding and deleting data items as special cases). A full checker that establishes mutual consistency for arbitrary update types such as conditionals will be implemented by January 1986.

Our research into application design is at a somewhat ear­lier stage. We are studying several test applications (one of which will be used to demonstrate the full checker imple­mentation) in order to gain insight into the kinds of alerters and triggers and probabilistic synchronization that are likely to be useful in practice. Our goal is to develop a high-level language that will allow application designers to easily spec­ify the compensation and probabilistic control methods that they wish to use.

For the long term, research is needed into the class of applications for which our approach is appropriate. We do not expect that it will be right for all applications. Rather, we have presented an alternative that application designers can compare against currently popular designs which are based on distributed atomic transactions; different approaches will be better for different applications. Our architecture does not sacrifice all of the integrity advantages of atomic transactions in order to achieve high availability, but instead allows the application to choose the level of synchronization that it needs in order to achieve any desired balance between avail­ability and integrity.

ACKNOWLEDGMENT

The authors would like to acknowledge the research con­tributions of U. Dayal, H. Garcia-Molina, N. Lynch, and D. Ries and the efforts of M. Chilenskas, R. McKee, and M. Verba in implementing the prototype distributor and checker.

REFERENCES

[1] B. Awerbuch and S. Even, "Efficient and reliable broadcast is achievable in an eventually connected network," in Proc. Symp. Principles Distrib. Comput., 1984, pp. 278-281.

[2] P. A. Bernstein and N. Goodman, "An algorithm for concurrency control and recovery in replicated distributed databases," ACM Trans. Database Syst., vol. 9, no. 4, pp. 596-615, Dec. 1984.

[3] A.D. Birrell, R. Levin, R . M . Needham, and M . D . Schroeder, "Grapevine: An exercise in distributed computing," Commun. ACM, vol. 25, no. 4, pp. 260-274, Apr. 1984.

[4] B.T. Blaustein and C.W. Kaufman, "Updating replicated data during communications failures," in Proc. Eleventh Int. Conf. Very Large Data­bases, Aug. 1985, pp. 49-58.

[5] O. P. Buneman and Ε. K. Clemons, "Efficiently monitoring relational databases," ACM Trans. Database Syst., vol. 4, no. 3, pp. 368-382, Sept. 1979.

SARIN et al.: SYSTEM ARCHITECTURE FOR DISTRIBUTED DATABASES 1163

[6] S.B. Davidson, "Optimism and consistency in partitioned distributed database systems," ACM Trans. Database Syst., vol. 9, no. 3, pp. 456-482, Sept. 1984.

[7] K. P. Eswaran, "Aspects of a trigger subsystem in an integrated database system," in Proc. Second Int. Conf. Software Eng., Oct. 1976, pp. 243-250.

[8] M. J. Fischer and A. Michael, "Sacrificing serializability to attain high availability of data in an unreliable network," in Proc. Symp. Principles Database Syst., 1982, pp. 70-75.

[9] D. K. Gifford, "Weighted voting for replicated data," in Proc. Seventh Symp. Oper. Syst. Principles, Nov. 1979, pp. 150-162.

[10] P. R. Johnson and R. H. Thomas, "The maintenance of duplicate data­bases," Bolt Beranek and Newman Inc., Arpanet Request for Comments (RFC) 677, Jan. 1975.

[11] L. Lamport, "Time, clocks, and the ordering of events in a distributed system," Commun. ACM, vol. 21, no. 7, pp. 558-565, July 1978.

Barbara T. Blaustein (M'85) received the B.S. de­gree in computer science from the University of Maryland, College Park, in 1975 and the Ph.D. de­gree in computer science from Harvard University, Cambridge, MA, in 1981.

She is now a Computer Scientist in the Research and Systems division of Computer Corporation of America. Her main research interests are distributed database systems, query processing and opti­mization, semantic integrity, and derived data.

Dr. Blaustein is a member of the Association for Computing Machinery.

Sunil K. Sarin (S'76-M'82) received the B.Tech. degree in electronics and electrical communication engineering from the Indian Institute of Technology, Kharagpur, India, in 1974 and the S.M. and Ph.D. degrees in computer science from the Massachusetts Institute of Technology, Cambridge, in 1977 and 1984, respectively.

He is now a Computer Scientist in the Research and Systems division of Computer Corporation of America. His research interests are in the manage­ment of shared distributed information: database sys­

tems, concurrency control and recovery, network protocols and distributed operating systems, support for cooperative work, and user interfaces.

Dr. Sarin is a member of the Association for Computing Machinery and Sigma Xi.

f mm

Charles W. Kaufman received the B.S. degree in mathematics from Bates College, Lewiston, ME, and the A.M. degree in mathematics from Dart­mouth College, Hanover, NH.

From 1976 to 1983 he worked at DTSS Inc., Han­over, on operating systems and database manage­ment systems. He is now a Computer Scientist in the Research and Systems division of Computer Cor­poration of America. His research interests are dis­tributed database systems, operating systems, and office automation.