8
Abstract Building dependable distributed systems using ad hoc methods is a challenging task. Without proper support, an application programmer must face the daunting require- ment of having to provide fault tolerance at the application level, in addition to dealing with the complexities of the distributed application itself. This approach requires a deep knowledge of fault tolerance on the part of the appli- cation designer, and has a high implementation cost. What is needed is a systematic approach to providing depend- ability to distributed applications. Proteus, part of the AQuA architecture, fills this need, and provides facilities to make a standard distributed CORBA application de- pendable, with minimal changes to an application. Fur- thermore, it permits applications to specify, either directly or via the Quality Objects (QuO) infrastructure, the level of dependability they expect of a remote object, and will attempt to configure the system to achieve the requested dependability level. Our previous papers have focused on the architecture and implementation of Proteus. This pa- per describes how to construct dependable applications using the AQuA architecture, by describing the interface that a programmer is presented with and the graphical monitoring facilities that it provides. 1. Introduction Middleware support for building dependable distributed systems has the potential to ease the burden on application programmers, and increase the dependability of standard applications, by providing an easy way to make an appli- cation more dependable. In order to be useful, the middle- ware must be easy to add to an existing distributed appli- cation, must run on standard commercial off-the-shelf 1 This research has been supported by DARPA Contracts F30602-96-C-0315 and F30602-97-C-0276. hardware, and must interfere as little as possible with ap- plications at runtime. In particular, it should 1) provide a simple interface in which application objects can specify desires about the dependability of remote objects they use, 2) provide automatic and transparent detection of and re- covery from failures, and 3) manage a pool of resources in a manner consistent with the desires of multiple objects that require dependable remote objects. While these goals are clearly desirable, building a software infrastructure that achieves them is not an easy task. The AQuA architecture [Cuk98] is one approach to build- ing dependable distributed objects that attempts to meet these goals. In particular, AQuA aims to allow distributed applications to request and obtain a desired level of de- pendability using Proteus [Sab99]. Proteus dynamically manages the replication of distributed objects in order to make them dependable. More specifically, Proteus takes requests regarding the dependability of remote objects used by an application object and decides how to provide fault tolerance. The choice of how to provide fault tolerance in- volves choosing the style of replication, the type of faults to tolerate, and the location of the replicas, among other things. Once a decision is made, the system is configured to try to achieve the dependability requested by one or more application objects. Reconfiguration of the system can occur if faults occur, or if the requested dependability of one or more application objects changes. Several projects focus on building dependable distributed objects. The Eternal system [Nar97] adds fault tolerance to applications by object replication. However, Eternal does not support dynamic system configuration changes in re- sponse to changing application requirements. Electra [Maf95] provides fault tolerance to CORBA by building a specialized ORB. However, since Electra uses a non- standard ORB to provide group communication services, it is incompatible with other ORBs if the fault-tolerant fea- tures are used. The OpenDREAMS research project [Fel96] focuses on the design and implementation of an Object Group Service (OGS), which provides facilities for Building Dependable Distributed Applications Using AQUA 1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable and High-Performance Computing Coordinated Science Laboratory and Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 {ren, cukier, rubel, whs}@crhc.uiuc.edu David E. Bakken and David A. Karr BBN Technologies Cambridge, Massachusetts 02138 {dbakken, dkarr}@bbn.com

Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

Embed Size (px)

Citation preview

Page 1: Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

Abstract

Building dependable distributed systems using ad hocmethods is a challenging task. Without proper support, anapplication programmer must face the daunting require-ment of having to provide fault tolerance at the applicationlevel, in addition to dealing with the complexities of thedistributed application itself. This approach requires adeep knowledge of fault tolerance on the part of the appli-cation designer, and has a high implementation cost. Whatis needed is a systematic approach to providing depend-ability to distributed applications. Proteus, part of theAQuA architecture, fills this need, and provides facilitiesto make a standard distributed CORBA application de-pendable, with minimal changes to an application. Fur-thermore, it permits applications to specify, either directlyor via the Quality Objects (QuO) infrastructure, the levelof dependability they expect of a remote object, and willattempt to configure the system to achieve the requesteddependability level. Our previous papers have focused onthe architecture and implementation of Proteus. This pa-per describes how to construct dependable applicationsusing the AQuA architecture, by describing the interfacethat a programmer is presented with and the graphicalmonitoring facilities that it provides.

1. Introduction

Middleware support for building dependable distributedsystems has the potential to ease the burden on applicationprogrammers, and increase the dependability of standardapplications, by providing an easy way to make an appli-cation more dependable. In order to be useful, the middle-ware must be easy to add to an existing distributed appli-cation, must run on standard commercial off-the-shelf

1 This research has been supported by DARPA ContractsF30602-96-C-0315 and F30602-97-C-0276.

hardware, and must interfere as little as possible with ap-plications at runtime. In particular, it should 1) provide asimple interface in which application objects can specifydesires about the dependability of remote objects they use,2) provide automatic and transparent detection of and re-covery from failures, and 3) manage a pool of resources ina manner consistent with the desires of multiple objectsthat require dependable remote objects. While these goalsare clearly desirable, building a software infrastructure thatachieves them is not an easy task.The AQuA architecture [Cuk98] is one approach to build-ing dependable distributed objects that attempts to meetthese goals. In particular, AQuA aims to allow distributedapplications to request and obtain a desired level of de-pendability using Proteus [Sab99]. Proteus dynamicallymanages the replication of distributed objects in order tomake them dependable. More specifically, Proteus takesrequests regarding the dependability of remote objects usedby an application object and decides how to provide faulttolerance. The choice of how to provide fault tolerance in-volves choosing the style of replication, the type of faultsto tolerate, and the location of the replicas, among otherthings. Once a decision is made, the system is configuredto try to achieve the dependability requested by one ormore application objects. Reconfiguration of the systemcan occur if faults occur, or if the requested dependabilityof one or more application objects changes.Several projects focus on building dependable distributedobjects. The Eternal system [Nar97] adds fault tolerance toapplications by object replication. However, Eternal doesnot support dynamic system configuration changes in re-sponse to changing application requirements. Electra[Maf95] provides fault tolerance to CORBA by building aspecialized ORB. However, since Electra uses a non-standard ORB to provide group communication services, itis incompatible with other ORBs if the fault-tolerant fea-tures are used. The OpenDREAMS research project[Fel96] focuses on the design and implementation of anObject Group Service (OGS), which provides facilities for

Building Dependable Distributed Applications Using AQUA1

Jennifer Ren, Michel Cukier,Paul Rubel, and William H. Sanders

Center for Reliable and High-Performance ComputingCoordinated Science Laboratory

and Department of Electrical and Computer EngineeringUniversity of Illinois at Urbana-Champaign,

Urbana, Illinois 61801{ren, cukier, rubel, whs}@crhc.uiuc.edu

David E. Bakken and David A. Karr

BBN TechnologiesCambridge, Massachusetts 02138

{dbakken, dkarr}@bbn.com

Page 2: Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

CORBA object group communication. This approach hasthe potential to provide group services to CORBA objects;however, it requires that the application developers beaware of and explicitly make use of the OGS. For abroader comparison with other projects, see [Cuk98].In this paper, we describe how to build dependable distrib-uted applications using the AQuA architecture. In particu-lar, we explain the remote method calls one can use tomake a quality of service (QoS) request regarding depend-ability, and the callbacks that occur when a dependabilityrequest can no longer be met. We also describe how an ap-plication object can obtain information on hosts and makesuggestions about the set of hosts that may be used. Wethen explain how to request more detailed informationfrom Proteus regarding actions it takes and decisions itmakes. Finally, we describe the graphical interface Proteusprovides on its manager, and on each node that hosts repli-cas. These programming and monitoring facilities providean easy-to-use environment for building dependable dis-tributed CORBA applications. To illustrate this, we presentan example in which a simple CORBA application wasmade dependable through the use of Proteus and the AQuAarchitecture.

2. AQuA Overview

Before describing how a distributed application interactswith Proteus, we briefly review the AQuA architecture.Figure 1 shows the different components of the AQuA ar-chitecture in one particular configuration. These compo-nents can be assigned to hosts in many different ways, de-pending on the dependability level that objects desire ofremote objects they use.The AQuA system uses the Maestro/Ensemble groupcommunication system [Hay98, Vay98] to provide reliablemulticast to a dynamically changing group of processes, toensure atomic delivery of multicasts to groups with chang-ing membership, and to detect and exclude from the groupany members that fail by crashing. The Ensemble protocolstack used in AQuA provides inter-process communicationbased on the virtual synchrony model [Bir96]. Maestro[Vay97] provides an object-oriented interface (in C++) toEnsemble.

Proteus, implemented on top of Maestro/Ensemble, is aflexible infrastructure for providing adaptive fault toler-ance. Proteus makes remote objects dependable by using 1)a replicated dependability manager to make decisions re-garding reconfigurations and to coordinate changes in sys-tem configurations, 2) object factories to kill and start ob-jects and provide information to the dependability managerregarding a host, and 3) gateways that implement particularvoting and replication schemes.The Proteus dependability manager makes decisions re-garding reconfiguration based on reported faults and de-pendability requests from QuO, and, together with thegateways, implements the chosen fault tolerance approach.Depending on the choices made by the dependability man-ager, Proteus can tolerate and recover from crash failures,time faults, and value faults in application objects and theQuO runtime. Note that we do not aim to tolerate Byzan-tine faults, value faults in the gateway, or faults in thegroup communication system itself. If tolerance of morecomplex fault types is required, one could substitute amore secure group communication protocol (e.g., [Kih98,Rei95]) for Ensemble within the AQuA architecture.Object factories are used to kill and start replicated appli-cations, depending on decisions made by the dependabilitymanager, and to provide information regarding the host tothe dependability manager.CORBA provides application developers with a standardinterface for building distributed object-oriented applica-tions, but does not provide a simple approach that allowsapplications to be fault-tolerant. The gateway provides astandard CORBA interface by translating between process-level communication, as supported by Ensemble, and IIOPmessages, which are understood by Object Request Bro-kers (ORBs) in CORBA. In this way, CORBA-based dis-tributed applications written for the AQuA architecture canuse standard, commercially available ORBs. In addition toproviding basic reliable communication services for appli-cation objects and the QuO runtime, the gateway also pro-vides fault tolerance using different voters and replicationprotocols. These services are located in the gateway han-dlers. Both active and passive replication of “AQuA ob-jects” can be supported.AQuA objects are the basic units of replication in theAQuA architecture. Each one consists of a gateway, an

Figure 1. AQuA Architecture

Gateway

ServerServer

Gateway

Server QuO

Gateway

ProteusDependability

Manager

Gateway

ObjectFactory

Gateway

ObjectFactory

Gateway

ObjectFactory

Gateway

ObjectFactory

Gateway

ObjectFactory

Gateway

ClientClientClient QuGateway

Client QuONameServer

GossipNameServer

Ensemble Group Communication System

Page 3: Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

“application object,” and a QuO runtime, if QuO is beingused to manage the desires of the application object. In thiscontext, the application object can be part of the distrib-uted application itself, or part of the AQuA architecturethat uses the services of a gateway (such as the depend-ability manager and the object factories).In order to provide a simple way for application objects tospecify the level of dependability they desire, the AQuAarchitecture uses the Quality Objects (QuO) [Zin97,Loy98] framework to process and invoke dependabilityrequests. QuO allows distributed applications to specifyQoS requirements at the application level using the notionof a “contract,” which specifies to Proteus the actions to betaken based on the state of the distributed system and de-sired application requirements.

3. Programmer’s Interface to the ProteusDependability Manager

This section describes how an application or QuO pro-grammer interacts with the Proteus dependability manager1) to request a particular level of dependability, 2) to benotified when that level is no longer met, 3) to obtain in-formation concerning hosts managed by Proteus and giveadvice regarding the hosts on which the manager placesreplicas, and 4) to obtain detailed information regardingdecisions that the dependability manager makes, and thefaults that it detects. The Proteus dependability manager isitself a CORBA application and, just like any other AQuAobject, uses a gateway to communicate reliably with other

CORBA objects. The interface to the manager (Figure 2)can be divided into two sets of methods: those used tocommunicate with the components in the AQuA systemcore (composed of the dependability manager, the gatewayhandlers, and the object factories), and those used by oneor more AQuA objects to request and observe QoS, to ob-serve the state of the dependability manager, and to ob-serve and control hosts.Proteus supports the development of three types of objectsthat can make QoS requests from the dependability man-ager and also observe its actions. One of these object types,called the QoS observer/requester, can be used to makeQoS requests to the dependability manager and can receivecallbacks regarding the ability of the dependability man-ager to satisfy the requester’s requests. (An example of anapplication that may contain a QoS observer/requester isQuO itself.) Furthermore, since the dependability managersupports a standard, well-defined interface, an applicationobject can also make QoS requests directly to the depend-ability manager.The second type of object the dependability manager sup-ports is called an advisor observer. Advisor observers can“subscribe” to a variety of information used by the depend-ability manager to make decisions, including informationabout faults detected and fine-grain information regardingactions taken by the manager.Proteus also supports the development of a third type ofobjects, called Host observer/controllers, that receive in-formation regarding the status of hosts that may be used toexecute object replicas, and that can be used to make re-

Figure 2. Interfaces to the Proteus Dependability Manager

kill

rep

ly

star

t rep

ly

regi

ster

hos

t

star

t rep

l ica

kill

rep

lica

Dependability Managerview change

time faultvalue fault

repl

ica

cras

hed

get h

ost i

nfor

mat

ion

repl

ica

kill

suc

cess

ful

repl

ica

star

t suc

cess

ful

repl

ica

kill

fail

ed

repl

ica

star

t fai

led

repl

ica

kill

att

empt

ed

repl

ica

star

t att

empt

ed

noti

fy n

umbe

r of

rep

lica

s

faul

t occ

urre

d

regi

ster

Adv

isor

Obs

erve

r

rem

ove

Adv

isor

Obs

erve

r

hos

t inf

o. r

e ply

host

info

rmat

ion

deac

tivat

e ho

stho

st r

emov

edac

tivat

e ho

stho

st d

eact

ivat

edho

st a

ctiv

ated

regi

ster

Obs

erve

r/C

ontr

olle

rre

mov

e O

bser

ver/

Con

trol

ler

Advisor Observer Host Observer/Controller

Object Factory QoS Observer/Requester

Gateway

rem

ove

QoS

req

uest

QoS

req

uest

not

sat

isfie

d

upda

te Q

oS r

eque

st

regi

ster

QoS

req

uest

QoS

req

uest

sat

isfie

d

Page 4: Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

quests regarding which hosts can be used to execute repli-cas. In particular, as will be seen in the following, host ob-servers/controllers can be used by an application or QuO tospecify hosts that should not be used to execute replicas, ifthe application or QuO has information that leads it to be-lieve that the host should not be used. The methods used tocommunicate between these three object types and the de-pendability manager will be described in Subsections 3.2,3.3, and 3.4.

3.1 Interface to the AQuA System Core

In order to illustrate how Proteus works, we now summa-rize the method calls that support communication betweenthe dependability manager, gateways, and object factories.In particular, three dependability manager methods arecalled by gateways to report information to the depend-ability manager. The view change method is called to re-port changes in group membership. Given this view changeinformation, the dependability manager can determinewhether the view change is the result of a crash failure, aconfiguration change requested by Proteus, or both. Simi-larly, the value fault (or time fault) method is called if avalue (or time) fault is reported by a gateway.The interface between the dependability manager and theobject factories consists of four methods called from theobject factories on the dependability manager and fourmethods called by the dependability manager implementedin each object factory. The four method calls made by eachfactory on the dependability manager include the registerhost method, which is called to register the calling factorywith the dependability manager; the start reply and kill re-ply methods, which are called to report, respectively, thestatus (success or failure) of a request made to the depend-ability manager to start an object and to kill an object; andthe host information reply method, which is called to pro-vide the dependability manager with information (e.g., thehost load) concerning the host. The four methods called bythe dependability manager on an object factory to config-ure the system include the start replica and kill replicamethods, which are called, respectively, to start a new ob-ject and to kill an object; the replica crashed method,which is called when a crash failure is detected and thecrashed object must then be removed from the running ob-ject table and killed; and the get host information method,which is called to request information concerning the host.

3.2 Interface with the QoS Observer/Requester

QoS observer/requester objects specify the level of de-pendability desired of a remote object, and receive infor-mation regarding the ability of an AQuA-based system tomeet that level of dependability. These objects can be im-plemented in a QuO system condition object or as part ofan object that makes use of the remote object. Five meth-ods are used in the interface between the dependability

manager and the QoS observer/requester. Three of thesemethods are implemented in the dependability manager,and receive information regarding the desired dependabil-ity of AQuA-managed objects. The other two methods areimplemented in the QoS observer/requester, and receiveinformation about the dependability manager’s ability tomeet a request.The three methods implemented in the dependability man-ager support the definition of, modification of, and removalof QoS requests. The first method, register QoS request, iscalled to register a new QoS request (one that does not re-place or supercede any pre-existing request). The argumentQoSRequest_info is passed with this call and contains thespecifics of the QoS request. This information includes 1) areference to the QoS observer/requester that should be no-tified when events concerning this QoS request occur, 2)the remote object whose dependability is to be managed, 3)the number of crash failures, value faults, and time faults ofthis object that should be tolerated, and 4) the time periodduring which the dependability level of the remote objectcan fall below the requested value before a callback mustbe made to the specified QoS observer/requester. Upon asuccessful request, the method returns the QoSRequest IDthat should be used in the future whenever the caller wantsto refer again to this request.The update QoS request method substitutes a new QoS re-quest for one that was previously registered. The parame-ters of that call are the QoSRequest ID (identifying theQoS that will be updated) and the QoSRequest_info (con-taining the new information concerning the QoS request).Finally, the remove QoS request method eliminates a pre-viously registered QoS request without replacing it with anew request. This method is used only when a requestingobject no longer needs a remote object. When the depend-ability manager receives a remove QoS request, it adjuststhe number of replicas of the referenced object to satisfyany other requests that have been made for this object,killing replicas if appropriate.The dependability manager calls two methods on QoS ob-servers/requesters: one to indicate that a QoS request hasbecome unsatisfied, and another to indicate that a QoS re-quest that had become unsatisfied has once again becomesatisfied. Both of these method calls have the QoSRequestID, the current number of replicas, and the current numberof “active hosts” as parameters (the term “active hosts” willbe defined in subsection 3.4). More specifically, QoS re-quest not satisfied method is called:

(1) When a new QoS request requests that morefaults be tolerated than is possible given the cur-rent set of active hosts.

(2) When a QoS request that was previously satisfiedby the number of hosts that were then active canno longer be satisfied by the current number ofactive hosts.

Page 5: Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

(3) When a QoS request that was satisfied is nolonger satisfied, due to a change in the number ofreplicas of the managed object. (Note that thecallback occurs after the dependability managerhas attempted to obtain the QoS request for a pe-riod of time defined by the recovery time speci-fied in the QoSRequest_info.)

Similarly, the QoS request satisfied method is called whena QoS request that previously could not be satisfied (indi-cated by the QoS request not satisfied call) can once againbe satisfied.

3.3 Interface with the Advisor Observer

An advisor observer object can be used by QuO or an ap-plication object that wishes to receive more detailed infor-mation concerning fault notifications and decisions that thedependability manager advisor makes. An object may wantthis information, for example, to make higher-level deci-sions on how to adapt in a particular situation. To receivethis information, each advisor observer implements severalmethods that may be called by the dependability manager.Just which types of events and actions an advisor observeris notified of depends on the type of information the advi-sor observer requested when it registered with the depend-ability manager.The dependability manager supports multiple advisor ob-servers, which can dynamically register and de-registerwith the dependability manager at run-time. The registerAdvisor Observer method, called on the manager by advi-sor observers, registers an advisor observer with the de-pendability manager. When it registers, an advisor ob-server passes a reference to the advisor specifying themethods that should be called when events concerning thisadvisor observer occur, and the types of notifications it de-sires from the dependability manager. Upon a successfulreturn, the register call returns an observer identificationID that can be used to de-register the advisor observer, us-ing the remove Advisor Observer method. The types of in-formation for which an advisor observer can register aregiven in the next paragraph.Each advisor observer implements several methods, asshown in Figure 3, to receive information from the depend-ability manager and to specify the action that is to be takenupon receiving that information. In particular, the fault oc-curred method is called on each advisor observer when thedependability manager detects a fault. This method pro-vides, as arguments to the call, the type of fault detected,the host where the fault was detected, and the dependableobject associated with the fault. All advisor observers re-ceive this information, regardless of what other informationthey requested when they registered with the dependabilitymanager. The seven other methods that must be imple-mented by an advisor observer are called on those advisorobservers that have requested information related to a par-ticular call. Specifically, the notify number of replicas

method provides the name of the dependable object towhich the call refers and the number of replicas of this ob-ject in the current system configuration. The remainingcalls in Figure 3 provide all the same information as notifynumber of replicas, plus the name and status information(e.g., load) for the host to which the call refers.

Method Function

Faultoccurred

Called when the dependability manager detects acrash failure, value, or time fault.

Notifynumber ofreplicas

Called when a new QoS request is registered andwhenever the number of replicas in the replicationgroup changes.

Replica startattempted

Called when the dependability manager attempts tostart a replica.

Replica killattempted

Called when the dependability manager attempts tokill a replica.

Replica startfailed

Called when a new replica either could not be startedby the object factory or could be started but could notjoin the replication group.

Replica killfailed

Called when the replica could not be killed by theobject factory and the kill failure was reported to thedependability manager.

Replica startsuccessful

Called when a replica was started successfully by theobject factory, joined the replication group, and wasreported to the dependability manager.

Replica killsuccessful

Called when a replica was killed successfully by theobject factory, was removed from the replicationgroup, and was reported to the dependability man-ager.

Figure 3. Advisor Observer Callbacks

3.4 Interface to the Host Observer/Controller

The final type of interface that an application or QuO canhave to the dependability manager is a host ob-server/controller. Host observer/controllers can receivestatus information concerning hosts that are being used toexecute dependable objects, and give instructions regard-ing hosts that should or should not have replicas placed onthem by the dependability manager. A host ob-server/controller gives these instructions by suggestingchanges in the status of hosts. Depending on a host’s status,it is placed in a certain set by the dependability manager.The dependability manager has three sets of hosts: an ac-tive host set, an inactive host set, and a removed host set.When an object factory registers a host with the depend-ability manager, the host is placed into the active host set.When a host observer/controller requests that a host be de-activated, the host is placed into the inactive host set.When the dependability manager detects the failure of anobject factory or a host, the host is placed into the removedhost set. Replicas running on removed hosts are assumed tohave failed. If a host in the inactive host set is reactivatedby a host observer/controller, the host is moved back to theactive host set. If a failed object factory or a failed host is

Page 6: Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

restarted, the host is moved from the removed host set tothe active host set. The newly started object factory com-municates with the dependability manager to initiate itsstate. The dependability manager will not create replicason a host that is in the inactive host set or in the removedhost set. It will also migrate the replicas on a host in theinactive host set to hosts in the active host set.Host observer/controllers register with the dependabilitymanager to obtain information and control the dependabil-ity manager’s operation. Like advisor observers, the de-pendability manager supports multiple host ob-server/controllers. Four methods are called by the host ob-server/controller on the dependability manager to registerand remove host observer/controllers and to activate anddeactivate hosts. The method register Observer/Controllerregisters a host observer/controller with the dependabilitymanager. The call passes 1) a reference to the host ob-server/controller that should be notified when events con-cerning this host observer/controller occur, and 2) a speci-fication of the information that the host observer/controllerwould like from the dependability manager. The returnvalue on success is the observer/controller ID that can beused to de-register the observer/controller, using the re-move Observer/Controller method. The method deactivatehost is used to request that the dependability manager de-activate the host specified by the call. When this call ismade, the dependability manager will move the host fromthe active host set to the inactive host set, and will also mi-grate the replicas from this inactive host to hosts in the ac-tive host set. The method activate host is used to requestthat the dependability manager move an inactive host fromthe inactive host set back to the active host set.Each host observer/controller must implement four meth-ods to receive information from the dependability manager.Whether these methods are called depends on the informa-tion that the host observer/controller requested when itregistered with the dependability manager. The methodhost activated may be called 1) when a host that is either inthe removed host set or in no host set registers with the de-pendability manager, or 2) when a host that is in the inac-tive host set is reactivated. The method host removed maybe called when a host or an object factory failure is de-tected by the dependability manager, and the host is movedfrom the inactive or the active host set to the removed hostset. This method will be implemented in the future, whenthe dependability manager is able to detect host and objectfactory failures. The method host deactivated may becalled when a host is deactivated by a host ob-server/controller. The method host information is called bythe dependability manager for all hosts in the active set ifthe host information method is enabled.

4. Proteus Graphical User Interface andExample Application

In addition to the programmer’s interface described in theprevious section, Proteus also provides several graphical

interfaces to allow monitoring of the status of applicationsbuilt using the AQuA architecture, and provides examplesof QoS observer/requesters, advisor observers, and hostobserver/controllers. In this section, we illustrate the use ofthese graphical interfaces via a simple application.

Figure 4. Castle Demonstration

The application, which we call the “castle demo,” (see Fig-ure 4) consists of a single, undependable client (called the“game board” in the following) and multiple dependableremote objects that interact with the game board (called the“black guard,” “purple guard,” and “red guard”). The ap-plication is based on the children’s game “capture theflag,” in which the game board represents the desired ob-ject (a castle depicted in the center of the screen, whoseperimeter is defined by the square surrounding it, ratherthan a flag) and “raiders” (shown as dots outside the pe-rimeter of the castle) move toward the castle in an attemptto reach it. The castle guards (shown as dots on the pe-rimeter of the castle) are implemented as dependable re-mote AQuA objects, and move to attempt to intercept raid-ers as they move toward the castle. To enable them to dothis, the game board makes remote CORBA calls to eachguard in succession, informing it of the current position ofthe raiders. The guard replies, giving its new position, andattempts to move toward the closest raider. If a guard in-tercepts a raider, the raider disappears, scoring a point forthe guards. If a raider is not intercepted, it eventuallyreaches the castle, scoring a point for the raiders.Dependable remote guards are created by the game boardusing the register QoS request call described in the lastsection. When the dependability manager receives theserequests, it creates an appropriate number of replicas of thespecified guard, depending on the level of fault tolerancerequested. Each guard replica has a small graphical inter-face with a status bar, as shown in Figure 4. The actionstaken by the dependability manager to create the replicascan be seen on the dependability manager graphical inter-face, shown in Figure 5. This interface displays a textualrecord of the actions that the dependability manager takes,and provides facilities for monitoring the status of hostsand replicas. In particular, the set of hosts on which object

Page 7: Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

factories are running, the groups of replicas running (called“replication groups”), and the specification of which de-pendable objects are interacting with one another (called“connection groups”) can be monitored. To do so, a useropens a window corresponding to the host, replicationgroup, or connection group in which he or she is interested.

Figure 5. Dependability Manager Graphical Interface

Examples of these windows are shown on the right side ofFigure 5. The top window on the right indicates, for thecorresponding host, the list of objects running on that hostand their states (normal, start pending, kill pending, joinpending, or remove pending). The middle window on theright shows the list of objects included in the correspond-ing replication group, and their states. The lowest windowon the right indicates, for each connection group, the repli-cation groups included in the connection group and theirtypes (sender or receiver). These windows are updatedwhenever a change in requested QoS occurs or the depend-ability manager is notified that a fault has occurred. In thecastle demonstration, a user makes these calls by pressingthe appropriate button on the QoS subwindow of the gameboard. We provide an example advisor and host ob-server/controller as part of the demonstration (see Figure6). This application object implements the methods re-quired by advisor observers and host observer/controllers,and makes register, remove, activate, and deactivate callsto the dependability manager, as described in the last sec-tion. As can be seen in the figure, push buttons are used tospecify which information the observer/controller will re-quest in register calls. Finally, the user can activate and de-activate hosts by giving their names to the interface.In order to demonstrate and test our current implementa-tion of Proteus, we have built several applications, includ-ing the castle demonstration just described, and stressedthem with a variety of input and fault scenarios. In par-ticular, for the castle demonstration, we have crashed dif-ferent numbers of replicas, from the same replication group

or from different replication groups, by killing the appro-priate AQuA object(s). In each case, the replicas were re-started successfully after going through all of the appropri-ate states (e.g., start pending, join pending, normal).

Figure 6. Example Advisor Observer and Host Ob-server/Controller Interface

To simulate an application prescribing that a host shouldno longer be used to host replicas, we used a host ob-server/controller to indicate that a host should no longer beused. When we did this, we observed that the replicas onthat host were migrated to the least loaded available host(i.e., where there is no replica of the same replicationgroup). We then used the same interface to return the hostto the set of available hosts. After doing this, we noticedthat when further replicas were crashed, that returned hostwas used again if it was the least loaded available host.We have also tested the ability of Proteus to adapt tochanging dependability requirements, by using a QoS ob-server/requester interface to change the level of depend-ability requested. In each case, the number of replicas inthe appropriate replication group changed as expected.Finally, we have observed the (correct) action of Proteuswhen there were not enough hosts to achieve the requestedlevel of dependability. In this case, a callback QoS requestnot satisfied was sent to the QoS observer/requester, indi-cating that the requested dependability could no longer beachieved.

5. Conclusions

This paper has described the program and user interface ofAQuA. By providing an easy-to-use interface to middle-ware that provides replication, fault detection, and faultrecovery services, we have significantly simplified thebuilding of dependable distributed applications usingcommercial off-the-shelf hardware. Furthermore, in addi-tion to providing an application programmer with a simpleinterface, the AQuA architecture provides a simple andstandard mechanism for QuO or applications to obtain in-formation on the configuration of the system, faults thatoccur, and actions that the dependability manager takes.

Page 8: Building Dependable Distributed Applications … Dependable Distributed Applications Using AQUA1 Jennifer Ren, Michel Cukier, Paul Rubel, and William H. Sanders Center for Reliable

This information can be used by QuO or an applicationobject to advise the dependability manager on hosts thatshould or should not be used to execute replicas.

AcknowledgmentsWe would like to thank several other members of theAQuA and QuO teams, namely Mark Berman, Joe Loyall,Partha Pal, Rick Schantz, and John Zinky, for support anddiscussions.

References[Bir96] K. P. Birman, Building Secure and Reliable Network Applica-tions, Greenwich, CT: Manning Publications, 1996.[Cuk98] M. Cukier, J. Ren, C. Sabnis, D. Henke, J. Pistole, W. H. Sand-ers, D. E. Bakken, M. E. Berman, D. A. Karr, and R. E. Schantz, “AQuA:An Adaptive Architecture that Provides Dependable Distributed Ob-jects,” Proc. of the 17th IEEE Symposium on Reliable Distributed Sys-tems, pp. 245-253, West Lafayette, IN, USA, October 1998.[Fel96] P. Felber, B. Garbinato, and R. Guerraoui, “The Design of aCORBA Group Communication Service,” Proc. of the 15th IEEE Sympo-sium on Reliable Distributed Systems, pp. 150-159, Niagara on the Lake,Ontario, Canada, October 1996.[Hay98] M. G. Hayden, “The Ensemble System,” Ph.D. thesis, CornellUniversity, 1998.[Kih98] K. P. Kihlstrom, L. E. Moser, and P. M. Melliar-Smith, “TheSecureRing Protocols for Securing Group Communication,” Proc. of theIEEE 31st Annual Hawaii International Conference on System Sciences,vol. 3, pp. 317-326, Kona, Hawaii, January 1998.[Loy98] J. P. Loyall, R. E. Schantz, J. A. Zinky, and D. E. Bakken,“Specifying and Measuring Quality of Service in Distributed ObjectSystems,” Proc. of the First International Symposium on Object-orientedReal-time Distributed Computing (ISORC '98), pp. 43-52, Kyoto, Japan,April 1998.[Maf95] S. Maffeis, “Run-Time Support for Object-Oriented DistributedProgramming,” Ph.D. thesis, University of Zurich, 1995.[Nar97] P. Narasimhan, L. E. Moser, and P. M. Melliar-Smith, “ReplicaConsistency of CORBA Objects in Partitionable Distributed Systems,”Distributed Systems Engineering, vol. 4, no. 3, September 1997, pp. 139-150.[Rei95] M. K. Reiter. “The Rampart Toolkit for Building High-IntegrityServices,” Theory and Practice in Distributed Systems (Lecture Notes inComputer Science 938), pp. 99-110, Springer-Verlag, 1995.[Sab99] C. Sabnis, M. Cukier, J. Ren, P. Rubel, W. H. Sanders, D. E.Bakken, and D. A. Karr, “Proteus: A Flexible Infrastructure to ImplementAdaptive Fault Tolerance in AQuA,” Proc. 7th IFIP Working Conf. onDependable Computing for Critical Applications (DCCA-7), pp. 137-156, San Jose, CA, USA, January 1999.[Vay97] A. Vaysburd and K. P. Birman, “Building Reliable AdaptiveDistributed Objects with the Maestro Tools,” Proc. of Workshop on De-pendable Distributed Object Systems, OOPSLA’97, Atlanta, Georgia,October 1997.[Vay98] A. Vaysburd and K. P. Birman, “The Maestro Approach toBuilding Reliable Interoperable Distributed Applications with MultipleExecution Styles,” Theory and Practice of Object Systems, vol. 4, no. 2,1998.[Zin97] J. A. Zinky, D. E. Bakken, and R. E. Schantz, “ArchitecturalSupport for Quality of Service for CORBA Objects,” Theory and Prac-tice of Object Systems, vol. 3, no. 1, pp. 55-73, April 1997.