Software for Reliable Networks

Software forReliable Networks

Techniques that enable distributed computing systems to reorganize themselves can restore operation

when one part crashes

by Kenneth P. Birman and Robbert van Renesse

Surfing the Internet is no longer just a seductive pastime. In risingnumbers, organizations of all kinds, from computer companies topublishing firms, are turning to on-line services that operate much

like the World Wide Web. These services can help manage important infor-mation, speed decision making and improve efficiency. But as more andmore enterprises become dependent on this new technology, many are alsoexposed to the downside of computer networking. The drawbacks are par-ticularly evident to users of distributed computing systems, which link pro-grams, servers and data files dispersed across an extended network of com-puters and terminals.

As every computer user knows, programs that operate across networksare prone to failure. Indeed, Leslie B. Lamport, a pioneer in distributedcomputing at Digital Equipment Corporation, defined a distributed com-puting system as “one in which the failure of a computer you didn’t evenknow existed can render your own computer unusable.” The Web is cer-tainly not exempt from breakdowns [see box on page 68]. During late1995, users of the Web reported several “brownouts,” when communica-tion on the Internet was largely impossible. Such lapses have been various-ly attributed to software errors, excessive traffic on transmission lines, andoverload or complete failure of the Web servers, which are computers that

PHYSICIANFrom a private officein another building, aphysician can monitorpatients in the hospi-tal. The physician canaccess vital signs,such as breathing andheart rate, laboratoryresults and currentmedical records.

REMOTE CONSULTATIONVideoconferencing server allows doctorsto consult other experts for additional information about cases.

PHARMACYThe hospital pharmacistalso adds information topatients’ records, notingwhen requested medica-tions were dispensed.Accurate data about allmedications a patient receives help to preventdangerous drug interactions.

Software for Reliable Networks64 Scientific American May 1996

Copyright 1996 Scientific American, Inc.

store the documents users access from their workstations. Most likely, acombination of factors contributes to brownouts. Unfortunately, similarevents will multiply as computer networks—not just the World Wide Webbut also distributed computing systems serving banks, schools and manyoffices—continue to expand.

When computers crash, sometimes the only casualties are the user’s timeand temper. If the automated bank teller nearest you is not working, theone across the street from it may be. But the shutdown of very complicatednetworks can have dire consequences. On July 15, 1994, the NASDAQstock exchange, an exclusively electronic stock market, opened two hourslate because of a mysterious problem that compromised the entire system.Initially, workers thought a software bug triggered the shutdown, but theerror was ultimately traced to a malfunctioning disk. Because trading wasdelayed only for a few hours, little revenue was lost. Yet the event couldhave been a catastrophe: the market would have faced enormous losseshad trading not resumed when it did.

In another example, from January 1990, the AT&T telephone systemexperienced a large-scale outage when an electronic switch within the sys-tem failed. Calls were automatically shifted to a backup switch, but thisswitch also crashed because of a software bug. The failure rippled throughthe network, resulting in a nationwide shutdown that lasted for ninehours, during which 60,000 people lost all telephone service and 70 mil-lion telephone calls could not be completed. To anyone familiar with thechallenges of managing even a simple network, the surprise is not thatthese mishaps occur but rather that they are not more frequent.

Building Reliable Electronic Bridges

Even a brief distributed systems failure can pose a significant problemfor applications that require around-the-clock operation. Air-traffic-

control and financial computer networks must be exceedingly reliable andconstantly updated. A message that a “host is not responding” or a mislead-ing display that shows out-of-date information about an airplane’s flightpath or a stock’s price could easily provoke an accident or financial misad-venture. As the way people live and work continues to be transformed, the

FUTURE HOSPITAL COMPUTER SYSTEM connects patients with medicalpersonnel throughout the building or anywhere in the world. Because thecrash of one computer can endanger a patient’s well-being, such systems mayemploy active replication, described in the article, to cope with failures.

LABORATORYTo ensure that physicians and nurses do not inadvertently make decisions based on out-of-date records, the server storinglaboratory results is replicated on several servers across the system. If the original server becomes unreachable for any reason, requests for records arererouted to other sites.

BEDSIDE MONITORINGThe medical-records server stores informa-tion on vital signs, suchas heart rate and blood pressure, as well aswhen a patient receivedmedication. This infor-mation must always beavailable to doctors andnurses; the system ensures its accessibilityby replicating the dataand the programs thatmanage them.

NURSES’ STATIONNurses frequently updatepatients’ records, feedingnew information into thesystem either at terminalstations throughout thehospital or with the aid of handheld electronicnotepads.

JAR

ED S

CH

NEI

DM

AN

/JS

D

Scientific American May 1996 65Software for Reliable Networks


security and stability of their finances,property and even health will increasing-ly depend on distributed computer sys-tems. Thus, although it is easy to talkabout the potential benefits of the in-formation superhighway, we believe thebridges that link computers must be in-spected more closely. Various computerscientists, including the two of us, havebeen working since the late 1970s ondeveloping software to improve distrib-uted computer networks, making themmore secure and resistant to failure—anactivity that people in the field refer toas designing robust distributed comput-ing systems.

Why do distributed systems crash? Ifwe exclude systems that fail because theywere mismanaged or poorly designed,the most common scenario involves anisolated problem at one site that trig-gers a chain of events in which programafter program throughout the networkeventually shuts down. One response tothis threat might be to strengthen indi-vidual components—incorporating com-puters and disks specially designed totolerate faults, for example. But ceilingscan still leak, causing short circuits; pow-er can fluctuate; and communicationsconnections can be inadvertently cut.Acts of sabotage by hackers or disgrun-tled employees can also endanger dis-tributed systems. Although engineersand computer programmers can im-prove the durability of hardware andsoftware, no computer can ever be madecompletely reliable.

Even if every component of a systemwere extremely dependable, the storywould not end there. Merely intercon-necting reliable computers and bug-freeprograms does not yield a robust dis-tributed system. Instead it produces anetwork that works well under mostconditions. Electronic-mail programs,bulletin boards and the Web were de-signed using components that, consid-ered individually, are very trustworthy.Yet these systems frequently freeze whenanything unexpected happens to an in-dividual component of the system; forinstance, the system may crash whenone machine or a communications linebecomes overloaded. Some additionalform of protection is therefore needed.

During the past two decades, pro-grammers have attacked the depend-ability problem by developing fault-tol-erant software—programs that allowcomputer systems to restore normal op-eration even when problems occur. Thetechnique eliminates the chains of inter-

nal dependencies that link the operationof a system as a whole to the operationof any single component. The resultingsystems do not need to shut down evenif some sites go off-line. Instead they re-sume service by rapidly reconfiguringto work around crashed servers.

Saved by the Backup

Computer scientists refer to these ar-rangements as highly available dis-

tributed systems. Because these systemsare designed to replicate critical infor-mation continuously and to distributemultiple backup copies among their in-dividual computers, they can adapt tochanging conditions—a malfunctioningdisk drive at one site, an overload at an-other, a broken communications con-nection and so forth. As long as failuresdo not occur so often that the softwarelacks time to react, these systems canrespond by pulling up from elsewhere aduplicate copy of a needed file or a rep-lica of an on-line program. In this way,a system as a whole remains availableand, ideally, provides uninterrupted ser-vice to the users still connected.

A simple and popular method of build-ing a highly available distributed systeminvolves a primary and a backup system.If the primary machine fails, the back-up can be called into service. Switchingbetween the two is easy if the data neverchange. The conversion becomes diffi-cult, however, if data or files change

while the system is running. And in anextensive network of servers, data, filesand programs, it can be difficult to dis-tinguish between a system that has gen-uinely crashed and one that is merely ex-periencing communications difficulties.

Suppose that a computer is trying toupdate information on both the prima-ry and backup servers, but one of themstops responding to messages. If theproblem is merely in the communica-tions lines, the messages will get through,given enough time. But if the server hasactually failed, the computer doing theupdating would wait indefinitely; in themeantime, the system would be unavail-able. If the computer trying to carry outthe update inappropriately stops wait-ing and sends the update to only oneserver, however, the primary and thebackup will no longer be identical. Er-rors will arise if the system attempts touse the outdated server.

The NASDAQ financial market illus-trates one way to resolve this conun-drum. The network has two central trad-ing servers. To prevent confusion, onlyone is active at any given time. The NAS-DAQ operators themselves decide whento switch to the replacement server. Un-fortunately, very few distributed systemscan rely on the wisdom of a human op-erator to detect failures and then toswitch the entire network from oneserver to another. Rather programmersmust automate this decision so that thetransition can occur seamlessly.


WEATHER MONITORING NETWORK alerts Norwegian fishermen to dangerousstorms (left) or hazardous oil spills (right). The computer system StormCast links re-

OLA

RØ

E R

øe F

oto

(bac

kgro

und

phot

ogra

ph a

nd c

ompu

ter

scre

en p

hoto

grap

h)


Moreover, highly available distribut-ed systems often have large numbers ofservers and programs. Consequently,these systems typically maintain a mem-bership list, which keeps track of everyprogram, noting whether it is workingor not. If a program is unresponsive forany reason, it is marked as faulty. Byrecognizing a failure at one site, the sys-tem can then reconfigure itself and redi-rect work to operational sites.

The NASDAQ system also demon-strates a second concern about reliabili-ty in distributed systems. The two-hourtrading delay in 1994 could have beenavoided if the operators had switchedimmediately to the backup. They optedto wait, however, because of concernsthat a software bug might have causedthe primary system to malfunction. Ifsuch a bug were present, the backupmight also crash, just as the AT&T back-up system did. Because it is impossibleto guarantee that software is complete-ly free of bugs, some form of protectionis needed to reduce the risk that backupversions of a critical server will crashfollowing the failure of a primary server.

Programmers have responded to thischallenge with an approach known asactive replication. In active replication,a system’s software establishes redun-dant copies of vital programs or serversthrough the use of so-called processgroups. A process group links a set ofprograms that cooperate closely. A dis-tributed system may contain many pro-

cess groups, and programs can belong toseveral of these groups. Each group isknown by a name much like a file nameand has its own list of current members.Most important, the process group pro-vides a means for sending messages toits members. This message-passing func-tion ensures that each member of thegroup receives every message in the sameorder, even if the sender crashes whiletransmitting the message.

If a particular program is necessaryfor maintaining availability, the systemintroduces a group of programs, eachof which replicates the original. To up-date the data managed by the replicatedprogram, the system sends a message tothe process group. Each member reactsby updating its particular replica. Be-cause all the programs see the same up-dates in the same order, they will remainin mutually consistent states.

Active replication enables a system totolerate faults because any group mem-ber can handle any request: if one ma-chine crashes, work can be redirectedto an operational site. Furthermore, if arequest does not alter data, one site canprocess the query rather than tie up theentire system. In this way, multipletasks can be worked on at once by dif-ferent programs, speeding up the appli-cation by employing parallel processing.

Of course, if all members of a processgroup handle an incoming message inthe same erroneous manner, all themembers could, in theory, crash simul-

taneously. Although it would seem thatactive replication should be vulnerableto such failures, this turns out not to bethe case. Programmers have often ob-served that the errors most likely to bemissed in testing software are those in-volving the order in which data are re-ceived. These bugs can be provokedonly by unlikely sequences or timings ofevents. When a system employs activereplication, the replicas do see the sameupdates in the same order; however, up-dates are only a small part of the re-quests a program sees. Most of the time,replicated programs work in parallel,with each program handling its own setof queries in a unique order. Thus, evenif a software bug slips through testingand interferes with a few parts of a net-worked application, it is unlikely to causeall the members of any particular pro-cess group to crash at the same time.

The idea behind active replication issimple, but the software needed to sup-port it is not. Managing dynamicallychanging membership lists and commu-nicating with process groups is difficult,particularly in the face of inevitablecrashes and lost messages. Although dis-tributed computing has become com-monplace over the past decade, activereplication has only recently emergedfrom the laboratory.

Tool Kits for Robust Networks

Over the past few years, more thana dozen software teams have de-

veloped packages for robust distributedcomputing systems. All provide highavailability through active replication,although they each differ somewhat intheir emphasis. Some packages focus onspeed, for example; others on the needfor security.

Our research efforts at Cornell Uni-versity contributed two such packages.One of us (Birman) headed the teamthat introduced Isis in 1987; more re-cently, the two of us worked on Horus,introduced in 1994. The names “Isis”and “Horus” allude to Egyptian myth-ology. The goddess Isis helped to revivethe god Osiris after he was torn topieces in a battle with the war god Set;Horus was Isis’s son, who eventuallytriumphed over Set. By analogy, the Isisand Horus packages can help restore adistributed system that has been dis-rupted by a failure.

Packages such as Isis employ a set ofsoftware functions, or “tools,” that rep-licate and update data, keep track of

Software for Reliable Networks Scientific American May 1996 67

mote video cameras, weather stations and satellites to provide reliably updated reports.StormCast can be accessed on the World Wide Web at http://www.cs.uit.no/

JOH

N P

AU

L G

amm

a Li

aiso

n (b

ackg

roun

d ph

otog

raph

); O

LA R

ØE

Røe

Fot

o (c

ompu

ter

scre

en p

hoto

grap

h)


process groups and assist in handlingmembership changes. Isis can also par-cel out data processing among servers(a procedure known as load sharing).Distributed systems that make use ofload sharing exhibit many of the advan-tages of parallel computing but withoutrequiring special-purpose parallel com-puters. By dividing up incoming workamong multiple servers functioning inconcert, Isis enables systems to managelarge tasks quickly. Also, if a particularapplication requires additional comput-ing power, one can add an extra server,

and the load-sharing technique willadapt itself to the new group size. Thepossibility, offered by such tool kits asIsis and Horus, of improving both per-formance and reliability often surprisesdevelopers: they tend to assume thatmaking a system more robust will alsomake it slower and more expensive.

Active replication has been applied ina number of settings, including severaltelecommunications networks, stockmarkets, banks and brokerages. In Nor-way, researchers have developed an en-vironmental monitoring system based

on the technology [see illustrations onpreceding two pages]. The French air-traffic-control agency is also exploringthe technique for use in a new genera-tion of air-traffic-control software. Andmanufacturing plants have used pro-cess groups to coordinate work and toreconfigure assembly lines when equip-ment is taken off-line to be serviced.

As computer scientists look to evermore demanding applications, however,they discover that active replication hasimportant limitations. Load sharing isnot always possible or desirable: somesystems (notably, those in which datastored at a server change very rapidly)slow down when components are repli-cated. For example, in videoconferenc-ing technology, active replication doesimprove the fault tolerance of the net-work of servers that must keep runningeven when some participants are cutoff. But the technique would slow downthe system—without improving depend-ability—if applied to the transmissionof video data to remote users.

The need for flexibility motivated usto develop Horus. Like the Isis tool kit,Horus supports active replication, butit also provides much greater versatility.The basic strategy behind Horus is mod-ularity, resembling that of a child’s set ofLegos: different building blocks of Ho-rus can fit together in any combinationto support the specific needs of a partic-ular process group. One block might en-crypt data so that hackers cannot breakinto the system. Another block mightaddress potential communications fail-ures that can arise when messages arelost or corrupted. Programmers usingHorus decide which properties their sys-tem actually needs, permitting them tocustomize the system for its intendeduse. Furthermore, Horus can be extend-ed with custom-designed blocks for spe-cial needs that we may not have encoun-tered or anticipated in our own work.

Horus has a growing group of usersworldwide. At Cornell, Brian C. Smithhas used it to build a videoconferencingsystem for “groupware” applications.Horus information is available fromhttp://www.cs.cornell.edu/Info/Projects /HORUS/

A Crisis of Will?

Our work on Isis and Horus hasconvinced us that careful planning

can ensure the dependability of comput-er networks. But making the informa-tion superhighway robust may take more


The Tangled Web

Most of the World Wide Web is invisible. For many users, the Web ap-pears to have only two components: a browser program and remote

servers where documents to be downloaded are stored. But the Web—and theInternet, which provides the medium of communication that connects Websites—consists of millions of additional programs and servers; dozens cooper-ate to fetch a single document. For example, to fetch an item from a server atCornell University, the name “www.cornell.edu” must be translated, or mapped,to a numerical address that the software recognizes.This task may be handledby a series of mapping programs before the correct address is located. Fur-thermore, a request to connect to a Web site typically passes through a num-ber of so-called proxies—programs that save copies of frequently accesseddocuments as a way of reducing the load on remote Web servers. If a nearbyproxy has stored a needed document, the user can avoid a lengthy file transfer

over the Internet. Such hidden depen-

dence on intermediateprograms is common indistributed systems, butit can contribute to sys-tem failures. If a name-mapping program doesnot respond, if a proxy

fails or if the Web server crashes, the initial request will not go through. So theubiquitous error message that the Web server has failed or is busy (above) canbe misleading: the overload or failure of any number of intermediate programscan result in such an outcome.

The Internet as a whole can experience “brownouts” somewhat akin to atelephone or power outage. For instance, in late 1995 a major Internet nameservice, located in Atlanta, became intermittently overloaded. During these pe-riods, no one on the Web could fetch documents from servers whose address-es were not already known to the local system. This type of brownout can af-fect large numbers of people worldwide. Even when a connection is made to aserver, errors can still occur. Copies saved by Web proxies are not updatedwhen the original document is, so there is no guarantee that users will see themost up-to-date version of a Web page. In many situations, this possibility doesnot cause significant problems, but some time-sensitive applications can be-come untrustworthy if the necessary documents are not kept current. Webproxies improve the Internet’s reliability in one sense, by reducing the process-ing load on the network. But proxies decrease reliability by creating the possi-bility that people will see outdated information.

Critical projects on the Web or other distributed computing systems will re-quire stronger guarantees that information retrieved from the system is accu-rate, current and always available when needed. One way to avoid such mis-takes is to arrange for saved copies of vital information to be managed by theprocess of active replication described in the accompanying article. —K.P.B.

CO

UR

TESY O

F N

ETSC

AP

E


time and money than computer makersand users are willing to commit. Soft-ware for distributed applications is typ-ically built with existing technology thatwas not designed for dependability.Moreover, researchers need to seek bet-ter methods for designing large-scalesystems that are robust and that providevery high performance: a system that isextremely robust when accessed by 50users simultaneously may turn out tobe unacceptably slow and hence unreli-able if 5,000 people do so.

Although programmers have appliedthe technology for robust distributedcomputing successfully in some instanc-es, the public hears more about failuresof nonrobust systems. For example, overthe past few years, there have been doz-ens of reports on the problems with thecurrent air-traffic-control system. In thefall of 1995 the Los Angeles systemfailed, leaving controllers unable to com-municate with aircraft; a midair collisionwas avoided by seconds.

To make matters worse, updated air-traffic-control software, commissionedin 1982 by the Federal Aviation Admin-istration, has been repeatedly delayedand scaled back. The FAA selected theoriginal proposal precisely because of itsinnovative approach to distributed com-puting; now it seems the highly availableand distributed aspects of the proposedsoftware have been almost entirely elim-inated. Yet air-traffic controllers criti-cize the existing system as dangerouslyinadequate, particularly because it lacksa distributed software architecture andhas become undependable with age.Highly publicized fiascoes such as thesehave fueled a common perception thatthere is a crisis in computer software[see “Software’s Chronic Crisis,” by W.Wayt Gibbs; Scientific American,September 1994].

But if we are really in the midst of asoftware development crisis, it is per-haps as much a crisis of will as of means.Not all developers are concerned withmaking their networking software ro-

bust, and the public pressure for relia-bility does not seem to extend beyond afew especially sensitive applications. In-deed, companies that market distribut-ed computing packages often state inproduct licenses that their technologiesmay not be dependable enough for usein critical applications—implying thatreliability is not a reasonable objective.In our opinion, this situation is analo-gous to the unlikely prospect of auto-makers selling cars with the warning thatvehicles are unsafe for use on highways.The computer equivalents of safety beltsand air bags are infrequently applied tosoftware development. And the desirefor sophisticated, user-friendly interfac-es as well as improved speed and perfor-mance tends to dominate the attentionboth of the software developers and thepeople who use the programs.

Reliability often conjures up an imageof slow, ponderous computer systemsthat is incompatible with the allure of

effortless and instantaneous access toinformation on the data superhighway.Yet robust technology does not have tobe slow and unpleasant to use: the Gold-en Gate Bridge is a model of stability aswell as grace. With each passing hour,more and more uses are being found forthe information bridges that link com-puters. Our enthusiasm to incorporateelegant electronic bridges in every con-ceivable application should not over-shadow a reasonable degree of concernabout whether or not such bridges willbe able to support the resulting trafficof information. We believe that robustdistributed systems provide a valuabletool for connecting computers quicklyand dependably, creating opportunitiesfor business and pleasure in the infor-mation society. But we also believe thatin many cases, unless a distributed sys-tem can be engineered to function ro-bustly, it may be better not to build—oruse—one at all.

Software for Reliable Networks Scientific American May 1996 69

The Authors

KENNETH P. BIRMAN and ROBBERT van RENESSE have worked togeth-er on distributed computing systems for the past five years. Birman is professorof computer science at Cornell University. After developing the Isis tool kit inthe 1980s, he founded a company to commercialize the technology. Isis Dis-tributed Systems now operates as a division of Stratus Computer, Inc. Van Re-nesse entered the field of distributed computing after deciding not to pursue acareer as a circus acrobat. He is now a senior research associate at Cornell andis the primary architect and developer of the Horus system.

Further Reading

Fault Tolerance in Tandem Computer Systems. JimGray, Joel Bartlett and Robert W. Horst in The Evolutionof Fault-Tolerant Computing. Edited by A. Avizienis, H.Kopetz and J. C. Laprie. Springer-Verlag, 1987.

Fatal Defect: Chasing Killer Computer Bugs. IvarsPeterson. Random House, 1995.

Group Communication. Special section in Communica-tions of the ACM, Vol. 39, No. 4, pages 50–97; April 1996.

AIRLINE PASSENGERS waited for delayed flights in New York City’s La Guardia Air-port in May 1995 after a power failure triggered a shutdown of the local air-traffic-control system. The effort to rejuvenate aging U.S. air-traffic-control software usingdistributed systems dates back to the 1980s.

AN

DR

EW L

ICH

TEN

STE

IN

SA


Documents

Software for Reliable Networks