42
NonStop Availability in a Client/Server Environment Alan Wood Technical Report 94.1 March 1994 Part Number: 106404

NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

~TANDEM

NonStop Availability in aClient/Server Environment

Alan Wood

Technical Report 94.1March 1994Part Number: 106404

Page 2: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment
Page 3: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

NonStop Availability in a Client/Server Environment

Alan WoodTandem Computers

10300 N Tantau Ave. WC55-52Cupertino, CA 95014

e-mail: wo([email protected]

Abstract

The popularity of client/server computing has grown enonnously in the last few years.Client/server architectures have become the standard campus LAN environment at mostcompanies, and client/server architectures are beginning to be used for mission-criticalapplications. Because of this computing paradigm shift, client/server availability hasbecome a very important issue. This paper presents a predictive model of client/serveravailability applied to a representative client/server environment. The model is based onclient/server outage data and measures client/server availability using a new metric: useroutage minutes, which measures the amount of downtime for each user. User outageminutes are calculated for all outage types: physical, design, operations, and environmentalfailures, as well as outages due to planned reconfigurations. The model includes outagesdue to all client, server, and network devices in a typical client/server environment. Themodel has been validated by showing that its results are very similar to downtime surveysand other outage reports in the literature. The major results from the model are:

• Each client in today's mission-critical client/server environment experiences anaverage of 12,000 minutes (200 hours) annual downtime.

• Servers are the dominant cause of client/server unavailability. They cause 60% ofthe user outage minutes. The network, which most users blame for client/serverunavailability, causes less than 10% of the user outage minutes.

• Software failures, especially server software failures, cause the most (39%)client/server downtime. Reconfigurations (planned outages) are next at 32%.Hardware failures, which many people think of as the main cause of client/serveroutages, are responsible for only 11% of client/server downtime.

• Software fault-tolerant (Tandem) servers can reduce server downtime by nearly anorder of magnitude and cut total user outage minutes in half.

Tandem Technical Report 94.1(Part Number 106404)

© Tandem Computers, 1994

Page 4: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment
Page 5: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

SectionNumber Section Tide

Table of Contents

PageNumber

1.01.11.2

2.0

3.03.13.23.33.4

4.0

5.05.15.25.35.4

6.0

7.07.17.2

Introduction 1Objective 2Approach and <:>rganization 2

Measuring Client/Server Availability .4

Qient/Server Outage Data 5Data Collection Issues 5Outage Data Classification 6Example Outage Data 7ClientlServer Outage Causes 7

Modeling Client/Server Availability 10

Client/Server Availability Model.. 11A "Typical" Qient/Server Environment 11Availability ModeL 12Results 14Result Summary 16

Model Validation 16

Improving Client/Server Availability 19~tional In1IRl>vements 19~hdnectural In1IRl>vements 20

Acknowledgement 25

References

Appendix 1

Appendix 2

.................................................................................. 26

Spreadsheet 28

Outage Minutes and Number of Users Affected 30

ii

Page 6: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

FigureNumber Figure Title

List of Figures

PageNumber

3-1.5-1.5-2.5-3.5-4.7-1.7-2.

Network Performance Characteristics 6"Typical" Client/Server Architecture 12User Outage Minutes by Equipment Type 14User Outage Minutes by Outage Category 15User Outage Minutes by Outage Category 15The Effect of Tandem Servers on Client/Server Availability 20Alternative Client/Server Architectures 22

List of Tables

TableNumber Table Title

PageNumber

2-1.2-2.3-1.3-2.3-3.5-1.5-2.6-1.6-2.6-3.7-1.

Relationship Between Percentages and Annual Outage Minutes .4User Outage Minutes Calculation 5Outage Data Categories 7Example Client/Server Outage Data " 8Client/Server Outage causes 9User Outage Minutes Calculation 13Largest Causes of User Outage Minutes 16Comparing the Model Results to Outage Surveys '" 17Equipment Type Outage Result Comparison 18Outage Category Result Comparison " , 18User Outage Minutes for Architecture Alternatives 21

iii

Page 7: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

NonStop Availability in a Client/Server Environment

1.0 Introduction

Ten years ago, computer systems were relatively simple. The usual user configuration wasa dumb terminal connected via an async line to a host mainframe. The world has changed.Today's typical user configuration is an intelligent workstation connected through severalnetwork devices such as bridges and routers to multiple servers and possibly a mainframe.Today's servers, and even some clients, have more processing power than the hostmainframes of ten years ago. Applications are spread among the clients and servers insteadof all residing on the host. A single data base query often requires five or more differentpieces of equipment and the associated software to all be operating properly. Instead ofbeing a uniform, single vendor environment, the typical client/server environment containsequipment and software from many different vendors.

With all this complexity, it would seem that system availability in a client/serverenvironment would be very poor. It is true that a complex client/server environment builtfrom the hardware of ten years ago would be down as often as it was up, but hardwarereliability has improved dramatically in the last ten years. Hardware MTBFs have increasedone to two orders of magnitude. The classical availability models that considered onlyhardware are no longer relevant since software, operations, and environmental failures areall at least as important as hardware failures. Scheduled maintenance periods are beingeliminated as customers demand continuous application availability. Availability models arecomplex because they must account for the myriad of ways that client/server environmentscan be configured and the many failure modes of client, server, and network devices. Suchcomplexity makes it difficult to properly account for availability in client/server architecturedesign.

This paper presents a client/server availability model to help design engineers evaluateclient/server availability. The model includes physical, design, operations, andenvironmental failures, as well as outages due to planned reconfigurations. It includesoutages due to client, server, and network devices in a typical client/server environment.The paper also addresses basic issues such as the definition of system failure and theappropriate measure of availability that need to be reconsidered in this new computingparadigm. Based on client/server outage data, the paper indicates the most important outagecauses in a client/server environment and evaluates methods for improving client/serveravailability.

Terminology

Unless otherwise stated, the term server means a commodity, non-fault-tolerant server suchas a SUN or HP workstation. The terms client and user are used interchangeably.

1

Page 8: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

1.1 Objective

The objective of this effort was to create an availability model of a typical client/serverenvironment to use in client/server architecture trade-offs. In creating this model, therewere four important constraints:

1. Availability should be a key consideration for the client/server environment to bemodeled.

2. The model should be robust enough to provide architectural conclusions.3. The model should be simple enough to show how the model parameters drive the

conclusions and to allow alternative architectures to be easily evaluated.4. The model should be data-driven, Le., the model parameters should be based on real

client/server outage data rather than engineering estimates.

Although availability is important for user productivity in any client/server environment, itis even more important in a mission-critical OLTP application. It is more likely thatavailability would influence client/server architecture in a mission-critical environment thanin a typical campus LAN environment. Therefore, the fIrst constraint drove us to model arepresentative data base server architecture as opposed to modeling me servers, printservers, or other less critical servers. However, most of the conclusions also apply to otherarchitectures and server types.

To make the model robust, we included:• all client, server, and network devices and associated software• both unplanned and planned outages• all categories of outages - physical (hardware), design (software), operations,

environment, and reconflguration.We also modeled a generic or "typical" client/server environment, rather than a specifIcinstallation, so that we could make general conclusions about client/server architecture.

To keep the model simple, we associated a number of outage minutes with each outagetype. We then determined user outage minutes for each outage type and summed over alloutage types to get total user outage minutes, which we feel is a good measure ofclient/server availability (see Sections 2 and 4). All the calculations can be done in aspreadsheet, which facilitates sensitivity analysis.

The model is data-driven. We gathered client/server outage data from a variety of sourcesand used it to detennine:

• outages causes for each client/server device• the parameter values, Le., the number of outage minutes associated with each outage

cause for each client/server device.

1.2 Approach and Organization

Because the client/server computing paradigm is so different from the mainframe-centeredcomputing paradigm, basic definitions and understandings pertaining to availability have tobe revisited. This paper is organized around a set of basic questions - what is failure in aclient/server environment and how do we measure it, what does outage data tell us aboutthe client/server environment, how can we create a simple model of the complexclient/server environment, what does the model tell us about a "typical" client/serverenvironment, and how can we improve client/server availability?

2

Page 9: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

• Defining and measuring failure (Section 2)If a user wants a 1 second response but is getting a 5 second response, is the systemdown? If a network connection fails, and 20 out of 1000 users cannot access data, isthe system down? What about planned outages in a 7x24x365 global environment? InSection 2, we propose that failure must be defined and measured from the user point­of-view rather than the system point-of-view and propose that user outage minutes isa good metric for client/server availability.

• Client/server outage data (Section 3)A client/server environment has so much different equipment and so many complexinterconnections that it would be easy to postulate thousands of outage causes.However, we wanted the model to be based on real outage data rather thanengineering hypotheses. In Section 3, we describe the data we gathered and theoutage causes that were abstracted from the data.

• Modeling client/server availability (Section 4)It would be easy to create an extremely complex model including every possibleoutage cause for all equipment, but we wanted to create a simple model to focus onthe important issues and use for sensitivity studies. Our approach to creating a simplemodel is described in Section 4.

• Applying the client/server availability model (Section 5)There is no such thing as a typical client/server environment, which makes it difficultto model one. However, there are standard principles to which most client/serverenvironments adhere, and Section 5.1 describes how we used these principles toformulate a "typical" client/server environment for use in modeling. Section 5.2shows how applied the modeling technique in Section 4 to this "typical" client/serverenvironment. Sections 5.3 and 5.4 contain the results from the model, including themost important outage causes in a client/server environment.

• Validating the client/server availability model (Section 6)Perhaps the most important and most often ignored step in model building is modelvalidation. It is particularly important for this model since we have created a relativelysimple model of an extremely complex environment. The use of other surveys andstudies to validate the model results is described in Section 6.

• Improving client/server availability (Section 7)There are many possible methods to improve client/server availability, bothoperational and architectural. Section 7 describes several potential client/serveravailability improvements, including fault-tolerance at various levels, and quantifiesthe impact of these improvements in terms of user outage minutes.

3

Page 10: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

2.0 Measuring Client/Server Availability

Availability, classically defined as system uptime/(system uptime + system downtime), is astandard system perfonnance measure. Availability is usually expressed as a probability orpercentage, e.g., 99.9% availability. Unavailability is defined as 1 - availability, e.g.,99.9% availability is equivalent to 0.1 % unavailability. It is often more useful to useunavailability than availability to describe the effects of design or operations changes,especially if unavailability is expressed as an amount of downtime per year. For example, adesign change that increases availability from 99.9% to 99.99% does not sound verysignificant. However, expressing the impact as a decrease in downtime from 500 minutesper year to 50 minutes per year sounds significant and does a better job of conveying thetrue impact of the change on business productivity. Table 2-1 shows the relationshipbetween percentages and annual outage minutes.

Table 2-1. Relationship Between Percentages and Annual Outage Minutes

In a distributed client/server environment, it is not very meaningful to define unavailabilityin terms of system failure and measure it in terms of system uptime and downtime becauseit is hard to define a system failure. If a single PC fails, the user of the PC cannot accessthe application, but all other users can continue operating without any loss of performance(unless the PC has failed in a mode that causes an outage for other users.) If lout of a1000 users cannot access the system, is the system down? If not, what about 10 out of a1000 or 100 out of 1000? All outages are not equally painful from a business perspective,even if they are of equal duration. A PC failure that causes a single user outage is not aspainful as a server failure that causes a 50 user outage, which is not as painful as adistributed data base failure that causes a 1000 user outage.

To accurately reflect the impact of an outage, it is important to measure it from the userpoint-of-view rather than the system point-of-view. An appropriate measurement ofclient/server availability needs to include both the duration of an outage and the number ofusers affected. We propose that user availability, rather than system availability, is theappropriate measure. User availability is defined as user uptime/(user uptime + userdowntime), averaged across all users. If an average of 1 user out of a 1000 is down, thatequates to 99.9% user availability, or about 500 annual outage minutes per user. Usingannual user outage minutes to measure client/server availability helps us avoid having toartificially decide how many down users equates to a system outage.

User outage minutes are calculated by multiplying the downtime due to each outage by thenumber of users affected. An example of the calculation for a few outage causes is shownin Table 2-2 (Section 5.2 describes the calculations in more detail). Corporations candetennine what a user outage minute costs them in terms of lost productivity, lost revenue,customer satisfaction, and so forth, and do a simple multiplication to detennine the cost ofdowntime for their application(s). This provides a way to measure the value of availabilitywhen developing a client/server architecture and choosing client/server vendors.

1There are 525,600 minutes in a year, so 90% availability is 5,256 yearly outage minutes. Thenumbers in the table have been rounded to help focus on the concept rather than the numbers.

4

Page 11: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Table 2-2. User Outage Minutes Calculation

3.0 Client/Server Outage Data

Since we wanted the client /server availability model to be data driven, we tried to collect asmuch client/server outage data as possible. Surprisingly, we were able to find very fewcompanies that kept good client/server outage data. Most companies that we approached didnot keep outage data of any type. A few·companies had data on the network only, a fewhad some data on the servers, and no one kept data on single client outages. Our bestsource of data was one of our large customers that has a large networking application withabout 15,000 users. They keep detailed records on all outages that affected 4 or more usersand provided us with their entire set of outages for 1992. In researching our internal outagedata, we found that our MIS group that supports the campus LAN kept good data on thecampus LAN, but had no data on clients or servers. We found another MIS group that hadgood data on the servers because they were responsible for supporting the servers. Thismay be typical of other companies and may explain why it is difficult to get outage data fora complete client/server environment. Some examples of the outage data we collected arecontained in Section 3.2.

Since we could not find an abundance of client/server outage data, we augmented the datawe had with literature surveys (References 2-10), articles quoting downtime figures(References 11-13, 21, 23), university studies (References 1, 24), MTBF quotes fromvendors, and our internal outage data. In Section 3.2, we describe how the data wassynthesized to provide a list of outage causes. The use of the outage data to create anavailability model is discussed in Section 5. The use of the survey data to validate theresults from the model is described in Section 6.

3. 1 Data Collection Issues

In a distributed computing environment, it is difficult to precisely determine when a user isup or down. Performance that is good enough for one user may not be good enough foranother user, and even if it was, the performance dynamically varies throughout thenetwork. Network performance can be characterized by latency, throughput, and reliabilityas shown in Figure 3-1. Latency is the time for information to transit from a source to adestination. Throughput is the amount of information that travels through the network overtime. Reliability is the ability of the network to recover from errors and deliver error-freepackets. These 3 characteristics are interrelated since (e.g.) poor latency could cause timeouts, increasing throughput and decreasing reliability.

The left side of Figure 3-1 shows a network performing very well while the right sideshows a network performing very poorly. Almost all users would say that a network

5

Page 12: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

perfonning as shown on the left side of Figure 3-1 was up, and almost all users would saythat a network performing as shown on the right side of Figure 3-1 was down. However,whether a user would say that a network performing as shown in the middle area of Figure3-1 was down or up would depend on the user, the application, the time of day, and manyother factors. Note that the users may say the LAN is down, but what they really mean isthat their client, their server, or their path through the network is not working to theirsatisfaction.

o

Throughput

Latency

Reliability

00

00/0

LAN is up LAN is? LAN is down

Figure 3-1. Network Performance Characteristics2

Rather than trying to absolutely quantify the definition of failure, we took a very simpleapproach for gathering data. If a user said the LAN was down (which means that theirclient, their server, or their path through the_ network was not working to their satisfaction),then we assumed that user was down. This ensures that we measure client/serveravailability as perceived by the users. It has the drawback that two users might report thesame situation differently. In practice, this was hardly ever a problem - all users felt thenetwork was down, or all users felt the network was up.

To make sure that we properly accounted for the impact of an outage on the users, wemeasured user downtime rather than equipment downtime. For example, if the clients haveto reboot their workstation and restart their application to recover from a server outage, thatrecovery time is included as part of the outage.

3.2 Outage Data Classification

All the outage data collected was classified as one of five types of outages: physical,design, operations, environmental, and reconfiguration. These categories are the standardoutage classes used in the NonStop Availability Initiative. The category definitions areshown in Table 3-1.

2Figure 3-1 is a slight modification of Figure 2-4 in [Feather,92].

6

Page 13: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Envrronmental

Recon guration

Table 3-1. Outage Data Categories

3 .3 Example Outage Data

Table 3-2 provides some examples of the raw outage data collected. Notice the wide varietyof outages and the potentially disastrous effect some of these outages could have on abusiness, e.g., the building transfonner that caused 1027 users to be down for 575 minutes= 590,525 user outage minutes! Another interesting point is that in many cases the realcause of the outage was never diagnosed - the operators just figured out how to get theusers back on line.

One of the most interesting things about the data we collected was the difference betweenactual client/server outage causes and the user's perception of those causes. People tend tothink of physical hardware failures when they think of computer downtime, but hardwarereliability has increased remarkably in the last decade. Our data showed that physicalhardware failures caused only 10%-20% of the unplanned outages. Design, operations,and environment failures were as common or more common than physical failures.Reconfiguration (planned) outages were also more common than physical failures. Anotherinteresting aspect of the data we collected is that users perceive outages as being caused bythe LAN, but our data shows that servers are the primary cause of user downtime. Section5.3 discusses these results in more detail.

3.4 Client/Server Outage Causes

Using the data described in Section 3.3, we developed a list of outage causes for aclient/server environment. These outage causes are shown in Table 3-3. Except wherenoted, the outage causes apply generically to clients, the network, or servers. Some of theoutage cause tenninology may be unfamiliar and is defined at the bottom of the table.

Most of these outage causes apply to any type of computing environment, but some arepropagating type outages peculiar to the client/server computing environment. Apropagating (also called multiple node) outage means that the failure of a specific item ofequipment propagates outside its fault domain, causing an outage for users that should beunaffected by the failure. For example, a router failure should cause downtime for at mostthe users dependent on the servers and hubs connected to that router, e.g., a maximum of200. However, the router can fail in a mode (e.g., broadcast storm or algorithm conflict)that causes very heavy traffic on the LAN, which makes response time very poor andcauses all users to think the LAN is down. These types of outages can result in very large

7

Page 14: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Outage cause!Mamtenance ActIon Outage Number UserCategory Minutes Users Outage

Down MinutesPhySIcal 60 workstatIons did not have access to the 143 60 8,580

application due to a router problem. Operationsisolated the trouble to a loose cable on thedownload server. This cable was tightened,routers reloaded and the clients rebooted theirworkstations.

PhySIcal Chents lost access to the application due to 9 2'67 2,5'63broadcast storm* on the LAN caused by a badsupervisor card in the hub. The card wasreplaced to restore service.

DesIgn Clients could not access applicauon due to a hub 90 58 5,220problem. Operations reset the supervisory cardon the 8th floor and the Ethernet cards on the 3rdand 4th floors.

Design 15 workstatIons unable to access the apphcation. 13 15 195The trouble was isolated to hung file server.Operations performed a stop and start of the fileserver process to restore service to the client.

DeSIgn Clients lost access to the applicatIon._OperatIons 24 30 720isolated the trouble to an out of synch conditionbetween file server and line handler processes,which occurred following a failure of the hostapplication. Both processes were bounced torestore.

DeSIgn Chents lost access to the apphcauon due to a 46 75 3,450broadcast storm* on the internet LAN.Operations determined the cause to be a packetwhich caused the LAN bridges to loop. Theydisabled and enabled the bridges to isolate thepacket and restore service.

<?J>era- After power restored, Ethernet card (to Fl1e 360 25 9,000Uons Server) did not properly restore because it had

been incorrectly configufed.<?pera- Chents expenenced degraded service following 31 205 6,355tlons an application maintenance release. The release

was backed out to restore service.Envrron- Client could not access the application due to a 575 1,027 590,525mental bad fuse in the building transformer. Due to the

location of the transformer, it took several hoursto replace the fuse and restore power.

Environ- Construction group accidently set off Halon 420 25 10,500mental system in computer room.ReconfIg Upgrading server network software to include 120 50 6,000uration AppletalkRec~:>nfIg File maintenance to balance disk utilizatIon, 1'60 100 18,000uratIon adding disks, network segmentation* Broadcast Storm is defmed m Table 3-3

Table 3-2. Example Oient/Server Outage Data

8

Page 15: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

amounts of user outage minutes because so many users are down. Some typicalpropagating type outage causes are:

• Distributed data base corrupted or frozen• Broadcast storm*• Babbling node*• Router algorithm conflict*• Duplicate network addresses (e.g., duplicate IP addresses)• Virus• Name/security server hung or data base incorrect• Operations bounced the application• Campus wide router reset* Defined in Table 3-3

e

PhySIcal

• PU, LANcard, disk,etc. fail• Babblingnode

Design

• WI ockonDBfails• Firmware error• Self-test failure• OS crash• Broadcast storm• Server hung• DB corruption• Disk access error• LAN protocol error• Access denied• Router algorithmconflict• Packet errors(runts jabbers)~ Timeouts• Application freeze• Network a ·n

OperatIons

• ab e umped• Wrong cablepulled• Duplicate orincorrectaddress• Data notbacked up• Stoppedwrong process• Table or logdeleted• Bouncedapplication

EnVlfOnment

• ommpower failure• Heating/Airconditioning!Ventilation• Disasters• Circuitbreaker trip• Power failrecovery error• Virus• Power failrecovery error

Reconfig­oration·Upsystem• Add disk• Move• New OSrelease• OS Bug fix• Workloadbalancing• Newapplicationrelease• Applicationbug fix

Bro ast torm - a broadcast is a special message or packet that all network hostsmust receive and process. A broadcast storm is a condition in which broadcastsare being overused, potentially completely disabling the network. Broadcaststorms usually occur because of software errors.

Babbling Node - the transmission of random, meaningless packets onto the network;often caused by a failed LAN card.

Runts - packets that are smaller than the minimum length allowed by the networkprotocol (e.g., 60 bytes for Ethernet).

Jabbers- packets that are larger than the maximum length allowed by the networkprotocol (e.g., 1518 bytes for Ethernet).

Router Algorithm Conflict - routers use some variant of a shortest path algorithm. If(e.g.) router A thinks that the shortest path to router C is through router Bandrouter B thinks that the shortest path to router C is through router A, packets forrouter C will be sent back and forth between routers A and B. This usuallyoccurs because of some breakdown in the router shortest path updating strategy.

Network Paging - a (diskless) workstation runs a job too large for its memory andhas to page over the network causing very heavy network traffic.

Table 3-3. Client/Server Outage Causes

9

Page 16: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

4.0 Modeling Client/Server Availability

A standard method for modeling unavailability is to create a state transition diagram orMarkov model in which the states represent unavailable components. For a client/serverenvironment with hundreds of components, such a model would be unwieldy. Asdescribed in Section 3, we extracted a set of outage causes from the data. Since the failurerates and repair times for those outage causes are averages from a number of different datasources and environments, we were only concerned with accounting for fIrst order effects.We also wanted a simple model to allow us to easily study client/server architecturealternatives.

For a series system with statistically independent components, system availability is theproduct of its component availabilities. The components in a client/server environment arenot independent since many of the outage causes described in Section 3 cause outages formultiple components. However, the outage causes seem to be statistically independent3 ,and the outage causes are in series. Therefore, we can calculate system availability as theproduct of the outage cause availabilities. System unavailability can then be approximatedas follows:

Usys =1 - ASys =1 - A1xA2x... =1 - (l-U1)x(l-U2)x... =~Ui - ~~UiUj + ... "" ~Ui'

where Usys (Asys) are system unavailability (availability), Ui(Ai) are unavailability(availability) due to the ith outage cause, and the sums and products are over all outagecauses. There are a total of 68 outage causes considered, and the Ui values are on the order

of 0.04%, so the second term of the availability equation, ~~UiUj' is on the order of

(0.0004)2 x 682 =0.07%. The ftrst term of the availability equation, ~Ui' is about 2.3%,so the error made by the approximation is on the order of 0.07%/2.3% = 3%. A morecareful analysis, taking into account the exact Ui values, reveals that the error is about 5%.Given the fidelity of the data, the error introduced by the approximation is minimal.

From the discussion in Section 2, user unavailability can be calculated the same way assystem availability, and total annual user outage minutes can be approximated by summingthe annual user outage minutes for each outage cause. This greatly simpliftes the model andallows us to use a spreadsheet for availability calculations.

There are two other technical details that need to be satisfied for the above equation to hold[Barlow,75]. The ftrst is that all other outage causes can occur during the recovery from aoutage cause. This is not quite true since a single component has multiple outage causes,and a component could not fail while it is being repaired since it would not be operating.However, the error introduced by violating this condition is a small second order effect(equivalent to the Ui2 terms in the above equation) since it only implies we do not perfectlyaccount for double failures affecting the same component. The second technicality is that

3 It might be expected that the occurrence of a failure would increase the load on othercomponents, thus increasing the failure rate of other failure modes. However, our data does notseem to indicate such a phenomenon. This could be due to the distributed nature of aclient/server environment or could be due to operations staffs reporting multiple overlappingfailures as a single failure event. Our conversations with operators indicate that they feelthat concurrent independent failures are a very rare occurrence. Even if they are not, constructingthe model using actual field data automatically takes the increased failure frequency intoaccount.

10

Page 17: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

repair of multiple failed components is simultaneous rather than sequential. Since we basedrepair times on real outage data, any lengthening of repair times due to sequential repair(waiting for repair personnel to finish repairing another component) would already bereflected in the repair times. Also, since there is usually an operating staff rather than asingle person responsible for maintaining the client/server environment, simultaneousrepair is a reasonable assumption.

5 . 0 Client/Server Availability Model

This section describes how the modeling technique in Section 4 is applied to a "typical"client/server environment. It describes the results of the model in terms of the mostimportant outage causes and categories.

5.1 A "Typical" Client/Server Environment

Client/server environments are very diverse. They range from a small departmental LANwith a single server to E-mail networks connecting tens of thousands of users to hundredsof servers. Although client/server environments are diverse, they share the followingtypical characteristics:

• Users primarily use local servers connected via a LAN. The economics of theclient/server environment generally permit the clients to be in close proximity to theirprimary servers. Therefore, our "typical" client/server environment is a LAN with acommunications server for WAN access. Also, the clients are located near theirprimary server (logically if not physically).

• There are no standard client/server configurations, but there are standard client/servercomponents. Servers are generally powerful workstations performing a specific typeof service, e.g., file servers, data base servers, and print servers. Clients are typicallyPC-class computers although other client devices such as ATM devices are certainlypossible. Typical network components include transceivers, bridges, routers, hubs,and gateways.

• Ten to 50 clients per (data base) server is typical. We chose 25 as a reasonableaverage number based on the client/server environments we studied and opinionsfrom experienced client/server operations managers.

• Equipment layering is typical, meaning that subnets are joined to form networks,which are joined to form larger networks.

Given these typical characteristics and our desire to model a data base server architecture,we defined the "typical" client/server architecture shown in Figure 5-1. In Figure 5-1, theusers and their primary servers are connected to a hub (also called a concentrator) with 50users and 2 servers per hub. Four hubs are attached to a router for a total of 200 users perrouter. There are 6 routers in a ring (e.g., a FDDI ring), 5 of which connect 200 users eachfor a total of 1000 users. The sixth router provides a gateway for communications (commserver) outside the LAN via a WAN and also provides a server that performs networkmanagement. Although network management activities such as name services and securityservices are often spread throughout the network, we chose to model a singlename/security/comm server for simplicity. This equipment in this architecture is moreuniformly distributed than expected for a client/server environment, but it provides areasonable model for calculating client/server downtime and comparing architecturealternatives. The clients are considered to be PCs or workstation class machines. It is

11

Page 18: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

certainly possible to consider other client types such as ATM machines or hand-helddevices or smart phones, but the majority of current clients are pes or similar machines.

Hmlwa .. Qua nIIt••1000 PCMbrltslatiQ'ls20 Hubs6 RQlters40Dala Base Servers1 Nelwork MansgementServer

HubsServersUsers

Users(xSO)

Servers(x2)

Figure 5-1. "Typical" Qient/Server Architecture

5.2 Availability Model

The client/server availability model identifies a set of outage causes and durations derivedfrom the outage data described in Section 3. These outage causes and durations are thencombined with the typical client/server environment shown in Figure 5-1 to determine useroutage minutes. The model is generic in the sense that it represents the collective set ofoutage data and other reference material rather than the experience or architecture of a singlecustomer. The client/server availability model is contained in a spreadsheet in which eachrow corresponds to an outage cause, and the columns contain the user outage minutecalculations. An excerpt from the spreadsheet is shown in Table 5-1, and the completespreadsheet is attached as Appendix A. The spreadsheet is divided into sections for eachtype of client/server device shown in Figure 5-1 - PC/workstations, (data base) servers,name/security/comm server, hubs, and routers. Physical (hardware), design (software),operations, and reconfiguration outages for each type of equipment are contained in thosesections. Separate sections have been created for operations, environment, and distributeddata base outages. The operations outage causes in the device sections are specific to thosedevices while the separate operations section contains outages such as building moves thatare not specific to a device type.

Each row of the spreadsheet provides user outage minutes for a outage cause or group ofoutage causes. Outage causes were grouped according to the data available and the numberof users affected by the outage. For example, a server CPU failure, memory failure, powerfailure, and any other hardware failure affecting only the server were all grouped togetherbecause they were usually included as a single item in MTBF quotes and other data andthey affected only the users who need a specific server. Disk failures were separatedbecause there was separate data for them. LAN card failures were separated because theyhave a propagating outage cause (babbling node) that could bring down an entire subnetrather than just the server. Software failures were similarly segregated into non-propagating

12

Page 19: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

failures that cause an outage for only one device vs propagating failures that cause outagesfor other equipment in addition to the failed device (e.g., broadcast storm).

Table 5-1 shows part of the spreadsheet For each outage cause, we used the outage data todetermine the number of outage minutes that should be ascribed to that outage cause(column 2 of Table 5-1). For example, for a LAN card/connection hardware outage,vendor MTBF quotes provide a 300,000 hour MTBF. Customer data provides a 3 hourrepair time. Annual outage minutes are (3 hours) x (60 min/hr) x (8760 hrs/yr)/(300,OOOhrs) = 5 min/yr. Customer data shows 14%-50% propagating failures; 20% assumed.Therefore, the annual outage minutes for LAN card failures is 4 minutes for non­propagating failures and 1 minute for propagating failures. These outage minutes areshown in the second column of Table 5-1. Non-propagating client LAN card failures areseparated from server LAN card failures because they affect a different number of users;propagating failures (babbling node) are not separated because they affect the same numberof users. Appendix 2 contains the complete derivation of annual outage minutes for eachoutage cause.

The client/server architecture shown in Figure 5-1 is used to determine the total number ofdevices to which each outage cause applies (column 3 of Table 5-1). For example, there are10 users per client LAN card, so there are 100 client LAN cards for 1000 users.Multiplying columns 2 and 3 yields column 4, "Total Annual Outage Minutes". Thenumber of users affected (column 5 of Table 5-1) is determined as described in the nextparagraph. Finally, user outage minutes are calculated by multiplying columns 4 and 5,yielding column 6.

5

Number 0UsersMfected

140

11,40

1

AnnOutageMinutes

rltem

HubLAN aniFailure- Babblin node

erver ApplicanonFailure

Hub server LANFailure

Table 5-1. User Outage Minutes Calculation

The number of users affected (column 5 of Table 5-1) was derived from the client/serverarchitecture shown in Figure 5-1 and the following assumptions:

• A server outage affects 50 users, the 25 directly attached to the server and anestimated 25 others who need remote access. Remote access could be required forremote data retrieval or because of cross-mounting of servers, e.g., putting the dataon one server, the application on another, and the home directory on a third. If any ofthe three cross-mounted servers goes down, the application is unavailable.

• Failure of the name/security/communications server will affect 200 users on average.Some failures might affect everyone by terminating all network connections.However, a name or security server failure generally does not affect established

13

Page 20: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

connections; it affects users trying to access a device to which they are not currentlyattached or trying to log on to some device or application.

• A hub failure affects all 50 users attached to it• A router failure affects an estimated 50 users needing access to a server in another

router zone.• Propagating failures affect the next higher level of network hierarchy, e.g., a PC

failure affects the hub to which it is attached and a hub failure affects the router towhich it is attached.

The complete derivation of the number of users affected for each outage cause is containedin Appendix 2.

S.3 Results

The total annual user outages minutes for 1000 users is 12,031,800 user outage minutes or12,032 user outage minutes (approximately 200 hours) per user. With 525,600 minutes peryear, this is 97.7% user availability. Unplanned user outage minutes are 8,185,800 or 68%of the total. Reconfiguration (planned outages) accounts for 32% of the total.

Figure 5-2 depicts annual user outage minutes by equipment type. The environmental andoperations outages that are not specific to an equipment type are shown separately.Distributed data base outages were included in server outages since the data base softwareusually runs on the server. Figure 5-2 shows that server outages dominate client/serveravailability, accounting for nearly 2/3 of user outage minutes (the distributed data basefailures included as server outages are less than 10% of user outage minutes). A surprisingresult is that the network only accounts for only 10% of the user outage minutes, despitethe common user perception that the LAN is always to blame for outages.

• 8,000.000..••~ 7.000,000000 6,000.000......2 5.000,000••; 4,000.000c:i

3,000.000•UI.! 2,000,000:;,0.. 1.000,000••~

0Qj ~

> 0Qj !

CIl ...Z

a Scheduled Outage Minutes

o Unscheduled Outage Minutes

Figure 5-2. User Outage Minutes by Equipment Type

Figure 5-3 shows user outage minutes by outage cause category. Physical (hardware),design (software), and operations outages were further categorized as client, network,server, and "all" outages. The "all" category is used for outages such as power failures andbuilding moves that affect all types of equipment. Design failures account for 39% of the

14

Page 21: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

user outage minutes, reconfiguration accounts for 32%, and physical hardware failuresaccount for only 11%.

• 8000000~••:)7000000

0

86000000-

(; 11 Client- 5000000• • Network•.. 4000000::Ic: [] Serveri

3000000• • Alla! 2000000::I0~ 1000000••:)

0'iii c: III C .2u J .§ Gl !"il E>- ! c: ::JJ:.

! 2 ~cQ."5

~c:w

a:

Figure 5-3. User Outage Minutes by Outage Category

Figure 5-4 shows a slightly different view of the data. In Figure 5-4, the reconfigurationoutages have been absorbed into the physical, design, and operations categories dependingon whether the reconfiguration was a hardware instalVupgrade (physical), a softwareinstalVupgrade (design), or operations testing. Moves were included as network operationsoutages. Design is still the dominant outage category.

lSIelienl

• Network

[]Server

OperationsDesign

8,000,000

I!• 7,000,000•:)0 6,000,0008p-

o.5,000,000.!

••.. 4,000,000::Jc:i• 3,000,000a•;0 2,000,000o.•• 1,000,000:)

0

Physical

Figure 5-4. User Outage Minutes by Outage Category

15

Page 22: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Table 5-2 contains a list of the 15 largest outage causes from the spreadsheet. This list isdominated by server outages, which is expected since servers are the equipment type thatcauses the most user outage minutes. Software design and reconfiguration outages are alsovery prevalent as expected from the preceding discussion.

tegory

non

non

tion

Table 5-2. Largest Causes of User Outage Minutes

5.4 Result Summary

Summarizing the results from the previous section:

• Server outages are the most significant cause of client/server unavailability.

• Software causes the most user outage minutes, but all outage categories are important.Physical, design, operations, environment, and reconfiguration outages all contributesignificantly to client/server unavailability.

• About 2/3 of the user outage minutes are due to unplanned outages. Althoughreconfiguration (planned) outages may not seem as damaging as unplanned outages,they contribute about a third of the user outage minutes and must be eliminated toachieve continuous availability.

• Fault localization is very important. Over a third of the user outage minutes in themodel come from propagating failures, i.e., failure such as broadcast storms thatcause outages for client/server components that should be unaffected by the failure. Ifthe impact of an equipment failure on other equipment can be minimized, user outageminutes can be significantly reduced

6.0 Model Validation

We developed the client/server availability model "bottoms-up", i.e., by defining a set ofoutage causes and the appropriate outage minutes based on aggregated outage data. Havingdone so, we need to verify that the model is reasonable by comparing our results to otherreported results. After all, 12,000 outage minutes (200 hours) a year for each user seems

16

Page 23: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

like a lot. There have been several surveys of LAN downtime that can be used forcomparison. Table 6-1 shows the comparison between the client/server availability modeland those surveys. The correlation is surprisingly good, indicating that the model isproducing reasonable results. To perform an apples-ta-apples comparison between themodel and the surveys, the appropriate set of outages had to be selected from the model,and the results had to be adjusted from continuous operation to operation during normalbusiness hours. The details of the calculations are described following Table 6-1.

Table 6-1. Comparing the Model Results to Outage Surveys

[Saal,90] reports an average of about 7,000 yearly network outage minutes (an average of24 outages at an average of 290 minutes each) that affect an average of 43% of users. Theaverage yearly user outage minutes is about 3,000 (43% of 7,(00). These outage minutesare for unplanned outages, excluding PC outages. The client/server availability modelpredicts 7,100 user outage minutes for unplanned outages, excluding PC outages. Thesurvey includes only outages that occur during normal business hours, e.g., 40 hours aweek, although some of the surveyed companies may have multiple shift operation. Weassume that the hours per week during which outage minutes are counted is 40-80 hours ascompared to 168 hours per week for the continuous operations assumed by the client/serveravailability model. Therefore, the outage minutes from the model are divided by a factor of2-4 to account for the different "at risk" periods. The prediction from the model is then1,800-3,600 annual user outage minutes, which is the range shown in Table 6-1.

[Caginalp,91] and [Caginalp,92] indicate that corporate LAN outages average 2 to 4 hoursper week with an average of 38% applications affected. This works out to 2400-4800application outage minutes, which should be similar to user outage minutes. It is unclearwhether the model includes PCs and all servers. We assume it excludes PCs, so the modelprediction is 7,100 user outage minutes. This survey applies to downtime during the"typical work week", so we again assume that means 40-80 hours per week, and the modelprediction is again 1,800-3,600 annual user outage minutes.

[Fogel,91] reports 6% downtime, or 31,500 outage minutes, for LANs. It is unclear howmany user outage minutes that equates to, but if 10%-20% of the users are affected by anoutage, that is 3,150 - 6,300 annual user outage minutes. We make the same assumptionsas in the comparison to the surveys described above, so the model prediction is again1,800-3,600 annual user outage minutes.

17

Page 24: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

[Caldwell,92] reports Boeing's experience measuring and improving user outage minutes.They report 5,000 - 10,000 lost workstation hours per million opportunities, which isequivalent to 2,600 - 5,200 annual user outage minutes. Boeing includes clientworkstations and servers in their calculations, but they only consider unplannedoutages.The model predicts 8,200 user outage minutes per user for unplanned outages.Boeing only measures outages during the hours that the MIS staff is available forassistance, about 60-80 hours per week. Therefore, we divide the model prediction by afactor of 2-2.5 to get the 2,900-3,900 annual user outage minutes shown in Table 6-1. TheBoeing article also contains an outage minute breakdown using the categories of host,network, and workstation. Since Boeing uses workstations as both clients and servers, wecannot compare those results with the model. However, the user outage minutes due tonetwork failures can be compared. Network failures cause 5%-10% of the user outageminutes in the Boeing data and about 10% of the unplanned user outage minutes in theclient/server availability model. This is very good agreement and again indicates thatnetwork failures are not the dominant cause of client/server outage minutes.

[Louzon,92] contains a reference to server outages. It says that conventional servers are"out of service for tens of hours a year". The unplanned outage minutes for a single(commodity) server from the model is 1,400, or about 24 hours, which is tens of hours.

[META,93] reports that companies experience 1,440-2,880 hours of LAN downtime peruser per year. This includes the same outage categories and equipment as the client/serveravailability model, so the model predicts 12,000 annual user outage minutes. For a 40 hourwork week, this equates to 3,000 annual user outage minutes. The article contains anadditional breakdown of outages by equipment type and outage category similar to Figures5-2 and 5-3. Tables 6-2 and 6-3 show the comparison. The correlation is surprisinglygood, particularly since the outage categories and defmitions are not exactly the same.

[Eqwpment Type tages

1

11%1 %

Table 6-2. Equipment Type Outage Result Comparison

[META,93] Oient/Server Availability ModelOutage Category Percent Downtime Outage category Percent User Outage

MinutesHardware 11% Physical (hardware) 11%~rattng System 52% DeSIgn (software) 39%and Application SWEnvironment 4% EnVIronment 4%Planned 30% ReconfIguration 32%

(planned)Operations 14%

Table 6-3. Outage Category Result Comparison

18

Page 25: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

7.0 Improving Client/Server Availability

Client/server availability can be improved by:

• Decreasing the frequency of outages (increasing the mean time between outages)• Decreasing the duration of outages (decreasing detection, repair, and recovery times)• Decreasing the number of users affected by an outage.

It is not possible to just "throw hardware" at the problem. For example, if the number ofclient LAN cards is doubled, so that the number of users affected by a LAN card failure iscut in half, the expected number ofLAN card failures doubles because there are now twiceas many cards. The net affect of doubling the LAN cards is zero (although if the LAN cardsare used in a fault-tolerant fashion as described in Section 7.2, availability can beimproved). There are, however, operational methods to improve availability as described inSection 7.1 and architectural approaches to improve availability as described in Section 7.2.

7.1 Operational Improvements

When talking to operations staffs about improving availability, improved training and toolsare issues that arise frequently. They feel that they could do a much better job of detection,recovery, and repair of equipment with improved training and tools. It is difficult to use themodel to determine how many user outage minutes could be saved by better training and/ortools since we have no data comparing different skill level responses to similar problems.However, if the average outage duration could be decreased by 10% through better trainingand tools, total user outage minutes would decrease by 10%. Also, 1,632,000 user outageminutes (14% of the total in the model) are attributed to improper installations, addressingproblems, incorrectly halting system processes, accidently bouncing user processes, andthe like. Users contribute to these outages to some degree, particularly installation andaddressing problems, but a well-trained operations staff could certainly make a differenceof hundreds of thousands of user outage minutes.

A very interesting result from the model is that propagating failures account for over a thirdof the user outage minutes. If the operations staff can find ways to localize faults, there isan excellent opportunity to significantly reduce user outage minutes. As an example, if theoperations group can quickly determine the source of a babbling node or broadcast storm,they could eliminate that node from the network, thus preventing the outage from affectinghundreds of nodes. An example of a network performance tool that can help improvenetwork trouble-shooting is Ungermann-Bass NetDirector™.

Equipment error recovery provides the operations staff another opportunity to decrease useroutage minutes. About 300,000 user outage minutes in the model (2.5% of the total) aredirectly attributable to equipment that does not reset properly following a power outage.Advance power cycle testing in the client/server environment could help eliminate this typeof outage or at least make it planned downtime instead of unplanned downtime. Recoveryfrom other outages may also be lengthened by error recovery problems, and testing therestoration of systems following simulated failure conditions could help reduce user outageminutes.

Cabling seems to be a continual source of problems in a client/server environment.Although only 1% of the user outage minutes in the model are directly attributable to cableproblems such as accidently disconnecting power cords, operators feel that they coulddecrease recovery time for other outages with better cable layout (consider the difficulty oftracing cables in the spaghetti that often exists in LAN closets).

19

Page 26: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

7 . 2 Architectural Improvements

As shown in Section 5, servers are the dominant cause of client/server unavailability. Theservers used in the model are commodity servers, and the obvious fIrst step in improvingclient/server availability is to use fault-tolerant servers. Since server software (applicationand OS) failures are the dominant server outage causes, software fault-tolerance isimportant in reducing outage minutes. Tandem servers are software fault-tolerant, and wehave been gathering outage minute data on Tandem servers since 1992. This data was usedto create Figure 7-1. It can be seen from the figure that software fault-tolerant (Tandem)servers reduce user outage minutes due to servers by nearly an order of magnitude and totaluser outage minutes by more than a factor of 2. In the SW Fault-Tolerant Server stackedcolumn in Figure 7-1, the four categories (server, network, client, and environment andoperations) account for relatively equal proportions of user outage minutes. Therefore,additional methods to decrease user outage minutes in all four categories need to beinvestigated.

14,000,000

i 12,000,000:::)

o8 10,000,000-.e'" 8,000,000

~S 6,000,000QI! 4,000,000

i:::) 2,000,000

o

18 Server

ra Network

CClient

• Environment and Cps

CommodityServers

SW Fault­Tolerant(Tandem)Servers

Figure 7-1. The Effect of Tandem Servers on Client/Server Availability

There are many possible client/server architectures. In the following discussion, weevaluate seven modifications to the architecture presented in Section 5. Figure 7-2(following Table 7-1) depicts some of those alternatives. Each alternative is evaluated byestimating its impact on user outage minutes for every outage cause. The results are shownin Table 7-1, and each alternative is described following the table. The results in Table 7-1show that significant savings in user outage minutes are possible through implementingvarious levels of fault tolerance and other architectural improvements. The information inTable 7-1 is valuable for cost-benefit trade-off studies. It quantifies the benefits ofavailability in terms of user outage minutes, which a company can convert into cost savingsusing its cost of user outage minutes. Since Table 7-1 is based on client/server outage dataapplied to the representative client/server architecture described in Section 5.1, a companymay want to use the information in the appendices to develop predictions for their specificclient/server architecture.

20

Page 27: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Addclient LANconnectionAdd Ff LAN

Add onlinereconfigura­tion for sys­tem moves

tage Minute

Reduce server LAN c by90% & babbling node by50%;reduce hub & router cardb 15%Reduce client network LANcards & babbling node by50%

1,449 Reduce most network modesby 90-99%, router algorithmby 75%, OS broadcast storm& addressing by 50%, opscable by 90% & ops test by50%Reduce clIentapplication by 75% becauserecove is much uickerReduce HW b 5 lJ

Reduce power outages by90%Reduce ops system moves by90%

Ff = Fault-tolerant; UPS =Uninterruptable power supply

Table 7-1. User Outage Minutes for Architecture Alternatives

The user outage minute reductions shown in Table 7-1 are cumulative. To some extent, themarginal impact of each alternative can be determined by the difference between its useroutage minutes and the user outage minutes for the previous alternative. However, some ofthe alternatives are not independent. For example, the outage minute reductions for thefault-tolerant LAN assumes fault-tolerant server-to-network and client-to-networkconnections. Therefore, each architecture alternative or combination of alternatives shouldbe considered individually using the spreadsheet in Appendix 1.

21

Page 28: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

D.

Fault-Tolerant LAN

C.

Fault-Tolerant ClientLAN Connection

\I\

I I

: PC \-----------,

,-----------Server :

\

\II\

I

I\

\

\\\

\

B.

Fault-Tolerant ServerLAN Connection

A.

\-----------I Server :I II I

\ II II I

I II II I\ I

Continuous Availability(Tandem) Server

Figure 7-2. Alternative Client/Server Architectures

Continuous Availability (1995 Tandem) Servers (Figure 7-2A)

For commodity servers, unplanned outages account for about 75% of the user outageminutes and planned outages account for about 25%. For software fault-tolerant (Tandem)servers, the ratio is reversed, although Tandem servers have significantly less downtimedue to reconfiguration than commodity servers because there are many types ofreconfiguration that Tandem servers can do on-line. Therefore, increasing on-linereconfiguration capability is the best way to increase Tandem server availability. We have anumber of current programs and products that will provide such a capability. Most of theunplanned outages for Tandem servers are caused by systems that are not taking fulladvantage of software fault-tolerance, e.g., improper configuration of hardware orsoftware or running a non-fault-tolerant application. We have a number of design andoperations services that will help eliminate these types of outages. Tandem servers arebecoming continuous availability servers, Le., fault-tolerant for all outage categoriesincluding planned reconfigurations. Our prediction, based on the current planned suite ofproducts and services, is that we can reduce continuous availability (Tandem) server outageminutes by an order of magnitude by 1995.

Add Fault-tolerant Server LAN Connection (Figure 7-28)

An extension of server fault-tolerance is to make the server connection to the network fault­tolerant. This eliminates outage minutes due to LAN card failures in the server and the

22

Page 29: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

server-to-network connection (e.g., in the hub). This approach requires server softwarethat automatically detects the loss of a server-ta-network connection and switches to thealternative path without user intervention. The Tandem and Ungermann-Bass NonStopAccess for Networking™ products provide these capabilities.

Add Fault-tolerant Client LAN Connection (Figure 7-2C)

Fault-tolerant Pes, while available, are very expensive. A more reasonable approach is tohave dual-ported or redundant LAN cards in each PC that can take advantage of redundantpaths through the network. This eliminates outage minutes due to LAN card failures in theclient and the server-ta-client connection (e.g., in the hub). This approach is most usefulwith software that automatically detects the loss of a server-to-client connection andswitches to an alternative path without user intervention. The Tandem and Ungermann­Bass NonStop Access for Networking™ products provide these capabilities.

Add Fault-tolerant LAN (Figure 7-2D)

The obvious way to reduce network outages is to make the LAN fault-tolerant This impliesmultiple, independent paths connecting all clients and servers. It reduces user outageminutes due to network failures and reconfigurations, and also reduces user outage minutesdue to propagating client and server failures such as a broadcast storms. This approachrequires software that automatically detects the loss of a server-to-client connection andswitches to an alternative path without user intervention. The Tandem and Ungermann­Bass NonStop Access for Networking™ products provide these capabilities.

Add Client Session Reestablishment

The most common client failure mode is a transient error in the client, network, or serverthat requires the user to reboot or restart an application. This generally takes at most a fewminutes, but the user then has to reestablish the session and ascertain the status of theoutstanding transactions. Reestablishing a session can take a long time if the client lackssuitable information. Then, after the session is reestablished, the client is forced to performqueries to determine the status of transactions that were in process when the session wasdisrupted. If both the client and its server maintain the information necessary forreestablishing a client session and maintain a transaction log, client downtime can besignificantly reduced. After the client reboots and logs in, the server can reestablish thesession, provide the status of client transactions, and continue processing transactions asnecessary. These capabilities are planned future enhancements for the Tandem RSCproduct.

Add Fault-tolerant Clients

Fault-tolerant PCs protect against most PC hardware failures including disk and powerfailures. However, they do not currently protect against software failures or allow on-linereconfiguration. Since fault-tolerant PCs are expensive, a more reasonable option might beto have extra PCs in a hot standby mode, particularly if the client session reestablishmentdescribed in the preceding paragraph can be performed from any client.

Add Uninterruptable Power Supplies (UPSs)

Environmental outages are dominated by commercial power failures. The obvious ways toprotect against these outages is to provide uninterruptable power supplies (UPSs) and topossibly improve site power filtering and conditioning. An interesting problem is that site

23

Page 30: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

consolidation makes it easier to protect against power outages, but the nature ofclient/server environments tends toward site expansion.

Add Online Reconfiguration for System Moves

The operations outages that affect the entire client/server environment are mainly massivesystem reconfigurations, such as building moves. A fault-tolerant server, network, andclient architecture can help reduce this downtime if it can be configured as two independentclient/server systems. Conceptually, half the hardware can be moved while the other halfcontinues to run the application (possibly with degraded performance). The hardware thatwas moved is then configured and runs the application(s) while the other half of theequipment is being moved.

24

Page 31: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Acknowledgement

I would like to thank Tim Chou for his encouragement and support in the development ofthe model and documentation of the results. I would also like to thank my reviewers,especially Rick Biedenweg of Stanford, Bob Horst, and Cindy Sidaris.

25

Page 32: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

References

1. [Feather,92] Feather, Frank, "Fault Detection in an Ethernet Network via AnomalyDetectors", Doctoral Thesis, Carnegie Mellon University, 1992.

2. [Saal,90] Saal, Harry, "LAN Downtime: Clear and Present Danger", DataCommunications, March 21, 1990 (Report on a survey by Infonetics called "The Cost ofLAN Downtime").

3. [Caginalp,91] Caginalp, Elizabeth G., "Downtime Problems Grow with Networks",CRN Extra, October 15, 1991.

4. [Caginalp,91] Caginalp, Elizabeth G., "The Lowdown on Downtime", CRN Extra,February, 1992.

5. [Fogel,91] Fogel, Avi, and Michael Rothenberg, "LAN wiring hubs can be criticalpoints of failure; but physical layer downtime can be prevented", LAN Times, January 7,1991. (Report of study by Forrester Research)

6. [Caldwell,92] Caldwell, Bruce, "Program Lifts Grounded Users", Information Week,June 22, 1992.

7. Knight, Bob, "The Data Pollution Problem", Computerworld, September 28, 1992.

8. [Louzon,92] Louzon, Michelle, "How tolerant can you be?", Computerworld, May 4,1992.

9. DiDio, Laura, "Local net downtime costs users millions, study says", Network World,September 11, 1989. (Report on a survey by Infonetics called "The Cost of LANDowntime").

10. Ballou, Melinda-Carol, "Survey pegs computer downtime costs at $4 billion",Computerworld, August 10, 1992. (report on FindiSVP study done for Stratus)

11. Silverman, Jeff, "Planning a Reliable Network", 1992, P. 25-27 of a Cisco Systemspublication.

12. Bowen, Charles, "Study by 3M Focuses on PC Data Loss", Online Today, October20, 1992. (Report of IntelliQuest survey commissioned by 3M)

13. Fibermux Corporation, "Backbone Applications Guide", 1992, P 38.

14. Communications Week, "Upping UNIX Uptime", Dec 14, 1992, P. 17.

15. Data from a Tandem customer with 15,000 workstations, complete outage log for1992. The customer has an experienced and sophisticated operations staff.

16. Data from help desk for a Tandem customer, primarily workstation OS problems,1992.

17. LAN/WAN outage data from a Tandem customer, 1991. Customer has a redundantLAN, so there are very few LAN outages.

18. Tandem corporate LAN outage data, 1992-1993.

26

Page 33: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

19. Tandem product data on tenninal reliability, 1992.

20. Tandem Customer outage data base, 1991-1993.

21. Reliability Ratings, Inc., various reports on hardware MTBF and MTfR during 1992,e.g., "The Reliability and Service Report for the Cluster Environment".

22. MTBF quotes (under non-disclosure agreement) from LAN and server hardwarevendors.

23. A 1990 study on workstation availability performed by INPUT for Tandem(unpublished).

24. Lee, Inhwan, Don Tang, and Ravishankar K. Iyer, "Measurement and Analysis ofOperating System Fault-tolerance", IEEE Transactions on Reliability, June, 1993.

25. [Barlow,75] Barlow, Richard E. and Frank Proschan, Statistical Theory ofReliabilityand Life Testing, Holt, Rinehart, and Winston, 1975, pp 190-194.

26. [META,93] META Group, Workgroup Computing Strategies, "LAN Downtime",November 26, 1993, File #340.

27

Page 34: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

~

To'.1 Unechecl ScheelUnit Unl' Unl' To'.1 UneclMcl Scheel Number Totel U.., Unecheel SCheelMlnIvr IIlnlVr Mln/yr Hum In MlnlYr Mln/yr MlnlVr U.,. IIIn/Yr IIln/yr IIln/yr

IteIltoF.H Ehot Down Down Down Sy.'.m Down Down Down Affec'•• Down Down DownPC HW - CPU._.& 120 120 0 1000 120000 120000 0 1 120000 120000 0PC HW· DIIk auh &~ 288 288 0 1000 288000 288000 0 1 288,000 288000 0PC HW· LAN ClWd .. .. 0 1000 .. 000 4000 0 1 .. 000 .. 000 0PC HW - 8IIbbIIna Node 1 1 0 1000 1000 1000 0 50 50,000 50,000 0PCSW-OS 180 180 0 1000 180000 180000 0 1 180000 180000 0PCSW· 180 180 0 1000 180,000 180000 0 1 180000 180,000 0PC SW • Emlr RealwrY 30 30 0 1000 30000 30000 0 1 30000 30000 0PC SW - aro.sc.t lam etc. 2 2 0 1000 2000 2,000 0 100 200,000 200,000 0PC SW - ADclIcallon muIIDIlt nodeI 1 1 0 1000 1000 1000 0 50 50000 50000 0PC HW - UDInde 80 0 80 1000 80000 0 80000 1 60000 0 80,000PC SW - UDInde or I,... 180 0 180 1000 180000 0 180000 1 180000 0 180000PC HW ...toteI ..73 ..,3 80 ..73000 .. ,3000 80000 522000 ..82000 60000PCSW· ...... 573 313 180 573000 313000 180000 820000 8..0000 180000PC-8ulttotel 1.0"8 808 240 1048000 108.000 2..0000 1342.000 1.102.000 2..0000Hub HW· CPU.DClMI'.& 10 10 0 20 200 200 0 50 10000 10000 0Hub HW· UIer Ethernet eft .. .. 0 100 "00 ..00 0 10 .. 000 .. 000 0Hub HW • 8eIwr E1hernlIt eft .. .. 0 ..0 180 180 0 50 8000 8000 0Hub HW· NIt Mar card .. .. 0 20 80 80 0 50 ",000 .. 000 0Hub HW • IndIlIIduII LAN CllIIIIICtkln 0." 0." 0 2000 800 800 0 1 800 800 0Hub HW • 8IIbbIIna Node 1 1 0 180 180 180 0 200 32000 32000 0Hub SW - u.reft F..... I 8 0 100 800 800 0 10 8000 1000 0Hub SW· Serwr eft F..... 8 I 0 ..0 380 380 0 50 18000 18000 0Hub SW • Net Mar card Freeze 8 8 0 20 180 180 0 50 8000 8000 0Hub SW -B~ StDrm 1 1 0 20 20 20 0 200 .. 000 .. 000 0HubHW-U ReIo8d 80 0 80 20 1 200 0 1200 50 80000 0 80000Hub SW - UIlanIde or InItIII 80 0 80 20 1 200 0 1 200 50 80000 0 60,000Hub HW • SUbtotel 83 23 80 3000 1800 1200 118800 58800 60000Hub SW • SUbtotel 88 28 80 2880 , ..80 1200 100000 ..0000 80,000Hub-Subtotel 171 51 120 5.880 :1280 2 ..00 211 .00 ••••00 120000Aau* HW • CPU.DOWIl'," 80 80 0 6 360 360 0 50 ,.000 ,.000 0Aau* HW· LAN eft .. .. 0 24 86 86 0 50 ...00 .. 800 0Aau* HW • 8IIbbIIna Node 1 1 0 24 24 24 0 400 8,600 8600 0RouW SW· c.d Freea 240 240 0 24 5780 5780 0 50 288000 288000 0Aau* SW· Emlr -7 80 80 0 8 540 5..0 0 50 27000 27000 0Router SW· ProbIernB~ Sto 120 120 0 8 720 720 0 400 288000 288000 0

"""HW· RImd 60 0 80 6 380 0 380 50 18000 0 18000Aau*SW· IDlnlIe or Ine&II ..80 0 ..80 6 2880 0 2880 50 , .... 000 0 , .... 000Aau*SW· ..... wldlreeet 30 0 30 8 180 0 180 1000 180000 0 180,000...... HW • lubtotel 125 85 80 840 ..80 360 50 ..00 32 ..00 18000Router SW • SUbtotel 880 ..50 510 10080 7020 3080 827000 803000 32.. 000Router • Subtotel 10.5 515 570 10120 7500 3 ..20 '77.400 835 ..00 3..20008eIwr HW - CPU.DClMI',& 2..0 2"0 0 40 8,600 8600 0 50 ..80000 ..80000 08eIwr HW - DIIk F'" &ReclMrv 72 72 0 ..0 2880 2880 0 50 , .... 000 , .... 000 08eIwr HW - LAN eft .. .. 0 .. 0 180 180 0 50 8000 8000 0

>"Cl"Cl

~Cl...~

""'"~

"Cl.,I'DD2Cl.I'll:r$i-

Page 35: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

~

s.r. HW • 8IIbbIIna Node 1 1 0 40 40 40 0 200 8000 8000 0s.r. SW • OS LAN Dn*lCDII 380 380 '0 40 14400 14,400 0 50 720,000 720,000 0ServerSW· 280 280 0 40 11 200 11 200 0 50 580000 560000 0server SW • Emlr 80 80 0 40 2400 2400 0 50 120000 120000 0server SW • OS fBr-':-. Sam lItC.) 40 40 0 40 1800 1600 0 200 320000 320000 0serversw- '--nad8a 120 120 0 40 4800 4800 0 200 880000 860000 08eMrHW· ReIaE 240 0 240 40 8800 0 8800 50 480000 0 480000serversw-ljIDarw:Ie or lnatlll 240 0 240 40 8800 0 8800 50 480000 0 480,000server SW - DIU1IIutId ~ 720 0 720 1 720 0 720 500 380000 0 380000Server >Ill • InclorrecI hIIla.....Me. 135 135 0 40 5400 5400 0 50 270000 270,000 0server g- ProbIIma 45 45 0 40 1800 1800 0 50 80000 80000 0server •• IncclmlCI HIftI -= .IIIUIInIa 11M 45 45 0 40 1800 1800 0 500 800000 800000 0Server g. PrabIIMI • muI1IDIIt n 15 15 0 40 800 800 0 500 300000 300000 08eI¥. HW • IUbtoUI 557 317 240 22280 12680 8800 1 120000 840000 4800008eI¥. SW • IUbIcMI 1 820 880 860 44720 34400 10320 3520000 2680000 8400008eI¥. ON • Bulltotel 240 240 0 8800 8800 0 1580000 1 580000 08Irnr ......... 2817 1 417 1200 71100 ""0 18 820 .200000 4880000 1320000~ serwr HW· CPU,IlCIIINI',-' 240 240 0 1 240 . 240 0 200 48000 48000 0NInVSeO'Carn 8eMr HW • DIIk F" a ReccMI 72 72 0 1 72 72 0 200 14400 14400 0fWlIISedOom serwr HW - LAN c.d 4 4 0 1 4 4 0 200 800 800 0~ serwr HW· IWIIIIIna Node 1 1 0 1 1 1 0 400 400 400 0~ serwr SW - os. LAN 380 380 0 1 380 380 0 400 144000 144000 0NMlfSeCICom server SW • 280 280 0 1 280 280 0 400 112000 112000 0NImISedCom serwr SW • Enar 80 60 ' 0 1 60 80 0 200 12,000 12000 0N8mISecfCom server SW - os 40 40 0 1 40 40 0 1000 40000 40000 0N8mISecfCom 8eMr HW. 1'---

..__ D

240 0 240 1 240 0 240 200 48000 0 48000N8nWSeo'Com SIrwr SW - UDIAl.or .... 240 0 240 1 240 0 240 200 48000 0 48000~ server 001 • IncarrecI haIIa.... 135 135 0 1 135 135 0 400 54000 54000 0NMVSecI'Com server Om • PrabIII 45 45 0 1 45 45 0 400 18000 18000 0~ serv., HW • Iubtot8I 557 317 240 557 317 240 111 800 83,600 48000~ ........ 8W. IUbIot8I 880 740 240 880 740 240 358000 308000 48000NMII8ecIICOM server ODe .....1 180 180 0 180 180 0 72000 72000 0

~"''''''bto''' 1 717 1 237 480 1.717 1.237 480 1"800 443800 88000g • c.bIe PldlIImI 300 300 0 8 1800 1800 0 50 80000 80000 0g.Mav.- 720 0 720 8 4320 0 4320 200 884000 0 884000lpI - 8YstIem ..... PM 120 0 120 8 720 0 720 200 144000 0 144000....SUMo..1 1 140 300 840 8840 1800 5040 1.018.000 10.000 1008000

E1W·Power 380 380 0 8 2160 2160 0 200 432000 432,000 0EIW • AC Iood *' 80 80 0 6 380 380 0 200 72000 72000 0E1wlrOlllMnt ....btotal 420 420 0 2.520 2.120 0 104.000 504000 01*....-cI DB Data • L.ocIwD 432 432 0 1 432 432 0 1000 432000 432000 0DIIlrIlutId DB Data •~ 720 0 720 1 720 0 720 1000 720000 0 720,000Dlatrlbuted Dl-8ubtotal 1.152 432 720 1.152 432 720 1 152 000 432 000 720000

To..1 1348 5178 4170 1151409 879429 271 980 12031 800 8185800 3848000TolalU.. UMChecI SCheelIIlnIY, IIlnly, II In/y,Down Down Down

Page 36: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Appendix 2 • Outage Minutes and Number of Users Affected

U 1 000MiOutage nutes - npJann utal esCare- Device Outage Cause Outage Number of Users Derivation of outage mingory Min per Affocted

YrHlItl- Servt'J' CPU, power, etc - standard 240 50 (25 connected Ref 23 says about 2 fails per year due toware things included in MTBF to server and an HW. About 2 hours to fIX based on Refs

estimated 25 15 & 18. This is 4x to 8x dumbothers who need terminal reliability in Refs 19 &21,remote access) which seems reasonable.

Disk crash - outage and 72 25 .3 fails per year from Ref 12 (equivalentdata restoration (includes to 4 disks at lOOK hrs MTBF each),viruses) assume 4 hrs to restore, including

backup from tape, based on discussionswith ODS

LAN connection - LAN 4 min 25 Same as Hub LAN cardcard/connecta" J)t'J' cardActive failure (Babbling 1 min 200 (subnet - Same as Hub LAN cardnode)· bad card/connection Pt'J'card should notsending garbage propagate pastoacketSlnoise router)

Router CPU, power, etc - standard 60 SO (estimated 20K hr MTBF from Ref 13, 18.4K fromthings included in MTBF number of users Ref 11,3 hours to fix from Ref 15, Refincluding CPU/net that need to 11 says.75 hrs to fix quoted but 8 hrsmanager cards but access remote more realistic. Assume 1 fail per 3 yearsexcluding LAN connection server via router) with 3 hours to fix.cardsLAN connection - LAN 4 min SO Same as Hub LAN cardcard/connector - either to Pt'J'cardother routers or to otherhubs or workstationsActive failure (Babbling 1 min 400+ (aU remote Same as Hub LAN cardnode) - bad LAN or Pt'J'card accesses affectedCPU/net manager (includi and heavycard/connection sending ng network traffic)lZlU'ba2e oackets/noise CPU)

Hub CPU, power, etc - standard 10 min 50 (all users lOOK hr MTBF from Ref 13, Ref 22 isthings included in MTBF attached to hub) similar. About 2 hours to fIX from Refincluding CPU/net 15. Ref 18 shows hubs at least 4x bettermanager cards but than routers. Assume 1 fail per 12 yearsexcluding LAN connection with 2 hours to fix.cardsLAN card connection - 4 min 10 when 300K hr MTBF from Ref 22, 3 hours toLAN card/connector • Pt'J'card connected to fix from Ref IS, which is about 5either to other routers or to users, same as min/yr. Assumed 4 min non-other hubs or workstations server if propagating, 1 min propagating - see

connected to active failure.server, same asrouter ifconnected torouter

Individual LAN connection 0.4 min 1 Connector MTBFs > 10M hrs. Assumed- connections to user's 10% of LAN card connection.PC/Workstations

30

Page 37: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Active failure (Babbling 1.0 min 2OOQocal From Ref IS, 5 of 35 (14%) of LANnode) - bad LAN or pt'l'card subnetand card type failW'eS are propagating. FromCPU/net manager (includi anyone needing Ref 16, about 50% of bad LAN tapscard/connection sending ngnet the attached caused propagating failure. Assumedgarbage packets,lnoise mgr servers; 20% of 5 min =1 min.

cad) propagationshould stop atrouter but maynot)

PC/W CPU, power, etc - standard 120 1 Half the server rate and 2x to 4x theorlcstat things included in MTBF min dumb terminal rate from Refs 19 & 21ion excluding LAN connection

ca'dsDisk Crash (includes 288 1 Ref 12 says .3 failslyr with a week toviruses) min recover. Assume 16 hours recovery for

users in this environmentLAN card connection - 4 min 1 Same as Hub LAN cardLAN card!connector nercardActive failure (Babbling 1 min 50 (all on server, Same as Hub LAN cardnode) • bad LAN sending peccard assume it doesgarbage packets/noise not propagate

oast server)Soft- Server OS SW - CPU freeze or 360 50 4 fails x 1-2 hrs =240-480 min fromware other crash including LAN Ref 18. 400 and 2000 min from Ref 24.

protocols UNIX goal is 60 minutes from Ref 14implying it must be at least 600minutes now. Assume 400 minutes ofwhich 10% is Dl'OlJaJ/:atinJ/:

Application SW problems, 280 50 400 minutes from Ref 15. Assume 400e.J/:., local data base crash minutes of which 30% is orooantinJ/:Faulty error recovery SW 60 50 About 1 per server per year in Ref 15 -extends power or other assume 1 for 60 minutesoutages. Usually needs atleast a reboot.High bandwidth utilization 40 200 (subnet and 10% of Ref 16 failures were of this- includes broadcast storm, possibly more) type.babbling node, runt orjabber storm, etc. mainlyfor OS but possibly alsoannlication SWApplication (or possibly 120 200 (subnetand About 35% (26 of 73 in Ref 15, usedOS) SW problem affects possibly more) 30%) application SW problems affectedmultiple nodes multiple nodes - sometimes only 2

nodes but sometimes many nodes.Name Same failure modes but 120 200-1000 30% from Ref 15. Ref 16 includes 3Server much more likely to affect cases of corrupt DB and 1 name server

multiple nodes than a file address conflicL TIle name serverserver or DB server. Data problem took down the entire network.base problem - lost name,etc. could cause outages formanvnodes.

31

Page 38: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Securit Same failure modes but 120 200-1000 Same as servery much more likely to affectServer multiple nodes than a file

server or DB server. Database problem - accessdenied, etc. could causeoula2es for many nodes.

Soft- Router CPU/net manager card or 240 50 (all who need Ref 18 has 12 outages per year withware LAN card freeze to access remote recovery time usually 15 min, but

server via router) occasionally up to 2 hrs. Assume 12outages x 30 min, 33% of which ispropagating. Ref 15 shows about 1-2outages per router with 2-3 hours peroutage. Ref 11 implies 4 outages with 8hour recovery time.

Faulty error recovery SW 90 50 Several of these following powerextends power or other outages in Ref 15. Assume 3 outages atoutages. Usually needs at 30 min each.least a reset.Broadcast storm or routing 120 400 (at least 2 Ref 15 shows 11 of 31 outages were ofalgorithm problem - e.g., subnetsand propagating type. Ref 18 about 15%.slow updates possibly all Assume 25%.

routers)-Hub CPU/net manager card or 90(9 CPU card: 50 (2 Ref 18 says about 3 per yr per building,

LAN card freeze min per servers) about 1 per hub. Ref 15 data <I outageaml) LAN card: 10 per hub. SW in hubs much less

when connected complex than SW in routers. Ref 15to users, same as data says about 100 min per outage.server if Assume 1 l00-min outage, 10%connected to propagating, 10 cards.server, same asrouter ifconnected torouter

Broadcast storm 10 (l 200+ (subnet; Ref 15 data says about 10% (5 out ofmin per propagation 48)aml) should stop at

router but maynot)

PC/W OS SW - CPU freeze or 180 1 Similar number of crashes as server, butorkstat other crash including LAN recovery should be much quicker.ion orotocols Assume 50% of server for all problems

Application SW problems, 180 1 50% of server. I notice about 2-4e.g., local data base crash problems a month with recovery < 5

minutes (usually seconds, but I oncelost 30 minutes of data entry)

Faulty error recovery SW 30 1 50% of serverextends power or otheroutages. Usually needs atleast a reboot.High bandwidth utilization 2 100 (half of 50% of server would be 20 min, but no- includes broadcast storm, subnet, assume incidents in Ref 15 data. Ref 16 has 13babbling node, runt or it propagates incidents for a total of 790 min for 175jabber storm, etc. mainly past server but WS for 22 months =2 min per WS perfor OS but possibly also not past router) yr.application SW

32

Page 39: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Application (or possibly 1 50 (assume all 50% of server would be 20 min, but noOS) SW problem affects on same server) incidents in Ref 15 data. Ref 16 has 1multiple nodes, e.g., incidents for a total of 480 min for 175network paging WS for 22 months =2 min per WS per

yr.

Distrib Corruption - either of data 432 1000 (all users) Ref 16 data includes 3 cases of corruptuted or links among data, or no data (on a single server - they just did 2Data access (DB locked up)- hour tape reload). From Ref 7, 60%~ example is failure of node companies had problems, say 1 problem

that has lock on a piece of in last year - 60% per year; say 25% duedala to corruption/access =15% per year.

Took competitor's customer 6 days torecover - only example I have. Assume2 full days (48 hrs) to recover. 15% x48x60 =432 min

Opml- Improper installs, 135 per 50 (one server) 144 minutes from Ref 20. About 200tions accidently bouncing user server. minutes from Ref 15. Assume 180 min,

processes, incorrectly 25 % multiple site.halting system processes,etc.Addressing problems - 45 per 50 (one server) 28 of95 outages in Ref 16 were this.wrong address given out, server Some Ref 15 outages sound like theyusers picking own address could be this. Assume 60 min, 25%(duplicate IP or LAA multiple node.address), etc.Improper installs, 45 per 500 (many users Ref 15 data shows 25% (10 of 39) ofaccidently bouncing user server. affected, ops outages were multiple node type.processes, incorrectly sometimes entirehalting system processes, application)etc.Addressing problems - 15 per 500 (many users Ref 15 data shows 25% (10 of 39) ofwrong address given out, server affected, ops outages were multiple node type.users picking own address sometimes entire(duplicate IP or LAA application)address), etc.Cable problems - 300 per 1-200 (PC to Ref 15 has 437 minutes per site thatmaintenance pulling wrong site router), assume mentions cable - some is already countedcable or power cord, users so average in LAN HWkicking out cord, etc.

Enviro Power 360 per 200 (a subnet is 356 outage minutes per site from Refnment site a site in my 15. Very site-dependent Ref20 has 126

model) min/yr for env fails, but probablyseverely under-

Other - air conditioning, 60 per 200 (a subnet is 58 minutes per site from Ref 15. Veryflood, fire, earthquake, etc. site a site in my site-dependent, Ref 1 had 480 min/yr due

model) to air conditionin2 failures

33

Page 40: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

PI edOMinautage utes - ann utagesCale- Device Outage Cause Oulage Number ofUsers Derivation of outage mingory Min per Affected

YrHad- Server Add controller, disk, upgrade 240 1 per year for 4 hours from Ref 18ware CPU etc

Router Change (e.g., different 60 50 (all who need 1 per year for 1 hour from Ref 18protocol) or add LAN to access remoteconnection cards, or change server via router)CPU or download card -probably for differentfinnware

Hub Change (e.g., different 60 10 when Estimate 10 min for CPU card + 5protocol) or add LAN connected to min for each of 10 LAN cardsconnection cards, or change users, same asCPU or download card - server ifconnectedprobably for different to server, same asfmnware router ifconnected

to routerPC/W Upgrading CPU, adding 60 1 1per year for 1 hour - estimate sameorkstat ROM, getting bigger disk number as server but less timeion

Soft- Server OS update or bug fix, 240 50 2 per year for 2 hours each based onware application install or update Ref 18

Application SW update 720 500 (subnet to all 1 per year for 12 hours based on Refacross multiple nodes users, possibly 18 and talking to Ref 15 ops(probably across all nodes entire application) personnelfor a distributed application)

Name Treat as a serverServerSecurit Treat as a serveryServer

Soft- Router Router SW update or 480 5O(allwhoneed 6 per year for 1 hours each based onware downlood to access remote Ref 18

server via router)Campus reset as 30 1000 Ref 18 contains 10 of these (15 minpreventative measure each) over a 6 month period to

prevent a SW bug from causingproblems. Bug has since been fIXed.No reference to this in other data.Assume 2 a year for 15 min.

Hub SW update or download 60 50 (all local 1 per year for 1 hour based on Ref 18users)

PC/W OS update or bug fix, 180 1 Estimate 6 times a year for 30orlcstat application install or update minutes each based on personalion experienceDistrib Maintenance to ensure all 720 1000 (all users) Assume 1 hour per month. We haveuted fIles properly setup and estimates of up to 4000 hrs/yr forData linked users that do lots of DBBase reconfisrnration

Opera- Moves 720 per 200 (1 subnet) 1 move per subnet every 2 yearstions subnet lakinS!: 24 hours based on Ref 18.

System testing and other 120 per 200 (1 subnet) 2 tests per subnet per year for 60oreventive maintenance subnet minutes each based on Ref 18.

34

Page 41: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment
Page 42: NonStop Availability in a Client/Server Environment · availability would influence client/server architecture in a mission-criticalenvironment than in a typical campus LAN environment

Distributed by..,TANDEM

Corporate Infonnation Center10400 N. Tantau Ave., Loe 248-07Cupertino, CA 95014-0726