Application Availability: An Approach to · PDF fileABSTRACT Application Availability is widely sought after as a requirement for applications delivered over networks. While users

Embed Size (px)

Citation preview

  • ABSTRACT

    Application Availability is widely sought afteras a requirement for applications deliveredover networks. While users know what theywantcontinuous application access withpredictable performanceit's often difficultto establish concrete measures that showwhether the service providers charged withdelivering the application over a network canmeet user requirements. Sifting out relevant,actionable data about availability is often ascomplicated as making decisions and actingbased on that information. In order tostructure discussion around definitions andappropriateness of measures, this paper setsforth:

    A definition of application availability

    An approach to decomposing applicationsfor measurement

    A classification of service level indicators

    A presentation of measurement as a modefor service level contracting and feedback

    Generalized requirements for developingsynthetic transactions

    1. INTRODUCTIONIronically, as the Internet demolishes many ofthe boundaries between IT and business, themeasures that account for their operations arediverging. Traditionally, IT infrastructures,particularly platform resources, haveaccounted for how much work is getting doneby systems using the same metrics as they usefor control of resource managementCPUutilization, queries processed per hour, I/Ooperations, network packets, and the like.However, as systems become moredistributed and networked, and as end-usersin 24 time zones access systems round theclockend users want to drive the measuresof system availability since it affects theirwork immediately and directly.

    End-users regard the contribution of ITinfrastructure in terms of the value that itdelivers, not operational metrics. Therenaissance of Service Level Agreements(SLAs), oncelike many aspects ofcentralized internet computingthe provinceof mainframe host-based environments, isdriving IT management and end-users alike toseek a common currency to define their

    Application Availability:An Approach to Measurement

    David M. Fishman

    SunUP High Availability Program

    Office of the CTO

    Sun Microsystems Inc.

  • 2000 SUN MICROSYSTEMS INC., All Rights Reserved

    2

    shared objectives, without creating undueoperational dependence between theirdomains.

    That common currency is availabilitymeasurement. Application AvailabilityMeasurement (AAMe) from the end userperspective does not replace resourcemanagement, capacity planning, changemanagement, performance analysis, or any ofthe other many practices that are the mtier ofthe disciplined mission-critical shop. But inestablishing and maintaining the value of anapplication to its users, none of these otherdisciplines can represent the system as awhole to the users as they see it.

    This paper is written for IT and line-of-business executives looking for a way toidentify meaningful indicators of applicationavailability. It presents a generalized model forcreating measures of application availability inuser terms, and validating the application'suser value. Specifically, it covers:

    An overview of the classic model forquantifying availability

    An approach for choosing what tomeasure

    An examination of measurement in thecontext of other feedback techniques

    A classification of different kinds ofmeasures

    Some considerations in developingmeasures

    There are a number of things I've assumedout of this paper. It does not directly addressdesign of highly available, redundantapplication or system architectures, for anumber of reasons. Any treatment of

    availability architectures short of book lengthwould not do the subject justice. Themeasurement approach presented here isintended to be fairly architecture-independent.It's up to the availability architecture todetermine how it tolerates, detects andrecovers from failures at the many differentcomponent levels that make up the stack. Iassume only that the application has thenecessary mechanisms for doing so. Thoughthe perspective is heavily skewed towardsvalidation, there's plenty that needs to beinferred about the underlying design.

    I use the terms "service", "application" and"system" fairly interchangeably; while somemight argue for more semantic precision, thethree notions are so fluid among themselvesthat one can argue for each to mean the sameas one of the others. But to the extentprecision can be applied, I use "system" torepresent an end-to-end applicationenvironment. A "service" is defined as anapplication delivered over a network; it is thesubstrate of measurement for availability.Service-level indicators, or service levelindicator metrics, are the result measures fromtests that validate the service.

    2. MEASURING AVAILABILITYHow Available is Available?In its classic form, availability is representedas a fraction of total time that a service needsto be up. From a theoretical perspective, itcan be quantified as the relationship of failurerecovery time (also known as MTTR, meantime to recovery) to the interval betweeninterruptions (MTBF or MTBI, mean timebetween failures or interruptions). A service

  • 2000 SUN MICROSYSTEMS INC., All Rights Reserved

    3

    that fails once every twenty minutes and takesone minute to recover can be described ashaving availability of 95%.

    For an entire year of uptime365 days times24 hours times 60 minutes equaling roughly525,600 minutesuptime can be representedas "nines", as in the chart below.

    One handy way to think of nines in a 365x24year is in orders of magnitude: Five ninesrepresents five minutes of downtime; fournines represents about 50 minutes; threenines, 500 minutes, etc. Every tenth of apercentage point per year is roughly 500minutes of downtime. Of course, for servicesthat don't need to operate 24 hours a dayseven days a week, such as factory-floorapplications in a single location, the outageminute numbers will vary based on the localoperational window.

    AVAILABILITY

    MEASURE

    DOWNTIME PER

    YEAR

    DOWNTIME PER

    WEEK

    98% 7.3 days 202.15 minutes

    99% 87.6 hours 101.08 minutes

    99.5% 43.8 hours 50.54 minutes

    99.8% 1,052 minutes 20.22 minutes

    99.9% 526 minutes 10.11 minutes

    99.95% 4.38 hours 5.05 minutes

    99.99% 53 minutes 1.01 minutes

    99.999% 5 minutes 6.00 seconds

    Figure 1. Table of fractional outages

    It should be readily apparent that getting past1 minute of downtime per week can be quitean expensive proposition. Redundant systemsthat double the hardware requiredin

    extreme cases, down to specialized fault-tolerant processes that compare instructionsat every clockand complex software thatcan handle the redundancy are just thebeginning. The skills to deal with thecomplexity and the system's inability to handlechange easily drive up the cost. Moreover,experience shows that people and processissues in such environments cause far moredowntime than the systems themselves canprevent. Some IT operations executives arefond of saying that the best way to improveavailability is to lock the datacenter door.

    Be that as it may, any foray into high-availability goal-setting should begin with acareful analysis of how much downtime userscan really tolerate, and what is the impact ofany outage. The "nines" are a tempting targetfor setting goals; the most common impulsefor any casual consumer of these "nines" is togo for a lot of them. Before you succumb tothe temptation, bear in mind one thing: youcan't decide how much availability you needwithout first asking "availability of what?" Theconcepts presented here should better prepareyou to answer that question; once you'veanswered it, you can make more constructiveuse of your downtime target. As youravailability goals mature, you'll find it moreproductive to choose user downtime targetsrather than snappy formulations of uptime.

    Availability Defined: UserRelevance and Measurement UtilityWhat is the value of application availability?Let's set a definition of availability as continuousapplication access with predictable performance. Indaily life, this is fairly intuitive: call your travelagency, and you don't care whether theservers are up or down, whether the networkis saturated or not, or whether the client

  • 2000 SUN MICROSYSTEMS INC., All Rights Reserved

    4

    application can validate your credit card data.To you, the only value of the system is inwhether the agent can book your ticket ornot, or how long it takes. The value of theserviceand the service level metric thatindicates whether that value is realizedismeasured at the end user's nose.

    Naturally, to the user, the only measure ofavailability that matters is at the userwhether the user lives and breathes, orwhether the user is some automatedconsumer of a service. In the online userworld, that user's nose is a valuable spot: itrepresents the point where the application'svalue is highestand usually becomes themost useful place to measure applicationavailability. By implication, AAMe increases invalue as it more closely approximates userexperience. Service level objectives for AAMemust be tightly coupled to the value of thework done with the application.

    Heisenberg:Measurement and its DiscontentsCan application availability be measured? It'sas much a philosophical question as a practicalone. Most end-to-end applications are highlycomplex, dynamic and not deterministic intheir behavior; with respect to bits speedingfrom point to point on the internet, thisvariation is a feature, not a bug. This alsomakes it difficult to pinpoint exactly howinstrumentation at any given point willprovide perfect information about thesystem's availability. Getting useful(actionable) information is a matter of scopingthe end points around which the system maybe measured.

    As Heisenberg once showed, measurementdistorts the measured event or element,m