Upload
erin-gregory
View
225
Download
7
Tags:
Embed Size (px)
Citation preview
Metrics and Techniques for Evaluating the Performability
of Internet Services
Pete [email protected]
Outline
1. Introduction to performability2. Performability metrics for Internet
services• Throughput-based metrics (Rutgers)• Latency-based metrics (ROC)
3. Analysis and future directions
• Goal of ROC project: develop metrics to evaluate new recovery techniques
• Problem: concept of availability assumes system is either “up” or “down” at a given time
• Availability doesn’t capture system’s capacity to support degraded service– degraded performance during failures– reduced data quality during high load
Motivation
What is “performability”?• Combination of performance and
dependability measures• Classical defn: probabilistic (model-
based) measure of a system’s “ability to perform” in the presence of faults1
– Concept from traditional fault-tolerant systems community, ca. 1978
– Has since been applied to other areas, but still not in widespread use
1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
Performability ExampleDiscrete-time Markov chain (DTMC) model of
a RAID-5 disk array1
1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997
pi(t) = probability that system is in state i at time t
p0(t)
Normaloperation
= failure rate of a single disk drive
D = number of data disks
(D+1)
p1(t)
1 disk failed,repair necessary
= disk repair rate
D
p2(t)
Failure -data loss
wi(t) = reward (disk I/O operations/sec)
w0(t) w1(t) w2(t)
Performability for Online Services: Rutgers Study
• Rich Martin (UCB alum) et al. wanted to quantify tradeoffs between web server designs, using a single metric for both performance and availability
• Approach:– Performed fault injection on PRESS, a
locality-aware, cluster-based web server– Measured throughput of cluster during
simulated faults and normal operation
REPAIR(human
operator)
DETECT
Degraded Service During a PRESS Component Fault
FAILURE
STABILIZE
RECOVER
RESET(optional)
Time
Throughput
Req
ues
ts/s
ec
Calculation of Average Throughput, Given Faults
Throughput
Time
Degraded throughput
Req
ues
ts/s
ec Average throughput
Normal throughput
Behavior of a Performability Metric
Effect of improving degraded performance
Per
form
abili
ty
Performance during faults
Behavior of a Performability Metric
Effect of improving component availability(shorter MTTR, longer MTTF)
MTTR MTTF
Per
form
abili
ty
Aavailability = MTTFMTTF + MTTR
Behavior of a Performability Metric
Effect of improving overall performance
Per
form
abili
ty
Overall performance (includes normal operation)
Most performability metrics scale linearly as component availability, degraded performance and overall performance increase
0
10
20
30
40
50
60
70
80
90
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
Web server version
Per
form
abil
ity
Reduced human monitoring
Original system
RAID storage subsystem
Results of Rutgers Study: Design Comparisons
An Alternative Metric: Response Latency
• Originally, performability metrics were meant to capture end-user experience1
• Latency better describes the experience of an end user of a web site– response time >8 sec = site abandonment
= lost income $$2
• Throughput describes the raw processing ability of a service– best used to quantify expenses
1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 19942 Zona Research and Keynote Systems, The Need for Speed II, 2001
Effect of Component Failure on Response Latency
Time
Responselatency
(sec)
REPAIR
8s
Abandonmentregion
FAILURE
Annoyanceregion?
Issues With Latency As a Performability Metric
• Modeling concerns:– Human element: retries and abandonment– Queuing issues: buffering and timeouts– Unavailability of load balancer due to faults– Burstiness of workload
• Latency is more accurately modeled at service, rather than end-to-end1
• Alternate approach: evaluate an existing system
1 M. Merzbacher and D. Patterson, Measuring End-User Availability on the Web: Practical Experience, 2002
Analysis• Queuing behavior may have a
significant effect on latency-based performability evaluation– Long component MTTRs = longer
waits, lower latency-based score– High performance in normal case =
faster queue reduction after repair, higher latency-based score
• More study is needed!
Future Work• Further collaboration with Rutgers on
collecting new measurements for latency-based performability analysis
• Development of more realistic fault and workload models, other performability factors such as data quality
• Research into methods for conducting automated performability evaluations of web services
Metrics and Techniques for Evaluating the Performability
of Internet Services
Pete [email protected]
Back-of-the-Envelope Latency Calculations
• Attempted to infer average request latency for PRESS servers from Rutgers data set– Required many simplifying assumptions, relying
upon knowledge of PRESS server design– Hoped to expose areas in which throughput- and
latency-based performability evaluations differ
• Assumptions:– FIFO queuing w/no timeouts, overflows– Independent faults, constant workload (also the case
for throughput-based model)
• Current models do not capture “completeness” of data returned to user
Comparison ofPerformability Metrics
0
5000
10000
15000
20000
25000
30000
35000
I-PRESS TCP-PRESS
ReTCP-PRESS
VIA-PRESS
Web server versions
Per
form
abil
ity
Latency-basedpeformability
Throughput-basedperformability
Rutgers calculations for long-term performability
Goal: metric that scales linearly with both- performance (throughput) and- availability [MTTF / (MTTF + MTTR)]
Tn = normal throughput for server
AI = ideal availability (.99999)Average throughput (AT) =
Tn during normal operation + per-component throughput during failure
Average availability (AA) = AT / Tn
Performability = Tn x [log(AI) / log(AA)]
Results of Rutgers study: performance comparison
Throughput
0
1000
2000
3000
4000
5000
6000
7000
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
Req
ues
ts/s
ec
Results of Rutgers study: availability comparison
Unavailability by Component
0
0.001
0.002
0.003
0.004
0.005
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
% U
nav
aila
bil
ity application crash
node freezenode crashscsi hangscsi timeoutinternal switch internal link
Results of Rutgers study: performability comparison
Performability
0
10
20
30
40
50
60
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
Th
rou
gh
pu
t X
S
cale
d A
vail
abil
ity