Latency as a Performability Metric: Experimental Results Pete Broadwell [email protected]

Latency as a Performability Latency as a Performability Metric: Experimental ResultsMetric: Experimental Results

Pete [email protected]

Outline

1. Motivation and background• Performability overview• Project summary

2. Test setup• PRESS web server• Mendosus fault injection system

3. Experimental results & analysis• How to represent latency• Questions for future research

• Goal of ROC project: develop metrics to evaluate new recovery techniques

• Performability – class of metrics to describe how a system performs in the presence of faults

– First used in fault-tolerant computing field1

– Now being applied to online services

Performability overview

1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

Example: microbenchmark

RAID disk failure

• Rutgers study: performability analysis of a web server, using throughput

• Other studies (esp. from HP Labs Storage group) also use response time as a metric

• Assertion: latency and data quality are better than throughput for describing user experience

• How best to represent latency in performability reports?

Project motivation

• Goals:1. Replicate PRESS/Mendosus study with

response time measurements2. Discuss how to incorporate latency into

performability statistics

• Contributions:1. Provide a latency-based analysis of a web

server’s performability (currently rare)2. Further the development of more

comprehensive dependability benchmarks

Project overview

Experiment components• The Mendosus fault injection system

– From Rutgers (Rich Martin)– Goal: low-overhead emulation of a cluster of workstations, injection of likely

faults

• The PRESS web server– Cluster-based, uses cooperative caching. Designed by Carreira et al.

(Rutgers)– Perf-PRESS: basic version– HA-PRESS: incorporates hearbeats, master node for automated cluster

management

• Client simulators– Submit set # of requests/sec, based on real traces

User-leveldaemon (Java)

ModifiedNIC

driver

SCSImodule

procmodule

apps

Global Controller(Java)

Fault config

file

Workstations (real or VMs)

LANemu

configfile

Appsconfig

file

Emulated LAN

Mendosus design

Experimental setup

Fault typesCategory Fault Possible Root Cause

Node Node crash Operator error, OS bug, hardware component failure, power outage

Node freeze OS or kernel module bug

Application App crash Application bug or resource unavailability

App hang Application bug or resource contention with other processes

Network Link down or flaky Broken, damaged or misattached cable

Switch down or flaky Damaged or misconfigured switch, power outage

Test case timeline

- Warm-up time: 30-60 seconds

- Time to repair: up to 90 seconds

Simplifying assumptions• Operator repairs any non-transient failure

after 90 seconds• Web page size is constant• Faults are independent• Each client request is independent of all

others (no sessions!)– Request arrival times are determined by a

Poisson process (not self-similar)

• Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secs

Sample result: app crash

0

100

200

300

400

500

600

700

800

900

0 10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

170

180

Time elapsed (secs)

# R

eq

uests

Success Aborted (2s) Timed out (8s)

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

Time elapsed (secs)

Avg

. res

po

nse

tim

e (s

ecs)

Latency

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

Time elapsed (secs)

Avg

. res

po

nse

tim

e (s

ecs)

Latency

0

100

200

300

400

500

600

700

800

900

0 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

Time elapsed (secs)

# R

equ

ests


Perf-PRESS HA-PRESS

Th

rou

gh

pu

tLa

ten

cy

0

100

200

300

400

500

600

700

800

900

0 10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

170

180

190

200

210

220

230

240

250

260

270

280

290

300

310

320

330

340

350

360

370

380

390

400

410

Time elapsed (secs)

# R

eq

uests


0

0.5

1

1.5

2

2.5

3

0 10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

170

180

190

200

210

220

230

240

250

260

270

280

290

300

310

320

330

340

350

360

370

380

390

400

410

Time elapsed (secs)

Avg

. re

sp

on

se t

ime (

secs)

Latency

Sample result: node hangPerf-PRESS HA-PRESS

0

100

200

300

400

500

600

700

800

900

0 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

190

200

210

220

230

240

250

260

270

280

290

300

310

320

330

340

350

360

370

380

390

400

410

Time elapsed (secs)

# R

equ

ests


0

0.5

1

1.5

2

2.5

3

01

02

03

04

05

06

07

08

09

01

00

11

01

20

13

01

40

15

01

60

17

01

80

19

02

00

21

02

20

23

02

40

25

02

60

27

02

80

29

03

00

31

03

20

33

03

40

35

03

60

37

03

80

39

04

00

41

0

Time elapsed (secs)

Avg

. re

sp

on

se

tim

e (

se

cs)

Latency

Th

rou

gh

pu

tLa

ten

cy

• Total seconds of wait time– Not good for comparing cases with

different workloads

• Average (mean) wait time per request– OK, but requires that expected (normal)

response time be given separately

• Variance of wait time– Not very intuitive to describe. Also, read-

only workload means that all variance is toward longer wait times anyway

Representing latency

• Consider “goodput”-based availability:total responses served total requests

• Idea: Latency-based “punctuality”:ideal total latency

actual total latency• Like goodput, maximum value is 1• “Ideal” total latency:

average latency for non-fault cases x total #requests (shouldn’t be 0)

Representing latency (2)

• Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience)

– Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec)

Representing latency (3)

Availability and punctualityThroughput-based availability of PRESS versions

0

0.2

0.4

0.6

0.8

1

App hang App crash Node crash Node freeze Link down

Fault scenario

Av

aila

bili

ty in

de

x

Perf-PRESS

HA-PRESS

Latency-based "punctuality" of PRESS versions

0

0.02

0.04

0.06

0.08

0.1

0.12

App hang App crash Node crash Node freeze Link down

Fault scenario

Pu

nc

tua

lity

ind

ex

Perf-PRESS

HA-PRESS

• Data quality, latency and throughput are interrelated

– Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”?

• To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote)1

– These can be very arbitrary, so it’s important that the demerit formula be straightforward and publicly available

Other metrics

1 Zona Research and Keynote Systems, The Need for Speed II, 2001

• Rules:– Each aborted (2s) conn: 2 demerits– Each conn error: 1 demerit– Each user timeout (8s): 8 demerits– Each sec of total latency above ideal level:

(1 demerit/total #requests) x scaling factor

Sample demerit system

0

100000

200000

300000

400000

500000

Perf HA Perf HA Perf HA Perf HA Perf HA

Fault scenario

De

me

rits

Latency

Throughput

Apphang

Appcrash

Nodecrash

Nodefreeze

Linkdown

Expensive,fast & flaky

Expensive &robust, but slow

Expensive,robust and fast

Online service optimizationPerformance metrics:

throughput, latency & data quality

Environment:workload & faults

Cost of operations &components

Cheap, robust& fast (optimal)Cheap, fast

& flaky

Cheap &robust, but slow

Conclusions• Latency-based punctuality and

throughput-based availability give similar results for a read-only web workload

• Applied workload is very important– Reliability metrics do not (and should not)

reflect maximum performance/workload!

• Latency did not degrade gracefully in proportion to workload

– At high loads, PRESS “oscillates” between full service, 100% load shedding

Further Work• Combine test results & predicted

component failure rates to get long-term performability estimates (are these useful?)

• Further study will benefit from more sophisticated client & workload simulators

• Services that generate dynamic content should lead to more interesting data (ex: RUBiS)

Latency as a Performability Latency as a Performability Metric: Experimental ResultsMetric: Experimental Results

Pete [email protected]

Example: long-term modelDiscrete-time Markov chain (DTMC) model of

a RAID-5 disk array1

pi(t) = probability that system is in state i at time t

wi(t) = reward (disk I/O operations/sec)

= failure rate of a single disk drive

= disk repair rate

D = number of data disks

1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997

p0(t)

Normaloperation

(D+1)

p1(t)

1 disk failed,repair necessary

D

p2(t)

Failure -data loss

w0(t) w1(t) w2(t)

Documents

Latency as a Performability Metric: Experimental Results Pete Broadwell [email protected]