56
Applied Performance Theory @kavya719

Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

Applied Performance Theory

kavya719

kavya

applyingperformance theory

to practice

performance

capacity

bull Whatrsquos the additional load the system can support without degrading response time

bull Whatrsquore the system utilization bottlenecks

bull Whatrsquos the impact of a change on response timemaximum throughput

bull How many additional servers to support 10x load

bull Is the system over-provisioned

YOLO method

load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful

performance modeling

performance modeling

real-world system theoretical model

resultsanalyze

translate back

model as

makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them

a cluster of many servers the USL

scaling bottlenecks

a single server open closed queueing systems

utilization law Littlersquos law the P-K formula CoDel adaptive LIFO

stepping back the role of performance modeling

a single server

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 2: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

kavya

applyingperformance theory

to practice

performance

capacity

bull Whatrsquos the additional load the system can support without degrading response time

bull Whatrsquore the system utilization bottlenecks

bull Whatrsquos the impact of a change on response timemaximum throughput

bull How many additional servers to support 10x load

bull Is the system over-provisioned

YOLO method

load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful

performance modeling

performance modeling

real-world system theoretical model

resultsanalyze

translate back

model as

makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them

a cluster of many servers the USL

scaling bottlenecks

a single server open closed queueing systems

utilization law Littlersquos law the P-K formula CoDel adaptive LIFO

stepping back the role of performance modeling

a single server

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 3: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

applyingperformance theory

to practice

performance

capacity

bull Whatrsquos the additional load the system can support without degrading response time

bull Whatrsquore the system utilization bottlenecks

bull Whatrsquos the impact of a change on response timemaximum throughput

bull How many additional servers to support 10x load

bull Is the system over-provisioned

YOLO method

load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful

performance modeling

performance modeling

real-world system theoretical model

resultsanalyze

translate back

model as

makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them

a cluster of many servers the USL

scaling bottlenecks

a single server open closed queueing systems

utilization law Littlersquos law the P-K formula CoDel adaptive LIFO

stepping back the role of performance modeling

a single server

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 4: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

performance

capacity

bull Whatrsquos the additional load the system can support without degrading response time

bull Whatrsquore the system utilization bottlenecks

bull Whatrsquos the impact of a change on response timemaximum throughput

bull How many additional servers to support 10x load

bull Is the system over-provisioned

YOLO method

load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful

performance modeling

performance modeling

real-world system theoretical model

resultsanalyze

translate back

model as

makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them

a cluster of many servers the USL

scaling bottlenecks

a single server open closed queueing systems

utilization law Littlersquos law the P-K formula CoDel adaptive LIFO

stepping back the role of performance modeling

a single server

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 5: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

YOLO method

load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful

performance modeling

performance modeling

real-world system theoretical model

resultsanalyze

translate back

model as

makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them

a cluster of many servers the USL

scaling bottlenecks

a single server open closed queueing systems

utilization law Littlersquos law the P-K formula CoDel adaptive LIFO

stepping back the role of performance modeling

a single server

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 6: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

performance modeling

real-world system theoretical model

resultsanalyze

translate back

model as

makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them

a cluster of many servers the USL

scaling bottlenecks

a single server open closed queueing systems

utilization law Littlersquos law the P-K formula CoDel adaptive LIFO

stepping back the role of performance modeling

a single server

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 7: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

a cluster of many servers the USL

scaling bottlenecks

a single server open closed queueing systems

utilization law Littlersquos law the P-K formula CoDel adaptive LIFO

stepping back the role of performance modeling

a single server

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 8: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

a single server

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 9: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

model I

clients web server

ldquohow can we improve the mean response timerdquo

ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo

resp

onse

tim

e (m

s)

throughput (requests second)

response time threshold

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 10: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

model the web server as a queueing system

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 11: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 12: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 13: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

model the web server as a queueing system

assumptions

1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order

requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant

web server

request response

queueing delay + service time = response time

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 14: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 15: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases

utilization = arrival rate service time

ldquobusynessrdquo

utili

zatio

n

arrival rate

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 16: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 17: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 18: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

P(request has to queue) increases somean queue length increases so mean queueing delay increases

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 19: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

Pollaczek-Khinchine (P-K) formula

mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

assuming constant service time and so request sizes

mean queueing delay prop U (1 - U)

utilization (U)

resp

onse

tim

esince response time prop queueing delay

utilization (U)

queu

eing

del

ay

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 20: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

low utilization regime

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 21: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target

arrival rate increases

server utilization increases linearly

Utilization law

P-K formula

mean queueing delay increases non-linearly so response time too

resp

onse

tim

e (m

s)

throughput (requests second)

max throughput

low utilization regime

high utilization regime

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 22: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoHow can we improve the mean response timerdquo

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 23: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 24: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoHow can we improve the mean response timerdquo

onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 25: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoHow can we improve the mean response timerdquo

1 response time prop queueing delay

prevent requests from queuing too long

bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework

bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy

bull set a max queue length

bull client-side concurrency control

newest requests first not old requests that are likely to expire

helps when system is overloaded makes no difference when itrsquos not

key insight queues are typically empty allows short bursts prevents standing queues

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 26: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

ldquoHow can we improve the mean response timerdquo

2 response time prop queueing delay

U linear fn (mean service time) quadratic fn (service time variability) (1 - U)

P-K formula

decrease request service size variability for example by batching requests

decrease service time by optimizing application code

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 27: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

the cloudindustry site

N sensors

server

while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)

processes data from N sensors

model II

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 28: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

bull requests are synchronized

bull fixed number of clients

throughput depends on response time

queue length is bounded (lt= N) so response time bounded

This is called a closed system super different that the previous web server model (open system)

server

N clients

] ]responserequest

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 29: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

response time vs load for closed systems

assumptions

1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant

What happens to response time in this regime

Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th

roug

hput

number of clients

low utilization regime

high utilization regime

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 30: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

Littlersquos Law for closed systems

server

sleeping

waiting being processed

] ]

the total number of requests in the system includes requests across the states

a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)

the system in this case is the entire loop ie

N clients

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 31: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

Littlersquos Law for closed systems

requests in system = throughput round-trip time of a request across the whole system

sleep time + response time

server

sleep time

queueing delay + service time = response time

] ]

So response time only grows linearly with N

N = constant response time

applying it in the high utilization regime (constant throughput) and assuming constant sleep

N clients

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 32: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime

grows linearly with N

low utilization regime response time stays ~same

high utilization regime

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 33: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

response time vs load for closed systems

So response time for a closed system

number of clients

resp

onse

tim

eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing

arrival rate

resp

onse

tim

e

way different than for an open system

high utilization regimegrows linearly with N

low utilization regime response time stays ~same

high utilization regime high utilization regime

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 34: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

open vs closed systems

bull how throughput relates to response time bull response time versus load especially in the high load regime

closed systems are very different from open systems

uh ohhellip

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 35: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

standard load simulators typically mimic closed systems

A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip

open vs closed systems

hellipbut the system with real users may not be one

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 36: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

a cluster of servers

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 37: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

clients

cluster of web servers

load balancer

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

capacity planning

ldquoHow can we improve how the system scalesrdquo scalability

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 38: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 39: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

max throughput of a cluster of N servers = max single server throughput N

ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same

no systems donrsquot scale linearly

bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention

bull crosstalk penaltydue to coordination for coherence

examples servers coordinating to synchronize mutable state

αN

βN2

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 40: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

Universal Scalability Law (USL)

throughput of N servers = N (αN + βN2 + C)

N (αN + βN2 + C)

NC

N(αN + C)

contention and crosstalk

linear scaling

contention

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 41: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache

ldquoHow can we improve how the system scalesrdquo

Avoid contention (serialization) and crosstalk (synchronization)

bull smarter aggregation in Facebookrsquos SCUBA data store

bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 42: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

stepping back

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 43: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 44: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 45: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

load simulation results with increasing number of virtual clients (N) = 1 hellip 100

hellip load simulator hit a bottleneck

resp

onse

tim

e

number of clients

wrong shape for response time curve

should be one of the two curves above

number of clients

resp

onse

tim

e

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 46: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to

bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs

bull interpret and evaluate the resultsload simulations predicted better results than your system shows

bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc

the role of performance modelingmost useful in conjunction with empirical analysis

load simulation experiments

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 47: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

kavya719speakerdeckcomkavya719applied-performance-theory

Special thanks to Eben Freeman for reading drafts of this

References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools

Queuing Theory In Practice

Fail at Scale Kraken Leveraging Live Traffic Tests

SCUBA Diving into Data at Facebook

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 48: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo

Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 49: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

number of virtual clients (N) = 1 hellip 100

response time

concurrency (N)

wrong shape for response time curve

should be

concurrency (N)

response time

hellip load simulator hit a bottleneck

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 50: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

utilization = throughput service time (Utilization Law)

throughput

ldquobusynessrdquo

queueing delay increases (non-linearly) so response time

throughput increases

utilization increases

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 51: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

Facebook sets target cluster capacity = 93 of theoretical

hellipis this good or is there a bottleneck

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 52: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix

Facebook sets target cluster capacity = 93 of theoretical

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 53: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

throughput

latency

non-linear responses to load

throughput

concurrency

non-linear scaling

microservices systems are complex

continuous deployssystems are in flux

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 54: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

load generation

need a representative workload

hellipuse live traffic

traffic shifting

profile (read write requests) arrival pattern including traffic bursts

capture and replay

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting

Page 55: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as

edge weight cluster weight server weight

adjust weights that control load balancing to increase the fraction of traffic to a cluster region server

traffic shifting