Benchmarking (RICON 2014)

Preview:

DESCRIPTION

Knowledge of how to set up good benchmarks is invaluable in understanding performance of the system. Writing correct and useful benchmarks is hard, and verification of the results is difficult and prone to errors. When done right, benchmarks guide teams to improve the performance of their systems. When done wrong, hours of effort may result in a worse performing application, upset customers or worse! In this talk, we will discuss what you need to know to write better benchmarks for distributed systems. We will look at examples of bad benchmarks and learn about what biases can invalidate the measurements, in the hope of correctly applying our new-found skills and avoiding such pitfalls in the future.

Citation preview

Benchmarking:  You’re Doing It Wrong

Aysylu  Greenberg  @aysylu22  

To  Write  Good  Benchmarks…  

Need  to  be  Full  Stack  

   

your  process  vs  Goal  your  process  vs  Best  PracCces  

 

Benchmark  =  How  Fast?  

Today  

•  How  Not  to  Write  Benchmarks  •  Benchmark  Setup  &  Results:  -   You’re  wrong  about  machines  -   You’re  wrong  about  stats  -   You’re  wrong  about  what  maLers  

•  Becoming  Less  Wrong  •  Having  Fun  with  Riak  

HOW  NOT  TO  WRITE  BENCHMARKS  

Website  Serving  Images  

•  Access  1  image  1000  Cmes  •  Latency  measured  for  each  access  •  Start  measuring  immediately  •  3  runs  •  Find  mean  •  Dev  environment  

Web  Request  

Server  

S3  Cache  

WHAT’S  WRONG  WITH  THIS  BENCHMARK?    

YOU’RE  WRONG  ABOUT  THE  MACHINE    

Wrong  About  the  Machine  

•  Cache,  cache,  cache,  cache!  

It’s  Caches  All  The  Way  Down  

Web  Request  

Server  

S3  Cache  

It’s  Caches  All  The  Way  Down  

Caches  in  Benchmarks  Prof.  Saman  Amarasinghe,  MIT  2009    

Caches  in  Benchmarks  Prof.  Saman  Amarasinghe,  MIT  2009    

Caches  in  Benchmarks  Prof.  Saman  Amarasinghe,  MIT  2009    

Caches  in  Benchmarks  Prof.  Saman  Amarasinghe,  MIT  2009    

Caches  in  Benchmarks  Prof.  Saman  Amarasinghe,  MIT  2009    

Website  Serving  Images  

•  Access  1  image  1000  Cmes  •  Latency  measured  for  each  access  •  Start  measuring  immediately  •  3  runs  •  Find  mean  •  Dev  environment  

Web  Request  

Server  

S3  Cache  

Wrong  About  the  Machine  

•  Cache,  cache,  cache,  cache!  •  Warmup  &  Cming  

Website  Serving  Images  

•  Access  1  image  1000  Cmes  •  Latency  measured  for  each  access  •  Start  measuring  immediately  •  3  runs  •  Find  mean  •  Dev  environment  

Web  Request  

Server  

S3  Cache  

Wrong  About  the  Machine  

•  Cache,  cache,  cache,  cache!  •  Warmup  &  Cming  •  Periodic  interference  

Website  Serving  Images  

•  Access  1  image  1000  Cmes  •  Latency  measured  for  each  access  •  Start  measuring  immediately  •  3  runs  •  Find  mean  •  Dev  environment  

Web  Request  

Server  

S3  Cache  

Wrong  About  the  Machine  

•  Cache,  cache,  cache,  cache!  •  Warmup  &  Cming  •  Periodic  interference  •  Test  !=  Prod  

Website  Serving  Images  

•  Access  1  image  1000  Cmes  •  Latency  measured  for  each  access  •  Start  measuring  immediately  •  3  runs  •  Find  mean  •  Dev  environment  

Web  Request  

Server  

S3  Cache  

Wrong  About  the  Machine  

•  Cache,  cache,  cache,  cache!  •  Warmup  &  Cming  •  Periodic  interference  •  Test  !=  Prod  •  Power  mode  changes  

YOU’RE  WRONG  ABOUT  THE  STATS    

Wrong  About  Stats  

•  Too  few  samples    

Wrong  About  Stats  

0  

20  

40  

60  

80  

100  

120  

0   10   20   30   40   50   60  

Latency  

Time  

Convergence  of  Median  on  Samples  

Stable  Samples  

Stable  Median  

Decaying  Samples  

Decaying  Median  

Website  Serving  Images  

•  Access  1  image  1000  Cmes  •  Latency  measured  for  each  access  •  Start  measuring  immediately  •  3  runs  •  Find  mean  •  Dev  machine  

Web  Request  

Server  

S3  Cache  

Wrong  About  Stats  

•  Too  few  samples  •  Gaussian  (not)  

Website  Serving  Images  

•  Access  1  image  1000  Cmes  •  Latency  measured  for  each  access  •  Start  measuring  immediately  •  3  runs  •  Find  mean  •  Dev  machine  

Web  Request  

Server  

S3  Cache  

Wrong  About  Stats  

•  Too  few  samples  •  Gaussian  (not)  •  MulCmodal  distribuCon  

MulCmodal  DistribuCon  

50%  99%  

#  occurren

ces  

Latency   5  ms   10  ms  

Wrong  About  Stats  

•  Too  few  samples  •  Gaussian  (not)  •  MulCmodal  distribuCon  •  Outliers  

YOU’RE  WRONG  ABOUT  WHAT  MATTERS    

Wrong  About  What  MaLers  

•  Premature  opCmizaCon  

“Programmers  waste  enormous  amounts  of  Cme  thinking  about  …  the  speed  of  noncriCcal  parts  of  their  programs  ...  Forget  about  small  efficiencies  …97%  of  the  Cme:  premature  opHmizaHon  is  the  root  of  all  evil.  Yet  we  should  not  pass  up  our  opportuniCes  in  that  criCcal  3%.”    

-­‐-­‐  Donald  Knuth  

Wrong  About  What  MaLers  

•  Premature  opCmizaCon  •  UnrepresentaCve  workloads  

Wrong  About  What  MaLers  

•  Premature  opCmizaCon  •  UnrepresentaCve  workloads  •  Memory  pressure  

Wrong  About  What  MaLers  

•  Premature  opCmizaCon  •  UnrepresentaCve  workloads  •  Memory  pressure  •  Load  balancing  

Wrong  About  What  MaLers  

•  Premature  opCmizaCon  •  UnrepresentaCve  workloads  •  Memory  pressure  •  Load  balancing  •  Reproducibility  of  measurements  

BECOMING  LESS  WRONG  

User  AcCons  MaLer    

X  >  Y  for  workload  Z  with  trade  offs  A,  B,  and  C  

-­‐  hLp://www.toomuchcode.org/  

Profiling  Code  instrumentaCon  Aggregate  over  logs  Traces    

Microbenchmarking:  Blessing  &  Curse  

+ Quick  &  cheap  + Answers  narrow  ?s  well  - Osen  misleading  results  - Not  representaCve  of  the  program  

Microbenchmarking:  Blessing  &  Curse  

•  Choose  your  N  wisely    

Choose  Your  N  Wisely  Prof.  Saman  Amarasinghe,  MIT  2009    

Microbenchmarking:  Blessing  &  Curse  

•  Choose  your  N  wisely  •  Measure  side  effects  

Microbenchmarking:  Blessing  &  Curse  

•  Choose  your  N  wisely  •  Measure  side  effects  •  Beware  of  clock  resoluCon  

Microbenchmarking:  Blessing  &  Curse  

•  Choose  your  N  wisely  •  Measure  side  effects  •  Beware  of  clock  resoluCon  •  Dead  Code  EliminaCon  

Microbenchmarking:  Blessing  &  Curse  

•  Choose  your  N  wisely  •  Measure  side  effects  •  Beware  of  clock  resoluCon  •  Dead  Code  EliminaCon  •  Constant  work  per  iteraCon  

Non-­‐Constant  Work  Per  IteraCon  

Follow-­‐up  Material  

•  How  NOT  to  Measure  Latency  by  Gil  Tene  –  hLp://www.infoq.com/presentaCons/latency-­‐piualls  

•  Taming  the  Long  Latency  Tail  on  highscalability.com  –  hLp://highscalability.com/blog/2012/3/12/google-­‐taming-­‐the-­‐long-­‐latency-­‐tail-­‐when-­‐more-­‐machines-­‐equal.html  

•  Performance  Analysis  Methodology  by  Brendan  Gregg  –  hLp://www.brendangregg.com/methodology.html  

•  Silverman’s  Mode  Detec@on  Method  by  MaL  Adereth  –  hLp://adereth.github.io/blog/2014/10/12/silvermans-­‐mode-­‐detecCon-­‐method-­‐explained/  

HAVING  FUN  WITH  

Setup      

•  SSD  30  GB  •  M3  large  •  Riak  version  1.4.2-­‐0-­‐g61ac9d8  •  Ubuntu  12.04.5  LTS  •  4  byte  keys,  10  KB  values  

1850  

1900  

1950  

2000  

2050  

2100  

2150  

2200  

2250  

2300  

2350  

Latency  (usec)  

Number  of  Keys  

Get  Latency  

L3  

Takeaway  #1:  Cache  

Takeaway  #2:  Outliers  

Takeaway  #3:  Workload  

Benchmarking:  You’re Doing It Wrong

Aysylu  Greenberg  @aysylu22  

Recommended