55
© 2015 ROBERT BIGOS Capacity rolling disaster challenge Robert Bigos [email protected] +48 665168240 h<p://www.slideshare.net/RobertBigos h<ps://pl.linkedin.com/in/robertbigos @bigosr

CEC15 capacity and perfomance workshop

Embed Size (px)

Citation preview

©  2015  ROBERT  BIGOS

Capacity  rolling    disaster  challenge  

Robert  Bigos  [email protected]  +48  665-­‐168-­‐240

h<p://www.slideshare.net/RobertBigos

h<ps://pl.linkedin.com/in/robertbigos

@bigosr

©  2015  ROBERT  BIGOS2

Agenda

Why  is  it  so  important  ?  

Let’s  talk  about:  • typical  monitoring  dashboards  • sta<s<cs  and  visualiza<ons  basics  • queueing  basis  • mathema<cian  vs  physicist  • complexity  • measurements,  <me  and  bigdata  • paBern  visualiza<ons  • lessons  learned

©  2015  ROBERT  BIGOS

WHY?

©  2015  ROBERT  BIGOS11

Why  is  capacity  so  important  ?

4

Source:    presenter  studies  for  top  enterprises  in  Poland.

Source:  http://s134.photobucket.com/user/charlesfrith/media/disaster.gif.html

©  2015  ROBERT  BIGOS11

Capacity  vs  performance

5

Capacity      =    fuel  Performance    =    speed  and  al:tude  

Capacity  and  performance  management    how  quickly  and  safely  you  can  achieve  

planned  des:na:ons

©  2015  ROBERT  BIGOS11

Why  is  capacity  so  important  ?

6

Source:    presenter  studies  for  top  enterprises  in  Poland.

Source:  http://www.skybrary.aero/index.php/James_Reason_HF_Model

©  2015  ROBERT  BIGOS11

Root  cause  analysis  ?

7

Source:    presenter  studies  for  top  enterprises  in  Poland.

Source:    Józef  Tischner  "The  Highlander's  History  of  Philosophy"

"the  truth,    the  whole  truth  

 and    the  bullshit  truth!

©  2015  ROBERT  BIGOS11

Why  capacity  is  so  important  ?

8

Source:    presenter  studies  for  top  enterprises  in  Poland.

is  there  a  capacity  for  growth  ?

is  there  a  capacity  for  change  ?

is  there  a  capacity  for  backup  ?

is  there  a  capacity  to  restore  ?

is  it  calculated  ?

is  it  tested  ?

what  about  quality  expecta:ons  ?

what  about  financial  aspects  ?

<me  !

DR  plan

©  2015  ROBERT  BIGOS2

There  is  no  magic  buBon  

Source:  http://make-­‐everything-­‐ok.com/

©  2015  ROBERT  BIGOS2

Backup  plan  ?

try:  

CtrZ  CmdZ

©  2015  ROBERT  BIGOS2

CritSit

…  it  is  a  long  story…  DR  procedures  automa;ons  

people  training  tes;ng  

communica;ons  leadership  

©  2015  ROBERT  BIGOS

DASHBOARD

©  2015  ROBERT  BIGOS32

Typical  “<me-­‐graph-­‐centric”  performance  dashboard  

Source:  Demo  site  dashboard  grafana.org

©  2015  ROBERT  BIGOS32

Storage  performance  examples

Source:  Łukasz  Piskorz  IBM  SWG  Lab

©  2015  ROBERT  BIGOS32

Death  or  lost  signal  ?

Source:  Łukasz  Piskorz  IBM  SWG  Lab

©  2015  ROBERT  BIGOS32

Peeping  through  the  keyhole

Computer  vs  human  scale    5  mins  =    5*60/10^-­‐9/(60*60*24*365)  =  9512  years

few  objects,  few  variables,    no  dependency,  no  rela<ons…

©  2015  ROBERT  BIGOS32

“Threshold  viola<on”  troubleshoo<ng

Source:  Łukasz  Piskorz  IBM  SWG  Lab

©  2015  ROBERT  BIGOS

STATISTICS

©  2015  ROBERT  BIGOS 19

Anscombe's  table  visualisa<on'

©  2015  ROBERT  BIGOS 20

Descriptive statistics:

©  2015  ROBERT  BIGOS 21

Visualisation

Death  zone  

©  2015  ROBERT  BIGOS

mathema<cian  vs  physicist

©  2015  ROBERT  BIGOS 23

Quiz

-­‐ Sir  ,  we  cannot  calculate  our  of  observa:ons  because  we  have  to  divide  X  by  0.  

(try  find    physicist  answer)

-­‐  Change  0  -­‐>  0.0001  and  show  me  a  picture.

..  fail  fast  and  try  to  understand  big  picture  …

©  2015  ROBERT  BIGOS

QUEUEING  

©  2015  ROBERT  BIGOS

Quiz

▪ What  if  :  Your  sequeneal  program  spends  75%  eme  on  server  and  25%  eme  on  storage.  We  replace  storage  with  5  emes  faster.  Please,  calculate  percentage  improvement  in  speed.

©  2015  ROBERT  BIGOS

Theore<cally

20%  percentage  improvement  

1,25    speedup

©  2015  ROBERT  BIGOS

How  ?

Amdahl's  law

©  2015  ROBERT  BIGOS

Reality

…  it  depends    …

©  2015  ROBERT  BIGOS

Queues  and  buffers  basis▪ Response  eme  depends  on  service  eme  and  queueing  eme.  

• Queue  length  depends  on  arrival  rate  and  service  :me  • U:liza:on  shows  how  busy  the  server  is:    work_:me/measured_:me  or    arrival  rate  *  service  :me  • When  u:liza:on  reaches  satura:on  =  100%  ,response  :me  going  to  infinity  in  some  cases……

Source:  hbp://perfdynamics.blogspot.com/2010/03/bandwidth-­‐vs-­‐latency-­‐world-­‐is-­‐curved.html    Graph  provided  by    Neil  Gunther.  Thanks  !

To  see  more:    

hbp://en.wikipedia.org/wiki/Amdahl's_law  hbp://en.wikipedia.org/wiki/Lible's_law    

hbp://www.perfdynamics.com/Manifesto/gcaprules.html  ,

©  2015  ROBERT  BIGOS

TOTAL_PORT_IO_RATE,READ_TRANSFER_SIZE,TOTAL_PORT_TO_LOCAL_NODE_IO_RATE,PORT_TO_DISK_RECEIVE_DATA_RATE  

TOTAL_PORT_DATA_RATE,WRITE_TRANSFER_SIZE,PORT_TO_REMOTE_NODE_SEND_IO_RATE,TOTAL_PORT_TO_DISK_DATA_RATE  

TOTAL_PORT_TRANSFER_SIZE,TOTAL_TRANSFER_SIZE,PORT_TO_REMOTE_NODE_RECEIVE_IO_RATE,PORT_TO_LOCAL_NODE_SEND_DATA_RATE  

PORT_SPEED,RECORD_MODE_READ_IO_RATE,TOTAL_PORT_TO_REMOTE_NODE_IO_RATE,PORT_TO_LOCAL_NODE_RECEIVE_DATA_RATE  

READ_IO_RATE_OVERALL,RECORD_MODE_READ_CACHE_HIT_PERC,PORT_TO_HOST_SEND_DATA_RATE,TOTAL_PORT_TO_LOCAL_NODE_DATA_RATE  

WRITE_IO_RATE_OVERALL,DISK_TO_CACHE_TRANSFER_RATE,PORT_TO_HOST_RECEIVE_DATA_RATE,PORT_TO_REMOTE_NODE_SEND_DATA_RATE  

TOTAL_IO_RATE_OVERALL,CACHE_TO_DISK_TRANSFER_RATE,TOTAL_PORT_TO_HOST_DATA_RATE,PORT_TO_REMOTE_NODE_RECEIVE_DATA_RATE  

READ_CACHE_HIT_PERC_OVERALL,WRITE_CACHE_DELAY_PERCENTAGE,PORT_TO_DISK_SEND_DATA_RATE,TOTAL_PORT_TO_REMOTE_NODE_DATA_RATE  

WRITE_CACHE_HIT_PERC_OVERALL,WRITE_CACHE_DELAY_IO_RATE,PORT_TO_DISK_RECEIVE_DATA_RATE,PORT_TO_LOCAL_NODE_SEND_RESPONSE_TIME  

TOTAL_CACHE_HIT_PERC_OVERALL,BACKEND_READ_IO_RATE,TOTAL_PORT_TO_DISK_DATA_RATE,OVERALL_PORT_TO_LOCAL_NODE_RESPONSE_TIME  

READ_DATA_RATE,BACKEND_WRITE_IO_RATE,PORT_TO_LOCAL_NODE_SEND_DATA_RATE,PORT_TO_LOCAL_NODE_SEND_QUEUE_TIME  

WRITE_DATA_RATE,TOTAL_BACKEND_IO_RATE,PORT_TO_LOCAL_NODE_RECEIVE_DATA_RATE,PORT_TO_LOCAL_NODE_RECEIVE_QUEUE_TIME  

TOTAL_DATA_RATE,BACKEND_READ_DATA_RATE,TOTAL_PORT_TO_LOCAL_NODE_DATA_RATE,OVERALL_PORT_TO_LOCAL_NODE_QUEUE_TIME  

READ_TRANSFER_SIZE,BACKEND_WRITE_DATA_RATE,PORT_TO_REMOTE_NODE_SEND_DATA_RATE,PORT_TO_REMOTE_NODE_SEND_RESPONSE_TIME  

WRITE_TRANSFER_SIZE,TOTAL_BACKEND_DATA_RATE,PORT_TO_REMOTE_NODE_RECEIVE_DATA_RATE,OVERALL_PORT_TO_REMOTE_NODE_RESPONSE_TIME  

TOTAL_TRANSFER_SIZE,BACKEND_READ_RESPONSE_TIME,TOTAL_PORT_TO_REMOTE_NODE_DATA_RATE,PORT_TO_REMOTE_NODE_SEND_QUEUE_TIME  

READ_IO_RATE_OVERALL,BACKEND_WRITE_RESPONSE_TIME,OVERALL_PORT_BANDWIDTH_PERCENTAGE,PORT_TO_REMOTE_NODE_RECEIVE_QUEUE_TIME  

WRITE_IO_RATE_OVERALL,OVERALL_BACKEND_RESPONSE_TIME,LOSS_OF_SYNC_RATE,OVERALL_PORT_TO_REMOTE_NODE_QUEUE_TIME  

TOTAL_IO_RATE_OVERALL,BACKEND_READ_TRANSFER_SIZE,INVALID_TRANSMISSION_WORD_RATE,PEAK_READ_RESPONSE_TIME  

READ_CACHE_HIT_PERC_OVERALL,BACKEND_WRITE_TRANSFER_SIZE,PORT_SEND_BANDWIDTH_PERCENTAGE,PEAK_WRITE_RESPONSE_TIME  

WRITE_CACHE_HIT_PERC_OVERALL,OVERALL_BACKEND_TRANSFER_SIZE,PORT_RECEIVE_BANDWIDTH_PERCENTAGE,LOSS_OF_SYNC_RATE  

TOTAL_CACHE_HIT_PERC_OVERALL,PORT_SEND_IO_RATE,BUFFER_TO_BUFFER_PERCENTAGE,INVALID_TRANSMISSION_WORD_RATE  

READ_DATA_RATE,PORT_RECEIVE_IO_RATE,PORT_SPEED,OVERALL_HOST_ATTRIBUTED_RESPONSE_TIME_PERCENTAGE  

WRITE_DATA_RATE,TOTAL_PORT_IO_RATE,READ_IO_RATE_OVERALL,PEAK_BACKEND_READ_RESPONSE_TIME  

TOTAL_DATA_RATE,PORT_SEND_DATA_RATE,WRITE_IO_RATE_OVERALL,PEAK_BACKEND_WRITE_RESPONSE_TIME  

READ_TRANSFER_SIZE,PORT_WRITE_DATA_RATE,TOTAL_IO_RATE_OVERALL,PEAK_BACKEND_READ_QUEUE_TIME  

WRITE_TRANSFER_SIZE,TOTAL_PORT_DATA_RATE,READ_CACHE_HIT_PERC_OVERALL,PEAK_BACKEND_WRITE_QUEUE_TIME  

TOTAL_TRANSFER_SIZE,PORT_SEND_RESPONSE_TIME,WRITE_CACHE_HIT_PERC_OVERALL,READ_IO_RATE_OVERALL  

REAL_SPACE,PORT_RECEIVE_RESPONSE_TIME,TOTAL_CACHE_HIT_PERC_OVERALL,WRITE_IO_RATE_OVERALL  

ND_WRITE_TRANSFER_SIZE,REAL_SPACE  

PORT_SPEED,WRITE_RESPONSE_TIME,OVERALL_BACKEND_TRANSFER_SIZE,IO_DENSITY  

READ_IO_RATE_NORMAL,TOTAL_RESPONSE_TIME,PORT_SEND_IO_RATE,PORT_SEND_PACKET_RATE  

READ_IO_RATE_SEQUENTIAL,READ_TRANSFER_SIZE,PORT_RECEIVE_IO_RATE,PORT_WRITE_PACKET_RATE  

READ_IO_RATE_OVERALL,WRITE_TRANSFER_SIZE,TOTAL_PORT_IO_RATE,TOTAL_PORT_PACKET_RATE  

WRITE_IO_RATE_NORMAL,TOTAL_TRANSFER_SIZE,PORT_SEND_DATA_RATE,PORT_SEND_DATA_RATE  

WRITE_IO_RATE_SEQUENTIAL,RECORD_MODE_READ_IO_RATE,PORT_WRITE_DATA_RATE,PORT_WRITE_DATA_RATE  

WRITE_IO_RATE_OVERALL,RECORD_MODE_READ_CACHE_HIT_PERC,TOTAL_PORT_DATA_RATE,TOTAL_PORT_DATA_RATE  

TOTAL_IO_RATE_NORMAL,DISK_TO_CACHE_TRANSFER_RATE,READAHEAD_PERCENTAGE_OF_CACHE_HITS,PORT_PEAK_SEND_DATA_RATE  

TOTAL_IO_RATE_SEQUENTIAL,CACHE_TO_DISK_TRANSFER_RATE,DIRTY_WRITE_PERCENTAGE_OF_CACHE_HITS,PORT_PEAK_RECEIVE_DATA_RATE  

TOTAL_IO_RATE_OVERALL,WRITE_CACHE_DELAY_PERCENTAGE,WRITE_CACHE_FLUSH_THROUGH_PERCENTAGE,PORT_SEND_PACKET_SIZE  

READ_CACHE_HIT_PERC_NORMAL,WRITE_CACHE_DELAY_IO_RATE,WRITE_CACHE_FLUSH_THROUGH_IO_RATE,PORT_RECEIVE_PACKET_SIZE  

READ_CACHE_HIT_PERC_SEQUENTIAL,REAL_SPACE,PORT_TO_HOST_RECEIVE_IO_RATE,OVERALL_PORT_PACKET_SIZE  

READ_CACHE_HIT_PERC_OVERALL,IO_DENSITY,TOTAL_PORT_TO_HOST_IO_RATE,LOSS_OF_SYNC_RATE  

Queues:  storage  example▪ Total  IO  *  Response  Time  =  Queue  length  (populaeon)  ▪ Total  IO  *  Service  Time  =  Utylizaeon  ▪ Queue  length/(1+  Queue  length)  =  uelizaeon  ▪ Service  Time  -­‐  never  used    ▪ Each  IO  has  own  characterisec:  r/w,  size  ,cache,  seq/rand  ~30  (some  

of  them  represents  subqueue  

▪ Many  of  „monitored"  variables  are  calculated  !  

©  2015  ROBERT  BIGOS14

Queues  and  buffers  basis

Source:  Kanal  von  FerdinandLutz  "Stay  in  queue"  youtube.com

©  2015  ROBERT  BIGOS

Queues  cache  and  buffers  basis

©  2015  ROBERT  BIGOS2

„Home  made  supercomputer”IT  infrastructure  in  the  enterprise  is  like  a  “home  made  supercomputer”  :    very  complicated  and  interconnected.  Designed  by  business  department,  acquired  by  procurement  department  ,  implemented  and  managed  by  IT  department,  used  by  unpredicted  users…  

Quality  always  depends  on  design  and  implementaeon  

Automaeon  requires  standardizaeon!

©  2015  ROBERT  BIGOS

COMPLEXITY

©  2015  ROBERT  BIGOS14

Complexity  of  hardware:  

Source:  Anvaka  github  user  site.  100k  pakages  and  200k  connections

>400  mln,  just  in  IBM  database  

Source:  IBM    interoperability  site

©  2015  ROBERT  BIGOS14

Complexity  of  sodware  …  npm  

Source:  Anvaka  github  user  site.  100k  pakages  and  200k  connections

©  2015  ROBERT  BIGOS2

Know  unknowns  and  unknown  unknowns

“…Reports   that   say   that   something   hasn't   happened   are   always  interes:ng  to  me,  because  as  we  know,   there  are  known  knowns;  there  are  things  we  know  we  know.  We  also  know  there  are  known  unknowns;  that  is  to  say  we  know  there  are  some  things  we  do  not  know.  But  there  are  also  unknown  unknowns  -­‐-­‐  the  ones  we  don't  know  we  don't  know…”  

Donald Rumsfeld, February 12th, 2004 DOD News Briefing

 Source:  http://www.defense.gov/transcripts/transcript.aspx?transcriptid=2636

©  2015  ROBERT  BIGOS32

Small  BigDataVOLUME VARIETY

VELOCITY

Source:  Andrew  Clay  Shafer    -­‐  Devops,  microservices  and  platforms,  oh  my

Small  bank  has  about  48  devices  (storage,san)  with  2011  monitored  components.  Over  6  days  and  5  minutes  sample  colleceng  3,37  mln  records:  

Time,    Device_id,  Group_ID,  Component_id  (FACTORs)  34  Variables  (average)    

about  1,6  GB  in  RDBMS

That's  just  the  enterprise  storage  and  FC  network  !  

You  have  to  add  layers:    LAN,  WAN,  servers,  VM,  DB,  App  

©  2015  ROBERT  BIGOS

VISUALIZATIONS

©  2015  ROBERT  BIGOS2

Big  picture  understanding

Source:  DDoS  attack  by  China  &  Ukraine

©  2015  ROBERT  BIGOS2

Big  picture  understanding

Source:  An  Actual  160  Gbps  DDoS  Attack  Being  Mitigated  by  Prolexic  s  Global  

©  2015  ROBERT  BIGOS2

Big  picture  understanding

Source:  Sense  of  Patterns  -­‐  Animations  from  Mahir  M.  Yavuz

©  2015  ROBERT  BIGOS

UNDERUTILIZATION?

©  2015  ROBERT  BIGOS11

Example:  datacenter  

44

Source:  Christina  Delimitrou  Presented  April  3rd,  2014  at  @TwitterOSS  #conf

©  2015  ROBERT  BIGOS11

Reserved  vs.  used  2009

45

Source:  Christina  Delimitrou  Presented  April  3rd,  2014  at  @TwitterOSS  #conf

©  2015  ROBERT  BIGOS11

Example:  leaders  datacenter  

46

Source:  Christina  Delimitrou  Presented  April  3rd,  2014  at  @TwitterOSS  #conf

2009

CFO  dreamreality issue

©  2015  ROBERT  BIGOS20

U<liza<on  heatmap

GREEN  -­‐  not  used  -­‐  losing  money  RED  -­‐  business  and  customer  wai:ng  YELLOW  -­‐  perfect  balance  

About  700  volumens  on  enterprise  Tier  1  and  2,  500  TB

Y  =  24  h    

X  =  7  days  observa;ons  

Utyliza;on

©  2015  ROBERT  BIGOS

Read  response  <me  heatmapY  =  24  h    

X  =  7  days  observa;ons  

GREEN  -­‐  ok  ?  RED  -­‐?      

About  700  volumens  on  enterprise  Tier  1  and  2,  500  TB

©  2015  ROBERT  BIGOS20

Peak  read  response  <me  heatmap>100  ms

Y  =  24  h    

X  =  7  days  observa;ons  

GREEN  -­‐  ok  RED  -­‐  slow  or  very  slow    GREY  -­‐  poteneal  „emeouts”  

About  700  volumens  on  enterprise  Tier  1  and  2,  500  TB

©  2015  ROBERT  BIGOS

Peak  read  response  <me  heatmapY  =  24  h    

X  =  7  days  observa;ons  

GREEN  -­‐  ok  ?  RED  -­‐  slow  or  very  slow    GREY  -­‐  poteneal  „emeouts”  

About  700  volumens  on  enterprise  Tier  1  and  2,  500  TB

©  2015  ROBERT  BIGOS

Peak  write  response  <me  heatmap>100  ms

Y  =  24  h    

X  =  7  days  observa;ons  

GREEN  -­‐  ok  RED  -­‐  slow  or  very  slow  GREY  -­‐  poteneal  „emeouts”  

About  700  volumens  on  enterprise  Tier  1  and  2,  500  TB

©  2015  ROBERT  BIGOS

SUMMARY

©  2015  ROBERT  BIGOS2

Lessons  learned  ?• data  comes  from  the  devil  

• date/eme/log  format  -­‐  wow!  • models  come  from  God  (in  the  past)  • vizualizaeons  come  from  God  (it  is  a  future)  • it  is  all  about  scale  (computer  vs  human)  • understand  big  picture,  dive  deep  if  have  to…  • try  keep  standards  • break  the  rules  (physicist  vs  mathemaecian)  • good  enough  is  be<er  then  perfect  • understanding  is  more  important  then  precision  • pa<erns/moeon/colors  -­‐  common  human  understanding,  

• There  is  No  Single  Version  of  the  Truth  …

©  2015  ROBERT  BIGOS2

Lessons  learned  ?• back-­‐up  plan  should  be  important  part  of  the  plan  • underutylizaeon  can  be  a  planed  goal  • rolling  disaster  stareng  mostly  in  none-­‐criecal  env  and  propagate  to  all  connected  env  (like  cholesterol)  

• root  case  analysis  is  challanging  • logs  shows  „direceons”  or  one  version  of  true  • no  backup  no  fun  !  

• There  is  No  Single  Version  of  the  Truth  …

©  2015  ROBERT  BIGOS

Robert  Bigos  [email protected]  +48  665-­‐168-­‐240

h<p://www.slideshare.net/RobertBigos

h<ps://pl.linkedin.com/in/robertbigos

@bigosr

If  you  need  more…