17
Balaji Prabhakar and Mendel Rosenblum Departments of EE and CS, Stanford SelfDriving Networks

Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Balaji  Prabhakar  and  Mendel  Rosenblum  Departments  of  EE  and  CS,  Stanford  

Self-­‐Driving  Networks  

Page 2: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

1960s—2000:    Packet-­‐switching  developed  •  Protocols  for  computer-­‐to-­‐computer  communica?on  •  Algorithms  for  rou?ng,  switching,  load  balancing,  conges?on  control,  …  

 

2005—now:    SoEware-­‐defined  Networking  (SDN)  •  Programmability  and  flexibility  

 Now—:    Self-­‐Driving  Networks  

•  Autonomy:    Network  senses  and  monitors  itself;  programs  and  controls  itself  •  Interac8vity:    Infra  should  be  transparent  and  fun  to  interact  with,  especially  for  3rd  

party  users  

Background:  Self-­‐Driving  Networks  

Page 3: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

What  does  “self-­‐driving”  mean?  

DCN  Workload  

Given  a  DCN  and  a  workload  or  jobs  that  arrive  over  ?me  •  Allocate  resources  (network,  CPU,  memory,  storage),    so  that  •  Jobs  are  processed  quickly  (small  job  comple?on  ?me),  and  •  Resource  u?liza?on  is  efficient  

Key  func?ons:    Sense,  Infer,  Learn,  Control    

Page 4: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Sense  DCN  from  the  Edge  

TX  Timestamp   RX  Timestamp  

Page 5: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

 NIC-­‐based  telemetry  

•  More  scalable:    edge  observa?ons  are  a  sufficient  sta?s?c  -  No  per-­‐queue  counters  -  No  per-­‐packet  measurement  -  No  extra  network  traffic  due  to  sensed  data  

•  Doesn’t  need  forkliE  upgrade  of  network,  or  same  vendor  of  switches  (data  formats)  •  Just  need  NICs  which  are  capable  of  ?me-­‐stamping  probes/packets  

-  Preey  standard  for  most  10,  40,  100G  NICs  

 Of  course,  if  switches  can  give  extra  data,  that’ll  help  

•  E.g.,  path  followed  by  probe/packet    NIC-­‐based  probing  also  useful  for  fine-­‐grained  clock  synchroniza?on  

Sensing  at  the  Edge  

Page 6: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Network  boelenecks              

Infer  

•  Detailed  buffer  depths  at  switches  •  Link  u?liza?ons  •  Queue  and  link  composi?ons  

-  Who’s  packets  are  in  the  queues/links?  -  Which  applica?ons,  tenants’  traffic,  etc?  

•  Link  failures,  brownouts  

Applica?on  performance              •  Timeouts  •  Predict  stragglers  •  Comparisons/Regressions  

-  Why  did  the  latest  soEware  patch  slow                things  down?  

Challenges:    Noisy  data,  sparse  observa?ons,  speed,  scalability    

Page 7: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

 Current  focus:  Sense  at  the  edge,  infer/reconstruct  fine  network  details  

•  Mesh  of  probes  for  sensing  •  Machine  learning  and  neural  nets  for  real-­‐?me  inference  

-  Func?on  Approxima?on:    Implemen?ng  network  algos  with  NNs  à  much  faster  -  Paeern  Recogni?on:  Learning  network  load  from  paeerns  (packet  traces,  CPU/memory  

u?liza?on  paeerns)  

Built  system  with  following  modules:  •  SoEware-­‐based  clock  synchroniza?on  system:  ~10s  of  nanoseconds  accuracy  •  Network  reconstruc?on:  accurately  infer  queues,  link  u?liza?on,  etc  •  Query  and  rendering  engines:  interac?ve  visualiza?on  and  diagnos?c  tool  

Future  work  will  focus  on  learning  and  control  •  Learning  best  responses  in  real-­‐?me  using  Reinforcement  Learning  •  Integra?on  with  network  controllers  for  real-­‐?me  autonomous  control  

 

Roadmap  and  progress  

Page 8: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

 Google  testbed  

•  40G  links,  40  racks,  5-­‐stage  Clos  switching  Ø  Collabora?on  with  Ashish  Naik  and  Amin  Vahdat  at  Google  

Stanford  testbed  •  1G  links,  128-­‐server,  2-­‐stage  Clos  switching  

   

Plaporms  and  Testbeds  

Cisco 2960!

Page 9: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Network  Evolu?on  Reconstruc?on  from    Edge-­‐based  Timestamps  

Page 10: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Sensing  DCN  from  the  edge  

TX  Timestamp   RX  Timestamp  

Page 11: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Algorithm  •  Input:  –  5-­‐tuple  flow  IDs,  for  inferring  network  paths  –  Rx,  Tx  ?mestamps  of  probes  

•  Basic  equa?ons  –  For  each  packet:  

–  Combine  all  packets:  

–  Solve  for  queue  sizes:    Use  Lasso  algorithm  

Page 12: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Es?mates  well  

Page 13: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

         Clock  SynchronizaHon  

Page 14: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

A  classical  hard  problem:  affects  performance  of  distributed  systems  •  Can  boost  performance  of  exis?ng  solu?ons  

-  e.g.,  in  databases  by  maintaining  causality  and  external  consistency  •  Or  enable  new  ones  

-  e.g.,  fine-­‐grained  resource  and  task  scheduling,  real-­‐?me  distributed  control,  etc  •  Has  become  more  severe    as  clock  precision  and  event  frequency  have  gone  up:  

milliseconds  à  microseconds  à  nanoseconds    

Current  solu?ons  •  Expensive:  PTP  and  PPS  require  compa?ble  hardware  •  Uneven  performance:  Many  PTP-­‐compa?ble  switches  perform  poorly  under  load  

Background  

Page 15: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Clock  synchroniza?on  

Pairwise  clock  driEs  •  Typically  5-­‐10  microseconds/sec  •  Can  be  as  high  as  30  microseconds/sec  

Clock  frequency  varies  with  temperature  •  Ideal  temperature  ~  25-­‐28  deg  cen?grade  •  Resonance  frequency  changes  quadra?cally                with  temperature:    10oC  change  ~  3.35  usec/s  

Page 16: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

A  soEware  clock  synchroniza?on  system  •  Probe-­‐based,  only  needs  ?mestamping-­‐

capable  NICs  à  Same  probe  mesh  needed  for  reconstruc?on  

Synchroniza?on  accuracy  of  10s  of  nanoseconds  

•  Accuracy  verified  against  NetFPGAs  

Our  solu?on  

0 10 20 30 40 50 60 70 80

Network load (%)

0

5

10

15

20

25

30

35

40

Err

or(n

s)

mean99th percentile

Synchroniza?on  error  stays  under    40  ns  at  80%  load  

Page 17: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&

Self-­‐Driving  Networks  is  a  mul?-­‐year  project  •  Current  system  has  clock  sync  and  network  reconstruc?on  •  Ready  for  wider  deployment:  many  use  cases  beyond  telemetry  

-  Regressions/comparisons,  forensics,  planning,  purchasing,  policy  setng,  …  •  Ini?al  work  on  learning;  in  future,  we’ll  be  doing  more  learning  and  control    

We  welcome  your  feedback  and  collabora?ons  

Summary