Containerizationas* the*Building*Block* for*Datacenter ... · (micro)services* ①...

Preview:

Citation preview

Benjamin  Hindman    –  @benh  

Containerization  as  the  Building  Block  for  Datacenter  Applications  GOTO  Amsterdam  

June  19,  2015  

an  emerging  trend  

microservices   containerization  

(micro)services  ①  do  one  thing  and  do  it  well  (UNIX)  

②  compose!  

③  build/commit  in  isolation,  test  in  isolation,  deploy  in  isolation  (with  easy  rollback)  

④  captures  organizational  structure  (many  teams  working  in  parallel)  

containerization  

now  then  

containerization  

now  then  

more  moving  parts   less  moving  parts  

a  reinforcing  trend  

microservices   containerization  

cluster  management  

“cluster  management”  at  Twitter,  circa  2010  

(configuration/package  management)   (deployment)  

“cluster  management”  at  Twitter,  circa  2010  

(configuration/package  management)   (deployment)  

“cluster  management”  at  Twitter,  circa  2010  

(configuration/package  management)   (deployment)  

MySQL  

MySQL   memcached  

MySQL   Rails   memcached  

MySQL   Cassandra   Rails   memcached  

MySQL   Cassandra   Rails   Hadoop   memcached  

challenges

challenges ①  failures  

failures  

MySQL   Cassandra   Rails   Hadoop   memcached  

MySQL   Cassandra   Rails   Hadoop   memcached  

challenges ②   maintenance  

(aka  “planned  failures”)    

maintenance  ①  upgrading  software  (i.e.,  the  kernel)  

ops  developers  

maintenance  ①  upgrading  software  (i.e.,  the  kernel)  

②  replacing  machines,  switches,  PDUs,  etc  

MySQL   Cassandra   Rails   Hadoop   memcached  

MySQL   Cassandra   Rails   Hadoop   memcached  

challenges ③   utilization  

Rails  

Hadoop  

memcached  

utilization  

utilization  

Rails  

Hadoop  

memcached  

utilization  

Rails  

Hadoop  

memcached  buy  less  machines  

or  run  more  applications!  

challenges ①  failures  

② maintenance  

③  utilization  

challenges ①  failures  

② maintenance  

③  utilization  

planning  for  failure?  

planning  for  failure  

challenges ①  failures  

② maintenance  

③  utilization  

planning  for  utilization?  

planning  for  utilization  intra-­‐machine  resource  sharing:  

share  a  single  machine’s  resources  between  multiple  applications  (multi-­‐tenancy)  

intra-­‐datacenter  resource  sharing:  

share  multiple  machine’s  resources  between  multiple  applications  

Twitter,  circa  2010  

I  want  a  cluster  manager!  

cluster  management  ①   Treat  machines  as  cattle  not  pets.  »  Keep  the  base  operating  system  small  and  simple,  

run  “containerized”  applications.  

cluster  management  ①   Treat  machines  as  cattle  not  pets.  »  Keep  the  base  operating  system  small  and  simple,  

run  “containerized”  applications.  

②  Automate  with  software  not  humans.  »  Let  software  schedule  software,  i.e.,  handle  

failures,  improve  utilization,  and  manage  maintenance.  

cluster  management  

industry  academia  

different  software  

academia   industry  

•   MPI  (Message  Passing  Interface)   •   Apache  (mod_perl,  mod_php)  •   web  services  (Java,  Ruby,  …)  

different  scale  (at  first)  

academia   industry  

•   100’s  of  machines   •   10’s  of  machines  

cluster  management  

academia   industry  

•   PBS  (Portable  Batch  System)  •   TORQUE  •   SGE  (Sun  Grid  Engine)  

•   ssh  •   Puppet/Chef  •   Capistrano/Ansible  

cluster  managers  

different  scale  (converging)  

academia   industry  

•   100’s  of  machines   •   10’s  of  machines  

1,000’s  of  machines  

cluster  management  

academia   industry  

•   PBS  (Portable  Batch  System)  •   TORQUE  •   SGE  (Sun  Grid  Engine)  

•   ssh  •   Puppet/Chef  •   Capistrano/Ansible  

batch  computation!  

Apache  Mesos  is  a  modern  general  purpose  cluster  manager  (i.e.,  not  just  focused  on  batch  computing)  

Mesos  was  designed  to  run:  

stateless  services!  

Mesos  is  a  cluster  manager  with  a  master/slave  architecture  

masters  

slaves  

schedulers  register  with  the  Mesos  master(s)  in  order  to  run  jobs/tasks  

masters  

slaves  

schedulers  

stateless  services!  

service  scheduler  

service  scheduler  

Mesos  

orchestrates  services  using  Mesos  

orchestration  vs  scheduling  

service  scheduler  

Mesos  

schedule  

orchestration  vs  scheduling  

service  scheduler  

Mesos  

orchestrate  

schedule  

orchestration  w/  Mesosphere’s  Marathon    ①   configuration/package  

management  

②   deployment  

configuration/package  management  

(1)  bundle  services  as  jar,  tar/gzip,    or  using  Docker  

(2)  upload  to  HDFS,  S3,    Docker  registry,  etc  

deployment  

(1)  describe  services  using  JSON  

(2)  submit  services  to  Marathon  via  REST  or  CLI  

example-­‐docker.json  { "container": { "type": "DOCKER", "docker": { "image": "libmesos/ubuntu" }, "volumes" : [ { "containerPath": "/etc/a", "hostPath": "/var/data/a", "mode": "RO" }, { "containerPath": "/etc/b", "hostPath": "/var/data/b", "mode": "RW" } ] }, "id": "ubuntu", "instances": 1, "cpus": 0.5, "mem": 512, "cmd": "while sleep 10; do date -u +%T; done" }

orchestration  w/  Kubernetes  on  Mesos    ①   configuration/package  

management  

②   deployment  

multiple  schedulers  

multiple  schedulers  

…  

multiple  schedulers  

multiple  schedulers  

0.8.2 0.9.0

multiple  schedulers  

going  deeper  …  

multiple  schedulers  

two-­‐level  scheduling  Mesos  influenced  by  multi-­‐level  scheduling  in  traditional  operating  systems  (user-­‐space  scheduling  and  scheduler  activations)  

 

Mesos  is  designed  less  like  a  “cluster  manager”  and  more  like  an  operating  system  kernel  

Apache  Mesos:  distributed  systems  kernel  

Apache  Mesos:  datacenter  kernel  

Mesos  (nodes)  

Mesos:  datacenter  kernel  

scheduler  

Mesos  (master)  

scheduler  

syscall-­‐like  API  for  datacenter  

Mesos:  datacenter  kernel  +  enable  running  multiple  distributed  systems  on  the  same  cluster  of  machines  and  dynamically  share  the  resources  more  efficiently!  

Mesos:  datacenter  kernel  +  enable  building  new  distributed  systems  by  providing  common  functionality  (primitives)  every  new  distributed  system  re-­‐implements  

Mesos  primitives  •  principals,  users,  roles  •  advanced  fair-­‐sharing  

allocation  algorithms  •  high-­‐availability  (even  

during  upgrades)  •  resource  monitoring  •  preemption/revocation  •  volume  management  •  reservations  (dynamic/

static)  •  …  

build  on  top  of  Mesos  

①  don’t  reinvent  the  wheel:  leverage  primitives  to  implement/automate  failures,  maintenance,  etc.  

build  on  top  of  Mesos  

② make  it  easier  for  your  users  to  use  your  software!  

build  on  top  of  Mesos  

② make  it  easier  for  your  users  to  use  your  software!  

built  on  Mesos  

2009   2010   2013   2014  

ported  to  Mesos  

2011   2012   2013   2014  

Mesos  is  being  run  at:  

2010   2013   2014  …  

going  even  deeper  …  

Mesos  resource  requests/offers  

masters  

scheduler  

request  3  CPUs  2  GB  RAM  

a  request  is  purposely  simplified  subset  of  a  specification,  mainly  including  the  required  resources  at  that  point  in  time  

Mesos  resource  requests/offers  

masters  

scheduler  

offer  hostname  4  CPUs  4  GB  RAM  

offer  hostname  4  CPUs  4  GB  RAM  

offer  hostname  4  CPUs  4  GB  RAM  

offer  hostname  4  CPUs  4  GB  RAM  

Mesos  resource  requests/offers  

masters  

scheduler  

offer  hostname  4  CPUs  4  GB  RAM  

offer  hostname  4  CPUs  4  GB  RAM  

offer  hostname  4  CPUs  4  GB  RAM  

offer  hostname  4  CPUs  4  GB  RAM  

Mesos  resource  requests/offers  

masters  

scheduler  

offer  hostname  4  CPUs  4  GB  RAM  

scheduler  uses  the  offers  to  decide  what  tasks  to  run  

offer  hostname  4  CPUs  4  GB  RAM  

offer  hostname  4  CPUs  4  GB  RAM  

offer  hostname  4  CPUs  4  GB  RAM  

Mesos  resource  requests/offers  

masters  

scheduler  

offer  hostname  4  CPUs  4  GB  RAM  

scheduler  uses  the  offers  to  decide  what  tasks  to  run    “two-­‐level  scheduling”  

Mesos  task/executor  model  

masters  

scheduler  

scheduler  uses  the  offers  to  decide  what  tasks  to  run  

task  3  CPUs  2  GB  RAM  

slave  

a  task  with  a  command  

mesos-slave!

slave  

a  task  with  a  command  

mesos-slave!

task!

slave  

a  task  with  a  command  

mesos-slave!

task! task!

slave  

a  task  with  an  executor  

mesos-slave!

slave  

a  task  with  an  executor  

mesos-slave!

executor!

slave  

a  task  with  an  executor  

mesos-slave!

executor!

task!

slave  

a  task  with  an  executor  

mesos-slave!

executor!

task! task!

slave  

a  task  with  an  executor  

mesos-slave!

executor!!!!

slave  

a  task  with  an  executor  

mesos-slave!

executor!

task! task!

task!

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

containers  

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

slave  

task/executor  isolation  

mesos-slave!

executor!

task!

task!

slave  

a  task  with  a  Docker  image  

mesos-slave!

Docker!image!

masters  

master  failover  

scheduler  

after  a  new  master  is  elected  all  schedulers  and  slaves  connect  to  the  new  master    all  tasks  keep  running  across  master  failover!  

masters  

scheduler  failover  

scheduler  

scheduler  re-­‐registers  with  master  and  resumes  operation    all  tasks  keep  running  across  framework  failover!  

scheduler  

slave  

slave  failover  

mesos-slave!

task! task!

slave  

slave  failover  

mesos-slave!

task!task!

slave  

slave  failover  

task!task!

slave  

slave  failover  

mesos-slave!

task!task!

slave  

slave  failover  

mesos-slave!

task!task!

slave  

slave  failover  

mesos-slave!

(large  in-­‐memory  services,  expensive  to  restart)  

down  the  rabbit  hole  …  

Mesos  1st-­‐level  scheduler  allocates  resources  to  frameworks  using  a  fair-­‐sharing  algorithm  we  created  called  Dominant  Resource  Fairness  (DRF)  

DRF,  born  of  static  partitioning  

datacenter  

static  partitioning  across  teams  

promotions   trends   recommendations  team  

promotions   trends   recommendations  team  

fairly  shared!  

static  partitioning  across  teams  

goal:  fairly  share  the  resources  without  static  partitioning  

partition  utilizations  

promotions   trends   recommendations  

45%  CPU  100%  RAM  

75%  CPU  100%  RAM  

100%  CPU  50%  RAM  

team  

utilization  

observation:  a  dominant  resource  bottlenecks  each  team  from  running  any  more  jobs/tasks    

dominant  resource  bottlenecks  

promotions   trends   recommendations  team  

utilization  

bottleneck   RAM  

45%  CPU  100%  RAM  

75%  CPU  100%  RAM  

100%  CPU  50%  RAM  

RAM   CPU  

insight:  allocating  a  fair  share  of  each  team’s  dominant  resource  guarantees  they  can  run  at  least  as  many  jobs/tasks  as  with  static  partitioning!  

…  if  my  team  gets  at  least  1/N  of  my  dominant  resource  I  will  do  no  worse  than  if  I  had  my  own  cluster,  but  I  might  do  better  when  resources  are  available!  

     tep  4:  Profit  (statistical  multiplexing)  $  

in  practice,  fair  sharing  is  insufficient  

weighted  fair  sharing  

promotions   trends   recommendations  team  

weighted  fair  sharing  

promotions   trends   recommendations  team  

weight   0.17   0.5   0.33  

Mesos  implements  weighted  DRF  

masters  

masters  can  be  configured  with  weights  per  role    resource  allocation  decisions  incorporate  the  weights  to  determine  dominant  fair  shares  

in  practice,  weighted  fair  sharing  is  still  insufficient  

a  non-­‐cooperative  framework  (i.e.,  has  long  tasks  or  is  buggy)  can  get  hoard  too  many  resources  

resource  reservations  

resources  on  individual  slaves  can  be  reserved  for  particular  roles    resource  offers  include  the  reservation  role  (if  any)  

masters  

framework  (trends)  

offer  hostname  4  CPUs  4  GB  RAM  role:  trends  

reservations  

slave   *!role-foo!

reservations  

mesos-slave!

role-bar!

static  reservations  

reservations  available  with  Mesos  using  the          --resources  flag  on  each  mesos-slave!

 $ mesos-slave --resources=‘cpus(role-foo):2;mem(role-foo):1024;cpus(role-bar):2;mem(role-bar):1024;cpus(*):4;mem(*):4096’!

static  reservations  

+  strong  guarantees  

-­‐  set  up  by  an  operator  when  starting  slave  

-­‐  immutable  (must  drain/restart  the  slave)  

 

 

dynamic  reservations  framework  scheduler  reserves  resources  at  runtime  when  it  accepts  an  offer  (allocation)  

 

 

 masters  

offer  hostname  4  CPUs  4  GB  RAM  

slave  (1)  

framework  (role-­‐baz)  

dynamic  reservations  framework  scheduler  reserves  resources  at  runtime  when  it  accepts  an  offer  (allocation)  

 

 

 masters  

framework  (role-­‐baz)  

Accept  Launch/Reserve(4  CPUs,  4  GB  RAM)  

slave  (2)  

dynamic  reservations  framework  scheduler  reserves  resources  at  runtime  when  it  accepts  an  offer  (allocation)  

 

 

 masters  

Launch/Reserve(4  CPUs,  4  GB  RAM)  

slave  (3)  

framework  (role-­‐baz)  

promotions  40%  

trends  20%  

used  10%  

unused  30%  

recommendations  40%  

reservations  

reservations  provide  guarantees,  but  at  the  cost  of  utilization  

revocable  resources  

masters  

framework  (promotions)  

resources  that  are  reserved  for  another  role  and  thus  may  be  revoked  at  any  time  

offer  hostname  4  CPUs  4  GB  RAM  role:  trends  

preemption  via  revocation    

…  my  tasks  will  not  be  killed  unless  I’m  using  revocable  

resources!  

oversubscription  via  revocable  resources  

masters  

framework  (promotions)  

oversubscribe  resources  by  allocating  unused  resources  as  revocable!  

offer  hostname  4  CPUs  4  GB  RAM  role:  trends  

revocation  “guarantee”  

…  my  tasks  will  not  be  killed  unless  I’m  using  revocable  

resources!  

revocation  “guarantee”  

what  about  when:  

① want/need  to  defrag  (think  page  replacement  algorithm!)  

② maintenance  

mechanism  for  deallocation  

possible  solution:  introduce  failures  to  deallocate  resources  

 

but  why  not  communicate  explicitly!?  

inverse  offer  framework  scheduler  gets  deallocation  requests  in  the  form  of  inverse  offers  

 

 

 masters  

offer-­‐1  hostname  4  CPUs  4  GB  RAM  

slave  (1)  

framework  (role-­‐baz)  

inverse  offer  framework  scheduler  can  kill  tasks  and  acknowledge  deallocation  

 

 

 masters  

framework  (role-­‐baz)  

Kill/ACK  

slave  (2)  

maintenance  ①  drain  machine  by  sending  out  inverse  offers  

② while  draining  can  still  send  out  offers  with  revocable  resources  

③  remove  machine  from  allocation  once  drained  (or  at  some  specific  time)  

but  what  about  my  persistent  data?  

persistent  volumes  framework  scheduler  creates  volumes  for  disk  resources  at  runtime  when  it  accepts  an  offer  (allocation)  

 

 

 

masters  

Offer  hostname  4  CPUs  4  GB  RAM  

slave  (1)  

scheduler  (role-­‐baz)  

persistent  volumes  framework  scheduler  creates  volumes  for  disk  resources  at  runtime  when  it  accepts  an  offer  (allocation)  

 

 

 

masters  

scheduler  (role-­‐baz)  

Accept  Launch/Create(1  TB  DISK)  

slave  (2)  

persistent  volumes  framework  scheduler  reserves  resources  at  runtime  when  it  accepts  an  offer  (allocation)  

 

 

 masters  

Launch/Create(1  TB  DISK)  

slave  (3)  

scheduler  (role-­‐baz)  

persistent  volumes  volumes  are  created  before  launching  any  tasks  or  executors  

 

 

 slave   role-baz!

role-foo!

mesos-slave!

role-bar!

persistent  volumes  volumes  are  mounted  into  the  container  when  a  task  or  executor  gets  launched  

 

 

slave  

!!!!!

role-baz!

role-foo!

mesos-slave!

role-bar!

task!

persistent  volumes  volumes  persist  even  after  task  or  executor  terminate!  

 

 

slave  

!!!!!

role-baz!

role-foo!

mesos-slave!

role-bar!

task!

persistent  volumes  volumes  persist  even  after  task  or  executor  terminate!  

 

slave   role-baz!

role-foo!

mesos-slave!

role-bar!

the  bigger  picture  reservations  +  

inverse  offers  +  

persistent  volumes  =  

long-­‐lived  stateful  frameworks!  

conclusion  

our  “other”  computers  are  datacenters*  

*  collection  of  physical  and/or  virtual  machines  

the  datacenter  is  just  another  form  factor  

Mesos  is  the  kernel  …  

desktop server datacenter

OS  

OS  

OS  

need  a  datacenter  operating  system  

Mesosphere’s  DCOS  

Marathon  

Marathon,  a  scheduler  for  running  stateless  services  written  in  any  language  

init  for  the  datacenter  operating  system  

Chronos  

Chronos,  a  scheduler  for  running  cron  jobs  with  dependencies  

cron  for  the  datacenter  operating  system  

DCOS  CLI  

mesosphere.com  

Thanks!  

Recommended