58
Firmament Fast, centralized cluster scheduling at scale 1 Ionel Gog Malte Schwarzkopf Adam Gleave Robert N. M. Watson Steven Hand

Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

FirmamentFast, centralized cluster scheduling at scale

1

Ionel Gog Malte Schwarzkopf Adam Gleave

Robert N. M. Watson Steven Hand

Page 2: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Meet Sesame, Inc.

● Sesame’s employees run:

1. Interactive data analyticsthat must complete in seconds

2

Sesame Inc

3. Batch processing jobsthat increase resource utilization

2. Long-running servicesthat must provide high performance

Page 3: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

The cluster scheduler must achieve:

5

1. Good task placements○ high utilization without interference

2. Low task scheduling latency○ support interactive tasks

○ no idle resources

Ideal scheduler

Page 4: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

4

State of the art

Centralized vs. distributed

Good task placements

Low scheduling latency

CentralizedSophisticated algorithms

[Borg, Quincy, Quasar]

DistributedSimple heuristics

[Sparrow, Tarcil, Yaq-d]

Can’t get both good placements and low latency for the entire workload!

HybridSplit workload, provide either

[Mercury, Hawk, Eagle]

Page 5: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

5

Firmament provides a solution!

Centralized vs. distributed

● Centralized architecture

● Good task placements

● Low task scheduling latency

● Scales to 10,000+ machines

Page 6: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

➢ Finds optimal task placements

➢ Min-cost flow-basedcentralized scheduler

6

[SO

SP 2

009]

Quincy: Fair Scheduling for

DistributedComputing Clusters

Flow-based intro

Copyright - Heather Gwinn

Page 7: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Min-cost flow scheduler

Rack 1 Rack 2

Interactive

Batch

Service

7

Flow-based schedules all tasks

Preference for first rack Me too!

Page 8: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Min-cost flow scheduler

Rack 1 Rack 2

Interactive

Batch

Service

8

Schedules all tasks at the same time

Flow-based schedules all tasks

Page 9: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Min-cost flow scheduler

Rack 1 Rack 2

Interactive

Batch

Service

9

Flow-based schedules all tasks

MigratePreempt

Considers tasks for migration or preemption

Page 10: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Min-cost flow scheduler

Rack 1 Rack 2

Interactive

Batch

Service

10

Flow-based schedules all tasks

Globally optimal placement!

Page 11: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

T0

Tasks Machines

Introduction to min-cost flow scheduling

11

Flow scheduling: tasks & machines

T1

T2

T3

T4

T5

M0

M1

M2

M3

M4

M5

Page 12: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Introduction to min-cost flow scheduling

12

Flow scheduling: tasks to machines

T0 M0

M1

M2

12

M3

M4

M5

T1

T2

T3

T4

T5

Page 13: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Introduction to min-cost flow scheduling

13

T0 M0

M1

M2

13

M3

M4

M5

T1

T2

T3

T4

T5

Flow scheduling: zoom in

Cost: 3Cost: 5

Page 14: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Introduction to min-cost flow scheduling

14

Flow scheduling: Min costs for all

T0 M0

M1

M2

14

M3

M4

M5

T1

T2

T3

T4

T5

Cost: 3

Cost: 5

Cost: 7Cost: 5

Cost: 3Cost: 9Cost: 6

Cost: 2

Cost: 6

Min-cost flow places tasks with minimum overall cost

Page 15: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Introduction to min-cost flow scheduling

15

T0 M0

M1

M2

15

M3

M4

M5

T1

T2

T3

T4

T5

Sink

S

Flow demand: 6

Cost: 3

Cost: 5

Cost: 7Cost: 5

Cost: 3Cost: 9Cost: 6

Cost: 2

Cost: 6

Flow supply 1

1

1

1

1

1

FLOW

Flow scheduling: pushing flow for n tasks

Page 16: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

S

Introduction to min-cost flow scheduling

16

Flow scheduling: pushing flow for n tasks

T0 M0

M1

M2

M3

M4

M5

T1

T2

T3

T4

T5

Flow supply 0

0

0

0

0

0

Flow demand: 0

Page 17: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

17

How well doesthe Quincy

approach scale?

Quincy doesn’t scale: intro

Page 18: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Simulated Quincy using Google trace, 50% utilization

bette

r

Google cluster

18

Quincy doesn’t scale: empty figure

Page 19: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Simulated Quincy using Google trace, 50% utilization

bette

r

66 sec on average

Too slow! 30% of tasks wait to be scheduled for over 33% of their runtime and waste resources 19

Quincy doesn’t scale

Page 20: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Simulated Quincy using Google trace, 50% utilizationGoal: sub-second scheduling latency in

common case

bette

r

20

Quincy doesn’t scale

Goal

Page 21: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

21

Contributions

Firmament contributions

● Low task scheduling latency○ Uses best suited min-cost flow algorithm○ Incrementally recomputes the solution

● Good task placement○ Same optimal placements as Quincy○ Customizable scheduling policies

Page 22: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Scheduling policy

22

sche

dule

r

mas

ter

Task table

Task statistics

Agent Agent

Cluster topology

mac

hine

mac

hine

SchedulerScheduling policy

Flow graph

Scheduling policy

Firmament scheduler: intro

Page 23: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

23

class QuincyPolicy {

Cost_t TaskToResourceNodeCost( TaskID_t task_id) { return task_unscheduled_time * quincy_wait_time_factor; } ...}

Specifying scheduling policies

Firmament policy specification

Defines flow graph

N.B: More details in the paper.

Page 24: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Scheduling policy

24

sche

dule

r

mas

ter

Task table

Task statistics

Agent Agent

Cluster topology

mac

hine

mac

hine

Scheduler

Flow graph

Scheduling policy

Defines graph

Flow graph

Firmament scheduler: intro

Page 25: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Scheduling policy

25

sche

dule

r

mas

ter

Task table

Task statistics

Agent Agent

Cluster topology

mac

hine

mac

hine

SchedulerScheduling policy

Flow graph

Min-cost, max-flow solver

Submits graph

Firmament scheduler: submit to solver

Defines graph

Page 26: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Scheduling policy

26

sche

dule

r

mas

ter

Task table

Task statistics

Agent Agent

Cluster topology

mac

hine

mac

hine

SchedulerScheduling policy

Flow graph

Min-costmax-flow solver

Firmament scheduler: slow min-cost flow solver

Defines graph

Submits graphMost timespent here

Page 27: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

27

Algorithm Worst-case complexityCost scaling O(V2Elog(VC))

E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U

Algorithms complexity: Successive shortest path

Used by Quincy

Page 28: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

bette

r

Cost scaling is too slow beyond 1,000 machines

Subsampled Google trace, 50% slot utilization [Quincy policy]

28

Too

slow

!

Goal

Algorithms results: Cost scaling

Page 29: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

29

Algorithm Worst-case complexityCost scaling O(V2Elog(VC))

Successive shortest path O(V2Ulog(V))

E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U

Algorithms complexity: Cost scaling

Lower worst-case complexity

Page 30: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Successive shortest path only scales to ~100 machines

bette

r

30

Too

slow

!

Goal

Algorithms results: successive shortest path

Subsampled Google trace, 50% slot utilization [Quincy policy]

Page 31: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

31

Algorithm Worst-case complexityCost scaling O(V2Elog(VC))

Successive shortest path O(V2Ulog(V))

Relaxation O(E3CU2)

E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U

Highest complexity

Algorithms complexity: Relaxation

Page 32: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

bette

r

Relaxation meets our sub-second latency goal 32

Goal Per

fect

!

Algorithms results: Relaxation

Subsampled Google trace, 50% slot utilization [Quincy policy]

Page 33: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

M0

M1

R0

M2

M3

R1

ST2

X

T0

T1

T3

T4

Single-ish pass flow push

Relaxation is well-suited to the graph structure 33

Why is Relaxation fast?

Relaxation single-ish pass

Page 34: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

M0

M1

M2

M3

T2

T0

T1

T3

T4

SCapacity: 1 task

34

Relaxation suffers in pathological edge cases

Slow Relaxation: tasks -> machine

Machine utilization

high medium

Page 35: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

M0

M1

M2

M3

T2

T0

T1

T3

T4

S

35

Relaxation suffers in pathological edge cases

Slow Relaxation: machine -> sink

Capacity: 1 task

Machine utilization

high medium

Page 36: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Machine utilization

high medium

M0

M1

M2

M3

T2

T0

T1

T3

T4

S

Relaxation suffers in pathological edge cases

Slow Relaxation: machine -> tasks

Relaxation cannot push flow in a single pass any more 36

Capacity: 0 tasks

Page 37: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

37

How bad doesRelaxation’s edge

case get?Experimental setup:● Simulated 12,500 machine cluster● Used the Quincy scheduling policy● Utilization >90% to oversubscribed cluster

High utilization introduction

Page 38: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Quincy, 12,500 machines cluster, job of increasing sizebe

tter

38

High utilization empty figure

Page 39: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Relaxation’s runtime increases with utilization

Quincy, 12,500 machines cluster, job of increasing sizebe

tter

39

High utilization Relaxation

Page 40: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Cost scaling is unaffected by high utilization

bette

rQuincy, 12,500 machines cluster, job of increasing size

40

Cost scaling is faster Best

High utilization Relaxation & Cost scaling

Page 41: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Seduling policy

41

sche

dule r

Scheduling policy

Flow graph

Min-cost, max-flow solver

Input graph

Min

-cos

t max

-flow

sol

ver

Cost scalingRelaxation

Input graph

Output flow

Solver runs both algorithms

Page 42: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

bette

rQuincy, 12,500 machines cluster, job of increasing size

Algorithm runtime is still high at utilization > 94% 42

High utilization push-down runtime

Page 43: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Seduling policy

43

sche

dule r

Scheduling policy

Flow graph

Min-cost, max-flow solver

Min

-cos

t max

-flow

sol

ver

Cost scaling

Graph state

Input graph

State discardedOutput flow

Incremental cost scaling introduction

Page 44: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Seduling policy

44

sche

dule r

Scheduling policy

Flow graph

Min-cost, max-flow solver

Cost scaling

Min

-cos

t max

-flow

sol

ver

Graph state

Output flow

Input graph

Previous graph state

Graph changes

Incremental cost scaling

Page 45: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Quincy, 12,500 machines cluster, job of increasing sizebe

tter

Incremental cost scaling is ~2x faster 45

Incremental Cost scaling results

Page 46: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Note: many additional

experiments in the paper.

46

Evaluation

Centralized vs. distributed

Good task placements

Low scheduling latency

CentralizedSophisticated algorithms

e.g., Borg, Quincy, Quasar

DistributedSimple heuristics

e.g., Sparrow, Tarcil

Does Firmament choose good placements with

low latency?

Page 47: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

47

How do Firmament’s placements compare to

other schedulers?

Experimental setup:● Homogeneous 40-machine cluster, 10G network● Mixed batch/service/interactive workload

Workload-mix intro

Page 48: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

SCHEDULER

Interactive

Service

M1 M2 M3 M4 M8M7M6M5 M9 M10

R1 R2

Agg

Network utilization: low medium high 48

Workload mix-service task figure

Page 49: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

49

better

Workload-mix Sparrow

5 seconds task response time on idle cluster

Firmament chooses good placements

Page 50: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Sparrow is unaware of resource utilization 50

better

20% of tasks experience poor placement

Workload-mix Sparrow

Firmament chooses good placements

Page 51: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Centralized Kubernetes and Docker still suffer 51

better

Workload-mix Docker

Firmament chooses good placements

20% of tasks experience poor placement

Page 52: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

52

better

Avoided co-location interference

Workload-mix Firmament

Firmament chooses good placements

Firmament outperforms centralized and distributed schedulers

Page 53: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

53

How well doesFirmament handle

challenging workloads?

Experimental setup:● Simulated 12,500 machine Google cluster● Used the centralized Quincy scheduling policy● Utilization varies between 75% and 95%

Google acceleration introduction

Page 54: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

54

Google acceleration empty figure

Firmament handles challenging workloads at low latency

Google trace, 12,500 machines, utilization between 75% and 90%

Simulate interactive workloads by scaling down task runtimes

bette

r

Median task runtime: 420s

Median task runtime: 1.7s

Page 55: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Google trace, 12,500 machines, utilization between 75% and 90% 55

bette

r

Average latency is too high even without many short tasks

Google acceleration Cost scaling

45 seconds average latency (tuned over Quincy setup’s 66s)

Firmament handles challenging workloads at low latency

Page 56: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Google trace, 12,500 machines, utilization between 75% and 90%

Latency with a 250x acceleration:75th percentile: 2 sec

maximum: 57 sec 56

bette

rFirmament handles challenging workloads at low latency

Google acceleration Relaxation

Page 57: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Google trace, 12,500 machines, utilization between 75% and 90% 57

Firmament’s common-case latency is sub-second even at 250x acceleration

bette

r

Google acceleration Firmament

Firmament handles challenging workloads at low latency

Page 58: Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500 machines, utilization between 75% and 90% Simulate interactive workloads by scaling

Conclusions

58

firmament.io

Conclusions

● Low task scheduling latency○ Uses best algorithm at all times○ Incrementally recomputes solution

● Good task placement○ Same optimal placements as Quincy○ Customizable scheduling policies

Open-source and available at: