Firmament - USENIX · Firmament handles challenging workloads at low latency Google trace, 12,500...

Preview:

Citation preview

FirmamentFast, centralized cluster scheduling at scale

1

Ionel Gog Malte Schwarzkopf Adam Gleave

Robert N. M. Watson Steven Hand

Meet Sesame, Inc.

● Sesame’s employees run:

1. Interactive data analyticsthat must complete in seconds

2

Sesame Inc

3. Batch processing jobsthat increase resource utilization

2. Long-running servicesthat must provide high performance

The cluster scheduler must achieve:

5

1. Good task placements○ high utilization without interference

2. Low task scheduling latency○ support interactive tasks

○ no idle resources

Ideal scheduler

4

State of the art

Centralized vs. distributed

Good task placements

Low scheduling latency

CentralizedSophisticated algorithms

[Borg, Quincy, Quasar]

DistributedSimple heuristics

[Sparrow, Tarcil, Yaq-d]

Can’t get both good placements and low latency for the entire workload!

HybridSplit workload, provide either

[Mercury, Hawk, Eagle]

5

Firmament provides a solution!

Centralized vs. distributed

● Centralized architecture

● Good task placements

● Low task scheduling latency

● Scales to 10,000+ machines

➢ Finds optimal task placements

➢ Min-cost flow-basedcentralized scheduler

6

[SO

SP 2

009]

Quincy: Fair Scheduling for

DistributedComputing Clusters

Flow-based intro

Copyright - Heather Gwinn

Min-cost flow scheduler

Rack 1 Rack 2

Interactive

Batch

Service

7

Flow-based schedules all tasks

Preference for first rack Me too!

Min-cost flow scheduler

Rack 1 Rack 2

Interactive

Batch

Service

8

Schedules all tasks at the same time

Flow-based schedules all tasks

Min-cost flow scheduler

Rack 1 Rack 2

Interactive

Batch

Service

9

Flow-based schedules all tasks

MigratePreempt

Considers tasks for migration or preemption

Min-cost flow scheduler

Rack 1 Rack 2

Interactive

Batch

Service

10

Flow-based schedules all tasks

Globally optimal placement!

T0

Tasks Machines

Introduction to min-cost flow scheduling

11

Flow scheduling: tasks & machines

T1

T2

T3

T4

T5

M0

M1

M2

M3

M4

M5

Introduction to min-cost flow scheduling

12

Flow scheduling: tasks to machines

T0 M0

M1

M2

12

M3

M4

M5

T1

T2

T3

T4

T5

Introduction to min-cost flow scheduling

13

T0 M0

M1

M2

13

M3

M4

M5

T1

T2

T3

T4

T5

Flow scheduling: zoom in

Cost: 3Cost: 5

Introduction to min-cost flow scheduling

14

Flow scheduling: Min costs for all

T0 M0

M1

M2

14

M3

M4

M5

T1

T2

T3

T4

T5

Cost: 3

Cost: 5

Cost: 7Cost: 5

Cost: 3Cost: 9Cost: 6

Cost: 2

Cost: 6

Min-cost flow places tasks with minimum overall cost

Introduction to min-cost flow scheduling

15

T0 M0

M1

M2

15

M3

M4

M5

T1

T2

T3

T4

T5

Sink

S

Flow demand: 6

Cost: 3

Cost: 5

Cost: 7Cost: 5

Cost: 3Cost: 9Cost: 6

Cost: 2

Cost: 6

Flow supply 1

1

1

1

1

1

FLOW

Flow scheduling: pushing flow for n tasks

S

Introduction to min-cost flow scheduling

16

Flow scheduling: pushing flow for n tasks

T0 M0

M1

M2

M3

M4

M5

T1

T2

T3

T4

T5

Flow supply 0

0

0

0

0

0

Flow demand: 0

17

How well doesthe Quincy

approach scale?

Quincy doesn’t scale: intro

Simulated Quincy using Google trace, 50% utilization

bette

r

Google cluster

18

Quincy doesn’t scale: empty figure

Simulated Quincy using Google trace, 50% utilization

bette

r

66 sec on average

Too slow! 30% of tasks wait to be scheduled for over 33% of their runtime and waste resources 19

Quincy doesn’t scale

Simulated Quincy using Google trace, 50% utilizationGoal: sub-second scheduling latency in

common case

bette

r

20

Quincy doesn’t scale

Goal

21

Contributions

Firmament contributions

● Low task scheduling latency○ Uses best suited min-cost flow algorithm○ Incrementally recomputes the solution

● Good task placement○ Same optimal placements as Quincy○ Customizable scheduling policies

Scheduling policy

22

sche

dule

r

mas

ter

Task table

Task statistics

Agent Agent

Cluster topology

mac

hine

mac

hine

SchedulerScheduling policy

Flow graph

Scheduling policy

Firmament scheduler: intro

23

class QuincyPolicy {

Cost_t TaskToResourceNodeCost( TaskID_t task_id) { return task_unscheduled_time * quincy_wait_time_factor; } ...}

Specifying scheduling policies

Firmament policy specification

Defines flow graph

N.B: More details in the paper.

Scheduling policy

24

sche

dule

r

mas

ter

Task table

Task statistics

Agent Agent

Cluster topology

mac

hine

mac

hine

Scheduler

Flow graph

Scheduling policy

Defines graph

Flow graph

Firmament scheduler: intro

Scheduling policy

25

sche

dule

r

mas

ter

Task table

Task statistics

Agent Agent

Cluster topology

mac

hine

mac

hine

SchedulerScheduling policy

Flow graph

Min-cost, max-flow solver

Submits graph

Firmament scheduler: submit to solver

Defines graph

Scheduling policy

26

sche

dule

r

mas

ter

Task table

Task statistics

Agent Agent

Cluster topology

mac

hine

mac

hine

SchedulerScheduling policy

Flow graph

Min-costmax-flow solver

Firmament scheduler: slow min-cost flow solver

Defines graph

Submits graphMost timespent here

27

Algorithm Worst-case complexityCost scaling O(V2Elog(VC))

E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U

Algorithms complexity: Successive shortest path

Used by Quincy

bette

r

Cost scaling is too slow beyond 1,000 machines

Subsampled Google trace, 50% slot utilization [Quincy policy]

28

Too

slow

!

Goal

Algorithms results: Cost scaling

29

Algorithm Worst-case complexityCost scaling O(V2Elog(VC))

Successive shortest path O(V2Ulog(V))

E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U

Algorithms complexity: Cost scaling

Lower worst-case complexity

Successive shortest path only scales to ~100 machines

bette

r

30

Too

slow

!

Goal

Algorithms results: successive shortest path

Subsampled Google trace, 50% slot utilization [Quincy policy]

31

Algorithm Worst-case complexityCost scaling O(V2Elog(VC))

Successive shortest path O(V2Ulog(V))

Relaxation O(E3CU2)

E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U

Highest complexity

Algorithms complexity: Relaxation

bette

r

Relaxation meets our sub-second latency goal 32

Goal Per

fect

!

Algorithms results: Relaxation

Subsampled Google trace, 50% slot utilization [Quincy policy]

M0

M1

R0

M2

M3

R1

ST2

X

T0

T1

T3

T4

Single-ish pass flow push

Relaxation is well-suited to the graph structure 33

Why is Relaxation fast?

Relaxation single-ish pass

M0

M1

M2

M3

T2

T0

T1

T3

T4

SCapacity: 1 task

34

Relaxation suffers in pathological edge cases

Slow Relaxation: tasks -> machine

Machine utilization

high medium

M0

M1

M2

M3

T2

T0

T1

T3

T4

S

35

Relaxation suffers in pathological edge cases

Slow Relaxation: machine -> sink

Capacity: 1 task

Machine utilization

high medium

Machine utilization

high medium

M0

M1

M2

M3

T2

T0

T1

T3

T4

S

Relaxation suffers in pathological edge cases

Slow Relaxation: machine -> tasks

Relaxation cannot push flow in a single pass any more 36

Capacity: 0 tasks

37

How bad doesRelaxation’s edge

case get?Experimental setup:● Simulated 12,500 machine cluster● Used the Quincy scheduling policy● Utilization >90% to oversubscribed cluster

High utilization introduction

Quincy, 12,500 machines cluster, job of increasing sizebe

tter

38

High utilization empty figure

Relaxation’s runtime increases with utilization

Quincy, 12,500 machines cluster, job of increasing sizebe

tter

39

High utilization Relaxation

Cost scaling is unaffected by high utilization

bette

rQuincy, 12,500 machines cluster, job of increasing size

40

Cost scaling is faster Best

High utilization Relaxation & Cost scaling

Seduling policy

41

sche

dule r

Scheduling policy

Flow graph

Min-cost, max-flow solver

Input graph

Min

-cos

t max

-flow

sol

ver

Cost scalingRelaxation

Input graph

Output flow

Solver runs both algorithms

bette

rQuincy, 12,500 machines cluster, job of increasing size

Algorithm runtime is still high at utilization > 94% 42

High utilization push-down runtime

Seduling policy

43

sche

dule r

Scheduling policy

Flow graph

Min-cost, max-flow solver

Min

-cos

t max

-flow

sol

ver

Cost scaling

Graph state

Input graph

State discardedOutput flow

Incremental cost scaling introduction

Seduling policy

44

sche

dule r

Scheduling policy

Flow graph

Min-cost, max-flow solver

Cost scaling

Min

-cos

t max

-flow

sol

ver

Graph state

Output flow

Input graph

Previous graph state

Graph changes

Incremental cost scaling

Quincy, 12,500 machines cluster, job of increasing sizebe

tter

Incremental cost scaling is ~2x faster 45

Incremental Cost scaling results

Note: many additional

experiments in the paper.

46

Evaluation

Centralized vs. distributed

Good task placements

Low scheduling latency

CentralizedSophisticated algorithms

e.g., Borg, Quincy, Quasar

DistributedSimple heuristics

e.g., Sparrow, Tarcil

Does Firmament choose good placements with

low latency?

47

How do Firmament’s placements compare to

other schedulers?

Experimental setup:● Homogeneous 40-machine cluster, 10G network● Mixed batch/service/interactive workload

Workload-mix intro

SCHEDULER

Interactive

Service

M1 M2 M3 M4 M8M7M6M5 M9 M10

R1 R2

Agg

Network utilization: low medium high 48

Workload mix-service task figure

49

better

Workload-mix Sparrow

5 seconds task response time on idle cluster

Firmament chooses good placements

Sparrow is unaware of resource utilization 50

better

20% of tasks experience poor placement

Workload-mix Sparrow

Firmament chooses good placements

Centralized Kubernetes and Docker still suffer 51

better

Workload-mix Docker

Firmament chooses good placements

20% of tasks experience poor placement

52

better

Avoided co-location interference

Workload-mix Firmament

Firmament chooses good placements

Firmament outperforms centralized and distributed schedulers

53

How well doesFirmament handle

challenging workloads?

Experimental setup:● Simulated 12,500 machine Google cluster● Used the centralized Quincy scheduling policy● Utilization varies between 75% and 95%

Google acceleration introduction

54

Google acceleration empty figure

Firmament handles challenging workloads at low latency

Google trace, 12,500 machines, utilization between 75% and 90%

Simulate interactive workloads by scaling down task runtimes

bette

r

Median task runtime: 420s

Median task runtime: 1.7s

Google trace, 12,500 machines, utilization between 75% and 90% 55

bette

r

Average latency is too high even without many short tasks

Google acceleration Cost scaling

45 seconds average latency (tuned over Quincy setup’s 66s)

Firmament handles challenging workloads at low latency

Google trace, 12,500 machines, utilization between 75% and 90%

Latency with a 250x acceleration:75th percentile: 2 sec

maximum: 57 sec 56

bette

rFirmament handles challenging workloads at low latency

Google acceleration Relaxation

Google trace, 12,500 machines, utilization between 75% and 90% 57

Firmament’s common-case latency is sub-second even at 250x acceleration

bette

r

Google acceleration Firmament

Firmament handles challenging workloads at low latency

Conclusions

58

firmament.io

Conclusions

● Low task scheduling latency○ Uses best algorithm at all times○ Incrementally recomputes solution

● Good task placement○ Same optimal placements as Quincy○ Customizable scheduling policies

Open-source and available at:

Recommended