Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for

Sparrow: Distributed, Low Latency Scheduling

Rinik Kumar <[email protected]>

Agenda

- Part A: Background

- Part B: Sparrow system design

- Part C: Sparrow experimental evaluation

Part A: Background

Background: Data Processing Frameworks

• How to distribute data-parallel computations across multiple machines?• MapReduce (OSDI ‘04)

• Dremel (VLDB ‘10)

• Spark (NSDI ‘12)

• Convert high-level computation description into jobs

• Partition input data and assign jobs to multiple machines

Background: Short Tasks

• Common challenges in data processing frameworks

• Problem 1: Stragglers• Job response times are dominated by stragglers

• Causes:• Machine performance (e.g. contended CPUS, congested networks, etc.)

• Data partitioning (tasks take increased time due to computational skew, etc.)

• Problem 2: Sharing• Long-running tasks block additional tasks from running

Reference: http://kayousterhout.org/publications/hotos13-final24.pdf

Solution: Shorter Tasks!

http://kayousterhout.org/publications/hotos13-final24.pdf

Solution 1: Straggler Mitigation

Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf

http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf

Solution 1: Straggler Mitigation



Solution 2: Improved Sharing



Solution 2: Improved Sharing



Q: Why don’t existing data processing frameworks use short tasks?

Background: Short Tasks

• Architectural changes:• Cluster must support minimal task launch overhead

• Scalable storage systems:• Task runtime could be dominated by time taken to read input data

• Low-latency scheduling:• Scheduler must be able to make millions of low-latency scheduling decisions per second

• Framework-controlled I/O:• Framework should exploit smaller resource footprint of small tasks (e.g. pipeline reading

data input)

• And more…• Changes to execution and programming model

Background: Scheduling

• Sparrow provides a solution to the scheduling problem!

• Restrictive time requirements:• Sparrow has around 1-10 milliseconds to make scheduling decisions

• High throughput requirements:• Sparrow must support millions of scheduling decisions per second

Background: Spark

• Data processing framework; optimizing for efficient data reuse and in-memory computation• Resilient distributed datasets (RDDs)

• Express computation as a sequence of transformations (e.g. map, filter, join, etc.) on RDDs

• Scheduling:• Tasks assigned to machines based on delay scheduling

• Delay scheduling attempts to achieve both fair sharing and data locality• Fair sharing: If N jobs are running, each job receives 1/N share of resources

• Data locality: Place computations near their input data

Reference: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Background: Centralized vs. Decentralized

Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf

http://kayousterhout.org/talks/sparrow-sosp-talk.pdf

Background: Centralized vs. Decentralized



Part B: Sparrow System Design

Sparrow’s Execution Model

• Cluster composed of worker machines that execute tasks and schedulers that assign tasks

• Each job composed of 𝑚 tasks

• Wait time:• Time until the task begins executing

• Represents scheduler overhead

• Service time:• Time the task spends executing on a worker machine

Sparrow’s Optimizations

• Batch sampling• Optimization of “power of 2 choices” load balancing

• Place 𝑚 tasks in a job on the least loaded of 𝑑 ∙ 𝑚 randomly selected machines

• Late binding• Delays assignment of tasks to machines until the machine is ready to run the

task

Randomized Sampling

• Scheduler chooses a random machine to assign tasks



Randomized Sampling: Analysis

• Let 𝑛 be the number of machines in the cluster

• Let 𝑝 be the probability that a randomly selected machine is loaded• Represents cluster load

• Probability that random sampling assigns 𝑚 tasks to an unloaded machine:• (1 − 𝑝)𝑚

Randomized Sampling: Results



Power of 2 Choices• Suppose 𝑛 balls are inserted into 𝑛 bins:

• Each ball chooses 𝑑 = 2 bins uniformly at random• The ball is inserted into the bin that has lesser number of balls• If both bins have an identical number of balls, put the ball in either bin

• Azar et al. proved that the max load is log log 𝑛 + 𝑂(1) with high probability

• This is exponentially better compared to random allocation:• Max load is ≈

log 𝑛

log log 𝑛

• Increasing 𝑑 does not improve much:• Max load is

log log 𝑛

log 𝑑+ 𝑂(1)

Reference 1: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdfReference 2: https://homes.cs.washington.edu/~karlin/papers/balls.pdf

http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf

https://homes.cs.washington.edu/~karlin/papers/balls.pdf

Per-Task Sampling

• Scheduler chooses 2 random machines; assigns task to least loaded machine



Per-Task Sampling: Analysis

• Let 𝑑 be the number of machines that are probed

• Probability that per-task sampling assigns 𝑚 tasks to an unloaded machine:• (1 − 𝑝𝑑)𝑚

• Q: Why not choose a larger 𝑑?

• Problems:• Job response time limited by longest wait time of any running task

• Sub-optimal placement of tasks

Per-Task Sampling: Analysis



Per-Task Sampling: Results



Batch Sampling

• Scheduler probes 2𝑚 random machines; assigns 𝑚 tasks to least loaded machines



Batch Sampling: Analysis

• Probability that batch sampling assigns 𝑚 tasks to an unloaded machine:• Equivalent to probability that ≥ 𝑚 machines are unloaded

• σ𝑖=𝑚𝑑∙𝑚(1 − 𝑝)𝑖𝑝𝑑∙𝑚−𝑖 𝑑∙𝑚

𝑖

• Problems:• Estimating load based on queue length is inaccurate

• Queue 1 = [ 50 ms, 50 ms, 50 ms ]

• Queue 2 = [ 200 ms ]

• Multiple schedulers assign tasks to the same machine

Batch Sampling: Results



Late Binding

• Scheduler probes 2𝑚 random machines; reserves task on all machines



Late Binding

• Machine requests task once it reaches front of queue



Late Binding: Analysis

• Problems:• Machines are idle during the RPC to request a task from the scheduler

• Machines might request tasks from schedulers that have already allocated alltasks in a job

• Solution: Proactive cancellation• Upon allocating all tasks in a job, send a cancellation RPC to machines that

have pending reservations

• Q: Does Sparrow’s design extend to microsecond-scale tasks?

Late Binding: Results



Placement constraints

• Per-job constraints:• E.g. Job must execute on machines that have a GPU

• Restrict batch sampling to machines that satisfy the constraint

• Per-task constraints:• E.g. Task must execute on machine that has input data

• Uses per-task sampling

• Probed information shared across tasks:• Probe Task 1: [A loaded, B loaded, C unloaded]

• Probe Task 2: [C unloaded, D unloaded, E loaded]

• Optimal placement?

Resource allocation policies

• Strict priorities:• Tasks are assigned priorities (e.g. high/low)

• Sparrow maintains separate high/low priority task queues at each machine

• High priority task queue emptied over low priority task queue

• Weighted fair sharing:• Idea from network scheduling

• Maintain separate queues per-user

• Each user assigned a percentage representing their allocated “bandwidth”• e.g. 10%, 30%, and 60% to different users

Implementation

• Front-end client converts high-level job descriptions to task specifications• Clients and scheduler run on the same machine

• Scheduler assigns tasks to machines

• Local node monitor running on each machine enqueues scheduled tasks

• Executor process on machines execute tasks

Implementation: Fault tolerance

• Schedulers do not maintain persistent state• Similar to stateless web server backends

• Client must send heartbeats to schedulers to detect failure

• Upon failure, front-end must choose how to handle in-flight tasks• Simplest approach is to restart all in-flight tasks

• Q: Is this a good design? Is it acceptable to restart in-flight tasks upon scheduler failure?

Example: Spark on Sparrow

• Front-end translates functional queries into parallel stages

• Sparrow receives task description and placement constraints

Part C: Sparrow evaluation

Experimental Setup: TPC-H

• Cluster running on Amazon EC2• 100 machines and 10 schedulers

• 8 cores and 68.4 GB memory per machine

• Performance evaluated using TPC-H benchmark• Representative of ad-hoc queries on business data

• Properties:• Cluster utilization fluctuates around 80%

• Non-uniform task durations (10-100 ms)

• Mixed constrained/unconstrained scheduling requests

Experimental Evaluation: TPC-H

Deconstructing Performance

How do task constraints affect performance?

How do scheduler failures impact job response time?

How does Sparrow compare to Spark?

How effective is Sparrow’s distributed fairness enforcement?

How much can low priority users hurt response times for high priority users?

How sensitive is Sparrow to the probe ratio?

Conclusion

• Sparrow presents a simple, scalable solution to task scheduling• Supports millions of scheduling requests per second

• Scheduling decisions can be made in the order of milliseconds

• Discussion:• Q: Suppose the cluster operates at max load (e.g. high job arrival rate). Is

Sparrow’s approach optimal?

• Q: How could data processing frameworks co-optimize with Sparrow to obtain higher performance?

• Q: Are there alternative solutions to the straggler problem?

Documents

Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for