Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Sparrow: Distributed, Low Latency Scheduling
Rinik Kumar <[email protected]>
Agenda
- Part A: Background
- Part B: Sparrow system design
- Part C: Sparrow experimental evaluation
Part A: Background
Background: Data Processing Frameworks
• How to distribute data-parallel computations across multiple machines?• MapReduce (OSDI ‘04)
• Dremel (VLDB ‘10)
• Spark (NSDI ‘12)
• Convert high-level computation description into jobs
• Partition input data and assign jobs to multiple machines
Background: Short Tasks
• Common challenges in data processing frameworks
• Problem 1: Stragglers• Job response times are dominated by stragglers
• Causes:• Machine performance (e.g. contended CPUS, congested networks, etc.)
• Data partitioning (tasks take increased time due to computational skew, etc.)
• Problem 2: Sharing• Long-running tasks block additional tasks from running
Reference: http://kayousterhout.org/publications/hotos13-final24.pdf
Solution: Shorter Tasks!
Solution 1: Straggler Mitigation
Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf
Solution 1: Straggler Mitigation
Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf
Solution 2: Improved Sharing
Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf
Solution 2: Improved Sharing
Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf
Q: Why don’t existing data processing frameworks use short tasks?
Background: Short Tasks
• Architectural changes:• Cluster must support minimal task launch overhead
• Scalable storage systems:• Task runtime could be dominated by time taken to read input data
• Low-latency scheduling:• Scheduler must be able to make millions of low-latency scheduling decisions per second
• Framework-controlled I/O:• Framework should exploit smaller resource footprint of small tasks (e.g. pipeline reading
data input)
• And more…• Changes to execution and programming model
Background: Scheduling
• Sparrow provides a solution to the scheduling problem!
• Restrictive time requirements:• Sparrow has around 1-10 milliseconds to make scheduling decisions
• High throughput requirements:• Sparrow must support millions of scheduling decisions per second
Background: Spark
• Data processing framework; optimizing for efficient data reuse and in-memory computation• Resilient distributed datasets (RDDs)
• Express computation as a sequence of transformations (e.g. map, filter, join, etc.) on RDDs
• Scheduling:• Tasks assigned to machines based on delay scheduling
• Delay scheduling attempts to achieve both fair sharing and data locality• Fair sharing: If N jobs are running, each job receives 1/N share of resources
• Data locality: Place computations near their input data
Reference: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
Background: Centralized vs. Decentralized
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Background: Centralized vs. Decentralized
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Part B: Sparrow System Design
Sparrow’s Execution Model
• Cluster composed of worker machines that execute tasks and schedulers that assign tasks
• Each job composed of 𝑚 tasks
• Wait time:• Time until the task begins executing
• Represents scheduler overhead
• Service time:• Time the task spends executing on a worker machine
Sparrow’s Optimizations
• Batch sampling• Optimization of “power of 2 choices” load balancing
• Place 𝑚 tasks in a job on the least loaded of 𝑑 ∙ 𝑚 randomly selected machines
• Late binding• Delays assignment of tasks to machines until the machine is ready to run the
task
Randomized Sampling
• Scheduler chooses a random machine to assign tasks
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Randomized Sampling: Analysis
• Let 𝑛 be the number of machines in the cluster
• Let 𝑝 be the probability that a randomly selected machine is loaded• Represents cluster load
• Probability that random sampling assigns 𝑚 tasks to an unloaded machine:• (1 − 𝑝)𝑚
Randomized Sampling: Results
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Power of 2 Choices• Suppose 𝑛 balls are inserted into 𝑛 bins:
• Each ball chooses 𝑑 = 2 bins uniformly at random• The ball is inserted into the bin that has lesser number of balls• If both bins have an identical number of balls, put the ball in either bin
• Azar et al. proved that the max load is log log 𝑛 + 𝑂(1) with high probability
• This is exponentially better compared to random allocation:• Max load is ≈
log 𝑛
log log 𝑛
• Increasing 𝑑 does not improve much:• Max load is
log log 𝑛
log 𝑑+ 𝑂(1)
Reference 1: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdfReference 2: https://homes.cs.washington.edu/~karlin/papers/balls.pdf
Per-Task Sampling
• Scheduler chooses 2 random machines; assigns task to least loaded machine
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Per-Task Sampling: Analysis
• Let 𝑑 be the number of machines that are probed
• Probability that per-task sampling assigns 𝑚 tasks to an unloaded machine:• (1 − 𝑝𝑑)𝑚
• Q: Why not choose a larger 𝑑?
• Problems:• Job response time limited by longest wait time of any running task
• Sub-optimal placement of tasks
Per-Task Sampling: Analysis
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Per-Task Sampling: Results
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Batch Sampling
• Scheduler probes 2𝑚 random machines; assigns 𝑚 tasks to least loaded machines
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Batch Sampling: Analysis
• Probability that batch sampling assigns 𝑚 tasks to an unloaded machine:• Equivalent to probability that ≥ 𝑚 machines are unloaded
• σ𝑖=𝑚𝑑∙𝑚(1 − 𝑝)𝑖𝑝𝑑∙𝑚−𝑖 𝑑∙𝑚
𝑖
• Problems:• Estimating load based on queue length is inaccurate
• Queue 1 = [ 50 ms, 50 ms, 50 ms ]
• Queue 2 = [ 200 ms ]
• Multiple schedulers assign tasks to the same machine
Batch Sampling: Results
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Late Binding
• Scheduler probes 2𝑚 random machines; reserves task on all machines
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Late Binding
• Machine requests task once it reaches front of queue
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Late Binding: Analysis
• Problems:• Machines are idle during the RPC to request a task from the scheduler
• Machines might request tasks from schedulers that have already allocated alltasks in a job
• Solution: Proactive cancellation• Upon allocating all tasks in a job, send a cancellation RPC to machines that
have pending reservations
• Q: Does Sparrow’s design extend to microsecond-scale tasks?
Late Binding: Results
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
Placement constraints
• Per-job constraints:• E.g. Job must execute on machines that have a GPU
• Restrict batch sampling to machines that satisfy the constraint
• Per-task constraints:• E.g. Task must execute on machine that has input data
• Uses per-task sampling
• Probed information shared across tasks:• Probe Task 1: [A loaded, B loaded, C unloaded]
• Probe Task 2: [C unloaded, D unloaded, E loaded]
• Optimal placement?
Resource allocation policies
• Strict priorities:• Tasks are assigned priorities (e.g. high/low)
• Sparrow maintains separate high/low priority task queues at each machine
• High priority task queue emptied over low priority task queue
• Weighted fair sharing:• Idea from network scheduling
• Maintain separate queues per-user
• Each user assigned a percentage representing their allocated “bandwidth”• e.g. 10%, 30%, and 60% to different users
Implementation
• Front-end client converts high-level job descriptions to task specifications• Clients and scheduler run on the same machine
• Scheduler assigns tasks to machines
• Local node monitor running on each machine enqueues scheduled tasks
• Executor process on machines execute tasks
Implementation: Fault tolerance
• Schedulers do not maintain persistent state• Similar to stateless web server backends
• Client must send heartbeats to schedulers to detect failure
• Upon failure, front-end must choose how to handle in-flight tasks• Simplest approach is to restart all in-flight tasks
• Q: Is this a good design? Is it acceptable to restart in-flight tasks upon scheduler failure?
Example: Spark on Sparrow
• Front-end translates functional queries into parallel stages
• Sparrow receives task description and placement constraints
Part C: Sparrow evaluation
Experimental Setup: TPC-H
• Cluster running on Amazon EC2• 100 machines and 10 schedulers
• 8 cores and 68.4 GB memory per machine
• Performance evaluated using TPC-H benchmark• Representative of ad-hoc queries on business data
• Properties:• Cluster utilization fluctuates around 80%
• Non-uniform task durations (10-100 ms)
• Mixed constrained/unconstrained scheduling requests
Experimental Evaluation: TPC-H
Deconstructing Performance
How do task constraints affect performance?
How do scheduler failures impact job response time?
How does Sparrow compare to Spark?
How effective is Sparrow’s distributed fairness enforcement?
How much can low priority users hurt response times for high priority users?
How sensitive is Sparrow to the probe ratio?
Conclusion
• Sparrow presents a simple, scalable solution to task scheduling• Supports millions of scheduling requests per second
• Scheduling decisions can be made in the order of milliseconds
• Discussion:• Q: Suppose the cluster operates at max load (e.g. high job arrival rate). Is
Sparrow’s approach optimal?
• Q: How could data processing frameworks co-optimize with Sparrow to obtain higher performance?
• Q: Are there alternative solutions to the straggler problem?