Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster

NATJAM: SUPPORTING DEADLINES AND PRIORITIES IN A MAPREDUCE CLUSTER

Brian Cho (Samsung/Illinois), Muntasir Rahman, Tej Chajed, Indranil Gupta,

Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin

University of Illinois (Urbana-Champaign)

1

Distributed Protocols Research Group (DPRG): http://dprg.cs.uiuc.edu

Hadoop Jobs have Priorities• Dual Priority Case– Production jobs (high

priority)• Time sensitive• Directly affect criticality

or revenue– Research jobs (low

priority)• e.g., long term analysis

• Example: Ad provider

Count clicks

Update ads

Slow counts → Show old ads → Don’t get

paid $$$

Ad click-through logs

Is there a better way to place ads?

Run machine learning analysis

Daily and Historical logs. 2

Prioritize production jobs

http://dprg.cs.uiuc.edu

State-of-the-art: Separate clusters• Production cluster receives production jobs (high priority)• Research cluster receives research jobs (low priority)

• Traces reveal large periods of under-utilization in each cluster– Long job completion times– Human involvement in job management

• Goal: single consolidated cluster for all priorities and deadlines– Prioritize production jobs and yet affect research jobs least

• Today’s Options:– Wait for research tasks to finish (e.g., Capacity Scheduler)

Prolongs production jobs– Kill research tasks (e.g., Fair Scheduler) can lead to repeated work

Prolongs research jobs

3http://dprg.cs.uiuc.edu

Natjam’s Techniques1. Scale down research jobs by– Preempting some Reduce tasks– Fast on-demand automated checkpointing of task state– Later, reduces can resume where they left off

• Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])

2. Job Eviction Policies3. Task Eviction Policies


Natjam built into Hadoop YARN Architecture

• Preemptor– Chooses Victim

Job– Reclaims queue

resources• Releaser

– Chooses Victim Task

• Local Suspender– Saves state of

Victim Task5

Resource ManagerCapacity Scheduler

Node A

(empty container)

Node BNode Manager A

Application Master 1

Node Manager B


Task (App2)

Preemptor

Releaser

Task (App2)

Local Suspender

Releaser Local Suspender

preempt()

# containers to release

release()suspend

saved state

ask container

Task (App1)

resume()


Suspending and Resuming Tasks• Existing intermediate data

used– Reduce inputs,

stored at local host– Reduce outputs,

stored on HDFS• Suspended task state saved

locally, so resume can avoid network overhead

• Checkpoint state saved– Key counter– Reduce input path– Hostname– List of suspended task attempt

IDs

6

HDFSTask Attempt 1

Inputs

KeyCounter

tmp/task_att_1

tmp/task_att_2

outdir/

(Resumed) Task Attempt 2

Inputs

KeyCounter

(skip)

(Suspended)Container freed,

Suspend state saved


Two-level Eviction Policies• On a container

request in a full cluster:

1. Job Eviction– @Preemptor

2. Task Eviction– @Releaser

7

Resource ManagerCapacity Scheduler

Node A Node BNode Manager A


Node Manager B


Task (App2)

Preemptor

Releaser

Task (App2)

Local Suspender

Releaser Local Suspender

# containers to release

preempt()

release()


Job Eviction Policies• Based on total amount of resources (e.g., containers) held by

victim job (known at Resource Manager)

1. Least Resources (LR) Large research jobs unaffected Starvation for small research jobs (e.g., repeated production arrivals)

2. Most Resources (MR) Small research jobs unaffected Starvation for the largest research job

3. Probabilistically-weighted on Resources (PR) Weigh jobs by number of containers: treats all tasks same, across jobs Affects multiple research jobs


Task Eviction Policies• Based on time remaining (known at Application Master)

1. Shortest Remaining Time (SRT) Leaves the tail of research job alone Holds on to containers that would be released soon

2. Longest Remaining Time (LRT) May lengthen the tail Releases more containers earlier

• However: SRT provably optimal under some conditions– Counter-intuitive. SRT = Longest-job-first scheduling.


Now

Eviction Policies in Practice• Task Eviction– SRT 20% faster than LRT for research jobs– Production job similar across SRT vs. LRT– Theorem: When research tasks resume simultaneously,

SRT results in shortest job completion time.• Job Eviction– MR best– PR very close behind– LR 14%-23% worse than MR

• MR + SRT best combination

http://dprg.cs.uiuc.edu 10

Natjam-R: Multiple Priorities• Special case of priorities: jobs with real-time deadlines• Best-effort only (no admission control)• Resource Manager keeps single queue of jobs sorted by

increasing priority (derived from deadline)– Periodically scans queue: evicts later job to give to earlier waiting job

• Job Eviction Policies1. Maximum Deadline First (MDF): Priority = Deadline

Prefers short deadline jobs May miss deadlines, e.g., schedules a large job instead of a small job with a slightly large deadline

2. Maximum Laxity First– Priority = Laxity = Deadline minus Job’s Projected Completion time Pays attention to job’s resource requirements

11

MDF vs. MLF in

Practice

0 50 100 150 200 250 3000

102030405060708090

100

Job 1 MapJob 2 MapJob 3 MapJob 1 ReduceJob 2 ReduceJob 3 Reduce

time (s)

Prog

ress

0 50 100 150 200 250 3000

102030405060708090

100

Job 1 MapJob 2 MapJob 3 MapJob 1 ReduceJob 2 ReduceJob 3 Reduce

time (s)

Prog

ress

MLF moves in lockstepMisses all deadlines

MDF prefers short deadlines

Job deadlines

• 8 node cluster• Yahoo! trace experiments in paper

Natjam vs. Alternatives

13Ideal Capacity scheduler:

Hard capCapacity scheduler:

Soft capKilling Natjam

0

50

100

150

200

250

300

350

Research-XL Job Production-S Job

Aver

age

Exec

ution

Tim

e (s

econ

ds)

50% worse than ideal



2% worse than ideal15% better than Killing

7% worse than Ideal40% better than Soft cap

time (seconds)

t=0s Research-XL(100% of cluster)

t=50s Production-S(25% of cluster)

Microbenchmark: • 7 node cluster

Empty Cluster

Large Experiments• 250 nodes @Yahoo!, Driven by Yahoo! traces• Natjam vs. Waiting for research tasks (Hadoop Capacity

Scheduler: Soft cap) – Production jobs: 53% benefit, 97% delayed < 5 s– Research jobs: 63% benefit, very few outliers (low starvation)

• Natjam vs. Killing research tasks– Production jobs: largely unaffected– Research jobs:

• 38% finish faster than 100 s• 5th percentile faster than 750 s• Biggest improvement: 1880 s• Negligible starvation http://dprg.cs.uiuc.edu

14

Related Work• Single cluster job scheduling has focused on:– Locality of Map tasks [Quincy, Delay Scheduling]– Speculative execution [LATE Scheduler]– Average fairness between queues [Capacity

Scheduler, Fair Scheduler]– Recent work: Elastic queues but uses Sailfish – needs

special intermediate file system, does not work with Hadoop [Amoeba]

– Mapreduce-5269 JIRA: Preemption in Hadoop


Takeaways• Natjam supports dual priority and arbitrary

priorities (derived from deadlines)• SRT (Shortest remaining time) best policy for task

eviction• MR (Most resources) best policy for job eviction• MDF (Maximum deadline first) best policy for job

eviction in Natjam-R• 2-7% Overhead for dual priority case• Please see our poster + demo video later today!



Backup slides


Contributions

• Our system Natjam allows us to – Maintain one cluster– With a production queue and a research queue– Prioritize production jobs and complete them

quickly– While affecting research jobs the least– (Later: Extend to multiple priorities.)


Hadoop 23’s Capacity Scheduler• Limitation: research jobs

cannot scale down• Hadoop capacity shared

using queues– Guaranteed capacity (G)– Maximum capacity(M)

• Example– Production (P) queue:

G 80%/M 80%– Research (R) queue:

G 20%/M 40%

1. Production jobsubmitted first:

2. Research jobsubmitted first:

time →

P takes 80%(under-utilization)

R can only grow to 40%

time →

R takes 40%(under-utilization)

P cannot grow beyond 60%


Natjam Scheduler• Does not require

Maximum capacity• Scales down research jobs

by – Preempting Reduce tasks– Fast on-demand automated

checkpointing of task state– Resumption where it left off

• Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])

1. P/R Guaranteed: 80%/20%

2. P/R Guaranteed: 100%/0%

time →

R takes 100%

P takes 80%

time →

R takes 100%

P takes 100%

Prioritize Production Jobs

Yahoo! Hadoop Traces:CDF of differences (negative is good)

21

-2000 -1500 -1000 -500 0 5000

0.10.20.30.40.50.60.70.80.9

1

Production Jobs: Natjam - KillingResearch Jobs: Natjam - Killing

Difference in Job Completion Time (seconds)

CDF

-300 -200 -100 0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

Production Jobs: Natjam - Soft CapResearch Jobs: Natjam - Soft Cap


CDF

-2500 -2000 -1500 -1000 -500 0 5000

0.10.20.30.40.50.60.70.80.9

1

Production Jobs: Natjam - KillingResearch Jobs: Natjam - Killing


CDF

-150 -100 -50 0 50 100 1500

0.10.20.30.40.50.60.70.80.9

1

Production Jobs: Natjam - Soft CapResearch Jobs: Natjam - Soft Cap


CDF

7-node cluster

250-node Yahoo! cluster

Only two starved jobs 260 s and 390 sLargest benefit

1880 s

Documents

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster