21
NATJAM: SUPPORTING DEADLINES AND PRIORITIES IN A MAPREDUCE CLUSTER Brian Cho (Samsung/Illinois), Muntasir Rahman, Tej Chajed, Indranil Gupta, Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin University of Illinois (Urbana-Champaign) 1 Distributed Protocols Research Group (DPRG): http://dprg.cs.uiuc.edu

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster

  • Upload
    yakov

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster. Brian Cho (Samsung/Illinois), Muntasir Rahman , Tej Chajed , Indranil Gupta , Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin University of Illinois (Urbana-Champaign). Hadoop Jobs have Priorities. - PowerPoint PPT Presentation

Citation preview

Page 1: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

NATJAM: SUPPORTING DEADLINES AND PRIORITIES IN A MAPREDUCE CLUSTER

Brian Cho (Samsung/Illinois), Muntasir Rahman, Tej Chajed, Indranil Gupta,

Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin

University of Illinois (Urbana-Champaign)

1

Distributed Protocols Research Group (DPRG): http://dprg.cs.uiuc.edu

Page 2: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Hadoop Jobs have Priorities• Dual Priority Case– Production jobs (high

priority)• Time sensitive• Directly affect criticality

or revenue– Research jobs (low

priority)• e.g., long term analysis

• Example: Ad provider

Count clicks

Update ads

Slow counts → Show old ads → Don’t get

paid $$$

Ad click-through logs

Is there a better way to place ads?

Run machine learning analysis

Daily and Historical logs. 2

Prioritize production jobs

http://dprg.cs.uiuc.edu

Page 3: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

State-of-the-art: Separate clusters• Production cluster receives production jobs (high priority)• Research cluster receives research jobs (low priority)

• Traces reveal large periods of under-utilization in each cluster– Long job completion times– Human involvement in job management

• Goal: single consolidated cluster for all priorities and deadlines– Prioritize production jobs and yet affect research jobs least

• Today’s Options:– Wait for research tasks to finish (e.g., Capacity Scheduler)

Prolongs production jobs– Kill research tasks (e.g., Fair Scheduler) can lead to repeated work

Prolongs research jobs

3http://dprg.cs.uiuc.edu

Page 4: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Natjam’s Techniques1. Scale down research jobs by– Preempting some Reduce tasks– Fast on-demand automated checkpointing of task state– Later, reduces can resume where they left off

• Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])

2. Job Eviction Policies3. Task Eviction Policies

4http://dprg.cs.uiuc.edu

Page 5: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Natjam built into Hadoop YARN Architecture

• Preemptor– Chooses Victim

Job– Reclaims queue

resources• Releaser

– Chooses Victim Task

• Local Suspender– Saves state of

Victim Task5

Resource ManagerCapacity Scheduler

Node A

(empty container)

Node BNode Manager A

Application Master 1

Node Manager B

Application Master 2

Task (App2)

Preemptor

Releaser

Task (App2)

Local Suspender

Releaser Local Suspender

preempt()

# containers to release

release()suspend

saved state

ask container

Task (App1)

resume()

http://dprg.cs.uiuc.edu

Page 6: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Suspending and Resuming Tasks• Existing intermediate data

used– Reduce inputs,

stored at local host– Reduce outputs,

stored on HDFS• Suspended task state saved

locally, so resume can avoid network overhead

• Checkpoint state saved– Key counter– Reduce input path– Hostname– List of suspended task attempt

IDs

6

HDFSTask Attempt 1

Inputs

KeyCounter

tmp/task_att_1

tmp/task_att_2

outdir/

(Resumed) Task Attempt 2

Inputs

KeyCounter

(skip)

(Suspended)Container freed,

Suspend state saved

http://dprg.cs.uiuc.edu

Page 7: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Two-level Eviction Policies• On a container

request in a full cluster:

1. Job Eviction– @Preemptor

2. Task Eviction– @Releaser

7

Resource ManagerCapacity Scheduler

Node A Node BNode Manager A

Application Master 1

Node Manager B

Application Master 2

Task (App2)

Preemptor

Releaser

Task (App2)

Local Suspender

Releaser Local Suspender

# containers to release

preempt()

release()

http://dprg.cs.uiuc.edu

Page 8: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Job Eviction Policies• Based on total amount of resources (e.g., containers) held by

victim job (known at Resource Manager)

1. Least Resources (LR) Large research jobs unaffected Starvation for small research jobs (e.g., repeated production arrivals)

2. Most Resources (MR) Small research jobs unaffected Starvation for the largest research job

3. Probabilistically-weighted on Resources (PR) Weigh jobs by number of containers: treats all tasks same, across jobs Affects multiple research jobs

8http://dprg.cs.uiuc.edu

Page 9: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Task Eviction Policies• Based on time remaining (known at Application Master)

1. Shortest Remaining Time (SRT) Leaves the tail of research job alone Holds on to containers that would be released soon

2. Longest Remaining Time (LRT) May lengthen the tail Releases more containers earlier

• However: SRT provably optimal under some conditions– Counter-intuitive. SRT = Longest-job-first scheduling.

9http://dprg.cs.uiuc.edu

Now

Page 10: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Eviction Policies in Practice• Task Eviction– SRT 20% faster than LRT for research jobs– Production job similar across SRT vs. LRT– Theorem: When research tasks resume simultaneously,

SRT results in shortest job completion time.• Job Eviction– MR best– PR very close behind– LR 14%-23% worse than MR

• MR + SRT best combination

http://dprg.cs.uiuc.edu 10

Page 11: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Natjam-R: Multiple Priorities• Special case of priorities: jobs with real-time deadlines• Best-effort only (no admission control)• Resource Manager keeps single queue of jobs sorted by

increasing priority (derived from deadline)– Periodically scans queue: evicts later job to give to earlier waiting job

• Job Eviction Policies1. Maximum Deadline First (MDF): Priority = Deadline

Prefers short deadline jobs May miss deadlines, e.g., schedules a large job instead of a small job with a slightly large deadline

2. Maximum Laxity First– Priority = Laxity = Deadline minus Job’s Projected Completion time Pays attention to job’s resource requirements

11

Page 12: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

MDF vs. MLF in

Practice

0 50 100 150 200 250 3000

102030405060708090

100

Job 1 MapJob 2 MapJob 3 MapJob 1 ReduceJob 2 ReduceJob 3 Reduce

time (s)

Prog

ress

0 50 100 150 200 250 3000

102030405060708090

100

Job 1 MapJob 2 MapJob 3 MapJob 1 ReduceJob 2 ReduceJob 3 Reduce

time (s)

Prog

ress

MLF moves in lockstepMisses all deadlines

MDF prefers short deadlines

Job deadlines

• 8 node cluster• Yahoo! trace experiments in paper

Page 13: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Natjam vs. Alternatives

13Ideal Capacity scheduler:

Hard capCapacity scheduler:

Soft capKilling Natjam

0

50

100

150

200

250

300

350

Research-XL Job Production-S Job

Aver

age

Exec

ution

Tim

e (s

econ

ds)

50% worse than ideal

90% worse than ideal

20% worse than ideal

2% worse than ideal15% better than Killing

7% worse than Ideal40% better than Soft cap

time (seconds)

t=0s Research-XL(100% of cluster)

t=50s Production-S(25% of cluster)

Microbenchmark: • 7 node cluster

Empty Cluster

Page 14: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Large Experiments• 250 nodes @Yahoo!, Driven by Yahoo! traces• Natjam vs. Waiting for research tasks (Hadoop Capacity

Scheduler: Soft cap) – Production jobs: 53% benefit, 97% delayed < 5 s– Research jobs: 63% benefit, very few outliers (low starvation)

• Natjam vs. Killing research tasks– Production jobs: largely unaffected– Research jobs:

• 38% finish faster than 100 s• 5th percentile faster than 750 s• Biggest improvement: 1880 s• Negligible starvation http://dprg.cs.uiuc.edu

14

Page 15: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Related Work• Single cluster job scheduling has focused on:– Locality of Map tasks [Quincy, Delay Scheduling]– Speculative execution [LATE Scheduler]– Average fairness between queues [Capacity

Scheduler, Fair Scheduler]– Recent work: Elastic queues but uses Sailfish – needs

special intermediate file system, does not work with Hadoop [Amoeba]

– Mapreduce-5269 JIRA: Preemption in Hadoop

15http://dprg.cs.uiuc.edu

Page 16: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Takeaways• Natjam supports dual priority and arbitrary

priorities (derived from deadlines)• SRT (Shortest remaining time) best policy for task

eviction• MR (Most resources) best policy for job eviction• MDF (Maximum deadline first) best policy for job

eviction in Natjam-R• 2-7% Overhead for dual priority case• Please see our poster + demo video later today!

http://dprg.cs.uiuc.edu 16

Page 17: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

http://dprg.cs.uiuc.edu 17

Backup slides

Page 18: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

http://dprg.cs.uiuc.edu 18

Contributions

• Our system Natjam allows us to – Maintain one cluster– With a production queue and a research queue– Prioritize production jobs and complete them

quickly– While affecting research jobs the least– (Later: Extend to multiple priorities.)

Page 19: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

http://dprg.cs.uiuc.edu 19

Hadoop 23’s Capacity Scheduler• Limitation: research jobs

cannot scale down• Hadoop capacity shared

using queues– Guaranteed capacity (G)– Maximum capacity(M)

• Example– Production (P) queue:

G 80%/M 80%– Research (R) queue:

G 20%/M 40%

1. Production jobsubmitted first:

2. Research jobsubmitted first:

time →

P takes 80%(under-utilization)

R can only grow to 40%

time →

R takes 40%(under-utilization)

P cannot grow beyond 60%

Page 20: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

http://dprg.cs.uiuc.edu 20

Natjam Scheduler• Does not require

Maximum capacity• Scales down research jobs

by – Preempting Reduce tasks– Fast on-demand automated

checkpointing of task state– Resumption where it left off

• Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])

1. P/R Guaranteed: 80%/20%

2. P/R Guaranteed: 100%/0%

time →

R takes 100%

P takes 80%

time →

R takes 100%

P takes 100%

Prioritize Production Jobs

Page 21: Natjam : Supporting Deadlines and Priorities in a  Mapreduce  Cluster

Yahoo! Hadoop Traces:CDF of differences (negative is good)

21

-2000 -1500 -1000 -500 0 5000

0.10.20.30.40.50.60.70.80.9

1

Production Jobs: Natjam - KillingResearch Jobs: Natjam - Killing

Difference in Job Completion Time (seconds)

CDF

-300 -200 -100 0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

Production Jobs: Natjam - Soft CapResearch Jobs: Natjam - Soft Cap

Difference in Job Completion Time (seconds)

CDF

-2500 -2000 -1500 -1000 -500 0 5000

0.10.20.30.40.50.60.70.80.9

1

Production Jobs: Natjam - KillingResearch Jobs: Natjam - Killing

Difference in Job Completion Time (seconds)

CDF

-150 -100 -50 0 50 100 1500

0.10.20.30.40.50.60.70.80.9

1

Production Jobs: Natjam - Soft CapResearch Jobs: Natjam - Soft Cap

Difference in Job Completion Time (seconds)

CDF

7-node cluster

250-node Yahoo! cluster

Only two starved jobs 260 s and 390 sLargest benefit

1880 s