Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State...

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)Wei Jiang (The Ohio State University)

Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Research Laboratories)

Rise of Heterogeneous Architectures

• Today’s High Performance Computing– Multi-core CPUs, Many-core GPUs are mainstream

• Many-core GPUs offer– Excellent “price-performance”& “performance-per-watt”

• Flavors of Heterogeneous computing– Multi-core CPUs + (GPUs/MICs) connected over PCI-E– Integrated CPU-GPUs like AMD Fusion, Intel Sandy Bridge

• Such hetero. platforms exist in:– 3 out 5 top Supercomputers, large clusters in acad., industry– Many cloud providers: Amazon, Nimbix, SoftLayer …

Motivation

• Supercomputers and Cloud environments are typically “Shared”– Accelerate a set of applications as opposed to single application

• Software Stack to program CPU-GPU Architectures– Combination of (Pthreads/OpenMP…) + (CUDA/Stream)– Now, OpenCL is becoming more popular

• OpenCL, a device agnostic platform– Offers great flexibility with portable solutions– Write kernel once, execute on any device

• Today’s schedulers (like TORQUE) for hetero. clusters:– DO NOT exploit the portability offered by OpenCL– User-guided Mapping of jobs to resources– Does not consider desirable scheduling possibilities (using CPU+GPU)

Revisit Scheduling problems for CPU-GPU clusters1) Exploit portability offered by models like

OpenCL2) Automatic mapping of jobs to resources3) Desirable advanced scheduling considerations

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

Outline

Problem Formulations

Problem Goal:• Accelerate a set of applications on CPU-GPU cluster• Each node has two resources: A Multi-core CPU

and a GPU• Map applications to resources to:

– Maximize overall system throughput– Minimize application latency

Scheduling Formulations:1) Single-Node, Single-Resource Allocation &

Scheduling2) Multi-Node, Multi-Resource Allocation & Scheduling

Scheduling Formulations

• Allocates a multi-core CPU or a GPU from a node in cluster– Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-

node apps.– Limited mechanisms to exploit CPU+GPU simultaneously

• Exploit the portability offered by OpenCL prog. Model

Single-Node, Single-Resource Allocation & Scheduling

Multi-Node, Multi-Resource Allocation & Scheduling• In addition, allows CPU+GPU allocation

– Desirable in future to allow flexibility in acceleration of applications

• In addition, allows multiple node allocation per job• MATE-CG [IPDPS’12], a framework for Map-Reduce

class of apps. allows such implementations

Outline

Challenges and Solution Approach

Decision Making Challenges:• Allocate/Map to CPU-only, GPU-only, or CPU+GPU?• Wait for optimal resource (involves queuing delay)• Assign to non-optimal resource (involves penalty)• Always allocating CPU+GPU may affect global

throughput– Should consider other possibilities like CPU-only or GPU-only

• Always allocate requested # of nodes?– May increase wait time, can consider allocation of lesser

Solution Approach:• Take different levels of user inputs (relative

speedups, execution times…)• Design scheduling schemes for each scheduling

formulation

Outline

Scheduling Schemes for First Formulation

Two Input Categories & Three Schemes: Categories are based on the amount of input

expected from the userCategory 1: Relative Multi-core (MP) and GPU (GP)

performance as inputScheme1: Relative Speedup based w/ Aggressive Option (RSA)

Scheme2: Relative Speedup based w/ Conservative Option (RSC)

Category 2: Additionally, sequential CPU exec. Time (SQ)Scheme3: Adaptive Shortest Job First (ASJF)

Relative-Speedup Aggressive (RSA) or Conservative (RSC)

N Jobs, MP[n], GP[n]

Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)

Sort CJQ and GJQ in Desc. Order

R=GetNextResourceAvialable()

GJQ Empty?

Assign GJQtop to R

Assign CJQbottom to R Wait for CPU

Aggressive?

Takes multi-core and GPU speedup as input

• Create CPU/GPU queues

• Map jobs to optimal resource queue

Aggressive, minimizes penalty

ConservativeYes No

Adaptive Shortest Job First (ASJF)

N Jobs, MP[n], GP[n], SQ[N]

Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)

Sort CJQ and GJQ in Asc. Order of SQ

R=GetNextResourceAvialable()

GJQ Empty?

NoAssign GJQtop to R

YesT1= GetMinWaitTimeForNextCPU()

T2k= GetJobWithMinPenOnGPU(CJQ)

T1 > T2kAssign CJQk to R

No Wait for CPU to become free or for GPU jobs

Minimize latency for short jobs

Automatic switch for aggressive or conservative option

Outline

Scheduling Scheme for Second Formulation

Solution Approach:• Flexibly schedule on CPU-only, GPU-only, or

CPU+GPU• Molding the # of nodes requested by job

• Consider allocating ½ or ¼th of requested nodes

Inputs from User:• Execution times of CPU-only, GPU-only, CPU+GPU• Execution times of jobs with n, n/2, n/4 nodes• Such app. Information can also be obtained from

profiles

Flexible Moldable Scheduling Scheme (FMS)

N Jobs, Exec. Times…

Group Jobs with # of Nodes as the Index

Sort each group based on exec. time of CPU+GPU version

Pick a pair of jobs to schedule in order of sorting

Minimize resource fragmentationHelps co-locate CPU and GPU job on the same node

Gives global view to co-locate on same node

Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each

Choose C for one job & G for the other

Co-locate jobs on same set of nodes

Choose same resource for both jobs (C,C)

(G,G) (CG,CG)

2N Nodes Avail?

YesSchedule pair of jobs in parallel

on 2N nodes

NoSchedule first job on N nodes

Consider Molding # of nodes for the next job

Outline

Cluster Hardware Setup

• Cluster of 16 CPU-GPU nodes• Each CPU is 8 core Intel Xeon E5520

(2.27GHz)• Each GPU is an Nvidia Tesla C2050 (1.15

GHz)• CPU Main Memory – 48 GB• GPU Device Memory – 3 GB• Machines are connected through Infiniband

Benchmarks

Single-Node Jobs• We use 10 benchmarks

• Scientific, Financial, Datamining, Image Processing applications

• Run each benchmark with 3 different exec. Configurations

• Overall, a pool of 30 jobs

Multi-Node Jobs• We use 3 applications

• Gridding kernel, Expectation-Maximization, PageRank

• Applications run with 2 different datasets and on 3 different node numbers

• Overall, a pool of 18 jobs

Baselines & Metrics

Baseline for Single-Node Jobs• Blind Round Robin (BRR)• Manual Optimal (Exhaustive search, Upper Bound)

Baseline for Multi-Node Jobs• TORQUE, a widely used resource manager for hetero. clusters• Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99]

Metrics• Completion Time (Comp. Time)• Application Latency:

• Non-optimal Assignment (Ave. NOA. Lat)• Queuing Delay (Ave. QD Lat.)

• Maximum Idle Time (Max. Idle Time)

Single-Node Job Results

Uniform CPU-GPU Job Mix

CPU-biased Job Mix

Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time

Metrics

BRR RSA RSC ASJF Manual Optimal

0.01.02.03.04.05.06.07.0

Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time

Metrics

BRR RSA RSC ASJF Manual Optimal

• 24 Jobs on 2 NodesProposed

schemes

4 different metrics

For each metric

• 108% better than BRR• Within 12% of Manual

Optimal• Tradeoff between non-

optimal penalty vs wait-time for resource

• BRR has the highest latency• RSA, non-optimal penalty• RSC, high Queue delay• ASF as good as Manual

optimal• BRR, very high idle times• RSC, can be very high too• RSA has the best utilization

among proposed schemes

Multi-Node Job Results

Varying Job Execution Lengths

Varying Resource Request Size

0.60.70.80.9

11.11.21.31.41.5

75 SJ/25 LJ 50 SJ/50 LJ 25 SJ/75 LJ

Job Mix

Torque MCT

Molding ResType Only Molding NumNodes Only

Molding ResType+NumNodes(FMS)

0.60.70.80.9

11.11.21.31.4

75 SR/25 LR 50 SR/50 LR 25 SR/75 LR

Job Mix

Torque MCTMolding ResType Only Molding NumNodes OnlyMolding ResType+NumNodes(FMS)

Short Job (SJ), Long Job (LJ)

Small Request (SJ), Large Request (LJ)

Proposed schemes • 32 Jobs on 16

Nodes• FMS, 42% better than best of Torque or MCT

• Each type of molding gives reasonable improvement

• Our schemes utilizes the resource betterhigh throughput

• Intelligent on deciding to wait for res. or mold it for smaller res.• FMS, 32% better than best of Torque or MCT

• Benefit from ResType Molding is better than NumNodes Molding

Outline

Conclusions

• Revisit scheduling problems on CPU-GPU clusters• Goal to improve aggregate throughput• Single-node, single-resource scheduling problem• Multi-node, multi-resource scheduling problem

• Developed novel scheduling schemes• Exploit portability offered by OpenCL• Automatic mapping of jobs to hetero. resources• RSA, RSC, and ASJF for single-node jobs• Flexible Molding Scheduling (FMS) for multi-node

• Significant improvement over state-of-the-art

Thank You!

Questions?raviv@cse.ohio-state.edu

becchim@missouri.edu

jiangwei@cse.ohio-state.edu

agrawal@cse.ohio-state.edu

chak@nec-labs.com

Benchmarks – Large Dataset

BenchmarksSeq. CPU Exec. (sec)

GPU Speedup (GP)

Multicore Speedup (MP)

Data set Characteristics

PDE Solver 7.3 4.7 6.814336*14336Image Processing 33.8 5.1 7.814336*14336FDTD 8.4 2.2 7.614336*14336

BlackScholes 2.6 2.1 7.210 mil optionsBinomial Options 11.8 5.6 4.21024 optionsMonteCarlo 45.4 38.4 7.91024 options

Kmeans 330.0 12.1 7.81.6 * 10 ^ 9 points

KNN 67.3 7.8 6.267108864 pointsPCA 142.0 9.7 5.6262144*80

Molecular Dynamics 46.6 12.9 7.9

256000 nodes, 31744000 edges

Benchmarks – Small Dataset

GPU Speedup (GP)

BlackScholes 0.7 0.6 6.82.5 mil optionsBinomial Options 3.0 2.3 4.2128 optionsMonteCarlo 11.0 9.4 7.9256 options

Kmeans 74.2 6.3 7.70.4*10 ^ 9 points

KNN 16.8 2.9 6.216777216 pointsPCA 33.8 9.1 5.665536*80

32000 nodes, 3968000 edges

Benchmarks – Large No. of Iterations

GPU Speedup (GP)

BlackScholes 269.1 92.8 7.810 mil optionsBinomial Options 1213.6 12.2 4.31024 optionsMonteCarlo 453.3 368.5 7.81024 options

Kmeans 1593.8 12.6 7.91.6 * 10 ^ 9 points

KNN 1691.1 58.4 6.967108864 pointsPCA 2835.7 11.8 6.2262144*80

256000 nodes, 31744000 edges

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State...

Documents

Agro tech by Vignesh Dhanabalan

Vignesh Simulation

Vignesh, Krishna, Dava & Rameshwar

School Name : KV MEG & CENTRE (SHIFT I) …...VIGNESH VIGNESH VIGNESH 86 199 190101054391789192 SYEDA ZAINAB 87 361 190102393382104746 ANANYA C 88 627 190104974993051829 SAMEEKSHA

Michela Hinds Photography

Michela del mistro presentazione 06.10 michela

Weaving calculation by Vignesh Dhanabalan

time management by vignesh

An Improved Algorithm to Accelerate Regular Expression Evaluation Author ： Michela Becchi 、 Patrick Crowley Publisher ： ANCS’07 Presenter ： Wen-Tse Liang

Michela Fazzolari - digibug.ugr.es

Memory-Efficient Regular Expression Search Using State Merging Author: Michela Becchi, Srihari Cadambi Publisher: INFOCOM 2007. 26th IEEE International

Presentation by vignesh swamidurai

Vignesh It Ppt

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

CAMP: Fast and Efficient IP Lookup Architecture Sailesh Kumar, Michela Becchi, Patrick Crowley, Jonathan Turner Washington University in St. Louis

Michela FINAL

Michela Project Continuation

Vignesh Total Units - Copy

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar

A Hybrid Finite Automaton for Practical Deep …pcrowley/becchi_conext2007.pdfA Hybrid Finite Automaton for Practical Deep Packet Inspection Michela Becchi Washington University Computer