View
223
Download
0
Category
Tags:
Preview:
Citation preview
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes
Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)Wei Jiang (The Ohio State University)
Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Research Laboratories)
1
Rise of Heterogeneous Architectures
• Today’s High Performance Computing– Multi-core CPUs, Many-core GPUs are mainstream
• Many-core GPUs offer– Excellent “price-performance”& “performance-per-watt”
• Flavors of Heterogeneous computing– Multi-core CPUs + (GPUs/MICs) connected over PCI-E– Integrated CPU-GPUs like AMD Fusion, Intel Sandy Bridge
• Such hetero. platforms exist in:– 3 out 5 top Supercomputers, large clusters in acad., industry– Many cloud providers: Amazon, Nimbix, SoftLayer …
2
Motivation
• Supercomputers and Cloud environments are typically “Shared”– Accelerate a set of applications as opposed to single application
• Software Stack to program CPU-GPU Architectures– Combination of (Pthreads/OpenMP…) + (CUDA/Stream)– Now, OpenCL is becoming more popular
• OpenCL, a device agnostic platform– Offers great flexibility with portable solutions– Write kernel once, execute on any device
• Today’s schedulers (like TORQUE) for hetero. clusters:– DO NOT exploit the portability offered by OpenCL– User-guided Mapping of jobs to resources– Does not consider desirable scheduling possibilities (using CPU+GPU)
3
Revisit Scheduling problems for CPU-GPU clusters1) Exploit portability offered by models like
OpenCL2) Automatic mapping of jobs to resources3) Desirable advanced scheduling considerations
Outline
• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions
4
Outline
• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions
5
Problem Formulations
Problem Goal:• Accelerate a set of applications on CPU-GPU cluster• Each node has two resources: A Multi-core CPU
and a GPU• Map applications to resources to:
– Maximize overall system throughput– Minimize application latency
Scheduling Formulations:1) Single-Node, Single-Resource Allocation &
Scheduling2) Multi-Node, Multi-Resource Allocation & Scheduling
6
Scheduling Formulations
• Allocates a multi-core CPU or a GPU from a node in cluster– Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-
node apps.– Limited mechanisms to exploit CPU+GPU simultaneously
• Exploit the portability offered by OpenCL prog. Model
7
Single-Node, Single-Resource Allocation & Scheduling
Multi-Node, Multi-Resource Allocation & Scheduling• In addition, allows CPU+GPU allocation
– Desirable in future to allow flexibility in acceleration of applications
• In addition, allows multiple node allocation per job• MATE-CG [IPDPS’12], a framework for Map-Reduce
class of apps. allows such implementations
Outline
• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions
8
Challenges and Solution Approach
Decision Making Challenges:• Allocate/Map to CPU-only, GPU-only, or CPU+GPU?• Wait for optimal resource (involves queuing delay)• Assign to non-optimal resource (involves penalty)• Always allocating CPU+GPU may affect global
throughput– Should consider other possibilities like CPU-only or GPU-only
• Always allocate requested # of nodes?– May increase wait time, can consider allocation of lesser
nodes
Solution Approach:• Take different levels of user inputs (relative
speedups, execution times…)• Design scheduling schemes for each scheduling
formulation
9
Outline
• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions
10
Scheduling Schemes for First Formulation
11
Two Input Categories & Three Schemes: Categories are based on the amount of input
expected from the userCategory 1: Relative Multi-core (MP) and GPU (GP)
performance as inputScheme1: Relative Speedup based w/ Aggressive Option (RSA)
Scheme2: Relative Speedup based w/ Conservative Option (RSC)
Category 2: Additionally, sequential CPU exec. Time (SQ)Scheme3: Adaptive Shortest Job First (ASJF)
Relative-Speedup Aggressive (RSA) or Conservative (RSC)
12
N Jobs, MP[n], GP[n]
Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)
Sort CJQ and GJQ in Desc. Order
R=GetNextResourceAvialable()
IsGPU
GJQ Empty?
YesNo
Assign GJQtop to R
Yes
Assign CJQbottom to R Wait for CPU
Aggressive?
Takes multi-core and GPU speedup as input
• Create CPU/GPU queues
• Map jobs to optimal resource queue
Aggressive, minimizes penalty
ConservativeYes No
Adaptive Shortest Job First (ASJF)
13
N Jobs, MP[n], GP[n], SQ[N]
Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)
Sort CJQ and GJQ in Asc. Order of SQ
R=GetNextResourceAvialable()
IsGPU
GJQ Empty?
Yes
NoAssign GJQtop to R
YesT1= GetMinWaitTimeForNextCPU()
T2k= GetJobWithMinPenOnGPU(CJQ)
T1 > T2kAssign CJQk to R
Yes
No Wait for CPU to become free or for GPU jobs
Minimize latency for short jobs
Automatic switch for aggressive or conservative option
Outline
• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions
14
Scheduling Scheme for Second Formulation
15
Solution Approach:• Flexibly schedule on CPU-only, GPU-only, or
CPU+GPU• Molding the # of nodes requested by job
• Consider allocating ½ or ¼th of requested nodes
Inputs from User:• Execution times of CPU-only, GPU-only, CPU+GPU• Execution times of jobs with n, n/2, n/4 nodes• Such app. Information can also be obtained from
profiles
Flexible Moldable Scheduling Scheme (FMS)
16
N Jobs, Exec. Times…
Group Jobs with # of Nodes as the Index
Sort each group based on exec. time of CPU+GPU version
Pick a pair of jobs to schedule in order of sorting
Minimize resource fragmentationHelps co-locate CPU and GPU job on the same node
Gives global view to co-locate on same node
Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each
job
Choose C for one job & G for the other
Co-locate jobs on same set of nodes
Choose same resource for both jobs (C,C)
(G,G) (CG,CG)
2N Nodes Avail?
YesSchedule pair of jobs in parallel
on 2N nodes
NoSchedule first job on N nodes
Consider Molding # of nodes for the next job
Outline
• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions
17
Cluster Hardware Setup
18
• Cluster of 16 CPU-GPU nodes• Each CPU is 8 core Intel Xeon E5520
(2.27GHz)• Each GPU is an Nvidia Tesla C2050 (1.15
GHz)• CPU Main Memory – 48 GB• GPU Device Memory – 3 GB• Machines are connected through Infiniband
Benchmarks
19
Single-Node Jobs• We use 10 benchmarks
• Scientific, Financial, Datamining, Image Processing applications
• Run each benchmark with 3 different exec. Configurations
• Overall, a pool of 30 jobs
Multi-Node Jobs• We use 3 applications
• Gridding kernel, Expectation-Maximization, PageRank
• Applications run with 2 different datasets and on 3 different node numbers
• Overall, a pool of 18 jobs
Baselines & Metrics
20
Baseline for Single-Node Jobs• Blind Round Robin (BRR)• Manual Optimal (Exhaustive search, Upper Bound)
Baseline for Multi-Node Jobs• TORQUE, a widely used resource manager for hetero. clusters• Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99]
Metrics• Completion Time (Comp. Time)• Application Latency:
• Non-optimal Assignment (Ave. NOA. Lat)• Queuing Delay (Ave. QD Lat.)
• Maximum Idle Time (Max. Idle Time)
Single-Node Job Results
21
Uniform CPU-GPU Job Mix
CPU-biased Job Mix
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time
No
rma
lize
d O
ve
r B
est
Ca
se
Metrics
BRR RSA RSC ASJF Manual Optimal
0.01.02.03.04.05.06.07.0
Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time
No
rma
lize
d O
ve
r B
est
Ca
se
Metrics
BRR RSA RSC ASJF Manual Optimal
• 24 Jobs on 2 NodesProposed
schemes
4 different metrics
For each metric
• 108% better than BRR• Within 12% of Manual
Optimal• Tradeoff between non-
optimal penalty vs wait-time for resource
• BRR has the highest latency• RSA, non-optimal penalty• RSC, high Queue delay• ASF as good as Manual
optimal• BRR, very high idle times• RSC, can be very high too• RSA has the best utilization
among proposed schemes
Multi-Node Job Results
22
Varying Job Execution Lengths
Varying Resource Request Size
0.60.70.80.9
11.11.21.31.41.5
75 SJ/25 LJ 50 SJ/50 LJ 25 SJ/75 LJ
No
rma
lize
d C
om
ple
tio
n T
ime
Job Mix
Torque MCT
Molding ResType Only Molding NumNodes Only
Molding ResType+NumNodes(FMS)
0.60.70.80.9
11.11.21.31.4
75 SR/25 LR 50 SR/50 LR 25 SR/75 LR
No
rma
lize
d C
om
ple
tio
n T
ime
Job Mix
Torque MCTMolding ResType Only Molding NumNodes OnlyMolding ResType+NumNodes(FMS)
Short Job (SJ), Long Job (LJ)
Small Request (SJ), Large Request (LJ)
Proposed schemes • 32 Jobs on 16
Nodes• FMS, 42% better than best of Torque or MCT
• Each type of molding gives reasonable improvement
• Our schemes utilizes the resource betterhigh throughput
• Intelligent on deciding to wait for res. or mold it for smaller res.• FMS, 32% better than best of Torque or MCT
• Benefit from ResType Molding is better than NumNodes Molding
Outline
• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions
23
Conclusions
24
• Revisit scheduling problems on CPU-GPU clusters• Goal to improve aggregate throughput• Single-node, single-resource scheduling problem• Multi-node, multi-resource scheduling problem
• Developed novel scheduling schemes• Exploit portability offered by OpenCL• Automatic mapping of jobs to hetero. resources• RSA, RSC, and ASJF for single-node jobs• Flexible Molding Scheduling (FMS) for multi-node
jobs
• Significant improvement over state-of-the-art
25
Thank You!
Questions?raviv@cse.ohio-state.edu
becchim@missouri.edu
jiangwei@cse.ohio-state.edu
agrawal@cse.ohio-state.edu
chak@nec-labs.com
Benchmarks – Large Dataset
26
BenchmarksSeq. CPU Exec. (sec)
GPU Speedup (GP)
Multicore Speedup (MP)
Data set Characteristics
PDE Solver 7.3 4.7 6.814336*14336Image Processing 33.8 5.1 7.814336*14336FDTD 8.4 2.2 7.614336*14336
BlackScholes 2.6 2.1 7.210 mil optionsBinomial Options 11.8 5.6 4.21024 optionsMonteCarlo 45.4 38.4 7.91024 options
Kmeans 330.0 12.1 7.81.6 * 10 ^ 9 points
KNN 67.3 7.8 6.267108864 pointsPCA 142.0 9.7 5.6262144*80
Molecular Dynamics 46.6 12.9 7.9
256000 nodes, 31744000 edges
Benchmarks – Small Dataset
27
BenchmarksSeq. CPU Exec. (sec)
GPU Speedup (GP)
Multicore Speedup (MP)
Data set Characteristics
PDE Solver 1.8 3.8 7.17168*7168Image Processing 8.4 5.6 7.57168*7168FDTD 2.1 1.3 7.77168*7168
BlackScholes 0.7 0.6 6.82.5 mil optionsBinomial Options 3.0 2.3 4.2128 optionsMonteCarlo 11.0 9.4 7.9256 options
Kmeans 74.2 6.3 7.70.4*10 ^ 9 points
KNN 16.8 2.9 6.216777216 pointsPCA 33.8 9.1 5.665536*80
Molecular Dynamics 6.7 12.8 7.3
32000 nodes, 3968000 edges
Benchmarks – Large No. of Iterations
28
BenchmarksSeq. CPU Exec. (sec)
GPU Speedup (GP)
Multicore Speedup (MP)
Data set Characteristics
PDE Solver 722.1 4.3 8.114336*14336Image Processing 3385.5 4.8 8.014336*14336FDTD 423.3 1.8 7.914336*14336
BlackScholes 269.1 92.8 7.810 mil optionsBinomial Options 1213.6 12.2 4.31024 optionsMonteCarlo 453.3 368.5 7.81024 options
Kmeans 1593.8 12.6 7.91.6 * 10 ^ 9 points
KNN 1691.1 58.4 6.967108864 pointsPCA 2835.7 11.8 6.2262144*80
Molecular Dynamics 593.8 20.8 7.8
256000 nodes, 31744000 edges
Recommended