Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The rCUDA technology:
an inexpensive way to improve
the performance of
GPU-based clusters
Federico Silla
Technical University of Valencia Spain
Delft, April 2015 2/47
The scope of this talk
Delft, April 2015 3/47
More flexible use of GPUs
• rCUDA: a software technology that enables a more
flexible use of GPUs in computing facilities
No GPU
Delft, April 2015 4/47
CUDASW++
Bioinformatics software for Smith-Waterman protein database searches
NVIDIA Tesla K20 GPU
Mellanox ConnectX-3 single-port adapters
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCU
DA
Ove
rhe
ad
(%
)
Exe
cu
tio
n T
ime
(s)
Overhead introduced by rCUDA
Small overhead when
using InfiniBand
Lower
is better
Delft, April 2015 5/47
• As many GPUs as there are in the cluster may be provided
to a single application
No GPU
1: more GPUs for a single application
Delft, April 2015 6/47
1: more GPUs for a single application
Delft, April 2015 7/47
1: more GPUs for a single application
Delft, April 2015 8/47
MonteCarlo Multi-GPU (from NVIDIA SDK)
Lower
is better
Higher
is better
1: more GPUs for a single application
Delft, April 2015 9/47
• GPUs can be shared among jobs running in remote clients
App 2
App 1
App 3
App 4
2: increased cluster performance
App 6
App 5
App 7
App 8
App 9
Delft, April 2015 10/47
Test bench for studying rCUDA performance at cluster level:
SLURM used as job scheduler
InfiniBand ConnectX-3 based cluster
Dual socket E5-2620v2 Intel Xeon based nodes:
1 node without GPU
8 nodes. Each with one NVIDIA K20 GPU
Four applications used
LAMMPS
GPU-Blast
MCUDA-MEME
Gromacs (no GPU)
Three workload sizes:
Small
Medium
Large
1 node hosting
the main SLURM
controller
8 nodes with
one K20
GPU each
2: increased cluster performance
Delft, April 2015 11/47
2: increased cluster performance
Delft, April 2015 12/47
Let’s reduce the amount of GPUs in the cluster
42% Less
41% Less
43% Less
3: less cost with more performance
Delft, April 2015 13/47
4: reduced energy consumption
Delft, April 2015 14/47
Why rCUDA: the problem with GPU-enabled clusters
The enabler for higher cluster throughput at lower cost
Engineering the enabler
Final considerations
Increasing throughput in current clusters
Delft, April 2015 15/47
Why rCUDA: the problem with GPU-enabled clusters
The enabler for higher cluster throughput at lower cost
Engineering the enabler
Final considerations
Increasing throughput in current clusters
Delft, April 2015 16/47
A GPU computing facility is usually a set of independent self-
contained nodes that leverage the shared-nothing approach:
Nothing is directly shared among nodes (MPI required for aggregating
computing resources within the cluster)
GPUs can only be used within the node they are attached to
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
Characteristics of GPU-based clusters
Delft, April 2015 17/47
• Applications can only use the GPUs located within their node:
• Non-accelerated applications keep GPUs idle in the nodes where they
use all the cores
First concern with accelerated clusters
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
A CPU-only application spreading over
these four nodes would make their GPUs
unavailable for accelerated applications
Delft, April 2015 18/47
For some workloads, GPUs may be idle for significant periods of time:
• Initial acquisition costs not amortized
• Space: GPUs reduce CPU density
• Energy: idle GPUs keep consuming power
Money leakage in current clusters?
Time (s)
Idle
Pow
er
(Watts)
1 GPU node
4 GPUs node
• 1 GPU node: Two E5-2620V2 sockets and 32GB DDR3 RAM. One Tesla K20 GPU
• 4 GPUs node: Two E5-2620V2 sockets and 128GB DDR3 RAM. Four Tesla K20 GPUs
25%
Delft, April 2015 19/47
• Applications can only use the GPUs located within their node:
• Multi-GPU applications running on a subset of nodes cannot
make use of the tremendous GPU resources available at other
cluster nodes (even if they are idle)
Second concern with accelerated clusters
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
All these GPUs cannot be
used by the multi-GPU
application in execution multi-GPU application
Delft, April 2015 20/47
• Do applications completely squeeze the GPUs present in the cluster?
• Even if all GPUs are assigned to running applications,
computational resources inside GPUs may not be fully used
• Application presenting low level of parallelism
• CPU code being executed (GPU assigned ≠ GPU working)
• GPU-core stall due to lack of data
• etc …
One more concern with accelerated clusters
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
Delft, April 2015 21/47
Sharing a given GPU among jobs
• Several GPU-Blast instances concurrently executed on the same GPU. Each instance uses about 1.5 of GPU memory
Delft, April 2015 22/47
In summary …
• There are scenarios where GPUs are available but
cannot be used
• Accelerated applications do not make use of GPUs
100% of the time
In conclusion …
• We are losing GPU cycles, thus reducing cluster
performance
Why GPU-cluster performance is lost?
Delft, April 2015 23/47
What is missing is ...
... some flexibility for using
the GPUs in the cluster
We need something more in the cluster
The current model for using GPUs is too rigid
Delft, April 2015 24/47
Why rCUDA: the problem with GPU-enabled clusters
The enabler for higher cluster throughput at lower cost
Engineering the enabler
Final considerations
Increasing throughput in current clusters
Delft, April 2015 25/47
• Two ingredients are required to cook a
higher-throughput GPU-based cluster
• A way of seamlessly sharing GPUs across
nodes in the cluster (remote GPU
virtualization)
• Enhanced job schedulers that take into
account the new shared GPUs
What is needed for increased flexibility?
Delft, April 2015 26/47
Remote GPU virtualization allows a new vision of a GPU
deployment, moving from the usual cluster configuration:
Remote GPU virtualization envision
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Main
Mem
ory
Network
GPU GPU mem
to the following one ….
Delft, April 2015 27/47
Physical
configuration
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
PC
I-e
CP
U
Main
Mem
ory
Network
Interconnection Network
Logical connections
Logical
configuration
Remote GPU virtualization envision
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Main
Mem
ory
Network
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
CP
U
Main
Mem
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Main
Mem
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Main
Mem
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e
Delft, April 2015 28/47
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Main
Mem
ory
Network
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
CP
U
Main
Mem
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Main
Mem
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Main
Mem
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
PC
I-e
CP
U
Main
Mem
ory
Network
Interconnection Network
Logical connections
Busy cores are no longer a problem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Physical
configuration
Logical
configuration
Delft, April 2015 29/47
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Without GPU virtualization
GPU virtualization is also useful for multi-GPU applications
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
CP
U
Main
Mem
ory
Network
PC
I-e
PC
I-e
CP
U
Main
Mem
ory
Network
Interconnection Network
Logical connections
PC
I-e
CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e
CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Main
Mem
ory
Network
GPU GPU mem
With GPU virtualization
Many GPUs in the
cluster can be provided
to the application
Only the GPUs in the
node can be provided
to the application
Multi-GPU applications get benefit
Delft, April 2015 30/47
Current job schedulers, like SLURM, know
about real GPUs, but cannot manage virtual
GPUs
Enhancing schedulers is required to effectively
take advantage of GPU virtualization
About the second ingredient
Delft, April 2015 31/47
One step further:
enhancing the scheduler so that GPU
servers are put into low-power sleeping
modes as soon as their acceleration
features are not required
More about enhanced GPU scheduling
Delft, April 2015 32/47
Enhancing even more GPU scheduling
Going even beyond:
support GPU task migration
consolidate GPU tasks into as few GPU
servers as possible
Delft, April 2015 33/47
Why rCUDA: the problem with GPU-enabled clusters
The enabler for higher cluster throughput at lower cost
Engineering the enabler (I)
Final considerations
Increasing throughput in current clusters
Delft, April 2015 34/47
Basics of the rCUDA framework
Basic CUDA behavior
Delft, April 2015 35/47
Basics of the rCUDA framework
rCUDA is binary compatible with CUDA 6.5
Delft, April 2015 36/47
Bandwidth is a concern for rCUDA
Performance
of pageable
memory
Performance
of pinned
memory
Delft, April 2015 37/47
CUDA-MEME application:
GPU NVIDIA Tesla K40
Mellanox ConnectX-3 single-port (FDR) and Connect-IB Adapters
Performance of applications with rCUDA
0.19%
Lower
is better
Delft, April 2015 38/47
Why rCUDA: the problem with GPU-enabled clusters
The enabler for higher cluster throughput at lower cost
Engineering the enabler (II)
Final considerations
Increasing throughput in current clusters
Delft, April 2015 39/47
• SLURM (Simple Linux Utility for Resource Management) job scheduler
• SLURM does not understand about virtualized GPUs
• Add a new GRES (general resource) in order to manage virtualized GPUs
• Where the GPUs are in the system is completely transparent to the user
• In the job script, or in the submission command, the user specifies the number of rGPUs (remote GPUs) required by the job. The amount of GPU memory required by the job may also be specified
Integrating rCUDA with SLURM
Delft, April 2015 40/47
The basic idea about SLURM
Delft, April 2015 41/47
GPUs are
decoupled
from nodes
The basic idea about SLURM + rCUDA
All jobs are
executed in
less time
Delft, April 2015 42/47
GPUs are
decoupled
from nodes
All jobs are
executed
even in
less time
Sharing remote GPUs among jobs
GPU 0 is
scheduled to
be shared
among jobs
Delft, April 2015 43/47
Cluster performance with rCUDA+SLURM
Delft, April 2015 44/47
Cluster performance with rCUDA+SLURM
Let’s reduce the amount of GPUs in the cluster
42% Less
41% Less
43% Less
Delft, April 2015 45/47
Why rCUDA: the problem with GPU-enabled clusters
The enabler for higher cluster throughput at lower cost
Engineering the enabler
Final considerations
Increasing throughput in current clusters
Delft, April 2015 46/47
• High Throughput Computing
• Sharing remote GPUs makes applications
to execute slower … BUT more
throughput (jobs/time) is achieved
• Green Computing
• GPU migration and application migration allow to devote just the
required computing resources to the current workload
• More flexible system upgrades
• GPU and CPU updates become independent
from each other. Attaching GPU boxes to non
GPU-enabled clusters is possible
• Datacenter administrators can choose between HPC and HTC
rCUDA is the enabling technology for …
Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_
Carlos Reaño Javier Prades Fernando Campos Rocío Alegre
Federico Silla José Duato Antonio Peña
The rCUDA team
(1) Former student, now at Argonne National Lab. (USA)
More than 500 requests
(1)