The rCUDA technology: an inexpensive way to improve the ... · an inexpensive way to improve the performance of GPU-based clusters Federico Silla Technical University of Valencia

The rCUDA technology:

an inexpensive way to improve

the performance of

GPU-based clusters

Federico Silla

Technical University of Valencia Spain

Delft, April 2015 2/47

The scope of this talk


More flexible use of GPUs

• rCUDA: a software technology that enables a more

flexible use of GPUs in computing facilities

No GPU


CUDASW++

Bioinformatics software for Smith-Waterman protein database searches

NVIDIA Tesla K20 GPU

Mellanox ConnectX-3 single-port adapters

144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478

0

5

10

15

20

25

30

35

40

45

0

2

4

6

8

10

12

14

16

18

FDR Overhead QDR Overhead GbE Overhead CUDA

rCUDA FDR rCUDA QDR rCUDA GbE

Sequence Length

rCU

DA

Ove

rhe

ad

(%

)

Exe

cu

tio

n T

ime

(s)

Overhead introduced by rCUDA

Small overhead when

using InfiniBand

Lower

is better


• As many GPUs as there are in the cluster may be provided

to a single application

No GPU

1: more GPUs for a single application






MonteCarlo Multi-GPU (from NVIDIA SDK)

Lower

is better

Higher

is better



• GPUs can be shared among jobs running in remote clients

App 2

App 1

App 3

App 4

2: increased cluster performance

App 6

App 5

App 7

App 8

App 9


Test bench for studying rCUDA performance at cluster level:

SLURM used as job scheduler

InfiniBand ConnectX-3 based cluster

Dual socket E5-2620v2 Intel Xeon based nodes:

1 node without GPU

8 nodes. Each with one NVIDIA K20 GPU

Four applications used

LAMMPS

GPU-Blast

MCUDA-MEME

Gromacs (no GPU)

Three workload sizes:

Small

Medium

Large

1 node hosting

the main SLURM

controller

8 nodes with

one K20

GPU each





Let’s reduce the amount of GPUs in the cluster

42% Less

41% Less

43% Less

3: less cost with more performance


4: reduced energy consumption


Why rCUDA: the problem with GPU-enabled clusters

The enabler for higher cluster throughput at lower cost

Engineering the enabler

Final considerations

Increasing throughput in current clusters








A GPU computing facility is usually a set of independent self-

contained nodes that leverage the shared-nothing approach:

Nothing is directly shared among nodes (MPI required for aggregating

computing resources within the cluster)

GPUs can only be used within the node they are attached to

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

Characteristics of GPU-based clusters


• Applications can only use the GPUs located within their node:

• Non-accelerated applications keep GPUs idle in the nodes where they

use all the cores

First concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem


PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

A CPU-only application spreading over

these four nodes would make their GPUs

unavailable for accelerated applications


For some workloads, GPUs may be idle for significant periods of time:

• Initial acquisition costs not amortized

• Space: GPUs reduce CPU density

• Energy: idle GPUs keep consuming power

Money leakage in current clusters?

Time (s)

Idle

Pow

er

(Watts)

1 GPU node

4 GPUs node

• 1 GPU node: Two E5-2620V2 sockets and 32GB DDR3 RAM. One Tesla K20 GPU

• 4 GPUs node: Two E5-2620V2 sockets and 128GB DDR3 RAM. Four Tesla K20 GPUs

25%


• Applications can only use the GPUs located within their node:

• Multi-GPU applications running on a subset of nodes cannot

make use of the tremendous GPU resources available at other

cluster nodes (even if they are idle)

Second concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem


PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

All these GPUs cannot be

used by the multi-GPU

application in execution multi-GPU application


• Do applications completely squeeze the GPUs present in the cluster?

• Even if all GPUs are assigned to running applications,

computational resources inside GPUs may not be fully used

• Application presenting low level of parallelism

• CPU code being executed (GPU assigned ≠ GPU working)

• GPU-core stall due to lack of data

• etc …

One more concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem


PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem


Sharing a given GPU among jobs

• Several GPU-Blast instances concurrently executed on the same GPU. Each instance uses about 1.5 of GPU memory


In summary …

• There are scenarios where GPUs are available but

cannot be used

• Accelerated applications do not make use of GPUs

100% of the time

In conclusion …

• We are losing GPU cycles, thus reducing cluster

performance

Why GPU-cluster performance is lost?


What is missing is ...

... some flexibility for using

the GPUs in the cluster

We need something more in the cluster

The current model for using GPUs is too rigid








• Two ingredients are required to cook a

higher-throughput GPU-based cluster

• A way of seamlessly sharing GPUs across

nodes in the cluster (remote GPU

virtualization)

• Enhanced job schedulers that take into

account the new shared GPUs

What is needed for increased flexibility?


Remote GPU virtualization allows a new vision of a GPU

deployment, moving from the usual cluster configuration:

Remote GPU virtualization envision

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem


PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Main

Mem

ory

Network

GPU GPU mem

to the following one ….


Physical

configuration

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

PC

I-e

CP

U

Main

Mem

ory

Network


Logical connections

Logical

configuration

Remote GPU virtualization envision

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Main

Mem

ory

Network


PC

I-e CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

CP

U

Main

Mem

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Main

Mem

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Main

Mem

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e


GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Main

Mem

ory

Network


PC

I-e CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

CP

U

Main

Mem

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Main

Mem

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Main

Mem

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

PC

I-e

CP

U

Main

Mem

ory

Network


Logical connections

Busy cores are no longer a problem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Physical

configuration

Logical

configuration


GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Without GPU virtualization

GPU virtualization is also useful for multi-GPU applications

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

PC

I-e

CP

U

Main

Mem

ory

Network


Logical connections

PC

I-e

CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem


PC

I-e

CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Main

Mem

ory

Network

GPU GPU mem

With GPU virtualization

Many GPUs in the

cluster can be provided

to the application

Only the GPUs in the

node can be provided

to the application

Multi-GPU applications get benefit


Current job schedulers, like SLURM, know

about real GPUs, but cannot manage virtual

GPUs

Enhancing schedulers is required to effectively

take advantage of GPU virtualization

About the second ingredient


One step further:

enhancing the scheduler so that GPU

servers are put into low-power sleeping

modes as soon as their acceleration

features are not required

More about enhanced GPU scheduling


Enhancing even more GPU scheduling

Going even beyond:

support GPU task migration

consolidate GPU tasks into as few GPU

servers as possible




Engineering the enabler (I)




Basics of the rCUDA framework

Basic CUDA behavior


Basics of the rCUDA framework

rCUDA is binary compatible with CUDA 6.5


Bandwidth is a concern for rCUDA

Performance

of pageable

memory

Performance

of pinned

memory


CUDA-MEME application:

GPU NVIDIA Tesla K40

Mellanox ConnectX-3 single-port (FDR) and Connect-IB Adapters

Performance of applications with rCUDA

0.19%

Lower

is better




Engineering the enabler (II)




• SLURM (Simple Linux Utility for Resource Management) job scheduler

• SLURM does not understand about virtualized GPUs

• Add a new GRES (general resource) in order to manage virtualized GPUs

• Where the GPUs are in the system is completely transparent to the user

• In the job script, or in the submission command, the user specifies the number of rGPUs (remote GPUs) required by the job. The amount of GPU memory required by the job may also be specified

Integrating rCUDA with SLURM


The basic idea about SLURM


GPUs are

decoupled

from nodes

The basic idea about SLURM + rCUDA

All jobs are

executed in

less time


GPUs are

decoupled

from nodes

All jobs are

executed

even in

less time

Sharing remote GPUs among jobs

GPU 0 is

scheduled to

be shared

among jobs


Cluster performance with rCUDA+SLURM


Cluster performance with rCUDA+SLURM

Let’s reduce the amount of GPUs in the cluster

42% Less

41% Less

43% Less








• High Throughput Computing

• Sharing remote GPUs makes applications

to execute slower … BUT more

throughput (jobs/time) is achieved

• Green Computing

• GPU migration and application migration allow to devote just the

required computing resources to the current workload

• More flexible system upgrades

• GPU and CPU updates become independent

from each other. Attaching GPU boxes to non

GPU-enabled clusters is possible

• Datacenter administrators can choose between HPC and HTC

rCUDA is the enabling technology for …

Get a free copy of rCUDA at

http://www.rcuda.net

@rcuda_

Carlos Reaño Javier Prades Fernando Campos Rocío Alegre

Federico Silla José Duato Antonio Peña

The rCUDA team

(1) Former student, now at Argonne National Lab. (USA)

More than 500 requests

(1)

Documents

The rCUDA technology: an inexpensive way to improve the ... · an inexpensive way to improve the performance of GPU-based clusters Federico Silla Technical University of Valencia