GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing

Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Jinpil Lee , Mitsuhisa Sato

University of Tsukuba, JapanGraduate School of Systems and Information Engineering, University of Tsukuba, JapanCenter for Computational Sciences, University of Tsukuba, Japan

9/10/2012/ P2S2 2012 1

†1

†2

†1

†1,2†1

†1,2 †2

Outline

• Background & Purpose• Base language: XcalableMP & XcalableMP-dev• StarPU• Implementation of XMP-dev/StarPU• Preliminary performance evaluation• Conclusion & Future work

9/10/2012/ P2S2 2012 2

Background• GPGPU is widely used for HPC

– Impact of NVIDIA CUDA, OpenCL, etc…– Programming became easy on single node– > Many GPU cluster appear on TOP500 list

• Problem of programming on GPU cluster– Inter-node programming model (such as MPI)– Data management among CPU and GPU– > Programmability and productivity are very low

• GPU is very powerful, but...– CPU’s performance have been improved.– We cannot neglect its performance.

9/10/2012/ P2S2 2012 3

・ Complex・ Include program source

Purpose

• High productivity programming on GPU clusters• Utilization of both GPU and CPU resources – Work sharing of the loop execution

• XcalableMP acceleration device extension (XMP-dev) [Nakao, et. al., PGAS10] [Lee, et. al., HeteroPar’11] – Parallel programming model for GPU clusters– Directive-base, low programming cost

• StarPU [INRIA Bordeaux]– Runtime system for GPU/CPU work sharing.

9/10/2012/ P2S2 2012 4

Related Works

• PGI Fortran and HMPP– Source-to-source compiler for single node with GPUs– Not to be designed for GPU/CPU co-working

• StarPU research– “StarPU: a unified platform for task scheduling on

heterogeneous multicore architectures” [Augonnet, 2010]– “Faster, Cheaper, Better - a Hybridization Methodology to

Develop Linear Algebra Software for GPUs” [Emmanuel, 2010]

– Higher performance than GPU only– GPU/CPU work sharing is effective

9/10/2012/ P2S2 2012 5

XcalableMP (XMP)

• A PGAS language designed for distributed memory systems– Directive-base; easy to understand• array distribution, inter-node communication, loop

work sharing on CPU, etc…– Low programming cost• little change from the original sequential program

• XMP is developed by a working group by many Japanese universities and HPC companies

9/10/2012/ P2S2 2012 6

XcalableMP acceleration device extension (XMP-dev)

• Developed HPCS laboratory, University of Tsukuba

• XMP-dev is an extension of XMP for accelerator-equipped cluster– Additional directives (manage accelerators)• Mapping data onto accelerator’s device memory• Data transfer between CPU and accelerator• Work sharing on loop with accelerator device (ex. GPU

cores)

9/10/2012/ P2S2 2012 7

Example of XMP-devint x[16], y[16];#pragma xmp nodes p(2)#pragma xmp template t(0:16-1)#pragma xmp distribute t(BLOCK) onto p#pragma xmp align [i] with t(i) :: x, yint main() {…#pragma xmp device replicate (x, y) {#pragma xmp device replicate_sync in (x, y)#pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10;#pragma xmp device replicate_sync out (x) }}

9/10/2012/ 8

node1

xy

HOST

DEVICE

node2

xy

HOST

DEVICE

P2S2 2012


9/10/2012/ 9

node1

xy

HOST

DEVICE

node2

xy

HOST

DEVICE

xy

DEVICExy

P2S2 2012


9/10/2012/ 10

node1

xy

HOST

DEVICE

node2

xy

HOST

DEVICE

xy

DEVICExy

data transferHOST -> DEVICE

P2S2 2012


9/10/2012/ 11

node1

xy

HOST

DEVICE

node2

xy

HOST

DEVICE

xy

DEVICExy

P2S2 2012


9/10/2012/ 12

node1

xy

HOST

DEVICE

node2

xy

HOST

DEVICE

xy

DEVICExy

data transferDEVICE->HOST

P2S2 2012


9/10/2012/ 13

node1

xy

HOST

DEVICE

node2

xy

HOST

DEVICEDEVICE

free data on Devices

P2S2 2012

StarPU

• Developed by INRIA Bordeaux, France • StarPU is a runtime system – allocates and dispatches resource– schedules the task execution dynamically

• All the target data is recorded and managed in a data pool shared by all computation resources.– guarantees the coherence of data among multiple

task executions

9/10/2012/ P2S2 2012 14

Example of StarPUdouble x;starpu_data_handle x_h;

starpu_vector_data_register(&x_h, 0, x, …);struct starpu_data_filter f = { .filter_func = starpu_block_filter_func_vector, .nchildren = NSLICEX , …};starpu_data_partition(x_h, dim);

for (i = 0; i < NSLICEX; i++) { struct starpu_task *task = starpu_task_create(); task->cl = &cl; task->buffer[0].h = starpu_sub_data(x, 1, i); …}

starpu_data_unpartition(x_h);starpu_data_unregister(x_h);

9/10/2012/ 15

x

starpu_codelet cl = { .where = STARPU_CPU|STARPU_CUDA,.cpu_func = c_func, .cuda_func = g_func,.nbuffers = 7};

P2S2 2012





9/10/2012/ 16

x


x_h

P2S2 2012





9/10/2012/ 17

x


x_h

P2S2 2012





9/10/2012/ 18

x


x_h

CPU0CPU0

CPU0CPU core

GPU0GPU

P2S2 2012





9/10/2012/ 19

x


x_h

P2S2 2012





9/10/2012/ 20

x


P2S2 2012

Implementation of XMP-dev/StarPU

• XMP-dev– inter-node communication– data distribution

• StarPU– data transfer between GPU and CPU– GPU/CPU work sharing on single-node

• We combine XMP-dev and StarPU to enhance the function and performance– GPU/CPU work sharing on multi-node GPU cluster

9/10/2012/ P2S2 2012 21

Task Model of XMP-dev/StarPU

9/10/2012/ P2S2 2012 22

node1

CPU coreCPU core

CPU coreCPU core

GPUGPU

node2

CPU coreCPU core

CPU coreCPU core

GPUGPU


9/10/2012/ P2S2 2012 23

node1

CPU coreCPU core

CPU coreCPU core

GPUGPU

node2

CPU coreCPU core

CPU coreCPU core

GPUGPU

XMP-dev/StarPU divide global

array


9/10/2012/ P2S2 2012 24

node1

CPU coreCPU core

CPU coreCPU core

GPUGPU

node2

CPU coreCPU core

CPU coreCPU core

GPUGPU

StarPU divide local array and allocate tasks to devices.

Preliminary performance evaluation• N-body (double precision)– Use hand-compiled code– XMP-dev/StarPU (GPU/CPU) vs XMP-dev (GPU)– # of particles : 16K or 32K

• Node specification

9/10/2012/ P2S2 2012 25

CPU AMD Opteron 6134 2.3GHz (8 cores x2sokets ）

Memory DDR3 16GBGPU NVIDIA Tesla C2050GPU Memory GDDR5 3GB (ECC ON)Interconnection

Gigabit Ethernet

# of nodes 4

Preliminary performance evaluation• Assumption and Condition– StarPU allocates a CPU core to a GPU for controlling and

data management– So, 15 CPU cores (16 – 1) contribute for computation on

CPU • Term definition– Number of Tasks (tc)

• Number of partitions to split data array (each partition is mapped to a task)

– Chunk size (N / tc)• Number of elements per task

9/10/2012/ P2S2 2012 26

Relative performance of XMP-dev/StarPU to XMP-dev

9/10/2012/ P2S2 2012 27

・ XMP-dev/StarPU is very low performance… -compared with 35%・ Number of tasks and chunk size effect a performance in hybrid environment. -> There are several patterns.

(256, 512) (128, 256) (64, 128) (32, 64) (16, 32)0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

16K32K

Chunk size : (16K, 32K)

Rela

tive

perf

orm

ance

The relation between number of tasks and chunk size

Number of tasks ＼ chunk size

Large Small

Enough Good performance Low GPU performance

Not enough CPU cores are bottleneck

Bad performance

9/10/2012/ P2S2 2012 28

CPU GPUGoodCPU GPU

Bad

idling

・ If there are enough number of tasks and chunk size, we can achieve good performance in previous implementation.・ Either case with not enough number of tasks or small chunk size ->It is difficult to keep good balance of them with a “fixed size chunk”

The relation between number of tasks and chunk size

• In a fixed problem size

9/10/2012/ P2S2 2012 29

Chunk size

Freedom of

task allocation

Freedom of task

allocation

Chunk size

・We have to balance “chunk size” with “freedom of task allocation”. -> An aggressive solution is to provide different size of task to CPU core or GPU. -> To find the task size on each devices, calculate ratio from GPU and CPU performance.

Sustained Performance of GPU and CPU

9/10/2012/ P2S2 2012 30

・ CPU performance is improved up to the case of (32, 64). Number of tasks is 128.・ GPU is assigned a single task. More than 2 tasks, scheduling overhead is large.

CPU-only GPU-only

(256, 512)

(128, 256)

(64, 128) (32, 64) (16, 32)0.0E+00

5.0E+02

1.0E+03

1.5E+03

2.0E+03

2.5E+03

16K32K

Chunk size : (16K, 32K)

CPU

perf

orm

ance

[MFL

OPS

]

(4096, 8192)

(2048, 4096)

(1024, 2048)

(512, 1024)0.0E+00

2.0E+03

4.0E+03

6.0E+03

8.0E+03

1.0E+04

1.2E+04

1.4E+04

16K32K

Chunk size : (16K, 32K)GP

U pe

rfor

man

ce [M

FLO

PS]

Optimized hybrid work sharing

• “CPU Weight”– allocated partitions for CPU cores

9/10/2012/ P2S2 2012 31

CPU GPU

N * CPU Weight

GPU performance [MFLOPS] CPU performance [MFLOPS] CPU Weight

16k 9068.8 2154.9 0.237 0.24≒32k 11484.2 2178.8 0.189 0.19≒

Improved Relative Performance of XMP-dev/StarPU vs XMP-dev

9/10/2012/ P2S2 2012 32

16k 32k

・When 32k, we can achieve better than 1.4 times the performance of XMP-dev(GPU only) -The load balance between CPU and GPU are taken well in these cases.・When 16k, we can achieve only 1.05 times the performance -Performance is unstable. We have to analyze this factor.

0.140.15

0.160.17

0.180.19 0.2

0.210.22

0.230.24

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60

CPU Weight

Rela

tive

perf

orm

ance

0.19 0.20.21

0.220.23

0.240.25

0.260.27

0.280.29

0.80

0.85

0.90

0.95

1.00

1.05

1.10

CPU Weight

Rela

tive

perf

orm

ance

Conclusion & Future work• We proposed a programming language and run-time system

named XMP-dev/StarPU.– GPU/CPU hybrid work sharing

• In preliminary performance evaluation, we can achieve 1.4 times better than XMP-dev

• Completion of XMP-dev/StarPU compiler implementation• Analyze dynamic task size management for GPU/CPU load

balancing• Examination of various applications on larger parallel systems

9/10/2012/ P2S2 2012 33

• Thank you for your attention.• Please ask me slowly• Thank you!

9/10/2012/ P2S2 2012 34

Documents

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing