Download pdf - Improving an Autotuning Engine for 3D Fast Wavelet

Conference title 1

Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU G. Bernabé†, J. Cuenca† , Luis P. García* and D. Giménez‡

† Computer Engineering Department, University of Murcia * Technical Research Service, Technical University of Cartagena

‡ Computer Science and Systems Department, University of Murcia

3-7 July, 2014

14th International Conference on Computational and Mathematical Methods

in Science and Engineering (CMMSE 2014)

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

2

Outline

Introduction and Motivation

An enhanced autotuning engine for the 3D-FWT

Experimental Results

Conclusions and Future work


3

Outline






4

•Cluster of nodes

Introduction


5

•Cluster of nodes

Introduction

To solve scientific problems like 3D-FWT


6

Introduction

• The development and optimization of

parallel code is a complex task

– Deep knowledge of the different components

exploit the computing capacity

– Programming and combining efficiently the

different paradigms (message passing, shared

memory and SIMD GPU)


7

Introduction









8

Introduction









9

Introduction








Autotuning architecture to run the 3D-FWT kernel automatically on clusters of multicore+GPUs [6]

[6] G. Bernabé, J. Cuenca y D. Giménez. “Optimizing a

3D-FWT code in heterogeneous cluster of multicore

CPUs and manycore GPUs”. SBAC-PAD 2013

Autotuning architecture

3D-FWT

Multicores GPUs


10

Autotuning architecture to run the 3D-FWT kernel automatically on clusters

• The architecture consists of two steps

1. Cluster Analysis

2. The Theoretical Searching of the Best Number of

Slave Nodes


3D-FWT

Multicores GPUs


11



1. Cluster Analysis

• Detects the number and type of CPUs and

GPUs

• The 3D-FWT computer performance

• The bandwidth of the interconnection network

2. The Theoretical Searching of the Best Number of

Slave Nodes


3D-FWT

Multicores GPUs

Cluster Analysis


12



1. Cluster Analysis

• Detects the number and type of CPUs and GPUs

• The 3D-FWT computer performance

• The bandwidth of the interconnection network

2. The Theoretical Searching of the Best Number of Slave

Nodes

• Automatically computes the proportions at which

the different sequences of video are divided among

the nodes in the cluster

• Searchs the possible Temporal distribution

schemes for 1 to n slaves nodes working jointly to

a master node

• Chooses the number of slave nodes which gives

the lowest execution time


3D-FWT

Multicores GPUs

Cluster Analysis

Theoretical Searching of the Best Number

of Slave Nodes


15


Cluster Analysis

Cluster Performance

Map


of Slave Nodes

Best Number of Slave Nodes

• The first stage is the Cluster Analysis

1. Detects automatically the available GPUs and

CPUs in each node

2. For each platform in each node (GPU or CPU) do

• If GPU Nvidia: CUDA 3D-FWT calculates

automatically block size

• If GPU ATI: OpenCL 3D-FWT computes

automatically work-group size

• If CPU: Tiling and pthreads fast analysis to

obtain the best n threads

• Send one sequence Computer

performance of the 3D-FWT kernel

3. Measures the performance of the interconnection

network among the nodes


16




CPUs in each node












Cluster Analysis

Cluster Performance

Map


of Slave Nodes



17




CPUs in each node



automatically block size and the number of

streams









Cluster Analysis

Cluster Performance

Map


of Slave Nodes



18

Motivation

Why the number of streams is so important?


19

Motivation

• Nvidia Fermi GPU and Kepler GPU

crucial for the incorporation of streams

as a key factor in codes

• One most difficult challenges for the

GPU architecture optimal scheduler

to manage the workload composed of

different streams in a GPU


20

Motivation

• Nvidia Fermi GPU and Kepler GPU

crucial for the incorporation of streams

as a key factor in codes

• One most difficult challenges for the

GPU architecture optimal scheduler

to manage the workload composed of

different streams in a GPU


21

Motivation

• The Fermi GPU architecture allows a concurrency execution of up to 16

streams, but there is a single hardware queue and the streams must be

multiplexed and serialized


22

Motivation

• The Kepler GPU architecture introduces Hyper-Q, which enables up to 32

hardware queues allowing great flexibility to improve the performance

without modifications to the source codes


23

Motivation

• A slight difference among execution times is observed for different block sizes in both GPUs. In

fact, the maximum difference is about 16% in Fermi GPU and 6% in Kepler GPU

• Speedups when using several streams with respect to the execution with a single stream is in the

range of 1.31 to 1.68 for the Fermi GPU and 1.28 to 1.56 for the Kepler GPU


25

Outline






26

Outline






27




CPUs in each node












Cluster Analysis

Cluster Performance

Map


of Slave Nodes



28




CPUs in each node



automatically block size and the number of

streams









Cluster Analysis

Cluster Performance

Map


of Slave Nodes



30


• function f(block, stream)


31




32



Set of block sizes


33



Set of block sizes

Set of number of

streams


34



Set of block sizes

Set of number of

streams

The minimum execution

time


35



Reducing the number of possible evaluations

Set of block sizes

Set of number of

streams

The minimum execution

time


36



1. Occupancy of each multiprocessor for a set of block sizes select all the

block sizes that reach at least 60% occupancy of each multiprocessor

2. Selects a first block size and obtains execution times for several number of

streams. The minimum ET is selected (best time).

3. If ET(streams) is > the best time plus a threshold the number of streams

is not considered for the next evaluation of block size

4. For the next block sizes, the analysis is only done for the number of streams

selected in the previous step

5. The output is the minimum execution time (block, stream)


37

Outline






38

Outline






39

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed

Sequence 256 frames 1024 x 1024 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%


40

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed - 1st Iteration



streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20


41




streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%


1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

X


42




streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%


1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

X X


43




streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%


1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

X X X


44




streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%


1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

X X X X


45

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed – 2nd Iteration



streams_Set 16, 32, 64

Threshold 10%


1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20


46

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed – 3rd – 6th Iteration




1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

256 434.33 439.08 415.91 457.50

320 444.47 431.28 409.68 450.65

384 442.41 428.84 405.37 445.91

448 448.94 435.84 414.54 445.91


47

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed – 3rd – 6th Iteration




1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

256 434.33 439.08 415.91 457.50

320 444.47 431.28 409.68 450.65

384 442.41 428.84 405.37 445.91

448 448.94 435.84 414.54 445.91

X


48

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed – 7th Iteration


streams_Set 32, 64


1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

256 434.33 439.08 415.91 457.50

320 444.47 431.28 409.68 450.65

384 442.41 428.84 405.37 445.91

448 448.94 435.84 414.54 445.91

512 438.99 416.87 445.91


49

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed


1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

256 434.33 439.08 415.91 457.50

320 444.47 431.28 409.68 450.65

384 442.41 428.84 405.37 445.91

448 448.94 435.84 414.54 445.91

512 438.99 416.87 445.91

• The autotuning engine obtains the minimum execution time, reducing the

number of executed evaluations from the 56 total possible to 24

• Our enhanced automatical method achieves the optimal configuration with a

block size of 384 and 64 streams in 10.61 secs.


50

Experimental Results For a sequence of video of 10 hours with 25 frames per second, split in group of 256 frames

Execution times (seconds) – 900,000 frames 1024 x 1024

Autotuning engine (block size=384, streams=64) 23.75 minutes

Non-expert user (block size=64, streams=1) 34.38 minutes

Expert user (block size=optimal, streams=32) 25.13 minutes

• Non-expert user has not knowledge to properly select the block size and

the number of streams (selecting 64 as the block size and 1 as the

number of streams)

• Expert user, who selects the optimal block size and establishes the

number of streams to 32 (which is the number of hardware queues in a

Tesla K20 GPU)

• Speedups of 1.45 and 1.06 with regard to non-expert user, expert user


51

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed - 1st Iteration



streams_Set 1, 2, 4, 8, 16, 32

Threshold 10%


1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71


52




streams_Set 1, 2, 4, 8, 16, 32

Threshold 10%


1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

X


53




streams_Set 1, 2, 4, 8, 16, 32

Threshold 10%


1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

X X


54




streams_Set 1, 2, 4, 8, 16, 32

Threshold 10%


1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

X X X


55

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed – 2nd Iteration




Threshold 10%


1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

192 583,30 557,59 574,58 613,35


56

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed – 3rd - 7th Iteration




1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

192 583.30 557.59 574.58 613.35

256 589,62 564,49 581,43 620,94

320 590,83 566,25 583,10 622,88

384 576,60 551,35 567,44 606,49

448 581,87 557,56 574,57 606,49

512 594,44 569,55 586,08 606,49


57

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed Block size /

streams 1 2 4 8 16 32 Best_time +

10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

192 583.30 557.59 574.58 613.35

256 589,62 564,49 581,43 620,94

320 590,83 566,25 583,10 622,88

384 576,60 551,35 567,44 606,49

448 581,87 557,56 574,57 606,49

512 594,44 569,55 586,08 606,49

• The optimization engine obtains the minimum execution time, executing 50%

of the total evaluations in 14.33 secs.

• The best time is 551.35 msecs. achieved by 384 block size and 8 streams


58







• Non-expert user has not knowledge to properly select the block size and

the number of streams

• Expert user, who selects the optimal block size and establishes the

number of streams to 1 or 16 (which are the number of theoretical

streams allowed in concurrency and the hardware queues in a Fermi

Tesla C2050 GPU)

• Speedups of 1.45 and 1.06 with regard to non-expert user, expert user


59







• Speedups of 1.67 with regard to a non-expert user

• Speedups of 1.51 and 1.02 depending of the selection of 1 or 16 streams

by an expert user


60

Outline






61

Outline






63

Conclusions

• An extension of a previously proposed optimization engine to run the

3D-FWT kernel automatically on integrated systems with different

platforms such as multicore CPU and manycore GPUs

• The autotuning method

Reducing the number of possible evaluations

• Speedups of up to 1.45x for the NVIDIA Tesla K20 and 1.67 for the

Fermi Tesla C2050 with respect to a user with no knowledge in

selecting the optimal block size and the number of streams

• For expert users, who select the optimal block size and know the

architecture of the GPUs, the autotuning engine achieves speedups up

to 1.51x

Set of block sizes

Set of number of

streams

The minimum

execution time


64

Future work

• The methodology applied to propose the

optimization engine should be applicable to other

complex compute applications

• Our work is part of the development of an image

processing library oriented to biomedical

applications, allowing users the efficient executions

of different routines automatically

Conference title 65

Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU G. Bernabé†, J. Cuenca† , Luis P. García* and D. Giménez‡

† Computer Engineering Department, University of Murcia * Technical Research Service, Technical University of Cartagena ‡ Computer Science and Systems Department, University of Murcia 3-7 July, 2014

14th International Conference on Computational and Mathematical Methods

in Science and Engineering (CMMSE 2014)