GPU Performance Prediction Using High-level Application Models

Filipo Novo Mór

Advisors:Dr. César Augusto Missio MarconDr. Andrew Rau-Chaplin

GPU Performance Prediction Using High-level Application

Models

ERAD 2014 presentation

2014 March

Pontifical Catholic University of Rio Grande do SulFaculty of Informatics

Postgraduate Programme in Computer Science

Outline

• Objectives• Related Works• Graphic Processor Units• Methodology• Performance Prediction Engine• Work Schedule

Objectives

• To model applications in high-level in order to predict their behaviour when running on GPU.– Secondary goals:• To create a description of a high-level model for the target

GPU architecture.• To evaluate the impact of using different cache sizes on

the tested applications

3 / 17

Related Works

• Theoretical works:

app. arch. CUDA HLRAAn Adaptive Performance Modeling Tool for GPU Architectures

Baghsorkhi et all no yessource code

noperformance prediction and bottleneck indicators

Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models

Marin and Mellor-Crummey

yes yessource code

no performance prediction

An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness

Hong and Kim no nosource code

noperformance prediction. Also proposed two new metrics for GPU modelling, MWP and CWP

Exploring the multiple-GPU design space Schaa and Kaeli no yessource code

no performance benchmark

A Quantitative Performance Analysis Model for GPU Architectures

Zhang and Owens no yessource code

no performance benchmark

yes yes no yes performance prediction

authorsworkmodelling inputs

outputs

this work

4 / 17

Related Works

• Application tools:

work authors inputs outputs target architectureBarra Collange et all CUDA source code execution measurements NVIDIA TeslaGPU_Sim Bakhoda et all CUDA source code execution measurements NVIDIA Tesla and GT200GPU Ocelot Diamos et all CUDA source code execution measurements PTX 2.3 (CUDA 4.0)

HLRA execution measurements NVIDIA GK110this work

gpgpu-sim.org

5 / 17

Graphic Processor Unit

Simplified architecture of a NVIDIA GPU

6 / 17


Simplified architecture of a NVIDIA GPU showing the internal sctructure of streaming multiprocessors

7 / 17


When a thread block is assigned to a streaming multiprocessor, it is divided into units called WARPS.

8 / 17

Mohamed Zahran


SIMT vs SIMD• Single Instruction, Multiple Register Sets: each thread has its own register

set, consequently, instructions may process different data simultaneously on different parallel running threads.

• Single Instruction, Multiple Addresses: each thread is permitted to freely access non-coalesced memory addresses, given more flexibility to the programmer. However, this is a unsafe technique because parallel access to non-coalesced addresses may serialize transactions, which reduce performance significantly.

• Single Instruction, Multiple Flow Paths: the control flow of different parallel running threads can diverge.

9 / 17


Branch Divergence

10 / 17


Branch Divergence

11 / 17


The Key Challenges for GPU Programming

• Data transfer between CPU and GPU• Memory access• Branch divergence• No recursion

12 / 17

Methodology

13 / 17

Methodology

Validating• Applications will be implement in CUDA as well as in

HLRA.• Applications will be chosen accordind to its profile:– Computation vs Communication– Sizing

14 / 17

Performance Prediction Engine

Aspects to be considered by the engine• Branch divergence• Memory access– Local, Global, Shared and thread register block.

• Thread synchronization• Loops

15 / 17

Work Schedule

16 / 17

Questions

Filipo Novo Mórfilipo.mor at acad.pucrs.br

Technology

GPU Performance Prediction Using High-level Application Models