17
Filipo Novo Mór Advisors: Dr. César Augusto Missio Marcon Dr. Andrew Rau-Chaplin GPU Performance Prediction Using High-level Application Models ERAD 2014 presentation 2014 March Pontifical Catholic University of Rio Grande do Sul Faculty of Informatics Postgraduate Programme in Computer Science

GPU Performance Prediction Using High-level Application Models

Embed Size (px)

DESCRIPTION

Speech presented on the ERAD RS 2014, in Alegrete, RS, Brazil, in 2014, March, 21st. This work intend to predict the performance of high-level represented algorithm "running" on GPU hardware models.

Citation preview

Page 1: GPU Performance Prediction Using High-level Application Models

Filipo Novo Mór

Advisors:Dr. César Augusto Missio MarconDr. Andrew Rau-Chaplin

GPU Performance Prediction Using High-level Application

Models

ERAD 2014 presentation

2014 March

Pontifical Catholic University of Rio Grande do SulFaculty of Informatics

Postgraduate Programme in Computer Science

Page 2: GPU Performance Prediction Using High-level Application Models

Outline

• Objectives• Related Works• Graphic Processor Units• Methodology• Performance Prediction Engine• Work Schedule

Page 3: GPU Performance Prediction Using High-level Application Models

Objectives

• To model applications in high-level in order to predict their behaviour when running on GPU.– Secondary goals:• To create a description of a high-level model for the target

GPU architecture.• To evaluate the impact of using different cache sizes on

the tested applications

3 / 17

Page 4: GPU Performance Prediction Using High-level Application Models

Related Works

• Theoretical works:

app. arch. CUDA HLRAAn Adaptive Performance Modeling Tool for GPU Architectures

Baghsorkhi et all no yessource code

noperformance prediction and bottleneck indicators

Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models

Marin and Mellor-Crummey

yes yessource code

no performance prediction

An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness

Hong and Kim no nosource code

noperformance prediction. Also proposed two new metrics for GPU modelling, MWP and CWP

Exploring the multiple-GPU design space Schaa and Kaeli no yessource code

no performance benchmark

A Quantitative Performance Analysis Model for GPU Architectures

Zhang and Owens no yessource code

no performance benchmark

yes yes no yes performance prediction

authorsworkmodelling inputs

outputs

this work

4 / 17

Page 5: GPU Performance Prediction Using High-level Application Models

Related Works

• Application tools:

work authors inputs outputs target architectureBarra Collange et all CUDA source code execution measurements NVIDIA TeslaGPU_Sim Bakhoda et all CUDA source code execution measurements NVIDIA Tesla and GT200GPU Ocelot Diamos et all CUDA source code execution measurements PTX 2.3 (CUDA 4.0)

HLRA execution measurements NVIDIA GK110this work

gpgpu-sim.org

5 / 17

Page 6: GPU Performance Prediction Using High-level Application Models

Graphic Processor Unit

Simplified architecture of a NVIDIA GPU

6 / 17

Page 7: GPU Performance Prediction Using High-level Application Models

Graphic Processor Unit

Simplified architecture of a NVIDIA GPU showing the internal sctructure of streaming multiprocessors

7 / 17

Page 8: GPU Performance Prediction Using High-level Application Models

Graphic Processor Unit

When a thread block is assigned to a streaming multiprocessor, it is divided into units called WARPS.

8 / 17

Mohamed Zahran

Page 9: GPU Performance Prediction Using High-level Application Models

Graphic Processor Unit

SIMT vs SIMD• Single Instruction, Multiple Register Sets: each thread has its own register

set, consequently, instructions may process different data simultaneously on different parallel running threads.

• Single Instruction, Multiple Addresses: each thread is permitted to freely access non-coalesced memory addresses, given more flexibility to the programmer. However, this is a unsafe technique because parallel access to non-coalesced addresses may serialize transactions, which reduce performance significantly.

• Single Instruction, Multiple Flow Paths: the control flow of different parallel running threads can diverge.

9 / 17

Page 10: GPU Performance Prediction Using High-level Application Models

Graphic Processor Unit

Branch Divergence

10 / 17

Page 11: GPU Performance Prediction Using High-level Application Models

Graphic Processor Unit

Branch Divergence

11 / 17

Page 12: GPU Performance Prediction Using High-level Application Models

Graphic Processor Unit

The Key Challenges for GPU Programming

• Data transfer between CPU and GPU• Memory access• Branch divergence• No recursion

12 / 17

Page 13: GPU Performance Prediction Using High-level Application Models

Methodology

13 / 17

Page 14: GPU Performance Prediction Using High-level Application Models

Methodology

Validating• Applications will be implement in CUDA as well as in

HLRA.• Applications will be chosen accordind to its profile:– Computation vs Communication– Sizing

14 / 17

Page 15: GPU Performance Prediction Using High-level Application Models

Performance Prediction Engine

Aspects to be considered by the engine• Branch divergence• Memory access– Local, Global, Shared and thread register block.

• Thread synchronization• Loops

15 / 17

Page 16: GPU Performance Prediction Using High-level Application Models

Work Schedule

16 / 17

Page 17: GPU Performance Prediction Using High-level Application Models

Questions

Filipo Novo Mórfilipo.mor at acad.pucrs.br