Upload
filipo-mor
View
535
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Speech presented on the ERAD RS 2014, in Alegrete, RS, Brazil, in 2014, March, 21st. This work intend to predict the performance of high-level represented algorithm "running" on GPU hardware models.
Citation preview
Filipo Novo Mór
Advisors:Dr. César Augusto Missio MarconDr. Andrew Rau-Chaplin
GPU Performance Prediction Using High-level Application
Models
ERAD 2014 presentation
2014 March
Pontifical Catholic University of Rio Grande do SulFaculty of Informatics
Postgraduate Programme in Computer Science
Outline
• Objectives• Related Works• Graphic Processor Units• Methodology• Performance Prediction Engine• Work Schedule
Objectives
• To model applications in high-level in order to predict their behaviour when running on GPU.– Secondary goals:• To create a description of a high-level model for the target
GPU architecture.• To evaluate the impact of using different cache sizes on
the tested applications
3 / 17
Related Works
• Theoretical works:
app. arch. CUDA HLRAAn Adaptive Performance Modeling Tool for GPU Architectures
Baghsorkhi et all no yessource code
noperformance prediction and bottleneck indicators
Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models
Marin and Mellor-Crummey
yes yessource code
no performance prediction
An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness
Hong and Kim no nosource code
noperformance prediction. Also proposed two new metrics for GPU modelling, MWP and CWP
Exploring the multiple-GPU design space Schaa and Kaeli no yessource code
no performance benchmark
A Quantitative Performance Analysis Model for GPU Architectures
Zhang and Owens no yessource code
no performance benchmark
yes yes no yes performance prediction
authorsworkmodelling inputs
outputs
this work
4 / 17
Related Works
• Application tools:
work authors inputs outputs target architectureBarra Collange et all CUDA source code execution measurements NVIDIA TeslaGPU_Sim Bakhoda et all CUDA source code execution measurements NVIDIA Tesla and GT200GPU Ocelot Diamos et all CUDA source code execution measurements PTX 2.3 (CUDA 4.0)
HLRA execution measurements NVIDIA GK110this work
gpgpu-sim.org
5 / 17
Graphic Processor Unit
Simplified architecture of a NVIDIA GPU
6 / 17
Graphic Processor Unit
Simplified architecture of a NVIDIA GPU showing the internal sctructure of streaming multiprocessors
7 / 17
Graphic Processor Unit
When a thread block is assigned to a streaming multiprocessor, it is divided into units called WARPS.
8 / 17
Mohamed Zahran
Graphic Processor Unit
SIMT vs SIMD• Single Instruction, Multiple Register Sets: each thread has its own register
set, consequently, instructions may process different data simultaneously on different parallel running threads.
• Single Instruction, Multiple Addresses: each thread is permitted to freely access non-coalesced memory addresses, given more flexibility to the programmer. However, this is a unsafe technique because parallel access to non-coalesced addresses may serialize transactions, which reduce performance significantly.
• Single Instruction, Multiple Flow Paths: the control flow of different parallel running threads can diverge.
9 / 17
Graphic Processor Unit
Branch Divergence
10 / 17
Graphic Processor Unit
Branch Divergence
11 / 17
Graphic Processor Unit
The Key Challenges for GPU Programming
• Data transfer between CPU and GPU• Memory access• Branch divergence• No recursion
12 / 17
Methodology
13 / 17
Methodology
Validating• Applications will be implement in CUDA as well as in
HLRA.• Applications will be chosen accordind to its profile:– Computation vs Communication– Sizing
14 / 17
Performance Prediction Engine
Aspects to be considered by the engine• Branch divergence• Memory access– Local, Global, Shared and thread register block.
• Thread synchronization• Loops
15 / 17
Work Schedule
16 / 17
Questions
Filipo Novo Mórfilipo.mor at acad.pucrs.br