Compute API –Past & Future

11

Compute API – Past & Future

Ofer Rosenberg

Visual Computing Software

2

Intro and acknowledgments

• Who am I ?

– For the past two years leading the Intel representation in OpenCL working group @

Khronos

– Additional background of Media, Signal Processing, etc.

– http://il.linkedin.com/in/oferrosenberg

• Acknowledgments:

– This presentation contains ideas based on talks with lots of people (who should be

mentioned here)

– Partial list:

– AMD: Mike Houston, Ben Gaster

– Apple: Aaftab Munshi

– DICE: Johan Andersson

– Intel: Aaron Lefohn, Stephen Junkins, David Blythe, Adam Lake, Yariv Aridor, Larry Seiler and more…

– And others…

http://il.linkedin.com/in/oferrosenberg

Agenda

• The beginning – From Shaders to Compute

• The Past/Present: 1st Generation of Compute API’s

– Caveats of the 1st generation

• The Future: 2nd Generation of Compute API’s

4

From Shaders to Compute

• In the beginning, GPU HW was fixed & optimized for Graphics…

Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:

5

From Shaders to Compute

• Graphics stages became programmable GPUs evolved …

• This led to the traditional GPGPU approach…

Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:

6

From Shaders to Compute Traditional GPGPU

• Write in graphics language and use the GPU

• Highly effective, but :

– The developer needs to learn another (not intuitive) language

– The developer was limited by the graphics language

• Then came CUDA & CTM…

6Slides from “General Purpose Computation on Graphics Processors (GPGPU)”, Mike Houston, Stanford University Graphics Lab

7

The cradle of GPU Compute API’s

Slides from “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06, & “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007

GeForce 8800 GTX (G80) was released on Nov. 2006

CUDA 0.8 was released on Feb. 2007 (first official Beta)

ATI x1900 (R580) released on Jan 2006

CTM was released on Nov. 2006

8

The 1st generation of Platform Compute API

• CUDA & CTM led the way to two compute standards: Direct Compute & OpenCL

• DirectCompute is a Microsoft standard

– Released as part of WIn7/DX11, a.k.a. Compute Shaders

– Only runs under Windows on a GPU device

• OpenCL is a cross-OS / cross-Vendor standard

– Managed by a working group in Khronos

– Apple is the spec editor & conformance owner

– Work can be scheduled on both GPUs and CPUs

CUDA 1.0Released

June2007

OpenCL 1.0Released

Dec2008

OpenCL 1.1Released

June2010

DirectX 11Released

Oct2009

CUDA 2.0Released

Aug2008

CUDA 3.0Released

Mar2010

The 1st Generation was developed on GPU HW which was tuned for graphics usages –

just extended it for general usage

CTMReleased

Nov2006

StreamSDKReleased

Dec2007

9

The 1st generation of Platform Compute APIExecution Model

• Execution model was driven directly from shader programming in graphics (“fragment

processing”) :

– Shader Programming : initiate one instance of the shader per vertex/pixel

– Compute : initiate one instance for each point in an N-dimensional grid

• Fits GPU’s vision of array of scalar (or stream) processors

Drawing from OpenCL 1.1 Specification , Rev36

10

The 1st generation of Platform Compute APIMemory Model

• Distributed Memory system:

– Abstraction: Application gets a “handle” to the memory object / resource

– Explicit transactions: API for sync between Host & Device(s) : read/write, map/unmap

• Three address spaces: Global, Local (Shared) & Private

– Local/Shared Memory: the non-trivial memory space…

App OCL RT

Dev1

Dev2H

A

A

11

Disclaimer

Next slides provide my opinion and thoughts on caveats and

future improvements to the Platform Compute API.

12

The 2nd generation of Platform Compute API

• Recap:– The 1st generation : CUDA (until 3.0), OpenCL 1.x, DX11 CS

– Defined on HW optimized for GFX, extended to General Compute

• The “cheese” has moved for GPUs– Compute becomes an important usage scenario

– Advanced Graphics: Physics, Advanced Lighting Effects, Irregular Shadow Mapping, Screen Space Rendering

– Media: Video Encoding & Processing, Image Processing, Image Segmentation, Face Recognition

– Throughput: Scientific Simulations, Finance, Oil Searches

– Developers feedback based on the 1st generation enables creating better HW/API

• The Second generation of Platform Compute API: “OpenCL Next”,

DirectX12 ?

The 2nd Generation of Compute API will run on HW which is designed with

Compute in mind

13

Caveats of the 1st generation:Execution Model

• Developers input:

– Most “real world” usages for compute use fine-grain granularity (the gird is small – 100’s at best)

– “Real world” kernels got sequential parts interleaved with the parallel code (reduction, condition testing, etc.)

Battlefield 2execution phase DAG

(Image courtesy Johan Andersson, DICE)

Using “fragment processing” for these usages results inefficient use of the machine

__kernel foo()

{

// code here runs for each point in the grid

barrier(CLK_LOCAL_MEM_FENCE);

if (local_id == 0)

{

// this code runs once per workgroup

}


barrier(CLK_GLOBAL_MEM_FENCE);

if (global_id == 0)

{

// this code runs only once

}


}

14

Caveats of the 1st generationExecution Model

• The “array of scalar/stream processors” model is not optimal for CPU’s & GPU’s

• Works well for large grids (like in traditional graphics), but on finer grain there is a better

model…

AMD R600

CPU’s and GPU’s are better modeled as multi-threaded vector machines

NV Fermi Intel NHM

15

• Goals

– Support fine-grain task parallelism

– Support complex application execution graphs:

– Better match HW evolution: target multi-threaded vector machines

– Aligned with CPU evolution, and SoC integration of CPU/GPU

• Solution: Tasking system as execution model foundation

The 2nd generation of Platform Compute API Ideas for new execution model

Device

HW

compute

unit

HW

compute

unit

HW

compute

unit

HW

compute

unit

task . . .

task

task

task

task . . .

task

task . . .

task

task

task

task

task

task

task

SW Thread

task

task

task

task

Tasking system:

• Task Q’s mapped to independent

HW units (~compute cores)

• Device load balancing enabled via

task stealing

• OpenCL Analogy: Tasks execute at

the “work group level”

• OpenCL Task ≠ CPU Task • More restricted: No Preemption

• Evolved: Braided Task (sequential parts

& fine-grain parallel parts interleaved)

Task Pool

task

tasktask

task

task

task

task

task

task

task

task

Device Domain

16

The 2nd generation of Platform Compute API Ideas for new execution model

• There are others who think along the same lines …

Slides from “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day

17

Caveats of the 1st generation:Memory Model

• Developers input:

– A growing number of compute workloads uses complex data structures (linked lists, trees, etc.)

– Performance: Cost of pointer marshaling & re-construct on device is high

– Porting complexity: need to add explicit transactions, marshaling, etc.

– Supporting a shared/unified address space (API & HW) is required

Shared/Unified Address Space between Host & Devices

App OCL

RT

Dev1

Dev2H

A

A

App OCL

RT

Dev1

Dev2A

A

A

S

18

The 2nd generation of Platform Compute API Ideas for new memory model

Shared Address Space


w. relaxed consistency


w. full coherency

Baseline:Memory objects / resources will have the same starting address between Host & Devices

• Extend existing OCL 1.x / DX11 Memory Model

• Use explicit API calls to sync between Host & Device

• Suitable for Disjoint memory architectures (Discrete

GPU’s, for example…)

• New Model - Memory is coherent between Host & Device

• Use known “language level” mechanisms for concurrent

access: atomics, volatile

• Suitable for Shared Memory architectures

Host Memory

PP

PP

P

Device Memory

PP

PP

P

Host Device

Coherent/Shared Memory

PP

PP

P

Host Device

19

Some more thoughts for the 2nd generation(and beyond)

• Promote Heterogeneous Processing – not GPU only…

– Running code pending on problem domain:

– Matrix Multiply of 16x16 should run on the CPU

– Matrix Multiply of 1000x1000 should run on the GPU

– Where’s the decision point ? Better leave it to the Runtime… (requires API)

– Load Balancing

– Relevant especially on systems where the CPU & GPU are close in compute power

• One API to rule them all

– Compute API as the underlying infrastructure to run Media & GFX

– Extend the API to contain flexible pipeline, fixed-function HW, etc.

Problem size

Execution

TimeGPU

CPU

Slide from “Parallel Future of a Game Engine”, Johan Andersson, DICE

20

References:

• “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06

– http://gpgpu.org/static/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf

• “GPU Architecture: Implications & Trends”, David Luebke, NVIDIA Research, SIGGRAPH 2008:

– http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

• “General Purpose Computation on Graphics Processors (GPGPU)”, Mike Houston, Stanford University Graphics Lab

– http://www-graphics.stanford.edu/~mhouston/public_talks/R520-mhouston.pdf

• “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007

– http://gpgpu.org/static/s2007/slides/07-CTM-overview.pdf

• “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”, Peter N. Glaskowsky

– http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi-

TheFirstCompleteGPUComputingArchitecture.pdf

• “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day

– http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9Njk3NTJ8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1

• “Parallel Future of a Game Engine”, Johan Andersson, DICE

– http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448

http://gpgpu.org/static/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf



http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf









http://www-graphics.stanford.edu/~mhouston/public_talks/R520-mhouston.pdf






http://gpgpu.org/static/s2007/slides/07-CTM-overview.pdf








http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi-TheFirstCompleteGPUComputingArchitecture.pdf



http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9Njk3NTJ8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1
















http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448













Documents

Compute API –Past & Future