48

demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Embed Size (px)

Citation preview

Page 1: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding
Page 2: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

DEV301

Введение в платформу гетерогенных вычислений C++ AMP и инструменты работыс GPU в Visual Studio 11

Максим ГольдинSenior DeveloperMicrosoft Corporation

Page 3: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Agenda

ContextCodeIDESummary

Page 4: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

demo

N-Body Simulation

Page 5: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding
Page 6: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

The Power of Heterogeneous Computing

146X

Interactive visualization of

volumetric white matter connectivity

36X

Ionic placement for molecular dynamics simulation on GPU

19X

Transcoding HD video stream to

H.264

17X

Simulation in Matlab using .mex file CUDA function

100X

Astrophysics N-body simulation

149X

Financial simulation of LIBOR model with swaptions

47X

GLAME@lab: An M-script API for linear Algebra operations

on GPU

20X

Ultrasound medical imaging for cancer

diagnostics

24X

Highly optimized object oriented

molecular dynamics

30X

Cmatch exact string matching to find

similar proteins and gene sequences

source

Page 7: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

CPUs vs GPUs today

CPU

Low memory bandwidthHigher power consumptionMedium level of parallelismDeep execution pipelinesRandom accessesSupports general codeMainstream programming

GPU

High memory bandwidthLower power consumptionHigh level of parallelismShallow execution pipelinesSequential accessesSupports data-parallel codeNiche programming

images source: AMD

Page 8: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Tomorrow…

CPUs and GPUs coming closer together……nothing settled in this space, things still in motion…

C++ Accelerated Massive Parallelism is designed as a mainstream solution not only for today, but also for tomorrow

image source: AMD

Page 9: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

C++ AMP

Part of Visual C++ Visual Studio integrationSTL-like library for multidimensional data Builds on Direct3D

performance

portability

productivity

Page 10: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Agenda checkpoint

ContextCodeIDESummary

Page 11: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){ for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; }}

How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU?

Page 12: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

}

#include <amp.h>using namespace concurrency;

void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { sum[i] = a[i] + b[i]; } );}

void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

}

Page 13: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Basic Elements of C++ AMP coding

void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each(

sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx];

} );}

array_view variables captured and associated data copied to accelerator (on demand)

parallel_for_each: execute the lambda on the accelerator once per thread

grid: the number and shape of threads to execute the lambda

index: the thread ID that is running the lambda, used to index into data

array_view: wraps the data to operate on the accelerator

restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator)

Page 14: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

grid<N>, extent<N>, and index<N>

index<N> represents an N-dimensional point

extent<N>number of units in each dimension of an N-dimensional space

grid<N>origin (index<N>) plus extent<N>

N can be any number

Page 15: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Examples: grid, extent, and index

index<1> i(2); index<2> i(0,2); index<3> i(2,0,1);

extent<3> e(3,2,2);extent<2> e(3,4);extent<1> e(6);

grid<3> g(e);grid<2> g(e);grid<1> g(e);

Page 16: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

vector<int> v(96);extent<2> e(8,12); // e[0] == 8; e[1] == 12;array<int,2> a(e, v.begin(), v.end());

// in the body of my lambdaindex<2> i(3,9); // i[0] == 3; i[1] == 9;int o = a[i]; //or a[i] = 16;//int o = a(i[0], i[1]);

array<T,N>

Multi-dimensional array of rank N with element TStorage lives on accelerator

0 1 2 3 4 5 6 7 8 9 10 11

0

1

2

3

4

5

6

7

Page 17: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

array_view<T,N>

View on existing data on the CPU or GPUarray_view<T,N> array_view<const T,N>

vector<int> v(10);

extent<2> e(2,5); array_view<int,2> a(e, v);

//above two lines can also be written//array_view<int,2> a(2,5,v);

Page 18: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Data Classes Comparison

array<T,N>

Rank at compile time Extent at runtimeRectangular

DenseContainer for dataExplicit copyCapture by reference [&]

array_view<T,N>

Rank at compile timeExtent at runtimeRectangular

Dense in one dimensionWrapper for dataImplicit copyCapture by value [=]

Page 19: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

1. parallel_for_each( 2. g, //g is of type grid<N>3. [ ](index<N> idx)

restrict(direct3d) { // kernel code}

4. );

parallel_for_each

Executes the lambda for each point in the extentAs-if synchronous in terms of visible side-effects

Page 20: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Example: Matrix Multiplication

void MatrixMultiplySerial( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){

for (int row = 0; row < M; row++) { for (int col = 0; col < N; col++){ float sum = 0.0f; for(int i = 0; i < W; i++) sum += vA[row * W + i] * vB[i * N + col]; vC[row * N + col] = sum; } }}

void MatrixMultiplyAMP( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){ array_view<const float,2> a(M,W,vA),b(W,N,vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) { int row = idx[0]; int col = idx[1];

float sum = 0.0f; for(int i = 0; i < W; i++) sum += a(row, i) * b(i, col); c[idx] = sum;

} );}

Page 21: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

accelerator, accelerator_view

accelerator e.g. DX11 GPU, REFe.g. CPU

accelerator_viewa context for scheduling and memory management

CPUs

System memory

GPUPCIe

GPU

GPU

GPU

Host Accelerator (GPU example)

• Data transfers • between accelerator and host

• could be optimized away for integrated memory architecture

Page 22: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Example: accelerator

// Identify an accelerator based on Windows device IDaccelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”);

// …or enumerate all accelerators (not shown)

// Allocate an array on my acceleratorarray<int> myArray(10, myAcc.default_view);

// …or launch a kernel on my acceleratorparallel_for_each(myAcc.default_view, myArrayView.grid, ...);

Page 23: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

C++ AMP at a Glance (so far)

restrict(direct3d, cpu)parallel_for_eachclass array<T,N>class array_view<T,N>class index<N>class extent<N>, grid<N>class acceleratorclass accelerator_view

Page 24: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Achieving maximum performance gains

Schedule threads in tilesAvoid thread index remappingGain ability to use tile static memory

parallel_for_each overload for tiles acceptstiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2>a lambda which accepts

tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2>

0 1 2 3 4 5

0

1

2

3

4

5

6

7

0 1 2 3 4 5

0

1

2

3

4

5

6

7

0 1 2 3 4 5

0

1

2

3

4

5

6

7

g.tile<2,2>()g.tile<4,3>()extent<2> e(8,6);grid<2> g(e);

Page 25: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

tiled_grid, tiled_index

Given

When the lambda is executed byt_idx.global = index<2> (6,3)t_idx.local = index<2> (0,1)t_idx.tile = index<2> (3,1)t_idx.tile_origin = index<2> (6,2)

T

array_view<int,2> data(8, 6, p_my_data);parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … });

0 1 2 3 4 50123456 T7

Page 26: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

tile_static, tile_barrier

Within the tiled parallel_for_each lambda we can usetile_static storage class for local variables

indicates that the variable is allocated in fast cache memoryi.e. shared by each thread in a tile of threads

only applicable in restrict(direct3d) functions

class tile_barriersynchronize all threads within a tilee.g. t_idx.barrier.wait();

Page 27: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Example: Matrix Multiplication (tiled)void MatrixMultSimple(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){

array_view<const float,2> a(M, W, vA), b(W, N, vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=] (index<2> idx) restrict(direct3d) { int row = idx[0]; int col = idx[1]; float sum = 0.0f;

for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col);

c[idx] = sum; } );}

void MatrixMultTiled(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){ static const int TS = 16; array_view<const float,2> a(M, W, vA), b(W, N, vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid.tile< TS, TS >(), [=] (tiled_index< TS, TS> t_idx) restrict(direct3d) { int row = t_idx.local[0]; int col = t_idx.local[1]; float sum = 0.0f; for (int i = 0; i < W; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[row][col] = a(t_idx.global[0], col + i); locB[row][col] = b(row + i, t_idx.global[1]); t_idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[row][k] * locB[k][col]; t_idx.barrier.wait(); } c[t_idx.global] = sum; } );}

Page 28: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

C++ AMP at a Glance

restrict(direct3d, cpu)parallel_for_eachclass array<T,N>class array_view<T,N>class index<N>class extent<N>, grid<N>class acceleratorclass accelerator_view

tile_static storage classclass tiled_grid< , , >class tiled_index< , , >class tile_barrier

Page 29: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Agenda checkpoint

ContextCodeIDESummary

Page 30: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Visual Studio 11

OrganizeEditDesignBuildBrowseDebugProfile

Page 31: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

demo

C++ AMP Parallel Debugger

Page 32: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Visual Studio 11

OrganizeEditDesignBuildBrowseDebugProfile

Page 33: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

C++ AMP Parallel Debugger

Well known Visual Studio debugging features Launch, Attach, Break, Stepping, Breakpoints, DataTips Toolwindows

Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch

New features (for both CPU and GPU)Parallel Stacks window, Parallel Watch window, Barrier

New GPU-specificEmulator, GPU Threads window, race detection

Page 34: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

demo

Concurrency Visualizerfor GPU

Page 35: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Visual Studio 11

OrganizeEditDesignBuildBrowseDebugProfile

Page 36: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Concurrency Visualizer for GPU

Direct3D-centricSupports any library/programming model built on it

Integrated GPU and CPU viewGoal is to analyze high-level performance metrics

Memory copy overheadsSynchronization overheads across CPU/GPUGPU activity and contention with other processes

Page 37: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Concurrency Visualizer for GPU

Team is exploring ways to provide data on:

GPU Memory Utilization

GPU HW Counters

Page 38: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Agenda checkpoint

ContextCodeIDESummary

Page 39: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Summary

Democratization of parallel hardware programmabilityPerformance for the mainstreamHigh-level abstractions in C++ (not C)State-of-the-art Visual Studio IDEHardware abstraction platform

Intent is to make C++ AMP an open specification

Page 40: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Resources

Daniel Moth's blog (PM of C++ AMP)http://www.danielmoth.com/Blog/

MSDN Native parallelism blog (team blog)http://blogs.msdn.com/b/nativeconcurrency/

MSDN Dev Center for Parallel Computinghttp://msdn.com/concurrency

MSND Forums to ask questionshttp://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

Page 41: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Feedback

Your feedback is very important! Please complete an evaluation form!

Thank you!

Page 42: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

DEV:Лабораторные работы

9 – 10 ноября в классе для самостоятельной работы9 ноября – с инструктором

10:30 – 11:45DEV201ILL: Основы Visual Studio LightSwitch

13:00 – 14:15DEV303ILL: Отладка с IntelliTrace с использованием Visual Stdudio 2010 Ultimate

14:30 – 15:45DEV304ILL: Использование Architecture Explorer для анализа кода в Visual Studio 2010 Ultimate

16:00 – 17:15DEV305ILL: Test Driven Development в Microsoft Visual Studio 2010

17:30 – 18:45DEV302ILL: Основы тестирования веб-производительности и нагрузочного тестирования с Visual Stdudio 2010 Ultimate

Page 43: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Questions?

DEV301 Максим Гольдин

Senior [email protected] http://blogs.msdn.com/b/mgoldin/

You can ask your questions at “Ask the expert” zone within an hour after end of this session

Page 44: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

restrict(…)

Applies to functions (including lambdas)Why restrict

Target-specific language restrictionsOptimizations or special code-gen behaviorFuture-proofing

Functions can have multiple restrictionsIn 1st release we are implementing direct3d and cpucpu – the implicit default

Page 45: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

restrict(direct3d) restrictions

Can only call other restrict(direct3d) functionsAll functions must be inlinableOnly direct3d-supported types

int, unsigned int, float, double structs & arrays of these types

Pointers and ReferencesLambdas cannot capture by reference¹, nor capture pointersReferences and single-indirection pointers supported only as local variables and function arguments

Page 46: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

restrict(direct3d) restrictions

No recursion'volatile'virtual functionspointers to functionspointers to member functionspointers in structspointers to pointers

No goto or labeled statementsthrow, try, catchglobals or staticsdynamic_cast or typeidasm declarationsvarargsunsupported types

e.g. char, short, long double

Page 47: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Example: restrict overloading

double bar( double ) restrict(cpu,direc3d); // 1: same code for bothdouble cos( double ); // 2a: general codedouble cos( double ) restrict(direct3d); // 2b: specific code

void SomeMethod(array<double> c) { parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { //… double d1 = bar(c[idx]); // ok double d2 = cos(c[idx]); // ok, chooses direct3d overload //… });}

Page 48: demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding

Not Covered

Math librarye.g. acosf

Atomic operation librarye.g. atomic_fetch_add

Direct3D intrinsicsdebugging (e.g. direct3d_printf), fences (e.g. __dp_d3d_all_memory_fence), float math (e.g. __dp_d3d_absf)

Direct3D Interop*get_device, create_accelerator_view, make_array, *get_buffer