demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for...

DEV301

Введение в платформу гетерогенных вычислений C++ AMP и инструменты работыс GPU в Visual Studio 11

Максим ГольдинSenior DeveloperMicrosoft Corporation

Agenda

ContextCodeIDESummary

N-Body Simulation

The Power of Heterogeneous Computing

Interactive visualization of

volumetric white matter connectivity

Ionic placement for molecular dynamics simulation on GPU

Transcoding HD video stream to

Simulation in Matlab using .mex file CUDA function

Astrophysics N-body simulation

Financial simulation of LIBOR model with swaptions

GLAME@lab: An M-script API for linear Algebra operations

on GPU

Ultrasound medical imaging for cancer

diagnostics

Highly optimized object oriented

molecular dynamics

Cmatch exact string matching to find

similar proteins and gene sequences

source

CPUs vs GPUs today

Low memory bandwidthHigher power consumptionMedium level of parallelismDeep execution pipelinesRandom accessesSupports general codeMainstream programming

High memory bandwidthLower power consumptionHigh level of parallelismShallow execution pipelinesSequential accessesSupports data-parallel codeNiche programming

images source: AMD

Tomorrow…

CPUs and GPUs coming closer together……nothing settled in this space, things still in motion…

C++ Accelerated Massive Parallelism is designed as a mainstream solution not only for today, but also for tomorrow

image source: AMD

C++ AMP

Part of Visual C++ Visual Studio integrationSTL-like library for multidimensional data Builds on Direct3D

performance

portability

productivity

Agenda checkpoint

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){ for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; }}

How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU?

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

#include <amp.h>using namespace concurrency;

void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { sum[i] = a[i] + b[i]; } );}

void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

Basic Elements of C++ AMP coding

void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each(

sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx];

array_view variables captured and associated data copied to accelerator (on demand)

parallel_for_each: execute the lambda on the accelerator once per thread

grid: the number and shape of threads to execute the lambda

index: the thread ID that is running the lambda, used to index into data

array_view: wraps the data to operate on the accelerator

restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator)

grid<N>, extent<N>, and index<N>

index<N> represents an N-dimensional point

extent<N>number of units in each dimension of an N-dimensional space

grid<N>origin (index<N>) plus extent<N>

N can be any number

Examples: grid, extent, and index

index<1> i(2); index<2> i(0,2); index<3> i(2,0,1);

extent<3> e(3,2,2);extent<2> e(3,4);extent<1> e(6);

grid<3> g(e);grid<2> g(e);grid<1> g(e);

vector<int> v(96);extent<2> e(8,12); // e[0] == 8; e[1] == 12;array<int,2> a(e, v.begin(), v.end());

// in the body of my lambdaindex<2> i(3,9); // i[0] == 3; i[1] == 9;int o = a[i]; //or a[i] = 16;//int o = a(i[0], i[1]);

array<T,N>

Multi-dimensional array of rank N with element TStorage lives on accelerator

0 1 2 3 4 5 6 7 8 9 10 11

array_view<T,N>

View on existing data on the CPU or GPUarray_view<T,N> array_view<const T,N>

vector<int> v(10);

extent<2> e(2,5); array_view<int,2> a(e, v);

//above two lines can also be written//array_view<int,2> a(2,5,v);

Data Classes Comparison

array<T,N>

Rank at compile time Extent at runtimeRectangular

DenseContainer for dataExplicit copyCapture by reference [&]

array_view<T,N>

Rank at compile timeExtent at runtimeRectangular

Dense in one dimensionWrapper for dataImplicit copyCapture by value [=]

1. parallel_for_each( 2. g, //g is of type grid<N>3. [ ](index<N> idx)

restrict(direct3d) { // kernel code}

parallel_for_each

Executes the lambda for each point in the extentAs-if synchronous in terms of visible side-effects

Example: Matrix Multiplication

void MatrixMultiplySerial( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){

for (int row = 0; row < M; row++) { for (int col = 0; col < N; col++){ float sum = 0.0f; for(int i = 0; i < W; i++) sum += vA[row * W + i] * vB[i * N + col]; vC[row * N + col] = sum; } }}

void MatrixMultiplyAMP( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){ array_view<const float,2> a(M,W,vA),b(W,N,vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) { int row = idx[0]; int col = idx[1];

float sum = 0.0f; for(int i = 0; i < W; i++) sum += a(row, i) * b(i, col); c[idx] = sum;

accelerator, accelerator_view

accelerator e.g. DX11 GPU, REFe.g. CPU

accelerator_viewa context for scheduling and memory management

System memory

GPUPCIe

Host Accelerator (GPU example)

• Data transfers • between accelerator and host

• could be optimized away for integrated memory architecture

Example: accelerator

// Identify an accelerator based on Windows device IDaccelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”);

// …or enumerate all accelerators (not shown)

// Allocate an array on my acceleratorarray<int> myArray(10, myAcc.default_view);

// …or launch a kernel on my acceleratorparallel_for_each(myAcc.default_view, myArrayView.grid, ...);

C++ AMP at a Glance (so far)

restrict(direct3d, cpu)parallel_for_eachclass array<T,N>class array_view<T,N>class index<N>class extent<N>, grid<N>class acceleratorclass accelerator_view

Achieving maximum performance gains

Schedule threads in tilesAvoid thread index remappingGain ability to use tile static memory

parallel_for_each overload for tiles acceptstiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2>a lambda which accepts

tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2>

0 1 2 3 4 5

g.tile<2,2>()g.tile<4,3>()extent<2> e(8,6);grid<2> g(e);

tiled_grid, tiled_index

When the lambda is executed byt_idx.global = index<2> (6,3)t_idx.local = index<2> (0,1)t_idx.tile = index<2> (3,1)t_idx.tile_origin = index<2> (6,2)

array_view<int,2> data(8, 6, p_my_data);parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … });

0 1 2 3 4 50123456 T7

tile_static, tile_barrier

Within the tiled parallel_for_each lambda we can usetile_static storage class for local variables

indicates that the variable is allocated in fast cache memoryi.e. shared by each thread in a tile of threads

only applicable in restrict(direct3d) functions

class tile_barriersynchronize all threads within a tilee.g. t_idx.barrier.wait();

Example: Matrix Multiplication (tiled)void MatrixMultSimple(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){

array_view<const float,2> a(M, W, vA), b(W, N, vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=] (index<2> idx) restrict(direct3d) { int row = idx[0]; int col = idx[1]; float sum = 0.0f;

for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col);

c[idx] = sum; } );}

void MatrixMultTiled(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){ static const int TS = 16; array_view<const float,2> a(M, W, vA), b(W, N, vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid.tile< TS, TS >(), [=] (tiled_index< TS, TS> t_idx) restrict(direct3d) { int row = t_idx.local[0]; int col = t_idx.local[1]; float sum = 0.0f; for (int i = 0; i < W; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[row][col] = a(t_idx.global[0], col + i); locB[row][col] = b(row + i, t_idx.global[1]); t_idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[row][k] * locB[k][col]; t_idx.barrier.wait(); } c[t_idx.global] = sum; } );}

C++ AMP at a Glance

restrict(direct3d, cpu)parallel_for_eachclass array<T,N>class array_view<T,N>class index<N>class extent<N>, grid<N>class acceleratorclass accelerator_view

tile_static storage classclass tiled_grid< , , >class tiled_index< , , >class tile_barrier

Agenda checkpoint

Visual Studio 11

OrganizeEditDesignBuildBrowseDebugProfile

C++ AMP Parallel Debugger

Visual Studio 11

C++ AMP Parallel Debugger

Well known Visual Studio debugging features Launch, Attach, Break, Stepping, Breakpoints, DataTips Toolwindows

Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch

New features (for both CPU and GPU)Parallel Stacks window, Parallel Watch window, Barrier

New GPU-specificEmulator, GPU Threads window, race detection

Concurrency Visualizerfor GPU

Visual Studio 11

Concurrency Visualizer for GPU

Direct3D-centricSupports any library/programming model built on it

Integrated GPU and CPU viewGoal is to analyze high-level performance metrics

Memory copy overheadsSynchronization overheads across CPU/GPUGPU activity and contention with other processes

Concurrency Visualizer for GPU

Team is exploring ways to provide data on:

GPU Memory Utilization

GPU HW Counters

Agenda checkpoint

Summary

Democratization of parallel hardware programmabilityPerformance for the mainstreamHigh-level abstractions in C++ (not C)State-of-the-art Visual Studio IDEHardware abstraction platform

Intent is to make C++ AMP an open specification

Resources

Daniel Moth's blog (PM of C++ AMP)http://www.danielmoth.com/Blog/

MSDN Native parallelism blog (team blog)http://blogs.msdn.com/b/nativeconcurrency/

MSDN Dev Center for Parallel Computinghttp://msdn.com/concurrency

MSND Forums to ask questionshttp://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

Feedback

Your feedback is very important! Please complete an evaluation form!

Thank you!

DEV:Лабораторные работы

9 – 10 ноября в классе для самостоятельной работы9 ноября – с инструктором

10:30 – 11:45DEV201ILL: Основы Visual Studio LightSwitch

13:00 – 14:15DEV303ILL: Отладка с IntelliTrace с использованием Visual Stdudio 2010 Ultimate

14:30 – 15:45DEV304ILL: Использование Architecture Explorer для анализа кода в Visual Studio 2010 Ultimate

16:00 – 17:15DEV305ILL: Test Driven Development в Microsoft Visual Studio 2010

17:30 – 18:45DEV302ILL: Основы тестирования веб-производительности и нагрузочного тестирования с Visual Stdudio 2010 Ultimate

Questions?

DEV301 Максим Гольдин

Senior Developermgoldin@microsoft.com http://blogs.msdn.com/b/mgoldin/

You can ask your questions at “Ask the expert” zone within an hour after end of this session

restrict(…)

Applies to functions (including lambdas)Why restrict

Target-specific language restrictionsOptimizations or special code-gen behaviorFuture-proofing

Functions can have multiple restrictionsIn 1st release we are implementing direct3d and cpucpu – the implicit default

restrict(direct3d) restrictions

Can only call other restrict(direct3d) functionsAll functions must be inlinableOnly direct3d-supported types

int, unsigned int, float, double structs & arrays of these types

Pointers and ReferencesLambdas cannot capture by reference¹, nor capture pointersReferences and single-indirection pointers supported only as local variables and function arguments

restrict(direct3d) restrictions

No recursion'volatile'virtual functionspointers to functionspointers to member functionspointers in structspointers to pointers

No goto or labeled statementsthrow, try, catchglobals or staticsdynamic_cast or typeidasm declarationsvarargsunsupported types

e.g. char, short, long double

Example: restrict overloading

double bar( double ) restrict(cpu,direc3d); // 1: same code for bothdouble cos( double ); // 2a: general codedouble cos( double ) restrict(direct3d); // 2b: specific code

void SomeMethod(array<double> c) { parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { //… double d1 = bar(c[idx]); // ok double d2 = cos(c[idx]); // ok, chooses direct3d overload //… });}

Not Covered

Math librarye.g. acosf

Atomic operation librarye.g. atomic_fetch_add

Direct3D intrinsicsdebugging (e.g. direct3d_printf), fences (e.g. __dp_d3d_all_memory_fence), float math (e.g. __dp_d3d_absf)

Direct3D Interop*get_device, create_accelerator_view, make_array, *get_buffer

demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for...

Documents

TI-36X Pro Calculator · TI-36X Pro Calculator ... SCI expresses numbers with one digit to the left of the decimal and the appropriate power of 10, ... of 3. Note: E is a shortcut

NEREA · NEREA 3/69. 36x S34722 36x S3eeee-N 52x S30106 Ø3.5x15 10x S30129 Ø3,5x25 4x S30080 Ø4x30 18x S33127 Ø4x45 90x S31298-14 Ø4x14 6x S31298 Ø4x15 30x S30111 Ø6,3x13 20x

TESL 36X 201309 - University of Saskatchewan

· 32x 16x 40x 20x 20x 16x 2x 36x 6x 24x 16x 12x 12x 24x 90x 112x 36x 44x 2x 2x 2x 2x 2x 2x 24x 124x 20x 1 Ox 62x 2x 8x 2x 2x 68x 24x 16x 64x 64x

GPS 19x HVS TECHNICAL SPECIFICATIONS - Garminstatic.garmin.com/pumac/GPS_19x_HVS_Tech_Specs_EN.pdfGPS 19x HVS TECHNICAL SPECIFICATIONS Garmin International, Inc. 1200 E. 151st Street

UltraView IP PTZ 36X Camera Quick Start Guide UltraView IP PTZ 36X Camera Quick Start... · EN UltraView IP PTZ 36X Camera Quick Start Guide DE UltraView IP PTZ 36X-Kamera - Kurzanleitung

PKR 36x | IKR 36x - High Vac Depot · 2019-07-05 · 10 Shipping 46 11 Disposal 47 12 Service solutions from Pfeiffer Vacuum 48 13 Ordering information 50 13.1 Ordering parts 50 13.2

Tricks Teaching Your Toaster New - SCALE 19x | 19x

The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

TI-36X ý · TI-36X ü automatically if no key is pressed for about five minutes. Press T after APD to power up again; the display, ... 578.3 DEG

Which has better zoom: 18x or 36x?

MXK-19x Series 1U GPON OLTs - Multicom · PDF fileMXK-19x Series 1U GPON OLTs ... options including SNMP, CLI, WebUI, ZMS and OSS Gateway ... ZMS (Zhone Management System) via SNMP

GPS 19x NMEA 2000 Installation Instructions - Garmin · GPS 19x NMEA 2000® Installation Instructions ... 19x antenna to the top of the T-connector you added to your NMEA 2000 network

tqops}:.,7-|u 36x 4z1 - Wing On Travel · PX\XV`[beT[\Z^cXbbX hgjfi,3no svxp}z1mywq-/uno vr~{1m054|-no \a+W6_Dlnlm mmml tqops}:.,7-|u /~36x/~4z1 z48|C9CSg,8ZbQ9CJ=8}[x-9C Sp8D]}Y^9CQkXn8Pt9CwgD8Tr8A9C

TI-36X Solar, English€¦ · TI-36X Solar, English ... ƒ To turn on the TI-36X Solar, expose the solar panel to light and press !. Note: ... For example, 8 " ‡ finds the

ISSN 1839-146X From The Lefthand Seat - RAAA

1072593D TruVision PTZ 36X User Manual-EN TruVision PTZ 36X... · 2955 Red Hill Avenue, Costa Mesa, ... Setting the RS-485 communication 13 ... The TruVision PTZ 36X Camera is an

GPS 19x NMEA 2000 Installation Instructions - Garminstatic.garmin.com/pumac/GPS19x_NMEA2000_INST_ML.pdf · GPS 19x NMEA 2000® Installation Instructions CAutIoN Always wear safety

TESL 36X 201409 July 30 - certesl.usask.ca

Sales Guide of DCS-6815(18X) DCS-6817(30X) DCS-6818(36X)