CyrilZeller_IntroductionToCUDACC

  • Upload
    lolo406

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    1/74

    NVIDIA Corporation 2011

    Introduction to CUDA CQCon 2011

    Cyril Zeller, NVIDIA Corporation

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    2/74

    NVIDIA Corporation 2011

    Welcome

    Goal: an introduction to GPU programming

    CUDA = NVIDIAs architecture for GPU computing

    What will you learn in this session?

    Start from Hello World!

    Write and execute C code on the GPU

    Manage GPU memory

    Manage communication and synchronization

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    3/74

    NVIDIA Corporation 2011

    GPUs are Fast!

    80.1

    656.1

    0

    150

    300

    450

    600

    750

    CPU Server GPU-CPUServer

    PerformanceGflops

    11

    60

    0

    10

    20

    30

    40

    50

    60

    70

    CPU Server GPU-CPUServer

    Performance / $Gflops / $K

    146

    0

    200

    400

    600

    800

    CPU Server

    PerformanGflops

    CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz,48 GB memory, $7K, 0.55

    GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1

    8x Higher Linpack

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    4/74

    NVIDIA Corporation 2011

    Worlds Fastest MD Simulation

    Sustained Performance of 1.87 Petaflops/s

    Institute of Process Engineering (IPE)Chinese Academy of Sciences (CAS)

    MD Simulation for Crystalline Silicon

    Used all 7168 Tesla G

    Tianhe-1A GPU Super

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    5/74 NVIDIA Corporation 2011

    Worlds Greenest Petaflop Supercomputer

    Tsubame 2.0Tokyo Institute of Technology

    1.19 Petaflops

    4,224 Tesla M2050 GPUs

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    6/74 NVIDIA Corporation 2011

    Increasing Number of Professional CUDAApplications

    Tools &Libraries

    Oil & Gas

    TotalViewDebugger

    Thrust C++Template Lib

    R-StreamReservoir Labs

    NVIDIA NPPPerf Primitives

    Bright ClusterManager

    CAPS HMPP

    PBSWorks

    EMPhotonicsCULAPACK

    NVIDIAVideo Libraries

    CUDA C/C++

    PGI Fortran

    Parallel NsightVis Studio IDE

    Allinea DDTDebugger

    GPU PackagesFor R Stats Pkg

    IMSLpyCUDA

    ParaToolsVampirTrace

    PGIAccelerators

    MAGMA

    AvailableNow

    ParadigmSKUA

    StoneRidgeRTM

    Headwave SuiteAccelewareRTM Solver

    GeoStar Seismic

    ffA SVI Pro

    OpenGeo SolnsOpenSEIS

    Seismic CityRTM

    ParadigmGeoDepth RTM

    VSGOpen Inventor

    TsunamiRTM

    VSGAvizo

    SVI ProSEA 3D

    Pro 2010S

    AccelerEyesJacket: MATLAB

    MathematicaMATLABLabVIEWLibraries

    NumericalAnalytics

    MOABAdaptive Comp

    TorqueAdaptive Comp

    Finance AquiminAlphaVision

    NAGRNG

    SciCompSciFinance

    Hanweck VoleraOptions Analysi

    MurexMACS

    NumerixCounterpartyRisk

    Avai

    Siemens4D Ultrasound

    DigisensCT

    SchrodingerCore Hopping

    ManifoldGIS

    DalsaMach Vision

    Other

    Useful ProgMedical Imag

    WRFWeather

    ASUCAWeather Model

    MVTechMach Vision

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    7/74 NVIDIA Corporation 2011

    Increasing Number of Professional CUDAApplications

    Bio-Chemistry

    Bio-Informatics

    GPU-HMMR MUMmerGPU

    CUDA-BLASTP CUDA-EC CUDA-MEME CUDA SW++ OpenEye ROCS

    HEX ProteinDocking

    TeraChemBigDFTABINT

    VMD

    AcelleraACEMD

    AMBER

    DL-POLY

    GROMACS GROMOS HOOMDNAMD

    GAMESS

    PIPERDocking

    LAMMPS

    EDAAgilent ADSSPICE Sim

    RemcomXFdtd

    AgilentEMPro 2010

    CST MicrowaveSPEAG

    SEMCAD XANSOFT Nexxim

    SynopsysTCAD

    AvailableNow

    CAEACUSIM/Altair

    AcuSolveAutodeskMoldflow

    ANSYSMechanical

    SIMULIAAbaqus/Std

    FluiDyna CulisesOpenFOAM

    MetacompCFD++

    ImpetusAFEA

    Rendering

    VideoSorensonSqueeze 7

    FraunhoferJPEG2000

    ElementalLive & Server

    MotionDSPIkena Video

    MS ExpressionEncoder

    MainConceptCUDA H.264

    AdobePremier Pro

    DassaultCatia v6 (iray)

    NVIDIAOptiX (SDK)

    mental imagesiray (OEM)

    BunkspeedShot (iray)

    Refractive SWOctane

    LightworksArtisan, Author

    Chaos GroupV-Ray RT

    CausticOpenRL (SDK)

    Weta DigitalPantaRay

    Autodesk3ds Max (iray)

    Works ZebraZeany

    Avai

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    8/74 NVIDIA Corporation 2011

    CUDA Capable GPUs300,000,000

    CUDA Toolkit Downloads500,000

    Active CUDA Developers100,000

    Universities Teaching CUDA400

    % OEMs offer CUDA GPU PCs100

    CUDA by the Numbers

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    9/74 NVIDIA Corporation 2011

    C++ OpenCLtm

    FortranJava

    PythonWrappers

    DirectCompute

    OpenCL is trademark of Apple Inc. use

    C

    NVIDIA GPUwith the CUDAParallel Computing Architecture

    GPU Computing Applications

    Fermi Architecture(compute capabilities 2.x)

    GeForce 400 Series Quadro Fermi Series Tesla 20 Series

    Tesla Architecture(compute capabilities 1.x)

    GeForce 200 SeriesGeForce 9 SeriesGeForce 8 Series

    Quadro FX SeriesQuadro Plex SeriesQuadro NVS Series

    Tesla 10 Series

    ProfessionalGraphics

    High PerformanceComputing

    Entertainment

    D(e.g

    Libraries and Middleware

    cuFFTcuBLAScuRAND

    cuSPARSE

    LAPACK

    CULAMAGMA

    ThrustNPP

    cuDPP

    VSIPLSVM

    OpencCurrent

    PhysXOptiX

    irayRealityServer

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    10/74 NVIDIA Corporation 2011

    WorkstatServers & Blades

    Tesla Data Center & Workstation GPU Solutio

    Tesla M-series GPUsM2090 | M2070 | M2050

    Tesla C-serieC2070 | C

    M2090 M2070 M2050

    Cores 512 448 448

    Memory 6 GB 6 GB 3 GBMemory bandwidth(ECC off)

    177.6 GB/s 150 GB/s 148.8 GB/s

    PeakPerfGflops

    SinglePrecision

    1331 1030 1030

    DoublePrecision

    665 515 515

    C2070

    448

    6 GB

    148.8 GB/s 1

    1030

    515

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    11/74 NVIDIA Corporation 2011

    NVIDIA Developer EcosystemDebuggers

    & Profilers

    cuda-gdb

    NV Visual ProfilerParallel Nsight

    Visual Studio

    Allinea

    TotalView

    MATLAB

    Mathematica

    NI LabView

    pyCUDA

    Numerical

    Packages

    C

    C++

    Fortran

    OpenCL

    DirectCompute

    Java

    Python

    GPU Compilers

    PGI AcceleratorCAPS HMPP

    mCUDA

    OpenMP

    Parallelizing

    Compilers

    OEGPGPU Consultants & Training

    ANEO GPU Tech

    http://images.google.com/imgres?imgurl=http://fishtrain.com/wp-content/uploads/2007/09/cray_logo.gif&imgrefurl=http://fishtrain.com/2007/09/03/nvidias-playbook/&usg=__mBEPjqB6tUo0mps50ld866NdmmI=&h=70&w=160&sz=3&hl=en&start=8&sig2=erIWlru80_C67bxBapde6g&tbnid=ooG9_suq3ywK-M:&tbnh=43&tbnw=98&prev=/images?q=cray+logo&gbv=2&hl=en&ei=aHYpSvyWEo-ctgPd-dXxCg
  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    12/74 NVIDIA Corporation 2011

    Parallel NsightVisual Studio

    Visual ProfilerWindows/Linux/Mac

    cudLin

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    13/74

    NVIDIA Corporation 2011

    What is CUDA?

    CUDA Architecture

    Expose GPU computing for general purpose

    Retain performance

    CUDA C/C++

    Based on industry-standard C/C++

    Small set of extensions to enable heterogeneous programming

    Straightforward APIs to manage devices, memory etc.

    This session introduces CUDA C/C++

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    14/74

    NVIDIA Corporation 2011

    Heterogeneous Comp

    Blocks

    Threads

    Indexing

    Shared memory

    __syncthreads()

    Asynchronous operati

    Handling errors

    Managing devices

    CONCEPTS

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    15/74

    NVIDIA Corporation 2011

    HELLO WORLD!

    Het

    Blo

    Thr

    Ind

    Sha

    __s

    Asy

    Han

    Man

    CONCEPTS

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    16/74

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    17/74

    NVIDIA Corporation 2011

    Heterogeneous Computing

    #include

    #include

    usingnamespacestd;

    #define N 1024

    #define RADIUS 3

    #define BLOCK_SIZE 16

    __global__ void stencil_1d(int*in, int*out) {

    __shared__ inttemp[BLOCK_SIZE + 2 * RADIUS];

    int gindex = threadIdx.x + blockIdx. x * blockDim.x;int lindex = threadIdx.x + RADIUS;

    // Read input elements into shared memory

    temp[lindex] = in[gindex];

    if(threadIdx.x < RADIUS) {

    temp[lindex - RADIUS] = in[gindex- RADIUS];

    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

    }

    // Synchronize(ensure all thedata is available)

    __syncthreads();

    // Applythe stencil

    int result = 0;

    for(intoffset = -RADIUS ; offset

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    18/74

    NVIDIA Corporation 2011

    Simple Processing Flow

    1. Copy input data from CPU memory to GPUmemory

    PCI Bus

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    19/74

    NVIDIA Corporation 2011

    Simple Processing Flow

    1. Copy input data from CPU memory to GPUmemory

    2. Load GPU code and execute it,

    caching data on chip for performance

    PCI Bus

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    20/74

    NVIDIA Corporation 2011

    Simple Processing Flow

    1. Copy input data from CPU memory to GPUmemory

    2. Load GPU program and execute,

    caching data on chip for performance

    3. Copy results from GPU memory to CPU

    memory

    PCI Bus

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    21/74

    NVIDIA Corporation 2011

    Hello World!

    int main(void) {

    printf("Hello World!\n");

    return0;

    }

    Standard C that runs on the host

    NVIDIA compiler (nvcc) can be used to compile

    programs with no devicecode

    Output:

    $ nvcchello_wo

    $ a.out

    Hello Wo

    $

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    22/74

    NVIDIA Corporation 2011

    Hello World! with Device Code

    __global__ voidmykernel(void) {

    }

    int main(void) {

    mykernel();

    printf("Hello World!\n");

    return0;

    }

    Two new syntactic elements

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    23/74

    NVIDIA Corporation 2011

    Hello World! with Device Code

    __global__ void mykernel(void) {

    }

    CUDA C/C++ keyword __global__ indicates a function th

    Runs on the device

    Is called from host code

    nvccseparates source code into host and device compon

    Device functions (e.g. mykernel()) processed by NVIDIA compi

    Host functions (e.g. main()) processed by standard host compile

    - gcc, cl.exe

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    24/74

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    25/74

    NVIDIA Corporation 2011

    Hello World! with Device Code

    __global__ voidmykernel(void) {

    }

    int main(void) {

    mykernel();

    printf("Hello World!\n");

    return0;

    }

    mykernel()does nothing, somewhat

    anticlimactic!

    Output:

    $ nvcc h

    $ a.out

    Hello Wo$

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    26/74

    NVIDIA Corporation 2011

    Parallel Programming in CUDA C/C++

    But wait GPU computing is about massive

    parallelism!

    We need a more interesting example

    Well start by adding two integers and build up

    to vector addition

    a

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    27/74

    NVIDIA Corporation 2011

    Addition on the Device

    A simple kernel to add two integers

    __global__ voidadd(int *a, int*b, int *c) {

    *c = *a + *b;

    }

    As before __global__ is a CUDA C/C++ keyword meanin

    add() will execute on the device

    add() will be called from the host

    Additi th D i

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    28/74

    NVIDIA Corporation 2011

    Addition on the Device

    Note that we use pointers for the variables

    __global__ void add(int *a, int *b, int *c) {

    *c= *a+ *b;

    }

    add() runs on the device, so a, band cmust point to devi

    We need to allocate memory on the GPU

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    29/74

    Additi th D i

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    30/74

    NVIDIA Corporation 2011

    Addition on the Device: add()

    Returning to our add()kernel

    __global__ voidadd(int *a, int*b, int *c) {

    *c = *a + *b;

    }

    Lets take a look at main()

    Additi th D i i ()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    31/74

    NVIDIA Corporation 2011

    Addition on the Device:main()

    int main(void) {

    inta, b, c; // host copies of a, b

    int*d_a, *d_b, *d_c; // device copies of a,

    intsize = sizeof(int);

    // Allocate space for device copies of a, b, c

    cudaMalloc((void**)&d_a, size);

    cudaMalloc((void**)&d_b, size);

    cudaMalloc((void**)&d_c, size);

    // Setup input values

    a = 2;

    b = 7;

    Addition on the Device: i ()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    32/74

    NVIDIA Corporation 2011

    Addition on the Device:main()

    // Copy inputs to device

    cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return0;

    }

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    33/74

    Moving to Parallel

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    34/74

    NVIDIA Corporation 2011

    Moving to Parallel

    GPU computing is about massive parallelism

    So how do we run code in parallel on the device?

    add>();

    add();

    Instead of executing add()once, execute N times in para

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    35/74

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    36/74

    Vector Addition on the Device: add()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    37/74

    NVIDIA Corporation 2011

    Vector Addition on the Device: add()

    Returning to our parallelized add()kernel

    __global__ voidadd(int *a, int*b, int *c) {

    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

    }

    Lets take a look at main()

    Vector Addition on the Device: main()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    38/74

    NVIDIA Corporation 2011

    Vector Addition on the Device:main()

    #define N 512

    intmain(void) {

    int*a, *b, *c; // host copies of a, b

    int*d_a, *d_b, *d_c; // device copies of a,

    intsize = N * sizeof(int);

    // Alloc space for device copies of a, b, c

    cudaMalloc((void**)&d_a, size);

    cudaMalloc((void**)&d_b, size);cudaMalloc((void**)&d_c, size);

    // Alloc space for host copies of a, b, c and setup

    a = (int *)malloc(size); random_ints(a, N);

    b = (int *)malloc(size); random_ints(b, N);

    c = (int *)malloc(size);

    Vector Addition on the Device: main()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    39/74

    NVIDIA Corporation 2011

    Vector Addition on the Device:main()

    // Copy inputs to device

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU with N blocks

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    free(a); free(b); free(c);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return0;

    }

    Review (1 of 2)

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    40/74

    NVIDIA Corporation 2011

    Review (1 of 2)

    Difference between host and device

    Host CPU

    Device GPU

    Using __global__ to declare a function as device code

    Executes on the device

    Called from the host

    Passing parameters from host code to a device function

    Review (2 of 2)

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    41/74

    NVIDIA Corporation 2011

    Review (2 of 2)

    Basic device memory management

    cudaMalloc() cudaMemcpy()

    cudaFree()

    Launching parallel kernels

    Launch Ncopies of add()with add();

    Use blockIdx.xto access block index

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    42/74

    NVIDIA Corporation 2011

    INTRODUCING THREADS

    Het

    Blo

    Thr

    Ind

    Sha

    __s

    Asy

    Han

    Man

    CONCEPTS

    CUDA Threads

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    43/74

    NVIDIA Corporation 2011

    __global__ voidadd(int *a, int*b, int *c) {

    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

    }

    CUDA Threads

    Terminology: a block can be split into parallel threads

    Lets change add()to use parallel threadsinstead of para

    __global__ voidadd(int *a, int*b, int *c) {

    c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];

    }

    We use threadIdx.xinstead of blockIdx.x

    Need to make one change in main()

    Vector Addition Using Threads: main()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    44/74

    NVIDIA Corporation 2011

    Vector Addition Using Threads:main()

    #define N 512

    intmain(void) {

    int*a, *b, *c; // host copies of a, b

    int*d_a, *d_b, *d_c; // device copies of a,

    intsize = N * sizeof(int);

    // Alloc space for device copies of a, b, c

    cudaMalloc((void**)&d_a, size);

    cudaMalloc((void**)&d_b, size);cudaMalloc((void**)&d_c, size);

    // Alloc space for host copies of a, b, c and setup

    a = (int *)malloc(size); random_ints(a, N);

    b = (int *)malloc(size); random_ints(b, N);

    c = (int *)malloc(size);

    Vector Addition Using Threads: main()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    45/74

    NVIDIA Corporation 2011

    // Copy inputs to device

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU with N blocks

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    free(a); free(b); free(c);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return0;

    }

    Vector Addition Using Threads:main()

    // Copy inputs to device

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU with N threads

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    free(a); free(b); free(c);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return0;

    }

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    46/74

    Combining Blocks andThreads

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    47/74

    NVIDIA Corporation 2011

    g

    Weve seen parallel vector addition using:

    Several blocks with one thread each One block with several threads

    Lets adapt vector addition to use both blocks and threads

    Why? Well come to that

    First lets discuss data indexing

    Indexing Arrays with Blocks and Threads

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    48/74

    NVIDIA Corporation 2011

    g y

    No longer as simple as using blockIdx.xand threadIdx.x

    Consider indexing an array with one element per thread (8 threa

    With M threads per block, a unique index for each thread iintindex = threadIdx.x + blockIdx.x * M;

    0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3

    threadIdx.x threadIdx.x threadIdx.x thread

    blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx

    Indexing Arrays: Example

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    49/74

    NVIDIA Corporation 2011

    g y p

    Which thread will operate on the red element?

    intindex = threadIdx.x + blockIdx.x * M;

    = 5 + 2 * 8;

    = 21;

    0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3

    threadIdx.x = 5

    blockIdx.x = 2

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

    M = 8

    Vector Addition with Blocks and Threads

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    50/74

    NVIDIA Corporation 2011

    Use the built-in variable blockDim.x for threads per block

    intindex = threadIdx.x + blockIdx.x * blockDim.x;

    Combined version of add()to use parallel threads andpa

    __global__ voidadd(int *a, int*b, int *c) {

    intindex = threadIdx.x + blockIdx.x * blockDim.x;

    c[index] = a[index] + b[index];

    }

    What changes need to be made in main()?

    Addition with Blocks and Threads:main()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    51/74

    NVIDIA Corporation 2011

    #define N (2048*2048)

    #define THREADS_PER_BLOCK 512

    intmain(void) {int*a, *b, *c; // host copies of a, b, c

    int*d_a, *d_b, *d_c; // device copies of a, b,

    intsize = N *sizeof(int);

    // Alloc space for device copies of a, b, c

    cudaMalloc((void**)&d_a, size);

    cudaMalloc((void**)&d_b, size);

    cudaMalloc((void**)&d_c, size);

    // Alloc space for host copies of a, b, c and setup input v

    a = (int *)malloc(size); random_ints(a, N);

    b = (int *)malloc(size); random_ints(b, N);

    c = (int *)malloc(size);

    Addition with Blocks and Threads:main()

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    52/74

    NVIDIA Corporation 2011

    // Copy inputs to device

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU

    add(d_a,

    // Copy result back to host

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    free(a); free(b); free(c);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return0;

    }

    Handling Arbitrary Vector Sizes

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    53/74

    NVIDIA Corporation 2011

    Typical problems are not friendly multiples of blockDim.x

    Avoid accessing beyond the end of the arrays:

    __global__ voidadd(int *a, int*b, int *c, intn) {

    intindex = threadIdx.x + blockIdx.x * blockDim.x;

    if (index < n)

    c[index] = a[index] + b[index];}

    Update the kernel launch:add(d_a, d_b, d_c, N);

    Why Bother with Threads?

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    54/74

    NVIDIA Corporation 2011

    Threads seem unnecessary

    They add a level of complexity What do we gain?

    Unlike parallel blocks, threads have mechanisms to efficie

    Communicate

    Synchronize

    To look closer, we need a new example

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    55/74

    1D Stencil

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    56/74

    NVIDIA Corporation 2011

    in

    out

    Consider applying a 1D stencil to a 1D array of elements

    Each output element is the sum of input elements within a radius

    If radius is 3, then each output element is the sum of 7 inp

    Implementing Within a Block

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    57/74

    NVIDIA Corporation 2011

    01234567

    Each thread processes one output element

    blockDim.xelements per block

    Input elements are read several times

    With radius 3, each input element is read seven times

    in

    out

    radius

    Thread0

    Thread1

    Thread2

    Thread3

    Thread4

    Thread5

    Thread6

    Thread7

    Thread8

    Sharing Data Between Threads

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    58/74

    NVIDIA Corporation 2011

    Terminology: within a block, threads share data via shared

    Extremely fast on-chip memory

    By opposition to device memory, referred to as globalmemory

    Like a user-managed cache

    Declare using__shared__, allocated per block

    Data is not visible to threads in other blocks

    Implementing With Shared Memory

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    59/74

    NVIDIA Corporation 2011

    Cache data in shared memory

    Read(blockDim.x + 2 * radius)

    input elements from global

    shared memory

    Compute blockDim.xoutput elements

    Write blockDim.xoutput elements to global memory

    Each block needs a haloof radiuselements at each boundary

    blockDim.x output elements

    halo on left halo

    in

    out

    Stencil Kernel

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    60/74

    NVIDIA Corporation 2011

    __global__ void stencil_1d(int*in, int *out) {

    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];

    intgindex = threadIdx.x + blockIdx.x * blockDim.x;

    intlindex = threadIdx.x + RADIUS;

    // Read input elements into shared memory

    temp[lindex] = in[gindex];

    if (threadIdx.x < RADIUS) {

    temp[lindex - RADIUS] = in[gindex - RADIUS];

    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

    }

    Stencil Kernel

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    61/74

    NVIDIA Corporation 2011

    // Apply the stencil

    int result = 0;

    for (int offset = -RADIUS ; offset

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    62/74

    NVIDIA Corporation 2011

    The stencil example will not work

    Suppose thread 15 reads the halo before thread 0 has fetc

    ...

    temp[lindex] = in[gindex];

    if (threadIdx.x < RADIUS) {

    temp[lindex RADIUS] = in[gindex RADIUS];

    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

    }

    int result = 0;

    for (int offset = -RADIUS ; offset

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    63/74

    NVIDIA Corporation 2011

    void __syncthreads();

    Synchronizes all threads within a block

    Used to prevent RAW / WAR / WAW hazards

    All threads must reach the barrier

    In conditional code, the condition must be uniform across the blo

    Stencil Kernel

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    64/74

    NVIDIA Corporation 2011

    __global__ void stencil_1d(int*in, int *out) {

    __shared__int temp[BLOCK_SIZE + 2 * RADIUS];

    intgindex = threadIdx.x + blockIdx.x * blockDim.x;

    intlindex = threadIdx.x + radius;

    // Read input elements into shared memory

    temp[lindex] = in[gindex];

    if (threadIdx.x < RADIUS) {

    temp[lindex RADIUS] = in[gindex RADIUS];

    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

    }

    // Synchronize (ensure all the data is available)

    __syncthreads();

    Stencil Kernel

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    65/74

    NVIDIA Corporation 2011

    // Apply the stencil

    int result = 0;

    for (int offset = -RADIUS ; offset

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    66/74

    NVIDIA Corporation 2011

    Launching parallel threads

    Launch Nblocks with Mthreads per block with kernel

    Use blockIdx.xto access block index within grid

    Use threadIdx.xto access thread index within block

    Allocate elements to threads:

    intindex = threadIdx.x + blockIdx.x * blockDim.x;

    Review (2 of 2)

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    67/74

    NVIDIA Corporation 2011

    Use __shared__ to declare a variable/array in shared mem

    Data is shared between threads in a block

    Not visible to threads in other blocks

    Use __syncthreads()as a barrier

    Use to prevent data hazards

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    68/74

    Coordinating Host & Device

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    69/74

    NVIDIA Corporation 2011

    Kernel launches are asynchronous

    Control returns to the CPU immediately

    CPU needs to synchronize before consuming the results

    cudaMemcpy() Blocks the CPU until the copy is complete

    Copy begins when all preceding CUDA calls h

    cudaMemcpyAsync() Asynchronous, does not block the CPU

    cudaDeviceSynchronize() Blocks the CPU until all preceding CUDA calls

    Reporting Errors

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    70/74

    NVIDIA Corporation 2011

    All CUDA API calls return an error code (cudaError_t)

    Error in the API call itself

    OR

    Error in an earlier asynchronous operation (e.g. kernel)

    Get the error code for the last error:cudaError_t cudaGetLastError(void)

    Get a string to describe the error:char *cudaGetErrorString(cudaError_t)

    printf("%s\n", cudaGetErrorString(cudaGetLastError())

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    71/74

    Introduction to CUDA C/C++

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    72/74

    NVIDIA Corporation 2011

    What have we learned?

    Write and launch CUDA C/C++ kernels

    - __global__, , blockIdx, threadIdx, blockDim

    Manage GPU memory

    - cudaMalloc(), cudaMemcpy(), cudaFree()

    Manage communication and synchronization

    - __shared__, __syncthreads()

    - cudaMemcpy()vs cudaMemcpyAsync(), cudaDeviceSynchronize(

    Topics we skipped

  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    73/74

    NVIDIA Corporation 2011

    We skipped some details, you can learn more:

    CUDA Programming Guide

    CUDA Zonetools, training, webinars and more

    http://developer.nvidia.com/cuda

    Next step:

    Download the CUDA toolkit at www.nvidia.com/getcuda

    Accelerate your application with GPUs!

    http://www.nvidia.com/getcudahttp://www.nvidia.com/getcuda
  • 8/14/2019 CyrilZeller_IntroductionToCUDACC

    74/74