High-Performance GPU Programming for Deep Learning

7 April 2016 Scott Gray

Nervana Systems

MAKING MACHINES SMARTER.™

Proprietary and confidential. Do not distribute.ne r vana

High-Performance GPU kernels for deep learning

• Fast matrix multiply for small minibatches

• Direct convolution leveraging GEMM advances

• Even faster convolution with Winograd

GEMM: Basics

C = AB

GEMM: Memory Load

Outer product contiguous Outer product strided

threads

memory load

single tile

batched GEMM

Batched GEMM tiles 32 x 32GEMM tile 32 x 64GEMM tile 32 x 32

GEMM: Tile sizes

threads

shared memory load

hGEMM Results - NN

Nx3072x3072 NN op

32 64 96 128

Nervana 32x32 cuBLAS 128x64

Batch Size (N)

hGEMM Results - TN

Nx3072x3072 TN op

32 64 96 128

Nervana 32x32 cuBLAS 128x64

Batch Size (N)

Direct convolution is still relevant

• Striding

• Odd-size filters

• Placeholder until faster algo can be implemented

• Often faster for single image or first small C layer

Direct convolution: implementation details

• Batched GEMM for efficient transpose and higher occupancy

• Compound outer product block remapping

• Square wave pattern for P,Q block mapping

• Slicing: shared memory lookup + integer division

• N vs C contiguous

• Single P,Q vs tiled P,Q

• Bprop as upside down fprop

• Update specific optimizations

Winograd: input transform

Input Feature Map

4x4 stride 2• Input transform

• 2D Winograd is a nested

product of 1D transforms

• Transforms can be

simplified to remove zeros

Winograd: filter transform

• Filter transform

• Same as input but with

different coefficients

• Transform each feature map

independently

Winograd: batched GEMM

Winograd: output transform

Output Feature Map

• Output transform

• Same as input and filter

• Transform back to pixel

space to obtain 2x2 output

Proprietary and confidential. Do not distribute.ne r vana 14

Performance: VGG

VGG fp32 - Totals by operation

64 32 16 8 4 2 1

Winograd fp32 fpropWinograd fp32 bpropWinograd fp32 updatecuDNN fp32 fpropcuDNN fp32 bpropcuDNN fp32 update

Batch Size

Performance: Alexnet convolutional layers

Alexnet Totals

128 64 32 16 8 4

Nervana fp16Nervana fp32CuBLAS fp16CuBLAS fp32

Batch Size

Compounding

• alpha / beta

• bias

• relu, prelu, tanh, …

• bprop relu, …

• bprop bias

• batchnorm mean

Compounding inside of GEMM and conv for free:

Summary

• Nervana has the fastest tools for deep learning

• neon with state-of-the-art Maxwell kernels

• Nervana Cloud with multi-GPU training

• Watch for Nervana Engine, our deep learning processor

High-Performance GPU Programming for Deep Learning

Engineering

GPU programming for DL

GPU Programming “Languages”

Multi-GPU Programming - GPU Technology Conference

GPU Programming Paradigms

Applications of Programming the GPU Directly from Python ...on-demand.gputechconf.com/...Programming-GPU-Python... · Applications of Programming the GPU Directly from Python Using

PROGRAMMING MULTI-GPU NODES · 2018-11-05 · Steve Abbott & Jeff Larkin, November 2018 PROGRAMMING MULTI-GPU NODES. 2 AGENDA Summit Node Overview Multi-GPU Programming Models Multi-GPU

GPU Programming Guide G80

GPU programming: CUDA

CS 380 - GPU and GPGPU Programming Lecture 4: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 4: GPU Architecture 3 Markus Hadwiger, KAUST

GPU Programming 360iDev

CUDA GPU Programming

GPU Programming Yanci Zhang Game Programming Practice

GPU Programming and CUDA

GPU Programming

PyCUDA: Even Simpler GPU Programming with Pythonon-demand.gputechconf.com/gtc/.../S12041...Python.pdf · GPU ScriptingPyOpenCLNewsRTCGShowcase PyCUDA: Even Simpler GPU Programming

GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

GPU Architecture & CUDA Programming

DEEP LEARNING WITH GPU

GPU Architecture and Programming. GPU vs CPU

Gpu Programming Talk