52
1 © 2018 The MathWorks, Inc. GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar

GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

1© 2018 The MathWorks, Inc.

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Girish VenkataramaniArvind JayaramanJaya Shankar

Page 2: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

2

GPUs and CUDA programming

Ease of programming(expressivity, algorithmic, …)

Perfo

rman

ceeasier

faster

MATLAB

Python

CUDA

OpenCL

C/C++

GPUs are “hardware on steroids”, … but, programming them is hard

GPU Coder

Page 3: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

3

Consider an example: saxpy

Scalarized MATLAB

Vectorized MATLAB

Automatic compilation from an expressive language to a high-performance language

Page 4: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

4

Deep Learning applications

TensorRT is great for inference, … but, applications require more than inference

TensorRT

Traffic Sign Recognition

Deep learning

Page 5: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

5

GPU Coder is new technology released in September 2017

Neural NetworksDeep Learning, machine learning

Image Processing and Computer Vision

Image filtering, feature detection/extraction

Signal Processing and Communications

FFT, filtering, cross correlation,

5x faster than TensorFlow2x faster than mxnet

60x faster than CPUs for stereo disparity

20x faster than CPUs for FFTs

Accelerated implementation of parallel algorithms on GPUs

Page 6: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

6

Example: Lidar semantic segmentation

Page 7: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

7

Talk outline

Introduction

GPU Coder internals– Automatic parallelization

– Memory optimization

– Deep learning compilation

Application demo: Lidar processing in MATLAB using deep learning

Page 8: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

8

GPU Coder automatically extracts parallelism from MATLAB

1. Scalarized MATLAB (“for-all” loops)

2. Vectorized MATLAB(math operators and library functions)

3. Composite functions in MATLAB(maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT)

Infer CUDA kernels from MATLAB loops

Vendor library replacement

Page 9: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

9

static __global__ mykernel(A, X, Y, C, n){

int k = getThreadIndex(N);

int t = A[k] * X[k];C[k] = t + Y[k];

}

From a loop to a CUDA kernel

for k = 1:nt = A(k) .* X(k);C(k) = t + Y(k);

end

{ …mykernel<<< f(n) >>>(A, X, Y, C, n);

…}

Create kernel from loop bodyCompute kernel size

Y

f(n)

Classify kernel variables (input, output, local)

Ins: A, X, Y, nOuts: CLocal: t, k

Is this loop parallel?

Dependence analysis to understand the iteration space

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

Page 10: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

10

From a loop-nesting to CUDA kernels

for i = 1:pfor j = 1:m

for k = 1:n…(inner loop)…

endend

end

for i = 1:p…(outer prologue code)…for j = 1:m

for k = 1:n…(inner loop)…

end…(outer epilogue code)…

endend

Perfect LoopsImperfect Loops

Fundamental requirement: Loops need to be contiguous for parallelization

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

Page 11: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

11

From a loop-nesting to CUDA kernels

for i = 1:pfor j = 1:m

for k = 1:n… (outer prologue code) ……(inner loop)…if k == n

… (outer epilogue code) …end

endend

end

for i = 1:p…(outer prologue code)…for j = 1:m

for k = 1:n…(inner loop)…

end…(outer epilogue code)…

endend

Perfect LoopsImperfect Loops

Fundamental requirement: Loops need to be contiguous for parallelization

Loop Perfectization(if possible)

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

Page 12: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

12

for a = 1:Mfor b = 1:N

…(outer prologue code)…for c = 1:K

for d = 1:P…(inner loop)…

endend

endend

From a loop-nesting to CUDA kernels

Find parallel loops

Dependence analysis

Partition loop nesting

Heuristic may favor larger iteration space

for i = 1:Pfor a = 1:M

for b = 1:N…(inner loop)…

endendfor x = 1:K

for y = 1:L…(inner loop)…

endend

end

(K x P)

(P)

(MxN + KxL)

Example 1 Example 2

Fundamental requirement: Loops need to be contiguous for parallelization

(M x N)

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

Page 13: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

13

for a = 1:Mfor b = 1:N

…(outer prologue code)…for c = 1:K

for d = 1:P…(inner loop)…

endend

endend

From a loop-nesting to CUDA kernels

Create kernel for each partitionFind parallel loops

Dependence analysis

Partition loop nesting

Heuristic may favor larger iteration space

Use process from single loop conversion

for i = 1:Pfor a = 1:M

for b = 1:N…(inner loop)…

endendfor x = 1:K

for y = 1:L…(inner loop)…

endend

end

(K x P)

(P)

(MxN + KxL)

Example 1 Example 2

Fundamental requirement: Loops need to be contiguous for parallelization

(M x N)

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

Page 14: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

14

From vectorized MATLAB to CUDA kernels

output(:, 1) = (input(:, 1) – x_im) .* factor;

Loop Fusion

Create larger parallel loops (and hence CUDA kernels)

Scalarization

Reduce to for-loops

for i = 1:Mdiff(i) = input(i, 1) – x_im(i);

endfor a = 1:M

output(i, 1) = diff(i) * factor(i);end

for i = 1:Mdiff(i) = input(i, 1) – x_im(i);output(i, 1) = diff(i) * factor(i);

end

Assume the following sizes: ‘output’ : M x 3‘input’ : M x 3‘x_im’ : M x 1‘factor’ : M x 1

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

Page 15: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

15

From vectorized MATLAB to CUDA kernels

output(:, 1) = (input(:, 1) – x_im) .* factor;

Loop Fusion

Create larger parallel loops (and hence CUDA kernels)

Scalar Replacement

Reduce temp matrices to temp scalars

Scalarization

Reduce to for-loops

for i = 1:Mdiff(i) = input(i, 1) – x_im(i);

endfor a = 1:M

output(i, 1) = diff(i) * factor(i);end

for i = 1:Mdiff(i) = input(i, 1) – x_im(i);output(i, 1) = diff(i) * factor(i);

end

for i = 1:Mtmp = input(i, 1) – x_im(i);output(i, 1) = tmp * factor(i);

end

Assume the following sizes: ‘output’ : M x 3‘input’ : M x 3‘x_im’ : M x 1‘factor’ : M x 1

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

Page 16: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

16

From composite functions to optimized CUDA

• imfilter• imresize• imerode• imdilate• bwlabel• imwarp• …

• Deep learning inference (cuDNN, TensorRT)

• …

• Matrix multiply (cuBLAS)• Linear algebra (cuSolver)• FFT functions (cuFFT)• Convolution• …

Core math Image processingComputer vision

Neural Networks

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

Over 300+ MATLAB functions are optimized for CUDA code generation

Page 17: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

17

Talk outline

Introduction

GPU Coder internals– Automatic parallelization

– Memory optimization

– Deep learning compilation

Application demo: Lidar processing in MATLAB using deep learning

Page 18: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

18

Optimizing CPU-GPU data movement is a challengeA = ……for i = 1:N

… A(i)end…

imfilter…

A = ……

kernel1<<<…>>>( )

imfilter_kernel(…)

A = ……cudaMemcpyHtoD(gA, a);kernel1<<<…>>>(gA)cudaMemcpyDtoH(…)…cudaMemcpyHtoD(…) imfilter_kernel(…)cudaMemcpyDtoH(…)…

Where is the ideal placement of memcpy?

Naïve placement

K1

K2

K3

Optimized placement

K1

K2

K3

Page 19: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

19

GPU Coder optimizes memcpy placementA(:) = ….C(:) = ….

for i = 1:N….gB = kernel1(gA);gA = kernel2(gB);if (some_condition)

gC = kernel3(gA, gB);end….

end

…. = C;

cudaMemcpy*definitely* needed

cudaMemcpy*not* needed

cudaMemcpy*may be* needed

Observations• Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a

status flag per variable• Use-Def to determine where to insert memcpy

A(:) = …A_isDirtyOnCpu = true;…for i = 1:N

if (A_isDirtyOnCpu)cudaMemcpy(gA, A);A_isDirtyOnCpu = false;

endgB = kernel1(gA);gA = kernel2(gB);if (somecondition)

gC = kernel3(gA, gB);C_isDirtyOnGpu = true;

end…

end…if (C_isDirtyOnGpu)

cudaMemcpy(C, gC);C_isDirtyOnGpu = false;

end… = C;

Assume gA, gB and gC are mapped to GPU memory

Generated (pseudo) code

Page 20: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

20

GPU memory hierarchy is deep

Memory Visibility Heuristics/notes GPU Coder support

Global memory CPU + GPU Share data b/w CPU and GPU Yes

Local memory/registers per GPU thread Thread local data Yes

Shared memory per GPU block Shared data between threads Yes

Texture memory CPU + GPU Shared read-only data with 2D alignment Future

Constant memory GPU Read-only constants Yes

Page 21: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

21

Dotprod

Input image Conv. kernel Output image

rows

cols

kw

kh

GPU Coder automatically maps data to shared memory

GPU Coder automatically creates shared memory for many MATLAB image processing functions:

imfilter, imerode, imdilate, conv2, …

Page 22: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

22

GPU Coder runs a host of compiler transforms to generate CUDA

Control-flow graphIntermediate representation

(CFG – IR)

….

….

CUDA kernel optimizations

Front – end

Traditional compiler optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loopoptimizations

Page 23: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

23

Demo: Stereo disparity

Left camera Right camera

Disparity map

Stereo disparity

Page 24: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

24

Easily re-target to Jetson and Drive platforms

Cross-compile for NVIDIA boards– Jetson boards– DrivePX2

Two small changes1. Change build-type to ‘lib’

2. Select cross-compile toolchain

Page 25: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

25

Talk outline

Introduction

GPU Coder internals– Automatic parallelization

– Memory optimization

– Deep learning compilation

Application demo: Lidar processing in MATLAB using deep learning

Page 26: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

26

Deep learning workflow in MATLAB

Train in MATLAB

Model importer

Trained DNN

Model importer

DNNdesign + training

Design in MATLAB Manage large image sets Automate data labeling Easy access to models

Training in MATLAB Acceleration with GPU’s Scale to clusters

Page 27: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

27

Deep learning workflow in MATLAB

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

DNNdesign + training

Applicationdesign

Page 28: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

28

Deep learning workflow in MATLAB

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

Applicationdesign

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Standalone Deployment

DNNdesign + training

Page 29: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

29

Deep learning workflow in MATLAB

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Standalone Deployment

DNNdesign + training

Applicationdesign

Page 30: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

30

Traffic sign detection and recognition

Object detection

DNN

Strongest Bounding

Box

Classifier DNN

Page 31: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

31

C++/CUDA + TensorRT

C++/CUDA + cuDNN

C++/CUDA + TensorRT (int8)

GPU Coder allows for choice in deployment (cuDNN, TensorRT)

Application logic

Page 32: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

32

Performance summary (VGG-16) on TitanXP

0 50 100 150 200 250 300 350 400

MATLAB (cuDNN fp32)

GPU Coder (cuDNN fp32)

GPU Coder (TensorRT fp32)

GPU Coder (TensorRT int8)

Page 33: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

33

MATLAB and GPU Coder support state-of-art classification networks

GoogLeNetResNet50

Alexnet vs Squeezenet

Network # Layers Size Frame-rate (GPU Coder)

Alexnet 25 227 MB 500 FpsResNet50 177 96 MB 160 FpsGoogLeNet 144 27 MB 190 FpsSqueezenet 68 5 MB 615 Fps

Page 34: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

34

Alexnet Inference on NVIDIA Titan Xp

MATLAB GPU Coder +TensorRT 3.0.1MATLAB GPU Coder +cuDNN

Fram

es p

er s

econ

d

Batch SizeCPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

mxNet (1.1.0)

MATLAB GPU Coder +TensorRT 3.0.1 (int8)

TensorFlow (1.6.0)

Page 35: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

35

VGG-16 Inference on NVIDIA Titan Xp

MATLAB GPU Coder +TensorRT 3.0.1MATLAB GPU Coder +cuDNN

Fram

es p

er s

econ

d

Batch SizeCPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

mxNet (1.1.0)

MATLAB GPU Coder +TensorRT 3.0.1 (int8)

TensorFlow (1.6.0)

Page 36: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

36

Talk outline

Introduction

GPU Coder internals– Automatic parallelization

– Memory optimization

– Deep learning compilation

Application demo: Lidar processing in MATLAB using deep learning

Page 37: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

37

LiDAR Processing for Autonomous Vehicles

Design Deep learning-based LiDAR algorithm in MATLAB• Automate ground-truth labeling • Pre-process LiDAR data for training • GPU accelerated training

High Performance Inference

Page 38: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

38

Why Use LiDAR and Deep Learning for Autonomous Vehicles ?

Why use LiDAR ?– Provides accurate 3-D structure of scene– Required sensor for autonomous driving

Why use deep learning ?– Best accuracy– High-performance inference with GPU Coder

CarsGround

LiDAR Sematic Segmentation

Classifying Point Cloud Clusters

Page 39: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

39

What Does LiDAR Data Look Like ?

Page 40: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

40

Lidar processing application design is easy in MATLAB

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Standalone Deployment

DNNdesign + training

Applicationdesign

Page 41: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

41

Lidar processing application design is easy in MATLAB

Trained DNN

DNNdesign + training

Data prep, labeling

Training

Application logic

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Standalone Deployment

Applicationdesign

Page 42: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

42

Data preparation and labeling of Lidar is a challenge

Trained DNN

DNNdesign + training

Accessing lidar data

TrainingLidar pre-processing

Labeling lidar data

Page 43: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

43

Access and Visualize LiDAR Data

Access Stored LiDAR Data Velodyne file I/O (pcap) Individual point clouds (.pcd,ply)

Visualize LiDAR Data Streaming LiDAR player Static point cloud display Point cloud differences

Page 44: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

44

LiDAR Preprocessing

ROI + Remove Ground• Fit plane using RANSAC

Cluster• Segment clusters using

Euclidean distance

Page 45: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

45

Ground Truth Labeling of LiDAR Data

Page 46: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

46

Ground Truth Labeling of LiDAR Data

Page 47: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

47

Ground Truth Labeling of LiDAR Data

Page 48: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

48

Organize Data for Training

Raw Point Cloud Data

Ground Truth Labels Transformed to Label Mask

Project to 2D

Page 49: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

49

Create Network Architecture

Easy API to create network structure

Page 50: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

50

Deployment using GPU Coder

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Page 51: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

51

Deep learning workflow in MATLAB

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Standalone Deployment

DNNdesign + training

Applicationdesign

Page 52: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and

52

Check Out Deep Learning in MATLAB and GPU Coder

GPU Coderhttps://www.mathworks.com/products/gpu-coder.html

Deep learning in MATLABhttps://www.mathworks.com/solutions/deep-learning.html

Deep learning On-Ramp: Free self-paced, online traininghttps://matlabacademy.mathworks.com/R2017b/portal.html?course=deeplearning