69
1 © 2019 The MathWorks, Inc. Deploying AI on Jetson Xavier/DRIVE Xavier with TensorRT and MATLAB Jaya Shankar, Engineering Manager (Deep Learning Code Generation) Avinash Nehemiah, Principal Product Manager ( Computer Vision, Deep Learning, Automated Driving)

Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

1© 2019 The MathWorks, Inc.

Deploying AI on Jetson Xavier/DRIVE Xavier with

TensorRT and MATLAB

Jaya Shankar, Engineering Manager (Deep Learning Code Generation)

Avinash Nehemiah, Principal Product Manager ( Computer Vision, Deep Learning, Automated Driving)

Page 2: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

2

Outline

Ground Truth Labeling Network Design and Training

CUDA and TensorRT Code Generation

Jetson Xavier and DRIVE Xavier Targeting

Key TakeawaysOptimized CUDA and TensorRT code generation Jetson Xavier and DRIVE Xavier targetingProcessor-in-loop(PIL) testing and system integration

Key TakeawaysPlatform Productivity: Workflow automation, ease of use Framework Interoperability: ONNX, Keras-TensorFlow, Caffe

Page 3: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

3

Example Used in Today’s Talk

Lane Detection Network

Co-ordinate Transform

Bounding Box Processing

AI Application

YOLOv2 Network

Page 4: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

4

Outline

Ground Truth Labeling Network Design and Training

CUDA and TensorRT Code Generation

Jetson Xavier and DRIVE Xavier Targeting

Page 5: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

5

Ground Truth LabelingUnlabeled Training Data Labels for Training

Page 6: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

6

Interactive Tools for Ground Truth Labeling

ROI Labels• Bound boxes• Pixel labels • Poly-lines

Scene Labels

Page 7: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

7

Automate Ground Truth Labeling

Pre-built Automation

User authored automation

Page 8: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

8

Automating Labeling of Lane Markers

Page 9: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

9

Automate Labeling of Bounding Boxes for Vehicles

Page 10: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

10

Export Labeled Data for Training

Bounding Boxes Labels

Polyline Labels

Page 11: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

11

Outline

Ground Truth Labeling Network Design and Training

CUDA and TensorRT Code Generation

Jetson Xavier and DRIVE Xavier Targeting

Page 12: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

12

Example Used in Today’s Talk

Lane Detection Network

Co-ordinate Transform

Bounding Box Processing

AI Application

YOLOv2 Network

Page 13: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

13

Lane Detection Algorithm

Pretrained Network (E.g. AlexNet)

Modify Network for

Lane DetectionCoefficients of parabola

Transform to

Image Coordinates

Page 14: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

14

Lane Detection: Load Pretrained Network

Lane Detection Network▪ Regression CNN for lane parameters ▪ MATLAB code to transform to image co-ordinates

>> net = alexnet

>> deepNetworkDesigner

Page 15: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

15

View Network in Deep Network Designer App

Page 16: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

16

Remove Layers from AlexNet

Page 17: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

17

Add Regression Output for Lane Parameters

Regression Output for

Lane Coefficients

Page 18: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

18

Transparently Scale Compute for Training

Specify Training on:

Quickly change training hardware

Works on Windows

(no additional setup)

Page 19: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

19

NVIDIA NGC & DGX Supports MATLAB for Deep Learning

▪ GPU-accelerated MATLAB Docker container for deep learning

– Leverage multiple GPUs on NVIDIA DGX Systems and in the Cloud

▪ Cloud providers include: AWS, Azure, Google, Oracle, and Alibaba

▪ NVIDIA DGX System / Station

– Interconnects 4/8/16 Volta GPUs in one box

▪ Containers available for R2018a and R2018b– New Docker container with every major release (a/b)

▪ Download MATLAB container from NGC Registry– https://ngc.nvidia.com/registry/partners-matlab

S9469 - MATLAB and NVIDIA Docker: A Complete AI Solution, Where You Need It, in an InstantWednesday, Mar 20, 4:00 PM - 04:50 PM

Page 20: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

20

Evaluate Lane Boundary Detections vs. Ground Truth

evaluateLaneBoundaries

Page 21: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

21

Example Used in Today’s Talk

Lane Detection Network

Co-ordinate Transform

Bounding Box Processing

AI Application

YOLOv2 Network

Page 22: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

22

YOLO v2 Object Detection

Pretrained Network Feature Extractor ( E.g. ResNet 50)

Detection SubnetworkYOLO CNN Network Decode Predictions

Page 23: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

23

Model Exchange with MATLAB

PyTorch

Caffe2

MXNet

Core ML

CNTK

Keras-

Tensorflow

Caffe

MATLABONNX

Open Neural Network Exchange

Page 24: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

24

Import Pretrained Network in ONNX Format

Page 25: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

25

Modify Network

imageInputLayer replaces the input and subtraction layer

Removing the 2 ResNet-50 layers

Page 26: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

26

YOLOv2 Detection Network

▪ yolov2Layers: Create network architecture

>> lgraph = yolov2Layers(imageSize, numClasses, anchorBoxes, network, featureLayer)

Number of Classes

Pretrained Feature Extractor

>> detector = trainYOLOv2ObjectDetector(trainingData,lgraph,options)

Page 27: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

27

Evaluate Performance of Trained Network

▪ Set of functions to evaluate trained

network performance

– evaluateDetectionMissRate

– evaluateDetectionPrecision

– bboxPrecisionRecall

– bboxOverlapRatio

>> [ap,recall,precision] =

evaluateDetectionPrecision(results,vehicles(:,2));

evaluateDetectionPrecision

bboxPrecisionRecall

Page 28: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

28

Example Applications using MATLAB for AI Development

Lane Keeping Assist using

Reinforcement LearningOccupancy Grid Creation

using Deep Learning

Lidar Segmentation with

Deep Learning

Page 29: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

29

Outline

Ground Truth Labeling Network Design and Training

CUDA and TensorRT Code Generation

Jetson Xavier and DRIVE Xavier Targeting

Key TakeawaysPlatform Productivity: Workflow automation, ease of use Framework Interoperability: ONNX, Keras-TensorFlow, Caffe

Page 30: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

30

CUDA code generation with GPU Coder

Ease of programming(expressivity, algorithmic, …)

Pe

rfo

rma

nce

easier

faster

MATLAB

Python

CUDA

OpenCL

C/C++

GPUs are “hardware on steroids”, … but, programming them is hard

GPU Coder

Page 31: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

31

Consider an example: saxpy

Simple MATLAB Loop

Automatic compilation from an expressive language to a high-performance language

Page 32: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

32

Deep Learning code generation with GPU Coder

TensorRT is great for inference, … but, applications require more than inference

TensorRT

Deep learning

Page 33: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

33

Regression DNN for lane

coefficients

Post process and

transform to image

coordinates

Strongest bounding

box

DNN Application

YOLOv2 object

detection DNN

Generate CUDA/C++/TensorRT code

with GPU Coder

Generate code for your application with GPU Coder

Page 34: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

34

Regression DNN for lane

coefficients

Post process and

transform to image

coordinates

Strongest bounding

box

DNN Application

YOLOv2 object

detection DNN

Generate Optimized CUDA code

Page 35: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

35

GPU Coder automatically extracts parallelism from MATLAB

1. Scalarized MATLAB (“for-all” loops)

2. Vectorized MATLAB(math operators and library functions)

3. Composite functions in MATLAB(maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT)

Infer CUDA

kernels from

MATLAB loops

Library

replacement

Page 36: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

36

GPU Coder runs a host of compiler transforms to generate CUDA

Control-flow graph

Intermediate representation

(CFG – IR)

….

….

CUDA kernel

lowering

Front – end

Traditional compiler

optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loop

optimizations

Page 37: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

37

static __global__ mykernel(A, X, Y, C, n)

{

int k = getThreadIndex(N);

int t = A[k] * X[k];

C[k] = t + Y[k];

}

From a loop to a CUDA kernel

for k = 1:n

t = A(k) .* X(k);

C(k) = t + Y(k);

end

{ …

mykernel<<< f(n) >>>(A, X, Y, C, n);

}

Create kernel from

loop bodyCompute kernel size

Y

f(n)

Classify kernel variables

(input, output, local)

Ins: A, X, Y, n

Outs: C

Local: t, k

Is this loop

parallel?

Dependence analysis

to understand the

iteration space

Extracting parallelism from loops

1. Scalarized MATLAB (for loops)

2. Vectorized MATLAB

Page 38: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

38

From a loop to a CUDA kernel

for i = 1:p

for j = 1:m

for k = 1:n

…(inner loop)…

end

end

end

for i = 1:p

…(outer prologue code)…

for j = 1:m

for k = 1:n

…(inner loop)…

end

…(outer epilogue code)…

end

end

Perfect LoopsImperfect Loops

Extracting parallelism from loops

1. Scalarized MATLAB (for loops)

2. Vectorized MATLAB

Page 39: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

39

From a loop to a CUDA kernel

for i = 1:p

for j = 1:m

for k = 1:n

… (outer prologue code) …

…(inner loop)…

if k == n

… (outer epilogue code) …

end

end

end

end

for i = 1:p

…(outer prologue code)…

for j = 1:m

for k = 1:n

…(inner loop)…

end

…(outer epilogue code)…

end

end

Perfect LoopsImperfect Loops

Loop Perfectization

Extracting parallelism from loops

1. Scalarized MATLAB (for loops)

2. Vectorized MATLAB

Page 40: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

40

From vectorized MATLAB to CUDA kernels

output(:, 1) = (input(:, 1) – x_im) .* factor;

Loop Fusion

Create larger parallel

loops (and hence

CUDA kernels)

Scalarization

Reduce to

for-loops

for i = 1:M

diff(i) = input(i, 1) – x_im(i);

end

for a = 1:M

output(i, 1) = diff(i) * factor(i);

end

for i = 1:M

diff(i) = input(i, 1) – x_im(i);

output(i, 1) = diff(i) * factor(i);

end

Assume the following sizes:

‘output’ : M x 3

‘input’ : M x 3

‘x_im’ : M x 1

‘factor’ : M x 1

Extracting parallelism from loops

1. Scalarized MATLAB (for loops)

2. Vectorized MATLAB

Page 41: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

41

From vectorized MATLAB to CUDA kernels

output(:, 1) = (input(:, 1) – x_im) .* factor;

Loop Fusion

Create larger parallel

loops (and hence

CUDA kernels)

Scalar

Replacement

Reduce temp matrices

to temp scalars

Scalarization

Reduce to

for-loops

for i = 1:M

diff(i) = input(i, 1) – x_im(i);

end

for a = 1:M

output(i, 1) = diff(i) * factor(i);

end

for i = 1:M

diff(i) = input(i, 1) – x_im(i);

output(i, 1) = diff(i) * factor(i);

end

for i = 1:M

tmp = input(i, 1) – x_im(i);

output(i, 1) = tmp * factor(i);

end

Assume the following sizes:

‘output’ : M x 3

‘input’ : M x 3

‘x_im’ : M x 1

‘factor’ : M x 1

Extracting parallelism from loops

1. Scalarized MATLAB (for loops)

2. Vectorized MATLAB

Page 42: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

42

Optimizing CPU-GPU data movement is a challenge

A = …

for i = 1:N

… A(i)

end

imfilter

A = …

kernel1<<<…>>>( )

imfilter_kernel(…)

Where is the ideal placement of memcpy?

Naïve placement

K1

K2

K3

Optimized placement

K1

K2

K3

A = …

cudaMemcpyHtoD(gA, a);

kernel1<<<…>>>(gA)

cudaMemcpyDtoH(…)

cudaMemcpyHtoD(…)

imfilter_kernel(…)

cudaMemcpyDtoH(…)

Memory Optimizations

Page 43: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

43

GPU Coder optimizes memcpy placement

A(:) = ….

C(:) = ….

for i = 1:N

….

gB = kernel1(gA);

gA = kernel2(gB);

if (some_condition)

gC = kernel3(gA, gB);

end

….

end

…. = C;

cudaMemcpy

*definitely* needed

cudaMemcpy

*not* needed

cudaMemcpy

*may be* needed

Observations

• Equivalent to Partial redundancy elimination (PRE)

• Dynamic strategy – track memory location with a

status flag per variable

• Use-Def to determine where to insert memcpy

A(:) = …

A_isDirtyOnCpu = true;

for i = 1:N

if (A_isDirtyOnCpu)

cudaMemcpy(gA, A);

A_isDirtyOnCpu = false;

end

gB = kernel1(gA);

gA = kernel2(gB);

if (somecondition)

gC = kernel3(gA, gB);

C_isDirtyOnGpu = true;

end

end

if (C_isDirtyOnGpu)

cudaMemcpy(C, gC);

C_isDirtyOnGpu = false;

end

… = C;

Assume gA, gB and gC are mapped to GPU memory

Generated (pseudo) code

Page 44: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

44

From composite functions to optimized CUDA

• imfilter

• imresize

• imerode

• imdilate

• bwlabel

• imwarp

• …

• Deep learning inference

(cuDNN, TensorRT)

• …

• Matrix multiply (cuBLAS)

• Linear algebra (cuSolver)

• FFT functions (cuFFT)

• Convolution

• …

Core math Image processing

Computer visionNeural Networks

Over 300+ MATLAB functions are optimized for CUDA code generation

Page 45: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

45

Dotprod

Input image Conv. kernel Output image

rows

co

ls

kw

kh

GPU Coder automatically maps data to shared memory

gpucoder.stencilKernel() automatically generates shared memory algorithm

Used when generating code from any image processing functions in MATLABimfilter, imerode, imdilate, conv2, …

Page 46: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

46

Generates optimal matrix-matrix operations

gpucoder.matrixMatrixKernel automatically generates shared memory algorithm

Used when generating code from many standard MATLAB functions matchFeatures SAD, SSD, pdist, …

Page 47: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

47

Generate CUDA friendly algorithms from MATLAB functions

Reductions Transpose Histogram

Sort Min/Max CumSum

Page 48: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

48

Integrating with legacy CUDA code

out = coder.ceval(‘-gpudevicefcn,

‘foo’, % your function name

a, % arguments

b);

__device__ float foo(float a, float b);

Calling a GPU device function

function out = foo(in)

……

……end

Caller needs to be a device

function

Arguments allocated in

device memory

Page 49: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

49

function out = foo(in)……

……end

__global__

void foo(const float *a[100], const float *b[100], float *c[100] );

Integrating with legacy CUDA code

out = coder.ceval( ‘foo<1024>’, % kernel call with launch params

coder.rref(a, ‘-gpu’), % value is read on GPU

coder.rref(b, ‘-gpu’),

coder.wref(c, ‘-gpu’)); % value is written on GPU

Calling a kernel function

Arguments allocated or copied

to device memory as required.

Caller may be a host or

device function

Page 50: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

50

Compiled image processing applications

Fog Rectification

5x speedup

Feature Matching

15x speedup

Stereo Disparity

8x speedup

Page 51: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

51

Regression DNN for lane

coefficients

Post process and

transform to image

coordinates

Strongest bounding

box

DNN Application

YOLOv2 object

detection DNN

Generate Optimized CUDA code Generate optimized

cuDNN/TensorRT inference code

Page 52: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

52

Regression DNN for lane

coefficients

Post process and

transform to image

coordinates

Strongest bounding

box

DNN Application

Object detection with

YOLOv2 DNN

Generate optimized cuDNN/TensorRT inference code

Layer Fusion

DNN optimizations

Memory

optimization

Network re-

architecting

Page 53: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

53Network

Deep learning network optimizations

conv

Batch

Norm

ReLu

Add

conv

ReLu

Max

Pool

Max

Pool

Layer fusion

Optimized computation

FusedConv

FusedConv

BatchNormAdd

Max

Pool

Max

Pool

Buffer minimization

Optimized memory

FusedConv

FusedConv

BatchNormAdd

Max

Pool

buffer a

buffer b

buffer d

Max

Pool

buffer c

buffer e

X Reuse buffer a

X Reuse buffer b

Page 54: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

54

Code generation workflow

Deployment

unit-test

Desktop

GPU

.mex

Build type

Call compiled

application from

MATLAB directly

Page 55: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

55

Code generation workflow

Deployment

unit-test

Desktop

GPU

.mex

Build type

Call compiled

application from

MATLAB directly

Page 56: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

56

Semantic Segmentation Defect classification and detection

Blood smear segmentation

Live Demos

at our GTC

booth #1343

Page 57: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

57

Single Image Inference on Titan V using cuDNN

PyTorch (1.0.0)

MXNet (1.4.0)

GPU Coder (R2019a)

TensorFlow (1.13.0)

Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0

Page 58: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

58

MATLAB GPU Coder

TensorRT Accelerates Inference Performance on Titan V

TensorFlow

( 7.4 ) ( 5.0 )

Page 59: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

59

GPU Coder

TensorRT Accelerates Inference Performance on Titan V

TensorFlow

Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0

Single Image Inference with ResNet-50 (Titan V)

cuDNN TensorRT (FP32) TensorRT (INT8)

Page 60: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

60

Outline

Ground Truth Labeling Network Design and Training

CUDA and TensorRT Code Generation

Jetson Xavier and DRIVE Xavier Targeting

Page 61: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

61

Deploy to Jetson and Drive

MATLAB algorithm

(functional reference)

Functional test1 Deployment

unit-test

2

Desktop

GPU

.mex

Build type

Call compiled

application from

MATLAB directly

C++

Deployment

integration-test

3

Desktop

GPU

.lib

Call compiled

application

from hand-coded

main()

Real-time test4

Embedded GPU

Deploy to target

Deploy to target

and run with

hardware-in-loop

Page 62: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

62

Streamlined deployment to Jetson or Drive GPUs.

Host Machine

Deploy standalone application

Page 63: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

63

Closed loop testing and verification on embedded GPUs.

Host Machine

Run application in MATLAB with Hardware in loop

Page 64: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

64

Deploying DNN application to Jetson/Drive device

Target Display

Host Machine

Video Feed

Deployable MATLAB

algorithm code

Page 65: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

65

Access to target peripherals from MATLAB

MATLAB talks to peripherals on hardware

Verification with processor-in-loop mode

Streams from

target peripherals

to MATLAB

Page 66: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

66

Generate code for peripheral access

Generate code to access peripherals on board

Standalone deployment mode

Generates interface

code for target

peripherals

Page 67: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

67

Processor-in-loop verificationwith MATLAB

GPU Coder enables hardware prototyping and system integration

Deploy and run on NVIDIA target

Jetson/DRIVE platformMATLAB on host

Peripheral access support

Page 68: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

68

In Summary

Ground Truth Labeling Network Design and Training

CUDA and TensorRT Code Generation

Jetson Xavier and DRIVE Xavier Targeting

DeploymentOptimized CUDA and TensorRT code generation Jetson Xavier and DRIVE Xavier targetingProcessor-in-loop(PIL) testing and system integration

Network designPlatform Productivity: Workflow automation, ease of use Framework Interoperability: ONNX, Keras-TensorFlow, Caffe

Page 69: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library

69

Thank You