Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

1© 2015 The MathWorks, Inc.

Deploying Deep Neural Networks

to Embedded GPUs and CPUs

Steven Thomsett

2

Deep Learning Workflow in MATLAB

Application

logic

Application

Design

Standalone

Deployment

Deep Neural Network

Design + Training

Trained

DNN

3

Deep Neural Network Design and Training

▪ Design in MATLAB

▪ Manage large data sets

▪ Automate data labeling

▪ Easy access to models

▪ Training in MATLAB

▪ Acceleration with GPU’s

▪ Scale to clusters

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

4

Application Design

Pre-

processing

Post-

processing

5

Multi-Platform Deep Learning Deployment

Application

logic

6

Multi-Platform Deep Learning Deployment

Application

logic

Embedded

MobileNVIDIA Jetson Raspberry pi

Beaglebone

Desktop Data Center

…

7

Algorithm Design to Embedded Deployment Workflow

Conventional Approach

Functional test1

Desktop

GPU

Deployment

unit-test

2

Desktop

GPU

C++

Deployment

integration-test

3

Embedded GPU

C++

Real-time test4

High-level language

Deep learning framework

Large, complex software stack

Challenges

• Integrating multiple libraries and packages

• Verifying and maintaining multiple

implementations

• Algorithm & vendor lock-in

C/C++

Low-level APIs

Application-specific libraries

C/C++

Target-optimized libraries

Optimize for memory & speed

8

Solution: Use MATLAB Coder & GPU Coder for

Deep Learning Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

9

Solution: Use MATLAB Coder & GPU Coder for

Deep Learning Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

ARM

Compute

Library

10

Deep Learning Deployment Workflows

Pre-

processing

Post-

processing

codegen

Portable target code

INTEGRATED APPLICATION DEPLOYMENT

cnncodegen


INFERENCE ENGINE DEPLOYMENT

Trained

DNN

12

Workflow for Inference Engine Deployment

Steps for inference engine deployment

1. Generate the code for trained model>> cnncodegen(net, 'targetlib’, ‘arm-

compute’)

2. Copy the generated code onto target board

3. Build the code for the inference engine>> make –C ./codegen –f …mk

4. Use hand written main function to call inference

engine

5. Generate the exe and test the executable>> make –C ./ ……

cnncodegen


INFERENCE ENGINE DEPLOYMENT

Trained

DNN

14

Deep Learning Inference Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

Pedestrian Detection

15


Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

Blood Smear Segmentation

16

ARM

Compute

Library


Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

Defect Classification &

Detection

18

ARM

Compute

Library

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

How is the

Performance?

19

Performance of Generated Code

▪ CNN inference (ResNet-50, VGG-16, Inception V3) on Titan V GPU

▪ CNN inference (ResNet-50) on Jetson TX2

▪ CNN inference (ResNet-50 , VGG-16, Inception V3) on Intel Xeon CPU

20

Single Image Inference on Titan V using cuDNN

PyTorch (1.0.0)

MXNet (1.4.0)

GPU Coder (R2019a)

TensorFlow (1.13.0)

Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0

21

Even Stronger Performance with INT8 using TensorRT

Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7.4.1 – TensorRT 5.0.2.6 - Frameworks: TensorFlow 1.13.0_rc0

Batch Size

GPU Coder + TensorRT (FP32)

TensorFlow + TensorRT (FP32)

ResNet-50 Inference (Titan V)

GPU Coder + TensorRT (INT8)

TensorFlow + TensorRT (INT8)

22

Single Image Inference on Jetson TX2

NVIDIA libraries: CUDA9 - cuDNN 7 – TensorRT 3.0.4 - Frameworks: TensorFlow 1.12.0

GPU Coder

+

TensorRT

TensorFlow

+

TensorRT

ResNet-50

23

CPU Performance

MATLAB

TensorFlow

MXNet

MATLAB Coder

PyTorch

Intel® Xeon® CPU 3.6 GHz - Frameworks: TensorFlow 1.6.0, MXNet 1.2.1, PyTorch 0.3.1

CPU, Single Image Inference (Linux)

24

Brief Summary

DNN libraries are great for inference, …

MATLAB Coder and GPU Coder generates code that takes advantage of:

NVIDIA® CUDA libraries, including TensorRT & cuDNN

Intel® Math Kernel Library for Deep Neural Networks

(MKL-DNN)

ARM® Compute libraries for mobile platforms

25

Brief Summary

DNN libraries are great for inference, …

MATLAB Coder and GPU Coder generates code that takes advantage of:

NVIDIA® CUDA libraries, including TensorRT & cuDNN

Intel® Math Kernel Library for Deep Neural Networks

(MKL-DNN)

ARM® Compute libraries for mobile platforms

But, Applications

Require More than

just Inference

26

Deep Learning Workflows: Integrated Application Deployment

Pre-

processing

Post-

processing

codegen


27

Lane

Detection

Strongest

Bounding

Box

Lane and Object Detection using YOLO v2

Post-

processing

Object

Detection

Workflow:

1) Test in MATLAB on CPU

2) Generate code and test on

desktop GPU


Jetson AGX Xavier GPU

AlexNet-based YOLO v2

28

Lane

Detection

Strongest

Bounding

Box

(1) Test in MATLAB on CPU

Post-

processing

Object

Detection


29

Lane

Detection

Strongest

Bounding

Box

(2) Generate Code and Test on Desktop GPU

Post-

processing

Object

Detection

cuDNN/TensorRT optimized code

CUDA optimized code


30

Lane

Detection

Strongest

Bounding

Box

(3) Generate Code and Test on Jetson AGX Xavier GPU

Post-

processing

Object

Detection


CUDA optimized code


31

Lane

Detection

Strongest

Bounding

Box

Lane and Object Detection using YOLO v2

Post-

processing

Object

Detection


CUDA optimized code


1) Running on CPU

2) 7X faster running generate

code on desktop GPU


Jetson AGX Xavier GPU

32

Accessing Hardware

Deploy Standalone

Application

Access Peripheral

from MATLAB

Processor-in-Loop

Verification

33

Deploy to Target Hardware via Apps and Command Line

34

PyTorch (1.0.0)

MXNet (1.4.0)

GPU Coder (R2019a)

TensorFlow (1.13.0)

35

PyTorch (1.0.0)

MXNet (1.4.0)

GPU Coder (R2019a)

TensorFlow (1.13.0)

How does

MATLAB Coder and

GPU Coder

achieve these results?

36

Coders Apply Various Optimizations

….

….

CUDA kernel

lowering

Traditional compiler

optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loop

optimizations

37

Coders Apply Various Optimizations

….

….

CUDA kernel

lowering

Traditional compiler

optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loop

optimizations

Optimized

Libraries

Network

OptimizationCoding

Patterns

38

Intel MKL-DNN

Library

Generated Code Calls Optimized Libraries

Pre-

processing

Post-

processing

NVIDIA TensorRT &

cuDNN Libraries

ARM Compute

Library

cuFFT, cuBLAS,

cuSolver, Thrust

Libraries

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

39

Network

Deep Learning Network Optimization

conv

Batch

Norm

ReLu

Add

conv

ReLu

Max

Pool

Max

Pool

Layer fusion

Optimized computation

FusedConv

FusedConv

BatchNormAdd

Max

Pool

Max

Pool

Buffer minimization

Optimized memory

FusedConv

FusedConv

BatchNormAdd

Max

Pool

buffer a

buffer b

buffer d

Max

Pool

buffer c

buffer e

X Reuse buffer a

X Reuse buffer b

Performance



3. Coding Patterns

40

Coding Patterns: Stencil Kernels

▪ Automatically applied for image processing functions (e.g. imfilter, imerode,

imdilate, conv2, …)

▪ Manually apply using gpucoder.stencilKernel()

Dotprod

Input image Conv. kernel Output image

rows

cols

kw

kh

Performance



3. Coding Patterns

41

Coding Patterns: Matrix-Matrix Kernels

▪ Automatically applied for many MATLAB functions (e.g. matchFeatures

SAD, SSD, pdist, …)

▪ Manually apply using gpucoder.matrixMatrixKernel()

Performance



3. Coding Patterns

42


Deep Neural Network

Design + Training

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

Application

Design

Application

logic

Standalone

Deployment

TensorRT and

cuDNN Libraries

MKL-DNN

Library

Coders

ARM Compute

Library

Application

logic

Application

Design

Standalone

Deployment

Deep Neural Network

Design + Training

Trained

DNN

43


Deep Neural Network

Design + Training

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

Application

Design

Application

logic

Standalone

Deployment

TensorRT and

cuDNN Libraries

MKL-DNN

Library

Coders

ARM Compute

Library

Documents

Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT