40
1 © 2015 The MathWorks, Inc. Deploying Deep Neural Networks to Embedded GPUs and CPUs Steven Thomsett

Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

1© 2015 The MathWorks, Inc.

Deploying Deep Neural Networks

to Embedded GPUs and CPUs

Steven Thomsett

Page 2: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

2

Deep Learning Workflow in MATLAB

Application

logic

Application

Design

Standalone

Deployment

Deep Neural Network

Design + Training

Trained

DNN

Page 3: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

3

Deep Neural Network Design and Training

▪ Design in MATLAB

▪ Manage large data sets

▪ Automate data labeling

▪ Easy access to models

▪ Training in MATLAB

▪ Acceleration with GPU’s

▪ Scale to clusters

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

Page 4: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

4

Application Design

Pre-

processing

Post-

processing

Page 5: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

5

Multi-Platform Deep Learning Deployment

Application

logic

Page 6: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

6

Multi-Platform Deep Learning Deployment

Application

logic

Embedded

MobileNVIDIA Jetson Raspberry pi

Beaglebone

Desktop Data Center

Page 7: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

7

Algorithm Design to Embedded Deployment Workflow

Conventional Approach

Functional test1

Desktop

GPU

Deployment

unit-test

2

Desktop

GPU

C++

Deployment

integration-test

3

Embedded GPU

C++

Real-time test4

High-level language

Deep learning framework

Large, complex software stack

Challenges

• Integrating multiple libraries and packages

• Verifying and maintaining multiple

implementations

• Algorithm & vendor lock-in

C/C++

Low-level APIs

Application-specific libraries

C/C++

Target-optimized libraries

Optimize for memory & speed

Page 8: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

8

Solution: Use MATLAB Coder & GPU Coder for

Deep Learning Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

Page 9: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

9

Solution: Use MATLAB Coder & GPU Coder for

Deep Learning Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

ARM

Compute

Library

Page 10: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

10

Deep Learning Deployment Workflows

Pre-

processing

Post-

processing

codegen

Portable target code

INTEGRATED APPLICATION DEPLOYMENT

cnncodegen

Portable target code

INFERENCE ENGINE DEPLOYMENT

Trained

DNN

Page 11: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

12

Workflow for Inference Engine Deployment

Steps for inference engine deployment

1. Generate the code for trained model>> cnncodegen(net, 'targetlib’, ‘arm-

compute’)

2. Copy the generated code onto target board

3. Build the code for the inference engine>> make –C ./codegen –f …mk

4. Use hand written main function to call inference

engine

5. Generate the exe and test the executable>> make –C ./ ……

cnncodegen

Portable target code

INFERENCE ENGINE DEPLOYMENT

Trained

DNN

Page 12: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

14

Deep Learning Inference Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

Pedestrian Detection

Page 13: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

15

Deep Learning Inference Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

Blood Smear Segmentation

Page 14: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

16

ARM

Compute

Library

Deep Learning Inference Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

Defect Classification &

Detection

Page 15: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

18

ARM

Compute

Library

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

How is the

Performance?

Page 16: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

19

Performance of Generated Code

▪ CNN inference (ResNet-50, VGG-16, Inception V3) on Titan V GPU

▪ CNN inference (ResNet-50) on Jetson TX2

▪ CNN inference (ResNet-50 , VGG-16, Inception V3) on Intel Xeon CPU

Page 17: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

20

Single Image Inference on Titan V using cuDNN

PyTorch (1.0.0)

MXNet (1.4.0)

GPU Coder (R2019a)

TensorFlow (1.13.0)

Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0

Page 18: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

21

Even Stronger Performance with INT8 using TensorRT

Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7.4.1 – TensorRT 5.0.2.6 - Frameworks: TensorFlow 1.13.0_rc0

Batch Size

GPU Coder + TensorRT (FP32)

TensorFlow + TensorRT (FP32)

ResNet-50 Inference (Titan V)

GPU Coder + TensorRT (INT8)

TensorFlow + TensorRT (INT8)

Page 19: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

22

Single Image Inference on Jetson TX2

NVIDIA libraries: CUDA9 - cuDNN 7 – TensorRT 3.0.4 - Frameworks: TensorFlow 1.12.0

GPU Coder

+

TensorRT

TensorFlow

+

TensorRT

ResNet-50

Page 20: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

23

CPU Performance

MATLAB

TensorFlow

MXNet

MATLAB Coder

PyTorch

Intel® Xeon® CPU 3.6 GHz - Frameworks: TensorFlow 1.6.0, MXNet 1.2.1, PyTorch 0.3.1

CPU, Single Image Inference (Linux)

Page 21: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

24

Brief Summary

DNN libraries are great for inference, …

MATLAB Coder and GPU Coder generates code that takes advantage of:

NVIDIA® CUDA libraries, including TensorRT & cuDNN

Intel® Math Kernel Library for Deep Neural Networks

(MKL-DNN)

ARM® Compute libraries for mobile platforms

Page 22: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

25

Brief Summary

DNN libraries are great for inference, …

MATLAB Coder and GPU Coder generates code that takes advantage of:

NVIDIA® CUDA libraries, including TensorRT & cuDNN

Intel® Math Kernel Library for Deep Neural Networks

(MKL-DNN)

ARM® Compute libraries for mobile platforms

But, Applications

Require More than

just Inference

Page 23: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

26

Deep Learning Workflows: Integrated Application Deployment

Pre-

processing

Post-

processing

codegen

Portable target code

Page 24: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

27

Lane

Detection

Strongest

Bounding

Box

Lane and Object Detection using YOLO v2

Post-

processing

Object

Detection

Workflow:

1) Test in MATLAB on CPU

2) Generate code and test on

desktop GPU

3) Generate code and test on

Jetson AGX Xavier GPU

AlexNet-based YOLO v2

Page 25: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

28

Lane

Detection

Strongest

Bounding

Box

(1) Test in MATLAB on CPU

Post-

processing

Object

Detection

AlexNet-based YOLO v2

Page 26: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

29

Lane

Detection

Strongest

Bounding

Box

(2) Generate Code and Test on Desktop GPU

Post-

processing

Object

Detection

cuDNN/TensorRT optimized code

CUDA optimized code

AlexNet-based YOLO v2

Page 27: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

30

Lane

Detection

Strongest

Bounding

Box

(3) Generate Code and Test on Jetson AGX Xavier GPU

Post-

processing

Object

Detection

cuDNN/TensorRT optimized code

CUDA optimized code

AlexNet-based YOLO v2

Page 28: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

31

Lane

Detection

Strongest

Bounding

Box

Lane and Object Detection using YOLO v2

Post-

processing

Object

Detection

cuDNN/TensorRT optimized code

CUDA optimized code

AlexNet-based YOLO v2

1) Running on CPU

2) 7X faster running generate

code on desktop GPU

3) Generate code and test on

Jetson AGX Xavier GPU

Page 29: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

32

Accessing Hardware

Deploy Standalone

Application

Access Peripheral

from MATLAB

Processor-in-Loop

Verification

Page 30: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

33

Deploy to Target Hardware via Apps and Command Line

Page 31: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

34

PyTorch (1.0.0)

MXNet (1.4.0)

GPU Coder (R2019a)

TensorFlow (1.13.0)

Page 32: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

35

PyTorch (1.0.0)

MXNet (1.4.0)

GPU Coder (R2019a)

TensorFlow (1.13.0)

How does

MATLAB Coder and

GPU Coder

achieve these results?

Page 33: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

36

Coders Apply Various Optimizations

….

….

CUDA kernel

lowering

Traditional compiler

optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loop

optimizations

Page 34: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

37

Coders Apply Various Optimizations

….

….

CUDA kernel

lowering

Traditional compiler

optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loop

optimizations

Optimized

Libraries

Network

OptimizationCoding

Patterns

Page 35: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

38

Intel MKL-DNN

Library

Generated Code Calls Optimized Libraries

Pre-

processing

Post-

processing

NVIDIA TensorRT &

cuDNN Libraries

ARM Compute

Library

cuFFT, cuBLAS,

cuSolver, Thrust

Libraries

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

Page 36: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

39

Network

Deep Learning Network Optimization

conv

Batch

Norm

ReLu

Add

conv

ReLu

Max

Pool

Max

Pool

Layer fusion

Optimized computation

FusedConv

FusedConv

BatchNormAdd

Max

Pool

Max

Pool

Buffer minimization

Optimized memory

FusedConv

FusedConv

BatchNormAdd

Max

Pool

buffer a

buffer b

buffer d

Max

Pool

buffer c

buffer e

X Reuse buffer a

X Reuse buffer b

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

Page 37: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

40

Coding Patterns: Stencil Kernels

▪ Automatically applied for image processing functions (e.g. imfilter, imerode,

imdilate, conv2, …)

▪ Manually apply using gpucoder.stencilKernel()

Dotprod

Input image Conv. kernel Output image

rows

cols

kw

kh

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

Page 38: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

41

Coding Patterns: Matrix-Matrix Kernels

▪ Automatically applied for many MATLAB functions (e.g. matchFeatures

SAD, SSD, pdist, …)

▪ Manually apply using gpucoder.matrixMatrixKernel()

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

Page 39: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

42

Deep Learning Workflow in MATLAB

Deep Neural Network

Design + Training

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

Application

Design

Application

logic

Standalone

Deployment

TensorRT and

cuDNN Libraries

MKL-DNN

Library

Coders

ARM Compute

Library

Application

logic

Application

Design

Standalone

Deployment

Deep Neural Network

Design + Training

Trained

DNN

Page 40: Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder generates code that takes advantage of: NVIDIA® CUDA libraries, including TensorRT

43

Deep Learning Workflow in MATLAB

Deep Neural Network

Design + Training

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

Application

Design

Application

logic

Standalone

Deployment

TensorRT and

cuDNN Libraries

MKL-DNN

Library

Coders

ARM Compute

Library