GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Confidential and Proprietary

GPU-BASED DEEP

LEARNING IN CLOUD AND

EMBEDDED SYSTEMS

FREDERICK SOO, CTO

April 4, 2016

Nauto is launching a connected camera for professional

drivers

2

• Drive more than most

consumers

• Exposed to passenger

and driver liability

• Driver quality unknown -

small number of very bad

drivers

Massive shift in transportation due to synergistic

technologies

3

Autonomous

90% reduction

in accidents

Connected

Electric Shared

$0.08 / mile

85% efficient

drivetrain

50-70%

utilization

Fleet

optimization

Why use deep learning?

4

Good at

visual tasks

Scalable

Deployable

Most important for NAUTO

Small brains have a lot of functionality

5

26 billion neurons

1 million

10 million

100 million 20 watts

1mW

10mW

100mW

Required performance depends on use case

6

Small changes in F1 with size

7

• Large networks can be

used in later stages of

cascade

• Order of magnitude

improvements in speed

with basic exploration

• Always worth

measuring

performance/size

tradeoff

Test your chipsets - algorithm speed important but not entire

story

8

0

30

60

90

120

150

A B C D E

Nauto

CN

N forw

ard

pass (

msec)

Embedded SoC

• Chipsets released in

2014, 2015 and 2016

• Pricing varying from

$25 to $60+

• Varying degrees of

HW/SW support

Algorithm is not the bottleneck

9

Image

processing

Conversion to

CNN space

CNN forward

pass Other steps

30msec 30msec … msec 15msec

Entire system must be optimized

10

Collect data Label Train Deploy

years months months months/years Pre-GPU


11


weeks months months months/years Post-GPU



12



days weeks weeks weeks Nauto

prototype



13



days weeks weeks weeks Nauto

prototype


Nauto at-

scale ? ? ? ?

Easy to think of optimization; hard to think of

system

14

Programmers waste enormous amounts of time thinking

about, or worrying about, the speed of noncritical parts of

their programs, and these attempts at efficiency actually have

a strong negative impact when debugging and maintenance

are considered.

We should forget about small efficiencies, say about 97% of

the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical

3%.

Donald Knuth

Lessons

15

• Embedded pipeline as important as raw CNN

performance

• Match algorithm performance to use case

• Overall system performance (data acquisition,

labeling, training) is where big progress to be made

The future is in distributed awareness

16

Real world search

Team

17

Ludmila Levkova

Nikhil Deshmukh

Joe Virzi

Jonathan Soo

Documents

GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing