17
Confidential and Proprietary GPU-BASED DEEP LEARNING IN CLOUD AND EMBEDDED SYSTEMS FREDERICK SOO, CTO April 4, 2016

GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Confidential and Proprietary

GPU-BASED DEEP

LEARNING IN CLOUD AND

EMBEDDED SYSTEMS

FREDERICK SOO, CTO

April 4, 2016

Page 2: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Nauto is launching a connected camera for professional

drivers

2

• Drive more than most

consumers

• Exposed to passenger

and driver liability

• Driver quality unknown -

small number of very bad

drivers

Page 3: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Massive shift in transportation due to synergistic

technologies

3

Autonomous

90% reduction

in accidents

Connected

Electric Shared

$0.08 / mile

85% efficient

drivetrain

50-70%

utilization

Fleet

optimization

Page 4: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Why use deep learning?

4

Good at

visual tasks

Scalable

Deployable

Most important for NAUTO

Page 5: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Small brains have a lot of functionality

5

26 billion neurons

1 million

10 million

100 million 20 watts

1mW

10mW

100mW

Page 6: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Required performance depends on use case

6

Page 7: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Small changes in F1 with size

7

• Large networks can be

used in later stages of

cascade

• Order of magnitude

improvements in speed

with basic exploration

• Always worth

measuring

performance/size

tradeoff

Page 8: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Test your chipsets - algorithm speed important but not entire

story

8

0

30

60

90

120

150

A B C D E

Nauto

CN

N forw

ard

pass (

msec)

Embedded SoC

• Chipsets released in

2014, 2015 and 2016

• Pricing varying from

$25 to $60+

• Varying degrees of

HW/SW support

Page 9: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Algorithm is not the bottleneck

9

Image

processing

Conversion to

CNN space

CNN forward

pass Other steps

30msec 30msec … msec 15msec

Page 10: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Entire system must be optimized

10

Collect data Label Train Deploy

years months months months/years Pre-GPU

Page 11: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Entire system must be optimized

11

Collect data Label Train Deploy

weeks months months months/years Post-GPU

years months months months/years Pre-GPU

Page 12: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Entire system must be optimized

12

Collect data Label Train Deploy

weeks months months months/years Post-GPU

days weeks weeks weeks Nauto

prototype

years months months months/years Pre-GPU

Page 13: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Entire system must be optimized

13

Collect data Label Train Deploy

weeks months months months/years Post-GPU

days weeks weeks weeks Nauto

prototype

years months months months/years Pre-GPU

Nauto at-

scale ? ? ? ?

Page 14: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Easy to think of optimization; hard to think of

system

14

Programmers waste enormous amounts of time thinking

about, or worrying about, the speed of noncritical parts of

their programs, and these attempts at efficiency actually have

a strong negative impact when debugging and maintenance

are considered.

We should forget about small efficiencies, say about 97% of

the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical

3%.

Donald Knuth

Page 15: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Lessons

15

• Embedded pipeline as important as raw CNN

performance

• Match algorithm performance to use case

• Overall system performance (data acquisition,

labeling, training) is where big progress to be made

Page 16: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

The future is in distributed awareness

16

Real world search

Page 17: GPU-B D L C E S F S , CTOon-demand.gputechconf.com/gtc/2016/presentation/s6806-frederick … · 10 million 100 million 20 watts 1mW 10mW 100mW . ... 2014, 2015 and 2016 •Pricing

Team

17

Ludmila Levkova

Nikhil Deshmukh

Joe Virzi

Jonathan Soo