Lecture 8: Deep Neural Network Evaluationgraphics.cs.cmu.edu/courses/15769/fall2016content/lectures/08_dnn… · whistles. e /2 2 64 64 64 64 2 128 256256256 256 256 256 256 512 pool

Visual Computing Systems CMU 15-769, Fall 2016

Lecture 8:

Deep Neural Network Evaluation

CMU 15-769, Fall 2016

Training/evaluating deep neural networksTechnique leading to many high-profile AI advances in recent yearsSpeech recognition/natural language processing

Image interpretation and understanding

MPII

FLIC

LSP

Figure 10: Qualitative results of our method on the MPII, LSP and FLIC datasets respectively. We see that the method is able to handle non-standardposes and resolve ambiguities between symmetric parts for a variety of different relative camera views.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

10

20

30

40

50

60

70

80

90

100

Viewpoint clusters

PC

Kh

0.5

, %

PCKh by Viewpoint

OursPishchulin et al., CVPR’16Tompson et al., CVPR’15Carreira et al., CVPR’16Tompson et al., NIPS’14

Figure 11: Comparing PCKh-0.5 across various viewpoints in theMPII dataset. Our method is significantly better in all the viewpoints.

ing data). Note that adding MPII data here significantlyboosts our performance, due to its labeling quality beingmuch better than LSP. Because of the noisy label in the LSPdataset, Pishchulin et al. [28] reproduced the dataset withoriginal high resolution images and better labeling quality.

FLIC Dataset. We evaluate our method on the FLICDataset [32] which consists of 3987 images for training and1016 images for testing. We report accuracy as per the met-ric introduced in Sapp et al. [32] for the elbow and wristjoints in Figure 12. Again, we outperform all prior art [email protected] with 97.59% on elbows and 95.03% on wrists. Inhigher precision region our advantage is even more signifi-cant: 14.8 percentage points on wrists and 12.7 percentagepoints on elbows at [email protected], and 8.9 percentage pointson wrists and 9.3 percentage points on elbows at [email protected].

0 0.05 0.1 0.15 0.20

10

20

30

40

50

60

70

80

90

100PCK wrist, FLIC

Normalized distance

Det

ecti

on

rat

e %

Ours 4−Stage

Tompson et al., CVPR’15

Tompson et al., NIPS’14

Chen et al., NIPS’14

Toshev et al., CVPR’14

Sapp et al., CVPR’13

0 0.05 0.1 0.15 0.2

PCK elbow, FLIC

Normalized distance

Figure 12: Quantitative results on the FLIC dataset for the elbow andwrist joints with a 4-stage CPM. We outperform all competing methods.

5. Discussion

Convolutional pose machines provide an end-to-end ar-chitecture for tackling structured prediction problems incomputer vision without the need for graphical-model styleinference. We showed that a sequential architecture com-posed of convolutional networks is capable of implicitlylearning a spatial models for pose by communicating in-creasingly refined uncertainty-preserving beliefs betweenstages. Problems with spatial dependencies between vari-ables arise in multiple domains of computer vision such assemantic image labeling, single image depth prediction andobject detection and future work will involve extending ourarchitecture to these problems. Our approach achieves stateof the art accuracy on all primary benchmarks, however wedo observe failure cases mainly when multiple people arein close proximity. Handling multiple people in a singleend-to-end architecture is also a challenging problem andan interesting avenue for future work.

Game AI

CMU 15-769, Fall 2016

What is a deep neural network?

x0 x1 x2 x3

x0 x1 x2 x3

x0 x1 x2 x3

x0 x1 x2 x3

w0 w1 w2 w3w0 w1 w2 w3

w0 w1 w2 w3

w0 w1 w2 w3

A basic unit: Unit with n inputs described by n+1 parameters (weights + bias)

f

X

i

xiwi + b

!

b

Input: Unit (“neuron”)

output

f(x) = max(0, x)

Example: rectified linear unit (ReLU)

Biological inspiration:

Machine learning interpretation:

Basic computational interpretation: It’s just a circuit!

unit output corresponds loosely to activation of neuron

binary classifier: interpret output as the probability of one class

f(x) =1

1 + e

�x

CMU 15-769, Fall 2016

What is a deep neural network: topology

Input: Output:

This network has: 4 inputs, 1 output, 7 hidden units “Deep” = at least one hidden layer Hidden layer 1: 3 units x (4 weights + 1 bias) = 15 parameters Hidden layer 2: 4 units x (3 weights + 1 bias) = 16 parameters

Hidden layers:

Note “fully-connected” topology in this example

CMU 15-769, Fall 2016

What is a deep neural network: topology

Fully connected layer

Sparsely (locally) connected

Inputs

Inputs

OutputsOutput

CMU 15-769, Fall 2016

Recall image convolution (3x3 conv)intWIDTH=1024;

intHEIGHT=1024;

floatinput[(WIDTH+2)*(HEIGHT+2)];

floatoutput[WIDTH*HEIGHT];

floatweights[]={1.0/9,1.0/9,1.0/9,

1.0/9,1.0/9,1.0/9,

1.0/9,1.0/9,1.0/9};

for(intj=0;j<HEIGHT;j++){

for(inti=0;i<WIDTH;i++){

floattmp=0.f;

for(intjj=0;jj<3;jj++)

for(intii=0;ii<3;ii++)

tmp+=input[(j+jj)*(WIDTH+2)+(i+ii)]*weights[jj*3+ii];

output[j*WIDTH+i]=tmp;

}

}

Convolutional layer: locally connected AND all units in layer share the same parameters (same weights + same bias): (note: network diagram only shows links for a 1D conv: a.k.a. one iteration of ii loop)

Inputs

. . .. . . . . .

Inputs

Conv Layer

CMU 15-769, Fall 2016

Strided 3x3 convolutionintWIDTH=1024;

intHEIGHT=1024;

intSTRIDE=2;

floatinput[(WIDTH+2)*(HEIGHT+2)];

floatoutput[(WIDTH/STRIDE)*(HEIGHT/STRIDE)];

floatweights[]={1.0/9,1.0/9,1.0/9,

1.0/9,1.0/9,1.0/9,

1.0/9,1.0/9,1.0/9};

for(intj=0;j<HEIGHT;j+=STRIDE){

for(inti=0;i<WIDTH;i+=STRIDE){

floattmp=0.f;

for(intjj=0;jj<3;jj++)

for(intii=0;ii<3;ii++){

tmp+=input[(j+jj)*(WIDTH+2)+(i+ii)]*weights[jj*3+ii];

output[(j/STRIDE)*WIDTH+(i/STRIDE)]=tmp;

}

Inputs

Convolutional layer with stride 2

c

Inputs

CMU 15-769, Fall 2016

What does convolution using these filter 2

4.075 .124 .075.124 .204 .124.075 .124 .075

3

5

Original

Blurred

“Gaussian Blur”

CMU 15-769, Fall 2016

What does convolution with these filters do?

Extracts horizontal gradients

2

4�1 0 1�2 0 2�1 0 1

3

5

2

4�1 �2 �10 0 01 2 1

3

5

Extracts vertical gradients

CMU 15-769, Fall 2016

Gradient detection filtersHorizontal gradients

Vertical gradients

Note: you can think of a filter as a “detector” of a pattern, and the magnitude of a pixel in the output image as the “response” of the filter to the region surrounding each pixel in the input image

CMU 15-769, Fall 2016

Applying many filters to an image at once

Input: image (single channel): W x H

3x3 spatial convolutions on image 3x3 x num_filters weights

…

Output: filter responses W x H x num_filters

…

Each filter described by unique set of 3x3 weights

(each filter responds to different image phenomena)

Filter response maps (num_filters of them)

CMU 15-769, Fall 2016

Applying many filters to an image at onceInput RGB image (W x H x 3)

96 11x11x3 filters (operate on RGB) 96 responses (normalized)

CMU 15-769, Fall 2016

Adding additional layers

Input: image (single channel)

W x H

3x3 spatial convolutions 3x3 x num_filters weights

…

Output: filter responses W x H x num_filters

…

Each filter described by unique set of weights (responds to different

image phenomena)

Filter responses

post ReLU W x H x num_filters

…ReLU Pool…

post pool W/2 x H/2 x num_filters

(max response in 2x2 region)

Note data reduction as a result of pooling

Conv

…

CMU 15-769, Fall 2016

Example: “AlexNet” object detection networkSequences of conv + reLU + pool (optional) layers

Example: AlexNet [Krizhevsky12]: 5 convolutional layers + 3 fully connected layers

[VGG illustration credit: Yang et al.]

Another example: VGG-16 [Simonyan15]: 13 convolutional layersinput: 224 x 224 RGB conv/reLU: 3x3x3x64 conv/reLU: 3x3x64x64 maxpool conv/reLU: 3x3x64x128 conv/reLU: 3x3x128x128 maxpool

conv/reLU: 3x3x128x256 conv/reLU: 3x3x256x256 conv/reLU: 3x3x256x256 maxpool conv/reLU: 3x3x256x512 conv/reLU: 3x3x512x512 conv/reLU: 3x3x512x512 maxpool

conv/reLU: 3x3x512x512 conv/reLU: 3x3x512x512 conv/reLU: 3x3x512x512 maxpool fully-connected 4096 fully-connected 4096 fully-connected 1000 soft-max

Our model

● Max-pooling layers follow first, second, and fifth convolutional layers

● The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264, 4096, 4096, 1000

CMU 15-769, Fall 2016

Why deep? Right: images that generate strongest response for filters at each layerLeft: what pixels trigger the response

[image credit: Zeiler 14]

CMU 15-769, Fall 2016

Why deep?

[image credit: Zeiler 14]

CMU 15-769, Fall 2016

More recent image understanding networks•

Alinearlayerw

ithsoftm

axloss

asthe

classifier(pre-dicting

thesam

e1000

classesasthem

ainclassifier,but

removed

atinferencetim

e).

Aschem

aticview

oftheresulting

network

isdepicted

inFigure

3.

6.TrainingM

ethodology

GoogLeN

etnetw

orksw

eretrained

usingthe

DistB

e-lief

[4]distributed

machine

learningsystem

usingm

od-est

amount

ofm

odeland

data-parallelism.

Although

we

useda

CPU

basedim

plementation

only,arough

estimate

suggeststhat

theG

oogLeNet

network

couldbe

trainedto

convergenceusing

fewhigh-end

GPU

sw

ithina

week,the

main

limitation

beingthe

mem

oryusage.O

urtrainingused

asynchronousstochastic

gradientdescentwith

0.9m

omen-

tum[17],fixed

learningrate

schedule(decreasing

thelearn-

ingrate

by4%

every8

epochs).Polyakaveraging

[13]was

usedto

createthe

finalmodelused

atinferencetim

e.Im

agesam

plingm

ethodshave

changedsubstantially

overthe

months

leadingto

thecom

petition,and

alreadyconverged

modelsw

eretrained

onw

ithotheroptions,som

e-tim

esin

conjunctionw

ithchanged

hyperparameters,

suchas

dropoutand

thelearning

rate.Therefore,

itis

hardto

givea

definitiveguidance

tothe

mosteffective

singlew

ayto

trainthese

networks.To

complicate

mattersfurther,som

eofthe

modelsw

erem

ainlytrained

onsm

allerrelativecrops,

otherson

largerones,inspired

by[8].

Still,oneprescrip-

tionthatw

asverified

tow

orkvery

wellafter

thecom

peti-tion,includes

sampling

ofvarioussized

patchesofthe

im-

agew

hosesize

isdistributed

evenlybetw

een8%

and100%

oftheim

agearea

with

aspectratioconstrained

tothe

inter-val

[34 ,

43 ].A

lso,we

foundthatthe

photometric

distortionsofA

ndrewH

oward

[8]were

usefultocom

batoverfittingto

theim

agingconditions

oftrainingdata.

7.IL

SVR

C2014

Classification

Challenge

Setupand

Results

TheILSV

RC

2014classification

challengeinvolves

thetask

ofclassifyingthe

image

intoone

of1000leaf-node

cat-egories

inthe

Imagenethierarchy.There

areabout1.2

mil-

lionim

agesfortraining,50,000

forvalidationand

100,000im

agesfor

testing.Each

image

isassociated

with

oneground

truthcategory,and

performance

ism

easuredbased

onthe

highestscoring

classifierpredictions.

Two

num-

bersare

usuallyreported:

thetop-1

accuracyrate,

which

compares

theground

truthagainstthe

firstpredictedclass,

andthe

top-5error

rate,which

compares

theground

truthagainst

thefirst

5predicted

classes:an

image

isdeem

edcorrectly

classifiedif

theground

truthis

among

thetop-5,

regardlessofits

rankin

them.The

challengeuses

thetop-5

errorrateforranking

purposes.

input

Conv7x7+

2(S)

MaxPool

3x3+2(S)

LocalRespNorm

Conv1x1+

1(V)

Conv3x3+

1(S)

LocalRespNorm

MaxPool

3x3+2(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

MaxPool

3x3+2(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool5x5+

3(V)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool5x5+

3(V)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

MaxPool

3x3+2(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)

Conv1x1+

1(S)Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv3x3+

1(S)Conv

5x5+1(S)

Conv1x1+

1(S)

AveragePool7x7+

1(V)

FC

Conv1x1+

1(S)

FC FC

SoftmaxActivation

softmax0

Conv1x1+

1(S)

FC FC

SoftmaxActivation

softmax1

SoftmaxActivation

softmax2

Figure3:G

oogLeNetnetw

orkw

ithallthe

bellsand

whistles.

7x

7 co

nv

, 6

4, /2

po

ol, /2

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 1

28

, /

2

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 c

on

v, 2

56

, /

2

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 c

on

v, 5

12

, /

2

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

av

g p

oo

l

fc

10

00

im

ag

e

3x

3 c

on

v, 5

12

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

po

ol, /2

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

po

ol, /2

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

po

ol, /2

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

po

ol, /2

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

po

ol, /2

fc

40

96

fc

40

96

fc

10

00

im

ag

e

ou

tp

ut

siz

e: 1

12

ou

tp

ut

siz

e: 2

24

ou

tp

ut

size

: 5

6

ou

tp

ut

size

: 2

8

ou

tp

ut

size

: 1

4

ou

tp

ut

siz

e: 7

ou

tp

ut

siz

e: 1

VG

G-1

93

4-la

ye

r p

la

in

7x

7 co

nv

, 6

4, /2

po

ol, /

2

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 1

28

, /

2

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 c

on

v, 2

56

, /

2

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 c

on

v, 5

12

, /

2

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

av

g p

oo

l

fc

10

00

ima

ge

34

-la

ye

r r

esid

ua

l

Figu

re3.

Exam

ple

netw

ork

arch

itect

ures

forI

mag

eNet

.L

eft:

the

VG

G-1

9m

odel

[41]

(19.

6bi

llion

FLO

Ps)

asa

refe

renc

e.M

id-

dle:

apl

ain

netw

ork

with

34pa

ram

eter

laye

rs(3

.6bi

llion

FLO

Ps).

Rig

ht:

are

sidu

alne

twor

kw

ith34

para

met

erla

yers

(3.6

billi

onFL

OPs

).Th

edo

tted

shor

tcut

sinc

reas

edi

men

sion

s.Ta

ble

1sh

ows

mor

ede

tails

and

othe

rvar

iant

s.

Res

idua

lNet

wor

k.B

ased

onth

eab

ove

plai

nne

twor

k,w

ein

sert

shor

tcut

conn

ectio

ns(F

ig.

3,rig

ht)

whi

chtu

rnth

ene

twor

kin

toits

coun

terp

artr

esid

ualv

ersi

on.

The

iden

tity

shor

tcut

s(E

qn.(1

))ca

nbe

dire

ctly

used

whe

nth

ein

puta

ndou

tput

are

ofth

esa

me

dim

ensi

ons

(sol

idlin

esh

ortc

uts

inFi

g.3)

.Whe

nth

edi

men

sion

sinc

reas

e(d

otte

dlin

esh

ortc

uts

inFi

g.3)

,we

cons

ider

two

optio

ns:

(A)

The

shor

tcut

still

perf

orm

sid

entit

ym

appi

ng,w

ithex

traze

roen

tries

padd

edfo

rinc

reas

ing

dim

ensi

ons.

This

optio

nin

trodu

ces

noex

trapa

ram

eter

;(B

)The

proj

ectio

nsh

ortc

utin

Eqn.

(2)i

suse

dto

mat

chdi

men

sion

s(d

one

by1⇥

1co

nvol

utio

ns).

For

both

optio

ns,w

hen

the

shor

tcut

sgo

acro

ssfe

atur

em

aps

oftw

osi

zes,

they

are

perf

orm

edw

itha

strid

eof

2.

3.4.

Impl

emen

tatio

n

Our

impl

emen

tatio

nfo

rIm

ageN

etfo

llow

sth

epr

actic

ein

[21,

41].

The

imag

eis

resi

zed

with

itssh

orte

rsi

dera

n-do

mly

sam

pled

in[256

,480

]fo

rsc

ale

augm

enta

tion

[41]

.A

224⇥

224

crop

isra

ndom

lysa

mpl

edfr

oman

imag

eor

itsho

rizon

talfl

ip,w

ithth

epe

r-pi

xelm

ean

subt

ract

ed[2

1].T

hest

anda

rdco

lora

ugm

enta

tion

in[2

1]is

used

.We

adop

tbat

chno

rmal

izat

ion

(BN

)[1

6]rig

htaf

ter

each

conv

olut

ion

and

befo

reac

tivat

ion,

follo

win

g[1

6].

We

initi

aliz

eth

ew

eigh

tsas

in[1

3]an

dtra

inal

lpla

in/re

sidu

alne

tsfr

omsc

ratc

h.W

eus

eSG

Dw

itha

min

i-bat

chsi

zeof

256.

The

lear

ning

rate

star

tsfr

om0.

1an

dis

divi

ded

by10

whe

nth

eer

rorp

late

aus,

and

the

mod

elsa

retra

ined

foru

pto

60⇥10

4ite

ratio

ns.W

eus

ea

wei

ghtd

ecay

of0.

0001

and

am

omen

tum

of0.

9.W

edo

notu

sedr

opou

t[14

],fo

llow

ing

the

prac

tice

in[1

6].

Inte

stin

g,fo

rcom

paris

onst

udie

sw

ead

optt

hest

anda

rd10

-cro

pte

stin

g[2

1].

For

best

resu

lts,w

ead

optt

hefu

lly-

conv

olut

iona

lfo

rmas

in[4

1,13

],an

dav

erag

eth

esc

ores

atm

ultip

lesc

ales

(imag

esar

ere

size

dsu

chth

atth

esh

orte

rsi

deis

in{2

24,256

,384

,480

,640

}).

4.E

xper

imen

ts4.

1.Im

ageN

etC

lass

ifica

tion

We

eval

uate

ourm

etho

don

the

Imag

eNet

2012

clas

sifi-

catio

nda

tase

t[36

]tha

tcon

sist

sof1

000

clas

ses.

The

mod

els

are

train

edon

the

1.28

mill

ion

train

ing

imag

es,a

ndev

alu-

ated

onth

e50

kva

lidat

ion

imag

es.

We

also

obta

ina

final

resu

lton

the

100k

test

imag

es,r

epor

ted

byth

ete

stse

rver

.W

eev

alua

tebo

thto

p-1

and

top-

5er

rorr

ates

.

Plai

nN

etw

orks

.W

efir

stev

alua

te18

-laye

ran

d34

-laye

rpl

ain

nets

.The

34-la

yerp

lain

neti

sin

Fig.

3(m

iddl

e).T

he18

-laye

rpla

inne

tis

ofa

sim

ilarf

orm

.Se

eTa

ble

1fo

rde-

taile

dar

chite

ctur

es.

The

resu

ltsin

Tabl

e2

show

that

the

deep

er34

-laye

rpla

inne

thas

high

erva

lidat

ion

erro

rth

anth

esh

allo

wer

18-la

yer

plai

nne

t.To

reve

alth

ere

ason

s,in

Fig.

4(le

ft)w

eco

m-

pare

thei

rtra

inin

g/va

lidat

ion

erro

rsdu

ring

the

train

ing

pro-

cedu

re.

We

have

obse

rved

the

degr

adat

ion

prob

lem

-th

e

4

ResNet (34 layer version)

Inception (GoogleLeNet)

FCN for image segmentation

Upsampling network

CMU 15-769, Fall 2016

Deep networks learn useful representations▪ Simultaneous, multi-scale learning of useful concepts for task

at hand - Example on previous slides: subparts detectors emerged in network for object

classification

▪ “Fine tuning” - Common practice of adjusting previous representations (weights) learned on

task A, for new (but similar) task B

- Reduces training time for task B

- Useful when little training data for task B exists

- Example: keep first N layers the same, tune weights in highest layers based on task B training data

CMU 15-769, Fall 2016

Efficiently implementing convolution layers

CMU 15-769, Fall 2016

Dense matrix multiplicationfloatA[M][K];

floatB[K][N];

floatC[M][N];

//computeC+=A*B

#pragmaompparallelfor

for(intj=0;j<M;j++)

for(inti=0;i<N;i++)

for(intk=0;k<K;k++)

C[j][i]+=A[j][k]*B[k][i];

K

M

N

M K

N

= X

What is the problem with this implementation?

Low arithmetic intensity (does not exploit temporal locality in access to A and B)

C A B

CMU 15-769, Fall 2016

Blocked dense matrix multiplicationfloatA[M][K];

floatB[K][N];

floatC[M][N];

//computeC+=A*B


for(intjblock=0;jblock<M;jblock+=BLOCKSIZE_J)

for(intiblock=0;iblock<N;iblock+=BLOCKSIZE_I)

for(intkblock=0;kblock<K;kblock+=BLOCKSIZE_K)

for(intj=0;j<BLOCKSIZE_J;j++)

for(inti=0;i<BLOCKSIZE_I;i++)

for(intk=0;k<BLOCKSIZE_K;k++)

C[jblock+j][iblock+i]+=A[jblock+j][kblock+k]*B[kblock+k][iblock+i];

K

M

N

M K

N

= XC A B

Idea: compute partial result for block of C while required blocks of A and B remain in cache (Assumes BLOCKSIZE chosen to allow block of A, B, and C to remain resident)

Self check: do you want as big a BLOCKSIZE as possible? Why?

CMU 15-769, Fall 2016

Hierarchical blocked matrix mult

floatA[M][K];

floatB[K][N];

floatC[M][N];

//computeC+=A*B


for(intjblock2=0;jblock2<M;jblock2+=L2_BLOCKSIZE_J)

for(intiblock2=0;iblock2<N;iblock2+=L2_BLOCKSIZE_I)

for(intkblock2=0;kblock2<K;kblock2+=L2_BLOCKSIZE_K)

for(intjblock1=0;jblock1<L1_BLOCKSIZE_J;jblock1+=L1_BLOCKSIZE_J)

for(intiblock1=0;iblock1<L1_BLOCKSIZE_I;iblock1+=L1_BLOCKSIZE_I)

for(intkblock1=0;kblock1<L1_BLOCKSIZE_K;kblock1+=L1_BLOCKSIZE_K)


for(inti=0;i<BLOCKSIZE_I;i++)

for(intk=0;k<BLOCKSIZE_K;k++)

...

Not shown: final level of “blocking” for register locality…

Exploit multi-level of memory hierarchy

CMU 15-769, Fall 2016

Blocked dense matrix multiplication (1)

...for(intj=0;j<BLOCKSIZE_J;j++){for(inti=0;i<BLOCKSIZE_I;i+=SIMD_WIDTH){simd_vecC_accum=vec_load(&C[jblock+j][iblock+i]);for(intk=0;k<BLOCKSIZE_K;k++){//C=A*B+Csimd_vecA_val=splat(&A[jblock+j][kblock+k]);//loadasingleelementinvectorregistersimd_muladd(A_val,vec_load(&B[kblock+k][iblock+i]),C_accum);}vec_store(&C[jblock+j][iblock+i],C_accum);}}

BLOCKSIZE_K

BLOCKSIZE_J

BLOCKSIZE_I

= XC A B

Vectorize i loop Good: also improves spatial locality in access to B Bad: working set increased by SIMD_WIDTH, still walking over B in large steps

BLOCKSIZE_I

BLOC

KSIZ

E_K

BLOC

KSIZ

E_J

Consider SIMD parallelism within a block

CMU 15-769, Fall 2016


...


for(inti=0;i<BLOCKSIZE_I;i++){

floatC_scalar=C[jblock+j][iblock+i];

for(intk=0;k<BLOCKSIZE_K;k+=SIMD_WIDTH){

//C_scalar=dot(A,B)+C_scalar

C_scalar+=simd_dot(vec_load(&A[jblock+j][kblock+k]),vec_load(&Btrans[iblock+i][[kblock+k]);

}

C[jblock+j][iblock+i]=C_scalar;

}

BLOCKSIZE_K

BLOCKSIZE_JBLOCKSIZE_I

= XC ABT

Assume i dimension is small. Previous vectorization scheme (1) would not work well. Pre-transpose block of B (copy block of B to temp buffer in transposed form) Vectorize innermost loop

BLOCKSIZE_I

BLOCKSIZE_K

BLOC

KSIZ

E_J

CMU 15-769, Fall 2016


//assumeblocksofAandCarepre-transposedasAtransandCtransfor(intj=0;j<BLOCKSIZE_J;j+=SIMD_WIDTH){for(inti=0;i<BLOCKSIZE_I;i+=SIMD_WIDTH){

simd_vecC_accum[SIMD_WIDTH];for(intk=0;k<SIMD_WIDTH;k++)//loadC_accimC_accum[k]=vec_load(&Ctrans[iblock+i+k][jblock+j]);

for(intk=0;k<BLOCKSIZE_K;k++){simd_vecbvec=vec_load(&B[kblock+k][iblock+i]);for(intkk=0;kk<SIMD_WIDTH;kk++)//innermostloopitemsnotdependentsimd_muladd(vec_load(&Atrans[kblock+k][jblock+j],splat(bvec[kk]),C_accum[kk]);}

for(intk=0;k<SIMD_WIDTH;k++)vec_store(&Ctrans[iblock+i+k][jblock+j],C_accum[k]);}}

BLOCKSIZE_J

BLOCKSIZE_I

BLOCKSIZE_I

= XCT AT B

BLOCKSIZE_J

BLOC

KSIZ

E_K

BLOC

KSIZ

E_K

CMU 15-769, Fall 2016

Convolution as matrix-vector product

2

6664

w0

w1...w8

3

7775

X00 X01 X02 X03 ...

X10 X11 X12 X13 ...

X20 X21 X22 X23 ...

X30 X31 X32 X33 ...

... ... ... ...

3x3 = 9

0000x00x010x10x11

WxH

...

Construct matrix from elements of input image

Note: 0-pad matrix

O(N) storage multiplier for filter with N elements Must construct input data matrix

CMU 15-769, Fall 2016

3x3 convolution as matrix-vector product

2

6664

w0

w1...w8

3

7775

X00 X01 X02 X03 ...

X10 X11 X12 X13 ...

X20 X21 X22 X23 ...

X30 X31 X32 X33 ...

... ... ... ...

9

0000x00x010x10x11

000x00x01x02x10x11x12

000x01x02x03x11x12x13

WxH

...

x00x01x02x10x11x12x20x21x22

Construct matrix from elements of input image

Note: 0-pad matrix

...

O(N) storage overhead for filter with N elements Must construct input data matrix

CMU 15-769, Fall 2016

Multiple convolutions as matrix-matrix multX00 X01 X02 X03 ...

X10 X11 X12 X13 ...

X20 X21 X22 X23 ...

X30 X31 X32 X33 ...

... ... ... ...

9

0000x00x010x10x11

000x00x01x02x10x11x12

000x01x02x03x11x12x13

WxH...

x00x01x02x10x11x12x20x21x22

2

6664

w00 w01 w02 · · · w0N

w10 w11 w12 · · · w0N...

......

...w80 w81 w82 · · · w8N

3

7775

num filters

...

CMU 15-769, Fall 2016

Multiple convolutions on multiple input channels

X00 X01 X02 X03 ...

X10 X11 X12 X13 ...

X20 X21 X22 X23 ...

X30 X31 X32 X33 ...

... ... ... ...

9 x num input channels

0000x00x010x10x11

000x00x01x02x10x11x12

000x01x02x03x11x12x13WxH

...

x00x01x02x10x11x12x20x21x22

num filters

...

channel 1

channel 0

channel 2

0000x00x010x10x11

000x00x01x02x10x11x12

000x01x02x03x11x12x13

...

x00x01x02x10x11x12x20x21x22

...

0000x00x010x10x11

000x00x01x02x10x11x12

000x01x02x03x11x12x13

...

x00x01x02x10x11x12x20x21x22

channel 0 values channel 1 values channel 2 values

For each filter, sum responses over input channels

Equivalent to (3 x 3 x num_channels) convolution on (W x H x num_channels) input data

2

6666666666666666666664

w000 w001 w002 · · · w00N

w010 w011 w012 · · · w01N...

......

...w080 w081 w082 · · · w08N

w100 w101 w102 · · · w10N

w110 w111 w112 · · · w11N...

......

...w180 w181 w182 · · · w18N

w200 w201 w202 · · · w20N

w210 w211 w212 · · · w21N...

......

...w280 w281 w282 · · · w28N

3

7777777777777777777775

CMU 15-769, Fall 2016

VGG memory footprint

input: 224 x 224 RGB image conv: (3x3x3) x 64 conv: (3x3x64) x 64 maxpool conv: (3x3x64) x 128 conv: (3x3x128) x 128 maxpool conv: (3x3x128) x 256 conv: (3x3x256) x 256 conv: (3x3x256) x 256 maxpool conv: (3x3x256) x 512 conv: (3x3x512) x 512 conv: (3x3x512) x 512 maxpool conv: (3x3x512) x 512 conv: (3x3x512) x 512 conv: (3x3x512) x 512 maxpool fully-connected 4096 fully-connected 4096 fully-connected 1000 soft-max

Calculations assume 32-bit values (image batch size = 1)weights mem:

output size (per image)

— 6.5 KB 144 KB — 228 KB 576 KB — 1.1 MB 2.3 MB 2.3 MB — 4.5 MB 9 MB 9 MB — 9 MB 9 MB 9 MB — 392 MB 64 MB 15.6 MB

224x224x3 224x224x64 224x224x64 112x112x64 112x112x128 112x112x128 56x56x128 56x56x256 56x56x256 56x56x256 28x28x256 28x28x512 28x28x512 28x28x512 14x14x512 14x14x512 14x14x512 14x14x512 7x7x512 4096 4096 1000 1000

150K 12.3 MB 12.3 MB 3.1 MB 6.2 MB 6.2 MB 1.5 MB 3.1 MB 3.1 MB 3.1 MB 766 KB 1.5 MB 1.5 MB 1.5 MB 383 KB 383 KB 383 KB 383 KB 98 KB 16 KB 16 KB 4 KB 4 KB

(mem)

multiply by next layer’s conv window size to form input matrix to next conv layer!!! (for VGG’s 3x3 convolutions, this is 9x data amplification)

inputs/outputs get multiplied by image batch size

CMU 15-769, Fall 2016

Direct implementation of conv layerfloatinput[INPUT_HEIGHT][INPUT_WIDTH][INPUT_DEPTH];

floatoutput[INPUT_HEIGHT][INPUT_WIDTH][LAYER_NUM_FILTERS];

floatlayer_weights[LAYER_CONVY][LAYER_CONVX][INPUT_DEPTH];

//assumesconvolutionstrideis1

for(intimg=0;img<IMAGE_BATCH_SIZE;img++)

for(intj=0;j<INPUT_HEIGHT;j++)

for(inti=0;i<INPUT_WIDTH;i++)

for(intf=0;f<LAYER_NUM_FILTERS;f++){

output[j][i][f]=0.f;

for(intkk=0;kk<INPUT_DEPTH;kk++)//sumoverfilterresponsesofinputchannels

for(intjj=0;jj<LAYER_CONVY;jj++)//spatialconvolution

for(intii=0;ii<LAYER_CONVX;ii+)//spatialconvolution

output[j][i][f]+=layer_weights[f][jj][ii][kk]*input[j+jj][i+ii][kk];

}

Seven loops with significant input data reuse: reuse of filter weights (during convolution), and reuse of input values (across different filters)

Avoids O(N) footprint increase by avoiding materializing input matrix In theory loads O(N) times less data (potentially higher arithmetic intensity… but matrix mult is typically compute-bound) But must roll your own highly optimized implementation of complicated loop nest.

CMU 15-769, Fall 2016

Conv layer in Halideintin_w,in_h,in_ch=4;//inputparams:assumeinitializedFuncin_func;//assumeinputfunctionisinitialized

intnum_f,f_w,f_h,pad,stride;//parametersoftheconvlayer

Funcforward=Func("conv");Varx("x"),y("y"),z("z"),n(“n");//nisminibatchdimension

//Thiscreatesapaddedinputtoavoidcheckingboundary//conditionswhilecomputingtheactualconvolutionf_in_bound=BoundaryConditions::repeat_edge(in_func,0,in_w,0,in_h);

//CreateimagebuffersforlayerparametersImage<float>W(f_w,f_h,in_ch,num_f)Image<float>b(num_f);

RDomr(0,f_w,0,f_h,0,in_ch);//Initializetobiasforward(x,y,z,n)=b(z);forward(x,y,z,n)+=W(r.x,r.y,r.z,z)*f_in_bound(x*stride+r.x-pad,y*stride+r.y-pad,r.z,n);

Consider scheduling this seven-dimensional loop nest.

CMU 15-769, Fall 2016

Algorithmic improvements▪ Direct convolution can be implemented efficiently in Fourier domain

(convolution → element-wise multiplication) - Overhead: FFT to transform inputs into Fourier domain, inverse FFT to get responses back to

spatial domain (NlgN) - Inverse transform amortized over all input channels (due to summation over inputs)

▪ Direct convolution using work-efficient Winograd convolutions1D example: consider producing two outputs of a 3-tap 1D convolution with weights: w0 w1 w2

x0 x1 x2 x3

y0 y1

y0

y1

�=

x0 x1 x2

x1 x2 x3

�2

4w0

w1

w2

3

5 =

m1 +m2 +m3

m2 �m3 �m4

�

1D 3-tap total cost: 4 multiplies 8 additions (4 to compute m’s + 4 to produce final result)

Direct convolution: 6 multiples, 4 adds In 2D can notably reduce multiplications (3x3 filter: 2.25x fewer multiples for 2x2 block of output)

m1 = (x0 � x1)w0

m2 = (x1 + x2)w0 + w1 + w2

2

m3 = (x2 � x1)w0 � w1 + w2

2m4 = (x1 � x3)w2

m1 = (x0 � x1)w0

m2 = (x1 + x2)w0 + w1 + w2

2

m3 = (x2 � x1)w0 � w1 + w2

2m4 = (x1 � x3)w2

m1 = (x0 � x1)w0

m2 = (x1 + x2)w0 + w1 + w2

2

m3 = (x2 � x1)w0 � w1 + w2

2m4 = (x1 � x3)w2

m1 = (x0 � x1)w0

m2 = (x1 + x2)w0 + w1 + w2

2

m3 = (x2 � x1)w0 � w1 + w2

2m4 = (x1 � x3)w2

Filter dependent (can be precomputed)

CMU 15-769, Fall 2016

Deep neural networks on GPUs▪ Today, many-performant DNN implementations target GPUs

- High arithmetic intensity computations (computational characteristics similar to dense matrix-matrix multiplication)

- Benefit from flop-rich architectures - Highly-optimized library of kernels exist for GPUs (NVIDIA’s cuDNN)

- Most CPU-based implementations use basic matrix-multiplication-based formulation (good implementations could run faster!)

CMU 15-769, Fall 2016

Emerging architectures for deep learning?▪ NVIDIA Pascal (NVIDIA’s latest GPU) - Adds double-throughput 16-bit floating point ops

- This feature is already common on mobile GPUs

▪ Intel Xeon Phi (Knights Landing) - Flop rich 68-core x86 processor for scientific computing and machine learning

▪ FPGAs, ASICs - Not new: FPGA solutions have been explored for years

- Significant amount of ongoing industry and academic research

- Google’s TPU

- We will discuss several recent papers in an upcoming class

CMU 15-769, Fall 2016

Summary: efficiently evaluating deep nets▪ Computational structure

- Convlayers: high arithmetic intensity, significant portion of cost of evaluating a network

- Similar data access patterns to dense-matrix multiplication (exploiting temporal reuse is key)

- But straight reduction to matrix-matrix multiplication is often sub-optimal

- Work-efficient techniques for convolutional layers (FFT-based, Wingrad convolutions)

▪ Large numbers of parameters: significant interest in reducing size of networks for both training and evaluation - Will discuss pruning/quantization in upcoming classes

▪ Many ongoing studies of specialized hardware architectures for efficient evaluation

- Future CPUs/GPUs, ASICs, FPGS, …

- Specialization will be important to achieving “always on” applications

Documents

Lecture 8: Deep Neural Network Evaluationgraphics.cs.cmu.edu/courses/15769/fall2016content/lectures/08_dnn… · whistles. e /2 2 64 64 64 64 2 128 256256256 256 256 256 256 512 pool