Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Visual Computing Systems CMU 15-769, Fall 2016
Lecture 8:
Deep Neural Network Evaluation
CMU 15-769, Fall 2016
Training/evaluating deep neural networksTechnique leading to many high-profile AI advances in recent yearsSpeech recognition/natural language processing
Image interpretation and understanding
MPII
FLIC
LSP
Figure 10: Qualitative results of our method on the MPII, LSP and FLIC datasets respectively. We see that the method is able to handle non-standardposes and resolve ambiguities between symmetric parts for a variety of different relative camera views.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 150
10
20
30
40
50
60
70
80
90
100
Viewpoint clusters
PC
Kh
0.5
, %
PCKh by Viewpoint
OursPishchulin et al., CVPR’16Tompson et al., CVPR’15Carreira et al., CVPR’16Tompson et al., NIPS’14
Figure 11: Comparing PCKh-0.5 across various viewpoints in theMPII dataset. Our method is significantly better in all the viewpoints.
ing data). Note that adding MPII data here significantlyboosts our performance, due to its labeling quality beingmuch better than LSP. Because of the noisy label in the LSPdataset, Pishchulin et al. [28] reproduced the dataset withoriginal high resolution images and better labeling quality.
FLIC Dataset. We evaluate our method on the FLICDataset [32] which consists of 3987 images for training and1016 images for testing. We report accuracy as per the met-ric introduced in Sapp et al. [32] for the elbow and wristjoints in Figure 12. Again, we outperform all prior art [email protected] with 97.59% on elbows and 95.03% on wrists. Inhigher precision region our advantage is even more signifi-cant: 14.8 percentage points on wrists and 12.7 percentagepoints on elbows at [email protected], and 8.9 percentage pointson wrists and 9.3 percentage points on elbows at [email protected].
0 0.05 0.1 0.15 0.20
10
20
30
40
50
60
70
80
90
100PCK wrist, FLIC
Normalized distance
Det
ecti
on
rat
e %
Ours 4−Stage
Tompson et al., CVPR’15
Tompson et al., NIPS’14
Chen et al., NIPS’14
Toshev et al., CVPR’14
Sapp et al., CVPR’13
0 0.05 0.1 0.15 0.2
PCK elbow, FLIC
Normalized distance
Figure 12: Quantitative results on the FLIC dataset for the elbow andwrist joints with a 4-stage CPM. We outperform all competing methods.
5. Discussion
Convolutional pose machines provide an end-to-end ar-chitecture for tackling structured prediction problems incomputer vision without the need for graphical-model styleinference. We showed that a sequential architecture com-posed of convolutional networks is capable of implicitlylearning a spatial models for pose by communicating in-creasingly refined uncertainty-preserving beliefs betweenstages. Problems with spatial dependencies between vari-ables arise in multiple domains of computer vision such assemantic image labeling, single image depth prediction andobject detection and future work will involve extending ourarchitecture to these problems. Our approach achieves stateof the art accuracy on all primary benchmarks, however wedo observe failure cases mainly when multiple people arein close proximity. Handling multiple people in a singleend-to-end architecture is also a challenging problem andan interesting avenue for future work.
Game AI
CMU 15-769, Fall 2016
What is a deep neural network?
x0 x1 x2 x3
x0 x1 x2 x3
x0 x1 x2 x3
x0 x1 x2 x3
w0 w1 w2 w3w0 w1 w2 w3
w0 w1 w2 w3
w0 w1 w2 w3
A basic unit: Unit with n inputs described by n+1 parameters (weights + bias)
f
X
i
xiwi + b
!
b
Input: Unit (“neuron”)
output
f(x) = max(0, x)
Example: rectified linear unit (ReLU)
Biological inspiration:
Machine learning interpretation:
Basic computational interpretation: It’s just a circuit!
unit output corresponds loosely to activation of neuron
binary classifier: interpret output as the probability of one class
f(x) =1
1 + e
�x
CMU 15-769, Fall 2016
What is a deep neural network: topology
Input: Output:
This network has: 4 inputs, 1 output, 7 hidden units “Deep” = at least one hidden layer Hidden layer 1: 3 units x (4 weights + 1 bias) = 15 parameters Hidden layer 2: 4 units x (3 weights + 1 bias) = 16 parameters
Hidden layers:
Note “fully-connected” topology in this example
CMU 15-769, Fall 2016
What is a deep neural network: topology
Fully connected layer
Sparsely (locally) connected
Inputs
Inputs
OutputsOutput
CMU 15-769, Fall 2016
Recall image convolution (3x3 conv)intWIDTH=1024;
intHEIGHT=1024;
floatinput[(WIDTH+2)*(HEIGHT+2)];
floatoutput[WIDTH*HEIGHT];
floatweights[]={1.0/9,1.0/9,1.0/9,
1.0/9,1.0/9,1.0/9,
1.0/9,1.0/9,1.0/9};
for(intj=0;j<HEIGHT;j++){
for(inti=0;i<WIDTH;i++){
floattmp=0.f;
for(intjj=0;jj<3;jj++)
for(intii=0;ii<3;ii++)
tmp+=input[(j+jj)*(WIDTH+2)+(i+ii)]*weights[jj*3+ii];
output[j*WIDTH+i]=tmp;
}
}
Convolutional layer: locally connected AND all units in layer share the same parameters (same weights + same bias): (note: network diagram only shows links for a 1D conv: a.k.a. one iteration of ii loop)
Inputs
. . .. . . . . .
Inputs
Conv Layer
CMU 15-769, Fall 2016
Strided 3x3 convolutionintWIDTH=1024;
intHEIGHT=1024;
intSTRIDE=2;
floatinput[(WIDTH+2)*(HEIGHT+2)];
floatoutput[(WIDTH/STRIDE)*(HEIGHT/STRIDE)];
floatweights[]={1.0/9,1.0/9,1.0/9,
1.0/9,1.0/9,1.0/9,
1.0/9,1.0/9,1.0/9};
for(intj=0;j<HEIGHT;j+=STRIDE){
for(inti=0;i<WIDTH;i+=STRIDE){
floattmp=0.f;
for(intjj=0;jj<3;jj++)
for(intii=0;ii<3;ii++){
tmp+=input[(j+jj)*(WIDTH+2)+(i+ii)]*weights[jj*3+ii];
output[(j/STRIDE)*WIDTH+(i/STRIDE)]=tmp;
}
Inputs
Convolutional layer with stride 2
c
Inputs
CMU 15-769, Fall 2016
What does convolution using these filter 2
4.075 .124 .075.124 .204 .124.075 .124 .075
3
5
Original
Blurred
“Gaussian Blur”
CMU 15-769, Fall 2016
What does convolution with these filters do?
Extracts horizontal gradients
2
4�1 0 1�2 0 2�1 0 1
3
5
2
4�1 �2 �10 0 01 2 1
3
5
Extracts vertical gradients
CMU 15-769, Fall 2016
Gradient detection filtersHorizontal gradients
Vertical gradients
Note: you can think of a filter as a “detector” of a pattern, and the magnitude of a pixel in the output image as the “response” of the filter to the region surrounding each pixel in the input image
CMU 15-769, Fall 2016
Applying many filters to an image at once
Input: image (single channel): W x H
3x3 spatial convolutions on image 3x3 x num_filters weights
…
Output: filter responses W x H x num_filters
…
Each filter described by unique set of 3x3 weights
(each filter responds to different image phenomena)
Filter response maps (num_filters of them)
CMU 15-769, Fall 2016
Applying many filters to an image at onceInput RGB image (W x H x 3)
96 11x11x3 filters (operate on RGB) 96 responses (normalized)
CMU 15-769, Fall 2016
Adding additional layers
Input: image (single channel)
W x H
3x3 spatial convolutions 3x3 x num_filters weights
…
Output: filter responses W x H x num_filters
…
Each filter described by unique set of weights (responds to different
image phenomena)
Filter responses
post ReLU W x H x num_filters
…ReLU Pool…
post pool W/2 x H/2 x num_filters
(max response in 2x2 region)
Note data reduction as a result of pooling
Conv
…
CMU 15-769, Fall 2016
Example: “AlexNet” object detection networkSequences of conv + reLU + pool (optional) layers
Example: AlexNet [Krizhevsky12]: 5 convolutional layers + 3 fully connected layers
[VGG illustration credit: Yang et al.]
Another example: VGG-16 [Simonyan15]: 13 convolutional layersinput: 224 x 224 RGB conv/reLU: 3x3x3x64 conv/reLU: 3x3x64x64 maxpool conv/reLU: 3x3x64x128 conv/reLU: 3x3x128x128 maxpool
conv/reLU: 3x3x128x256 conv/reLU: 3x3x256x256 conv/reLU: 3x3x256x256 maxpool conv/reLU: 3x3x256x512 conv/reLU: 3x3x512x512 conv/reLU: 3x3x512x512 maxpool
conv/reLU: 3x3x512x512 conv/reLU: 3x3x512x512 conv/reLU: 3x3x512x512 maxpool fully-connected 4096 fully-connected 4096 fully-connected 1000 soft-max
Our model
● Max-pooling layers follow first, second, and fifth convolutional layers
● The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264, 4096, 4096, 1000
CMU 15-769, Fall 2016
Why deep? Right: images that generate strongest response for filters at each layerLeft: what pixels trigger the response
[image credit: Zeiler 14]
CMU 15-769, Fall 2016
Why deep?
[image credit: Zeiler 14]
CMU 15-769, Fall 2016
More recent image understanding networks•
Alinearlayerw
ithsoftm
axloss
asthe
classifier(pre-dicting
thesam
e1000
classesasthem
ainclassifier,but
removed
atinferencetim
e).
Aschem
aticview
oftheresulting
network
isdepicted
inFigure
3.
6.TrainingM
ethodology
GoogLeN
etnetw
orksw
eretrained
usingthe
DistB
e-lief
[4]distributed
machine
learningsystem
usingm
od-est
amount
ofm
odeland
data-parallelism.
Although
we
useda
CPU
basedim
plementation
only,arough
estimate
suggeststhat
theG
oogLeNet
network
couldbe
trainedto
convergenceusing
fewhigh-end
GPU
sw
ithina
week,the
main
limitation
beingthe
mem
oryusage.O
urtrainingused
asynchronousstochastic
gradientdescentwith
0.9m
omen-
tum[17],fixed
learningrate
schedule(decreasing
thelearn-
ingrate
by4%
every8
epochs).Polyakaveraging
[13]was
usedto
createthe
finalmodelused
atinferencetim
e.Im
agesam
plingm
ethodshave
changedsubstantially
overthe
months
leadingto
thecom
petition,and
alreadyconverged
modelsw
eretrained
onw
ithotheroptions,som
e-tim
esin
conjunctionw
ithchanged
hyperparameters,
suchas
dropoutand
thelearning
rate.Therefore,
itis
hardto
givea
definitiveguidance
tothe
mosteffective
singlew
ayto
trainthese
networks.To
complicate
mattersfurther,som
eofthe
modelsw
erem
ainlytrained
onsm
allerrelativecrops,
otherson
largerones,inspired
by[8].
Still,oneprescrip-
tionthatw
asverified
tow
orkvery
wellafter
thecom
peti-tion,includes
sampling
ofvarioussized
patchesofthe
im-
agew
hosesize
isdistributed
evenlybetw
een8%
and100%
oftheim
agearea
with
aspectratioconstrained
tothe
inter-val
[34 ,
43 ].A
lso,we
foundthatthe
photometric
distortionsofA
ndrewH
oward
[8]were
usefultocom
batoverfittingto
theim
agingconditions
oftrainingdata.
7.IL
SVR
C2014
Classification
Challenge
Setupand
Results
TheILSV
RC
2014classification
challengeinvolves
thetask
ofclassifyingthe
image
intoone
of1000leaf-node
cat-egories
inthe
Imagenethierarchy.There
areabout1.2
mil-
lionim
agesfortraining,50,000
forvalidationand
100,000im
agesfor
testing.Each
image
isassociated
with
oneground
truthcategory,and
performance
ism
easuredbased
onthe
highestscoring
classifierpredictions.
Two
num-
bersare
usuallyreported:
thetop-1
accuracyrate,
which
compares
theground
truthagainstthe
firstpredictedclass,
andthe
top-5error
rate,which
compares
theground
truthagainst
thefirst
5predicted
classes:an
image
isdeem
edcorrectly
classifiedif
theground
truthis
among
thetop-5,
regardlessofits
rankin
them.The
challengeuses
thetop-5
errorrateforranking
purposes.
input
Conv7x7+
2(S)
MaxPool
3x3+2(S)
LocalRespNorm
Conv1x1+
1(V)
Conv3x3+
1(S)
LocalRespNorm
MaxPool
3x3+2(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
MaxPool
3x3+2(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool5x5+
3(V)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool5x5+
3(V)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
MaxPool
3x3+2(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)
Conv1x1+
1(S)Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv3x3+
1(S)Conv
5x5+1(S)
Conv1x1+
1(S)
AveragePool7x7+
1(V)
FC
Conv1x1+
1(S)
FC FC
SoftmaxActivation
softmax0
Conv1x1+
1(S)
FC FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
Figure3:G
oogLeNetnetw
orkw
ithallthe
bellsand
whistles.
7x
7 co
nv
, 6
4, /2
po
ol, /2
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 1
28
, /
2
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 c
on
v, 2
56
, /
2
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 c
on
v, 5
12
, /
2
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
av
g p
oo
l
fc
10
00
im
ag
e
3x
3 c
on
v, 5
12
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
po
ol, /2
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
po
ol, /2
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
po
ol, /2
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
po
ol, /2
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
po
ol, /2
fc
40
96
fc
40
96
fc
10
00
im
ag
e
ou
tp
ut
siz
e: 1
12
ou
tp
ut
siz
e: 2
24
ou
tp
ut
size
: 5
6
ou
tp
ut
size
: 2
8
ou
tp
ut
size
: 1
4
ou
tp
ut
siz
e: 7
ou
tp
ut
siz
e: 1
VG
G-1
93
4-la
ye
r p
la
in
7x
7 co
nv
, 6
4, /2
po
ol, /
2
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 1
28
, /
2
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 c
on
v, 2
56
, /
2
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 c
on
v, 5
12
, /
2
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
av
g p
oo
l
fc
10
00
ima
ge
34
-la
ye
r r
esid
ua
l
Figu
re3.
Exam
ple
netw
ork
arch
itect
ures
forI
mag
eNet
.L
eft:
the
VG
G-1
9m
odel
[41]
(19.
6bi
llion
FLO
Ps)
asa
refe
renc
e.M
id-
dle:
apl
ain
netw
ork
with
34pa
ram
eter
laye
rs(3
.6bi
llion
FLO
Ps).
Rig
ht:
are
sidu
alne
twor
kw
ith34
para
met
erla
yers
(3.6
billi
onFL
OPs
).Th
edo
tted
shor
tcut
sinc
reas
edi
men
sion
s.Ta
ble
1sh
ows
mor
ede
tails
and
othe
rvar
iant
s.
Res
idua
lNet
wor
k.B
ased
onth
eab
ove
plai
nne
twor
k,w
ein
sert
shor
tcut
conn
ectio
ns(F
ig.
3,rig
ht)
whi
chtu
rnth
ene
twor
kin
toits
coun
terp
artr
esid
ualv
ersi
on.
The
iden
tity
shor
tcut
s(E
qn.(1
))ca
nbe
dire
ctly
used
whe
nth
ein
puta
ndou
tput
are
ofth
esa
me
dim
ensi
ons
(sol
idlin
esh
ortc
uts
inFi
g.3)
.Whe
nth
edi
men
sion
sinc
reas
e(d
otte
dlin
esh
ortc
uts
inFi
g.3)
,we
cons
ider
two
optio
ns:
(A)
The
shor
tcut
still
perf
orm
sid
entit
ym
appi
ng,w
ithex
traze
roen
tries
padd
edfo
rinc
reas
ing
dim
ensi
ons.
This
optio
nin
trodu
ces
noex
trapa
ram
eter
;(B
)The
proj
ectio
nsh
ortc
utin
Eqn.
(2)i
suse
dto
mat
chdi
men
sion
s(d
one
by1⇥
1co
nvol
utio
ns).
For
both
optio
ns,w
hen
the
shor
tcut
sgo
acro
ssfe
atur
em
aps
oftw
osi
zes,
they
are
perf
orm
edw
itha
strid
eof
2.
3.4.
Impl
emen
tatio
n
Our
impl
emen
tatio
nfo
rIm
ageN
etfo
llow
sth
epr
actic
ein
[21,
41].
The
imag
eis
resi
zed
with
itssh
orte
rsi
dera
n-do
mly
sam
pled
in[256
,480
]fo
rsc
ale
augm
enta
tion
[41]
.A
224⇥
224
crop
isra
ndom
lysa
mpl
edfr
oman
imag
eor
itsho
rizon
talfl
ip,w
ithth
epe
r-pi
xelm
ean
subt
ract
ed[2
1].T
hest
anda
rdco
lora
ugm
enta
tion
in[2
1]is
used
.We
adop
tbat
chno
rmal
izat
ion
(BN
)[1
6]rig
htaf
ter
each
conv
olut
ion
and
befo
reac
tivat
ion,
follo
win
g[1
6].
We
initi
aliz
eth
ew
eigh
tsas
in[1
3]an
dtra
inal
lpla
in/re
sidu
alne
tsfr
omsc
ratc
h.W
eus
eSG
Dw
itha
min
i-bat
chsi
zeof
256.
The
lear
ning
rate
star
tsfr
om0.
1an
dis
divi
ded
by10
whe
nth
eer
rorp
late
aus,
and
the
mod
elsa
retra
ined
foru
pto
60⇥10
4ite
ratio
ns.W
eus
ea
wei
ghtd
ecay
of0.
0001
and
am
omen
tum
of0.
9.W
edo
notu
sedr
opou
t[14
],fo
llow
ing
the
prac
tice
in[1
6].
Inte
stin
g,fo
rcom
paris
onst
udie
sw
ead
optt
hest
anda
rd10
-cro
pte
stin
g[2
1].
For
best
resu
lts,w
ead
optt
hefu
lly-
conv
olut
iona
lfo
rmas
in[4
1,13
],an
dav
erag
eth
esc
ores
atm
ultip
lesc
ales
(imag
esar
ere
size
dsu
chth
atth
esh
orte
rsi
deis
in{2
24,256
,384
,480
,640
}).
4.E
xper
imen
ts4.
1.Im
ageN
etC
lass
ifica
tion
We
eval
uate
ourm
etho
don
the
Imag
eNet
2012
clas
sifi-
catio
nda
tase
t[36
]tha
tcon
sist
sof1
000
clas
ses.
The
mod
els
are
train
edon
the
1.28
mill
ion
train
ing
imag
es,a
ndev
alu-
ated
onth
e50
kva
lidat
ion
imag
es.
We
also
obta
ina
final
resu
lton
the
100k
test
imag
es,r
epor
ted
byth
ete
stse
rver
.W
eev
alua
tebo
thto
p-1
and
top-
5er
rorr
ates
.
Plai
nN
etw
orks
.W
efir
stev
alua
te18
-laye
ran
d34
-laye
rpl
ain
nets
.The
34-la
yerp
lain
neti
sin
Fig.
3(m
iddl
e).T
he18
-laye
rpla
inne
tis
ofa
sim
ilarf
orm
.Se
eTa
ble
1fo
rde-
taile
dar
chite
ctur
es.
The
resu
ltsin
Tabl
e2
show
that
the
deep
er34
-laye
rpla
inne
thas
high
erva
lidat
ion
erro
rth
anth
esh
allo
wer
18-la
yer
plai
nne
t.To
reve
alth
ere
ason
s,in
Fig.
4(le
ft)w
eco
m-
pare
thei
rtra
inin
g/va
lidat
ion
erro
rsdu
ring
the
train
ing
pro-
cedu
re.
We
have
obse
rved
the
degr
adat
ion
prob
lem
-th
e
4
ResNet (34 layer version)
Inception (GoogleLeNet)
FCN for image segmentation
Upsampling network
CMU 15-769, Fall 2016
Deep networks learn useful representations▪ Simultaneous, multi-scale learning of useful concepts for task
at hand - Example on previous slides: subparts detectors emerged in network for object
classification
▪ “Fine tuning” - Common practice of adjusting previous representations (weights) learned on
task A, for new (but similar) task B
- Reduces training time for task B
- Useful when little training data for task B exists
- Example: keep first N layers the same, tune weights in highest layers based on task B training data
CMU 15-769, Fall 2016
Efficiently implementing convolution layers
CMU 15-769, Fall 2016
Dense matrix multiplicationfloatA[M][K];
floatB[K][N];
floatC[M][N];
//computeC+=A*B
#pragmaompparallelfor
for(intj=0;j<M;j++)
for(inti=0;i<N;i++)
for(intk=0;k<K;k++)
C[j][i]+=A[j][k]*B[k][i];
K
M
N
M K
N
= X
What is the problem with this implementation?
Low arithmetic intensity (does not exploit temporal locality in access to A and B)
C A B
CMU 15-769, Fall 2016
Blocked dense matrix multiplicationfloatA[M][K];
floatB[K][N];
floatC[M][N];
//computeC+=A*B
#pragmaompparallelfor
for(intjblock=0;jblock<M;jblock+=BLOCKSIZE_J)
for(intiblock=0;iblock<N;iblock+=BLOCKSIZE_I)
for(intkblock=0;kblock<K;kblock+=BLOCKSIZE_K)
for(intj=0;j<BLOCKSIZE_J;j++)
for(inti=0;i<BLOCKSIZE_I;i++)
for(intk=0;k<BLOCKSIZE_K;k++)
C[jblock+j][iblock+i]+=A[jblock+j][kblock+k]*B[kblock+k][iblock+i];
K
M
N
M K
N
= XC A B
Idea: compute partial result for block of C while required blocks of A and B remain in cache (Assumes BLOCKSIZE chosen to allow block of A, B, and C to remain resident)
Self check: do you want as big a BLOCKSIZE as possible? Why?
CMU 15-769, Fall 2016
Hierarchical blocked matrix mult
floatA[M][K];
floatB[K][N];
floatC[M][N];
//computeC+=A*B
#pragmaompparallelfor
for(intjblock2=0;jblock2<M;jblock2+=L2_BLOCKSIZE_J)
for(intiblock2=0;iblock2<N;iblock2+=L2_BLOCKSIZE_I)
for(intkblock2=0;kblock2<K;kblock2+=L2_BLOCKSIZE_K)
for(intjblock1=0;jblock1<L1_BLOCKSIZE_J;jblock1+=L1_BLOCKSIZE_J)
for(intiblock1=0;iblock1<L1_BLOCKSIZE_I;iblock1+=L1_BLOCKSIZE_I)
for(intkblock1=0;kblock1<L1_BLOCKSIZE_K;kblock1+=L1_BLOCKSIZE_K)
for(intj=0;j<BLOCKSIZE_J;j++)
for(inti=0;i<BLOCKSIZE_I;i++)
for(intk=0;k<BLOCKSIZE_K;k++)
...
Not shown: final level of “blocking” for register locality…
Exploit multi-level of memory hierarchy
CMU 15-769, Fall 2016
Blocked dense matrix multiplication (1)
...for(intj=0;j<BLOCKSIZE_J;j++){for(inti=0;i<BLOCKSIZE_I;i+=SIMD_WIDTH){simd_vecC_accum=vec_load(&C[jblock+j][iblock+i]);for(intk=0;k<BLOCKSIZE_K;k++){//C=A*B+Csimd_vecA_val=splat(&A[jblock+j][kblock+k]);//loadasingleelementinvectorregistersimd_muladd(A_val,vec_load(&B[kblock+k][iblock+i]),C_accum);}vec_store(&C[jblock+j][iblock+i],C_accum);}}
BLOCKSIZE_K
BLOCKSIZE_J
BLOCKSIZE_I
= XC A B
Vectorize i loop Good: also improves spatial locality in access to B Bad: working set increased by SIMD_WIDTH, still walking over B in large steps
BLOCKSIZE_I
BLOC
KSIZ
E_K
BLOC
KSIZ
E_J
Consider SIMD parallelism within a block
CMU 15-769, Fall 2016
Blocked dense matrix multiplication (2)
...
for(intj=0;j<BLOCKSIZE_J;j++)
for(inti=0;i<BLOCKSIZE_I;i++){
floatC_scalar=C[jblock+j][iblock+i];
for(intk=0;k<BLOCKSIZE_K;k+=SIMD_WIDTH){
//C_scalar=dot(A,B)+C_scalar
C_scalar+=simd_dot(vec_load(&A[jblock+j][kblock+k]),vec_load(&Btrans[iblock+i][[kblock+k]);
}
C[jblock+j][iblock+i]=C_scalar;
}
BLOCKSIZE_K
BLOCKSIZE_JBLOCKSIZE_I
= XC ABT
Assume i dimension is small. Previous vectorization scheme (1) would not work well. Pre-transpose block of B (copy block of B to temp buffer in transposed form) Vectorize innermost loop
BLOCKSIZE_I
BLOCKSIZE_K
BLOC
KSIZ
E_J
CMU 15-769, Fall 2016
Blocked dense matrix multiplication (3)
//assumeblocksofAandCarepre-transposedasAtransandCtransfor(intj=0;j<BLOCKSIZE_J;j+=SIMD_WIDTH){for(inti=0;i<BLOCKSIZE_I;i+=SIMD_WIDTH){
simd_vecC_accum[SIMD_WIDTH];for(intk=0;k<SIMD_WIDTH;k++)//loadC_accimC_accum[k]=vec_load(&Ctrans[iblock+i+k][jblock+j]);
for(intk=0;k<BLOCKSIZE_K;k++){simd_vecbvec=vec_load(&B[kblock+k][iblock+i]);for(intkk=0;kk<SIMD_WIDTH;kk++)//innermostloopitemsnotdependentsimd_muladd(vec_load(&Atrans[kblock+k][jblock+j],splat(bvec[kk]),C_accum[kk]);}
for(intk=0;k<SIMD_WIDTH;k++)vec_store(&Ctrans[iblock+i+k][jblock+j],C_accum[k]);}}
BLOCKSIZE_J
BLOCKSIZE_I
BLOCKSIZE_I
= XCT AT B
BLOCKSIZE_J
BLOC
KSIZ
E_K
BLOC
KSIZ
E_K
CMU 15-769, Fall 2016
Convolution as matrix-vector product
2
6664
w0
w1...w8
3
7775
X00 X01 X02 X03 ...
X10 X11 X12 X13 ...
X20 X21 X22 X23 ...
X30 X31 X32 X33 ...
... ... ... ...
3x3 = 9
0000x00x010x10x11
WxH
...
Construct matrix from elements of input image
Note: 0-pad matrix
O(N) storage multiplier for filter with N elements Must construct input data matrix
CMU 15-769, Fall 2016
3x3 convolution as matrix-vector product
2
6664
w0
w1...w8
3
7775
X00 X01 X02 X03 ...
X10 X11 X12 X13 ...
X20 X21 X22 X23 ...
X30 X31 X32 X33 ...
... ... ... ...
9
0000x00x010x10x11
000x00x01x02x10x11x12
000x01x02x03x11x12x13
WxH
...
x00x01x02x10x11x12x20x21x22
Construct matrix from elements of input image
Note: 0-pad matrix
...
O(N) storage overhead for filter with N elements Must construct input data matrix
CMU 15-769, Fall 2016
Multiple convolutions as matrix-matrix multX00 X01 X02 X03 ...
X10 X11 X12 X13 ...
X20 X21 X22 X23 ...
X30 X31 X32 X33 ...
... ... ... ...
9
0000x00x010x10x11
000x00x01x02x10x11x12
000x01x02x03x11x12x13
WxH...
x00x01x02x10x11x12x20x21x22
2
6664
w00 w01 w02 · · · w0N
w10 w11 w12 · · · w0N...
......
...w80 w81 w82 · · · w8N
3
7775
num filters
...
CMU 15-769, Fall 2016
Multiple convolutions on multiple input channels
X00 X01 X02 X03 ...
X10 X11 X12 X13 ...
X20 X21 X22 X23 ...
X30 X31 X32 X33 ...
... ... ... ...
9 x num input channels
0000x00x010x10x11
000x00x01x02x10x11x12
000x01x02x03x11x12x13WxH
...
x00x01x02x10x11x12x20x21x22
num filters
...
channel 1
channel 0
channel 2
0000x00x010x10x11
000x00x01x02x10x11x12
000x01x02x03x11x12x13
...
x00x01x02x10x11x12x20x21x22
...
0000x00x010x10x11
000x00x01x02x10x11x12
000x01x02x03x11x12x13
...
x00x01x02x10x11x12x20x21x22
channel 0 values channel 1 values channel 2 values
For each filter, sum responses over input channels
Equivalent to (3 x 3 x num_channels) convolution on (W x H x num_channels) input data
2
6666666666666666666664
w000 w001 w002 · · · w00N
w010 w011 w012 · · · w01N...
......
...w080 w081 w082 · · · w08N
w100 w101 w102 · · · w10N
w110 w111 w112 · · · w11N...
......
...w180 w181 w182 · · · w18N
w200 w201 w202 · · · w20N
w210 w211 w212 · · · w21N...
......
...w280 w281 w282 · · · w28N
3
7777777777777777777775
CMU 15-769, Fall 2016
VGG memory footprint
input: 224 x 224 RGB image conv: (3x3x3) x 64 conv: (3x3x64) x 64 maxpool conv: (3x3x64) x 128 conv: (3x3x128) x 128 maxpool conv: (3x3x128) x 256 conv: (3x3x256) x 256 conv: (3x3x256) x 256 maxpool conv: (3x3x256) x 512 conv: (3x3x512) x 512 conv: (3x3x512) x 512 maxpool conv: (3x3x512) x 512 conv: (3x3x512) x 512 conv: (3x3x512) x 512 maxpool fully-connected 4096 fully-connected 4096 fully-connected 1000 soft-max
Calculations assume 32-bit values (image batch size = 1)weights mem:
output size (per image)
— 6.5 KB 144 KB — 228 KB 576 KB — 1.1 MB 2.3 MB 2.3 MB — 4.5 MB 9 MB 9 MB — 9 MB 9 MB 9 MB — 392 MB 64 MB 15.6 MB
224x224x3 224x224x64 224x224x64 112x112x64 112x112x128 112x112x128 56x56x128 56x56x256 56x56x256 56x56x256 28x28x256 28x28x512 28x28x512 28x28x512 14x14x512 14x14x512 14x14x512 14x14x512 7x7x512 4096 4096 1000 1000
150K 12.3 MB 12.3 MB 3.1 MB 6.2 MB 6.2 MB 1.5 MB 3.1 MB 3.1 MB 3.1 MB 766 KB 1.5 MB 1.5 MB 1.5 MB 383 KB 383 KB 383 KB 383 KB 98 KB 16 KB 16 KB 4 KB 4 KB
(mem)
multiply by next layer’s conv window size to form input matrix to next conv layer!!! (for VGG’s 3x3 convolutions, this is 9x data amplification)
inputs/outputs get multiplied by image batch size
CMU 15-769, Fall 2016
Direct implementation of conv layerfloatinput[INPUT_HEIGHT][INPUT_WIDTH][INPUT_DEPTH];
floatoutput[INPUT_HEIGHT][INPUT_WIDTH][LAYER_NUM_FILTERS];
floatlayer_weights[LAYER_CONVY][LAYER_CONVX][INPUT_DEPTH];
//assumesconvolutionstrideis1
for(intimg=0;img<IMAGE_BATCH_SIZE;img++)
for(intj=0;j<INPUT_HEIGHT;j++)
for(inti=0;i<INPUT_WIDTH;i++)
for(intf=0;f<LAYER_NUM_FILTERS;f++){
output[j][i][f]=0.f;
for(intkk=0;kk<INPUT_DEPTH;kk++)//sumoverfilterresponsesofinputchannels
for(intjj=0;jj<LAYER_CONVY;jj++)//spatialconvolution
for(intii=0;ii<LAYER_CONVX;ii+)//spatialconvolution
output[j][i][f]+=layer_weights[f][jj][ii][kk]*input[j+jj][i+ii][kk];
}
Seven loops with significant input data reuse: reuse of filter weights (during convolution), and reuse of input values (across different filters)
Avoids O(N) footprint increase by avoiding materializing input matrix In theory loads O(N) times less data (potentially higher arithmetic intensity… but matrix mult is typically compute-bound) But must roll your own highly optimized implementation of complicated loop nest.
CMU 15-769, Fall 2016
Conv layer in Halideintin_w,in_h,in_ch=4;//inputparams:assumeinitializedFuncin_func;//assumeinputfunctionisinitialized
intnum_f,f_w,f_h,pad,stride;//parametersoftheconvlayer
Funcforward=Func("conv");Varx("x"),y("y"),z("z"),n(“n");//nisminibatchdimension
//Thiscreatesapaddedinputtoavoidcheckingboundary//conditionswhilecomputingtheactualconvolutionf_in_bound=BoundaryConditions::repeat_edge(in_func,0,in_w,0,in_h);
//CreateimagebuffersforlayerparametersImage<float>W(f_w,f_h,in_ch,num_f)Image<float>b(num_f);
RDomr(0,f_w,0,f_h,0,in_ch);//Initializetobiasforward(x,y,z,n)=b(z);forward(x,y,z,n)+=W(r.x,r.y,r.z,z)*f_in_bound(x*stride+r.x-pad,y*stride+r.y-pad,r.z,n);
Consider scheduling this seven-dimensional loop nest.
CMU 15-769, Fall 2016
Algorithmic improvements▪ Direct convolution can be implemented efficiently in Fourier domain
(convolution → element-wise multiplication) - Overhead: FFT to transform inputs into Fourier domain, inverse FFT to get responses back to
spatial domain (NlgN) - Inverse transform amortized over all input channels (due to summation over inputs)
▪ Direct convolution using work-efficient Winograd convolutions1D example: consider producing two outputs of a 3-tap 1D convolution with weights: w0 w1 w2
x0 x1 x2 x3
y0 y1
y0
y1
�=
x0 x1 x2
x1 x2 x3
�2
4w0
w1
w2
3
5 =
m1 +m2 +m3
m2 �m3 �m4
�
1D 3-tap total cost: 4 multiplies 8 additions (4 to compute m’s + 4 to produce final result)
Direct convolution: 6 multiples, 4 adds In 2D can notably reduce multiplications (3x3 filter: 2.25x fewer multiples for 2x2 block of output)
m1 = (x0 � x1)w0
m2 = (x1 + x2)w0 + w1 + w2
2
m3 = (x2 � x1)w0 � w1 + w2
2m4 = (x1 � x3)w2
m1 = (x0 � x1)w0
m2 = (x1 + x2)w0 + w1 + w2
2
m3 = (x2 � x1)w0 � w1 + w2
2m4 = (x1 � x3)w2
m1 = (x0 � x1)w0
m2 = (x1 + x2)w0 + w1 + w2
2
m3 = (x2 � x1)w0 � w1 + w2
2m4 = (x1 � x3)w2
m1 = (x0 � x1)w0
m2 = (x1 + x2)w0 + w1 + w2
2
m3 = (x2 � x1)w0 � w1 + w2
2m4 = (x1 � x3)w2
Filter dependent (can be precomputed)
CMU 15-769, Fall 2016
Deep neural networks on GPUs▪ Today, many-performant DNN implementations target GPUs
- High arithmetic intensity computations (computational characteristics similar to dense matrix-matrix multiplication)
- Benefit from flop-rich architectures - Highly-optimized library of kernels exist for GPUs (NVIDIA’s cuDNN)
- Most CPU-based implementations use basic matrix-multiplication-based formulation (good implementations could run faster!)
CMU 15-769, Fall 2016
Emerging architectures for deep learning?▪ NVIDIA Pascal (NVIDIA’s latest GPU) - Adds double-throughput 16-bit floating point ops
- This feature is already common on mobile GPUs
▪ Intel Xeon Phi (Knights Landing) - Flop rich 68-core x86 processor for scientific computing and machine learning
▪ FPGAs, ASICs - Not new: FPGA solutions have been explored for years
- Significant amount of ongoing industry and academic research
- Google’s TPU
- We will discuss several recent papers in an upcoming class
CMU 15-769, Fall 2016
Summary: efficiently evaluating deep nets▪ Computational structure
- Convlayers: high arithmetic intensity, significant portion of cost of evaluating a network
- Similar data access patterns to dense-matrix multiplication (exploiting temporal reuse is key)
- But straight reduction to matrix-matrix multiplication is often sub-optimal
- Work-efficient techniques for convolutional layers (FFT-based, Wingrad convolutions)
▪ Large numbers of parameters: significant interest in reducing size of networks for both training and evaluation - Will discuss pruning/quantization in upcoming classes
▪ Many ongoing studies of specialized hardware architectures for efficient evaluation
- Future CPUs/GPUs, ASICs, FPGS, …
- Specialization will be important to achieving “always on” applications