38
[course site] Verónica Vilaplana [email protected] Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Optimization for neural network training Day 4 Lecture 1 #DLUPC

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Embed Size (px)

Citation preview

Page 1: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

[course  site]  

Verónica Vilaplana [email protected]

Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia

Optimization for neural network training

Day 4 Lecture 1

#DLUPC

Page 2: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Previously  in  DLAI…  •  Mul.layer  perceptron  •  Training:  (stochas.c  /  mini-­‐batch)  gradient  descent  

•  Backpropaga.on    

but…  What  type  of  op.miza.on  problem?  Do  local  minima  and  saddle  points  cause  problems?  Does  gradient  descent  perform  well?  How  to  set  the  learning  rate?  How  to  ini.alize  weights?  How  does  batch  size  affect  training?        

2  

Page 3: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Index  •  Op6miza6on  for  a  machine  learning  task;  difference  between  learning  and  pure  op6miza6on  

•  Expected  and  empirical  risk  •  Surrogate  loss  func.ons  and  early  stopping  •  Batch  and  mini-­‐batch  algorithms  

•  Challenges  •  Local  minima    •  Saddle  points  and  other  flat  regions  •  Cliffs  and  exploding  gradients  

•  Prac6cal  algorithms  •  Stochas.c  Gradient  Descent  •  Momentum  •  Nesterov  Momentum  •  Learning  rate  •  Adap.ve  learning  rates:  adaGrad,  RMSProp,  Adam  

•  Approximate  second-­‐order  methods  •  Parameter  ini6aliza6on  •  Batch  Normaliza6on  

3  

Page 4: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Differences  between  learning  and  pure  op6miza6on  

Page 5: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Op6miza6on  for  NN  training  •  Goal:  Find  the  parameters  that  minimize  the  expected  risk  (generaliza.on  error)  

•  x  input,                                predicted  output,  y  target  output,  E  expecta.on  •  pdata  true  (unknown)  data  distribu.on  •  L    loss  func6on  (how  wrong  predic6ons  are)  

•  But  we  only  have  a  training  set  of  samples:  we  minimize  the  empirical  risk,  average  loss  on  a  finite  dataset  D  

J (θ ) = Ε( x ,y )∼pdata

L( f (x;θ ), y)

f (x,θ )

J (θ ) = Ε( x ,y )∼ pdata

L( f (x;θ ), y) = 1D

L( f (x( i) ,θ ), y( i) )( x( i ) ,y( i ) )∈D∑

where                            is  the  empirical  distribu.on,  |D|  is  the  number  of  examples  in  D  5   pdata

Page 6: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Surrogate  loss  

•  O]en  minimizing  the  real  loss  is  intractable  •  e.g.  0-­‐1  loss  (0  if  correctly  classified,  1  if  it  is  not)                (intractable  even  for  linear  classifiers  (Marcoae  1992)      

•  Minimize  a  surrogate  loss  instead  •  e.g.  nega.ve  log-­‐likelihood  for  the  0-­‐1  loss  

•  Some.mes  the  surrogate  loss  may  learn  more  •  test  error  0-­‐1  loss  keeps  decreasing  even  a]er  

training  0-­‐1  loss  is  zero  •  further  pushing  classes  apart  from  each  other  

6  

0-­‐1  loss  (blue)  and  surrogate  losses  (square,  hinge,  logis.c)    

0-­‐1  loss  (blue)  and    nega.ve  log  likelihood  (red)  

Page 7: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Surrogate  loss  func6ons  

7  

Probabilistic classifier

Outputs  probability  of  class  1  g(x) ≈ P(y=1 | x) Probability for class 0 is 1-g(x) Binary cross-entropy loss:L(g(x),y) = -(y log(g(x)) + (1-y) log(1-g(x)) Decision function: f(x) = Ig(x)>0.5

Outputs  a  vector  of  probabili.es:  g(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) ) Negative conditional log likelihood loss L(g(x),y) = -log g(x)y

Decision function: f(x) = argmax(g(x))

Non- Hinge  loss:  probabilistic classifier

Outputs a «score» g(x) for class 1. score for the other class is -g(x)

L(g(x),t) = max(0, 1-tg(x)) where t=2y-1 Decision function: f(x) = Ig(x)>0

Outputs  a  vector  g(x) of  real-­‐valued  scores  for  the  m  classes.Mul.class  margin  loss  L(g(x),y) = max(0,1+max(g(x)k)-g(x)y )

k≠y Decision function: f(x) = argmax(g(x))

Binary classifier Multiclass classifier

Page 8: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Early  stopping  •  Training  algorithms  usually  do  not  halt  at  a  local  minimum  •  Early  stopping:  

•  based  on  the  true  underlying  loss  (ex  0-­‐1  loss)  measured  on  a  valida6on  set  •  #  training  steps  =  hyperparameter  controlling  the  effec.ve  capacity  of  the  model  •  simple,  effec.ve,  must  keep  a  copy  of  the  best  parameters  •  acts  as  a  regularizer  (Bishop  1995,…)  

8  

Training  error  decreases  steadily  Valida.on  error  begins  to  increase    Return  parameters  at  point  with  lowest  valida6on  error  

Page 9: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Batch  and  mini-­‐batch  algorithms  •  In  most  op.miza.on  methods  used  in  ML  the  objec.ve  func.on  decomposes  as  a  sum  over  a  

training  set  

•  Gradient  descent:  

   examples  from  the  training  set                                                with  corresponding  targets    

•  Using  the  complete  training  set  can  be  very  expensive  (the  gain  of  using  more  samples  is  less  than  linear  –  standard  error  of  mean  drops  propor.onally  to  sqrt(m)-­‐,  training  set  may  be  redundant:  use  a  subset  of  the  training  set  

•  How  many  samples  in  each  update  step?  •  Determinis.c  or  batch  gradient  methods:  process  all  training  samples  in  a  large  batch  •  Stochas.c  methods:  use  a  single  example  at  a  .me  

•  online  methods:  samples  are  drawn  from  a  stream  of  con.nually  created  samples  

•  Mini-­‐batch  stochas.c  methods:  use  several  (not  all)  samples  9  

∇θ J (θ ) = Ε( x ,y )∼ pdata

∇θ L( f (x;θ ), y) = 1m∇θ L( f (x( i);θ ), y( i) )

i∑

{x( i)}i=1...m {y( i)}i=1...m

Page 10: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Batch  and  mini-­‐batch  algorithms  Mini-­‐batch  size?  •  Larger  batches:  more  accurate  es.mate  of  the  gradient  but  less  than  linear  return    •  Very  small  batches:  Mul.core  architectures  under-­‐u.lized  •  If  samples  processed  in  parallel:  memory  scales  with  batch  size  •  Smaller  batches  provide  noisier  gradient  es.mates  •  Small  batches  may  offer  a  regularizing  effect    (add  noise)  

•  but  may  require  small  learning  rate  •  may  increase  number  of  steps  for  convergence  

   Minbatches  should  be  selected  randomly  (shuffle  samples)  •  unbiased  es.mate  of  gradients  

10  

Page 11: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Challenges  in  NN  op6miza6on  

Page 12: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Local  minima  

•  Convex  op.miza.on  •  any  local  minimum  is  a  global  minimum  •  there  are  several  opt.  algorithms  (polynomial-­‐.me)  

 

•  Non-­‐convex  op.miza.on  •  objec6ve  func6on  in  deep  networks  is  non-­‐convex  •  deep  models  may  have  several  local  minima  •  but  this  is  not  necessarily  a  major  problem!  

12  

Page 13: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Local  minima  and  saddle  points  •  Cri.cal  points:    

•  For  high  dimensional  loss  func.ons,  local  minima  are  rare  compared  to  saddle  points  •  Hessian  matrix:    

 real,  symmetric      eigenvector/eigenvalue  decomposi.on  

 •  Intui.on:  eigenvalues  of  the  Hessian  matrix    

•  local  minimum/maximum:  all  posi.ve  /  all  nega.ve  eigenvalues:  exponen.ally  unlikely  as  n  grows  •  saddle  points:  both  posi.ve  and  nega.ve  eigenvalues  

13  Dauphin  et  al.  Iden.fying  and  aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  NIPS  2014    

Hij =

∂2 f∂xi ∂x j

f :!n → !

∇x f (x) = 0

Page 14: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Local  minima  and  saddle  points  

•  It  is  believed  that  for  many  problems  including  learning  deep  nets,  almost  all  local  minimum  have  very  similar  func.on  value  to  the  global  op.mum  

•  Finding  a  local  minimum  is  good  enough  

14  

Value  of  local  minima  found  by  running  SGD  for  200  itera.ons  on  a  simplified  version  of  MNIST  from  different  ini.al  star.ng  points.  As  number  of  parameters  increases,  local  minima  tend  to  cluster  more  .ghtly.  

•  For  many  random  func.ons  local  minima  are  more  likely  to  have  low  cost  than  high  cost.  

Dauphin  et  al.  Iden.fying  and  aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  NIPS  2014    

Page 15: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Saddle  points  •  How  to  escape  from  saddle  points?  

•  First  order  methods  •  ini.ally  aaracted  to  saddle  points,  but  unless  

exact  hit,  it  will  be  repelled  when  close  •  hiqng  cri.cal  point  exactly  is  unlikely  (es.mated  

gradient  is  noisy)  •  saddle  points  are  very  unstable:  noise  (stochas.c  

gradient  descent)  helps  convergence,  trajectory  escapes  quickly  

•  Second  order  moments:  •  Netwon’s  method  can  jump  to  saddle  points  

(where  gradient  is  0)  

15  S.  Credit:  K.McGuinness  

SGD  tends  to  oscillate  between  slowly  approaching  a  saddle  point  and  quickly  escaping  from  it  

Page 16: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Other  difficul6es  •  Cliffs  and  exploding  gradients  

•  Nets  with  many  layers  /  recurrent  nets  can  contain  very  steep  regions  (cliffs)  mul.plica.on  of  several  parameters):  gradient  descent  can  move  parameters  too  far,  jumping  off  of  the  cliff.  (solu.ons:  gradient  clipping)  

•  Long  term  dependencies:  •  computa.onal  graph  becomes  very  deep:  vanishing  and  exploding  gradients  

16  

Page 17: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Algorithms  

Page 18: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Stochas6c  Gradient  Descent  (SGD)  •  Most  used  algorithm  for  deep  learning  •  Do  not  confuse  with  determinis.c  gradient  descent:  stochas.c  uses  mini-­‐batches  

Algorithm  •  Require:  Learning  rate  α,  ini.al  parameter  θ  •  while  stopping  criterion  not  met  do  

•  sample  a  minibatch  of  m  examples  from  the  training  set                                          with  corresponding  targets    

•  compute  gradient  es.mate  

•  apply  update    

•  end  while  18  

{x( i)}i=1...m {y( i)}i=1...m

g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

θ ←θ −α g

Page 19: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Momentum  

19  

•  Designed  to  accelerate  learning,  especially  for  high  curvature,  small  but  consistent  gradients  or  noisy  gradients  

•  Momentum  aims  to  solve:  poor  condi.oning  of  Hessian  matrix  and  variance  in  the  stochas.c  gradient  

Contour  lines=  a  quadra.c  loss  with  poor  condi.oning  of  Hessian  Path  (red)  followed  by  SGD  (le])  and  momentum  (right)  

Page 20: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Momentum    •  New  variable  v  (velocity).  direc.on  and  speed  at  which  parameters  move:  

exponen.ally  decaying  average  of  nega.ve  gradient  

Algorithm  •  Require:  learning  rate  α,  ini.al  parameter  θ,    momentum  parameter  λ    ,  ini6al  velocity  v •  Update  rule:  

•  compute  gradient  es.mate  

•  compute  velocity  update  •  apply  update    

   •  Typical  values  λ=.5,  .9,.99  (in  [0,1})  •  Size  of  step  depends  on  how  large  and  aligned  a  sequence  of  gradients  are.  •  Read  physical  analogy  in  Deep  Learning  book  (Goodfellow  et  al)  

20  

g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

θ ←θ + v v ←λv −αg

Page 21: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Nesterov  accelerated  gradient  (NAG)  •  A  variant  of  momentum,  where  gradient  is  evaluated  a]er  current  velocity  is  applied:  

•  Approximate  where  the  parameters  will  be  on  the  next  .me  step  using  current  velocity  •  Update  velocity  using  gradient  where  we  predict  parameters  will  be  

Algorithm  •  Require:  learning  rate  α,  ini.al  parameter  θ,  momentum  parameter  λ    ,  ini.al  velocity  v •  Update:  

•  apply  interim  update  

•  compute  gradient  (at  interim  point)  

•  compute  velocity  update  

•  apply  update    

•  Interpreta.on:  add  a  correc.on  factor  to  momentum  21  

g ← + 1

m∇ !θ L( f (x( i); !θ ), y( i) )

i∑

θ ←θ + v v ←λv −αg

!θ ←θ + λv

Page 22: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Nesterov  accelerated  gradient  (NAG)  

22  

current  loca.on  wt

vt

∇L(wt) vt+1

S.  Credit:  K.  McGuinness  predicted  loca.on  based  on  velocity  alone  wt + 𝛾v

∇L(wt + 𝛾vt)

vt

vt+1

Page 23: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

SGD:  learning  rate  •  Learning  rate  is  a  crucial  parameter  for  SGD  

•  To  large:  overshoots  local  minimum,  loss  increases  •  Too  small:  makes  very  slow  progress,  can  get  stuck  •  Good  learning  rate:  makes  steady  progress  toward  local  minimum  

•  In  prac.ce  it  is  necessary  to  gradually  decrease  learning  rate  

•  step  decay  (e.g.  decay  by  half  every  few  epochs)  

•  exponen6al  decay  

•  1/t  decay      •  Sufficient  condi.ons  for  convergence:    •  Usually:  adapt  learning  rate  by  monitoring  learning  curves  that  plot  the  objec.ve  func.on  

as  a  func.on  of  .me  (more  of  an  art  than  a  science!)  23  

α t = ∞

t=1

α t2 = ∞

t=1

α =α0e−kt

α =α0 / (1+ kt)t  =  itera/on  number  

Page 24: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Adap6ve  learning  rates  •  Learning  rate  is  one  of  the  hyperparameters  that  is  the  most  difficult  to  set;  it  has  a  

significant  impact  on  the  model  performance  

•  Cost  if  o]en  sensi.ve  to  some  direc.ons  and  insensi.ve  to  others  •  Momentum/Nesterov  mi.gate  this  issue  but  introduce  another  hyperparameter  

•  Solu6on:  Use  a  separate  learning  rate  for  each  parameter  and  automa6cally  adapt  it  through  the  course  of  learning    

•  Algorithms  (mini-­‐batch  based)  •  AdaGrad  •  RMSProp  •  Adam  •  RMSProp  with  Nesterov  momentum  

 

24  

Page 25: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

AdaGrad  •  Adapts  the  learning  rate  of  each  parameter  based  on  sizes  of  previous  updates:    

•  scales  updates  to  be  larger  for  parameters  that  are  updated  less  •  scales  updates  to  be  smaller  for  parameters  that  are  updated  more  

 

•  The  net  effect  is  greater  progress  in  the  more  gently  sloped  direc.ons  of  parameter  space    •  Desirable  theore.cal  proper.es  but  empirically  (for  deep  models)  can  result  in  a  premature  and  

excessive  decrease  in  effec6ve  learning  rate  

•  Require:  learning  rate  α,  ini.al  parameter  θ,  small  constant  δ  (e.g.  10-­‐7)  for  numerical  stability •  Update:  

•  compute  gradient  

•  accumulate  squared  gradient  

•  compute  update  

•  apply  update    25  

g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

θ ←θ + Δθ Δθ ←− α

δ + r⊙ g

r ← r + g⊙ g sum  of    all  previous  squared  gradients    updates  inversely  propor.onal  to  the  square  root  of  the  sum  

Duchi  et  al.  Adap.ve  Subgradient  Methods  for  Online  Learning  and  Stochas.c  Op.miza.on.  JMRL  2011  

Page 26: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Root  Mean  Square  Propaga6on  (RMSProp)  •  Modifies  AdaGrad  to  perform  beaer  in  non-­‐convex  surfaces,  for  aggressively  decaying  

learning  rates  •  Changes  gradient  accumula.on  by  an  exponen6ally  decaying  average  of  sum  of  

squares  of  gradients    •  Requires:  learning  rate  α,  ini.al  parameter  θ,  decay  rate  ρ,  small  constant  δ  (e.g.  10-­‐7)  

for  numerical  stability •  Update:  

•  compute  gradient  

•  accumulate  squared  gradient  

•  compute  update  

•  apply  update    

26  

θ ←θ + Δθ Δθ ←− α

δ + r⊙ g r ←ρr + (1− ρ)g⊙ g

g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

It  can  be  combined  with  Nesterov  momentum  

Geoff  Hinton,  Unpublished  

Page 27: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

ADAp6ve  Moments  (Adam)  •  Combina.on  of  RMSProp  and  momentum,  but:  

•  Keep  decaying  average  of  both  first-­‐order  moment  of  gradient  (momentum)  and  second-­‐order  moment  (RMSProp)  

•  Includes  bias  correc.ons  (firs  and  second  moments)  to  account  for  their  ini.aliza.on  at  origin  

Update:  •  compute  gradient  

•  updated  biased  first  moment  es6mate  

•  update  biased  second  moment  

•  correct  biases  

•  compute  update                                                                                              (opera.ons  applied  elementwise)  

•  apply  update  

27  

θ ←θ + Δθ Δθ ←−α s

δ + r

s ←ρ1s+ (1− ρ1)g g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

r ←ρ2r + (1− ρ2 )g⊙ g

s ← s

1− ρ1 r ← r

1− ρ2

Kingma  et  al.  Adam:  a  Method  for  Stochas.c  Op.miza.on.  ICLR  2015  

 δ=e-­‐8,  ρ1=0.9,  ρ2=0.999  

Page 28: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Example:  test  func6on  

28  Image  credit:  Alec  Radford.  

Beale’s  func.on  

Page 29: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Example:  saddle  point  

29  Image  credit:  Alec  Radford.  

Page 30: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Second  order  op6miza6on  

30  

First  order                Second  order  

Page 31: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Second  order  op6miza6on  •  Second  order  Taylor  expansion  

 •  Solving  for  the  cri.cal  point  we  obtain  the  Newton  parameter  update:  

•  Problem:  Hessian  has  O(N2)  elements,  inver.ng  H  O(N3)  (N  parameters  ≈millions)  

•  Alterna6ves:  •  Quasi-­‐Newton  methods  (BGFS  Broyden-­‐Fletcher-­‐Goldfarb-­‐Shanno):  instead  of  

inver.ng  Hessian,  approximate  inverse  Hessianwith  rank  1  updates  over  .me  O(N2)  each  

•  L-­‐BFGS  (Limited  memory  BFGS):  does  not  form/store  the  full  inverse  Hessian  

31  

J (θ ) ≈ J (θ0 )+ (θ −θ0 )T ∇θ J (θ0 )+ 1

2(θ −θ0 )T H (θ −θ0 )

θ* = θ0 − H −1∇θ J (θ0 )

Page 32: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Parameter  ini6aliza6on  •  Weights  

•  Can’t  ini.alize  weights  to  0    (gradients  would  be  0)  •  Can’t  ini.alize  all  weights  to  the  same  value  (all  hidden  units  in  a  layer  will  always  

behave  the  same;  need  to  break  symmetry)  •  Small  random  number,  e.g.,  uniform  or  gaussian  distribu6on    

•  if  weights  start  too  small,  the  signal  shrinks  as  it  passes  through  each  layer  un.l  it  is  too  .ny  to  be  useful  

•  Calibra6ng  variances  with  1/sqrt(n)    (Xavier  Ini6aliza6on)  •  each  neuron:  w  =  randn(n)  /  sqrt(n)  ,  n  inputs  

•  He  ini6aliza6on  (for  ReLu  ac6va6ons)  sqrt(2/n)  •  each  neuron  w  =  randn(n)  *  sqrt(2.0  /n)  ,  n  inputs  

•  Biases  •  ini.alize  all  to  0  (except  for  output  unit  for  skewed  distribu.ons,  0.01  to  avoid  satura.ng  RELU)  

•  Alterna6ve:  Ini.alize  using  machine  learning;  parameters  learned  by  unsupervised  model  trained  on  the  same  inputs  /  trained  on  unrelated  task   32  

N (0,10−2 )

Page 33: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Batch  normaliza6on  •  As  learning  progresses,  the  distribu.on  of  the  layer  inputs  changes  due  

to  parameter  updates  (  internal  covariate  shi])  

•  This  can  result  in  most  inputs  being  in  the  non-­‐linear  regime  of      the  ac.va.on  func.on,  slowing  down  learning  

 •  Bach  normaliza.on  is  a  technique  to  reduce  this  effect  

•  Explicitly  force  the  layer  ac.va.ons  to  have  zero  mean  and  unit  variance  w.r.t  running  batch  es.mates  

 

•  Adds  a  learnable  scale  and  bias  term  to  allow  the  network  to  s.ll  use  the  nonlinearity  

33    Ioffe  and  Szegedy,  2015.  “Batch  normaliza.on:  accelera.ng  deep  network  training  by  reducing  internal  covariate  shi]”  

FC  /  Conv  

Batch  norm  

ReLu  

FC  /  Conv  

Batch  norm  

ReLu  

Page 34: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Batch  normaliza6on  •  Can  be  applied  to  any  input  or  hidden  layer  •  For  a  mini-­‐batch  of  N  ac.va.ons  of  the  layer  

1.  compute  empirical  mean  and  variance  for  each  dimension  D  2.  normalize  

 •  Note:  normaliza.on  can  reduce  the  expressive  power  of  the  network  (e.g.  normalize  

intputs  of  a  sigmoid  would  constrain  them  to  the  linear  regime  •  So  let  the  network  learn  the  iden.ty:  scale  and  shi]  •  To  recover  the  iden.ty  mapping  the  newtork  can  lean  

34  

x(k ) = x(k ) − E(x(k ) )

var(x(k ) )

N

D

X

β(k ) = E(x(k ) ) γ

(k ) = var(x(k ) )

y(k ) = γ (k ) x(k ) + β (k )

Page 35: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Batch  normaliza6on  

35  

1.  Improves  gradient  flow  through  the  network  

2.  Allows  higher  learning  rates  

3.  Reduces  the  strong  dependency  on  ini.aliza.on  

4.  Reduces  the  need  of  regulariza.on  

Page 36: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Batch  normaliza6on  

36  

At  test  6me  BN  layers  func6ons  differently:    1.  Mean  and  std  are  not  

computed  on  the  batch.  

2.  Instead,  a  single  fixed  empirical  mean  and  std  of  ac.va.ons  computed  during  training  is  used  

               (can  be  es.mated  with    running  averages)  

Page 37: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Summary  

37  

•  Op.miza.on  for  NN  is  different  from  pure  op.miza.on:  •  GD  with  mini-­‐batches  •  early  stopping  •  non-­‐convex  surface,  local  minima  and  saddle  points  

•  Learning  rate  has  a  significant  impact  on  model  performance  •  Several  extensions  to  SGD  can  improve  convergence  •  Ada.ve  learning-­‐rate  methods  are  likely  to  achieve  best  results  

•  RMSProp,  Adam    •  Weight  ini.aliza.on:  He        w=  randn(n)  sqrt(2/n)    •  Batch  normaliza.on  to  reduce  the  internal  covariance  shi]  

Page 38: Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

Bibliograpy  •  Goodfellow,  I.,  Bengio,  Y.,  and  A.,  C.  (2016),  Deep  Learning,  MIT  Press.  •  Choromanska,  A.,  Henaff,  M.,  Mathieu,  M.,  Arous,  G.  B.,  and  LeCun,  Y.  (2015),  The  loss  surfaces  of  

mul.layer  networks.  In  AISTATS.  •  Dauphin,  Y.  N.,  Pascanu,  R.,  Gulcehre,  C.,  Cho,  K.,  Ganguli,  S.,  and  Bengio,  Y.  (2014).  Iden.fying  and  

aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  In  Advances  in  Neural  Informa.on  Processing.  Systems,  pages  2933–2941.  

•  Duchi,  J.,  Hazan,  E.,  and  Singer,  Y.  (2011).  Adap.ve  subgradient  methods  for  online  learning  and  stochas.c  op.miza.on.  Journal  of  Machine  Learning  Research,  12(Jul):2121–2159.  

•  Goodfellow,  I.  J.,  Vinyals,  O.,  and  Saxe,  A.  M.  (2015).  Qualita.vely  characterizing  neural  network  op.miza.on  problems.  In  Interna.onal  Conference  on  Learning  Representa.ons.  

•  Hinton,  G.  (2012).  Neural  networks  for  machine  learning.  Coursera,  video  lectures  •  Jacobs,  R.  A.  (1988).  Increased  rates  of  convergence  through  learning  rate  adapta.on.  Neural  

networks,  1(4):295–307.  •  Kingma,  D.  and  Ba,  J.  (2014)-­‐  Adam:  A  method  for  stochas.c  op.miza.on.  arXiv  preprint  arXiv:

1412.6980.  •  Saxe,  A.  M.,  McClelland,  J.  L.,  and  Ganguli,  S.  (2013).  Exact  solu.ons  to  the  nonlinear  dynamics  of  

learning  in  deep  linear  neural  networks.  In  Interna.onal  Conference  on  Learning  Representa.ons  38