97
Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Applied Machine LearningApplied Machine LearningMultilayer Perceptron 

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)

1

Page 2: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

multilayer percepron:

modeldifferent supervised learning tasksactivation functionsarchitecture of a neural network

its expressive powerregularization techniques

Learning objectivesLearning objectives

2

Page 3: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Adaptive basesAdaptive bases   

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

3

Page 4: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Adaptive basesAdaptive bases   

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

consider the adaptive bases in a general form (contrast to decision trees)

3

Page 5: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Adaptive basesAdaptive bases   

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

consider the adaptive bases in a general form (contrast to decision trees)

use gradient descent to find good parameters (contrast to boosting)

3

Page 6: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Adaptive basesAdaptive bases   

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

consider the adaptive bases in a general form (contrast to decision trees)

use gradient descent to find good parameters (contrast to boosting)

create more complex adaptive bases by combining simpler basesleads to deep neural networks

3

Page 7: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Radial Bases Radial Bases  ϕ (x) =d e

s2(x−μ )d

2

model: f(x;w) = w ϕ (x)∑d d d

Gaussian bases, or radial bases

4 . 1

non-adaptive case

Page 8: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Radial Bases Radial Bases  ϕ (x) =d e

s2(x−μ )d

2

model: f(x;w) = w ϕ (x)∑d d d

Gaussian bases, or radial bases

cost: J(w) = (f(x ;w) −21 ∑n

(n) y )(n) 2

4 . 1

non-adaptive case

Page 9: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Radial Bases Radial Bases  ϕ (x) =d e

s2(x−μ )d

2

the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution

model: f(x;w) = w ϕ (x)∑d d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4

5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

the center are fixedcost: J(w) = (f(x ;w) −21 ∑n

(n) y )(n) 2

4 . 1

non-adaptive case

Page 10: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Radial Bases Radial Bases  ϕ (x) =d e

s2(x−μ )d

2

the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution

model: f(x;w) = w ϕ (x)∑d d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4

5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

the center are fixedcost: J(w) = (f(x ;w) −21 ∑n

(n) y )(n) 2

4 . 1

non-adaptive case

Page 11: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Radial Bases Radial Bases  ϕ (x) =d e

s2(x−μ )d

2

the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution

model: f(x;w) = w ϕ (x)∑d d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4

5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

the center are fixedcost: J(w) = (f(x ;w) −21 ∑n

(n) y )(n) 2

how to minimize the cost?

4 . 1

non-adaptive case

we can make the bases adaptive by learning these centers

model: f(x;w,μ) = w ϕ (x;μ )∑d d d d

adaptive case

Page 12: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Radial Bases Radial Bases  ϕ (x) =d e

s2(x−μ )d

2

the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution

model: f(x;w) = w ϕ (x)∑d d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4

5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

the center are fixedcost: J(w) = (f(x ;w) −21 ∑n

(n) y )(n) 2

how to minimize the cost?not convex in all model parametersuse gradient descent to find a local minimum

4 . 1

non-adaptive case

we can make the bases adaptive by learning these centers

model: f(x;w,μ) = w ϕ (x;μ )∑d d d d

adaptive case

Page 13: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Radial Bases Radial Bases  ϕ (x) =d e

s2(x−μ )d

2

the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution

model: f(x;w) = w ϕ (x)∑d d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4

5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

the center are fixedcost: J(w) = (f(x ;w) −21 ∑n

(n) y )(n) 2

how to minimize the cost?not convex in all model parametersuse gradient descent to find a local minimum

note that the basis centers are adaptively changing

4 . 1

non-adaptive case

we can make the bases adaptive by learning these centers

model: f(x;w,μ) = w ϕ (x;μ )∑d d d d

adaptive case

Page 14: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Sigmoid BasesSigmoid Bases  ϕ (x) =d

1+e−( )

s d

x−μ d

1

using adaptive sigmoid bases gives us a neural network

μ d

s =d 1

non-adaptive case

4 . 2

Page 15: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Sigmoid BasesSigmoid Bases  ϕ (x) =d

1+e−( )

s d

x−μ d

1

using adaptive sigmoid bases gives us a neural network

 

        is fixed to D locations μ d

s =d 1

non-adaptive case

4 . 2

Page 16: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Sigmoid BasesSigmoid Bases  ϕ (x) =d

1+e−( )

s d

x−μ d

1

phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu)))mu = np.linspace(0,3,10)

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3

45

Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

using adaptive sigmoid bases gives us a neural network

model: f(x;w) = w ϕ (x)∑d d d

 

        is fixed to D locations μ d

s =d 1

non-adaptive case

4 . 2

...ϕ (x)1 ϕ (x)2 ϕ (x)D

w 1 w Dw 2

y

Page 17: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Sigmoid BasesSigmoid Bases  ϕ (x) =d

1+e−( )

s d

x−μ d

1

phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu)))mu = np.linspace(0,3,10)

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3

45

Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

using adaptive sigmoid bases gives us a neural network

model: f(x;w) = w ϕ (x)∑d d d

D=10

 

        is fixed to D locations μ d

s =d 1

non-adaptive case

4 . 2

...ϕ (x)1 ϕ (x)2 ϕ (x)D

w 1 w Dw 2

y

Page 18: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Sigmoid BasesSigmoid Bases  ϕ (x) =d

1+e−( )

s d

x−μ d

1

phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu)))mu = np.linspace(0,3,10)

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3

45

Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

using adaptive sigmoid bases gives us a neural network

model: f(x;w) = w ϕ (x)∑d d d

D=10 D=5 D=3

 

        is fixed to D locations μ d

s =d 1

non-adaptive case

4 . 2

...ϕ (x)1 ϕ (x)2 ϕ (x)D

w 1 w Dw 2

y

Page 19: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Sigmoid Bases Sigmoid Bases  

ϕ (x) =d

1+e−( )

s d

x−μ d

1

rewrite the sigmoid basis

ϕ (x) =d σ( ) =s d

x−μ d σ(v x +d b )d

4 . 3

Page 20: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Sigmoid Bases Sigmoid Bases  

ϕ (x) =d

1+e−( )

s d

x−μ d

1

rewrite the sigmoid basis

ϕ (x) =d σ( ) =s d

x−μ d σ(v x +d b )d

each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d

assuming input is higher than one dimension

4 . 3

Page 21: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Sigmoid Bases Sigmoid Bases  

ϕ (x) =d

1+e−( )

s d

x−μ d

1

rewrite the sigmoid basis

ϕ (x) =d σ( ) =s d

x−μ d σ(v x +d b )d

model: f(x;w, v, b) = w σ(v x +∑d d d b )d

each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d

assuming input is higher than one dimension

4 . 3

Page 22: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

AdaptiveAdaptive Sigmoid Bases Sigmoid Bases  

ϕ (x) =d

1+e−( )

s d

x−μ d

1

rewrite the sigmoid basis

ϕ (x) =d σ( ) =s d

x−μ d σ(v x +d b )d

model: f(x;w, v, b) = w σ(v x +∑d d d b )d

each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d

assuming input is higher than one dimension

this is a neural network with two layers ...ϕ 1

w 1 w Dw 2

y

ϕ D

v , b D Dv , b 1 1

x

4 . 3

Page 23: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Winter 2020 | Applied Machine Learning (COMP551)

AdaptiveAdaptive Sigmoid Bases Sigmoid Bases  

ϕ (x) =d

1+e−( )

s d

x−μ d

1

rewrite the sigmoid basis

ϕ (x) =d σ( ) =s d

x−μ d σ(v x +d b )d

model: f(x;w, v, b) = w σ(v x +∑d d d b )d

each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d

assuming input is higher than one dimension

optimize using gradient descent (find a local optima)

D=3 adaptive bases D=3 fixed bases

this is a neural network with two layers ...ϕ 1

w 1 w Dw 2

y

ϕ D

v , b D Dv , b 1 1

x

4 . 3

Page 24: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Multilayer Perceptron (Multilayer Perceptron (MLPMLP))  

suppose we have

D inputsK outputsM hidden units

=yk g ( W h( V x ))∑m k,m ∑

d m,d d

nonlinearity, activation function: we have different choices

x , … ,x 1 D

z , … , z 1 M

, … , y1 yK

more compressed form

=y g(W h(V x))non-linearities are applied elementwise

for simplicity we may drop bias terms

W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 1

model

Page 25: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regression Regression using Neural Networksusing Neural Networks  

the choice of activation function in the final layer depends on the task

W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 2

=y g(W h(V x))model

Page 26: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regression Regression using Neural Networksusing Neural Networks  

the choice of activation function in the final layer depends on the task

regression

identity function + L2 loss : Gaussian likelihood

=y g(Wz ) = Wz

L(y, ) =y ∣∣y −21

∣∣ =y 22 logN (y; ,βI) +y constant

we may have one or more output variables W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 2

=y g(W h(V x))model

Page 27: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regression Regression using Neural Networksusing Neural Networks  

the choice of activation function in the final layer depends on the task

regression

identity function + L2 loss : Gaussian likelihood

=y g(Wz ) = Wz

L(y, ) =y ∣∣y −21

∣∣ =y 22 logN (y; ,βI) +y constant

we may have one or more output variables W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 2

=y g(W h(V x))model

we may explicitly produce a distribution at output - e.g.,

mean and variance of a Gaussianmixture of Gaussians

the loss will be the log-likelihood of the data under our model

L(y, ) =y log p(y; f(x))neural network outputs the parameters of a distribution

more generally

Page 28: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Classification Classification using neural networksusing neural networks  

the choice of activation function in the final layer depends on the task

W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 3

=y g(W h(V x))model

Page 29: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Classification Classification using neural networksusing neural networks  

the choice of activation function in the final layer depends on the task

L(y, ) =y y log +y (1 − y) log(1 − ) =y log Bernouli(y; )y

binary classification =y g(Wz) = (1 + e )−Wz −1

logistic sigmoid + CE loss: Bernouli likelihood

scalar output C=1 W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 3

=y g(W h(V x))model

Page 30: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Classification Classification using neural networksusing neural networks  

the choice of activation function in the final layer depends on the task

L(y, ) =y y log +y (1 − y) log(1 − ) =y log Bernouli(y; )y

binary classification =y g(Wz) = (1 + e )−Wz −1

logistic sigmoid + CE loss: Bernouli likelihood

scalar output C=1

multiclass classification =y g(Wz) = softmax(Wz)

softmax + multi-class CE loss: categorical likelihood

L(y, ) =y y log =∑k k yk log Categorical(y; )y

C is the number of classes

W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 3

=y g(W h(V x))model

Page 31: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

for middle layer(s) there is more freedom in the choice of activation function

5 . 4

Page 32: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

for middle layer(s) there is more freedom in the choice of activation function

identity (no activation function)h(x) = x

5 . 4

Page 33: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

for middle layer(s) there is more freedom in the choice of activation function

identity (no activation function)h(x) = x

composition of two linear functions is linear

x =W ′

WV W x′K × M M × D K × D

5 . 4

Page 34: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

for middle layer(s) there is more freedom in the choice of activation function

identity (no activation function)h(x) = x

composition of two linear functions is linear

x =W ′

WV W x′K × M M × D K × D

M < min(D,K)

so nothing is gained (in representation power) by stacking linear layers

exception: if                            then the hidden layer iscompressing the data (W' is low-rank)

this idea is used in dimensionality reduction (later!)

5 . 4

Page 35: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

Page 36: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

logistic functionh(x) = σ(x) = 1+e−x1

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

Page 37: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

logistic functionh(x) = σ(x) = 1+e−x1

the same function used in logistic regressionused to be the function of choice in neural networks

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

Page 38: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

logistic functionh(x) = σ(x) = 1+e−x1

the same function used in logistic regressionused to be the function of choice in neural networks

away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

Page 39: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

logistic functionh(x) = σ(x) = 1+e−x1

the same function used in logistic regressionused to be the function of choice in neural networks

away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)

σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

Page 40: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

logistic functionh(x) = σ(x) = 1+e−x1

the same function used in logistic regressionused to be the function of choice in neural networks

away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)

hyperbolic tangenth(x) = 2σ(x) − 1 = e +ex −xe −ex −x

similar to sigmoid, but symmetric

σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

Page 41: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

logistic functionh(x) = σ(x) = 1+e−x1

the same function used in logistic regressionused to be the function of choice in neural networks

away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)

hyperbolic tangenth(x) = 2σ(x) − 1 = e +ex −xe −ex −x

similar to sigmoid, but symmetric

σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember

often better for optimization because close to zero itsimilar to a linear function

(rather than an affine function when using logistic)

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

Page 42: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

logistic functionh(x) = σ(x) = 1+e−x1

the same function used in logistic regressionused to be the function of choice in neural networks

away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)

hyperbolic tangenth(x) = 2σ(x) − 1 = e +ex −xe −ex −x

similar to sigmoid, but symmetric

σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember

often better for optimization because close to zero itsimilar to a linear function

(rather than an affine function when using logistic)

5 . 5

tanh(x) =∂x∂ 1 − tanh(x)2

similar problem with vanishing gradient

for middle layer(s) there is more freedom in the choice of activation function

Page 43: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

Page 44: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

Rectified Linear Unit (ReLU)h(x) = max(0,x)

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

Page 45: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

replacing logistic with ReLU significantly improves the training of deep networks

zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization

Rectified Linear Unit (ReLU)h(x) = max(0,x)

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

Page 46: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

replacing logistic with ReLU significantly improves the training of deep networks

zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization

Rectified Linear Unit (ReLU)h(x) = max(0,x)

leaky ReLU h(x) = max(0,x) + γ min(0,x)

fixes the zero-gradient problem

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

Page 47: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Activation functionActivation function  

replacing logistic with ReLU significantly improves the training of deep networks

zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization

Rectified Linear Unit (ReLU)h(x) = max(0,x)

leaky ReLU h(x) = max(0,x) + γ min(0,x)

fixes the zero-gradient problem

parameteric ReLU:make     a learnable parameterγ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

Page 48: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Winter 2020 | Applied Machine Learning (COMP551)

Activation functionActivation function  

replacing logistic with ReLU significantly improves the training of deep networks

zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization

Rectified Linear Unit (ReLU)h(x) = max(0,x)

leaky ReLU h(x) = max(0,x) + γ min(0,x)

fixes the zero-gradient problem

parameteric ReLU:make     a learnable parameterγ

Softplus (differentiable everywhere) h(x) = log(1 + e )x

it doesn't perform as wellin practice

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

Page 49: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Network architectureNetwork architecture  

architecture is the overall structure of the network

6 . 1

Page 50: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Network architectureNetwork architecture  

architecture is the overall structure of the networkfeedforward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the network

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1 y2 yC... ... ......z 1

{ℓ}z 2

{ℓ} z M{ℓ}

... ... ...

dept

h

width

6 . 1

Page 51: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Network architectureNetwork architecture  

architecture is the overall structure of the networkfeedforward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the network

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1 y2 yC... ... ......z 1

{ℓ}z 2

{ℓ} z M{ℓ}

... ... ...

dept

h

width

6 . 1

x 1 ...

...

x 2 x D

z 1 z 2 z M

fully connected

x 1 ...

...

x 2 x D

z 1 z 2 z M

sparsely connected

each layer can be fully connected (dense) or sparse

Page 52: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Network architectureNetwork architecture  

architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparse

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}

... ... ...skip connection

Page 53: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Network architectureNetwork architecture  

architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections

helps with gradient flow

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}

... ... ...skip connection

Page 54: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Network architectureNetwork architecture  

architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections

helps with gradient flowunits may have different activations

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}

... ... ...skip connection

Page 55: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Network architectureNetwork architecture  

architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections

helps with gradient flowunits may have different activationsparameters may be shared across units (e.g., in conv-nets) x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}

... ... ...skip connection

parameter sharing

Page 56: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Winter 2020 | Applied Machine Learning (COMP551)

Network architectureNetwork architecture  

architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections

helps with gradient flowunits may have different activationsparameters may be shared across units (e.g., in conv-nets) x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}

... ... ...skip connection

parameter sharing

more generally a directed acyclic graph (DAG) expresses thefeed-forward architecture

Page 57: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Expressive powerExpressive power  

an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy

universal approximation theorem

7 . 1

Page 58: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Expressive powerExpressive power  

an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy

universal approximation theorem

for 1D input we can see this even with fixed basesM = 100 in this examplethe fit is good (hard to see the blue line)

7 . 1

Page 59: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Expressive powerExpressive power  

an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy

universal approximation theorem

for 1D input we can see this even with fixed basesM = 100 in this examplethe fit is good (hard to see the blue line)

however # bases (M) should grow exponentiallywith D (curse of dimensionality)

7 . 1

Page 60: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Depth vs WidthDepth vs Width  

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats

we may need a very wide network (large M)this is only about training error, we care about test error

universal approximation theorem

7 . 2

Page 61: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Depth vs WidthDepth vs Width  

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats

we may need a very wide network (large M)this is only about training error, we care about test error

universal approximation theorem

Deep networks (with ReLU activation) of bounded width are also shown to be universal

7 . 2

Page 62: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Depth vs WidthDepth vs Width  

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats

we may need a very wide network (large M)this is only about training error, we care about test error

universal approximation theorem

Deep networks (with ReLU activation) of bounded width are also shown to be universalempirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer)assuming a compositional functional form (through depth) is a useful inductive bias

7 . 2

Page 63: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Depth vs WidthDepth vs Width  

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats

we may need a very wide network (large M)this is only about training error, we care about test error

universal approximation theorem

Deep networks (with ReLU activation) of bounded width are also shown to be universalempirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer)assuming a compositional functional form (through depth) is a useful inductive bias

7 . 2

Page 64: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Depth vs WidthDepth vs Width  

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats

we may need a very wide network (large M)this is only about training error, we care about test error

universal approximation theorem

Deep networks (with ReLU activation) of bounded width are also shown to be universalempirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer)assuming a compositional functional form (through depth) is a useful inductive bias

7 . 2

Page 65: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Depth vs WidthDepth vs Width  

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats

we may need a very wide network (large M)this is only about training error, we care about test error

universal approximation theorem

Deep networks (with ReLU activation) of bounded width are also shown to be universal

7 . 3

Page 66: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Winter 2020 | Applied Machine Learning (COMP551)

Depth vs WidthDepth vs Width  

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats

we may need a very wide network (large M)this is only about training error, we care about test error

universal approximation theorem

Deep networks (with ReLU activation) of bounded width are also shown to be universalnumber of regions (in which the network is linear) grows exponentially with depth

W x ={ℓ} {ℓ} 0

 layer ℓ

h(W x) ={ℓ} ∣W x∣{ℓ}simplified demonstration

W x ={ℓ+1} {ℓ+1} 0

7 . 3

Page 67: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regularization strategiesRegularization strategies  

universality of neural networks also means they can overfitstrategies for variance reduction:

8 . 1

Page 68: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regularization strategiesRegularization strategies  

universality of neural networks also means they can overfitstrategies for variance reduction:

L1 and L2 regularization (weight decay)

8 . 1

Page 69: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regularization strategiesRegularization strategies  

universality of neural networks also means they can overfitstrategies for variance reduction:

L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropout

8 . 1

Page 70: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regularization strategiesRegularization strategies  

universality of neural networks also means they can overfitstrategies for variance reduction:

L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)

8 . 1

Page 71: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regularization strategiesRegularization strategies  

universality of neural networks also means they can overfitstrategies for variance reduction:

L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learning

8 . 1

Page 72: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regularization strategiesRegularization strategies  

universality of neural networks also means they can overfitstrategies for variance reduction:

L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learningadversarial training

8 . 1

Page 73: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Regularization strategiesRegularization strategies  

universality of neural networks also means they can overfitstrategies for variance reduction:

L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learningadversarial trainingparameter-tying

8 . 1

Page 74: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Data augmentationData augmentation  

a larger dataset results in a better generalization

8 . 2

Page 75: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Data augmentationData augmentation  

a larger dataset results in a better generalization

N = 20 N = 40 N = 80

example: in all 3 examples below training error is close to zero

however, a larger training dataset leads to better generalization

8 . 2

Page 76: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

  Data augmentationData augmentation

a larger dataset results in a better generalization

increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x)

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

8 . 3

Page 77: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

special approaches to data-augmentation

adding noise to the inputadding noise to hidden units

noise in higher level of abstraction

  Data augmentationData augmentation

a larger dataset results in a better generalization

increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x)

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

8 . 3

Page 78: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

special approaches to data-augmentation

adding noise to the inputadding noise to hidden units

noise in higher level of abstraction

  Data augmentationData augmentation

a larger dataset results in a better generalization

increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x)

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

(x, y)p

x , y ∼(n )′ (n )′ p

learn a generative model                    of the datause                              for training

8 . 3

Page 79: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

special approaches to data-augmentation

adding noise to the inputadding noise to hidden units

noise in higher level of abstraction

  Data augmentationData augmentation

a larger dataset results in a better generalization

increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x)

sometimes we an achieve the same goal by designing the modelsthat are invariant to a given set of transformations

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

(x, y)p

x , y ∼(n )′ (n )′ p

learn a generative model                    of the datause                              for training

8 . 3

Page 80: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

  Noise robustnessNoise robustness

make the model robust to noise in

8 . 4

Page 81: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

  Noise robustnessNoise robustness

make the model robust to noise in     input (data augmentation)

     hidden units (e.g., in dropout)

8 . 4

Page 82: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

  Noise robustnessNoise robustness

make the model robust to noise in

weights the loss is not sensitive to small changes in the weight (flat minima)

image credit: Keshkar et al'17

     input (data augmentation)

     hidden units (e.g., in dropout)

8 . 4

Page 83: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

  Noise robustnessNoise robustness

make the model robust to noise in

weights the loss is not sensitive to small changes in the weight (flat minima)

flat minima generalize bettergood performance of SGD using small minibatch is attributed to flat minima

in this case, SGD regularizes the model due to gradient noise

image credit: Keshkar et al'17

     input (data augmentation)

     hidden units (e.g., in dropout)

8 . 4

Page 84: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

  Noise robustnessNoise robustness

make the model robust to noise in

label smoothing

output (avoid overfitting, specially to wrong labels)

a heuristic is to replace hard labels with "soft-labels"[0, 0, 1, 0] → [ , , 1 −3

ϵ3ϵ ϵ, ]3

ϵe.g.,

weights the loss is not sensitive to small changes in the weight (flat minima)

flat minima generalize bettergood performance of SGD using small minibatch is attributed to flat minima

in this case, SGD regularizes the model due to gradient noise

image credit: Keshkar et al'17

     input (data augmentation)

     hidden units (e.g., in dropout)

8 . 4

Page 85: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Early stoppingEarly stopping  

the test loss-vs-time step is "often" U-shapeduse validation for early stoppingalso saves computation!

8 . 5

Page 86: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Early stoppingEarly stopping  

the test loss-vs-time step is "often" U-shapeduse validation for early stoppingalso saves computation!

early stopping bounds the region of the parameter-space that is reachable in T time-stepsassuming

bounded gradientstarting with a small w

it has an effect similar to L2 regularizationwe get the regularization path (various    )we saw a similar phenomena in boosting

λ

8 . 5

Page 87: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

BaggingBagging  

several sources of variance in neural networks, such as

optimizationinitializationrandomness of SGDlearning rate and other hyper-parameters

choice of architecturenumber of layers, hidden units, etc.

8 . 6

Page 88: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

BaggingBagging  

use bagging or even averaging without bootstrap to reduce varianceissue: computationally expensive

several sources of variance in neural networks, such as

optimizationinitializationrandomness of SGDlearning rate and other hyper-parameters

choice of architecturenumber of layers, hidden units, etc.

8 . 6

Page 89: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

DropoutDropout  

idea

randomly remove a subset of units during trainingas opposed to bagging a single model is trained

8 . 7

Page 90: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

DropoutDropout  

idea

randomly remove a subset of units during trainingas opposed to bagging a single model is trained

exponentially many subnetworks that share parameterscan be viewed as

8 . 7

Page 91: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

DropoutDropout  

idea

randomly remove a subset of units during trainingas opposed to bagging a single model is trained

exponentially many subnetworks that share parameterscan be viewed as

is one of the most effective regularization schemes for MLPs

8 . 7

Page 92: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

DropoutDropout  

at test time

during training

8 . 8

Page 93: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

DropoutDropout  

at test time

during training

for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training

8 . 8

Page 94: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

DropoutDropout  

at test time

ideally we want to average over the prediction of all possible sub-networks

during training

for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training

8 . 8

Page 95: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

DropoutDropout  

at test time

ideally we want to average over the prediction of all possible sub-networks

1) Monte Carlo dropout: average the prediction of several feed-forward passes using dropout

during training

for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training

this is computationally infeasible, instead

8 . 8

Page 96: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Winter 2020 | Applied Machine Learning (COMP551)

DropoutDropout  

at test time

ideally we want to average over the prediction of all possible sub-networks

1) Monte Carlo dropout: average the prediction of several feed-forward passes using dropout2) weight scaling: scale the weights by p to compensate for dropout

e.g., for 50% dropout, scale by a factor of 2in general this is not equivalent to the average prediction of the ensemble

during training

for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training

this is computationally infeasible, instead

8 . 8

Page 97: Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

SummarySummary  

Deep feed-forward networks learn adaptive basesmore complex bases at higher layersincreasing depth is often preferable to widthvarious choices of activation function and architectureuniversal approximation powertheir expressive power often necessitates using regularization schemes

9