Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Applied Machine LearningApplied Machine LearningMultilayer Perceptron
Siamak RavanbakhshSiamak Ravanbakhsh
COMP 551COMP 551 (winter 2020)(winter 2020)
1
multilayer percepron:
modeldifferent supervised learning tasksactivation functionsarchitecture of a neural network
its expressive powerregularization techniques
Learning objectivesLearning objectives
2
Adaptive basesAdaptive bases
several methods can be classified as learning these bases adaptively
decision treesgeneralized additive modelsboostingneural networks
f(x) = w ϕ (x; v )∑d d d d
3
Adaptive basesAdaptive bases
several methods can be classified as learning these bases adaptively
decision treesgeneralized additive modelsboostingneural networks
f(x) = w ϕ (x; v )∑d d d d
consider the adaptive bases in a general form (contrast to decision trees)
3
Adaptive basesAdaptive bases
several methods can be classified as learning these bases adaptively
decision treesgeneralized additive modelsboostingneural networks
f(x) = w ϕ (x; v )∑d d d d
consider the adaptive bases in a general form (contrast to decision trees)
use gradient descent to find good parameters (contrast to boosting)
3
Adaptive basesAdaptive bases
several methods can be classified as learning these bases adaptively
decision treesgeneralized additive modelsboostingneural networks
f(x) = w ϕ (x; v )∑d d d d
consider the adaptive bases in a general form (contrast to decision trees)
use gradient descent to find good parameters (contrast to boosting)
create more complex adaptive bases by combining simpler basesleads to deep neural networks
3
AdaptiveAdaptive Radial Bases Radial Bases ϕ (x) =d e
−
s2(x−μ )d
2
model: f(x;w) = w ϕ (x)∑d d d
Gaussian bases, or radial bases
4 . 1
non-adaptive case
AdaptiveAdaptive Radial Bases Radial Bases ϕ (x) =d e
−
s2(x−μ )d
2
model: f(x;w) = w ϕ (x)∑d d d
Gaussian bases, or radial bases
cost: J(w) = (f(x ;w) −21 ∑n
(n) y )(n) 2
4 . 1
non-adaptive case
AdaptiveAdaptive Radial Bases Radial Bases ϕ (x) =d e
−
s2(x−μ )d
2
the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution
model: f(x;w) = w ϕ (x)∑d d d
Gaussian bases, or radial bases
mu = np.linspace(0,4,10) #4 Gaussians bases
#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4
5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9
the center are fixedcost: J(w) = (f(x ;w) −21 ∑n
(n) y )(n) 2
4 . 1
non-adaptive case
AdaptiveAdaptive Radial Bases Radial Bases ϕ (x) =d e
−
s2(x−μ )d
2
the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution
model: f(x;w) = w ϕ (x)∑d d d
Gaussian bases, or radial bases
mu = np.linspace(0,4,10) #4 Gaussians bases
#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4
5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9
the center are fixedcost: J(w) = (f(x ;w) −21 ∑n
(n) y )(n) 2
4 . 1
non-adaptive case
AdaptiveAdaptive Radial Bases Radial Bases ϕ (x) =d e
−
s2(x−μ )d
2
the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution
model: f(x;w) = w ϕ (x)∑d d d
Gaussian bases, or radial bases
mu = np.linspace(0,4,10) #4 Gaussians bases
#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4
5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9
the center are fixedcost: J(w) = (f(x ;w) −21 ∑n
(n) y )(n) 2
how to minimize the cost?
4 . 1
non-adaptive case
we can make the bases adaptive by learning these centers
model: f(x;w,μ) = w ϕ (x;μ )∑d d d d
adaptive case
AdaptiveAdaptive Radial Bases Radial Bases ϕ (x) =d e
−
s2(x−μ )d
2
the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution
model: f(x;w) = w ϕ (x)∑d d d
Gaussian bases, or radial bases
mu = np.linspace(0,4,10) #4 Gaussians bases
#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4
5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9
the center are fixedcost: J(w) = (f(x ;w) −21 ∑n
(n) y )(n) 2
how to minimize the cost?not convex in all model parametersuse gradient descent to find a local minimum
4 . 1
non-adaptive case
we can make the bases adaptive by learning these centers
model: f(x;w,μ) = w ϕ (x;μ )∑d d d d
adaptive case
AdaptiveAdaptive Radial Bases Radial Bases ϕ (x) =d e
−
s2(x−μ )d
2
the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution
model: f(x;w) = w ϕ (x)∑d d d
Gaussian bases, or radial bases
mu = np.linspace(0,4,10) #4 Gaussians bases
#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4
5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9
the center are fixedcost: J(w) = (f(x ;w) −21 ∑n
(n) y )(n) 2
how to minimize the cost?not convex in all model parametersuse gradient descent to find a local minimum
note that the basis centers are adaptively changing
4 . 1
non-adaptive case
we can make the bases adaptive by learning these centers
model: f(x;w,μ) = w ϕ (x;μ )∑d d d d
adaptive case
Sigmoid BasesSigmoid Bases ϕ (x) =d
1+e−( )
s d
x−μ d
1
using adaptive sigmoid bases gives us a neural network
μ d
s =d 1
non-adaptive case
4 . 2
Sigmoid BasesSigmoid Bases ϕ (x) =d
1+e−( )
s d
x−μ d
1
using adaptive sigmoid bases gives us a neural network
is fixed to D locations μ d
s =d 1
non-adaptive case
4 . 2
Sigmoid BasesSigmoid Bases ϕ (x) =d
1+e−( )
s d
x−μ d
1
phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu)))mu = np.linspace(0,3,10)
#x: N 1#y: N 2plt.plot(x, y, 'b.') 3
45
Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9
using adaptive sigmoid bases gives us a neural network
model: f(x;w) = w ϕ (x)∑d d d
is fixed to D locations μ d
s =d 1
non-adaptive case
4 . 2
...ϕ (x)1 ϕ (x)2 ϕ (x)D
w 1 w Dw 2
y
Sigmoid BasesSigmoid Bases ϕ (x) =d
1+e−( )
s d
x−μ d
1
phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu)))mu = np.linspace(0,3,10)
#x: N 1#y: N 2plt.plot(x, y, 'b.') 3
45
Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9
using adaptive sigmoid bases gives us a neural network
model: f(x;w) = w ϕ (x)∑d d d
D=10
is fixed to D locations μ d
s =d 1
non-adaptive case
4 . 2
...ϕ (x)1 ϕ (x)2 ϕ (x)D
w 1 w Dw 2
y
Sigmoid BasesSigmoid Bases ϕ (x) =d
1+e−( )
s d
x−μ d
1
phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu)))mu = np.linspace(0,3,10)
#x: N 1#y: N 2plt.plot(x, y, 'b.') 3
45
Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9
using adaptive sigmoid bases gives us a neural network
model: f(x;w) = w ϕ (x)∑d d d
D=10 D=5 D=3
is fixed to D locations μ d
s =d 1
non-adaptive case
4 . 2
...ϕ (x)1 ϕ (x)2 ϕ (x)D
w 1 w Dw 2
y
AdaptiveAdaptive Sigmoid Bases Sigmoid Bases
ϕ (x) =d
1+e−( )
s d
x−μ d
1
rewrite the sigmoid basis
ϕ (x) =d σ( ) =s d
x−μ d σ(v x +d b )d
4 . 3
AdaptiveAdaptive Sigmoid Bases Sigmoid Bases
ϕ (x) =d
1+e−( )
s d
x−μ d
1
rewrite the sigmoid basis
ϕ (x) =d σ( ) =s d
x−μ d σ(v x +d b )d
each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d
assuming input is higher than one dimension
4 . 3
AdaptiveAdaptive Sigmoid Bases Sigmoid Bases
ϕ (x) =d
1+e−( )
s d
x−μ d
1
rewrite the sigmoid basis
ϕ (x) =d σ( ) =s d
x−μ d σ(v x +d b )d
model: f(x;w, v, b) = w σ(v x +∑d d d b )d
each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d
assuming input is higher than one dimension
4 . 3
AdaptiveAdaptive Sigmoid Bases Sigmoid Bases
ϕ (x) =d
1+e−( )
s d
x−μ d
1
rewrite the sigmoid basis
ϕ (x) =d σ( ) =s d
x−μ d σ(v x +d b )d
model: f(x;w, v, b) = w σ(v x +∑d d d b )d
each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d
assuming input is higher than one dimension
this is a neural network with two layers ...ϕ 1
w 1 w Dw 2
y
ϕ D
v , b D Dv , b 1 1
x
4 . 3
Winter 2020 | Applied Machine Learning (COMP551)
AdaptiveAdaptive Sigmoid Bases Sigmoid Bases
ϕ (x) =d
1+e−( )
s d
x−μ d
1
rewrite the sigmoid basis
ϕ (x) =d σ( ) =s d
x−μ d σ(v x +d b )d
model: f(x;w, v, b) = w σ(v x +∑d d d b )d
each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d
assuming input is higher than one dimension
optimize using gradient descent (find a local optima)
D=3 adaptive bases D=3 fixed bases
this is a neural network with two layers ...ϕ 1
w 1 w Dw 2
y
ϕ D
v , b D Dv , b 1 1
x
4 . 3
Multilayer Perceptron (Multilayer Perceptron (MLPMLP))
suppose we have
D inputsK outputsM hidden units
=yk g ( W h( V x ))∑m k,m ∑
d m,d d
nonlinearity, activation function: we have different choices
x , … ,x 1 D
z , … , z 1 M
, … , y1 yK
more compressed form
=y g(W h(V x))non-linearities are applied elementwise
for simplicity we may drop bias terms
W
x 1 ...
...
...
x 2 x D 1
1z 1 z 2 z M
y1 y2 yC
input
hidden units
output
V
5 . 1
model
Regression Regression using Neural Networksusing Neural Networks
the choice of activation function in the final layer depends on the task
W
x 1 ...
...
...
x 2 x D 1
1z 1 z 2 z M
y1 y2 yC
input
hidden units
output
V
5 . 2
=y g(W h(V x))model
Regression Regression using Neural Networksusing Neural Networks
the choice of activation function in the final layer depends on the task
regression
identity function + L2 loss : Gaussian likelihood
=y g(Wz ) = Wz
L(y, ) =y ∣∣y −21
∣∣ =y 22 logN (y; ,βI) +y constant
we may have one or more output variables W
x 1 ...
...
...
x 2 x D 1
1z 1 z 2 z M
y1 y2 yC
input
hidden units
output
V
5 . 2
=y g(W h(V x))model
Regression Regression using Neural Networksusing Neural Networks
the choice of activation function in the final layer depends on the task
regression
identity function + L2 loss : Gaussian likelihood
=y g(Wz ) = Wz
L(y, ) =y ∣∣y −21
∣∣ =y 22 logN (y; ,βI) +y constant
we may have one or more output variables W
x 1 ...
...
...
x 2 x D 1
1z 1 z 2 z M
y1 y2 yC
input
hidden units
output
V
5 . 2
=y g(W h(V x))model
we may explicitly produce a distribution at output - e.g.,
mean and variance of a Gaussianmixture of Gaussians
the loss will be the log-likelihood of the data under our model
L(y, ) =y log p(y; f(x))neural network outputs the parameters of a distribution
more generally
Classification Classification using neural networksusing neural networks
the choice of activation function in the final layer depends on the task
W
x 1 ...
...
...
x 2 x D 1
1z 1 z 2 z M
y1 y2 yC
input
hidden units
output
V
5 . 3
=y g(W h(V x))model
Classification Classification using neural networksusing neural networks
the choice of activation function in the final layer depends on the task
L(y, ) =y y log +y (1 − y) log(1 − ) =y log Bernouli(y; )y
binary classification =y g(Wz) = (1 + e )−Wz −1
logistic sigmoid + CE loss: Bernouli likelihood
scalar output C=1 W
x 1 ...
...
...
x 2 x D 1
1z 1 z 2 z M
y1 y2 yC
input
hidden units
output
V
5 . 3
=y g(W h(V x))model
Classification Classification using neural networksusing neural networks
the choice of activation function in the final layer depends on the task
L(y, ) =y y log +y (1 − y) log(1 − ) =y log Bernouli(y; )y
binary classification =y g(Wz) = (1 + e )−Wz −1
logistic sigmoid + CE loss: Bernouli likelihood
scalar output C=1
multiclass classification =y g(Wz) = softmax(Wz)
softmax + multi-class CE loss: categorical likelihood
L(y, ) =y y log =∑k k yk log Categorical(y; )y
C is the number of classes
W
x 1 ...
...
...
x 2 x D 1
1z 1 z 2 z M
y1 y2 yC
input
hidden units
output
V
5 . 3
=y g(W h(V x))model
Activation functionActivation function
for middle layer(s) there is more freedom in the choice of activation function
5 . 4
Activation functionActivation function
for middle layer(s) there is more freedom in the choice of activation function
identity (no activation function)h(x) = x
5 . 4
Activation functionActivation function
for middle layer(s) there is more freedom in the choice of activation function
identity (no activation function)h(x) = x
composition of two linear functions is linear
x =W ′
WV W x′K × M M × D K × D
5 . 4
Activation functionActivation function
for middle layer(s) there is more freedom in the choice of activation function
identity (no activation function)h(x) = x
composition of two linear functions is linear
x =W ′
WV W x′K × M M × D K × D
M < min(D,K)
so nothing is gained (in representation power) by stacking linear layers
exception: if then the hidden layer iscompressing the data (W' is low-rank)
this idea is used in dimensionality reduction (later!)
5 . 4
Activation functionActivation function
5 . 5
for middle layer(s) there is more freedom in the choice of activation function
Activation functionActivation function
logistic functionh(x) = σ(x) = 1+e−x1
5 . 5
for middle layer(s) there is more freedom in the choice of activation function
Activation functionActivation function
logistic functionh(x) = σ(x) = 1+e−x1
the same function used in logistic regressionused to be the function of choice in neural networks
5 . 5
for middle layer(s) there is more freedom in the choice of activation function
Activation functionActivation function
logistic functionh(x) = σ(x) = 1+e−x1
the same function used in logistic regressionused to be the function of choice in neural networks
away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)
5 . 5
for middle layer(s) there is more freedom in the choice of activation function
Activation functionActivation function
logistic functionh(x) = σ(x) = 1+e−x1
the same function used in logistic regressionused to be the function of choice in neural networks
away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)
σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember
5 . 5
for middle layer(s) there is more freedom in the choice of activation function
Activation functionActivation function
logistic functionh(x) = σ(x) = 1+e−x1
the same function used in logistic regressionused to be the function of choice in neural networks
away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)
hyperbolic tangenth(x) = 2σ(x) − 1 = e +ex −xe −ex −x
similar to sigmoid, but symmetric
σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember
5 . 5
for middle layer(s) there is more freedom in the choice of activation function
Activation functionActivation function
logistic functionh(x) = σ(x) = 1+e−x1
the same function used in logistic regressionused to be the function of choice in neural networks
away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)
hyperbolic tangenth(x) = 2σ(x) − 1 = e +ex −xe −ex −x
similar to sigmoid, but symmetric
σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember
often better for optimization because close to zero itsimilar to a linear function
(rather than an affine function when using logistic)
5 . 5
for middle layer(s) there is more freedom in the choice of activation function
Activation functionActivation function
logistic functionh(x) = σ(x) = 1+e−x1
the same function used in logistic regressionused to be the function of choice in neural networks
away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)
hyperbolic tangenth(x) = 2σ(x) − 1 = e +ex −xe −ex −x
similar to sigmoid, but symmetric
σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember
often better for optimization because close to zero itsimilar to a linear function
(rather than an affine function when using logistic)
5 . 5
tanh(x) =∂x∂ 1 − tanh(x)2
similar problem with vanishing gradient
for middle layer(s) there is more freedom in the choice of activation function
Activation functionActivation function
γ
for middle layer(s) there is more freedom in the choice of activation function
5 . 6
Activation functionActivation function
Rectified Linear Unit (ReLU)h(x) = max(0,x)
γ
for middle layer(s) there is more freedom in the choice of activation function
5 . 6
Activation functionActivation function
replacing logistic with ReLU significantly improves the training of deep networks
zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization
Rectified Linear Unit (ReLU)h(x) = max(0,x)
γ
for middle layer(s) there is more freedom in the choice of activation function
5 . 6
Activation functionActivation function
replacing logistic with ReLU significantly improves the training of deep networks
zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization
Rectified Linear Unit (ReLU)h(x) = max(0,x)
leaky ReLU h(x) = max(0,x) + γ min(0,x)
fixes the zero-gradient problem
γ
for middle layer(s) there is more freedom in the choice of activation function
5 . 6
Activation functionActivation function
replacing logistic with ReLU significantly improves the training of deep networks
zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization
Rectified Linear Unit (ReLU)h(x) = max(0,x)
leaky ReLU h(x) = max(0,x) + γ min(0,x)
fixes the zero-gradient problem
parameteric ReLU:make a learnable parameterγ
for middle layer(s) there is more freedom in the choice of activation function
5 . 6
Winter 2020 | Applied Machine Learning (COMP551)
Activation functionActivation function
replacing logistic with ReLU significantly improves the training of deep networks
zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization
Rectified Linear Unit (ReLU)h(x) = max(0,x)
leaky ReLU h(x) = max(0,x) + γ min(0,x)
fixes the zero-gradient problem
parameteric ReLU:make a learnable parameterγ
Softplus (differentiable everywhere) h(x) = log(1 + e )x
it doesn't perform as wellin practice
for middle layer(s) there is more freedom in the choice of activation function
5 . 6
Network architectureNetwork architecture
architecture is the overall structure of the network
6 . 1
Network architectureNetwork architecture
architecture is the overall structure of the networkfeedforward network (aka multilayer perceptron)
can have many layers# layers is called the depth of the network
x 1 ...
...
...
x 2 x D
z 1 z 2 z M
y1 y2 yC... ... ......z 1
{ℓ}z 2
{ℓ} z M{ℓ}
... ... ...
dept
h
width
6 . 1
Network architectureNetwork architecture
architecture is the overall structure of the networkfeedforward network (aka multilayer perceptron)
can have many layers# layers is called the depth of the network
x 1 ...
...
...
x 2 x D
z 1 z 2 z M
y1 y2 yC... ... ......z 1
{ℓ}z 2
{ℓ} z M{ℓ}
... ... ...
dept
h
width
6 . 1
x 1 ...
...
x 2 x D
z 1 z 2 z M
fully connected
x 1 ...
...
x 2 x D
z 1 z 2 z M
sparsely connected
each layer can be fully connected (dense) or sparse
Network architectureNetwork architecture
architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)
can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparse
x 1 ...
...
...
x 2 x D
z 1 z 2 z M
y1
6 . 2
y2 yC... ... ...
...z 1{ℓ}
z 2{ℓ} z M
{ℓ}
... ... ...skip connection
Network architectureNetwork architecture
architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)
can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections
helps with gradient flow
x 1 ...
...
...
x 2 x D
z 1 z 2 z M
y1
6 . 2
y2 yC... ... ...
...z 1{ℓ}
z 2{ℓ} z M
{ℓ}
... ... ...skip connection
Network architectureNetwork architecture
architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)
can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections
helps with gradient flowunits may have different activations
x 1 ...
...
...
x 2 x D
z 1 z 2 z M
y1
6 . 2
y2 yC... ... ...
...z 1{ℓ}
z 2{ℓ} z M
{ℓ}
... ... ...skip connection
Network architectureNetwork architecture
architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)
can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections
helps with gradient flowunits may have different activationsparameters may be shared across units (e.g., in conv-nets) x 1 ...
...
...
x 2 x D
z 1 z 2 z M
y1
6 . 2
y2 yC... ... ...
...z 1{ℓ}
z 2{ℓ} z M
{ℓ}
... ... ...skip connection
parameter sharing
Winter 2020 | Applied Machine Learning (COMP551)
Network architectureNetwork architecture
architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)
can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections
helps with gradient flowunits may have different activationsparameters may be shared across units (e.g., in conv-nets) x 1 ...
...
...
x 2 x D
z 1 z 2 z M
y1
6 . 2
y2 yC... ... ...
...z 1{ℓ}
z 2{ℓ} z M
{ℓ}
... ... ...skip connection
parameter sharing
more generally a directed acyclic graph (DAG) expresses thefeed-forward architecture
Expressive powerExpressive power
an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy
universal approximation theorem
7 . 1
Expressive powerExpressive power
an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy
universal approximation theorem
for 1D input we can see this even with fixed basesM = 100 in this examplethe fit is good (hard to see the blue line)
7 . 1
Expressive powerExpressive power
an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy
universal approximation theorem
for 1D input we can see this even with fixed basesM = 100 in this examplethe fit is good (hard to see the blue line)
however # bases (M) should grow exponentiallywith D (curse of dimensionality)
7 . 1
Depth vs WidthDepth vs Width
an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats
we may need a very wide network (large M)this is only about training error, we care about test error
universal approximation theorem
7 . 2
Depth vs WidthDepth vs Width
an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats
we may need a very wide network (large M)this is only about training error, we care about test error
universal approximation theorem
Deep networks (with ReLU activation) of bounded width are also shown to be universal
7 . 2
Depth vs WidthDepth vs Width
an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats
we may need a very wide network (large M)this is only about training error, we care about test error
universal approximation theorem
Deep networks (with ReLU activation) of bounded width are also shown to be universalempirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer)assuming a compositional functional form (through depth) is a useful inductive bias
7 . 2
Depth vs WidthDepth vs Width
an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats
we may need a very wide network (large M)this is only about training error, we care about test error
universal approximation theorem
Deep networks (with ReLU activation) of bounded width are also shown to be universalempirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer)assuming a compositional functional form (through depth) is a useful inductive bias
7 . 2
Depth vs WidthDepth vs Width
an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats
we may need a very wide network (large M)this is only about training error, we care about test error
universal approximation theorem
Deep networks (with ReLU activation) of bounded width are also shown to be universalempirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer)assuming a compositional functional form (through depth) is a useful inductive bias
7 . 2
Depth vs WidthDepth vs Width
an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats
we may need a very wide network (large M)this is only about training error, we care about test error
universal approximation theorem
Deep networks (with ReLU activation) of bounded width are also shown to be universal
7 . 3
Winter 2020 | Applied Machine Learning (COMP551)
Depth vs WidthDepth vs Width
an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats
we may need a very wide network (large M)this is only about training error, we care about test error
universal approximation theorem
Deep networks (with ReLU activation) of bounded width are also shown to be universalnumber of regions (in which the network is linear) grows exponentially with depth
W x ={ℓ} {ℓ} 0
layer ℓ
h(W x) ={ℓ} ∣W x∣{ℓ}simplified demonstration
W x ={ℓ+1} {ℓ+1} 0
7 . 3
Regularization strategiesRegularization strategies
universality of neural networks also means they can overfitstrategies for variance reduction:
8 . 1
Regularization strategiesRegularization strategies
universality of neural networks also means they can overfitstrategies for variance reduction:
L1 and L2 regularization (weight decay)
8 . 1
Regularization strategiesRegularization strategies
universality of neural networks also means they can overfitstrategies for variance reduction:
L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropout
8 . 1
Regularization strategiesRegularization strategies
universality of neural networks also means they can overfitstrategies for variance reduction:
L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)
8 . 1
Regularization strategiesRegularization strategies
universality of neural networks also means they can overfitstrategies for variance reduction:
L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learning
8 . 1
Regularization strategiesRegularization strategies
universality of neural networks also means they can overfitstrategies for variance reduction:
L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learningadversarial training
8 . 1
Regularization strategiesRegularization strategies
universality of neural networks also means they can overfitstrategies for variance reduction:
L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learningadversarial trainingparameter-tying
8 . 1
Data augmentationData augmentation
a larger dataset results in a better generalization
8 . 2
Data augmentationData augmentation
a larger dataset results in a better generalization
N = 20 N = 40 N = 80
example: in all 3 examples below training error is close to zero
however, a larger training dataset leads to better generalization
8 . 2
Data augmentationData augmentation
a larger dataset results in a better generalization
increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,
τ(x)
f(τ(x)) = f(x)
image: https://github.com/aleju/imgaug/blob/master/README.md
idea
8 . 3
special approaches to data-augmentation
adding noise to the inputadding noise to hidden units
noise in higher level of abstraction
Data augmentationData augmentation
a larger dataset results in a better generalization
increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,
τ(x)
f(τ(x)) = f(x)
image: https://github.com/aleju/imgaug/blob/master/README.md
idea
8 . 3
special approaches to data-augmentation
adding noise to the inputadding noise to hidden units
noise in higher level of abstraction
Data augmentationData augmentation
a larger dataset results in a better generalization
increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,
τ(x)
f(τ(x)) = f(x)
image: https://github.com/aleju/imgaug/blob/master/README.md
idea
(x, y)p
x , y ∼(n )′ (n )′ p
learn a generative model of the datause for training
8 . 3
special approaches to data-augmentation
adding noise to the inputadding noise to hidden units
noise in higher level of abstraction
Data augmentationData augmentation
a larger dataset results in a better generalization
increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,
τ(x)
f(τ(x)) = f(x)
sometimes we an achieve the same goal by designing the modelsthat are invariant to a given set of transformations
image: https://github.com/aleju/imgaug/blob/master/README.md
idea
(x, y)p
x , y ∼(n )′ (n )′ p
learn a generative model of the datause for training
8 . 3
Noise robustnessNoise robustness
make the model robust to noise in
8 . 4
Noise robustnessNoise robustness
make the model robust to noise in input (data augmentation)
hidden units (e.g., in dropout)
8 . 4
Noise robustnessNoise robustness
make the model robust to noise in
weights the loss is not sensitive to small changes in the weight (flat minima)
image credit: Keshkar et al'17
input (data augmentation)
hidden units (e.g., in dropout)
8 . 4
Noise robustnessNoise robustness
make the model robust to noise in
weights the loss is not sensitive to small changes in the weight (flat minima)
flat minima generalize bettergood performance of SGD using small minibatch is attributed to flat minima
in this case, SGD regularizes the model due to gradient noise
image credit: Keshkar et al'17
input (data augmentation)
hidden units (e.g., in dropout)
8 . 4
Noise robustnessNoise robustness
make the model robust to noise in
label smoothing
output (avoid overfitting, specially to wrong labels)
a heuristic is to replace hard labels with "soft-labels"[0, 0, 1, 0] → [ , , 1 −3
ϵ3ϵ ϵ, ]3
ϵe.g.,
weights the loss is not sensitive to small changes in the weight (flat minima)
flat minima generalize bettergood performance of SGD using small minibatch is attributed to flat minima
in this case, SGD regularizes the model due to gradient noise
image credit: Keshkar et al'17
input (data augmentation)
hidden units (e.g., in dropout)
8 . 4
Early stoppingEarly stopping
the test loss-vs-time step is "often" U-shapeduse validation for early stoppingalso saves computation!
8 . 5
Early stoppingEarly stopping
the test loss-vs-time step is "often" U-shapeduse validation for early stoppingalso saves computation!
early stopping bounds the region of the parameter-space that is reachable in T time-stepsassuming
bounded gradientstarting with a small w
it has an effect similar to L2 regularizationwe get the regularization path (various )we saw a similar phenomena in boosting
λ
8 . 5
BaggingBagging
several sources of variance in neural networks, such as
optimizationinitializationrandomness of SGDlearning rate and other hyper-parameters
choice of architecturenumber of layers, hidden units, etc.
8 . 6
BaggingBagging
use bagging or even averaging without bootstrap to reduce varianceissue: computationally expensive
several sources of variance in neural networks, such as
optimizationinitializationrandomness of SGDlearning rate and other hyper-parameters
choice of architecturenumber of layers, hidden units, etc.
8 . 6
DropoutDropout
idea
randomly remove a subset of units during trainingas opposed to bagging a single model is trained
8 . 7
DropoutDropout
idea
randomly remove a subset of units during trainingas opposed to bagging a single model is trained
exponentially many subnetworks that share parameterscan be viewed as
8 . 7
DropoutDropout
idea
randomly remove a subset of units during trainingas opposed to bagging a single model is trained
exponentially many subnetworks that share parameterscan be viewed as
is one of the most effective regularization schemes for MLPs
8 . 7
DropoutDropout
at test time
during training
8 . 8
DropoutDropout
at test time
during training
for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training
8 . 8
DropoutDropout
at test time
ideally we want to average over the prediction of all possible sub-networks
during training
for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training
8 . 8
DropoutDropout
at test time
ideally we want to average over the prediction of all possible sub-networks
1) Monte Carlo dropout: average the prediction of several feed-forward passes using dropout
during training
for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training
this is computationally infeasible, instead
8 . 8
Winter 2020 | Applied Machine Learning (COMP551)
DropoutDropout
at test time
ideally we want to average over the prediction of all possible sub-networks
1) Monte Carlo dropout: average the prediction of several feed-forward passes using dropout2) weight scaling: scale the weights by p to compensate for dropout
e.g., for 50% dropout, scale by a factor of 2in general this is not equivalent to the average prediction of the ensemble
during training
for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training
this is computationally infeasible, instead
8 . 8
SummarySummary
Deep feed-forward networks learn adaptive basesmore complex bases at higher layersincreasing depth is often preferable to widthvarious choices of activation function and architectureuniversal approximation powertheir expressive power often necessitates using regularization schemes
9