Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer

Applied Machine LearningApplied Machine LearningMultilayer Perceptron

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)

1

multilayer percepron:

modeldifferent supervised learning tasksactivation functionsarchitecture of a neural network

its expressive powerregularization techniques

Learning objectivesLearning objectives

2

Adaptive basesAdaptive bases

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

3




f(x) = w ϕ (x; v )∑d d d d

consider the adaptive bases in a general form (contrast to decision trees)

3




f(x) = w ϕ (x; v )∑d d d d


use gradient descent to find good parameters (contrast to boosting)

3




f(x) = w ϕ (x; v )∑d d d d


use gradient descent to find good parameters (contrast to boosting)

create more complex adaptive bases by combining simpler basesleads to deep neural networks

3

AdaptiveAdaptive Radial Bases Radial Bases ϕ (x) =d e

−

s2(x−μ )d

2

model: f(x;w) = w ϕ (x)∑d d d

Gaussian bases, or radial bases

4 . 1

non-adaptive case


−

s2(x−μ )d

2



cost: J(w) = (f(x ;w) −21 ∑n

(n) y )(n) 2

4 . 1

non-adaptive case


−

s2(x−μ )d

2

the model is linear in its parametersthe cost is convex in w (unique minimum)even has a closed form solution



mu = np.linspace(0,4,10) #4 Gaussians bases

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3phi = lambda x,mu: np.exp(-(x-mu)**2) 4

5Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9

the center are fixedcost: J(w) = (f(x ;w) −21 ∑n

(n) y )(n) 2

4 . 1

non-adaptive case


−

s2(x−μ )d

2








(n) y )(n) 2

4 . 1

non-adaptive case


−

s2(x−μ )d

2








(n) y )(n) 2

how to minimize the cost?

4 . 1

non-adaptive case

we can make the bases adaptive by learning these centers

model: f(x;w,μ) = w ϕ (x;μ )∑d d d d

adaptive case


−

s2(x−μ )d

2








(n) y )(n) 2

how to minimize the cost?not convex in all model parametersuse gradient descent to find a local minimum

4 . 1

non-adaptive case



adaptive case


−

s2(x−μ )d

2








(n) y )(n) 2

how to minimize the cost?not convex in all model parametersuse gradient descent to find a local minimum

note that the basis centers are adaptively changing

4 . 1

non-adaptive case



adaptive case

Sigmoid BasesSigmoid Bases ϕ (x) =d

1+e−( )

s d

x−μ d

1

using adaptive sigmoid bases gives us a neural network

μ d

s =d 1

non-adaptive case

4 . 2


1+e−( )

s d

x−μ d

1


is fixed to D locations μ d

s =d 1

non-adaptive case

4 . 2


1+e−( )

s d

x−μ d

1

phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu)))mu = np.linspace(0,3,10)

#x: N 1#y: N 2plt.plot(x, y, 'b.') 3

45

Phi = phi(x[:,None], mu[None,:]) #N x 106w = np.linalg.lstsq(Phi, y)[0]7yh = np.dot(Phi,w)8plt.plot(x, yh, 'g-')9




s =d 1

non-adaptive case

4 . 2

...ϕ (x)1 ϕ (x)2 ϕ (x)D

w 1 w Dw 2

y


1+e−( )

s d

x−μ d

1


#x: N 1#y: N 2plt.plot(x, y, 'b.') 3

45




D=10


s =d 1

non-adaptive case

4 . 2

...ϕ (x)1 ϕ (x)2 ϕ (x)D

w 1 w Dw 2

y


1+e−( )

s d

x−μ d

1


#x: N 1#y: N 2plt.plot(x, y, 'b.') 3

45




D=10 D=5 D=3


s =d 1

non-adaptive case

4 . 2

...ϕ (x)1 ϕ (x)2 ϕ (x)D

w 1 w Dw 2

y

AdaptiveAdaptive Sigmoid Bases Sigmoid Bases

ϕ (x) =d

1+e−( )

s d

x−μ d

1

rewrite the sigmoid basis

ϕ (x) =d σ( ) =s d

x−μ d σ(v x +d b )d

4 . 3


ϕ (x) =d

1+e−( )

s d

x−μ d

1


ϕ (x) =d σ( ) =s d


each basis is the logistic regression model ϕ (x) =d σ(v x +d⊤ b )d

assuming input is higher than one dimension

4 . 3


ϕ (x) =d

1+e−( )

s d

x−μ d

1


ϕ (x) =d σ( ) =s d


model: f(x;w, v, b) = w σ(v x +∑d d d b )d



4 . 3


ϕ (x) =d

1+e−( )

s d

x−μ d

1


ϕ (x) =d σ( ) =s d





this is a neural network with two layers ...ϕ 1

w 1 w Dw 2

y

ϕ D

v , b D Dv , b 1 1

x

4 . 3

Winter 2020 | Applied Machine Learning (COMP551)


ϕ (x) =d

1+e−( )

s d

x−μ d

1


ϕ (x) =d σ( ) =s d





optimize using gradient descent (find a local optima)

D=3 adaptive bases D=3 fixed bases

this is a neural network with two layers ...ϕ 1

w 1 w Dw 2

y

ϕ D

v , b D Dv , b 1 1

x

4 . 3

Multilayer Perceptron (Multilayer Perceptron (MLPMLP))

suppose we have

D inputsK outputsM hidden units

=yk g ( W h( V x ))∑m k,m ∑

d m,d d

nonlinearity, activation function: we have different choices

x , … ,x 1 D

z , … , z 1 M

, … , y1 yK

more compressed form

=y g(W h(V x))non-linearities are applied elementwise

for simplicity we may drop bias terms

W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 1

model

Regression Regression using Neural Networksusing Neural Networks

the choice of activation function in the final layer depends on the task

W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 2

=y g(W h(V x))model



regression

identity function + L2 loss : Gaussian likelihood

=y g(Wz ) = Wz

L(y, ) =y ∣∣y −21

∣∣ =y 22 logN (y; ,βI) +y constant

we may have one or more output variables W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 2

=y g(W h(V x))model



regression

identity function + L2 loss : Gaussian likelihood

=y g(Wz ) = Wz

L(y, ) =y ∣∣y −21

∣∣ =y 22 logN (y; ,βI) +y constant

we may have one or more output variables W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 2

=y g(W h(V x))model

we may explicitly produce a distribution at output - e.g.,

mean and variance of a Gaussianmixture of Gaussians

the loss will be the log-likelihood of the data under our model

L(y, ) =y log p(y; f(x))neural network outputs the parameters of a distribution

more generally

Classification Classification using neural networksusing neural networks


W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 3

=y g(W h(V x))model



L(y, ) =y y log +y (1 − y) log(1 − ) =y log Bernouli(y; )y

binary classification =y g(Wz) = (1 + e )−Wz −1

logistic sigmoid + CE loss: Bernouli likelihood

scalar output C=1 W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 3

=y g(W h(V x))model



L(y, ) =y y log +y (1 − y) log(1 − ) =y log Bernouli(y; )y

binary classification =y g(Wz) = (1 + e )−Wz −1

logistic sigmoid + CE loss: Bernouli likelihood

scalar output C=1

multiclass classification =y g(Wz) = softmax(Wz)

softmax + multi-class CE loss: categorical likelihood

L(y, ) =y y log =∑k k yk log Categorical(y; )y

C is the number of classes

W

x 1 ...

...

...

x 2 x D 1

1z 1 z 2 z M

y1 y2 yC

input

hidden units

output

V

5 . 3

=y g(W h(V x))model

Activation functionActivation function

for middle layer(s) there is more freedom in the choice of activation function

5 . 4



identity (no activation function)h(x) = x

5 . 4




composition of two linear functions is linear

x =W ′

WV W x′K × M M × D K × D

5 . 4




composition of two linear functions is linear

x =W ′

WV W x′K × M M × D K × D

M < min(D,K)

so nothing is gained (in representation power) by stacking linear layers

exception: if then the hidden layer iscompressing the data (W' is low-rank)

this idea is used in dimensionality reduction (later!)

5 . 4


5 . 5



logistic functionh(x) = σ(x) = 1+e−x1

5 . 5




the same function used in logistic regressionused to be the function of choice in neural networks

5 . 5





away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)

5 . 5






σ(x) =∂x∂ σ(x)(1 − σ(x))its derivative is easy to remember

5 . 5






hyperbolic tangenth(x) = 2σ(x) − 1 = e +ex −xe −ex −x

similar to sigmoid, but symmetric


5 . 5









often better for optimization because close to zero itsimilar to a linear function

(rather than an affine function when using logistic)

5 . 5









often better for optimization because close to zero itsimilar to a linear function

(rather than an affine function when using logistic)

5 . 5

tanh(x) =∂x∂ 1 − tanh(x)2

similar problem with vanishing gradient



γ


5 . 6


Rectified Linear Unit (ReLU)h(x) = max(0,x)

γ


5 . 6


replacing logistic with ReLU significantly improves the training of deep networks

zero derivative if the unit is "inactive"initialization should ensure active units at the beginning of optimization


γ


5 . 6





leaky ReLU h(x) = max(0,x) + γ min(0,x)

fixes the zero-gradient problem

γ


5 . 6







parameteric ReLU:make a learnable parameterγ


5 . 6








parameteric ReLU:make a learnable parameterγ

Softplus (differentiable everywhere) h(x) = log(1 + e )x

it doesn't perform as wellin practice


5 . 6

Network architectureNetwork architecture

architecture is the overall structure of the network

6 . 1


architecture is the overall structure of the networkfeedforward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the network

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1 y2 yC... ... ......z 1

{ℓ}z 2

{ℓ} z M{ℓ}

... ... ...

dept

h

width

6 . 1


architecture is the overall structure of the networkfeedforward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the network

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1 y2 yC... ... ......z 1

{ℓ}z 2

{ℓ} z M{ℓ}

... ... ...

dept

h

width

6 . 1

x 1 ...

...

x 2 x D

z 1 z 2 z M

fully connected

x 1 ...

...

x 2 x D

z 1 z 2 z M

sparsely connected

each layer can be fully connected (dense) or sparse


architecture is the overall structure of the networkfeed-forward network (aka multilayer perceptron)

can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparse

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}

... ... ...skip connection



can have many layers# layers is called the depth of the networkeach layer can be fully connected (dense) or sparselayers may have skip layer connections

helps with gradient flow

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}





helps with gradient flowunits may have different activations

x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}





helps with gradient flowunits may have different activationsparameters may be shared across units (e.g., in conv-nets) x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}


parameter sharing





helps with gradient flowunits may have different activationsparameters may be shared across units (e.g., in conv-nets) x 1 ...

...

...

x 2 x D

z 1 z 2 z M

y1

6 . 2

y2 yC... ... ...

...z 1{ℓ}

z 2{ℓ} z M

{ℓ}


parameter sharing

more generally a directed acyclic graph (DAG) expresses thefeed-forward architecture

Expressive powerExpressive power

an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy

universal approximation theorem

7 . 1




for 1D input we can see this even with fixed basesM = 100 in this examplethe fit is good (hard to see the blue line)

7 . 1




for 1D input we can see this even with fixed basesM = 100 in this examplethe fit is good (hard to see the blue line)

however # bases (M) should grow exponentiallywith D (curse of dimensionality)

7 . 1

Depth vs WidthDepth vs Width

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracyCaveats

we may need a very wide network (large M)this is only about training error, we care about test error


7 . 2





Deep networks (with ReLU activation) of bounded width are also shown to be universal

7 . 2





Deep networks (with ReLU activation) of bounded width are also shown to be universalempirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer)assuming a compositional functional form (through depth) is a useful inductive bias

7 . 2






7 . 2






7 . 2





Deep networks (with ReLU activation) of bounded width are also shown to be universal

7 . 3






Deep networks (with ReLU activation) of bounded width are also shown to be universalnumber of regions (in which the network is linear) grows exponentially with depth

W x ={ℓ} {ℓ} 0

layer ℓ

h(W x) ={ℓ} ∣W x∣{ℓ}simplified demonstration

W x ={ℓ+1} {ℓ+1} 0

7 . 3

Regularization strategiesRegularization strategies

universality of neural networks also means they can overfitstrategies for variance reduction:

8 . 1



L1 and L2 regularization (weight decay)

8 . 1



L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropout

8 . 1



L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)

8 . 1



L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learning

8 . 1



L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learningadversarial training

8 . 1



L1 and L2 regularization (weight decay)data augmentationnoise robustnessearly stoppingbagging and dropoutsparse representations (e.g., L1 penalty on hidden unit activations)semi-supervised and multi-task learningadversarial trainingparameter-tying

8 . 1

Data augmentationData augmentation

a larger dataset results in a better generalization

8 . 2



N = 20 N = 40 N = 80

example: in all 3 examples below training error is close to zero

however, a larger training dataset leads to better generalization

8 . 2



increase the size of dataset by adding reasonable transformationsthat change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x)

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

8 . 3

special approaches to data-augmentation

adding noise to the inputadding noise to hidden units

noise in higher level of abstraction




τ(x)

f(τ(x)) = f(x)


idea

8 . 3







τ(x)

f(τ(x)) = f(x)


idea

(x, y)p

x , y ∼(n )′ (n )′ p

learn a generative model of the datause for training

8 . 3







τ(x)

f(τ(x)) = f(x)

sometimes we an achieve the same goal by designing the modelsthat are invariant to a given set of transformations


idea

(x, y)p

x , y ∼(n )′ (n )′ p

learn a generative model of the datause for training

8 . 3

Noise robustnessNoise robustness

make the model robust to noise in

8 . 4


make the model robust to noise in input (data augmentation)

hidden units (e.g., in dropout)

8 . 4



weights the loss is not sensitive to small changes in the weight (flat minima)

image credit: Keshkar et al'17

input (data augmentation)


8 . 4




flat minima generalize bettergood performance of SGD using small minibatch is attributed to flat minima

in this case, SGD regularizes the model due to gradient noise




8 . 4



label smoothing

output (avoid overfitting, specially to wrong labels)

a heuristic is to replace hard labels with "soft-labels"[0, 0, 1, 0] → [ , , 1 −3

ϵ3ϵ ϵ, ]3

ϵe.g.,


flat minima generalize bettergood performance of SGD using small minibatch is attributed to flat minima

in this case, SGD regularizes the model due to gradient noise




8 . 4

Early stoppingEarly stopping

the test loss-vs-time step is "often" U-shapeduse validation for early stoppingalso saves computation!

8 . 5

Early stoppingEarly stopping

the test loss-vs-time step is "often" U-shapeduse validation for early stoppingalso saves computation!

early stopping bounds the region of the parameter-space that is reachable in T time-stepsassuming

bounded gradientstarting with a small w

it has an effect similar to L2 regularizationwe get the regularization path (various )we saw a similar phenomena in boosting

λ

8 . 5

BaggingBagging

several sources of variance in neural networks, such as

optimizationinitializationrandomness of SGDlearning rate and other hyper-parameters

choice of architecturenumber of layers, hidden units, etc.

8 . 6

BaggingBagging

use bagging or even averaging without bootstrap to reduce varianceissue: computationally expensive

several sources of variance in neural networks, such as

optimizationinitializationrandomness of SGDlearning rate and other hyper-parameters

choice of architecturenumber of layers, hidden units, etc.

8 . 6

DropoutDropout

idea

randomly remove a subset of units during trainingas opposed to bagging a single model is trained

8 . 7

DropoutDropout

idea


exponentially many subnetworks that share parameterscan be viewed as

8 . 7

DropoutDropout

idea


exponentially many subnetworks that share parameterscan be viewed as

is one of the most effective regularization schemes for MLPs

8 . 7

DropoutDropout

at test time

during training

8 . 8

DropoutDropout

at test time

during training

for each instance (n):randomly dropout each unit with probability p (e.g., p=.5)only the remaining subnetwork participates in training

8 . 8

DropoutDropout

at test time

ideally we want to average over the prediction of all possible sub-networks

during training


8 . 8

DropoutDropout

at test time


1) Monte Carlo dropout: average the prediction of several feed-forward passes using dropout

during training


this is computationally infeasible, instead

8 . 8


DropoutDropout

at test time


1) Monte Carlo dropout: average the prediction of several feed-forward passes using dropout2) weight scaling: scale the weights by p to compensate for dropout

e.g., for 50% dropout, scale by a factor of 2in general this is not equivalent to the average prediction of the ensemble

during training


this is computationally infeasible, instead

8 . 8

SummarySummary

Deep feed-forward networks learn adaptive basesmore complex bases at higher layersincreasing depth is often preferable to widthvarious choices of activation function and architectureuniversal approximation powertheir expressive power often necessitates using regularization schemes

9

Documents

Multilayer Perceptron Applied Machine Learningsiamak/COMP551/slides/11-mlp.pdf · Multilayer Perceptron S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. multilayer