Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Neural networks are a decades oldarea of study.
Initially, these computational modelswere created with the goal ofmimicking the processing ofneuronal networks.
Historical Overview
1 / 110
Inspiration: model neuron asprocessing unit.
Some of the mathematical functionshistorically used in neural networkmodels arise from biologicallyplausible activation functions.
Historical Overview
2 / 110
Somewhat limited success inmodeling neuronal processing
Neural network models gainedtraction as general MachineLearning models.
Historical Overview
3 / 110
Historical OverviewStrong results about the ability of these models to approximate arbitraryfunctions
Became the subject of intense study in ML.
In practice, effective training of these models was both technically andcomputationally difficult.
4 / 110
Starting from 2005, technicaladvances have led to a resurgenceof interest in neural networks,specifically in Deep NeuralNetworks.
Historical Overview
5 / 110
Deep LearningAdvances in computational processing:
powerful parallel processing given by Graphical Processing Units
6 / 110
Deep LearningAdvances in computational processing:
powerful parallel processing given by Graphical Processing Units
Advances in neural network architecture design and networkoptimization
7 / 110
Deep LearningAdvances in computational processing:
powerful parallel processing given by Graphical Processing Units
Advances in neural network architecture design and networkoptimization
Researchers apply Deep Neural Networks successfully in a number ofapplications.
8 / 110
Self driving cars make use of DeepLearning models for sensorprocessing.
Deep Learning
9 / 110
Image recognition software usesDeep Learning to identify individualswithin photos.
Deep Learning
10 / 110
Deep Learning models have beenapplied to medical imaging to yieldexpert-level prognosis.
Deep Learning
11 / 110
An automated Go player, makingheavy use of Deep Learning, iscapable of beating the best humanGo players in the world.
Deep Learning
12 / 110
Neural Networks and Deep LearningIn this unit we study neural networks and recent advances in DeepLearning.
13 / 110
Projection-Pursuit RegressionTo motivate our discussion of Deep Neural Networks, let's turn to simplebut very powerful class of models.
As per the usual regression setting, suppose
given predictors (attributes) for an observation
we want to predict a continuous outcome .
{X1, … ,Xp}
Y
14 / 110
Projection-Pursuit RegressionThe Projection-Pursuit Regression (PPR) model predicts outcome using function as
where:
is a p-dimensional weight vectorso, is a linear combination of predictors
and , are univariate non-linear functions (asmoothing spline for example)
Y
f(X)
f(X) =M
∑i=1
gm(w′mX)
wm
w′X = ∑
p
j=1 wmjxj xj
gm m = 1, … ,M15 / 110
Projection-Pursuit RegressionOur prediction function is a linear function (with terms).
Each term is the result of applying a non-linear function to,what we can think of as, a derived feature (or derived predictor)
.
M
gm(w′mX)
Vm = w′mX
16 / 110
Projection-Pursuit RegressionHere's another intuition. Recall the Principal Component Analysisproblem we saw in the previous unit.
Given:
Data set , where is the vector of variablevalues for the -th observation.
Return:
Matrix of linear transformations that retain maximalvariance.
{x1, x2, … , xn} xi p
i
[ϕ1,ϕ2, … ,ϕp]
17 / 110
Projection-Pursuit RegressionMatrix of linear transformations
You can think of the first vector as a linear transformation thatembeds observations into 1 dimension:
where is selected so that the resulting dataset hasmaximum variance.
[ϕ1,ϕ2, … ,ϕp]
ϕ1
Z1 = ϕ11X1 + ϕ21X2 + ⋯ + ϕp1Xp
ϕ1 {z1, … , zn}
18 / 110
Projection-Pursuit Regression
In PPR we are reducing the dimensionality of from to usinglinear projections,
And building a regression function over the representation with reduceddimension.
f(X) =M
∑i=1
gm(w′mX)
X p M
19 / 110
Projection-Pursuit RegressionLet's revist the data from our previous unit and see how the PPR modelperforms.
This is a time series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.
The dataset contains affordability measurements for 76 counties withdata from 1979 to 2017. Here we plot the time series of affordability forall counties.
20 / 110
We will try to predictaffordability at thelast time-point givenin the dataset basedon the time series upto one year previousto the last time point.
Projection-Pursuit Regression
21 / 110
Projection-Pursuit Regression
22 / 110
Projection-Pursuit RegressionSo, how can we fit the PPR model?
As we have done previously in other regression settings, we start with aloss function to minimize
Use an optimization method to minimize the error of the model.
For simplicity let's consider a model with and drop the subscript .
L(g,W) =N
∑i=1
[yi −M
∑m=1
gm(w′mxi)]
2
M = 1
m 23 / 110
Projection-Pursuit RegressionConsider the following procedure
Initialize weight vector to some value
Construct derived variable
Use a non-linear regression method to fit function based on model . You can use additive splines or loess
w wold
v = wold
g
E[Y |V ] = g(v)
24 / 110
Projection-Pursuit RegressionGiven function now update weight vector using a gradientdescent method
where is a learning rate.
g wold
w = wold + 2αN
∑i=1
(yi − g(vi))g′(vi)xi
= wold + 2αN
∑i=1
rixi
α
25 / 110
Projection-Pursuit Regression
In the second line we rewrite the gradient in terms of the residual ofthe current model (using the derived feature ) weighted by, whatwe could think of, as the sensitivity of the model to changes in derivedfeature .
w = wold + 2αN
∑i=1
(yi − g(vi))g′(vi)xi
= wold + 2αN
∑i=1
~r ixi
rig(vi) v
vi
26 / 110
Projection-Pursuit RegressionGiven an updated weight vector we can then fit again and continueiterating until a stop condition is reached.
w g
27 / 110
Projection-Pursuit RegressionLet's consider the PPR and this fitting technique a bit more in detail witha few observations
We can think of the PPR model as composing three functions:
the linear projection ,the result of non-linear function and, in the case when ,the linear combination of the functions.
w′x
g M > 1
gm
28 / 110
Projection-Pursuit RegressionTo tie this to the formulation usually described in the neural networkliterature we make one slight change to our understanding of derivedfeature.
Consider the case , the final predictor is a linear combination .
We could also think of each term as providing a non-lineardimensionality reduction to a single derived feature.
M > 1
∑M
i=1 gm(vm)
gm(vm)
29 / 110
Projection-Pursuit RegressionThis interpretation is closer to that used in the neural network literature,at each stage of the composition we apply a non-linear transform to thedata of the type .g(w
′x)
30 / 110
Projection-Pursuit RegressionThe fitting procedure propagates errors (residuals) down this functioncomposition in a stage-wise manner.
31 / 110
Feed-forward Neural NetworksWe can now write the general formulation for a feed-forward neuralnetwork.
We will present the formulation for a general case where we aremodeling outcomes as .K Y1, … ,Yk f1(X), … , fK(X)
32 / 110
Feed-forward Neural NetworksIn multi-class classification, categorical outcome may take multiplevalues
We consider as a discriminant function for class ,
Final classification is made using . For regression, we cantake .
Yk k
arg maxk YkK = 1
33 / 110
Feed-forward Neural NetworksA single layer feed-forward neural network is defined as
hm = gh(w′1mX), m = 1, … ,M
fk = gfk(w′2kh), k = 1, … ,K
34 / 110
The network is organized into input,hidden and output layers.
Feed-forward Neural Networks
35 / 110
Feed-forward Neural NetworksUnits represent a hidden layer,which we can interpret as a derivednon-linear representation of theinput data as we saw before.
hm
36 / 110
Feed-forward Neural NetworksFunction is an activation functionused to introduce non-linearity tothe representation.
gh
37 / 110
Historically, thesigmoid activationfunction wascommonly used
or
the hyperbolictangent.
Feed-forward Neural Networks
gh(v) = 11+e−v
38 / 110
Nowadays, a rectifiedlinear unit (ReLU)
is used morefrequently in practice.(there are manyextensions)
Feed-forward Neural Networks
gh(v) = max{0, v}
39 / 110
Feed-forward Neural NetworksFunction used in the output layerdepends on the outcome modeled.
For classification a soft-maxfunction can be used
where
.
For regression, we may take tobe the identify function.
gf
gfk(tk) = etk
∑K
l=1 etk
tk = w′2kh
gfk
40 / 110
Feed-forward Neural NetworksThe single-layer feed-forward neural network has the sameparameterization as the PPR model,
Activation functions are much simpler, as opposed to, e.g., smoothingsplines as used in PPR.
gh
41 / 110
Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).
42 / 110
Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).
However, the number of units in the hidden layer may be exponentiallylarge to approximate arbitrary functions.
43 / 110
Feed-forward Neural NetworksEmpirically, a single-layer feed-forward neural network has similarperformance to kernel-based methods like SVMs.
This is not usually the case once more than a single-layer is used in aneural network.
44 / 110
Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.
45 / 110
Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.
Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.
46 / 110
Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.
Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.
They have the advantage of explicitly representing all operations used ina neural network which then permits easier specification of gradient-based algorithms.
47 / 110
Fitting with back propagation
48 / 110
Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.
49 / 110
Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.
These methods require that gradients are computed based on modelerror.
50 / 110
Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.
These methods require that gradients are computed based on modelerror.
The layer-wise propagation of error is at the core of these gradientcomputations.
51 / 110
Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.
These methods require that gradients are computed based on modelerror.
The layer-wise propagation of error is at the core of these gradientcomputations.
This is called back-propagation.
52 / 110
Fitting with back propagationAssume we have a current estimateof model parameters, and we areprocessing one observation (inpractice a small batch ofobservations is used).
x
53 / 110
Fitting with back propagationFirst, to perform back propation wemust compute the error of the modelon observation given the currentset of parameters.
To do this we compute all activationfunctions along the computationgraph from the bottom up.
x
54 / 110
Fitting with back propagationOnce we have computed output ,we can compute error (or, generally,cost) .
Once we do this we can walk backthrough the computation graph toobtain gradients of cost withrespect to any of the modelparameters applying the chain rule.
y
J(y, y)
J
55 / 110
Fitting with back propagationWe will continously update agradient vector .
First, we set
∇
∇ ← ∇yJ
56 / 110
Fitting with back propagationNext, we need the gradient
We apply the chain rule to obtain
is the derivative of the softmaxfunction
is element-wise multiplication.
Set .
∇tJ
∇tJ = ∇ ⊙ f ′(t)
f ′
⊙
∇ ← ∇tJ
57 / 110
Fitting with back propagationNext, we want to compute .
We can do so using the gradient wejust computed since
.
In this case, we get .
∇WkJ
∇
∇WkJ = ∇tJ∇Wk
t
∇WkJ = ∇h
′
58 / 110
Fitting with back propagationAt this point we have computedgradients for the weight matrix from the hidden layer to the outputlayer, which we can use to updatethose parameters as part ofstochastic gradient descent.
Wk
59 / 110
Fitting with back propagationOnce we have computed gradientsfor weights connecting the hiddenand output layers, we can computegradients for weights connecting theinput and hidden layers.
60 / 110
Fitting with back propagationWe require , we we cancompute as since currentlyhas value .
At this point we can set .
∇hJ
W ′k∇ ∇
∇tJ
∇ ← ∇hJ
61 / 110
Fitting with back propagationFinally, we set
where isthe derivative of the ReLU activationfunction.
This gives us .
∇ ← ∇zJ = ∇ ⋅ g′(z) g′
∇WhJ = ∇x
′
62 / 110
Fitting with back propagationAt this point we have propagatedthe gradient of cost function to allparameters of the model
We can thus update the model forthe next step of stochastic gradientdescent.
J
63 / 110
Practical IssuesStochastic gradient descent (SGD) based on back-propagation algorithmas shown above introduces some complications.
64 / 110
ScalingThe scale of inputs effectively determines the scale of weight matrices
Scale can have a large effect on how well SGD behaves.
In practice, all inputs are usually standardized to have zero mean andunit variance before application of SGD.
x
W
65 / 110
InitializationWith properly scaled inputs, initialization of weights can be done in asomewhat reasonable manner
Randomly choose initial weights in .[−.7, .7]
66 / 110
Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.
67 / 110
Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.
We can incorporate penalty terms to control model complexity to somedegree.
68 / 110
Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.
69 / 110
Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.
We saw above that a wide enough hidden layer is capable of perfectlyfitting data.
70 / 110
Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.
We saw above that a wide enough hidden layer is capable of perfectlyfitting data.
We will also see later that in many cases making the neural networkdeeper instead of wider performs better.
71 / 110
Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.
We saw above that a wide enough hidden layer is capable of perfectlyfitting data.
We will also see later that in many cases making the neural networkdeeper instead of wider performs better.
In this case, models may have significantly fewer parameters, but tend tobe much harder to fit.
72 / 110
Architecture DesignIdeal network architectures are task dependent
Require much experimentation
Judicious use of cross-validation methods to measure expectedprediction error to guide architecture choice.
73 / 110
Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.
74 / 110
Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.
This will lead to the problem of multiple local minima in which methodslike SGD can suffer.
75 / 110
Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.
This will lead to the problem of multiple local minima in which methodslike SGD can suffer.
We will see later in detail a variety of approaches used to address thisproblem.
76 / 110
Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.
This will lead to the problem of multiple local minima in which methodslike SGD can suffer.
We will see later in detail a variety of approaches used to address thisproblem.
Here, we present a few rule of thumbs to follow.
77 / 110
Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.
78 / 110
Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.
One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.
79 / 110
Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.
One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.
A related idea is to average the predictions of this multiple models.
80 / 110
Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.
One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.
A related idea is to average the predictions of this multiple models.
Finally, we can use bagging as described in a previous session to createan ensemble of neural networks to circumvent the local minima problem.
81 / 110
SummaryNeural networks are representationally powerful prediction models.
They can be difficult to optimize properly due to the non-convexity ofthe resulting optimization problem.
Deciding on network architecture is a significant challenge. We'll seelater that recent proposals use deep, but thinner networks effectively.Even in this case, choice of model depth is difficult.
There is tremendous excitment over recent excellent performance ofdeep neural networks in many applications.
82 / 110
Deep Feed-Forward Neural NetworksThe general form of feed-forwardnetwork can be extended by addingadditional hidden layers.
83 / 110
Deep Feed-Forward Neural NetworksThe same principles we saw before:
We arrange computation using acomputing graph
Use Stochastic Gradient Descent
Use Backpropagation for gradientcalculation along the computationgraph.
84 / 110
Deep Feed-Forward Neural NetworksEmpirically, it is found that by usingmore, thinner, layers, betterexpected prediction error isobtained.
However, each layer introducesmore linearity into the network.
Making optimization markedly moredifficult.
85 / 110
Deep Feed-Forward Neural NetworksWe may interpret hidden layers asprogressive derived representationsof the input data.
Since we train based on a loss-function, these derivedrepresentations should makemodeling the outcome of interestprogressively easier.
86 / 110
Deep Feed-Forward Neural NetworksIn many applications, these derivedrepresentations are used for modelinterpretation.
87 / 110
Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.
88 / 110
Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.
The applications we discussed previously build this type of massive deepnetwork.
89 / 110
Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.
The applications we discussed previously build this type of massive deepnetwork.
They also require massive amounts of data to train.
90 / 110
Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.
The applications we discussed previously build this type of massive deepnetwork.
They also require massive amounts of data to train.
However, this approach can still be applicable to moderate datasizeswith careful network design, regularization and training.
91 / 110
Supervised Pre-trainingA clever idea for training deepnetworks.
Train each layer successively on theoutcome of interest.
Use the resulting weights as initialweights for network with one moreadditional layer.
92 / 110
Supervised Pre-trainingTrain the first layer as a single layerfeed forward network.
Weights initialized as standardpractice.
This fits .W 1h
93 / 110
Supervised Pre-trainingNow train two layer network.
Weights are initialized to resultof previous fit.
W 1h
94 / 110
Supervised Pre-trainingThis procedure continues until all layers are trained.
Hypothesis is that training each layer on the outcome of interest movesthe weights to parts of parameter space that lead to good performance.
Minimizing updates can ameliorate dependency problem.
95 / 110
Supervised Pre-trainingThis is one strategy others are popular and effective
Train each layer as a single layer network using the hidden layer ofthe previous layer as inputs to the model.
In this case, no long term dependencies occur at all.
Performance may suffer.
96 / 110
Supervised Pre-trainingThis is one strategy others are popular and effective
Train each layer as a single layer on the hidden layer of the previouslayer, but also add the original input data as input to every layer of thenetwork.
No long-term dependency
Performance improves
Number of parameters increases.
97 / 110
Parameter SharingAnother method for reducing the number of parameters in a deeplearning model.
When predictors exhibit some internal structure, parts of the modelcan then share parameters.
X
98 / 110
Parameter SharingTwo important applications use this idea:
Image processing: local structure of nearby pixelsSequence modeling: structure given by sequence
The latter includes modeling of time series data.
99 / 110
Convolutional Networks are used inimaging applications.
Input is pixel data.
Parameters are shared acrossnearby parts of the image.
Parameter Sharing
100 / 110
Recurrent Networks are used insequence modeling applications.
For instance, time series andforecasting.
Parameters are shared across atime lag.
Recurrent Networks
101 / 110
Recurrent NetworksThe long short-term memory(LSTM) model is very popular intime series analysis
102 / 110
ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html
103 / 110
ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html
Addition encoded as sequence of one-hot vectors:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 5 0 0 0 0 0 0 0 1 0 0
## 5 0 0 0 0 0 0 0 1 0 0
## + 0 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 1 0 0 0 0 0
## 2 0 0 0 0 1 0 0 0 0 0
## [,11] [,12]
## 5 0 0
104 / 110
ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html
Result encoded as sequence of one-hot vectors
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 7 0 0 0 0 0 0 0 0 0 1
## 7 0 0 0 0 0 0 0 0 0 1
## [,11] [,12]
## 7 0 0
## 7 0 0
105 / 110
ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html
Result encoded as sequence of one-hot vectors
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 7 0 0 0 0 0 0 0 0 0 1
## 7 0 0 0 0 0 0 0 0 0 1
## [,11] [,12]
## 7 0 0
## 7 0 0
This is a sequence-to-sequence model. Perfect application for106 / 110
Example## ______________________________________________________
## Layer (type) Output Shape Param #
## ======================================================
## lstm_10 (LSTM) (None, 128) 72192
## ______________________________________________________
## repeat_vector_5 (Repeat (None, 3, 128) 0
## ______________________________________________________
## lstm_11 (LSTM) (None, 3, 128) 131584
## ______________________________________________________
## time_distributed_5 (Tim (None, 3, 12) 1548
## ______________________________________________________
## activation_5 (Activatio (None, 3, 12) 0 107 / 110
SummaryDeep Learning is riding a big wave of popularity.
State-of-the-art results in many applications.
Best results in applications with massive amounts of data.
However, newer methods allow use in other situations.
108 / 110
SummaryMany of recent advances stem from computational and technicalapproaches to modeling.
Keeping track of these advances is hard, and many of them are ad-hoc.
Not straightforward to determine a-priori how these technical advancesmay help in a specific application.
Require significant amount of experimentation.
109 / 110
SummaryThe interpretation of hidden units as representations can lead to insight.
There is current research on interpreting these to support some notion ofstatistical inference.
Excellent textbook: http://deeplearningbook.org
110 / 110
Introduction to Data Science: NeuralNetworks and Deep Learning
Héctor Corrada Bravo
University of Maryland, College Park, USACMSC 320: 2020-05-10
Neural networks are a decades oldarea of study.
Initially, these computational modelswere created with the goal ofmimicking the processing ofneuronal networks.
Historical Overview
1 / 110
Inspiration: model neuron asprocessing unit.
Some of the mathematical functionshistorically used in neural networkmodels arise from biologicallyplausible activation functions.
Historical Overview
2 / 110
Somewhat limited success inmodeling neuronal processing
Neural network models gainedtraction as general MachineLearning models.
Historical Overview
3 / 110
Historical OverviewStrong results about the ability of these models to approximate arbitraryfunctions
Became the subject of intense study in ML.
In practice, effective training of these models was both technically andcomputationally difficult.
4 / 110
Starting from 2005, technicaladvances have led to a resurgenceof interest in neural networks,specifically in Deep NeuralNetworks.
Historical Overview
5 / 110
Deep LearningAdvances in computational processing:
powerful parallel processing given by Graphical Processing Units
6 / 110
Deep LearningAdvances in computational processing:
powerful parallel processing given by Graphical Processing Units
Advances in neural network architecture design and networkoptimization
7 / 110
Deep LearningAdvances in computational processing:
powerful parallel processing given by Graphical Processing Units
Advances in neural network architecture design and networkoptimization
Researchers apply Deep Neural Networks successfully in a number ofapplications.
8 / 110
Self driving cars make use of DeepLearning models for sensorprocessing.
Deep Learning
9 / 110
Image recognition software usesDeep Learning to identify individualswithin photos.
Deep Learning
10 / 110
Deep Learning models have beenapplied to medical imaging to yieldexpert-level prognosis.
Deep Learning
11 / 110
An automated Go player, makingheavy use of Deep Learning, iscapable of beating the best humanGo players in the world.
Deep Learning
12 / 110
Neural Networks and Deep LearningIn this unit we study neural networks and recent advances in DeepLearning.
13 / 110
Projection-Pursuit RegressionTo motivate our discussion of Deep Neural Networks, let's turn to simplebut very powerful class of models.
As per the usual regression setting, suppose
given predictors (attributes) for an observation
we want to predict a continuous outcome .
{X1, … , Xp}
Y
14 / 110
Projection-Pursuit RegressionThe Projection-Pursuit Regression (PPR) model predicts outcome using function as
where:
is a p-dimensional weight vectorso, is a linear combination of predictors
and , are univariate non-linear functions (asmoothing spline for example)
Y
f(X)
f(X) =M
∑i=1
gm(w′mX)
wm
w′X = ∑
p
j=1 wmjxj xj
gm m = 1, … , M
15 / 110
Projection-Pursuit RegressionOur prediction function is a linear function (with terms).
Each term is the result of applying a non-linear function to,what we can think of as, a derived feature (or derived predictor)
.
M
gm(w′mX)
Vm = w′mX
16 / 110
Projection-Pursuit RegressionHere's another intuition. Recall the Principal Component Analysisproblem we saw in the previous unit.
Given:
Data set , where is the vector of variablevalues for the -th observation.
Return:
Matrix of linear transformations that retain maximalvariance.
{x1, x2, … , xn} xi p
i
[ϕ1, ϕ2, … , ϕp]
17 / 110
Projection-Pursuit RegressionMatrix of linear transformations
You can think of the first vector as a linear transformation thatembeds observations into 1 dimension:
where is selected so that the resulting dataset hasmaximum variance.
[ϕ1, ϕ2, … , ϕp]
ϕ1
Z1 = ϕ11X1 + ϕ21X2 + ⋯ + ϕp1Xp
ϕ1 {z1, … , zn}
18 / 110
Projection-Pursuit Regression
In PPR we are reducing the dimensionality of from to usinglinear projections,
And building a regression function over the representation with reduceddimension.
f(X) =M
∑i=1
gm(w′mX)
X p M
19 / 110
Projection-Pursuit RegressionLet's revist the data from our previous unit and see how the PPR modelperforms.
This is a time series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.
The dataset contains affordability measurements for 76 counties withdata from 1979 to 2017. Here we plot the time series of affordability forall counties.
20 / 110
We will try to predictaffordability at thelast time-point givenin the dataset basedon the time series upto one year previousto the last time point.
Projection-Pursuit Regression
21 / 110
Projection-Pursuit Regression
22 / 110
Projection-Pursuit RegressionSo, how can we fit the PPR model?
As we have done previously in other regression settings, we start with aloss function to minimize
Use an optimization method to minimize the error of the model.
For simplicity let's consider a model with and drop the subscript .
L(g, W) =N
∑i=1
[yi −M
∑m=1
gm(w′mxi)]
2
M = 1
m 23 / 110
Projection-Pursuit RegressionConsider the following procedure
Initialize weight vector to some value
Construct derived variable
Use a non-linear regression method to fit function based on model . You can use additive splines or loess
w wold
v = wold
g
E[Y |V ] = g(v)
24 / 110
Projection-Pursuit RegressionGiven function now update weight vector using a gradientdescent method
where is a learning rate.
g wold
w = wold + 2αN
∑i=1
(yi − g(vi))g′(vi)xi
= wold + 2αN
∑i=1
rixi
α
25 / 110
Projection-Pursuit Regression
In the second line we rewrite the gradient in terms of the residual ofthe current model (using the derived feature ) weighted by, whatwe could think of, as the sensitivity of the model to changes in derivedfeature .
w = wold + 2αN
∑i=1
(yi − g(vi))g′(vi)xi
= wold + 2αN
∑i=1
~r ixi
rig(vi) v
vi
26 / 110
Projection-Pursuit RegressionGiven an updated weight vector we can then fit again and continueiterating until a stop condition is reached.
w g
27 / 110
Projection-Pursuit RegressionLet's consider the PPR and this fitting technique a bit more in detail witha few observations
We can think of the PPR model as composing three functions:
the linear projection ,the result of non-linear function and, in the case when ,the linear combination of the functions.
w′x
g M > 1
gm
28 / 110
Projection-Pursuit RegressionTo tie this to the formulation usually described in the neural networkliterature we make one slight change to our understanding of derivedfeature.
Consider the case , the final predictor is a linear combination .
We could also think of each term as providing a non-lineardimensionality reduction to a single derived feature.
M > 1
∑M
i=1 gm(vm)
gm(vm)
29 / 110
Projection-Pursuit RegressionThis interpretation is closer to that used in the neural network literature,at each stage of the composition we apply a non-linear transform to thedata of the type .g(w
′x)
30 / 110
Projection-Pursuit RegressionThe fitting procedure propagates errors (residuals) down this functioncomposition in a stage-wise manner.
31 / 110
Feed-forward Neural NetworksWe can now write the general formulation for a feed-forward neuralnetwork.
We will present the formulation for a general case where we aremodeling outcomes as .K Y1, … , Yk f1(X), … , fK(X)
32 / 110
Feed-forward Neural NetworksIn multi-class classification, categorical outcome may take multiplevalues
We consider as a discriminant function for class ,
Final classification is made using . For regression, we cantake .
Yk k
arg maxk Yk
K = 1
33 / 110
Feed-forward Neural NetworksA single layer feed-forward neural network is defined as
hm = gh(w′1mX), m = 1, … , M
fk = gfk(w′2k
h), k = 1, … , K
34 / 110
The network is organized into input,hidden and output layers.
Feed-forward Neural Networks
35 / 110
Feed-forward Neural NetworksUnits represent a hidden layer,which we can interpret as a derivednon-linear representation of theinput data as we saw before.
hm
36 / 110
Feed-forward Neural NetworksFunction is an activation functionused to introduce non-linearity tothe representation.
gh
37 / 110
Historically, thesigmoid activationfunction wascommonly used
or
the hyperbolictangent.
Feed-forward Neural Networks
gh(v) = 11+e−v
38 / 110
Nowadays, a rectifiedlinear unit (ReLU)
is used morefrequently in practice.(there are manyextensions)
Feed-forward Neural Networks
gh(v) = max{0, v}
39 / 110
Feed-forward Neural NetworksFunction used in the output layerdepends on the outcome modeled.
For classification a soft-maxfunction can be used
where
.
For regression, we may take tobe the identify function.
gf
gfk(tk) = etk
∑K
l=1 etk
tk = w′2k
h
gfk
40 / 110
Feed-forward Neural NetworksThe single-layer feed-forward neural network has the sameparameterization as the PPR model,
Activation functions are much simpler, as opposed to, e.g., smoothingsplines as used in PPR.
gh
41 / 110
Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).
42 / 110
Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).
However, the number of units in the hidden layer may be exponentiallylarge to approximate arbitrary functions.
43 / 110
Feed-forward Neural NetworksEmpirically, a single-layer feed-forward neural network has similarperformance to kernel-based methods like SVMs.
This is not usually the case once more than a single-layer is used in aneural network.
44 / 110
Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.
45 / 110
Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.
Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.
46 / 110
Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.
Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.
They have the advantage of explicitly representing all operations used ina neural network which then permits easier specification of gradient-based algorithms.
47 / 110
Fitting with back propagation
48 / 110
Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.
49 / 110
Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.
These methods require that gradients are computed based on modelerror.
50 / 110
Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.
These methods require that gradients are computed based on modelerror.
The layer-wise propagation of error is at the core of these gradientcomputations.
51 / 110
Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.
These methods require that gradients are computed based on modelerror.
The layer-wise propagation of error is at the core of these gradientcomputations.
This is called back-propagation.
52 / 110
Fitting with back propagationAssume we have a current estimateof model parameters, and we areprocessing one observation (inpractice a small batch ofobservations is used).
x
53 / 110
Fitting with back propagationFirst, to perform back propation wemust compute the error of the modelon observation given the currentset of parameters.
To do this we compute all activationfunctions along the computationgraph from the bottom up.
x
54 / 110
Fitting with back propagationOnce we have computed output ,we can compute error (or, generally,cost) .
Once we do this we can walk backthrough the computation graph toobtain gradients of cost withrespect to any of the modelparameters applying the chain rule.
y
J(y, y)
J
55 / 110
Fitting with back propagationWe will continously update agradient vector .
First, we set
∇
∇ ← ∇y J
56 / 110
Fitting with back propagationNext, we need the gradient
We apply the chain rule to obtain
is the derivative of the softmaxfunction
is element-wise multiplication.
Set .
∇tJ
∇tJ = ∇ ⊙ f ′(t)
f ′
⊙
∇ ← ∇tJ
57 / 110
Fitting with back propagationNext, we want to compute .
We can do so using the gradient wejust computed since
.
In this case, we get .
∇WkJ
∇
∇WkJ = ∇tJ∇Wk
t
∇WkJ = ∇h
′
58 / 110
Fitting with back propagationAt this point we have computedgradients for the weight matrix from the hidden layer to the outputlayer, which we can use to updatethose parameters as part ofstochastic gradient descent.
Wk
59 / 110
Fitting with back propagationOnce we have computed gradientsfor weights connecting the hiddenand output layers, we can computegradients for weights connecting theinput and hidden layers.
60 / 110
Fitting with back propagationWe require , we we cancompute as since currentlyhas value .
At this point we can set .
∇hJ
W′
k∇ ∇
∇tJ
∇ ← ∇hJ
61 / 110
Fitting with back propagationFinally, we set
where isthe derivative of the ReLU activationfunction.
This gives us .
∇ ← ∇zJ = ∇ ⋅ g′(z) g′
∇WhJ = ∇x
′
62 / 110
Fitting with back propagationAt this point we have propagatedthe gradient of cost function to allparameters of the model
We can thus update the model forthe next step of stochastic gradientdescent.
J
63 / 110
Practical IssuesStochastic gradient descent (SGD) based on back-propagation algorithmas shown above introduces some complications.
64 / 110
ScalingThe scale of inputs effectively determines the scale of weight matrices
Scale can have a large effect on how well SGD behaves.
In practice, all inputs are usually standardized to have zero mean andunit variance before application of SGD.
x
W
65 / 110
InitializationWith properly scaled inputs, initialization of weights can be done in asomewhat reasonable manner
Randomly choose initial weights in .[−.7, .7]
66 / 110
Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.
67 / 110
Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.
We can incorporate penalty terms to control model complexity to somedegree.
68 / 110
Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.
69 / 110
Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.
We saw above that a wide enough hidden layer is capable of perfectlyfitting data.
70 / 110
Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.
We saw above that a wide enough hidden layer is capable of perfectlyfitting data.
We will also see later that in many cases making the neural networkdeeper instead of wider performs better.
71 / 110
Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.
We saw above that a wide enough hidden layer is capable of perfectlyfitting data.
We will also see later that in many cases making the neural networkdeeper instead of wider performs better.
In this case, models may have significantly fewer parameters, but tend tobe much harder to fit.
72 / 110
Architecture DesignIdeal network architectures are task dependent
Require much experimentation
Judicious use of cross-validation methods to measure expectedprediction error to guide architecture choice.
73 / 110
Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.
74 / 110
Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.
This will lead to the problem of multiple local minima in which methodslike SGD can suffer.
75 / 110
Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.
This will lead to the problem of multiple local minima in which methodslike SGD can suffer.
We will see later in detail a variety of approaches used to address thisproblem.
76 / 110
Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.
This will lead to the problem of multiple local minima in which methodslike SGD can suffer.
We will see later in detail a variety of approaches used to address thisproblem.
Here, we present a few rule of thumbs to follow.
77 / 110
Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.
78 / 110
Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.
One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.
79 / 110
Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.
One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.
A related idea is to average the predictions of this multiple models.
80 / 110
Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.
One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.
A related idea is to average the predictions of this multiple models.
Finally, we can use bagging as described in a previous session to createan ensemble of neural networks to circumvent the local minima problem.
81 / 110
SummaryNeural networks are representationally powerful prediction models.
They can be difficult to optimize properly due to the non-convexity ofthe resulting optimization problem.
Deciding on network architecture is a significant challenge. We'll seelater that recent proposals use deep, but thinner networks effectively.Even in this case, choice of model depth is difficult.
There is tremendous excitment over recent excellent performance ofdeep neural networks in many applications.
82 / 110
Deep Feed-Forward Neural NetworksThe general form of feed-forwardnetwork can be extended by addingadditional hidden layers.
83 / 110
Deep Feed-Forward Neural NetworksThe same principles we saw before:
We arrange computation using acomputing graph
Use Stochastic Gradient Descent
Use Backpropagation for gradientcalculation along the computationgraph.
84 / 110
Deep Feed-Forward Neural NetworksEmpirically, it is found that by usingmore, thinner, layers, betterexpected prediction error isobtained.
However, each layer introducesmore linearity into the network.
Making optimization markedly moredifficult.
85 / 110
Deep Feed-Forward Neural NetworksWe may interpret hidden layers asprogressive derived representationsof the input data.
Since we train based on a loss-function, these derivedrepresentations should makemodeling the outcome of interestprogressively easier.
86 / 110
Deep Feed-Forward Neural NetworksIn many applications, these derivedrepresentations are used for modelinterpretation.
87 / 110
Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.
88 / 110
Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.
The applications we discussed previously build this type of massive deepnetwork.
89 / 110
Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.
The applications we discussed previously build this type of massive deepnetwork.
They also require massive amounts of data to train.
90 / 110
Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.
The applications we discussed previously build this type of massive deepnetwork.
They also require massive amounts of data to train.
However, this approach can still be applicable to moderate datasizeswith careful network design, regularization and training.
91 / 110
Supervised Pre-trainingA clever idea for training deepnetworks.
Train each layer successively on theoutcome of interest.
Use the resulting weights as initialweights for network with one moreadditional layer.
92 / 110
Supervised Pre-trainingTrain the first layer as a single layerfeed forward network.
Weights initialized as standardpractice.
This fits .W1
h
93 / 110
Supervised Pre-trainingNow train two layer network.
Weights are initialized to resultof previous fit.
W1
h
94 / 110
Supervised Pre-trainingThis procedure continues until all layers are trained.
Hypothesis is that training each layer on the outcome of interest movesthe weights to parts of parameter space that lead to good performance.
Minimizing updates can ameliorate dependency problem.
95 / 110
Supervised Pre-trainingThis is one strategy others are popular and effective
Train each layer as a single layer network using the hidden layer ofthe previous layer as inputs to the model.
In this case, no long term dependencies occur at all.
Performance may suffer.
96 / 110
Supervised Pre-trainingThis is one strategy others are popular and effective
Train each layer as a single layer on the hidden layer of the previouslayer, but also add the original input data as input to every layer of thenetwork.
No long-term dependency
Performance improves
Number of parameters increases.
97 / 110
Parameter SharingAnother method for reducing the number of parameters in a deeplearning model.
When predictors exhibit some internal structure, parts of the modelcan then share parameters.
X
98 / 110
Parameter SharingTwo important applications use this idea:
Image processing: local structure of nearby pixelsSequence modeling: structure given by sequence
The latter includes modeling of time series data.
99 / 110
Convolutional Networks are used inimaging applications.
Input is pixel data.
Parameters are shared acrossnearby parts of the image.
Parameter Sharing
100 / 110
Recurrent Networks are used insequence modeling applications.
For instance, time series andforecasting.
Parameters are shared across atime lag.
Recurrent Networks
101 / 110
Recurrent NetworksThe long short-term memory(LSTM) model is very popular intime series analysis
102 / 110
ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html
103 / 110
ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html
Addition encoded as sequence of one-hot vectors:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 5 0 0 0 0 0 0 0 1 0 0
## 5 0 0 0 0 0 0 0 1 0 0
## + 0 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 1 0 0 0 0 0
## 2 0 0 0 0 1 0 0 0 0 0
## [,11] [,12]
## 5 0 0
104 / 110
ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html
Result encoded as sequence of one-hot vectors
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 7 0 0 0 0 0 0 0 0 0 1
## 7 0 0 0 0 0 0 0 0 0 1
## [,11] [,12]
## 7 0 0
## 7 0 0
105 / 110
ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html
Result encoded as sequence of one-hot vectors
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 7 0 0 0 0 0 0 0 0 0 1
## 7 0 0 0 0 0 0 0 0 0 1
## [,11] [,12]
## 7 0 0
## 7 0 0
This is a sequence-to-sequence model. Perfect application for106 / 110
Example## ______________________________________________________
## Layer (type) Output Shape Param #
## ======================================================
## lstm_10 (LSTM) (None, 128) 72192
## ______________________________________________________
## repeat_vector_5 (Repeat (None, 3, 128) 0
## ______________________________________________________
## lstm_11 (LSTM) (None, 3, 128) 131584
## ______________________________________________________
## time_distributed_5 (Tim (None, 3, 12) 1548
## ______________________________________________________
## activation_5 (Activatio (None, 3, 12) 0 107 / 110
SummaryDeep Learning is riding a big wave of popularity.
State-of-the-art results in many applications.
Best results in applications with massive amounts of data.
However, newer methods allow use in other situations.
108 / 110
SummaryMany of recent advances stem from computational and technicalapproaches to modeling.
Keeping track of these advances is hard, and many of them are ad-hoc.
Not straightforward to determine a-priori how these technical advancesmay help in a specific application.
Require significant amount of experimentation.
109 / 110
SummaryThe interpretation of hidden units as representations can lead to insight.
There is current research on interpreting these to support some notion ofstatistical inference.
Excellent textbook: http://deeplearningbook.org
110 / 110