111
Neural networks are a decades old area of study. Initially, these computational models were created with the goal of mimicking the processing of neuronal networks. Historical Overview 1 / 110 Inspiration: model neuron as processing unit. Some of the mathematical functions historically used in neural network models arise from biologically plausible activation functions. Historical Overview 2 / 110 Somewhat limited success in modeling neuronal processing Neural network models gained traction as general Machine Learning models. Historical Overview 3 / 110 Historical Overview Strong results about the ability of these models to approximate arbitrary functions Became the subject of intense study in ML. In practice, effective training of these models was both technically and computationally difficult. 4 / 110 Starting from 2005, technical advances have led to a resurgence of interest in neural networks, specifically in Deep Neural Networks. Historical Overview 5 / 110 Deep Learning Advances in computational processing: powerful parallel processing given by Graphical Processing Units 6 / 110 Deep Learning Advances in computational processing: powerful parallel processing given by Graphical Processing Units Advances in neural network architecture design and network optimization 7 / 110 Deep Learning Advances in computational processing: powerful parallel processing given by Graphical Processing Units Advances in neural network architecture design and network optimization Researchers apply Deep Neural Networks successfully in a number of applications. 8 / 110 Self driving cars make use of Deep Learning models for sensor processing. Deep Learning 9 / 110 Image recognition software uses Deep Learning to identify individuals within photos. Deep Learning 10 / 110 Deep Learning models have been applied to medical imaging to yield expert-level prognosis. Deep Learning 11 / 110 An automated Go player, making heavy use of Deep Learning, is capable of beating the best human Go players in the world. Deep Learning 12 / 110 Neural Networks and Deep Learning In this unit we study neural networks and recent advances in Deep Learning. 13 / 110 Projection-Pursuit Regression To motivate our discussion of Deep Neural Networks, let's turn to simple but very powerful class of models. As per the usual regression setting, suppose given predictors (attributes) for an observation we want to predict a continuous outcome . 14 / 110 Projection-Pursuit Regression The Projection-Pursuit Regression (PPR) model predicts outcome using function as where: is a p-dimensional weight vector so, is a linear combination of predictors and , are univariate non-linear functions (a smoothing spline for example) 15 / 110 Projection-Pursuit Regression Our prediction function is a linear function (with terms). Each term is the result of applying a non-linear function to, what we can think of as, a derived feature (or derived predictor) . 16 / 110 Projection-Pursuit Regression Here's another intuition. Recall the Principal Component Analysis problem we saw in the previous unit. Given: Data set , where is the vector of variable values for the -th observation. Return: Matrix of linear transformations that retain maximal variance. 17 / 110 Projection-Pursuit Regression Matrix of linear transformations You can think of the first vector as a linear transformation that embeds observations into 1 dimension: where is selected so that the resulting dataset has maximum variance. 18 / 110 Projection-Pursuit Regression In PPR we are reducing the dimensionality of from to using linear projections, And building a regression function over the representation with reduced dimension. 19 / 110 Projection-Pursuit Regression Let's revist the data from our previous unit and see how the PPR model performs. This is a time series dataset of mortgage affordability as calculated and distributed by Zillow: https://www.zillow.com/research/data/. The dataset contains affordability measurements for 76 counties with data from 1979 to 2017. Here we plot the time series of affordability for all counties. 20 / 110 We will try to predict affordability at the last time-point given in the dataset based on the time series up to one year previous to the last time point. Projection-Pursuit Regression 21 / 110 Projection-Pursuit Regression 22 / 110 Projection-Pursuit Regression So, how can we fit the PPR model? As we have done previously in other regression settings, we start with a loss function to minimize Use an optimization method to minimize the error of the model. For simplicity let's consider a model with and drop the subscript . 23 / 110 Projection-Pursuit Regression Consider the following procedure Initialize weight vector to some value Construct derived variable Use a non-linear regression method to fit function based on model . You can use additive splines or loess 24 / 110 Projection-Pursuit Regression Given function now update weight vector using a gradient descent method where is a learning rate. 25 / 110 Projection-Pursuit Regression In the second line we rewrite the gradient in terms of the residual of the current model (using the derived feature ) weighted by, what we could think of, as the sensitivity of the model to changes in derived feature . 26 / 110 Projection-Pursuit Regression Given an updated weight vector we can then fit again and continue iterating until a stop condition is reached. 27 / 110 Projection-Pursuit Regression Let's consider the PPR and this fitting technique a bit more in detail with a few observations We can think of the PPR model as composing three functions: the linear projection , the result of non-linear function and, in the case when , the linear combination of the functions. 28 / 110 Projection-Pursuit Regression To tie this to the formulation usually described in the neural network literature we make one slight change to our understanding of derived feature. Consider the case , the final predictor is a linear combination . We could also think of each term as providing a non-linear dimensionality reduction to a single derived feature. 29 / 110 Projection-Pursuit Regression This interpretation is closer to that used in the neural network literature, at each stage of the composition we apply a non-linear transform to the data of the type . 30 / 110 Projection-Pursuit Regression The fitting procedure propagates errors (residuals) down this function composition in a stage-wise manner. 31 / 110 Feed-forward Neural Networks We can now write the general formulation for a feed-forward neural network. We will present the formulation for a general case where we are modeling outcomes as . 32 / 110 Feed-forward Neural Networks In multi-class classification, categorical outcome may take multiple values We consider as a discriminant function for class , Final classification is made using . For regression, we can take . 33 / 110 Feed-forward Neural Networks A single layer feed-forward neural network is defined as 34 / 110 The network is organized into input, hidden and output layers. Feed-forward Neural Networks 35 / 110 Feed-forward Neural Networks Units represent a hidden layer, which we can interpret as a derived non-linear representation of the input data as we saw before. 36 / 110 Feed-forward Neural Networks Function is an activation function used to introduce non-linearity to the representation. 37 / 110 Historically, the sigmoid activation function was commonly used or the hyperbolic tangent. Feed-forward Neural Networks 38 / 110 Nowadays, a rectified linear unit (ReLU) is used more frequently in practice. (there are many extensions) Feed-forward Neural Networks 39 / 110 Feed-forward Neural Networks Function used in the output layer depends on the outcome modeled. For classification a soft-max function can be used where . For regression, we may take to be the identify function. 40 / 110 Feed-forward Neural Networks The single-layer feed-forward neural network has the same parameterization as the PPR model, Activation functions are much simpler, as opposed to, e.g., smoothing splines as used in PPR. 41 / 110 Feed-forward Neural Networks A classic result of the Neural Network literature is the universal function representational ability of the single-layer feed-forward neural network with ReLU activation functions (Leshno et al. 1993). 42 / 110 Feed-forward Neural Networks A classic result of the Neural Network literature is the universal function representational ability of the single-layer feed-forward neural network with ReLU activation functions (Leshno et al. 1993). However, the number of units in the hidden layer may be exponentially large to approximate arbitrary functions. 43 / 110 Feed-forward Neural Networks Empirically, a single-layer feed-forward neural network has similar performance to kernel-based methods like SVMs. This is not usually the case once more than a single-layer is used in a neural network. 44 / 110 Fitting with back propagation In modern neural network literature, the graphical representation of neural nets we saw above has been extended to computational graphs. 45 / 110 Fitting with back propagation In modern neural network literature, the graphical representation of neural nets we saw above has been extended to computational graphs. Especially useful to guide the design of general-use programming libraries for the specification of neural nets. 46 / 110 Fitting with back propagation In modern neural network literature, the graphical representation of neural nets we saw above has been extended to computational graphs. Especially useful to guide the design of general-use programming libraries for the specification of neural nets. They have the advantage of explicitly representing all operations used in a neural network which then permits easier specification of gradient- based algorithms. 47 / 110 Fitting with back propagation 48 / 110 Fitting with back propagation Gradient-based methods based on stochastic gradient descent are most frequently used to fit the parameters of neural networks. 49 / 110 Fitting with back propagation Gradient-based methods based on stochastic gradient descent are most frequently used to fit the parameters of neural networks. These methods require that gradients are computed based on model error. 50 / 110 Fitting with back propagation Gradient-based methods based on stochastic gradient descent are most frequently used to fit the parameters of neural networks. These methods require that gradients are computed based on model error. The layer-wise propagation of error is at the core of these gradient computations. 51 / 110 Fitting with back propagation Gradient-based methods based on stochastic gradient descent are most frequently used to fit the parameters of neural networks. These methods require that gradients are computed based on model error. The layer-wise propagation of error is at the core of these gradient computations. This is called back-propagation. 52 / 110 Fitting with back propagation Assume we have a current estimate of model parameters, and we are processing one observation (in practice a small batch of observations is used). 53 / 110 Fitting with back propagation First, to perform back propation we must compute the error of the model on observation given the current set of parameters. To do this we compute all activation functions along the computation graph from the bottom up. 54 / 110 Fitting with back propagation Once we have computed output , we can compute error (or, generally, cost) . Once we do this we can walk back through the computation graph to obtain gradients of cost with respect to any of the model parameters applying the chain rule. 55 / 110 Fitting with back propagation We will continously update a gradient vector . First, we set 56 / 110 Fitting with back propagation Next, we need the gradient We apply the chain rule to obtain is the derivative of the softmax function is element-wise multiplication. Set . 57 / 110 Fitting with back propagation Next, we want to compute . We can do so using the gradient we just computed since . In this case, we get . 58 / 110 Fitting with back propagation At this point we have computed gradients for the weight matrix from the hidden layer to the output layer, which we can use to update those parameters as part of stochastic gradient descent. 59 / 110 Fitting with back propagation Once we have computed gradients for weights connecting the hidden and output layers, we can compute gradients for weights connecting the input and hidden layers. 60 / 110 Fitting with back propagation We require , we we can compute as since currently has value . At this point we can set . 61 / 110 Fitting with back propagation Finally, we set where is the derivative of the ReLU activation function. This gives us . 62 / 110 Fitting with back propagation At this point we have propagated the gradient of cost function to all parameters of the model We can thus update the model for the next step of stochastic gradient descent. 63 / 110 Practical Issues Stochastic gradient descent (SGD) based on back-propagation algorithm as shown above introduces some complications. 64 / 110 Scaling The scale of inputs effectively determines the scale of weight matrices Scale can have a large effect on how well SGD behaves. In practice, all inputs are usually standardized to have zero mean and unit variance before application of SGD. 65 / 110 Initialization With properly scaled inputs, initialization of weights can be done in a somewhat reasonable manner Randomly choose initial weights in . 66 / 110 Overtting As with other highly-flexible models we have seen previously, feed- forward neural nets are prone to overfit data. 67 / 110 Overtting As with other highly-flexible models we have seen previously, feed- forward neural nets are prone to overfit data. We can incorporate penalty terms to control model complexity to some degree. 68 / 110 Architecture Design A significant issue in the application of feed-forward neural networks is that we need to choose the number of units in the hidden layer. 69 / 110 Architecture Design A significant issue in the application of feed-forward neural networks is that we need to choose the number of units in the hidden layer. We saw above that a wide enough hidden layer is capable of perfectly fitting data. 70 / 110 Architecture Design A significant issue in the application of feed-forward neural networks is that we need to choose the number of units in the hidden layer. We saw above that a wide enough hidden layer is capable of perfectly fitting data. We will also see later that in many cases making the neural network deeper instead of wider performs better. 71 / 110 Architecture Design A significant issue in the application of feed-forward neural networks is that we need to choose the number of units in the hidden layer. We saw above that a wide enough hidden layer is capable of perfectly fitting data. We will also see later that in many cases making the neural network deeper instead of wider performs better. In this case, models may have significantly fewer parameters, but tend to be much harder to fit. 72 / 110 Architecture Design Ideal network architectures are task dependent Require much experimentation Judicious use of cross-validation methods to measure expected prediction error to guide architecture choice. 73 / 110 Multiple Minima As opposed to other learning methods we have seen so far, the feed- forward neural network yields a non-convex optimization problem. 74 / 110 Multiple Minima As opposed to other learning methods we have seen so far, the feed- forward neural network yields a non-convex optimization problem. This will lead to the problem of multiple local minima in which methods like SGD can suffer. 75 / 110 Multiple Minima As opposed to other learning methods we have seen so far, the feed- forward neural network yields a non-convex optimization problem. This will lead to the problem of multiple local minima in which methods like SGD can suffer. We will see later in detail a variety of approaches used to address this problem. 76 / 110 Multiple Minima As opposed to other learning methods we have seen so far, the feed- forward neural network yields a non-convex optimization problem. This will lead to the problem of multiple local minima in which methods like SGD can suffer. We will see later in detail a variety of approaches used to address this problem. Here, we present a few rule of thumbs to follow. 77 / 110 Multiple Minima The local minima a method like SGD may yield depend on the initial parameter values chosen. 78 / 110 Multiple Minima The local minima a method like SGD may yield depend on the initial parameter values chosen. One idea is to train multiple models using different initial values and make predictions using the model that gives best expected prediction error. 79 / 110 Multiple Minima The local minima a method like SGD may yield depend on the initial parameter values chosen. One idea is to train multiple models using different initial values and make predictions using the model that gives best expected prediction error. A related idea is to average the predictions of this multiple models. 80 / 110 Multiple Minima The local minima a method like SGD may yield depend on the initial parameter values chosen. One idea is to train multiple models using different initial values and make predictions using the model that gives best expected prediction error. A related idea is to average the predictions of this multiple models. Finally, we can use bagging as described in a previous session to create an ensemble of neural networks to circumvent the local minima problem. 81 / 110 Summary Neural networks are representationally powerful prediction models. They can be difficult to optimize properly due to the non-convexity of the resulting optimization problem. Deciding on network architecture is a significant challenge. We'll see later that recent proposals use deep, but thinner networks effectively. Even in this case, choice of model depth is difficult. There is tremendous excitment over recent excellent performance of deep neural networks in many applications. 82 / 110 Deep Feed-Forward Neural Networks The general form of feed-forward network can be extended by adding additional hidden layers. 83 / 110 Deep Feed-Forward Neural Networks The same principles we saw before: We arrange computation using a computing graph Use Stochastic Gradient Descent Use Backpropagation for gradient calculation along the computation graph. 84 / 110 Deep Feed-Forward Neural Networks Empirically, it is found that by using more, thinner, layers, better expected prediction error is obtained. However, each layer introduces more linearity into the network. Making optimization markedly more difficult. 85 / 110 Deep Feed-Forward Neural Networks We may interpret hidden layers as progressive derived representations of the input data. Since we train based on a loss- function, these derived representations should make modeling the outcome of interest progressively easier. 86 / 110 Deep Feed-Forward Neural Networks In many applications, these derived representations are used for model interpretation. 87 / 110 Deep Feed-Forward Neural Networks Advanced parallel computation systems and methods are used in order to train these deep networks, with billions of connections. 88 / 110 Deep Feed-Forward Neural Networks Advanced parallel computation systems and methods are used in order to train these deep networks, with billions of connections. The applications we discussed previously build this type of massive deep network. 89 / 110 Deep Feed-Forward Neural Networks Advanced parallel computation systems and methods are used in order to train these deep networks, with billions of connections. The applications we discussed previously build this type of massive deep network. They also require massive amounts of data to train. 90 / 110 Deep Feed-Forward Neural Networks Advanced parallel computation systems and methods are used in order to train these deep networks, with billions of connections. The applications we discussed previously build this type of massive deep network. They also require massive amounts of data to train. However, this approach can still be applicable to moderate datasizes with careful network design, regularization and training. 91 / 110 Supervised Pre-training A clever idea for training deep networks. Train each layer successively on the outcome of interest. Use the resulting weights as initial weights for network with one more additional layer. 92 / 110 Supervised Pre-training Train the first layer as a single layer feed forward network. Weights initialized as standard practice. This fits . 93 / 110 Supervised Pre-training Now train two layer network. Weights are initialized to result of previous fit. 94 / 110 Supervised Pre-training This procedure continues until all layers are trained. Hypothesis is that training each layer on the outcome of interest moves the weights to parts of parameter space that lead to good performance. Minimizing updates can ameliorate dependency problem. 95 / 110 Supervised Pre-training This is one strategy others are popular and effective Train each layer as a single layer network using the hidden layer of the previous layer as inputs to the model. In this case, no long term dependencies occur at all. Performance may suffer. 96 / 110 Supervised Pre-training This is one strategy others are popular and effective Train each layer as a single layer on the hidden layer of the previous layer, but also add the original input data as input to every layer of the network. No long-term dependency Performance improves Number of parameters increases. 97 / 110 Parameter Sharing Another method for reducing the number of parameters in a deep learning model. When predictors exhibit some internal structure, parts of the model can then share parameters. 98 / 110 Parameter Sharing Two important applications use this idea: Image processing: local structure of nearby pixels Sequence modeling: structure given by sequence The latter includes modeling of time series data. 99 / 110 Convolutional Networks are used in imaging applications. Input is pixel data. Parameters are shared across nearby parts of the image. Parameter Sharing 100 / 110 Recurrent Networks are used in sequence modeling applications. For instance, time series and forecasting. Parameters are shared across a time lag. Recurrent Networks 101 / 110 Recurrent Networks The long short-term memory (LSTM) model is very popular in time series analysis 102 / 110 Example Learn to add: "55+22=77" https://keras.rstudio.com/articles/examples/addition_rnn.html 103 / 110 Example Learn to add: "55+22=77" https://keras.rstudio.com/articles/examples/addition_rnn.html Addition encoded as sequence of one-hot vectors: ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## 5 0 0 0 0 0 0 0 1 0 0 ## 5 0 0 0 0 0 0 0 1 0 0 ## + 0 1 0 0 0 0 0 0 0 0 ## 2 0 0 0 0 1 0 0 0 0 0 ## 2 0 0 0 0 1 0 0 0 0 0 ## [,11] [,12] ## 5 0 0 104 / 110 Example Learn to add: "55+22=77" https://keras.rstudio.com/articles/examples/addition_rnn.html Result encoded as sequence of one-hot vectors ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## 7 0 0 0 0 0 0 0 0 0 1 ## 7 0 0 0 0 0 0 0 0 0 1 ## [,11] [,12] ## 7 0 0 ## 7 0 0 105 / 110 Example Learn to add: "55+22=77" https://keras.rstudio.com/articles/examples/addition_rnn.html Result encoded as sequence of one-hot vectors ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## 7 0 0 0 0 0 0 0 0 0 1 ## 7 0 0 0 0 0 0 0 0 0 1 ## [,11] [,12] ## 7 0 0 ## 7 0 0 This is a sequence-to-sequence model. Perfect application for 106 / 110 Example ## ______________________________________________________ ## Layer (type) Output Shape Param # ## ====================================================== ## lstm_10 (LSTM) (None, 128) 72192 ## ______________________________________________________ ## repeat_vector_5 (Repeat (None, 3, 128) 0 ## ______________________________________________________ ## lstm_11 (LSTM) (None, 3, 128) 131584 ## ______________________________________________________ ## time_distributed_5 (Tim (None, 3, 12) 1548 ## ______________________________________________________ ## activation_5 (Activatio (None, 3, 12) 0 107 / 110 Summary Deep Learning is riding a big wave of popularity. State-of-the-art results in many applications. Best results in applications with massive amounts of data. However, newer methods allow use in other situations. 108 / 110 Summary Many of recent advances stem from computational and technical approaches to modeling. Keeping track of these advances is hard, and many of them are ad-hoc. Not straightforward to determine a-priori how these technical advances may help in a specific application. Require significant amount of experimentation. 109 / 110 Summary The interpretation of hidden units as representations can lead to insight. There is current research on interpreting these to support some notion of statistical inference. Excellent textbook: http://deeplearningbook.org 110 / 110 Introduction to Data Science: Neural Networks and Deep Learning Héctor Corrada Bravo University of Maryland, College Park, USA CMSC 320: 2020-05-10

Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Neural networks are a decades oldarea of study.

Initially, these computational modelswere created with the goal ofmimicking the processing ofneuronal networks.

Historical Overview

1 / 110

Inspiration: model neuron asprocessing unit.

Some of the mathematical functionshistorically used in neural networkmodels arise from biologicallyplausible activation functions.

Historical Overview

2 / 110

Somewhat limited success inmodeling neuronal processing

Neural network models gainedtraction as general MachineLearning models.

Historical Overview

3 / 110

Historical OverviewStrong results about the ability of these models to approximate arbitraryfunctions

Became the subject of intense study in ML.

In practice, effective training of these models was both technically andcomputationally difficult.

4 / 110

Starting from 2005, technicaladvances have led to a resurgenceof interest in neural networks,specifically in Deep NeuralNetworks.

Historical Overview

5 / 110

Deep LearningAdvances in computational processing:

powerful parallel processing given by Graphical Processing Units

6 / 110

Deep LearningAdvances in computational processing:

powerful parallel processing given by Graphical Processing Units

Advances in neural network architecture design and networkoptimization

7 / 110

Deep LearningAdvances in computational processing:

powerful parallel processing given by Graphical Processing Units

Advances in neural network architecture design and networkoptimization

Researchers apply Deep Neural Networks successfully in a number ofapplications.

8 / 110

Self driving cars make use of DeepLearning models for sensorprocessing.

Deep Learning

9 / 110

Image recognition software usesDeep Learning to identify individualswithin photos.

Deep Learning

10 / 110

Deep Learning models have beenapplied to medical imaging to yieldexpert-level prognosis.

Deep Learning

11 / 110

An automated Go player, makingheavy use of Deep Learning, iscapable of beating the best humanGo players in the world.

Deep Learning

12 / 110

Neural Networks and Deep LearningIn this unit we study neural networks and recent advances in DeepLearning.

13 / 110

Projection-Pursuit RegressionTo motivate our discussion of Deep Neural Networks, let's turn to simplebut very powerful class of models.

As per the usual regression setting, suppose

given predictors (attributes) for an observation

we want to predict a continuous outcome .

{X1, … ,Xp}

Y

14 / 110

Projection-Pursuit RegressionThe Projection-Pursuit Regression (PPR) model predicts outcome using function as

where:

is a p-dimensional weight vectorso, is a linear combination of predictors

and , are univariate non-linear functions (asmoothing spline for example)

Y

f(X)

f(X) =M

∑i=1

gm(w′mX)

wm

w′X = ∑

p

j=1 wmjxj xj

gm m = 1, … ,M15 / 110

Projection-Pursuit RegressionOur prediction function is a linear function (with terms).

Each term is the result of applying a non-linear function to,what we can think of as, a derived feature (or derived predictor)

.

M

gm(w′mX)

Vm = w′mX

16 / 110

Projection-Pursuit RegressionHere's another intuition. Recall the Principal Component Analysisproblem we saw in the previous unit.

Given:

Data set , where is the vector of variablevalues for the -th observation.

Return:

Matrix of linear transformations that retain maximalvariance.

{x1, x2, … , xn} xi p

i

[ϕ1,ϕ2, … ,ϕp]

17 / 110

Projection-Pursuit RegressionMatrix of linear transformations

You can think of the first vector as a linear transformation thatembeds observations into 1 dimension:

where is selected so that the resulting dataset hasmaximum variance.

[ϕ1,ϕ2, … ,ϕp]

ϕ1

Z1 = ϕ11X1 + ϕ21X2 + ⋯ + ϕp1Xp

ϕ1 {z1, … , zn}

18 / 110

Projection-Pursuit Regression

In PPR we are reducing the dimensionality of from to usinglinear projections,

And building a regression function over the representation with reduceddimension.

f(X) =M

∑i=1

gm(w′mX)

X p M

19 / 110

Projection-Pursuit RegressionLet's revist the data from our previous unit and see how the PPR modelperforms.

This is a time series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.

The dataset contains affordability measurements for 76 counties withdata from 1979 to 2017. Here we plot the time series of affordability forall counties.

20 / 110

We will try to predictaffordability at thelast time-point givenin the dataset basedon the time series upto one year previousto the last time point.

Projection-Pursuit Regression

21 / 110

Projection-Pursuit Regression

22 / 110

Projection-Pursuit RegressionSo, how can we fit the PPR model?

As we have done previously in other regression settings, we start with aloss function to minimize

Use an optimization method to minimize the error of the model.

For simplicity let's consider a model with and drop the subscript .

L(g,W) =N

∑i=1

[yi −M

∑m=1

gm(w′mxi)]

2

M = 1

m 23 / 110

Projection-Pursuit RegressionConsider the following procedure

Initialize weight vector to some value

Construct derived variable

Use a non-linear regression method to fit function based on model . You can use additive splines or loess

w wold

v = wold

g

E[Y |V ] = g(v)

24 / 110

Projection-Pursuit RegressionGiven function now update weight vector using a gradientdescent method

where is a learning rate.

g wold

w = wold + 2αN

∑i=1

(yi − g(vi))g′(vi)xi

= wold + 2αN

∑i=1

rixi

α

25 / 110

Projection-Pursuit Regression

In the second line we rewrite the gradient in terms of the residual ofthe current model (using the derived feature ) weighted by, whatwe could think of, as the sensitivity of the model to changes in derivedfeature .

w = wold + 2αN

∑i=1

(yi − g(vi))g′(vi)xi

= wold + 2αN

∑i=1

~r ixi

rig(vi) v

vi

26 / 110

Projection-Pursuit RegressionGiven an updated weight vector we can then fit again and continueiterating until a stop condition is reached.

w g

27 / 110

Projection-Pursuit RegressionLet's consider the PPR and this fitting technique a bit more in detail witha few observations

We can think of the PPR model as composing three functions:

the linear projection ,the result of non-linear function and, in the case when ,the linear combination of the functions.

w′x

g M > 1

gm

28 / 110

Projection-Pursuit RegressionTo tie this to the formulation usually described in the neural networkliterature we make one slight change to our understanding of derivedfeature.

Consider the case , the final predictor is a linear combination .

We could also think of each term as providing a non-lineardimensionality reduction to a single derived feature.

M > 1

∑M

i=1 gm(vm)

gm(vm)

29 / 110

Projection-Pursuit RegressionThis interpretation is closer to that used in the neural network literature,at each stage of the composition we apply a non-linear transform to thedata of the type .g(w

′x)

30 / 110

Projection-Pursuit RegressionThe fitting procedure propagates errors (residuals) down this functioncomposition in a stage-wise manner.

31 / 110

Feed-forward Neural NetworksWe can now write the general formulation for a feed-forward neuralnetwork.

We will present the formulation for a general case where we aremodeling outcomes as .K Y1, … ,Yk f1(X), … , fK(X)

32 / 110

Feed-forward Neural NetworksIn multi-class classification, categorical outcome may take multiplevalues

We consider as a discriminant function for class ,

Final classification is made using . For regression, we cantake .

Yk k

arg maxk YkK = 1

33 / 110

Feed-forward Neural NetworksA single layer feed-forward neural network is defined as

hm = gh(w′1mX), m = 1, … ,M

 fk = gfk(w′2kh), k = 1, … ,K

34 / 110

The network is organized into input,hidden and output layers.

Feed-forward Neural Networks

35 / 110

Feed-forward Neural NetworksUnits represent a hidden layer,which we can interpret as a derivednon-linear representation of theinput data as we saw before.

hm

36 / 110

Feed-forward Neural NetworksFunction is an activation functionused to introduce non-linearity tothe representation.

gh

37 / 110

Historically, thesigmoid activationfunction wascommonly used

or

the hyperbolictangent.

Feed-forward Neural Networks

gh(v) = 11+e−v

38 / 110

Nowadays, a rectifiedlinear unit (ReLU)

is used morefrequently in practice.(there are manyextensions)

Feed-forward Neural Networks

gh(v) = max{0, v}

39 / 110

Feed-forward Neural NetworksFunction used in the output layerdepends on the outcome modeled.

For classification a soft-maxfunction can be used

where

.

For regression, we may take tobe the identify function.

gf

gfk(tk) = etk

∑K

l=1 etk

tk = w′2kh

gfk

40 / 110

Feed-forward Neural NetworksThe single-layer feed-forward neural network has the sameparameterization as the PPR model,

Activation functions are much simpler, as opposed to, e.g., smoothingsplines as used in PPR.

gh

41 / 110

Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).

42 / 110

Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).

However, the number of units in the hidden layer may be exponentiallylarge to approximate arbitrary functions.

43 / 110

Feed-forward Neural NetworksEmpirically, a single-layer feed-forward neural network has similarperformance to kernel-based methods like SVMs.

This is not usually the case once more than a single-layer is used in aneural network.

44 / 110

Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.

45 / 110

Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.

Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.

46 / 110

Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.

Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.

They have the advantage of explicitly representing all operations used ina neural network which then permits easier specification of gradient-based algorithms.

47 / 110

Fitting with back propagation

48 / 110

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

49 / 110

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

These methods require that gradients are computed based on modelerror.

50 / 110

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

These methods require that gradients are computed based on modelerror.

The layer-wise propagation of error is at the core of these gradientcomputations.

51 / 110

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

These methods require that gradients are computed based on modelerror.

The layer-wise propagation of error is at the core of these gradientcomputations.

This is called back-propagation.

52 / 110

Fitting with back propagationAssume we have a current estimateof model parameters, and we areprocessing one observation (inpractice a small batch ofobservations is used).

x

53 / 110

Fitting with back propagationFirst, to perform back propation wemust compute the error of the modelon observation given the currentset of parameters.

To do this we compute all activationfunctions along the computationgraph from the bottom up.

x

54 / 110

Fitting with back propagationOnce we have computed output ,we can compute error (or, generally,cost) .

Once we do this we can walk backthrough the computation graph toobtain gradients of cost withrespect to any of the modelparameters applying the chain rule.

y

J(y, y)

J

55 / 110

Fitting with back propagationWe will continously update agradient vector .

First, we set

∇ ← ∇yJ

56 / 110

Fitting with back propagationNext, we need the gradient

We apply the chain rule to obtain

is the derivative of the softmaxfunction

is element-wise multiplication.

Set .

∇tJ

∇tJ = ∇ ⊙ f ′(t)

f ′

∇ ← ∇tJ

57 / 110

Fitting with back propagationNext, we want to compute .

We can do so using the gradient wejust computed since

.

In this case, we get .

∇WkJ

∇WkJ = ∇tJ∇Wk

t

∇WkJ = ∇h

58 / 110

Fitting with back propagationAt this point we have computedgradients for the weight matrix from the hidden layer to the outputlayer, which we can use to updatethose parameters as part ofstochastic gradient descent.

Wk

59 / 110

Fitting with back propagationOnce we have computed gradientsfor weights connecting the hiddenand output layers, we can computegradients for weights connecting theinput and hidden layers.

60 / 110

Fitting with back propagationWe require , we we cancompute as since currentlyhas value .

At this point we can set .

∇hJ

W ′k∇ ∇

∇tJ

∇ ← ∇hJ

61 / 110

Fitting with back propagationFinally, we set

where isthe derivative of the ReLU activationfunction.

This gives us .

∇ ← ∇zJ = ∇ ⋅ g′(z) g′

∇WhJ = ∇x

62 / 110

Fitting with back propagationAt this point we have propagatedthe gradient of cost function to allparameters of the model

We can thus update the model forthe next step of stochastic gradientdescent.

J

63 / 110

Practical IssuesStochastic gradient descent (SGD) based on back-propagation algorithmas shown above introduces some complications.

64 / 110

ScalingThe scale of inputs effectively determines the scale of weight matrices

Scale can have a large effect on how well SGD behaves.

In practice, all inputs are usually standardized to have zero mean andunit variance before application of SGD.

x

W

65 / 110

InitializationWith properly scaled inputs, initialization of weights can be done in asomewhat reasonable manner

Randomly choose initial weights in .[−.7, .7]

66 / 110

Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.

67 / 110

Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.

We can incorporate penalty terms to control model complexity to somedegree.

68 / 110

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

69 / 110

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

We saw above that a wide enough hidden layer is capable of perfectlyfitting data.

70 / 110

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

We saw above that a wide enough hidden layer is capable of perfectlyfitting data.

We will also see later that in many cases making the neural networkdeeper instead of wider performs better.

71 / 110

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

We saw above that a wide enough hidden layer is capable of perfectlyfitting data.

We will also see later that in many cases making the neural networkdeeper instead of wider performs better.

In this case, models may have significantly fewer parameters, but tend tobe much harder to fit.

72 / 110

Architecture DesignIdeal network architectures are task dependent

Require much experimentation

Judicious use of cross-validation methods to measure expectedprediction error to guide architecture choice.

73 / 110

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

74 / 110

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

This will lead to the problem of multiple local minima in which methodslike SGD can suffer.

75 / 110

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

This will lead to the problem of multiple local minima in which methodslike SGD can suffer.

We will see later in detail a variety of approaches used to address thisproblem.

76 / 110

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

This will lead to the problem of multiple local minima in which methodslike SGD can suffer.

We will see later in detail a variety of approaches used to address thisproblem.

Here, we present a few rule of thumbs to follow.

77 / 110

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

78 / 110

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.

79 / 110

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.

A related idea is to average the predictions of this multiple models.

80 / 110

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.

A related idea is to average the predictions of this multiple models.

Finally, we can use bagging as described in a previous session to createan ensemble of neural networks to circumvent the local minima problem.

81 / 110

SummaryNeural networks are representationally powerful prediction models.

They can be difficult to optimize properly due to the non-convexity ofthe resulting optimization problem.

Deciding on network architecture is a significant challenge. We'll seelater that recent proposals use deep, but thinner networks effectively.Even in this case, choice of model depth is difficult.

There is tremendous excitment over recent excellent performance ofdeep neural networks in many applications.

82 / 110

Deep Feed-Forward Neural NetworksThe general form of feed-forwardnetwork can be extended by addingadditional hidden layers.

83 / 110

Deep Feed-Forward Neural NetworksThe same principles we saw before:

We arrange computation using acomputing graph

Use Stochastic Gradient Descent

Use Backpropagation for gradientcalculation along the computationgraph.

84 / 110

Deep Feed-Forward Neural NetworksEmpirically, it is found that by usingmore, thinner, layers, betterexpected prediction error isobtained.

However, each layer introducesmore linearity into the network.

Making optimization markedly moredifficult.

85 / 110

Deep Feed-Forward Neural NetworksWe may interpret hidden layers asprogressive derived representationsof the input data.

Since we train based on a loss-function, these derivedrepresentations should makemodeling the outcome of interestprogressively easier.

86 / 110

Deep Feed-Forward Neural NetworksIn many applications, these derivedrepresentations are used for modelinterpretation.

87 / 110

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

88 / 110

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

The applications we discussed previously build this type of massive deepnetwork.

89 / 110

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

The applications we discussed previously build this type of massive deepnetwork.

They also require massive amounts of data to train.

90 / 110

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

The applications we discussed previously build this type of massive deepnetwork.

They also require massive amounts of data to train.

However, this approach can still be applicable to moderate datasizeswith careful network design, regularization and training.

91 / 110

Supervised Pre-trainingA clever idea for training deepnetworks.

Train each layer successively on theoutcome of interest.

Use the resulting weights as initialweights for network with one moreadditional layer.

92 / 110

Supervised Pre-trainingTrain the first layer as a single layerfeed forward network.

Weights initialized as standardpractice.

This fits .W 1h

93 / 110

Supervised Pre-trainingNow train two layer network.

Weights are initialized to resultof previous fit.

W 1h

94 / 110

Supervised Pre-trainingThis procedure continues until all layers are trained.

Hypothesis is that training each layer on the outcome of interest movesthe weights to parts of parameter space that lead to good performance.

Minimizing updates can ameliorate dependency problem.

95 / 110

Supervised Pre-trainingThis is one strategy others are popular and effective

Train each layer as a single layer network using the hidden layer ofthe previous layer as inputs to the model.

In this case, no long term dependencies occur at all.

Performance may suffer.

96 / 110

Supervised Pre-trainingThis is one strategy others are popular and effective

Train each layer as a single layer on the hidden layer of the previouslayer, but also add the original input data as input to every layer of thenetwork.

No long-term dependency

Performance improves

Number of parameters increases.

97 / 110

Parameter SharingAnother method for reducing the number of parameters in a deeplearning model.

When predictors exhibit some internal structure, parts of the modelcan then share parameters.

X

98 / 110

Parameter SharingTwo important applications use this idea:

Image processing: local structure of nearby pixelsSequence modeling: structure given by sequence

The latter includes modeling of time series data.

99 / 110

Convolutional Networks are used inimaging applications.

Input is pixel data.

Parameters are shared acrossnearby parts of the image.

Parameter Sharing

100 / 110

Recurrent Networks are used insequence modeling applications.

For instance, time series andforecasting.

Parameters are shared across atime lag.

Recurrent Networks

101 / 110

Recurrent NetworksThe long short-term memory(LSTM) model is very popular intime series analysis

102 / 110

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

103 / 110

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

Addition encoded as sequence of one-hot vectors:

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 5 0 0 0 0 0 0 0 1 0 0

## 5 0 0 0 0 0 0 0 1 0 0

## + 0 1 0 0 0 0 0 0 0 0

## 2 0 0 0 0 1 0 0 0 0 0

## 2 0 0 0 0 1 0 0 0 0 0

## [,11] [,12]

## 5 0 0

104 / 110

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

Result encoded as sequence of one-hot vectors

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 7 0 0 0 0 0 0 0 0 0 1

## 7 0 0 0 0 0 0 0 0 0 1

## [,11] [,12]

## 7 0 0

## 7 0 0

105 / 110

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

Result encoded as sequence of one-hot vectors

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 7 0 0 0 0 0 0 0 0 0 1

## 7 0 0 0 0 0 0 0 0 0 1

## [,11] [,12]

## 7 0 0

## 7 0 0

This is a sequence-to-sequence model. Perfect application for106 / 110

Example## ______________________________________________________

## Layer (type) Output Shape Param #

## ======================================================

## lstm_10 (LSTM) (None, 128) 72192

## ______________________________________________________

## repeat_vector_5 (Repeat (None, 3, 128) 0

## ______________________________________________________

## lstm_11 (LSTM) (None, 3, 128) 131584

## ______________________________________________________

## time_distributed_5 (Tim (None, 3, 12) 1548

## ______________________________________________________

## activation_5 (Activatio (None, 3, 12) 0 107 / 110

SummaryDeep Learning is riding a big wave of popularity.

State-of-the-art results in many applications.

Best results in applications with massive amounts of data.

However, newer methods allow use in other situations.

108 / 110

SummaryMany of recent advances stem from computational and technicalapproaches to modeling.

Keeping track of these advances is hard, and many of them are ad-hoc.

Not straightforward to determine a-priori how these technical advancesmay help in a specific application.

Require significant amount of experimentation.

109 / 110

SummaryThe interpretation of hidden units as representations can lead to insight.

There is current research on interpreting these to support some notion ofstatistical inference.

Excellent textbook: http://deeplearningbook.org

110 / 110

Introduction to Data Science: NeuralNetworks and Deep Learning

Héctor Corrada Bravo

University of Maryland, College Park, USACMSC 320: 2020-05-10

Page 2: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Neural networks are a decades oldarea of study.

Initially, these computational modelswere created with the goal ofmimicking the processing ofneuronal networks.

Historical Overview

1 / 110

Page 3: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Inspiration: model neuron asprocessing unit.

Some of the mathematical functionshistorically used in neural networkmodels arise from biologicallyplausible activation functions.

Historical Overview

2 / 110

Page 4: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Somewhat limited success inmodeling neuronal processing

Neural network models gainedtraction as general MachineLearning models.

Historical Overview

3 / 110

Page 5: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Historical OverviewStrong results about the ability of these models to approximate arbitraryfunctions

Became the subject of intense study in ML.

In practice, effective training of these models was both technically andcomputationally difficult.

4 / 110

Page 6: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Starting from 2005, technicaladvances have led to a resurgenceof interest in neural networks,specifically in Deep NeuralNetworks.

Historical Overview

5 / 110

Page 7: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep LearningAdvances in computational processing:

powerful parallel processing given by Graphical Processing Units

6 / 110

Page 8: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep LearningAdvances in computational processing:

powerful parallel processing given by Graphical Processing Units

Advances in neural network architecture design and networkoptimization

7 / 110

Page 9: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep LearningAdvances in computational processing:

powerful parallel processing given by Graphical Processing Units

Advances in neural network architecture design and networkoptimization

Researchers apply Deep Neural Networks successfully in a number ofapplications.

8 / 110

Page 10: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Self driving cars make use of DeepLearning models for sensorprocessing.

Deep Learning

9 / 110

Page 11: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Image recognition software usesDeep Learning to identify individualswithin photos.

Deep Learning

10 / 110

Page 12: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Learning models have beenapplied to medical imaging to yieldexpert-level prognosis.

Deep Learning

11 / 110

Page 13: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

An automated Go player, makingheavy use of Deep Learning, iscapable of beating the best humanGo players in the world.

Deep Learning

12 / 110

Page 14: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Neural Networks and Deep LearningIn this unit we study neural networks and recent advances in DeepLearning.

13 / 110

Page 15: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionTo motivate our discussion of Deep Neural Networks, let's turn to simplebut very powerful class of models.

As per the usual regression setting, suppose

given predictors (attributes) for an observation

we want to predict a continuous outcome .

{X1, … , Xp}

Y

14 / 110

Page 16: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionThe Projection-Pursuit Regression (PPR) model predicts outcome using function as

where:

is a p-dimensional weight vectorso, is a linear combination of predictors

and , are univariate non-linear functions (asmoothing spline for example)

Y

f(X)

f(X) =M

∑i=1

gm(w′mX)

wm

w′X = ∑

p

j=1 wmjxj xj

gm m = 1, … , M

15 / 110

Page 17: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionOur prediction function is a linear function (with terms).

Each term is the result of applying a non-linear function to,what we can think of as, a derived feature (or derived predictor)

.

M

gm(w′mX)

Vm = w′mX

16 / 110

Page 18: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionHere's another intuition. Recall the Principal Component Analysisproblem we saw in the previous unit.

Given:

Data set , where is the vector of variablevalues for the -th observation.

Return:

Matrix of linear transformations that retain maximalvariance.

{x1, x2, … , xn} xi p

i

[ϕ1, ϕ2, … , ϕp]

17 / 110

Page 19: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionMatrix of linear transformations

You can think of the first vector as a linear transformation thatembeds observations into 1 dimension:

where is selected so that the resulting dataset hasmaximum variance.

[ϕ1, ϕ2, … , ϕp]

ϕ1

Z1 = ϕ11X1 + ϕ21X2 + ⋯ + ϕp1Xp

ϕ1 {z1, … , zn}

18 / 110

Page 20: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit Regression

In PPR we are reducing the dimensionality of from to usinglinear projections,

And building a regression function over the representation with reduceddimension.

f(X) =M

∑i=1

gm(w′mX)

X p M

19 / 110

Page 21: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionLet's revist the data from our previous unit and see how the PPR modelperforms.

This is a time series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.

The dataset contains affordability measurements for 76 counties withdata from 1979 to 2017. Here we plot the time series of affordability forall counties.

20 / 110

Page 22: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

We will try to predictaffordability at thelast time-point givenin the dataset basedon the time series upto one year previousto the last time point.

Projection-Pursuit Regression

21 / 110

Page 23: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit Regression

22 / 110

Page 24: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionSo, how can we fit the PPR model?

As we have done previously in other regression settings, we start with aloss function to minimize

Use an optimization method to minimize the error of the model.

For simplicity let's consider a model with and drop the subscript .

L(g, W) =N

∑i=1

[yi −M

∑m=1

gm(w′mxi)]

2

M = 1

m 23 / 110

Page 25: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionConsider the following procedure

Initialize weight vector to some value

Construct derived variable

Use a non-linear regression method to fit function based on model . You can use additive splines or loess

w wold

v = wold

g

E[Y |V ] = g(v)

24 / 110

Page 26: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionGiven function now update weight vector using a gradientdescent method

where is a learning rate.

g wold

w = wold + 2αN

∑i=1

(yi − g(vi))g′(vi)xi

= wold + 2αN

∑i=1

rixi

α

25 / 110

Page 27: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit Regression

In the second line we rewrite the gradient in terms of the residual ofthe current model (using the derived feature ) weighted by, whatwe could think of, as the sensitivity of the model to changes in derivedfeature .

w = wold + 2αN

∑i=1

(yi − g(vi))g′(vi)xi

= wold + 2αN

∑i=1

~r ixi

rig(vi) v

vi

26 / 110

Page 28: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionGiven an updated weight vector we can then fit again and continueiterating until a stop condition is reached.

w g

27 / 110

Page 29: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionLet's consider the PPR and this fitting technique a bit more in detail witha few observations

We can think of the PPR model as composing three functions:

the linear projection ,the result of non-linear function and, in the case when ,the linear combination of the functions.

w′x

g M > 1

gm

28 / 110

Page 30: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionTo tie this to the formulation usually described in the neural networkliterature we make one slight change to our understanding of derivedfeature.

Consider the case , the final predictor is a linear combination .

We could also think of each term as providing a non-lineardimensionality reduction to a single derived feature.

M > 1

∑M

i=1 gm(vm)

gm(vm)

29 / 110

Page 31: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionThis interpretation is closer to that used in the neural network literature,at each stage of the composition we apply a non-linear transform to thedata of the type .g(w

′x)

30 / 110

Page 32: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Projection-Pursuit RegressionThe fitting procedure propagates errors (residuals) down this functioncomposition in a stage-wise manner.

31 / 110

Page 33: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksWe can now write the general formulation for a feed-forward neuralnetwork.

We will present the formulation for a general case where we aremodeling outcomes as .K Y1, … , Yk f1(X), … , fK(X)

32 / 110

Page 34: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksIn multi-class classification, categorical outcome may take multiplevalues

We consider as a discriminant function for class ,

Final classification is made using . For regression, we cantake .

Yk k

arg maxk Yk

K = 1

33 / 110

Page 35: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksA single layer feed-forward neural network is defined as

hm = gh(w′1mX), m = 1, … , M

 fk = gfk(w′2k

h), k = 1, … , K

34 / 110

Page 36: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

The network is organized into input,hidden and output layers.

Feed-forward Neural Networks

35 / 110

Page 37: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksUnits represent a hidden layer,which we can interpret as a derivednon-linear representation of theinput data as we saw before.

hm

36 / 110

Page 38: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksFunction is an activation functionused to introduce non-linearity tothe representation.

gh

37 / 110

Page 39: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Historically, thesigmoid activationfunction wascommonly used

or

the hyperbolictangent.

Feed-forward Neural Networks

gh(v) = 11+e−v

38 / 110

Page 40: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Nowadays, a rectifiedlinear unit (ReLU)

is used morefrequently in practice.(there are manyextensions)

Feed-forward Neural Networks

gh(v) = max{0, v}

39 / 110

Page 41: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksFunction used in the output layerdepends on the outcome modeled.

For classification a soft-maxfunction can be used

where

.

For regression, we may take tobe the identify function.

gf

gfk(tk) = etk

∑K

l=1 etk

tk = w′2k

h

gfk

40 / 110

Page 42: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksThe single-layer feed-forward neural network has the sameparameterization as the PPR model,

Activation functions are much simpler, as opposed to, e.g., smoothingsplines as used in PPR.

gh

41 / 110

Page 43: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).

42 / 110

Page 44: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).

However, the number of units in the hidden layer may be exponentiallylarge to approximate arbitrary functions.

43 / 110

Page 45: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Feed-forward Neural NetworksEmpirically, a single-layer feed-forward neural network has similarperformance to kernel-based methods like SVMs.

This is not usually the case once more than a single-layer is used in aneural network.

44 / 110

Page 46: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.

45 / 110

Page 47: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.

Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.

46 / 110

Page 48: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.

Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.

They have the advantage of explicitly representing all operations used ina neural network which then permits easier specification of gradient-based algorithms.

47 / 110

Page 49: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagation

48 / 110

Page 50: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

49 / 110

Page 51: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

These methods require that gradients are computed based on modelerror.

50 / 110

Page 52: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

These methods require that gradients are computed based on modelerror.

The layer-wise propagation of error is at the core of these gradientcomputations.

51 / 110

Page 53: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

These methods require that gradients are computed based on modelerror.

The layer-wise propagation of error is at the core of these gradientcomputations.

This is called back-propagation.

52 / 110

Page 54: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationAssume we have a current estimateof model parameters, and we areprocessing one observation (inpractice a small batch ofobservations is used).

x

53 / 110

Page 55: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationFirst, to perform back propation wemust compute the error of the modelon observation given the currentset of parameters.

To do this we compute all activationfunctions along the computationgraph from the bottom up.

x

54 / 110

Page 56: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationOnce we have computed output ,we can compute error (or, generally,cost) .

Once we do this we can walk backthrough the computation graph toobtain gradients of cost withrespect to any of the modelparameters applying the chain rule.

y

J(y, y)

J

55 / 110

Page 57: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationWe will continously update agradient vector .

First, we set

∇ ← ∇y J

56 / 110

Page 58: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationNext, we need the gradient

We apply the chain rule to obtain

is the derivative of the softmaxfunction

is element-wise multiplication.

Set .

∇tJ

∇tJ = ∇ ⊙ f ′(t)

f ′

∇ ← ∇tJ

57 / 110

Page 59: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationNext, we want to compute .

We can do so using the gradient wejust computed since

.

In this case, we get .

∇WkJ

∇WkJ = ∇tJ∇Wk

t

∇WkJ = ∇h

58 / 110

Page 60: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationAt this point we have computedgradients for the weight matrix from the hidden layer to the outputlayer, which we can use to updatethose parameters as part ofstochastic gradient descent.

Wk

59 / 110

Page 61: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationOnce we have computed gradientsfor weights connecting the hiddenand output layers, we can computegradients for weights connecting theinput and hidden layers.

60 / 110

Page 62: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationWe require , we we cancompute as since currentlyhas value .

At this point we can set .

∇hJ

W′

k∇ ∇

∇tJ

∇ ← ∇hJ

61 / 110

Page 63: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationFinally, we set

where isthe derivative of the ReLU activationfunction.

This gives us .

∇ ← ∇zJ = ∇ ⋅ g′(z) g′

∇WhJ = ∇x

62 / 110

Page 64: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Fitting with back propagationAt this point we have propagatedthe gradient of cost function to allparameters of the model

We can thus update the model forthe next step of stochastic gradientdescent.

J

63 / 110

Page 65: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Practical IssuesStochastic gradient descent (SGD) based on back-propagation algorithmas shown above introduces some complications.

64 / 110

Page 66: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

ScalingThe scale of inputs effectively determines the scale of weight matrices

Scale can have a large effect on how well SGD behaves.

In practice, all inputs are usually standardized to have zero mean andunit variance before application of SGD.

x

W

65 / 110

Page 67: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

InitializationWith properly scaled inputs, initialization of weights can be done in asomewhat reasonable manner

Randomly choose initial weights in .[−.7, .7]

66 / 110

Page 68: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.

67 / 110

Page 69: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.

We can incorporate penalty terms to control model complexity to somedegree.

68 / 110

Page 70: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

69 / 110

Page 71: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

We saw above that a wide enough hidden layer is capable of perfectlyfitting data.

70 / 110

Page 72: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

We saw above that a wide enough hidden layer is capable of perfectlyfitting data.

We will also see later that in many cases making the neural networkdeeper instead of wider performs better.

71 / 110

Page 73: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

We saw above that a wide enough hidden layer is capable of perfectlyfitting data.

We will also see later that in many cases making the neural networkdeeper instead of wider performs better.

In this case, models may have significantly fewer parameters, but tend tobe much harder to fit.

72 / 110

Page 74: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Architecture DesignIdeal network architectures are task dependent

Require much experimentation

Judicious use of cross-validation methods to measure expectedprediction error to guide architecture choice.

73 / 110

Page 75: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

74 / 110

Page 76: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

This will lead to the problem of multiple local minima in which methodslike SGD can suffer.

75 / 110

Page 77: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

This will lead to the problem of multiple local minima in which methodslike SGD can suffer.

We will see later in detail a variety of approaches used to address thisproblem.

76 / 110

Page 78: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

This will lead to the problem of multiple local minima in which methodslike SGD can suffer.

We will see later in detail a variety of approaches used to address thisproblem.

Here, we present a few rule of thumbs to follow.

77 / 110

Page 79: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

78 / 110

Page 80: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.

79 / 110

Page 81: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.

A related idea is to average the predictions of this multiple models.

80 / 110

Page 82: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.

A related idea is to average the predictions of this multiple models.

Finally, we can use bagging as described in a previous session to createan ensemble of neural networks to circumvent the local minima problem.

81 / 110

Page 83: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

SummaryNeural networks are representationally powerful prediction models.

They can be difficult to optimize properly due to the non-convexity ofthe resulting optimization problem.

Deciding on network architecture is a significant challenge. We'll seelater that recent proposals use deep, but thinner networks effectively.Even in this case, choice of model depth is difficult.

There is tremendous excitment over recent excellent performance ofdeep neural networks in many applications.

82 / 110

Page 84: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksThe general form of feed-forwardnetwork can be extended by addingadditional hidden layers.

83 / 110

Page 85: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksThe same principles we saw before:

We arrange computation using acomputing graph

Use Stochastic Gradient Descent

Use Backpropagation for gradientcalculation along the computationgraph.

84 / 110

Page 86: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksEmpirically, it is found that by usingmore, thinner, layers, betterexpected prediction error isobtained.

However, each layer introducesmore linearity into the network.

Making optimization markedly moredifficult.

85 / 110

Page 87: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksWe may interpret hidden layers asprogressive derived representationsof the input data.

Since we train based on a loss-function, these derivedrepresentations should makemodeling the outcome of interestprogressively easier.

86 / 110

Page 88: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksIn many applications, these derivedrepresentations are used for modelinterpretation.

87 / 110

Page 89: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

88 / 110

Page 90: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

The applications we discussed previously build this type of massive deepnetwork.

89 / 110

Page 91: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

The applications we discussed previously build this type of massive deepnetwork.

They also require massive amounts of data to train.

90 / 110

Page 92: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

The applications we discussed previously build this type of massive deepnetwork.

They also require massive amounts of data to train.

However, this approach can still be applicable to moderate datasizeswith careful network design, regularization and training.

91 / 110

Page 93: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Supervised Pre-trainingA clever idea for training deepnetworks.

Train each layer successively on theoutcome of interest.

Use the resulting weights as initialweights for network with one moreadditional layer.

92 / 110

Page 94: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Supervised Pre-trainingTrain the first layer as a single layerfeed forward network.

Weights initialized as standardpractice.

This fits .W1

h

93 / 110

Page 95: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Supervised Pre-trainingNow train two layer network.

Weights are initialized to resultof previous fit.

W1

h

94 / 110

Page 96: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Supervised Pre-trainingThis procedure continues until all layers are trained.

Hypothesis is that training each layer on the outcome of interest movesthe weights to parts of parameter space that lead to good performance.

Minimizing updates can ameliorate dependency problem.

95 / 110

Page 97: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Supervised Pre-trainingThis is one strategy others are popular and effective

Train each layer as a single layer network using the hidden layer ofthe previous layer as inputs to the model.

In this case, no long term dependencies occur at all.

Performance may suffer.

96 / 110

Page 98: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Supervised Pre-trainingThis is one strategy others are popular and effective

Train each layer as a single layer on the hidden layer of the previouslayer, but also add the original input data as input to every layer of thenetwork.

No long-term dependency

Performance improves

Number of parameters increases.

97 / 110

Page 99: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Parameter SharingAnother method for reducing the number of parameters in a deeplearning model.

When predictors exhibit some internal structure, parts of the modelcan then share parameters.

X

98 / 110

Page 100: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Parameter SharingTwo important applications use this idea:

Image processing: local structure of nearby pixelsSequence modeling: structure given by sequence

The latter includes modeling of time series data.

99 / 110

Page 101: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Convolutional Networks are used inimaging applications.

Input is pixel data.

Parameters are shared acrossnearby parts of the image.

Parameter Sharing

100 / 110

Page 102: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Recurrent Networks are used insequence modeling applications.

For instance, time series andforecasting.

Parameters are shared across atime lag.

Recurrent Networks

101 / 110

Page 103: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Recurrent NetworksThe long short-term memory(LSTM) model is very popular intime series analysis

102 / 110

Page 104: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

103 / 110

Page 105: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

Addition encoded as sequence of one-hot vectors:

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 5 0 0 0 0 0 0 0 1 0 0

## 5 0 0 0 0 0 0 0 1 0 0

## + 0 1 0 0 0 0 0 0 0 0

## 2 0 0 0 0 1 0 0 0 0 0

## 2 0 0 0 0 1 0 0 0 0 0

## [,11] [,12]

## 5 0 0

104 / 110

Page 106: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

Result encoded as sequence of one-hot vectors

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 7 0 0 0 0 0 0 0 0 0 1

## 7 0 0 0 0 0 0 0 0 0 1

## [,11] [,12]

## 7 0 0

## 7 0 0

105 / 110

Page 107: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

Result encoded as sequence of one-hot vectors

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 7 0 0 0 0 0 0 0 0 0 1

## 7 0 0 0 0 0 0 0 0 0 1

## [,11] [,12]

## 7 0 0

## 7 0 0

This is a sequence-to-sequence model. Perfect application for106 / 110

Page 108: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Example## ______________________________________________________

## Layer (type) Output Shape Param #

## ======================================================

## lstm_10 (LSTM) (None, 128) 72192

## ______________________________________________________

## repeat_vector_5 (Repeat (None, 3, 128) 0

## ______________________________________________________

## lstm_11 (LSTM) (None, 3, 128) 131584

## ______________________________________________________

## time_distributed_5 (Tim (None, 3, 12) 1548

## ______________________________________________________

## activation_5 (Activatio (None, 3, 12) 0 107 / 110

Page 109: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

SummaryDeep Learning is riding a big wave of popularity.

State-of-the-art results in many applications.

Best results in applications with massive amounts of data.

However, newer methods allow use in other situations.

108 / 110

Page 110: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

SummaryMany of recent advances stem from computational and technicalapproaches to modeling.

Keeping track of these advances is hard, and many of them are ad-hoc.

Not straightforward to determine a-priori how these technical advancesmay help in a specific application.

Require significant amount of experimentation.

109 / 110

Page 111: Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

SummaryThe interpretation of hidden units as representations can lead to insight.

There is current research on interpreting these to support some notion ofstatistical inference.

Excellent textbook: http://deeplearningbook.org

110 / 110