Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate

Neural networks are a decades oldarea of study.

Initially, these computational modelswere created with the goal ofmimicking the processing ofneuronal networks.

Historical Overview

1 / 110

Inspiration: model neuron asprocessing unit.

Some of the mathematical functionshistorically used in neural networkmodels arise from biologicallyplausible activation functions.

Historical Overview

2 / 110

Somewhat limited success inmodeling neuronal processing

Neural network models gainedtraction as general MachineLearning models.

Historical Overview

3 / 110

Historical OverviewStrong results about the ability of these models to approximate arbitraryfunctions

Became the subject of intense study in ML.

In practice, effective training of these models was both technically andcomputationally difficult.

4 / 110

Starting from 2005, technicaladvances have led to a resurgenceof interest in neural networks,specifically in Deep NeuralNetworks.

Historical Overview

5 / 110

Deep LearningAdvances in computational processing:

powerful parallel processing given by Graphical Processing Units

6 / 110



Advances in neural network architecture design and networkoptimization

7 / 110




Researchers apply Deep Neural Networks successfully in a number ofapplications.

8 / 110

Self driving cars make use of DeepLearning models for sensorprocessing.

Deep Learning

9 / 110

Image recognition software usesDeep Learning to identify individualswithin photos.

Deep Learning

10 / 110

Deep Learning models have beenapplied to medical imaging to yieldexpert-level prognosis.

Deep Learning

11 / 110

An automated Go player, makingheavy use of Deep Learning, iscapable of beating the best humanGo players in the world.

Deep Learning

12 / 110

Neural Networks and Deep LearningIn this unit we study neural networks and recent advances in DeepLearning.

13 / 110

Projection-Pursuit RegressionTo motivate our discussion of Deep Neural Networks, let's turn to simplebut very powerful class of models.

As per the usual regression setting, suppose

given predictors (attributes) for an observation

we want to predict a continuous outcome .

{X1, … ,Xp}

Y

14 / 110

Projection-Pursuit RegressionThe Projection-Pursuit Regression (PPR) model predicts outcome using function as

where:

is a p-dimensional weight vectorso, is a linear combination of predictors

and , are univariate non-linear functions (asmoothing spline for example)

Y

f(X)

f(X) =M

∑i=1

gm(w′mX)

wm

w′X = ∑

p

j=1 wmjxj xj

gm m = 1, … ,M15 / 110

Projection-Pursuit RegressionOur prediction function is a linear function (with terms).

Each term is the result of applying a non-linear function to,what we can think of as, a derived feature (or derived predictor)

.

M

gm(w′mX)

Vm = w′mX

16 / 110

Projection-Pursuit RegressionHere's another intuition. Recall the Principal Component Analysisproblem we saw in the previous unit.

Given:

Data set , where is the vector of variablevalues for the -th observation.

Return:

Matrix of linear transformations that retain maximalvariance.

{x1, x2, … , xn} xi p

i

[ϕ1,ϕ2, … ,ϕp]

17 / 110

Projection-Pursuit RegressionMatrix of linear transformations

You can think of the first vector as a linear transformation thatembeds observations into 1 dimension:

where is selected so that the resulting dataset hasmaximum variance.

[ϕ1,ϕ2, … ,ϕp]

ϕ1

Z1 = ϕ11X1 + ϕ21X2 + ⋯ + ϕp1Xp

ϕ1 {z1, … , zn}

18 / 110

Projection-Pursuit Regression

In PPR we are reducing the dimensionality of from to usinglinear projections,

And building a regression function over the representation with reduceddimension.

f(X) =M

∑i=1

gm(w′mX)

X p M

19 / 110

Projection-Pursuit RegressionLet's revist the data from our previous unit and see how the PPR modelperforms.

This is a time series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.

The dataset contains affordability measurements for 76 counties withdata from 1979 to 2017. Here we plot the time series of affordability forall counties.

20 / 110

We will try to predictaffordability at thelast time-point givenin the dataset basedon the time series upto one year previousto the last time point.


21 / 110


22 / 110

Projection-Pursuit RegressionSo, how can we fit the PPR model?

As we have done previously in other regression settings, we start with aloss function to minimize

Use an optimization method to minimize the error of the model.

For simplicity let's consider a model with and drop the subscript .

L(g,W) =N

∑i=1

[yi −M

∑m=1

gm(w′mxi)]

2

M = 1

m 23 / 110

Projection-Pursuit RegressionConsider the following procedure

Initialize weight vector to some value

Construct derived variable

Use a non-linear regression method to fit function based on model . You can use additive splines or loess

w wold

v = wold

g

E[Y |V ] = g(v)

24 / 110

Projection-Pursuit RegressionGiven function now update weight vector using a gradientdescent method

where is a learning rate.

g wold

w = wold + 2αN

∑i=1

(yi − g(vi))g′(vi)xi

= wold + 2αN

∑i=1

rixi

α

25 / 110


In the second line we rewrite the gradient in terms of the residual ofthe current model (using the derived feature ) weighted by, whatwe could think of, as the sensitivity of the model to changes in derivedfeature .

w = wold + 2αN

∑i=1


= wold + 2αN

∑i=1

~r ixi

rig(vi) v

vi

26 / 110

Projection-Pursuit RegressionGiven an updated weight vector we can then fit again and continueiterating until a stop condition is reached.

w g

27 / 110

Projection-Pursuit RegressionLet's consider the PPR and this fitting technique a bit more in detail witha few observations

We can think of the PPR model as composing three functions:

the linear projection ,the result of non-linear function and, in the case when ,the linear combination of the functions.

w′x

g M > 1

gm

28 / 110

Projection-Pursuit RegressionTo tie this to the formulation usually described in the neural networkliterature we make one slight change to our understanding of derivedfeature.

Consider the case , the final predictor is a linear combination .

We could also think of each term as providing a non-lineardimensionality reduction to a single derived feature.

M > 1

∑M

i=1 gm(vm)

gm(vm)

29 / 110

Projection-Pursuit RegressionThis interpretation is closer to that used in the neural network literature,at each stage of the composition we apply a non-linear transform to thedata of the type .g(w

′x)

30 / 110

Projection-Pursuit RegressionThe fitting procedure propagates errors (residuals) down this functioncomposition in a stage-wise manner.

31 / 110

Feed-forward Neural NetworksWe can now write the general formulation for a feed-forward neuralnetwork.

We will present the formulation for a general case where we aremodeling outcomes as .K Y1, … ,Yk f1(X), … , fK(X)

32 / 110

Feed-forward Neural NetworksIn multi-class classification, categorical outcome may take multiplevalues

We consider as a discriminant function for class ,

Final classification is made using . For regression, we cantake .

Yk k

arg maxk YkK = 1

33 / 110

Feed-forward Neural NetworksA single layer feed-forward neural network is defined as

hm = gh(w′1mX), m = 1, … ,M

fk = gfk(w′2kh), k = 1, … ,K

34 / 110

The network is organized into input,hidden and output layers.

Feed-forward Neural Networks

35 / 110

Feed-forward Neural NetworksUnits represent a hidden layer,which we can interpret as a derivednon-linear representation of theinput data as we saw before.

hm

36 / 110

Feed-forward Neural NetworksFunction is an activation functionused to introduce non-linearity tothe representation.

gh

37 / 110

Historically, thesigmoid activationfunction wascommonly used

or

the hyperbolictangent.


gh(v) = 11+e−v

38 / 110

Nowadays, a rectifiedlinear unit (ReLU)

is used morefrequently in practice.(there are manyextensions)


gh(v) = max{0, v}

39 / 110

Feed-forward Neural NetworksFunction used in the output layerdepends on the outcome modeled.

For classification a soft-maxfunction can be used

where

.

For regression, we may take tobe the identify function.

gf

gfk(tk) = etk

∑K

l=1 etk

tk = w′2kh

gfk

40 / 110

Feed-forward Neural NetworksThe single-layer feed-forward neural network has the sameparameterization as the PPR model,

Activation functions are much simpler, as opposed to, e.g., smoothingsplines as used in PPR.

gh

41 / 110

Feed-forward Neural NetworksA classic result of the Neural Network literature is the universal functionrepresentational ability of the single-layer feed-forward neural networkwith ReLU activation functions (Leshno et al. 1993).

42 / 110


However, the number of units in the hidden layer may be exponentiallylarge to approximate arbitrary functions.

43 / 110

Feed-forward Neural NetworksEmpirically, a single-layer feed-forward neural network has similarperformance to kernel-based methods like SVMs.

This is not usually the case once more than a single-layer is used in aneural network.

44 / 110

Fitting with back propagationIn modern neural network literature, the graphical representation ofneural nets we saw above has been extended to computational graphs.

45 / 110


Especially useful to guide the design of general-use programminglibraries for the specification of neural nets.

46 / 110



They have the advantage of explicitly representing all operations used ina neural network which then permits easier specification of gradient-based algorithms.

47 / 110

Fitting with back propagation

48 / 110

Fitting with back propagationGradient-based methods based on stochastic gradient descent are mostfrequently used to fit the parameters of neural networks.

49 / 110


These methods require that gradients are computed based on modelerror.

50 / 110



The layer-wise propagation of error is at the core of these gradientcomputations.

51 / 110




This is called back-propagation.

52 / 110

Fitting with back propagationAssume we have a current estimateof model parameters, and we areprocessing one observation (inpractice a small batch ofobservations is used).

x

53 / 110

Fitting with back propagationFirst, to perform back propation wemust compute the error of the modelon observation given the currentset of parameters.

To do this we compute all activationfunctions along the computationgraph from the bottom up.

x

54 / 110

Fitting with back propagationOnce we have computed output ,we can compute error (or, generally,cost) .

Once we do this we can walk backthrough the computation graph toobtain gradients of cost withrespect to any of the modelparameters applying the chain rule.

y

J(y, y)

J

55 / 110

Fitting with back propagationWe will continously update agradient vector .

First, we set

∇

∇ ← ∇yJ

56 / 110

Fitting with back propagationNext, we need the gradient

We apply the chain rule to obtain

is the derivative of the softmaxfunction

is element-wise multiplication.

Set .

∇tJ

∇tJ = ∇ ⊙ f ′(t)

f ′

⊙

∇ ← ∇tJ

57 / 110

Fitting with back propagationNext, we want to compute .

We can do so using the gradient wejust computed since

.

In this case, we get .

∇WkJ

∇

∇WkJ = ∇tJ∇Wk

t

∇WkJ = ∇h

′

58 / 110

Fitting with back propagationAt this point we have computedgradients for the weight matrix from the hidden layer to the outputlayer, which we can use to updatethose parameters as part ofstochastic gradient descent.

Wk

59 / 110

Fitting with back propagationOnce we have computed gradientsfor weights connecting the hiddenand output layers, we can computegradients for weights connecting theinput and hidden layers.

60 / 110

Fitting with back propagationWe require , we we cancompute as since currentlyhas value .

At this point we can set .

∇hJ

W ′k∇ ∇

∇tJ

∇ ← ∇hJ

61 / 110

Fitting with back propagationFinally, we set

where isthe derivative of the ReLU activationfunction.

This gives us .

∇ ← ∇zJ = ∇ ⋅ g′(z) g′

∇WhJ = ∇x

′

62 / 110

Fitting with back propagationAt this point we have propagatedthe gradient of cost function to allparameters of the model

We can thus update the model forthe next step of stochastic gradientdescent.

J

63 / 110

Practical IssuesStochastic gradient descent (SGD) based on back-propagation algorithmas shown above introduces some complications.

64 / 110

ScalingThe scale of inputs effectively determines the scale of weight matrices

Scale can have a large effect on how well SGD behaves.

In practice, all inputs are usually standardized to have zero mean andunit variance before application of SGD.

x

W

65 / 110

InitializationWith properly scaled inputs, initialization of weights can be done in asomewhat reasonable manner

Randomly choose initial weights in .[−.7, .7]

66 / 110

Over�ttingAs with other highly-flexible models we have seen previously, feed-forward neural nets are prone to overfit data.

67 / 110


We can incorporate penalty terms to control model complexity to somedegree.

68 / 110

Architecture DesignA significant issue in the application of feed-forward neural networks isthat we need to choose the number of units in the hidden layer.

69 / 110


We saw above that a wide enough hidden layer is capable of perfectlyfitting data.

70 / 110



We will also see later that in many cases making the neural networkdeeper instead of wider performs better.

71 / 110




In this case, models may have significantly fewer parameters, but tend tobe much harder to fit.

72 / 110

Architecture DesignIdeal network architectures are task dependent

Require much experimentation

Judicious use of cross-validation methods to measure expectedprediction error to guide architecture choice.

73 / 110

Multiple MinimaAs opposed to other learning methods we have seen so far, the feed-forward neural network yields a non-convex optimization problem.

74 / 110


This will lead to the problem of multiple local minima in which methodslike SGD can suffer.

75 / 110



We will see later in detail a variety of approaches used to address thisproblem.

76 / 110




Here, we present a few rule of thumbs to follow.

77 / 110

Multiple MinimaThe local minima a method like SGD may yield depend on the initialparameter values chosen.

78 / 110


One idea is to train multiple models using different initial values andmake predictions using the model that gives best expected predictionerror.

79 / 110



A related idea is to average the predictions of this multiple models.

80 / 110




Finally, we can use bagging as described in a previous session to createan ensemble of neural networks to circumvent the local minima problem.

81 / 110

SummaryNeural networks are representationally powerful prediction models.

They can be difficult to optimize properly due to the non-convexity ofthe resulting optimization problem.

Deciding on network architecture is a significant challenge. We'll seelater that recent proposals use deep, but thinner networks effectively.Even in this case, choice of model depth is difficult.

There is tremendous excitment over recent excellent performance ofdeep neural networks in many applications.

82 / 110

Deep Feed-Forward Neural NetworksThe general form of feed-forwardnetwork can be extended by addingadditional hidden layers.

83 / 110

Deep Feed-Forward Neural NetworksThe same principles we saw before:

We arrange computation using acomputing graph

Use Stochastic Gradient Descent

Use Backpropagation for gradientcalculation along the computationgraph.

84 / 110

Deep Feed-Forward Neural NetworksEmpirically, it is found that by usingmore, thinner, layers, betterexpected prediction error isobtained.

However, each layer introducesmore linearity into the network.

Making optimization markedly moredifficult.

85 / 110

Deep Feed-Forward Neural NetworksWe may interpret hidden layers asprogressive derived representationsof the input data.

Since we train based on a loss-function, these derivedrepresentations should makemodeling the outcome of interestprogressively easier.

86 / 110

Deep Feed-Forward Neural NetworksIn many applications, these derivedrepresentations are used for modelinterpretation.

87 / 110

Deep Feed-Forward Neural NetworksAdvanced parallel computation systems and methods are used in orderto train these deep networks, with billions of connections.

88 / 110


The applications we discussed previously build this type of massive deepnetwork.

89 / 110



They also require massive amounts of data to train.

90 / 110




However, this approach can still be applicable to moderate datasizeswith careful network design, regularization and training.

91 / 110

Supervised Pre-trainingA clever idea for training deepnetworks.

Train each layer successively on theoutcome of interest.

Use the resulting weights as initialweights for network with one moreadditional layer.

92 / 110

Supervised Pre-trainingTrain the first layer as a single layerfeed forward network.

Weights initialized as standardpractice.

This fits .W 1h

93 / 110

Supervised Pre-trainingNow train two layer network.

Weights are initialized to resultof previous fit.

W 1h

94 / 110

Supervised Pre-trainingThis procedure continues until all layers are trained.

Hypothesis is that training each layer on the outcome of interest movesthe weights to parts of parameter space that lead to good performance.

Minimizing updates can ameliorate dependency problem.

95 / 110

Supervised Pre-trainingThis is one strategy others are popular and effective

Train each layer as a single layer network using the hidden layer ofthe previous layer as inputs to the model.

In this case, no long term dependencies occur at all.

Performance may suffer.

96 / 110


Train each layer as a single layer on the hidden layer of the previouslayer, but also add the original input data as input to every layer of thenetwork.

No long-term dependency

Performance improves

Number of parameters increases.

97 / 110

Parameter SharingAnother method for reducing the number of parameters in a deeplearning model.

When predictors exhibit some internal structure, parts of the modelcan then share parameters.

X

98 / 110

Parameter SharingTwo important applications use this idea:

Image processing: local structure of nearby pixelsSequence modeling: structure given by sequence

The latter includes modeling of time series data.

99 / 110

Convolutional Networks are used inimaging applications.

Input is pixel data.

Parameters are shared acrossnearby parts of the image.

Parameter Sharing

100 / 110

Recurrent Networks are used insequence modeling applications.

For instance, time series andforecasting.

Parameters are shared across atime lag.

Recurrent Networks

101 / 110

Recurrent NetworksThe long short-term memory(LSTM) model is very popular intime series analysis

102 / 110

ExampleLearn to add: "55+22=77"https://keras.rstudio.com/articles/examples/addition_rnn.html

103 / 110


Addition encoded as sequence of one-hot vectors:

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 5 0 0 0 0 0 0 0 1 0 0

## 5 0 0 0 0 0 0 0 1 0 0

## + 0 1 0 0 0 0 0 0 0 0

## 2 0 0 0 0 1 0 0 0 0 0

## 2 0 0 0 0 1 0 0 0 0 0

## [,11] [,12]

## 5 0 0

104 / 110


Result encoded as sequence of one-hot vectors

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 7 0 0 0 0 0 0 0 0 0 1

## 7 0 0 0 0 0 0 0 0 0 1

## [,11] [,12]

## 7 0 0

## 7 0 0

105 / 110



## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 7 0 0 0 0 0 0 0 0 0 1

## 7 0 0 0 0 0 0 0 0 0 1

## [,11] [,12]

## 7 0 0

## 7 0 0

This is a sequence-to-sequence model. Perfect application for106 / 110

Example## ______________________________________________________

## Layer (type) Output Shape Param #

## ======================================================

## lstm_10 (LSTM) (None, 128) 72192

## ______________________________________________________

## repeat_vector_5 (Repeat (None, 3, 128) 0

## ______________________________________________________

## lstm_11 (LSTM) (None, 3, 128) 131584

## ______________________________________________________

## time_distributed_5 (Tim (None, 3, 12) 1548

## ______________________________________________________

## activation_5 (Activatio (None, 3, 12) 0 107 / 110

SummaryDeep Learning is riding a big wave of popularity.

State-of-the-art results in many applications.

Best results in applications with massive amounts of data.

However, newer methods allow use in other situations.

108 / 110

SummaryMany of recent advances stem from computational and technicalapproaches to modeling.

Keeping track of these advances is hard, and many of them are ad-hoc.

Not straightforward to determine a-priori how these technical advancesmay help in a specific application.

Require significant amount of experimentation.

109 / 110

SummaryThe interpretation of hidden units as representations can lead to insight.

There is current research on interpreting these to support some notion ofstatistical inference.

Excellent textbook: http://deeplearningbook.org

110 / 110

Introduction to Data Science: NeuralNetworks and Deep Learning

Héctor Corrada Bravo

University of Maryland, College Park, USACMSC 320: 2020-05-10

https://www.zillow.com/research/data/

https://keras.rstudio.com/articles/examples/addition_rnn.html




http://deeplearningbook.org/

Neural networks are a decades oldarea of study.

Initially, these computational modelswere created with the goal ofmimicking the processing ofneuronal networks.

Historical Overview

1 / 110

Inspiration: model neuron asprocessing unit.

Some of the mathematical functionshistorically used in neural networkmodels arise from biologicallyplausible activation functions.

Historical Overview

2 / 110

Somewhat limited success inmodeling neuronal processing

Neural network models gainedtraction as general MachineLearning models.

Historical Overview

3 / 110

Historical OverviewStrong results about the ability of these models to approximate arbitraryfunctions

Became the subject of intense study in ML.

In practice, effective training of these models was both technically andcomputationally difficult.

4 / 110

Starting from 2005, technicaladvances have led to a resurgenceof interest in neural networks,specifically in Deep NeuralNetworks.

Historical Overview

5 / 110



6 / 110




7 / 110




Researchers apply Deep Neural Networks successfully in a number ofapplications.

8 / 110

Self driving cars make use of DeepLearning models for sensorprocessing.

Deep Learning

9 / 110

Image recognition software usesDeep Learning to identify individualswithin photos.

Deep Learning

10 / 110

Deep Learning models have beenapplied to medical imaging to yieldexpert-level prognosis.

Deep Learning

11 / 110

An automated Go player, makingheavy use of Deep Learning, iscapable of beating the best humanGo players in the world.

Deep Learning

12 / 110

Neural Networks and Deep LearningIn this unit we study neural networks and recent advances in DeepLearning.

13 / 110

Projection-Pursuit RegressionTo motivate our discussion of Deep Neural Networks, let's turn to simplebut very powerful class of models.

As per the usual regression setting, suppose

given predictors (attributes) for an observation

we want to predict a continuous outcome .

{X1, … , Xp}

Y

14 / 110

Projection-Pursuit RegressionThe Projection-Pursuit Regression (PPR) model predicts outcome using function as

where:

is a p-dimensional weight vectorso, is a linear combination of predictors

and , are univariate non-linear functions (asmoothing spline for example)

Y

f(X)

f(X) =M

∑i=1

gm(w′mX)

wm

w′X = ∑

p

j=1 wmjxj xj

gm m = 1, … , M

15 / 110

Projection-Pursuit RegressionOur prediction function is a linear function (with terms).

Each term is the result of applying a non-linear function to,what we can think of as, a derived feature (or derived predictor)

.

M

gm(w′mX)

Vm = w′mX

16 / 110

Projection-Pursuit RegressionHere's another intuition. Recall the Principal Component Analysisproblem we saw in the previous unit.

Given:

Data set , where is the vector of variablevalues for the -th observation.

Return:

Matrix of linear transformations that retain maximalvariance.

{x1, x2, … , xn} xi p

i

[ϕ1, ϕ2, … , ϕp]

17 / 110

Projection-Pursuit RegressionMatrix of linear transformations

You can think of the first vector as a linear transformation thatembeds observations into 1 dimension:

where is selected so that the resulting dataset hasmaximum variance.

[ϕ1, ϕ2, … , ϕp]

ϕ1

Z1 = ϕ11X1 + ϕ21X2 + ⋯ + ϕp1Xp

ϕ1 {z1, … , zn}

18 / 110


In PPR we are reducing the dimensionality of from to usinglinear projections,

And building a regression function over the representation with reduceddimension.

f(X) =M

∑i=1

gm(w′mX)

X p M

19 / 110

Projection-Pursuit RegressionLet's revist the data from our previous unit and see how the PPR modelperforms.

This is a time series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.

The dataset contains affordability measurements for 76 counties withdata from 1979 to 2017. Here we plot the time series of affordability forall counties.

20 / 110

https://www.zillow.com/research/data/

We will try to predictaffordability at thelast time-point givenin the dataset basedon the time series upto one year previousto the last time point.


21 / 110


22 / 110

Projection-Pursuit RegressionSo, how can we fit the PPR model?

As we have done previously in other regression settings, we start with aloss function to minimize

Use an optimization method to minimize the error of the model.

For simplicity let's consider a model with and drop the subscript .

L(g, W) =N

∑i=1

[yi −M

∑m=1

gm(w′mxi)]

2

M = 1

m 23 / 110

Projection-Pursuit RegressionConsider the following procedure

Initialize weight vector to some value

Construct derived variable

Use a non-linear regression method to fit function based on model . You can use additive splines or loess

w wold

v = wold

g

E[Y |V ] = g(v)

24 / 110

Projection-Pursuit RegressionGiven function now update weight vector using a gradientdescent method

where is a learning rate.

g wold

w = wold + 2αN

∑i=1


= wold + 2αN

∑i=1

rixi

α

25 / 110


In the second line we rewrite the gradient in terms of the residual ofthe current model (using the derived feature ) weighted by, whatwe could think of, as the sensitivity of the model to changes in derivedfeature .

w = wold + 2αN

∑i=1


= wold + 2αN

∑i=1

~r ixi

rig(vi) v

vi

26 / 110

Projection-Pursuit RegressionGiven an updated weight vector we can then fit again and continueiterating until a stop condition is reached.

w g

27 / 110

Projection-Pursuit RegressionLet's consider the PPR and this fitting technique a bit more in detail witha few observations

We can think of the PPR model as composing three functions:

the linear projection ,the result of non-linear function and, in the case when ,the linear combination of the functions.

w′x

g M > 1

gm

28 / 110

Projection-Pursuit RegressionTo tie this to the formulation usually described in the neural networkliterature we make one slight change to our understanding of derivedfeature.

Consider the case , the final predictor is a linear combination .

We could also think of each term as providing a non-lineardimensionality reduction to a single derived feature.

M > 1

∑M

i=1 gm(vm)

gm(vm)

29 / 110

Projection-Pursuit RegressionThis interpretation is closer to that used in the neural network literature,at each stage of the composition we apply a non-linear transform to thedata of the type .g(w

′x)

30 / 110

Projection-Pursuit RegressionThe fitting procedure propagates errors (residuals) down this functioncomposition in a stage-wise manner.

31 / 110

Feed-forward Neural NetworksWe can now write the general formulation for a feed-forward neuralnetwork.

We will present the formulation for a general case where we aremodeling outcomes as .K Y1, … , Yk f1(X), … , fK(X)

32 / 110

Feed-forward Neural NetworksIn multi-class classification, categorical outcome may take multiplevalues

We consider as a discriminant function for class ,

Final classification is made using . For regression, we cantake .

Yk k

arg maxk Yk

K = 1

33 / 110

Feed-forward Neural NetworksA single layer feed-forward neural network is defined as

hm = gh(w′1mX), m = 1, … , M

fk = gfk(w′2k

h), k = 1, … , K

34 / 110

The network is organized into input,hidden and output layers.


35 / 110

Feed-forward Neural NetworksUnits represent a hidden layer,which we can interpret as a derivednon-linear representation of theinput data as we saw before.

hm

36 / 110

Feed-forward Neural NetworksFunction is an activation functionused to introduce non-linearity tothe representation.

gh

37 / 110

Historically, thesigmoid activationfunction wascommonly used

or

the hyperbolictangent.


gh(v) = 11+e−v

38 / 110

Nowadays, a rectifiedlinear unit (ReLU)

is used morefrequently in practice.(there are manyextensions)


gh(v) = max{0, v}

39 / 110

Feed-forward Neural NetworksFunction used in the output layerdepends on the outcome modeled.

For classification a soft-maxfunction can be used

where

.

For regression, we may take tobe the identify function.

gf

gfk(tk) = etk

∑K

l=1 etk

tk = w′2k

h

gfk

40 / 110

Feed-forward Neural NetworksThe single-layer feed-forward neural network has the sameparameterization as the PPR model,

Activation functions are much simpler, as opposed to, e.g., smoothingsplines as used in PPR.

gh

41 / 110


42 / 110


However, the number of units in the hidden layer may be exponentiallylarge to approximate arbitrary functions.

43 / 110

Feed-forward Neural NetworksEmpirically, a single-layer feed-forward neural network has similarperformance to kernel-based methods like SVMs.

This is not usually the case once more than a single-layer is used in aneural network.

44 / 110


45 / 110



46 / 110



They have the advantage of explicitly representing all operations used ina neural network which then permits easier specification of gradient-based algorithms.

47 / 110

Fitting with back propagation

48 / 110


49 / 110



50 / 110




51 / 110




This is called back-propagation.

52 / 110

Fitting with back propagationAssume we have a current estimateof model parameters, and we areprocessing one observation (inpractice a small batch ofobservations is used).

x

53 / 110

Fitting with back propagationFirst, to perform back propation wemust compute the error of the modelon observation given the currentset of parameters.

To do this we compute all activationfunctions along the computationgraph from the bottom up.

x

54 / 110

Fitting with back propagationOnce we have computed output ,we can compute error (or, generally,cost) .

Once we do this we can walk backthrough the computation graph toobtain gradients of cost withrespect to any of the modelparameters applying the chain rule.

y

J(y, y)

J

55 / 110

Fitting with back propagationWe will continously update agradient vector .

First, we set

∇

∇ ← ∇y J

56 / 110

Fitting with back propagationNext, we need the gradient

We apply the chain rule to obtain

is the derivative of the softmaxfunction

is element-wise multiplication.

Set .

∇tJ

∇tJ = ∇ ⊙ f ′(t)

f ′

⊙

∇ ← ∇tJ

57 / 110

Fitting with back propagationNext, we want to compute .

We can do so using the gradient wejust computed since

.

In this case, we get .

∇WkJ

∇

∇WkJ = ∇tJ∇Wk

t

∇WkJ = ∇h

′

58 / 110

Fitting with back propagationAt this point we have computedgradients for the weight matrix from the hidden layer to the outputlayer, which we can use to updatethose parameters as part ofstochastic gradient descent.

Wk

59 / 110

Fitting with back propagationOnce we have computed gradientsfor weights connecting the hiddenand output layers, we can computegradients for weights connecting theinput and hidden layers.

60 / 110

Fitting with back propagationWe require , we we cancompute as since currentlyhas value .

At this point we can set .

∇hJ

W′

k∇ ∇

∇tJ

∇ ← ∇hJ

61 / 110

Fitting with back propagationFinally, we set

where isthe derivative of the ReLU activationfunction.

This gives us .

∇ ← ∇zJ = ∇ ⋅ g′(z) g′

∇WhJ = ∇x

′

62 / 110

Fitting with back propagationAt this point we have propagatedthe gradient of cost function to allparameters of the model

We can thus update the model forthe next step of stochastic gradientdescent.

J

63 / 110

Practical IssuesStochastic gradient descent (SGD) based on back-propagation algorithmas shown above introduces some complications.

64 / 110

ScalingThe scale of inputs effectively determines the scale of weight matrices

Scale can have a large effect on how well SGD behaves.

In practice, all inputs are usually standardized to have zero mean andunit variance before application of SGD.

x

W

65 / 110

InitializationWith properly scaled inputs, initialization of weights can be done in asomewhat reasonable manner

Randomly choose initial weights in .[−.7, .7]

66 / 110


67 / 110


We can incorporate penalty terms to control model complexity to somedegree.

68 / 110


69 / 110



70 / 110




71 / 110




In this case, models may have significantly fewer parameters, but tend tobe much harder to fit.

72 / 110

Architecture DesignIdeal network architectures are task dependent

Require much experimentation

Judicious use of cross-validation methods to measure expectedprediction error to guide architecture choice.

73 / 110


74 / 110



75 / 110




76 / 110




Here, we present a few rule of thumbs to follow.

77 / 110


78 / 110



79 / 110




80 / 110




Finally, we can use bagging as described in a previous session to createan ensemble of neural networks to circumvent the local minima problem.

81 / 110

SummaryNeural networks are representationally powerful prediction models.

They can be difficult to optimize properly due to the non-convexity ofthe resulting optimization problem.

Deciding on network architecture is a significant challenge. We'll seelater that recent proposals use deep, but thinner networks effectively.Even in this case, choice of model depth is difficult.

There is tremendous excitment over recent excellent performance ofdeep neural networks in many applications.

82 / 110

Deep Feed-Forward Neural NetworksThe general form of feed-forwardnetwork can be extended by addingadditional hidden layers.

83 / 110

Deep Feed-Forward Neural NetworksThe same principles we saw before:

We arrange computation using acomputing graph

Use Stochastic Gradient Descent

Use Backpropagation for gradientcalculation along the computationgraph.

84 / 110

Deep Feed-Forward Neural NetworksEmpirically, it is found that by usingmore, thinner, layers, betterexpected prediction error isobtained.

However, each layer introducesmore linearity into the network.

Making optimization markedly moredifficult.

85 / 110

Deep Feed-Forward Neural NetworksWe may interpret hidden layers asprogressive derived representationsof the input data.

Since we train based on a loss-function, these derivedrepresentations should makemodeling the outcome of interestprogressively easier.

86 / 110

Deep Feed-Forward Neural NetworksIn many applications, these derivedrepresentations are used for modelinterpretation.

87 / 110


88 / 110



89 / 110




90 / 110




However, this approach can still be applicable to moderate datasizeswith careful network design, regularization and training.

91 / 110

Supervised Pre-trainingA clever idea for training deepnetworks.

Train each layer successively on theoutcome of interest.

Use the resulting weights as initialweights for network with one moreadditional layer.

92 / 110

Supervised Pre-trainingTrain the first layer as a single layerfeed forward network.

Weights initialized as standardpractice.

This fits .W1

h

93 / 110

Supervised Pre-trainingNow train two layer network.

Weights are initialized to resultof previous fit.

W1

h

94 / 110

Supervised Pre-trainingThis procedure continues until all layers are trained.

Hypothesis is that training each layer on the outcome of interest movesthe weights to parts of parameter space that lead to good performance.

Minimizing updates can ameliorate dependency problem.

95 / 110


Train each layer as a single layer network using the hidden layer ofthe previous layer as inputs to the model.

In this case, no long term dependencies occur at all.

Performance may suffer.

96 / 110


Train each layer as a single layer on the hidden layer of the previouslayer, but also add the original input data as input to every layer of thenetwork.

No long-term dependency

Performance improves

Number of parameters increases.

97 / 110

Parameter SharingAnother method for reducing the number of parameters in a deeplearning model.

When predictors exhibit some internal structure, parts of the modelcan then share parameters.

X

98 / 110

Parameter SharingTwo important applications use this idea:

Image processing: local structure of nearby pixelsSequence modeling: structure given by sequence

The latter includes modeling of time series data.

99 / 110

Convolutional Networks are used inimaging applications.

Input is pixel data.

Parameters are shared acrossnearby parts of the image.

Parameter Sharing

100 / 110

Recurrent Networks are used insequence modeling applications.

For instance, time series andforecasting.

Parameters are shared across atime lag.

Recurrent Networks

101 / 110

Recurrent NetworksThe long short-term memory(LSTM) model is very popular intime series analysis

102 / 110


103 / 110



Addition encoded as sequence of one-hot vectors:

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 5 0 0 0 0 0 0 0 1 0 0

## 5 0 0 0 0 0 0 0 1 0 0

## + 0 1 0 0 0 0 0 0 0 0

## 2 0 0 0 0 1 0 0 0 0 0

## 2 0 0 0 0 1 0 0 0 0 0

## [,11] [,12]

## 5 0 0

104 / 110




## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 7 0 0 0 0 0 0 0 0 0 1

## 7 0 0 0 0 0 0 0 0 0 1

## [,11] [,12]

## 7 0 0

## 7 0 0

105 / 110




## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

## 7 0 0 0 0 0 0 0 0 0 1

## 7 0 0 0 0 0 0 0 0 0 1

## [,11] [,12]

## 7 0 0

## 7 0 0

This is a sequence-to-sequence model. Perfect application for106 / 110


Example## ______________________________________________________

## Layer (type) Output Shape Param #

## ======================================================

## lstm_10 (LSTM) (None, 128) 72192

## ______________________________________________________

## repeat_vector_5 (Repeat (None, 3, 128) 0

## ______________________________________________________

## lstm_11 (LSTM) (None, 3, 128) 131584

## ______________________________________________________

## time_distributed_5 (Tim (None, 3, 12) 1548

## ______________________________________________________

## activation_5 (Activatio (None, 3, 12) 0 107 / 110

SummaryDeep Learning is riding a big wave of popularity.

State-of-the-art results in many applications.

Best results in applications with massive amounts of data.

However, newer methods allow use in other situations.

108 / 110

SummaryMany of recent advances stem from computational and technicalapproaches to modeling.

Keeping track of these advances is hard, and many of them are ad-hoc.

Not straightforward to determine a-priori how these technical advancesmay help in a specific application.

Require significant amount of experimentation.

109 / 110

SummaryThe interpretation of hidden units as representations can lead to insight.

There is current research on interpreting these to support some notion ofstatistical inference.

Excellent textbook: http://deeplearningbook.org

110 / 110

http://deeplearningbook.org/

Documents

Networks and Deep Learning Introduction to Data Science ... · In this unit we study neural networks and recent advances in Deep Learning. 13 / 1 Projection-PursuitRegressi n To motivate