Neural Networks - Stanford Universityee104.stanford.edu/lectures/neural.pdfNeural network layers I wecanwritethepredictor ^y = g 3 ( g2 1 ( x ))) as z 1 = g 1 ( x ) ; z 2 = g 2 ( z

EE104 S. Lall and S. Boyd

Neural Networks

Sanjay Lall and Stephen Boyd

EE104Stanford University

1

Neural networks

I a neural network (NN) is a nonlinear predictor y = g�(x) with a particular layered form

I NNs can be thought of as incorporating aspects of feature engineering into the predictor(and indeed are often used as ‘automatic feature engineering’)

I the parameter dimension p can be very large

I training NNs can be tricky, and take a long time

I NNs can perform very well, especially when there’s lots of training data

I it is very hard to interpret a NN predictor y = g�(x) (cf. a linear predictor y = �Tx)

2

Nomenclature

3

Neural network layers

I a (feedforward) neural network predictor consists of a composition of functions

y = g3(g2(g1(x)))

(we show three here, but we can have any number)

I written as g = g3 � g2 � g1 (the symbol � means function composition)

I each gi is called a layer; here we have 3 layers

I sometimes called a multi-layer perceptron

4

Neural network layers

I we can write the predictor y = g3(g2(g1(x))) as

z1 = g

1(x); z2 = g

2(z1); y = g3(z2)

I the vector zi 2 Rdi

is called the activation or output of layer i

I layer output dimensions di need not be the same

I we sometimes write z0 = x, d0 = d, and z3 = y, d3 = m

(so the predictor input x and predictor output y are also considered activations of layers)

I sometimes visualized as flow graph

x z1

z2 y

g1

g2

g3

5

Layer functions

I each layer gi is a composition of a function h with an affine function

gi(zi�1) = h

��Ti (1; z

i�1)�; i = 1; : : : ;m

I the matrix �i 2 R(di�1+1)�di is the parameter for layer i

I an M -layer neural network predictor is parameterized by � = (�1; : : : ; �M ) (for M layers)

I the function h : R ! R is a scalar activation function, which acts elementwise on a vector argument(i.e., it is applied to each entry of a vector)

6

Weights and biases

I we can write the entries of the parameter matrix �i as

�i =

��i

�i

�=

��i1 � � � �idi�i1 � � � �idi

�

where �ij 2 R and �ij 2 Rdi�1 , so

�Ti (1; z

i�1) = �Ti

�1

zi�1

�=

264

�i1...

�idi

375+

264�i1

Tzi�1

...�idi

Tzi�1

375

I and the layer map zi = gi(zi�1) means

zij = h(�ij + �

ij

Tzi�1); j = 1; : : : ; di

a composition of h with an affine function

I such a function is called a neuron

I �ij is called the bias of the neuron; �ij are its (input) weights7

Activation functions

I the activation function h : R ! R is nonlinear

I (if it were linear or affine, then g� would be an affine function of x, i.e., a linear predictor)

I common activation functions include

I h(a) = (a)+ = max(a; 0), called ReLu (rectified linear unit)

I h(a) = ea=(1 + ea), called sigmoid function

6 4 2 0 2 4 6a

0

1

2

3

4

5

6

h(a)

ReLu

6 4 2 0 2 4 6a

0.0

0.2

0.4

0.6

0.8

1.0

h(a)

Sigmoid

8

Network depiction

x2

x1

z1

1

z1

2

z1

3

z1

4

z2

1

z2

2

y

(�2)21

(�1)24

I neural networks are often represented by network diagrams

I each vertex is a component of an activation

I edges are individual weights or parameters

I example above has 3 layers, with d0 = 2, d1 = 4, d2 = 2, d4 = 1

9

-6 -4 -2 0 2 4 6

-6-4

-20

24

60

1

2

3

4

x1

x2

y

Example neural network predictor

�1 =

24 0:80 0:10 1:30 1:20

�0:50 0:70 0:80 2:90

�1:80 0:20 �1:50 �0:60

35

�2 =

26664

1:40 1:10

�0:10 �0:90

0:50 0:20

�0:40 0:90

�0:40 �0:10

37775

�3 =

24 0:90

0:70

0:50

35

10

Layer terminology

I in an M layer network, layers 1 to M � 1 are called hidden layers

I layer M is called the output layer

I often for regression, there is no activation function (i.e., h(a) = a) in the output layer

I number of layers M is called the depth of the network

I using large M (say more than 3) is called deep learning

11

Training

12

Training

I for RERM, we minimize over �1; : : : ; �m the regularized empirical risk

1

n

nXi=1

`(g�(xi); yi) + �

mXj=1

r(�j)

I we do not regularize the bias parameters �ij

I common regularizers include sum-squares and `1

I the RERM minimization problem is very hard or impossible to solve exactly

I so training algorithms find an approximately optimal solution, using iterative methods we’ll see later

I these algorithms can take a long time (e.g., weeks)

13

Example

-3-2

-10

12

3

-3-2

-10

12

3-3

-2

-1

0

1

2

3

14

Julia

function nnregression(X, Y, lambda)d = size(X,2); m = size(Y,2)model = Chain(

Dense(d, 10, relu),Dense(10, 10, relu),Dense(10, m, identity))

data = zip(eachrow(X), eachrow(Y))reg() = normsquared(model[1].W) + normsquared(model[2].W) + normsquared(model[3].W)predicty(x) = model(x)loss(x,y) = normsquared(predicty(x)-y)cost(x,y) = loss(x,y) + lambda*reg()train(cost, Flux.params(model), data)return model

endpredictall(model,X) = vcat([model(x) for x in eachrow(X)]...)train_error = rmse(predictall(model,Xtrain), Ytrain)

15

Neural networks as feature engineering

16

Neural networks versus feature engineering

I NNs have similar form to feature engineering pipeline

I start with x

I carry out a sequence of transforms or mappings

I feature engineering mappings are chosen by hand, have few (or zero) parameters, and are interpretable

I NN mappings have a specific form, with many parameters

I we can think of NNs as doing data-driven automatic feature engineering

17

Pre-trained neural networks

I first train a NN to predict some target

I this is usually done with a very large data set, and takes considerable time

I now fix the parameters in the NN

I use the last hidden layer output zM�1 2 RdM�1 as a set of features for other prediction tasks

I can work well even when the other prediction tasks are quite different from the original one

I called a pre-trained neural network

I examples:

I word2vec maps English words to vectors in R300

I VGG16 maps images to vectors in R1000

18

Summary

19

Summary

I needs lots of data

I training can be tricky and take much time

I not interpretable

I often work well

I incorporate aspects of feature engineering

I can be used to do automatic, data-driven feature engineering

20

Documents

Neural Networks - Stanford Universityee104.stanford.edu/lectures/neural.pdfNeural network layers I wecanwritethepredictor ^y = g 3 ( g2 1 ( x ))) as z 1 = g 1 ( x ) ; z 2 = g 2 ( z