Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
EE104 S. Lall and S. Boyd
Neural Networks
Sanjay Lall and Stephen Boyd
EE104Stanford University
1
Neural networks
I a neural network (NN) is a nonlinear predictor y = g�(x) with a particular layered form
I NNs can be thought of as incorporating aspects of feature engineering into the predictor(and indeed are often used as ‘automatic feature engineering’)
I the parameter dimension p can be very large
I training NNs can be tricky, and take a long time
I NNs can perform very well, especially when there’s lots of training data
I it is very hard to interpret a NN predictor y = g�(x) (cf. a linear predictor y = �Tx)
2
Nomenclature
3
Neural network layers
I a (feedforward) neural network predictor consists of a composition of functions
y = g3(g2(g1(x)))
(we show three here, but we can have any number)
I written as g = g3 � g2 � g1 (the symbol � means function composition)
I each gi is called a layer; here we have 3 layers
I sometimes called a multi-layer perceptron
4
Neural network layers
I we can write the predictor y = g3(g2(g1(x))) as
z1 = g
1(x); z2 = g
2(z1); y = g3(z2)
I the vector zi 2 Rdi
is called the activation or output of layer i
I layer output dimensions di need not be the same
I we sometimes write z0 = x, d0 = d, and z3 = y, d3 = m
(so the predictor input x and predictor output y are also considered activations of layers)
I sometimes visualized as flow graph
x z1
z2 y
g1
g2
g3
5
Layer functions
I each layer gi is a composition of a function h with an affine function
gi(zi�1) = h
��Ti (1; z
i�1)�; i = 1; : : : ;m
I the matrix �i 2 R(di�1+1)�di is the parameter for layer i
I an M -layer neural network predictor is parameterized by � = (�1; : : : ; �M ) (for M layers)
I the function h : R ! R is a scalar activation function, which acts elementwise on a vector argument(i.e., it is applied to each entry of a vector)
6
Weights and biases
I we can write the entries of the parameter matrix �i as
�i =
��i
�i
�=
��i1 � � � �idi�i1 � � � �idi
�
where �ij 2 R and �ij 2 Rdi�1 , so
�Ti (1; z
i�1) = �Ti
�1
zi�1
�=
264
�i1...
�idi
375+
264�i1
Tzi�1
...�idi
Tzi�1
375
I and the layer map zi = gi(zi�1) means
zij = h(�ij + �
ij
Tzi�1); j = 1; : : : ; di
a composition of h with an affine function
I such a function is called a neuron
I �ij is called the bias of the neuron; �ij are its (input) weights7
Activation functions
I the activation function h : R ! R is nonlinear
I (if it were linear or affine, then g� would be an affine function of x, i.e., a linear predictor)
I common activation functions include
I h(a) = (a)+ = max(a; 0), called ReLu (rectified linear unit)
I h(a) = ea=(1 + ea), called sigmoid function
6 4 2 0 2 4 6a
0
1
2
3
4
5
6
h(a)
ReLu
6 4 2 0 2 4 6a
0.0
0.2
0.4
0.6
0.8
1.0
h(a)
Sigmoid
8
Network depiction
x2
x1
z1
1
z1
2
z1
3
z1
4
z2
1
z2
2
y
(�2)21
(�1)24
I neural networks are often represented by network diagrams
I each vertex is a component of an activation
I edges are individual weights or parameters
I example above has 3 layers, with d0 = 2, d1 = 4, d2 = 2, d4 = 1
9
-6 -4 -2 0 2 4 6
-6-4
-20
24
60
1
2
3
4
x1
x2
y
Example neural network predictor
�1 =
24 0:80 0:10 1:30 1:20
�0:50 0:70 0:80 2:90
�1:80 0:20 �1:50 �0:60
35
�2 =
26664
1:40 1:10
�0:10 �0:90
0:50 0:20
�0:40 0:90
�0:40 �0:10
37775
�3 =
24 0:90
0:70
0:50
35
10
Layer terminology
I in an M layer network, layers 1 to M � 1 are called hidden layers
I layer M is called the output layer
I often for regression, there is no activation function (i.e., h(a) = a) in the output layer
I number of layers M is called the depth of the network
I using large M (say more than 3) is called deep learning
11
Training
12
Training
I for RERM, we minimize over �1; : : : ; �m the regularized empirical risk
1
n
nXi=1
`(g�(xi); yi) + �
mXj=1
r(�j)
I we do not regularize the bias parameters �ij
I common regularizers include sum-squares and `1
I the RERM minimization problem is very hard or impossible to solve exactly
I so training algorithms find an approximately optimal solution, using iterative methods we’ll see later
I these algorithms can take a long time (e.g., weeks)
13
Example
-3-2
-10
12
3
-3-2
-10
12
3-3
-2
-1
0
1
2
3
14
Julia
function nnregression(X, Y, lambda)d = size(X,2); m = size(Y,2)model = Chain(
Dense(d, 10, relu),Dense(10, 10, relu),Dense(10, m, identity))
data = zip(eachrow(X), eachrow(Y))reg() = normsquared(model[1].W) + normsquared(model[2].W) + normsquared(model[3].W)predicty(x) = model(x)loss(x,y) = normsquared(predicty(x)-y)cost(x,y) = loss(x,y) + lambda*reg()train(cost, Flux.params(model), data)return model
endpredictall(model,X) = vcat([model(x) for x in eachrow(X)]...)train_error = rmse(predictall(model,Xtrain), Ytrain)
15
Neural networks as feature engineering
16
Neural networks versus feature engineering
I NNs have similar form to feature engineering pipeline
I start with x
I carry out a sequence of transforms or mappings
I feature engineering mappings are chosen by hand, have few (or zero) parameters, and are interpretable
I NN mappings have a specific form, with many parameters
I we can think of NNs as doing data-driven automatic feature engineering
17
Pre-trained neural networks
I first train a NN to predict some target
I this is usually done with a very large data set, and takes considerable time
I now fix the parameters in the NN
I use the last hidden layer output zM�1 2 RdM�1 as a set of features for other prediction tasks
I can work well even when the other prediction tasks are quite different from the original one
I called a pre-trained neural network
I examples:
I word2vec maps English words to vectors in R300
I VGG16 maps images to vectors in R1000
18
Summary
19
Summary
I needs lots of data
I training can be tricky and take much time
I not interpretable
I often work well
I incorporate aspects of feature engineering
I can be used to do automatic, data-driven feature engineering
20