Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Introduction to Neural Networks

Mark Johnson

Dept of ComputingMacquarie University

SydneyAustralia

September 2015

1/31

Outline

Introduction

Learning weights with Backpropagation

Alternative non-linear activation functions

Avoiding over-learning

Learning low-dimensional representations

Conclusion and other topics

2/31

What are (deep) neural networks?

• Neural networks are functions mapping inputs to outputsÉ the inputs are usually feature vectors, encoding pictures,documents, etc.

É the outputs are usually a single value (e.g., a label for the input),but can be structured (e.g., a sentence)

• A neural network is a network of models typically organised into asequence of layersÉ the first layer is the input feature vectorÉ the output of one layer serves as the input to the next layerÉ the output is the last layer

• Deep neural networks have a relatively large number of layers (e.g., 5or more)

3/31

Logistic regression as a 2-layer neural network

P(Y=1 | x) = σ(∑

j

wjxj + b)

= σ(w · x + b)

σ(v) =1

1+ exp(−v)

• x = (x1, . . . , xm) is the input vector (feature values)

• w = (w1, . . . ,wm) is a vector of connection weights(feature weights)

• b is a bias weightÉ can be viewed as the weight associated with an “alwayson” input

y...

1

b

x0

w0x1w1

xm

wm

4/31

The logistic sigmoid function

0

1

0−5 5

y = 1/1 + e−x

5/31

Weaknesses of logistic regression• Logistic regression is a log-linear modeli.e., the log odds are a linear function of the input

P(Y=1 | x) =1

1+ exp(−w · x)⇔ log�

P(Y=1 | x)P(Y=0 | x)

�

= w · x

• The presence of feature xj increases the log odds by wj⇒ Unable to capture non-linear interactions between the features

É XOR is the classic example of a function that a linear modelcannot express

É can a linear function of pixel intensities recognise a cat?• Neural networks with hidden layers can learn non-linear featureinteractions

• Note: a linear model can capture non-linear interactions if the inputsare non-linearly transformed firstÉ XOR = OR − ANDÉ Extreme learning consists of linear regression applied to randomnon-linear transformations of the input

6/31

Hidden nodes

• Hidden nodes are nodes that are neitherinput nor output nodes

• Each hidden layer provides extra expressivepower

• Three layer neural network (ignoring biasnodes):

hi = σ(∑

j

Aijxj)

y = σ(∑

j

bjhj)

or in matrix terms:

h = σ(Ax)y = σ(b · h)

x0

x1

x2

x3

x4

xm

...

h0

h1

h2

h3

h4

hn

...

y

Input Hidden Output

7/31

One-hot encoding for categorical inputs

• A categorical variable ranges over a fixed set of categories (e.g., wordsin a fixed vocabulary)

• A one-hot encoding of a categorical variable associates a binary nodefor each possible value of the categorical variableÉ the one-hot encoding of c ∈ 1, . . . ,m is x = (x1, . . . , xm), wherexc = 1 and xc ′ = 0 for c ′ 6= c

cat dog chair table book cup pen desk . . .

• A sequence of categorical variables can be represented by a sequenceof their 1-hot encodings

• If xc is the one-hot encoding of category c , then Axc = A·,cso a 1-hot encoding of c corresponds to a table-lookup of column c

8/31

Why is the non-linearity important?• General form of a neural network:

x(i) = g(i)(W(i) x(i−1))

where:É x(i) are the activations at level i ,É W(i) is a weight matrix mapping activations at level i − 1 to level iÉ g(i) is a non-linear function (e.g., the sigmoid function)

• E.g., a general 3-layer network has the following form:

y = g(2)(W(2) x(1)) = g(2)(W(2) g(1)(W(1) x(0)))

• If g(1) = g(2) = I (the identity function), then:

y = W(2) W(1) x(0)

• But the product of two matrices is another matrix!⇒ A three-layer network is equivalent to a two-layer network if g(1) is a

linear function

9/31

Outline

Introduction






10/31

Learning weights via Stochastic Gradient Descent• Given labelled training data D = ((x1, y1), . . . , (xn, yn)), we want tofind weights cW with minimum loss:

cW = argminW

n∑

i=1

`(W; xi , yi), where:

`(W; x, y) = − log P(y | x)

• `(W; xi , yi) is loss incurred when predicting y from x using weights WÉ probabilities are not greater than 1 ⇒ loss is never negativeÉ log(1)=0 ⇒ perfect prediction incurrs zero loss

• Minimising log-loss is maximum likelihood estimation

• Stochastic gradient descent:É choose a random training example (x, y) ∈ DÉ update W by adding −ε∂`(W;x,y)

∂W

11/31

Derivatives of the sigmoid function

σ(x) =1

1+ e−x

dσ(x)dx

=1

2+ e−x + ex = σ(x)σ(−x)

0

1

0−5 5

σ(x)

dσ/dx

12/31

Derivatives of the log sigmoid function

d logσ(x)dx

=1+ e−x

2+ e−x + ex = σ(−x) = 1− σ(x)

−5

0

1

0−5 5

logσ(x)

d logσ/dx

13/31

Calculating derivatives with Backpropagation (1)

x00

x01

x02

x03

x10v1

0σW 1

0,0

W 10,1

W 10,2

W 10,3

x11v1

1σ

W 11,0

W 11,1

W 11,2

W 11,3

x12v1

2σ

W 12,0

W 12,1

W 12,2

W 12,3

x13v1

3σ

W 13,0

W 13,1

W 13,2

W 13,3

x20v2

0σW 2

0,0

W 20,1

W 20,2

W 20,3

x21v2

1σ

W 21,0

W 21,1

W 21,2

W 21,3

x22v2

2σ

W 22,0

W 22,1

W 22,2

W 22,3

x23v2

3σ

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σW

30

W 31

W32

W3

3

y = σ(v (3))

v(k) = W(k) x(k−1), k = 1, . . . , 3

x(k) = σ(v(k)), k = 1, 2

14/31


x00

x01

x02

x03

x10v1

0σW 1

0,0

W 10,1

W 10,2

W 10,3

x11v1

1σ

W 11,0

W 11,1

W 11,2

W 11,3

x12v1

2σ

W 12,0

W 12,1

W 12,2

W 12,3

x13v1

3σ

W 13,0

W 13,1

W 13,2

W 13,3

x20v2

0σW 2

0,0

W 20,1

W 20,2

W 20,3

x21v2

1σ

W 21,0

W 21,1

W 21,2

W 21,3

x22v2

2σ

W 22,0

W 22,1

W 22,2

W 22,3

x23v2

3σ

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σ

W30

W 31

W32

W3

3

y = σ(v (3))

v (3) = W(3) x(2)

∂ log y

∂W(3)=

∂ log y

∂v (3)∂v (3)

∂W(3)= (1− σ(v (3))) x(2)

∂ log y

∂ x(2)=

∂y

∂v (3)∂v (3)

∂ x(2)= (1− σ(v (3))) W(3)

∂ log y

∂ v(2)=

∂ log y

∂ x(2)∂ x(2)

∂ v(2)= (1− σ(v (3))) W(3)σ(v(2))σ(−v(2))

15/31


x00

x01

x02

x03

x10v1

0σW 1

0,0

W 10,1

W 10,2

W 10,3

x11v1

1σ

W 11,0

W 11,1

W 11,2

W 11,3

x12v1

2σ

W 12,0

W 12,1

W 12,2

W 12,3

x13v1

3σ

W 13,0

W 13,1

W 13,2

W 13,3

x20v2

0σW 2

0,0

W 20,1

W 20,2

W 20,3

x21v2

1σ

W 21,0

W 21,1

W 21,2

W 21,3

x22v2

2σ

W 22,0

W 22,1

W 22,2

W 22,3

x23v2

3σ

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σ

W30

W 31

W32

W3

3

x(2) = σ(v(2))v(2) = W(2)x(1)

∂ log y

∂ v(2)=

∂ log y

∂ x(2)∂ x(2)

∂ v(2)=

∂ log y

∂ x(2)(σ(v(2))σ(−v(2)))

∂ log y

∂W(2)=

∂ log y

∂ v(2)∂ v(2)

∂W(2)=

∂ log y

∂ v(2)x(1)

∂ log y

∂ x(1)=

∂ log y

∂ v(2)∂ v(2)

∂ x(1)=

∂ log y

∂ v(2)W(2)

16/31


• Backpropagation is essentially just repeated application of the “chainrule” for differentiation

• High-level overview of backpropagation:É Forward pass: propagate activations from input layer to outputlayer

É Backward pass: propagate error signal from output layer back toinput layer

É Compute derivatives: derivatives involve both “forward” and“backward” activations

17/31

Outline

Introduction






18/31

tanh activation function (a scaled sigmoid)• No reason for hidden unit activations to be probabilities

É output unit is often a predicted value, e.g., a (log) probability• Idea: allow negative as well as positive hidden unit activations

tanh(v) =exp(v)− exp(−v)exp(v) + exp(−v)

= 2σ(2v)− 1

σ(v) =1

1+ exp(−v)

−1

0

1

0−5 5

σ(x)

tanh(x)

19/31

The vanishing gradient problem• The sigmoid and tanh functions saturate on very large or very smallinputs

⇒ Vanishing gradient problem: in deep networks, gradient is often closeto zero for weights close to input layer

⇒ That part of the network learns very slowly, or not at all

0

1

0−5 5

σ(x)

dσ/dx

20/31

Rectlinear activation function• The Rectlinear (rectified linear) activation function doesn’t saturate:(in positive direction)

RelU(v) = max(v , 0)

• It’s also very fast to calculate!

0

5

0−5 5

RelU(x)

21/31

Outline

Introduction






22/31

Early stopping in stochastic gradient descent

• Neural networks (like all machine learning algorithms) can over-fit thetraining data

• You can detect over-fitting by measuring the difference in accuracy ontraining and held-out development data (don’t use your test data!)

• Early stopping: stop training when accuracy on held-out developmentdata starts decreasingÉ often over-fitting will have started before held-out accuracy startsdecreasing

23/31

Regularisation via weight decay

• Regularisation: add a term to loss function that penalises largeconnection weights, such as the L2 regulariser

Q(w) = 1/2∑

j

w2j

∂Q∂wj

= wj

• In SGD, the L2 penalty subtracts a fraction of each weight at eachiterationÉ also known as a shrinkage method, as it “shrinks” the connectionweights

24/31

Model-averaging via Drop-out

• Drop-out training:É at training time, during each SGD iteration randomly select halfthe connection weights, and ignore the others– i.e., calculate forward activations, derivatives and updatesjust using selected weights

É at test time, use full network but divide all weights by 2

• This is training a randomly-chosen subnetwork at each iterationÉ it forces the network not to rely on any single connectionÉ each time a data item is seen during training it has a differentsubnetwork ⇒ encourages network to learn multiplegeneralisations about each data item

• Full network is an “average” of randomly chosen subnetworks

• Over-fitting is much less of a problem with drop-out

25/31

Outline

Introduction






26/31

Auto-encoders learn low-dimensional representations

• An auto-encoder is a neural network thatpredicts its own input (i.e., learns an identityfunction)

• Typically auto-encoders have fewer hiddenunits than visible input/output unitsÉ can also impose a sparsity penalty (e.g.,cross-entropy) on hidden units

⇒ Auto-encoder has to learn generalisationsabout its inputsÉ hidden units represent thesegeneralisations

Input Hidden Input

27/31

word2vec learns low-dimensional word representations

• word2vec learns 500-dimensionalword representations by predictingthe neighbouring words surroundingeach word w in a 5-word window(w−2,w−1,w ,w1,w2)

• Has surprising properties, e.g.,[Paris]− [France] ≈ [Rome]− [Italy]

• Weight matrices mappingrepresentation of word w torepresentations of neighbouringwords are shared (i.e.,position-independent)É would position-dependentweights be better?

◦◦•◦◦

•••

•••

•••

•••

•••

◦◦•◦◦

◦◦•◦◦

◦◦•◦◦

◦◦•◦◦

w

w−2

w−1

w1

w2

28/31

Outline

Introduction






29/31

Summary

• A neural network is a large combination of classifiers, arranged in asequence of layers

• Neural networks are typically trained using stochastic gradient ascent

• The gradient can be efficiently calculated using the backpropagationalgorithmÉ large number of “tricks” for learning

• The activation patterns of hidden units associated with each input canserve as low dimensional representations of that input

30/31

Where to go from here

• Boltzmann Machines

• Convolutional Neural Networks (ConvNets) and weight tying:É CS231n: Convolutional neural networks for visual recognition,Stanford University

• Recurrent Neural Networks (RNNs) and sequence modelling:É CS224d: Deep learning for natural language processing, StanfordUniversity

31/31

http://cs231n.stanford.edu/syllabus.html

http://cs224d.stanford.edu/syllabus.html

Documents

Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created