31
Introduction to Neural Networks Mark Johnson Dept of Computing Macquarie University Sydney Australia September 2015 1/31

Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Introduction to Neural Networks

Mark Johnson

Dept of ComputingMacquarie University

SydneyAustralia

September 2015

1/31

Page 2: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Outline

Introduction

Learning weights with Backpropagation

Alternative non-linear activation functions

Avoiding over-learning

Learning low-dimensional representations

Conclusion and other topics

2/31

Page 3: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

What are (deep) neural networks?

• Neural networks are functions mapping inputs to outputsÉ the inputs are usually feature vectors, encoding pictures,documents, etc.

É the outputs are usually a single value (e.g., a label for the input),but can be structured (e.g., a sentence)

• A neural network is a network of models typically organised into asequence of layersÉ the first layer is the input feature vectorÉ the output of one layer serves as the input to the next layerÉ the output is the last layer

• Deep neural networks have a relatively large number of layers (e.g., 5or more)

3/31

Page 4: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Logistic regression as a 2-layer neural network

P(Y=1 | x) = σ(∑

j

wjxj + b)

= σ(w · x + b)

σ(v) =1

1+ exp(−v)

• x = (x1, . . . , xm) is the input vector (feature values)

• w = (w1, . . . ,wm) is a vector of connection weights(feature weights)

• b is a bias weightÉ can be viewed as the weight associated with an “alwayson” input

y...

1

b

x0

w0x1w1

xm

wm

4/31

Page 5: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

The logistic sigmoid function

0

1

0−5 5

y = 1/1 + e−x

5/31

Page 6: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Weaknesses of logistic regression• Logistic regression is a log-linear modeli.e., the log odds are a linear function of the input

P(Y=1 | x) =1

1+ exp(−w · x)⇔ log�

P(Y=1 | x)P(Y=0 | x)

= w · x

• The presence of feature xj increases the log odds by wj⇒ Unable to capture non-linear interactions between the features

É XOR is the classic example of a function that a linear modelcannot express

É can a linear function of pixel intensities recognise a cat?• Neural networks with hidden layers can learn non-linear featureinteractions

• Note: a linear model can capture non-linear interactions if the inputsare non-linearly transformed firstÉ XOR = OR − ANDÉ Extreme learning consists of linear regression applied to randomnon-linear transformations of the input

6/31

Page 7: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Hidden nodes

• Hidden nodes are nodes that are neitherinput nor output nodes

• Each hidden layer provides extra expressivepower

• Three layer neural network (ignoring biasnodes):

hi = σ(∑

j

Aijxj)

y = σ(∑

j

bjhj)

or in matrix terms:

h = σ(Ax)y = σ(b · h)

x0

x1

x2

x3

x4

xm

...

h0

h1

h2

h3

h4

hn

...

y

Input Hidden Output

7/31

Page 8: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

One-hot encoding for categorical inputs

• A categorical variable ranges over a fixed set of categories (e.g., wordsin a fixed vocabulary)

• A one-hot encoding of a categorical variable associates a binary nodefor each possible value of the categorical variableÉ the one-hot encoding of c ∈ 1, . . . ,m is x = (x1, . . . , xm), wherexc = 1 and xc ′ = 0 for c ′ 6= c

cat dog chair table book cup pen desk . . .

• A sequence of categorical variables can be represented by a sequenceof their 1-hot encodings

• If xc is the one-hot encoding of category c , then Axc = A·,cso a 1-hot encoding of c corresponds to a table-lookup of column c

8/31

Page 9: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Why is the non-linearity important?• General form of a neural network:

x(i) = g(i)(W(i) x(i−1))

where:É x(i) are the activations at level i ,É W(i) is a weight matrix mapping activations at level i − 1 to level iÉ g(i) is a non-linear function (e.g., the sigmoid function)

• E.g., a general 3-layer network has the following form:

y = g(2)(W(2) x(1)) = g(2)(W(2) g(1)(W(1) x(0)))

• If g(1) = g(2) = I (the identity function), then:

y = W(2) W(1) x(0)

• But the product of two matrices is another matrix!⇒ A three-layer network is equivalent to a two-layer network if g(1) is a

linear function

9/31

Page 10: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Outline

Introduction

Learning weights with Backpropagation

Alternative non-linear activation functions

Avoiding over-learning

Learning low-dimensional representations

Conclusion and other topics

10/31

Page 11: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Learning weights via Stochastic Gradient Descent• Given labelled training data D = ((x1, y1), . . . , (xn, yn)), we want tofind weights cW with minimum loss:

cW = argminW

n∑

i=1

`(W; xi , yi), where:

`(W; x, y) = − log P(y | x)

• `(W; xi , yi) is loss incurred when predicting y from x using weights WÉ probabilities are not greater than 1 ⇒ loss is never negativeÉ log(1)=0 ⇒ perfect prediction incurrs zero loss

• Minimising log-loss is maximum likelihood estimation

• Stochastic gradient descent:É choose a random training example (x, y) ∈ DÉ update W by adding −ε∂`(W;x,y)

∂W

11/31

Page 12: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Derivatives of the sigmoid function

σ(x) =1

1+ e−x

dσ(x)dx

=1

2+ e−x + ex = σ(x)σ(−x)

0

1

0−5 5

σ(x)

dσ/dx

12/31

Page 13: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Derivatives of the log sigmoid function

d logσ(x)dx

=1+ e−x

2+ e−x + ex = σ(−x) = 1− σ(x)

−5

0

1

0−5 5

logσ(x)

d logσ/dx

13/31

Page 14: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Calculating derivatives with Backpropagation (1)

x00

x01

x02

x03

x10v1

0σW 1

0,0

W 10,1

W 10,2

W 10,3

x11v1

W 11,0

W 11,1

W 11,2

W 11,3

x12v1

W 12,0

W 12,1

W 12,2

W 12,3

x13v1

W 13,0

W 13,1

W 13,2

W 13,3

x20v2

0σW 2

0,0

W 20,1

W 20,2

W 20,3

x21v2

W 21,0

W 21,1

W 21,2

W 21,3

x22v2

W 22,0

W 22,1

W 22,2

W 22,3

x23v2

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σW

30

W 31

W32

W3

3

y = σ(v (3))

v(k) = W(k) x(k−1), k = 1, . . . , 3

x(k) = σ(v(k)), k = 1, 2

14/31

Page 15: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Calculating derivatives with Backpropagation (2)

x00

x01

x02

x03

x10v1

0σW 1

0,0

W 10,1

W 10,2

W 10,3

x11v1

W 11,0

W 11,1

W 11,2

W 11,3

x12v1

W 12,0

W 12,1

W 12,2

W 12,3

x13v1

W 13,0

W 13,1

W 13,2

W 13,3

x20v2

0σW 2

0,0

W 20,1

W 20,2

W 20,3

x21v2

W 21,0

W 21,1

W 21,2

W 21,3

x22v2

W 22,0

W 22,1

W 22,2

W 22,3

x23v2

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σ

W30

W 31

W32

W3

3

y = σ(v (3))

v (3) = W(3) x(2)

∂ log y

∂W(3)=

∂ log y

∂v (3)∂v (3)

∂W(3)= (1− σ(v (3))) x(2)

∂ log y

∂ x(2)=

∂y

∂v (3)∂v (3)

∂ x(2)= (1− σ(v (3))) W(3)

∂ log y

∂ v(2)=

∂ log y

∂ x(2)∂ x(2)

∂ v(2)= (1− σ(v (3))) W(3)σ(v(2))σ(−v(2))

15/31

Page 16: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Calculating derivatives with Backpropagation (3)

x00

x01

x02

x03

x10v1

0σW 1

0,0

W 10,1

W 10,2

W 10,3

x11v1

W 11,0

W 11,1

W 11,2

W 11,3

x12v1

W 12,0

W 12,1

W 12,2

W 12,3

x13v1

W 13,0

W 13,1

W 13,2

W 13,3

x20v2

0σW 2

0,0

W 20,1

W 20,2

W 20,3

x21v2

W 21,0

W 21,1

W 21,2

W 21,3

x22v2

W 22,0

W 22,1

W 22,2

W 22,3

x23v2

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σ

W30

W 31

W32

W3

3

x(2) = σ(v(2))v(2) = W(2)x(1)

∂ log y

∂ v(2)=

∂ log y

∂ x(2)∂ x(2)

∂ v(2)=

∂ log y

∂ x(2)(σ(v(2))σ(−v(2)))

∂ log y

∂W(2)=

∂ log y

∂ v(2)∂ v(2)

∂W(2)=

∂ log y

∂ v(2)x(1)

∂ log y

∂ x(1)=

∂ log y

∂ v(2)∂ v(2)

∂ x(1)=

∂ log y

∂ v(2)W(2)

16/31

Page 17: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Calculating derivatives with Backpropagation (4)

• Backpropagation is essentially just repeated application of the “chainrule” for differentiation

• High-level overview of backpropagation:É Forward pass: propagate activations from input layer to outputlayer

É Backward pass: propagate error signal from output layer back toinput layer

É Compute derivatives: derivatives involve both “forward” and“backward” activations

17/31

Page 18: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Outline

Introduction

Learning weights with Backpropagation

Alternative non-linear activation functions

Avoiding over-learning

Learning low-dimensional representations

Conclusion and other topics

18/31

Page 19: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

tanh activation function (a scaled sigmoid)• No reason for hidden unit activations to be probabilities

É output unit is often a predicted value, e.g., a (log) probability• Idea: allow negative as well as positive hidden unit activations

tanh(v) =exp(v)− exp(−v)exp(v) + exp(−v)

= 2σ(2v)− 1

σ(v) =1

1+ exp(−v)

−1

0

1

0−5 5

σ(x)

tanh(x)

19/31

Page 20: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

The vanishing gradient problem• The sigmoid and tanh functions saturate on very large or very smallinputs

⇒ Vanishing gradient problem: in deep networks, gradient is often closeto zero for weights close to input layer

⇒ That part of the network learns very slowly, or not at all

0

1

0−5 5

σ(x)

dσ/dx

20/31

Page 21: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Rectlinear activation function• The Rectlinear (rectified linear) activation function doesn’t saturate:(in positive direction)

RelU(v) = max(v , 0)

• It’s also very fast to calculate!

0

5

0−5 5

RelU(x)

21/31

Page 22: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Outline

Introduction

Learning weights with Backpropagation

Alternative non-linear activation functions

Avoiding over-learning

Learning low-dimensional representations

Conclusion and other topics

22/31

Page 23: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Early stopping in stochastic gradient descent

• Neural networks (like all machine learning algorithms) can over-fit thetraining data

• You can detect over-fitting by measuring the difference in accuracy ontraining and held-out development data (don’t use your test data!)

• Early stopping: stop training when accuracy on held-out developmentdata starts decreasingÉ often over-fitting will have started before held-out accuracy startsdecreasing

23/31

Page 24: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Regularisation via weight decay

• Regularisation: add a term to loss function that penalises largeconnection weights, such as the L2 regulariser

Q(w) = 1/2∑

j

w2j

∂Q∂wj

= wj

• In SGD, the L2 penalty subtracts a fraction of each weight at eachiterationÉ also known as a shrinkage method, as it “shrinks” the connectionweights

24/31

Page 25: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Model-averaging via Drop-out

• Drop-out training:É at training time, during each SGD iteration randomly select halfthe connection weights, and ignore the others– i.e., calculate forward activations, derivatives and updatesjust using selected weights

É at test time, use full network but divide all weights by 2

• This is training a randomly-chosen subnetwork at each iterationÉ it forces the network not to rely on any single connectionÉ each time a data item is seen during training it has a differentsubnetwork ⇒ encourages network to learn multiplegeneralisations about each data item

• Full network is an “average” of randomly chosen subnetworks

• Over-fitting is much less of a problem with drop-out

25/31

Page 26: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Outline

Introduction

Learning weights with Backpropagation

Alternative non-linear activation functions

Avoiding over-learning

Learning low-dimensional representations

Conclusion and other topics

26/31

Page 27: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Auto-encoders learn low-dimensional representations

• An auto-encoder is a neural network thatpredicts its own input (i.e., learns an identityfunction)

• Typically auto-encoders have fewer hiddenunits than visible input/output unitsÉ can also impose a sparsity penalty (e.g.,cross-entropy) on hidden units

⇒ Auto-encoder has to learn generalisationsabout its inputsÉ hidden units represent thesegeneralisations

Input Hidden Input

27/31

Page 28: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

word2vec learns low-dimensional word representations

• word2vec learns 500-dimensionalword representations by predictingthe neighbouring words surroundingeach word w in a 5-word window(w−2,w−1,w ,w1,w2)

• Has surprising properties, e.g.,[Paris]− [France] ≈ [Rome]− [Italy]

• Weight matrices mappingrepresentation of word w torepresentations of neighbouringwords are shared (i.e.,position-independent)É would position-dependentweights be better?

◦◦•◦◦

•••

•••

•••

•••

•••

◦◦•◦◦

◦◦•◦◦

◦◦•◦◦

◦◦•◦◦

w

w−2

w−1

w1

w2

28/31

Page 29: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Outline

Introduction

Learning weights with Backpropagation

Alternative non-linear activation functions

Avoiding over-learning

Learning low-dimensional representations

Conclusion and other topics

29/31

Page 30: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Summary

• A neural network is a large combination of classifiers, arranged in asequence of layers

• Neural networks are typically trained using stochastic gradient ascent

• The gradient can be efficiently calculated using the backpropagationalgorithmÉ large number of “tricks” for learning

• The activation patterns of hidden units associated with each input canserve as low dimensional representations of that input

30/31

Page 31: Introduction to Neural Networks - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson15IntroNeuralNe… · Introduction to Neural Networks Author: Mark Johnson Created

Where to go from here

• Boltzmann Machines

• Convolutional Neural Networks (ConvNets) and weight tying:É CS231n: Convolutional neural networks for visual recognition,Stanford University

• Recurrent Neural Networks (RNNs) and sequence modelling:É CS224d: Deep learning for natural language processing, StanfordUniversity

31/31