Introduction to Neural Networks

Mark Johnson

Dept of ComputingMacquarie University

SydneyAustralia

September 2015

Outline

Introduction

Learning weights with Backpropagation

Alternative non-linear activation functions

Avoiding over-learning

Learning low-dimensional representations

Conclusion and other topics

What are (deep) neural networks?

• Neural networks are functions mapping inputs to outputsÉ the inputs are usually feature vectors, encoding pictures,documents, etc.

É the outputs are usually a single value (e.g., a label for the input),but can be structured (e.g., a sentence)

• A neural network is a network of models typically organised into asequence of layersÉ the first layer is the input feature vectorÉ the output of one layer serves as the input to the next layerÉ the output is the last layer

• Deep neural networks have a relatively large number of layers (e.g., 5or more)

Logistic regression as a 2-layer neural network

P(Y=1 | x) = σ(∑

wjxj + b)

= σ(w · x + b)

σ(v) =1

1+ exp(−v)

• x = (x1, . . . , xm) is the input vector (feature values)

• w = (w1, . . . ,wm) is a vector of connection weights(feature weights)

• b is a bias weightÉ can be viewed as the weight associated with an “alwayson” input

w0x1w1

The logistic sigmoid function

0−5 5

y = 1/1 + e−x

Weaknesses of logistic regression• Logistic regression is a log-linear modeli.e., the log odds are a linear function of the input

P(Y=1 | x) =1

1+ exp(−w · x)⇔ log�

P(Y=1 | x)P(Y=0 | x)

= w · x

• The presence of feature xj increases the log odds by wj⇒ Unable to capture non-linear interactions between the features

É XOR is the classic example of a function that a linear modelcannot express

É can a linear function of pixel intensities recognise a cat?• Neural networks with hidden layers can learn non-linear featureinteractions

• Note: a linear model can capture non-linear interactions if the inputsare non-linearly transformed firstÉ XOR = OR − ANDÉ Extreme learning consists of linear regression applied to randomnon-linear transformations of the input

Hidden nodes

• Hidden nodes are nodes that are neitherinput nor output nodes

• Each hidden layer provides extra expressivepower

• Three layer neural network (ignoring biasnodes):

hi = σ(∑

Aijxj)

y = σ(∑

or in matrix terms:

h = σ(Ax)y = σ(b · h)

Input Hidden Output

One-hot encoding for categorical inputs

• A categorical variable ranges over a fixed set of categories (e.g., wordsin a fixed vocabulary)

• A one-hot encoding of a categorical variable associates a binary nodefor each possible value of the categorical variableÉ the one-hot encoding of c ∈ 1, . . . ,m is x = (x1, . . . , xm), wherexc = 1 and xc ′ = 0 for c ′ 6= c

cat dog chair table book cup pen desk . . .

• A sequence of categorical variables can be represented by a sequenceof their 1-hot encodings

• If xc is the one-hot encoding of category c , then Axc = A·,cso a 1-hot encoding of c corresponds to a table-lookup of column c

Why is the non-linearity important?• General form of a neural network:

x(i) = g(i)(W(i) x(i−1))

where:É x(i) are the activations at level i ,É W(i) is a weight matrix mapping activations at level i − 1 to level iÉ g(i) is a non-linear function (e.g., the sigmoid function)

• E.g., a general 3-layer network has the following form:

y = g(2)(W(2) x(1)) = g(2)(W(2) g(1)(W(1) x(0)))

• If g(1) = g(2) = I (the identity function), then:

y = W(2) W(1) x(0)

• But the product of two matrices is another matrix!⇒ A three-layer network is equivalent to a two-layer network if g(1) is a

linear function

Outline

Introduction

Learning weights via Stochastic Gradient Descent• Given labelled training data D = ((x1, y1), . . . , (xn, yn)), we want tofind weights cW with minimum loss:

cW = argminW

`(W; xi , yi), where:

`(W; x, y) = − log P(y | x)

• `(W; xi , yi) is loss incurred when predicting y from x using weights WÉ probabilities are not greater than 1 ⇒ loss is never negativeÉ log(1)=0 ⇒ perfect prediction incurrs zero loss

• Minimising log-loss is maximum likelihood estimation

• Stochastic gradient descent:É choose a random training example (x, y) ∈ DÉ update W by adding −ε∂`(W;x,y)

Derivatives of the sigmoid function

σ(x) =1

1+ e−x

dσ(x)dx

2+ e−x + ex = σ(x)σ(−x)

0−5 5

dσ/dx

Derivatives of the log sigmoid function

d logσ(x)dx

=1+ e−x

2+ e−x + ex = σ(−x) = 1− σ(x)

0−5 5

logσ(x)

d logσ/dx

Calculating derivatives with Backpropagation (1)

0σW 1

W 10,1

W 10,2

W 10,3

W 11,0

W 11,1

W 11,2

W 11,3

W 12,0

W 12,1

W 12,2

W 12,3

W 13,0

W 13,1

W 13,2

W 13,3

0σW 2

W 20,1

W 20,2

W 20,3

W 21,0

W 21,1

W 21,2

W 21,3

W 22,0

W 22,1

W 22,2

W 22,3

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σW

y = σ(v (3))

v(k) = W(k) x(k−1), k = 1, . . . , 3

x(k) = σ(v(k)), k = 1, 2

0σW 1

W 10,1

W 10,2

W 10,3

W 11,0

W 11,1

W 11,2

W 11,3

W 12,0

W 12,1

W 12,2

W 12,3

W 13,0

W 13,1

W 13,2

W 13,3

0σW 2

W 20,1

W 20,2

W 20,3

W 21,0

W 21,1

W 21,2

W 21,3

W 22,0

W 22,1

W 22,2

W 22,3

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σ

y = σ(v (3))

v (3) = W(3) x(2)

∂ log y

∂W(3)=

∂ log y

∂v (3)∂v (3)

∂W(3)= (1− σ(v (3))) x(2)

∂ log y

∂ x(2)=

∂v (3)∂v (3)

∂ x(2)= (1− σ(v (3))) W(3)

∂ log y

∂ v(2)=

∂ log y

∂ x(2)∂ x(2)

∂ v(2)= (1− σ(v (3))) W(3)σ(v(2))σ(−v(2))

0σW 1

W 10,1

W 10,2

W 10,3

W 11,0

W 11,1

W 11,2

W 11,3

W 12,0

W 12,1

W 12,2

W 12,3

W 13,0

W 13,1

W 13,2

W 13,3

0σW 2

W 20,1

W 20,2

W 20,3

W 21,0

W 21,1

W 21,2

W 21,3

W 22,0

W 22,1

W 22,2

W 22,3

W 23,0

W 23,1

W 23,2

W 23,3

yv3 σ

x(2) = σ(v(2))v(2) = W(2)x(1)

∂ log y

∂ v(2)=

∂ log y

∂ x(2)∂ x(2)

∂ v(2)=

∂ log y

∂ x(2)(σ(v(2))σ(−v(2)))

∂ log y

∂W(2)=

∂ log y

∂ v(2)∂ v(2)

∂W(2)=

∂ log y

∂ v(2)x(1)

∂ log y

∂ x(1)=

∂ log y

∂ v(2)∂ v(2)

∂ x(1)=

∂ log y

∂ v(2)W(2)

• Backpropagation is essentially just repeated application of the “chainrule” for differentiation

• High-level overview of backpropagation:É Forward pass: propagate activations from input layer to outputlayer

É Backward pass: propagate error signal from output layer back toinput layer

É Compute derivatives: derivatives involve both “forward” and“backward” activations

Outline

Introduction

tanh activation function (a scaled sigmoid)• No reason for hidden unit activations to be probabilities

É output unit is often a predicted value, e.g., a (log) probability• Idea: allow negative as well as positive hidden unit activations

tanh(v) =exp(v)− exp(−v)exp(v) + exp(−v)

= 2σ(2v)− 1

σ(v) =1

1+ exp(−v)

0−5 5

tanh(x)

The vanishing gradient problem• The sigmoid and tanh functions saturate on very large or very smallinputs

⇒ Vanishing gradient problem: in deep networks, gradient is often closeto zero for weights close to input layer

⇒ That part of the network learns very slowly, or not at all

0−5 5

dσ/dx

Rectlinear activation function• The Rectlinear (rectified linear) activation function doesn’t saturate:(in positive direction)

RelU(v) = max(v , 0)

• It’s also very fast to calculate!

0−5 5

RelU(x)

Outline

Introduction

Early stopping in stochastic gradient descent

• Neural networks (like all machine learning algorithms) can over-fit thetraining data

• You can detect over-fitting by measuring the difference in accuracy ontraining and held-out development data (don’t use your test data!)

• Early stopping: stop training when accuracy on held-out developmentdata starts decreasingÉ often over-fitting will have started before held-out accuracy startsdecreasing

Regularisation via weight decay

• Regularisation: add a term to loss function that penalises largeconnection weights, such as the L2 regulariser

Q(w) = 1/2∑

∂Q∂wj

• In SGD, the L2 penalty subtracts a fraction of each weight at eachiterationÉ also known as a shrinkage method, as it “shrinks” the connectionweights

Model-averaging via Drop-out

• Drop-out training:É at training time, during each SGD iteration randomly select halfthe connection weights, and ignore the others– i.e., calculate forward activations, derivatives and updatesjust using selected weights

É at test time, use full network but divide all weights by 2

• This is training a randomly-chosen subnetwork at each iterationÉ it forces the network not to rely on any single connectionÉ each time a data item is seen during training it has a differentsubnetwork ⇒ encourages network to learn multiplegeneralisations about each data item

• Full network is an “average” of randomly chosen subnetworks

• Over-fitting is much less of a problem with drop-out

Outline

Introduction

Auto-encoders learn low-dimensional representations

• An auto-encoder is a neural network thatpredicts its own input (i.e., learns an identityfunction)

• Typically auto-encoders have fewer hiddenunits than visible input/output unitsÉ can also impose a sparsity penalty (e.g.,cross-entropy) on hidden units

⇒ Auto-encoder has to learn generalisationsabout its inputsÉ hidden units represent thesegeneralisations

Input Hidden Input

word2vec learns low-dimensional word representations

• word2vec learns 500-dimensionalword representations by predictingthe neighbouring words surroundingeach word w in a 5-word window(w−2,w−1,w ,w1,w2)

• Has surprising properties, e.g.,[Paris]− [France] ≈ [Rome]− [Italy]

• Weight matrices mappingrepresentation of word w torepresentations of neighbouringwords are shared (i.e.,position-independent)É would position-dependentweights be better?

◦◦•◦◦

•••

◦◦•◦◦

Outline

Introduction

Summary

• A neural network is a large combination of classifiers, arranged in asequence of layers

• Neural networks are typically trained using stochastic gradient ascent

• The gradient can be efficiently calculated using the backpropagationalgorithmÉ large number of “tricks” for learning

• The activation patterns of hidden units associated with each input canserve as low dimensional representations of that input

Where to go from here

• Boltzmann Machines

• Convolutional Neural Networks (ConvNets) and weight tying:É CS231n: Convolutional neural networks for visual recognition,Stanford University

• Recurrent Neural Networks (RNNs) and sequence modelling:É CS224d: Deep learning for natural language processing, StanfordUniversity

Introduction to Neural Networks - Macquarie...

Documents

Bayesian models of language acquisition or Where do the ...web.science.mq.edu.au/~mjohnson/papers/Penn07JohnsonTalk.pdfBayesian models of language acquisition or ... • Each adapted

Parsing and Speech Research at Brown Universityweb.science.mq.edu.au/~mjohnson/papers/Tokyo-speech-slides.pdf · Parsing and Speech Research at Brown University Mark Johnson ... machine

Final Project Report - Macquarie Universityweb.science.mq.edu.au/~rdale/teaching/itec810/2009H1/FinalReports/... · 2 | Final Report Abstract Payment System is a system to supports

Probabilistic parsing with a wide variety of featuresweb.science.mq.edu.au/~mjohnson/papers/johnson-ijcnlp04-slides.pdf · Probabilistic parsing with a wide variety of features

Computational linguistics Where do we go from here?web.science.mq.edu.au/~mjohnson/papers/Johnson12next50.pdf · Why computational linguistics? Computers have revolutionised many

Explainable Neural Computation via Stack Neural Module …openaccess.thecvf.com/content_ECCV_2018/papers/Ronghang_Hu_Explainable... · Explainable Neural Computationvia Stack Neural

Introduction - Macquarie Universityweb.science.mq.edu.au/~rgarner/Papers/nonquotienting-maps.pdf · Introduction In mathematics and computer science, we often encounter structures

Bayesian Inference for Dirichlet-Multinomials and ...web.science.mq.edu.au/~mjohnson/papers/Johnson11MLSS-talk-ext… · Bayesian Inference for Dirichlet-Multinomials and Dirichlet

Neural and Fuzzy Neural Networks

Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

A brief introduction to Conditional Random Fieldsweb.science.mq.edu.au/~mjohnson/papers/CRF-intro-slides05.pdf · A brief introduction to Conditional Random Fields ... • HMMs and

ICSOC 2018 Conference Program - Macquarie Universityweb.science.mq.edu.au/~qsheng/icsoc2018/ICSOC2018... · ICSOC 2018 Conference Program Time Monday, 12 Nov Tuesday, 13 Nov Wednesday,

Introduction to hypothesis testing with Rweb.science.mq.edu.au/~mjohnson/papers/Johnson14-02HT-talk.pdf · Outline Introduction Hypothesistestsandconﬁdenceintervals Testsfortabulardata

Agentleintroductiontomaximumentropy,log-linear ...web.science.mq.edu.au/~mjohnson/papers/Johnson12IntroMaxEnt.pdfAgentleintroductiontomaximumentropy,log-linear, exponential,logistic,harmonic,Boltzmann,Markov

Cut To Core ICS 2013 ICS2 - Rhode Island · 2013. 10. 26. · Julie Nora, Director (jnora@internationalcharterschool.org) Michelle Johnson, Reading Specialist (mjohnson@internationalcharterschool.org)

CNS DEVELOPMENT. Stages in Neural Tube Development Neural plate. Neural plate. Neural folds. Neural folds. Neural tube. Neural tube

Probing Strongly-Scattered Compact Objects Using Ultra ...mjohnson/Johnson_Dissertation.pdf · \Ultra-High-Resolution Intensity Statistics of a Scintillating Source," Johnson, M

PHYS178 2008 Week4 Mars - Macquarie Universityweb.science.mq.edu.au/~wardle/PHYS178/Week4_Mars.pdf · 3 Mars - HST This NASA Hubble Space Telescope view of the planet Mars is the

Artificial Neural NetworksArtificial Neural Networksrailway.iust.ac.ir/files/rail/Booklet_Teach/neural_network_moaveni.pdf · Artificial Neural NetworksArtificial Neural Networks

Introduction to Restricted Boltzmann Machinesweb.science.mq.edu.au/~mjohnson/papers/Johnson12IntroRBM... · 2015-09-15 · Introduction to Restricted Boltzmann Machines Mark Johnson