Introduction to Statistical Machine Learning Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 828

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research GroupData61 | CSIRO

andCollege of Engineering and Computer Science

The Australian National University

CanberraFebruary – June 2018

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")




University

Motivation

Autoencoder

766of 828

Part XXII

Neural Network 3




University

Motivation

Autoencoder

767of 828

Number of layers

expression of a function is compact when it has fewcomputational elements, i.e. few degrees of freedom thatneed to be tuned by learningfor a fixed number of training examples, expect thatcompact representations of the target function would yieldbetter generalizationExample representations

affine operations, sigmoid⇒ logistic regression has depth1, fixed number of units (a.k.a. neurons)fixed kernel, affine operations⇒ kernel machine (e.g. SVM)has two levels, with as many units as data pointsstacked neural network of multiple “linear transformationfollowed by a non-linearity”⇒ deep neural network hasarbitrary depth with arbitrary number of units per layer




University

Motivation

Autoencoder

768of 828

Easier to represent with more layers

An old result:functions that can be compactly represented by a depth karchitecture might require an exponential number ofcomputational elements to be represented by a depthk − 1 architectureExample, the d bit parity function

parity : (b1, . . . , bd) ∈ {0, 1}d 7→

{1 if

∑di=1 bi is even

0 otherwise

Theorem: d-bit parity circuits of depth 2 have exponentialsize

Analogous in modern deep learning:“Shallow networks require exponentially more parametersfor the same number of modes” — Canadian deeplearning mafia.




University

Motivation

Autoencoder

769of 828

Recall: Multi-layer Neural Network Architecture

yk(x,w) = g

M∑j=0

w(2)kj h

(D∑

i=0

w(1)ji xi

)where w now contains all weight and bias parameters.

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

We could add more hidden layers




University

Motivation

Autoencoder

770of 828

Empirical observations - pre 2006

Deep architectures get stuck in local minima or plateausAs architecture gets deeper, more difficult to obtain goodgeneralisationHard to initialise random weights well1 or 2 hidden layers seem to perform better2006: Unsupervised pre-training, find distributedrepresentation




University

Motivation

Autoencoder

771of 828

Deep representation - intuition

Bengio, "Learning Deep Architectures for AI", 2009




University

Motivation

Autoencoder

772of 828

Deep representation - practice

AlexNet / VGG-F network visualized by mNeuron.




University

Motivation

Autoencoder

773of 828

Recall: PCA

Idea: Linearly project the data points onto a lowerdimensional subspace such that

the variance of the projected data is maximised, orthe distortion error from the projection is minimised.

Both formulation lead to the same result.Need to find the lower dimensional subspace, called theprincipal subspace.

x2

x1

xn

x̃n

u1




University

Motivation

Autoencoder

774of 828

Multiple PCA layers? - linear transforms

Principle Component Analysis is a linear transformation(because it is a projection)The composite of two linear transformations is linearLinear transformations M : Rm → Rn are matricesLet S and T be matrices of appropriate dimension suchthat ST is defined

ST(X + X′) = ST(X) + ST(X′)

Similarly for multiplication with a scalar

⇒ multiple PCA layers pointless




University

Motivation

Autoencoder

775of 828

Multiple PCA layers? - projection

Let XTX = UΛUT be the eigenvalue decomposition of thecovariance matrix (what is assumed about the mean?).Define Uk to be the matrix of the first k columns of U, forthe k largest eigenvalues. Define Λk similarlyConsider the features formed by projecting onto theprincipal components

Z = XUk

We perform PCA a second time, ZTZ = VΛZVT .By the definition of Z and XTX, and the orthogonality of U

ZTZ = (XUk)T(XUk) = UT

k XTXUk = UTk UΛUTUk = Λk

Hence ΛZ = Λk and V is the identity, therefore the secondPCA has no effect

⇒ again, multiple PCA layers pointless




University

Motivation

Autoencoder

776of 828

Autoencoder

An autoencoder is trained to encode the input x into somerepresentation c(x) so that the input can be reconstructedfrom that representationthe target output of the autoencoder is the autoencoderinput itselfWith one linear hidden layer and the mean squared errorcriterion, the k hidden units learn to project the input in thespan of the first k principal components of the dataIf the hidden layer is nonlinear, the autoencoder behavesdifferently from PCA, with the ability to capture multimodalaspects of the input distributionLet f be the decoder. We want to minimise thereconstruction error

N∑n=1

` (xn, f (c(xn)))




University

Motivation

Autoencoder

777of 828

Cost function

Recall: f (c(x)) is the reconstruction produced by thenetworkMinimisation of the negative log likelihood of thereconstruction, given the encoding c(x)

RE = − log P(x|c(x))

If x|c(x) is Gaussian, we recover the familiar squared errorIf the inputs xi are either binary or considered to bebinomial probabilities, then the loss function would be thecross entropy

− log P(x|c(x)) = −xi log fi(c(x)) + (1− xi) log(1− fi(c(x)))

where fi(·) is the ith component of the decoder




University

Motivation

Autoencoder

778of 828

Undercomplete Autoencoder

Consider a small number of hidden units.c(x) is viewed as a lossy compression of x

Cannot have small loss for all x, so focus on trainingexamplesHope code c(x) is a distributed representation thatcaptures the main factors of variation in the data




University

Motivation

Autoencoder

779of 828

Stacking autoencoders

Let cj and fj be the encoder and corresponding decoder ofthe jth layerLet zj be the representation after the encoder cj

We can define multiple layers of autoencoders recursively.For example, let z1 = c1(x), and z2 = c2(z1),the corresponding decoding is given by f1(f2(z2))

Because of non-linear activation functions, the latentfeature z2 can capture more complex patterns than z1.




University

Motivation

Autoencoder

780of 828

Higher level image features - faces

codingplayground.blogspot.com

codingplayground.blogspot.com




University

Motivation

Autoencoder

781of 828

Pre-training supervised neural networks

Latent features zj in layer j can capture high level patterns

zj = cj(cj−1(· · · c2(c1(x)) · · · ))

These features may also be useful for supervised learningtasks.In contrast to the feed forward network, the features zj areconstructed in an unsupervised fashion.Discard the decoding layers, and directly use zj with asupervised training method, such as logistic regression.Various such pre-trained networks are available on-line,e.g VGG-19.




University

Motivation

Autoencoder

782of 828

Xavier Initialisation / ReLU

Layer-wise unsupervised pre-training facilitates learningby extracting useful features for subsequent supervisedbackprop.Pre-training also avoids saturation (large magnitudearguments to, say, sigmoidal units).Simpler Xavier initialization can also avoid saturation.Let the inputs xi ∼ N (0, 1), weights wi ∼ N (0, σ2) andactivation z =

∑mi=1 xiyi. Then:

VAR[z] = E[(z− E[z])2] = E[z2] = E[(

m∑i=1

xiwi)2]

=

m∑i=1

E[(xiwi)2] =

m∑i=1

E[x2i ]E[w2

i ] = mσ2.

So we set σ = 1/√

m to have “nice” activations.Similarly for subsequent layers in the network.ReLU activations h(x) = max(x, 0) also help in practice.




University

Motivation

Autoencoder

783of 828

Higher dimensional hidden layer

if there is no other constraint, then an autoencoder withd-dimensional input and an encoding of dimension at leastd could potentially just learn the identity functionAvoid by:

RegularisationEarly stopping of stochastic gradient descentAdd noise in the encodingSparsity constraint on code c(x).




University

Motivation

Autoencoder

784of 828

Denoising autoencoder

Add noise to input, keeping perfect example as outputAutoencoder tries to:

1 preserve information of input2 undo stochastic corruption process

Reconstruction log likelihood

− log P(x|c(x̂))

where x noise free, x̂ corrupted




University

Motivation

Autoencoder

785of 828

Image denoising

Images with Gaussian noise added.

Denoised using Stacked Sparse Denoising Autoencoder

Images from Xie et. al. NIPS 2012




University

Motivation

Autoencoder

786of 828

Inpainting - 1

Free a bird

Image from http://cimg.eu/greycstoration/demonstration.shtml

http://cimg.eu/greycstoration/demonstration.shtml

http://cimg.eu/greycstoration/demonstration.shtml




University

Motivation

Autoencoder

787of 828

Inpainting - 2

Undo text over image

Image from Bach et. al. ICCV tutorial 2009




University

Motivation

Autoencoder

788of 828

Recall: Basis functions

For fixed basis functions φ(x), we use domain knowledgefor encoding featuresNeural networks use data to learn a set of transformations.φi(x) is the ith component of the feature vector, and islearned by the network.The transformations φi(·) for a particular dataset may nolonger be orthogonal, and furthermore may be minorvariations of each other.We collect all the transformed features into a matrix Φ.




University

Motivation

Autoencoder

789of 828

Sparse representations

Idea: Have many hidden nodes, but only a few active for aparticular code c(x).Student t prior on codes`1 penalty on coefficients α

Given bases in matrix Φ, look for codes by choosing α suchthat input signal x is reconstructed with low `2 reconstructionerror, while w is sparse

minα

N∑n=1

12‖xn − Φαn‖2

2 + λ‖α‖1

Φ is overcomplete, no longer orthogonalSparse⇒ small number of non-zero αi.Exact recovery under certain conditions (coherence):`1 → `0.`1 regulariser ∼ Laplace prior p(αi) = λ

2 exp(−λ|αi|).




University

Motivation

Autoencoder

790of 828

The image denoising problem

y︸︷︷︸measurements

= xorig︸︷︷︸original image

+ w︸︷︷︸noise




University

Motivation

Autoencoder

791of 828

Sparsity assumption

Only have noisy measurements

y︸︷︷︸measurements

= xorig︸︷︷︸original image

+ w︸︷︷︸noise

Given Φ ∈ Rm×p, find α such that

‖α‖0 is small for x ≈ Φα

where ‖ · ‖0 is the number of non-zero elements of α.Φ is not necessarily features constructed from trainingdata.Minimise reconstruction error

minα

N∑n=1

12‖xn − Φαn‖2

2 + λ‖α‖0




University

Motivation

Autoencoder

792of 828

Convex relaxation

Want to minimise number of components

minα

N∑n=1

12‖xn − Φαn‖2

2 + λ‖α‖0

but ‖ · ‖0 is hard to optimiseRelax to a convex norm

minα

N∑n=1

12‖xn − Φαn‖2

2 + λ‖α‖1

where ‖α‖1 =∑

n |αn|.In some settings does minimisation with `1 regularisationgive the same solution as minimisation with `0regularisation (exact recovery)?




University

Motivation

Autoencoder

793of 828

Mutual coherence

Expect to be ok when columns of Φ “not too parallel”Assume columns of Φ are normalised to unit normLet K = ΦΦT be the Gram matrix, then K(i, j) is the valueof the inner product between φi and φj.Define the mutual coherence

M = M(Φ) = maxi6=j|K(i, j)|

If we have an orthogonal basis, then Φ is an orthogonalmatrix, hence K(i, j) = 0 when i 6= j.However, if we have very similar columns, then M ≈ 1.




University

Motivation

Autoencoder

794of 828

Exact recovery conditions

If a minimiser of the true `0 problem, α∗ satisfies

‖α∗‖0 <1M

then it is the unique sparsest solution.If α∗ satisfies the stronger condition

‖α∗‖0 <12

(1 +

1M

)then the minimiser of the `1 relaxation has the samesparsity pattern as α∗.




University

Motivation

Autoencoder

795of 828

References

Yoshua Bengio, "Learning Deep Architectures for AI",Foundations and Trends in Machine Learning, 2009http://deeplearning.net/tutorial/

http://www.deeplearningbook.org/contents/autoencoders.html

Fuchs, "On Sparse Representations in ArbitraryRedundant Bases", IEEE Trans. Info. Theory, 2004Xavier Glorot and Yoshua Bengio, "Understanding thedifficulty of training deep feedforward neural networks",2010.

http://deeplearning.net/tutorial/



Documents

Introduction to Statistical Machine Learning Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis