Upload
trantuyen
View
226
Download
2
Embed Size (px)
Citation preview
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Outlines
OverviewIntroductionLinear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel MethodsSparse Kernel Methods
Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis
AutoencodersGraphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 828
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research GroupData61 | CSIRO
andCollege of Engineering and Computer Science
The Australian National University
CanberraFebruary – June 2018
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
766of 828
Part XXII
Neural Network 3
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
767of 828
Number of layers
expression of a function is compact when it has fewcomputational elements, i.e. few degrees of freedom thatneed to be tuned by learningfor a fixed number of training examples, expect thatcompact representations of the target function would yieldbetter generalizationExample representations
affine operations, sigmoid⇒ logistic regression has depth1, fixed number of units (a.k.a. neurons)fixed kernel, affine operations⇒ kernel machine (e.g. SVM)has two levels, with as many units as data pointsstacked neural network of multiple “linear transformationfollowed by a non-linearity”⇒ deep neural network hasarbitrary depth with arbitrary number of units per layer
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
768of 828
Easier to represent with more layers
An old result:functions that can be compactly represented by a depth karchitecture might require an exponential number ofcomputational elements to be represented by a depthk − 1 architectureExample, the d bit parity function
parity : (b1, . . . , bd) ∈ {0, 1}d 7→
{1 if
∑di=1 bi is even
0 otherwise
Theorem: d-bit parity circuits of depth 2 have exponentialsize
Analogous in modern deep learning:“Shallow networks require exponentially more parametersfor the same number of modes” — Canadian deeplearning mafia.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
769of 828
Recall: Multi-layer Neural Network Architecture
yk(x,w) = g
M∑j=0
w(2)kj h
(D∑
i=0
w(1)ji xi
)where w now contains all weight and bias parameters.
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
We could add more hidden layers
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
770of 828
Empirical observations - pre 2006
Deep architectures get stuck in local minima or plateausAs architecture gets deeper, more difficult to obtain goodgeneralisationHard to initialise random weights well1 or 2 hidden layers seem to perform better2006: Unsupervised pre-training, find distributedrepresentation
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
771of 828
Deep representation - intuition
Bengio, "Learning Deep Architectures for AI", 2009
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
772of 828
Deep representation - practice
AlexNet / VGG-F network visualized by mNeuron.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
773of 828
Recall: PCA
Idea: Linearly project the data points onto a lowerdimensional subspace such that
the variance of the projected data is maximised, orthe distortion error from the projection is minimised.
Both formulation lead to the same result.Need to find the lower dimensional subspace, called theprincipal subspace.
x2
x1
xn
x̃n
u1
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
774of 828
Multiple PCA layers? - linear transforms
Principle Component Analysis is a linear transformation(because it is a projection)The composite of two linear transformations is linearLinear transformations M : Rm → Rn are matricesLet S and T be matrices of appropriate dimension suchthat ST is defined
ST(X + X′) = ST(X) + ST(X′)
Similarly for multiplication with a scalar
⇒ multiple PCA layers pointless
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
775of 828
Multiple PCA layers? - projection
Let XTX = UΛUT be the eigenvalue decomposition of thecovariance matrix (what is assumed about the mean?).Define Uk to be the matrix of the first k columns of U, forthe k largest eigenvalues. Define Λk similarlyConsider the features formed by projecting onto theprincipal components
Z = XUk
We perform PCA a second time, ZTZ = VΛZVT .By the definition of Z and XTX, and the orthogonality of U
ZTZ = (XUk)T(XUk) = UT
k XTXUk = UTk UΛUTUk = Λk
Hence ΛZ = Λk and V is the identity, therefore the secondPCA has no effect
⇒ again, multiple PCA layers pointless
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
776of 828
Autoencoder
An autoencoder is trained to encode the input x into somerepresentation c(x) so that the input can be reconstructedfrom that representationthe target output of the autoencoder is the autoencoderinput itselfWith one linear hidden layer and the mean squared errorcriterion, the k hidden units learn to project the input in thespan of the first k principal components of the dataIf the hidden layer is nonlinear, the autoencoder behavesdifferently from PCA, with the ability to capture multimodalaspects of the input distributionLet f be the decoder. We want to minimise thereconstruction error
N∑n=1
` (xn, f (c(xn)))
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
777of 828
Cost function
Recall: f (c(x)) is the reconstruction produced by thenetworkMinimisation of the negative log likelihood of thereconstruction, given the encoding c(x)
RE = − log P(x|c(x))
If x|c(x) is Gaussian, we recover the familiar squared errorIf the inputs xi are either binary or considered to bebinomial probabilities, then the loss function would be thecross entropy
− log P(x|c(x)) = −xi log fi(c(x)) + (1− xi) log(1− fi(c(x)))
where fi(·) is the ith component of the decoder
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
778of 828
Undercomplete Autoencoder
Consider a small number of hidden units.c(x) is viewed as a lossy compression of x
Cannot have small loss for all x, so focus on trainingexamplesHope code c(x) is a distributed representation thatcaptures the main factors of variation in the data
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
779of 828
Stacking autoencoders
Let cj and fj be the encoder and corresponding decoder ofthe jth layerLet zj be the representation after the encoder cj
We can define multiple layers of autoencoders recursively.For example, let z1 = c1(x), and z2 = c2(z1),the corresponding decoding is given by f1(f2(z2))
Because of non-linear activation functions, the latentfeature z2 can capture more complex patterns than z1.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
780of 828
Higher level image features - faces
codingplayground.blogspot.com
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
781of 828
Pre-training supervised neural networks
Latent features zj in layer j can capture high level patterns
zj = cj(cj−1(· · · c2(c1(x)) · · · ))
These features may also be useful for supervised learningtasks.In contrast to the feed forward network, the features zj areconstructed in an unsupervised fashion.Discard the decoding layers, and directly use zj with asupervised training method, such as logistic regression.Various such pre-trained networks are available on-line,e.g VGG-19.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
782of 828
Xavier Initialisation / ReLU
Layer-wise unsupervised pre-training facilitates learningby extracting useful features for subsequent supervisedbackprop.Pre-training also avoids saturation (large magnitudearguments to, say, sigmoidal units).Simpler Xavier initialization can also avoid saturation.Let the inputs xi ∼ N (0, 1), weights wi ∼ N (0, σ2) andactivation z =
∑mi=1 xiyi. Then:
VAR[z] = E[(z− E[z])2] = E[z2] = E[(
m∑i=1
xiwi)2]
=
m∑i=1
E[(xiwi)2] =
m∑i=1
E[x2i ]E[w2
i ] = mσ2.
So we set σ = 1/√
m to have “nice” activations.Similarly for subsequent layers in the network.ReLU activations h(x) = max(x, 0) also help in practice.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
783of 828
Higher dimensional hidden layer
if there is no other constraint, then an autoencoder withd-dimensional input and an encoding of dimension at leastd could potentially just learn the identity functionAvoid by:
RegularisationEarly stopping of stochastic gradient descentAdd noise in the encodingSparsity constraint on code c(x).
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
784of 828
Denoising autoencoder
Add noise to input, keeping perfect example as outputAutoencoder tries to:
1 preserve information of input2 undo stochastic corruption process
Reconstruction log likelihood
− log P(x|c(x̂))
where x noise free, x̂ corrupted
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
785of 828
Image denoising
Images with Gaussian noise added.
Denoised using Stacked Sparse Denoising Autoencoder
Images from Xie et. al. NIPS 2012
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
786of 828
Inpainting - 1
Free a bird
Image from http://cimg.eu/greycstoration/demonstration.shtml
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
787of 828
Inpainting - 2
Undo text over image
Image from Bach et. al. ICCV tutorial 2009
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
788of 828
Recall: Basis functions
For fixed basis functions φ(x), we use domain knowledgefor encoding featuresNeural networks use data to learn a set of transformations.φi(x) is the ith component of the feature vector, and islearned by the network.The transformations φi(·) for a particular dataset may nolonger be orthogonal, and furthermore may be minorvariations of each other.We collect all the transformed features into a matrix Φ.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
789of 828
Sparse representations
Idea: Have many hidden nodes, but only a few active for aparticular code c(x).Student t prior on codes`1 penalty on coefficients α
Given bases in matrix Φ, look for codes by choosing α suchthat input signal x is reconstructed with low `2 reconstructionerror, while w is sparse
minα
N∑n=1
12‖xn − Φαn‖2
2 + λ‖α‖1
Φ is overcomplete, no longer orthogonalSparse⇒ small number of non-zero αi.Exact recovery under certain conditions (coherence):`1 → `0.`1 regulariser ∼ Laplace prior p(αi) = λ
2 exp(−λ|αi|).
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
790of 828
The image denoising problem
y︸︷︷︸measurements
= xorig︸︷︷︸original image
+ w︸︷︷︸noise
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
791of 828
Sparsity assumption
Only have noisy measurements
y︸︷︷︸measurements
= xorig︸︷︷︸original image
+ w︸︷︷︸noise
Given Φ ∈ Rm×p, find α such that
‖α‖0 is small for x ≈ Φα
where ‖ · ‖0 is the number of non-zero elements of α.Φ is not necessarily features constructed from trainingdata.Minimise reconstruction error
minα
N∑n=1
12‖xn − Φαn‖2
2 + λ‖α‖0
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
792of 828
Convex relaxation
Want to minimise number of components
minα
N∑n=1
12‖xn − Φαn‖2
2 + λ‖α‖0
but ‖ · ‖0 is hard to optimiseRelax to a convex norm
minα
N∑n=1
12‖xn − Φαn‖2
2 + λ‖α‖1
where ‖α‖1 =∑
n |αn|.In some settings does minimisation with `1 regularisationgive the same solution as minimisation with `0regularisation (exact recovery)?
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
793of 828
Mutual coherence
Expect to be ok when columns of Φ “not too parallel”Assume columns of Φ are normalised to unit normLet K = ΦΦT be the Gram matrix, then K(i, j) is the valueof the inner product between φi and φj.Define the mutual coherence
M = M(Φ) = maxi6=j|K(i, j)|
If we have an orthogonal basis, then Φ is an orthogonalmatrix, hence K(i, j) = 0 when i 6= j.However, if we have very similar columns, then M ≈ 1.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
794of 828
Exact recovery conditions
If a minimiser of the true `0 problem, α∗ satisfies
‖α∗‖0 <1M
then it is the unique sparsest solution.If α∗ satisfies the stronger condition
‖α∗‖0 <12
(1 +
1M
)then the minimiser of the `1 relaxation has the samesparsity pattern as α∗.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Autoencoder
795of 828
References
Yoshua Bengio, "Learning Deep Architectures for AI",Foundations and Trends in Machine Learning, 2009http://deeplearning.net/tutorial/
http://www.deeplearningbook.org/contents/autoencoders.html
Fuchs, "On Sparse Representations in ArbitraryRedundant Bases", IEEE Trans. Info. Theory, 2004Xavier Glorot and Yoshua Bengio, "Understanding thedifficulty of training deep feedforward neural networks",2010.