Reducing the dimensionality of data with neural networks

Reducing the Dimensionality of Data with Neural Networks

@St_Hakky

Geoffrey E. Hinton; R. R. Salakhutdinov (2006-07-28). “Reducing the Dimensionality of Data with Neural Networks”. Science 313 (5786)

Dimensionality Reduction• Dimensionality Reduction facipitates…• Classification• Visualization• Communication• Storage of high-dimensional data

Principal Components Analysis• PCA(Principal Components Analysis) • A simple and widely used method• Finds the directions of greatest variance in the data set• Represents each data point by its coordinates along

each of these directions

“Encoder” and “Decoder” Network• This paper describe a nonlinear generalization of

PCA(This is autoencoder)• use an adaptive, multilayer “encoder” network to

transform the high-dimensional data into a low-dimensional code• a similar “decoder” network to recover the data from

the code

AutoEncoder

Code

Input Output

Encoder Decoder

AutoEncoder

Input data

Reconstructing data

Hidden layer

Input layer

Outputlayer

DimensionalityReduction

How to train the AutoEncoder

・ Starting with random weights in the two networks

Input data

Reconstructing data

Hidden layer

Input layer

Outputlayer

DimensionalityReduction

・ They are trained by minimizing the discrepancy between the original data and its reconstruction.

・ Gradients are obtained by the chain rule to back-propagate error from the decoder network to encoder network.

It is difficult to optimize multilayer autoencoder• It is difficult to optimize the weights in nonlinear

autoencoders that have multiple hidden layers(2-4).• With large initial weights:

• autoencoders typically find poor local minima• With small initial weights:

• the gradients in the early layers are tiny, making it infeasible to train autoencoders with many hidden layers

• If the initial weights are close to a good solution, gradient decent works well. However finding such initial weights is very difficult.

Pretraining

• This paper introduce this “pretraining” procedure for binary data, generalize it to real-valued data, and show that it works well for a variety of data sets.

Restricted Boltzmann Machine(RBM)

Visible units

Hidden unitsThe input data correspond to “visible” units of the RBM and the feature detectors correspond to “hidden” units.A joint configuration of the visible and hidden units has an energy given by (1).

𝑣 𝑖

h 𝑗

𝑏𝑖 ,𝑏 𝑗 :𝑏𝑖𝑎𝑠

𝑤𝑖𝑗

The network assigns a probability to every possible data via this energy function.

Pretraining consits of learning a stack of RBMs

・ The first layer of feature detectors then become the visible units for learning the next RBM.

・ This layer-by-layer learning can be repeated as many times as desired.

Experiment(2-A)

The six units in the code layer were linear and all the other units were logistic.

The network was trained on 20,000 images and tested on 10,000 new images.

The autoencoder discovered how to convert each 784-pixel image into six real numbers that allow almost perfect reconstruction.

Data

The function of layerUsed AutoEncoder’s Network

Observed Results

Experiment(2-A)

(1) Random samples of curves from the test data set

(2) Reconstructions produced by the six-dimensional deep autoencoder

(3) Reconstructions by logistic PCA using six components

(4) Reconstructions by logistic PCA

The average squared error per image for the last four rows is 1.44, 7.64, 2.45, 5.90.

(5) Standard PCA using 18 components.

(1)

(3)

(5)

(4)

(2)

Experiment(2-B)Used AutoEncoder’s Network

The 30 units in the code layer were linear and all the other units were logistic.

The function of layer

The network was trained on 60,000 images and tested on 10,000 new images.

Data

Experiment(2-B) ： MNIST

The average squared errors for the last three rows are 3.00, 8.01, and 13.87.

(1)

(3)

(2)

(4)

(1) A random test image from each class

(2) Reconstructions by the 30-dimensional autoencoder

(3) Reconstructions by 30- dimensional logistic PCA

(4) Reconstructions by standard PCA

Experiment(2-B)

A two-dimensional autoencoder produced a better visualization of the data than did the first two principal components.

(A) The two-dimensional codes for 500 digits of each class produced by taking the first two principal components of all 60,000 training images.

(B) The two-dimensional codes found by a 784- 1000-500-250-2 autoencoder.

Experiment(2-C)Used AutoEncoder’s Network

The 30 units in the code layer were linear and all the other units were logistic.

The function of layer

Olivetti face data setData

Observed Results

The autoencoder clearly outperformed PCA

Experiment(2-C)

(1) Random samples from the test data set

(1)

(3)

(2)

(2) Reconstructions by the 30-dimensional autoencoder

(3) Reconstructions by 30-dimensional PCA.

The average squared errors are 126 and 135.

Conclusion• It has been obvious since the 1980s that

backpropagation through deep autoencoders would be very effective for nonlinear dimensionality reduction in the situation of…• Computers were fast enough• Data sets were big enough• The initial weights were close enough to a good solution.

Conclusion• Autoencoders give mappings in both directions

between the data and code spaces.

• They can be applied to very large data sets.

• The reason is that both the pretraining and the fine-tuning scale linearly in time and space with the number of training cases.

Data & Analytics

Reducing the dimensionality of data with neural networks