Deep learning from a novice perspective

Deep learning from a novice perspective and recent innovations from KGPians

Anirban SantaraDoctoral Research Fellow

Department of CSE, IIT Kharagpurbit.do/AnirbanSantara

Deep Learning

Just a kind of

Machine Learning

Classification

Regression

Clustering

3 main tasks:

CLASSIFICATION

Pandas Dogs

Cats

? ??

Rather:

P(class| )?

REGRESSION

Independent variable (feature)

Dependent variable(target attribute)

CLUSTERING

Attribute 1

Attribute 2

The methodology:

1. Design a hypothesis function: h(y|x,θ)

Target attribute Input Parameters of the learning machine

2. Keep improving the hypothesis until the prediction happens really good

Well, how bad is your hypothesis?In case of regressions:

A very common measure is mean squared error:

𝐸= ∑𝑎 𝑙𝑙𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠

¿ 𝑦 𝑑𝑒𝑠𝑖𝑟𝑒𝑑−𝑦 𝑎𝑠𝑝𝑒𝑟 h h𝑦𝑝𝑜𝑡 𝑒𝑠𝑖𝑠∨ 2

In classification problems: [10 ][01 ]

In one-hot classification frameworks, we often use mean square error

However, often we ask for the probabilities of occurrence of the different classes for a given input ( Pr(class|X) ). In that case we use K-L divergence between the observed (p(output classes)) and predicted (q(output classes)) distributions as the measure of error. This is sometimes referred to as the cross entropy error criterion.

𝐾𝐿¿

Clustering uses a plethora of criteria like:• Entropy of a cluster• Maximum distance between 2

neighbors in a cluster

--and a lot more

Now its time to rectify the machine and improve$100,000

$50,000

Learning

We perform “gradient descent” along the “error-plane” in the “parameter space”:

∆ 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟=−learningrate∗𝛻𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑒𝑟𝑟𝑜𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟←𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 +∆𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟

Lets now look into a practical learning system: Artificial Neural Network

Cat

Dog

Panda

- A very small unit of computation

So the parameters of an ANN are:1. Incoming weights of every neuron2. Bias of every neuron

These are the ones that need to be tuned during learning

We perform gradient descent on these parameters

Backpropagation algorithm is a popular method of computing

Backpropagation algorithm

Input pattern vector

W21 W32

Forward propagate:

Error calculation:

Backward propagation:

If k output layer

If k hidden layer

Well after all, life is tough…• The parameters of a neural network are generally initialized to random values.• Starting from these random values (with useless information)

it is very difficult (well not impossible, in fact time consuming) for backpropagation to arrive at the correct values of these parameters.

• Exponential activation functions like sigmoid and hyperbolic-tangent are traditionally used in artificial neurons. Thesefunctions have gradients that are prone to become zero in course of backpropagation.

• If the gradients in a layer get close to zero, they induce the gradients in the previous layers to vanish too. As a result the weights and biases in the lower layers remain immature.

• This phenomenon is called “vanishing gradient” problem in the literature.

These problems crop up very frequently in neural networks that contain a large number of hidden layers and way too many parameters (the so called Deep Neural Networks).

How to get around? Ans: Make “informed” initialization• A signal is nothing but a set of random variables. • These random variables jointly take values from a probability distribution that is dependent on the nature of the

source of the signal.

E.g.: A blank 28x28 pixel array like can house numerous kinds of images. The set of 784 random variables assume values from a different joint probability distribution for every class of objects/scenes.

𝑃𝑑𝑖𝑔𝑖𝑡 (𝑥1 , 𝑥2 ,…, 𝑥784)

𝑃h𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒 (𝑥1 ,𝑥2 ,…,𝑥784 )

Lets try and model the probability distribution of interest

Our target distribution: 𝑃h𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒 (𝑥1 , 𝑥2 ,…,𝑥784 )We try to capture this distribution in a model that looks quite similar toa single layer neural network

The Restricted Boltzmann Machine: It’s a probabilistic graphical model (a special kind of Markov Random Field) that is capable of modelling a wide variety of probability distributions.

Capture the dependencies among the “visible”variables

The working of RBMParameters of the RBM:1. Weights on the edges 2. Biases on each node and

Using these we define a joint probability distribution over the “visible” variables and the “hidden” variables as:

Where the energy function is defined as:

And Z is a normalization term called the “Partition function”

𝑃 𝑅𝐵𝑀 (𝒗 ,𝒉)= 1𝑍𝑒𝐸 (𝒗 ,𝒉 )

𝑃h𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒 (𝑣1 ,𝑣2,…,𝑣784 )

∑𝒉

𝑃𝑅𝐵𝑀 (𝒗 ,𝒉 )

𝑃 𝑅𝐵𝑀 (𝑣1 ,𝑣2 ,… ,𝑣784 )

𝐾𝐿¿¿−𝐻 (𝑃h𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒)− ∑

𝑣1 ,𝑣2 ,…,𝑣784

𝑃h𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒 (𝑣1 ,… ,𝑣784 ) 𝑙𝑛𝑃 𝑅𝐵𝑀 (𝑣1 ,…,𝑣784 )

Empirical average of the log-likelihood of data under the model distributionNot under our control

MAXIMIZE

Layer-wise pre-training using RBM• Every hidden layer is pre-trained

as the hidden layer of a RBM

As RBM models the statistics of the input, the weights and biases carry meaningful information about the input. Use of these as initial values of the parameters of a deep neural network has shown phenomenal improvement over random initialization both in terms of time complexity and performance.

• This is followed by fine-tuning over the entire network via back-propagation

• Autoencoder is a neural network operating in unsupervised learning mode

• The output and the input are set equal to each other• Learns an identity mapping from the input to the output• Applications:

• Dimensionality reduction (Efficient, non-linear)• Representation learning (discovering interesting structures)• Alternative to RBM for layer-wise pre-training of DNN.

The Autoencoder

A deep stacked autoencoder

So deep learning ≈ training “deep” neural networks with many hidden layers

Step 1: Unsupervised layer-wise pre-training Step 2: Supervised fine-tuning

- This is pretty much all about how deep learning works. However there is a class of deep networks called convolutional neural networks that often do not need pre-training because these networks use extensive parameter sharing and use rectified linear activation functions.

Well, deep learning when viewed from a different perspective looks really amazing!!!

Traditional machine learning v.s. deep learning

Data

Hand-engineering of feature extractors

Data–driven target-oriented representation learning

Data representations by feature extractors

• Classification• Regression• Clustering• Efficient

coding

Inference engine

Traditional machine learning

Deep Learning

What’s so special about it?Traditional machine learning Deep learning

• Designing feature detectors requires careful engineering and considerable domain expertise

• Representations must be selective to aspects of data that are important for our task and invariant to the irrelevant aspects (selectivity-invariance dilemma)

• Abstractions of hierarchically increasing complexity are learnt by a data driven approach using general purpose learning procedures

• A composition of simple non-linear modules can learn very complex functions

• Cost functions specific to the problem amplify aspects of the input that are important for the task and suppress irrelevant variations

Pretty much how we humans go about analyzing…

Some deep architectures:-

Deep stacked autoencoder

Deep convolutional neural network

Recurrent neural network

Used for efficient non-linear dimensionality reduction and discovering salient underlying structures in data

Exploits stationarity of natural data and uses the concept of parameter sharing to study large images, long spoken/ written strings to make inferences from them

Custom made for modelling dynamic systems and find use in natural language (speech and text) processing, machine translation, etc.

Classical automatic speech recognition system

Viterbi beam

search / A*

decoding

N-best sentences or word lattice

Rescoring

FINAL UTTERRENCE

Acoustic model generationSentence model preparation

Phonetic utterance models

Sentence model

Signal acquisition

Feature extraction

Acoustic modelling

Some of our works:-

2015:

Deep neural network and Random Forest hybrid architecture for learning to detect retinal vessels in fundus images (accepted at EMBC-2015, Milan, Italy)

Our architecture:

Average accuracy of detection: 93.27%

2014-15:

Faster learning of deep stacked autoencoders on multi-core systems through synchronized layer-wise pre-training (accepted at PDCKDD Workshop, a part of ECML-PKDD 2015, Porto, Portugal)

Conventional serial pre-training:

Proposed algorithm:

26% speedup for compression of MNIST handwritten digits

Take-home messages• Deep learning is a set of algorithms that have been designed to

1. Train neural networks with a large number of hidden layers.2. Learn features of hierarchically increasing complexity in a data and objective – driven method.

• Deep neural networks are breaking all world records in AI because it can be proved that they have the capacity of modelling highly non-linear functions of the data with fewer parameters than shallow networks.

• Deep learning is extremely interesting and a breeze to implement once the underlying philosophies are understood. It has great potential of being used in a lot of ongoing projects at KGP.

If you are interested to go deep into deep learning…

Take Andrew Ng’s Machine Learning

course on Coursera

Visit ufldl.Stanford.edu

and read the entire tutorial

Read LeCun’s latest deep learning

review published in Nature

Thank you so much

Please give me some feedback for this talk by visiting: bit.do/RateAnirban Or just scan the QR code

Technology

Deep learning from a novice perspective