Distributed deep learning_over_spark_20_nov_2014_ver_2.8

Deep Learning: Evolution of ML

from Statistical to Brain-like

Computing

Dr. Vijay Srinivas Agneeswaran,

Director, Big-data Labs,

Impetus

Dr. Dobbs Conference Keynote

20th Nov 2014, Bangalore.

Contents

Introduction to Artificial Neural Networks

Deep learning networks

Towards deep learning

From ANNs to DLNs.

Basics of DLNs.

Related Approaches.

Distributed DLNs: Challenges

Introduction to Spark

Distributed DLNs over Spark

Copyright @Impetus Technologies, 2014


Deep Learning: Evolution Timeline

Introduction to Artificial Neural Networks (ANNs)

Perceptron

Copyright @Impetus Technologies,

2014

Introduction to Artificial Neural Networks (ANNs)

Sigmoid Neuron

• Small change in input = small change in behaviour.

• Output of a sigmoid neuron is given below:

• Small change in input = small change in behaviour.

• Output of a sigmoid neuron is given below:


2014


2014

Introduction to Artificial Neural Networks

(ANNs): Back Propagation

http://zerkpage.tripod.com/ann.htm

What is this?

NAND Gate!

initialize network weights (often small random values)

do forEach training example ex

prediction = neural-net-output(network, ex) // forward pass

actual = teacher-output(ex)

compute error (prediction - actual) at the output units

compute delta(wh)for all weights from hidden layer to output layer //

backward pass

compute delta(wi) for all weights from input layer to hidden layer

// backward pass continued

update network weights until all examples classified correctly or

another stopping criterion satisfied

return the network


The network to identify the individual digits

from the input image

http://neuralnetworksanddeeplearning.com/chap1.html

Different Shallow Architectures

Weighted

Sum

Weighted

Sum

Weighted

Sum

Template

matchers

Fixed Basis

Functions

Simple

Trainable Basis

Functions

Y. Bengio and Y. LeCun, "Scaling learning algorithms towards AI," in Large Scale Kernel Machines, (L.

Bottou, O. Chapelle, D. DeCoste, and J. Weston, eds.), MIT Press, 2007.


Linear predictor ANN, Radial Basis FunctionsKernel Machines

ANNs for Face Recognition?



DLN for Face Recognition

http://theanalyticsstore.com/deep-learning/


Deep Learning Networks: Learning

No general learning

algorithm (No-free-lunch

theorem by Wolpert 1996).

Learning algorithm

for specific tasks –

perception, control,

prediction, planning,

reasoning, language

understanding.

Limitations of BP –

local minima,

optimization challenges

for non-convex

objective functions.

Hinton’s deep belief networks as

stack of RBMs.

Lecun’s energy based

learning for DBNs.

• This is a deep neural network

composed of multiple layers of

latent variables (hidden units or

feature detectors)

• Can be viewed as a stack of

RBMs

• Hinton along with his student

proposed that these networks

can be trained greedily one

layer at a time

Deep Belief Networks


http://www.iro.umontreal.ca/~lisa/twiki/pub/Public/DeepBeliefNetworks/DBNs.png

• Boltzmann Machine is a

specific energy model with

linear energy function.


• Aim of auto encoders network is to

learn a compressed representation for

set of data

• Is an unsupervised learning algorithm

that applies back propagation, setting

the target values equal to inputs

(identity function)

• Denoising auto encoder addresses

identity function by randomly corrupting

input that the auto encoder must then

reconstruct or denoise

• Best applied when there is structure in

the data

• Applications : Dimensionality reduction,

feature selection

Other DL Networks: Auto Encoders (Auto-

associators or Diabolo Network)

Why Deep Learning Networks are Brain-like?

Statistical approach of traditional ML –SVMs or kernel approaches.

• Not applicable in deep learning networks.

Human brain –trophic factors

Traditional ML – lot of data munging,

representational issues (feature

abstractor), before classifier can kick in.

Deep learning –allows the

system to learn representations

as well naturally.


2014

Success stories of DLNsAndroid voice

recognition system –

based on DLNs

Improves accuracy by

25% compared to state-

of-artMicrosoft Skype Translate software

and Digital assistant Cortana

1.2 million images, 1000

classes (ImageNet Data)

– error rate of 15.3%,

better than state of art at

26.1%

http://research.microsoft.com/pubs/144412/DBN4LVCSR-TransASLP.pdf

http://www.wired.com/2014/05/microsoft-skype-translate/

http://www.cs.toronto.edu/~hinton/absps/imagenet.pdf


Success stories of DLNs…..

Senna system – PoS tagging, chunking, NER,

semantic role labeling, syntactic parsing

Comparable F1 score with state-of-art with huge speed

advantage (5 days VS few hours).

DLNs VS TF-IDF: 1 million

documents, relevance search.

3.2ms VS 1.2s.

Robot navigation

http://ml.nec-labs.com/senna/

Potential Applications of DLNs


Speech recognition/enhancement

Video sequencing

Emotion recognition (video/audio),

Malware detection,

Robotics – navigation.

multi-modal learning (text and image).

Natural Language Processing

• Deeplearning4j – open source

implementation of Jeffery Dean’s

distributed deep learning paper.

• Theano: python library of math functions.

• Efficient use of GPUs transparently.

• Hinton’ courses on Coursera:

https://www.coursera.org/instructor/~154

Available resources


http://deeplearning.net/software/theano/

https://www.coursera.org/instructor/~154


Challenges in Realizing DLNs

Large no. of training examples – high

accuracy.

• Large no. of parameters can also improve accuracy.

Inherently sequential nature – freeze up

one layer for learning.

GPUs to improve training speedup

• Limitations –CPU_to_GPU data transfers.

Distributed DLNs –Jeffrey Dean’s work.

• Motivation

• Scalable, low latency training

• Parallelize training data and learn fast

• Jeffrey Dean’s work DistBelief

• Pseudo-centralized realization

Distributed DLNs


What is Spark?

21

Spark provides a computing

abstraction that generalizes Map-

Reduce.

More powerful set of operations than

just map and reduce – group by,

order by, sort, reduce by key,

sample, union, etc.

Provides efficient execution

environment based on

distributed shared memory – keep

working set of data in memory.

Shark provides Hive Query

Language (HQL) interface over

Spark

22

What is Spark? Data Flow in Hadoop

23

What is Spark? Data Flow in Spark

Real world use-case example: HITS algorithm

The Hub score and Authority score for a node is calculated with the following algorithm:

Start with each node having a hub score and authority score of 1 i.e. auth(p) = 1 and

hub(p) = 1

Run the Authority Update Rule: Update each node's Authority score to be equal

to the sum of the Hub Scores of each node that points to it. That is, a node is given

a high authority score by being linked to by pages that are recognized as Hubs for

information.

Run the Hub Update Rule: Update each node's Hub Score to be equal to the sum

of the Authority Scores of each node that it points to. That is, a node is given a high

hub score by linking to nodes that are considered to be authorities on the subject.

Normalize the values by dividing each Hub score by square root of the sum of the

squares of all Hub scores, and dividing each Authority score by square root of the

sum of the squares of all Authority scores.

Repeat from the second step as necessary.

24

Solve HITS algorithm using Hadoop MR

HDFS

Storag

e

Step 1 : auth(p) = 1 and

hub(p) = 1

Step 2 : Run Authority Update

Rule auth(p) = X

Step 3 : Run Hub Update Rule

hub(p) = Y

Step 4 : Normalize hub(p) and

auth(p)WriteReadFlow25

Solve HITS algorithm using Spark

HDFS

Storag

e

Step 1 : auth(p) = 1 and

hub(p) = 1

Step 2 : Run Authority Update

Rule auth(p) = X

Step 3 : Run Hub Update Rule

hub(p) = Y

Step 4 : Normalize hub(p) and

auth(p)

WriteReadFlow26

Spark

[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael

J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-

memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and

Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

Transformations/Actions Description

Map(function f1) Pass each element of the RDD through f1 in parallel and return the resulting RDD.

Filter(function f2) Select elements of RDD that return true when passed through f2.

flatMap(function f3) Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple

outputs.

Union(RDD r1) Returns result of union of the RDD r1 with the self.

Sample(flag, p, seed) Returns a randomly sampled (with seed) p percentage of the RDD.

groupByKey(noTasks) Can only be invoked on key-value paired data – returns data grouped by value. No. of

parallel tasks is given as an argument (default is 8).

reduceByKey(function f4,

noTasks)

Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the

second argument.

Join(RDD r2, noTasks) Joins RDD r2 with self – computes all possible pairs for given key.

groupWith(RDD r3,

noTasks)

Joins RDD r3 with self and groups by key.

sortByKey(flag) Sorts the self RDD in ascending or descending based on flag.

Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD

Collect() Return all elements of the RDD as an array.

Count() Count no. of elements in RDD

take(n) Get first n elements of RDD.

First() Equivalent to take(1)

saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path.

saveAsSequenceFile(path

)

Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs

that implement Hadoop writable interface or equivalent.

foreach(function f6) Run f6 in parallel on elements of self RDD.

28

Berkeley Big-data Analytics Stack (BDAS)

Spark: Use Cases

29

Ooyala

Uses Cassandra for video data

personalization.

Pre-compute aggregates VS on-

the-fly queries.

Moved to Spark for ML and computing

views.

Moved to Shark for on-the-fly queries – C* OLAP aggregate

queries on Cassandra 130 secs, 60 ms in Spark

Conviva

Uses Hive for repeatedly running ad-hoc queries on

video data.

Optimized ad-hoc queries using Spark RDDs – found Spark

is 30 times faster than Hive

ML for connection analysis and video

streaming optimization.

Yahoo

Advertisement targeting: 30K nodes

on Hadoop Yarn

Hadoop – batch processing

Spark – iterative processing

Storm – on-the-fly processing

Content recommendation –

collaborative filtering

30

Spark Use Cases: Spark is good for linear algebra, optimization and

N-body problems.C

om

puta

tions/O

pera

tio

ns

Giant 1 (simple stats) is perfect for Hadoop 1.0.

Giants 2 (linear algebra), 3 (N-body), 4 (optimization) Spark from UC Berkeley is efficient.

Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs

sampling, alternating least squares.

Example is social group-first approach for consumer churn

analysis [2]

Interactive/On-the-fly data processing – Storm.

OLAP – data cube operations. Dremel/Drill

Data sets – not embarrassingly parallel?

Deep LearningArtificial Neural Networks/Deep

Belief Networks

Machine vision from Google [3]

Speech analysis from Microsoft

Giant 5 – Graph processing –GraphLab, Pregel, Giraph

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.

[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social

Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741

[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio

Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:

1232-1240

http://www.informatik.uni-trier.de/~ley/pers/hd/c/Corrado:Gregory_S=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/m/Monga:Rajat.html

http://www.informatik.uni-trier.de/~ley/pers/hd/c/Chen:Kai.html

http://www.informatik.uni-trier.de/~ley/pers/hd/d/Devin:Matthieu.html

http://www.informatik.uni-trier.de/~ley/pers/hd/l/Le:Quoc_V=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/m/Mao:Mark_Z=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/r/Ranzato:Marc=Aurelio.html

http://www.informatik.uni-trier.de/~ley/pers/hd/s/Senior:Andrew_W=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/t/Tucker:Paul_A=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/y/Yang:Ke.html

http://www.informatik.uni-trier.de/~ley/pers/hd/n/Ng:Andrew_Y=.html

http://www.informatik.uni-trier.de/~ley/db/conf/nips/nips2012.html

Some Spark(ling) examples

Scala code (serial)

var count = 0

for (i <- 1 to 100000)

{ val x = Math.random * 2 - 1

val y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Sample random point on unit circle – count how many are inside them (roughly about PI/4).

Hence, u get approximate value for PI.

Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).

Some Spark(ling) examplesSpark code (parallel)

val spark = new SparkContext(<Mesos master>)

var count = spark.accumulator(0)

for (i <- spark.parallelize(1 to 100000, 12))

{ val x = Math.random * 2 – 1

val y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Notable points:

1. Spark context created – talks to Mesos1 master.

2. Count becomes shared variable – accumulator.

3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.

4. Parallelize method invokes foreach method of RDD.

1 Mesos is an Apache incubated clustering system – http://mesosproject.org

http://mesosproject.org/

Logistic Regression in Spark: Serial Code// Read data file and convert it into Point objects

val lines = scala.io.Source.fromFile("data.txt").getLines()

val points = lines.map(x => parsePoint(x))

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = Vector.zeros(D)

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient

}

println("Result: " + w)

Logistic Regression in Spark// Read data file and transform it into Point objects

val spark = new SparkContext(<Mesos master>)

val lines = spark.hdfsTextFile("hdfs://.../data.txt")

val points = lines.map(x => parsePoint(x)).cache()

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = spark.accumulator(Vector.zeros(D))

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient.value

}

println("Result: " + w)

Deep Learning on

Spark Fully Distributed Deep learning network

implementation on Spark.

Spark would handle the parallelism, synchronization,

distribution, and fail over.

The input data set in HDFS, intermediate

data in local file system

Publish/subscribe message passing

framework built on top of Apache Spark using

Akka Framework.

• ANN to Distributed Deep Learning

• Key ideas in deep learning

• Need for distributed realizations.

• DistBelief, deeplearning4j etc.

• Our work on large scale distributed deep learning

• Deep learning leads us from statistics based

machine learning towards brain inspired AI.

Conclusions


Thank You!

Mail • [email protected]

LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran

Blogs • blogs.impetus.com

Twitter • @a_vijaysrinivas.


Backup Slides

Copyright @Impetus

Technologies, 2014


2014

• RBM are Energy Based Models (EBM)

• EBM associate an energy with every configuration of a

system

• Learning corresponds to modifying the shape of energy

function, so that it has desirable properties

• Like in physics, lower energy = more stability

• So, modify shape of energy function such that the

desirable configurations have lower energy

Energy Based Models

http://www.cs.nyu.edu/~yann/research/ebm/loss-

func.png

Other DL networks:

Convolutional Networks


Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. 1999. Object Recognition with Gradient-Based

Learning. In Shape, Contour and Grouping in Computer Vision, David A. Forsyth, Joseph L. Mundy, Vito Di

Gesù, and Roberto Cipolla (Eds.). Springer-Verlag, London, UK, UK, 319-.

• Recurrent Neural networks

• Long Short Term Memory (LSTM), Temporal

data

• Sum-product networks

• Deep architectures of sum-product networks

• Hierarchical temporal memory

• online structural and algorithmic model of

neocortex.

Other Brain-like Approaches


• Connections between units form a Directed

cycle i.e. a typical feed back connections

• RNNs can use their internal memory to process

arbitrary sequences of inputs

• RNNs cannot learn to look far back past

• LSTM solve this problem by introducing stem

cells

• These stem cells can remember a value for an

arbitrary amount of time

Recurrent Neural Networks


• SPN is deep network model and is a directed

acyclic graph

• These networks allow to compute the

probability of an event quickly

• SPNs try to convert multi linear functions to

ones in computationally short forms i.e. it must

consist of multiple additions and multiplications

• Leaves correspond to variables and nodes

correspond to sums and products

Sum-Product Networks (SPN)


• Is a online machine learning model developed by

Jeff Hawkins

• This model learns one instance at a time

• Best explained by online stock model. Today’s

situation of stock helps in prediction of tomorrow’s

stock

• A HTM network is tree shaped hierarchy of levels

• Higher hierarchy levels can use patterns learned at

lower levels. This is adopted from learning model

adopted by brain in the form of neo cortex

Hierarchical Temporal Memory


http://en.wikipedia.org/wiki/Hierarchical_temporal_memory


Mathematical Equations

• The Energy Function is defined as follows:

b’ and c’ are the biases

𝐸 𝑥, ℎ = −𝑏′𝑥 − 𝑐′ℎ − ℎ′𝑊𝑥

where, W represents the

weights connecting

visible layer and hidden

layer.


Learning Energy Based Models

• Energy based models can be learnt by performing gradient

descent on negative log-likelihood of training data

• It has the following form:

−𝜕 log 𝑝 𝑥

𝜕θ=𝜕 𝐹 𝑥

𝜕θ−

𝑥̃

𝑝 𝑥 𝜕 𝐹 𝑥

𝜕θ

Positive phaseNegative phase


Data & Analytics

Distributed deep learning_over_spark_20_nov_2014_ver_2.8