Introduction to apache horn (incubating)

Apache Horn (Incubating)a Large-scale Deep Learning Platform

Edward J. Yoon @eddieyoonOct 15, 2015 @ R3 Diva-Hall, Samsung Electronics

I am ..● Member of Apache Software Foundation● PMC member and committer, or Mentor of

○ Apache Incubator, ○ Apache Hama, Apache Horn, Apache MRQL, ○ and Apache Rya, Apache BigTop.

● Cloud Tech Lab, Software R&D Center.○ HPC Cloud (Network Analysis, ML & DNN)

What’s Apache Horn?

Horn [hɔ:n]: 얼(혼) 魂 = Mind

● Horn is a clone project of Google’s DistBelief, supports both data and model parallelism.○ Apache Incubator Project (Since Sep 2015)○ 9 initial members are from Samsung Electronics, Microsoft, Cldi Inc,

LINE plus, TUM, KAIST, …, etc.

Google’s DistBelief● GPUs are expensive, both to buy and to rent. ● Most GPUs can only hold a relatively small amount of data in

memory and CPU-to-GPU data transfer is very slow. ○ Therefore, the training speed-up is small when the model

does not fit in GPU memory.

● DistBelief is a framework for training deep neural networks that avoids GPUs-only approach (for the above reasons) and solves the problems with a large number of examples and dimensions (e.g., high-resolution images).

Google’s DistBelief

● It supports both Data and Model Parallelism○ Data Parallelism: The training data is partitioned

across several machines each having its own replica of the model. Each model trains with its partition of the data in parallel.

○ Model Parallelism: The layers of each model replica are distributed across machines.

DistBelief: Basic ArchitectureEach worker group performs minibatch in BSP paradigm, and interacts with Parameter Server asynchronously.

What’s BSP?● Bulk Synchronous Parallel

It was developed by Leslie Valiant of Harvard University during the 1980s.

● Iteratively:a. Local Computationb. Communication (Message Passing)c. Global Barrier Synchronization

DistBelief: Batch Optimization

Coordinator 1) finds stragglers (slow tasks) for better load balancing and resource usage. It similar to Google MapReduce’s “Backup Tasks” 2) reduces communication overheads between the central Parameter Server and workers something like Aggregators.

As a result:● CPU cluster to train deep networks significantly faster

than a GPU, w/o limitation on the max size of model.○ CPU cluster is 10x faster than a GPU.

● Trained a model with over 1 billion parameters to achieve better than state-of-the-art performance on ImageNet challenge.

Nov 2012: IBM simulates 530 billion neurons, 100 trillion synapses * 1,572,864 processor cores, 1.5 PB memory, and 6,291,456 threads.

Wait, .. Why do we need this?● Deep learning is likely to spur other applications beyond

speech and image recognition in the nearer term. ○ e.g., medicine, manufacturing, and transportation.

and, it’s a Closed Source Software● We needs to solve size matters (training set and the

size of neural networks), but many OSS such as Caffe, DeepDist, Spark MLlib, Deeplearning4j, and NeuralGiraph are data or model parallel only.

● So, we started to clone the Google’s DistBelief, called Apache Horn (Incubating).

The key idea of implementation

● .. is to use existing OSS distributed systems○ Apache Hadoop: Distributed File System, Resource

Manager.○ Apache Hama: general-purpose BSP computing

engine on top of Hadoop, which can be used for Both data-parallel and graph-parallel in flexible way.

Apache Hama: BSP framework

BSP frameworkon Hama or YARN

Hadoop HDFS

Task 1 Task 2 Task 3 Task N...

Like MapReduce, Apache Hama BSP framework schedules tasks according to the distance between the input data of the tasks and request nodes.

BSP tasks are globally synchronized after performing computations on local data and communication actions.

Global Regional Synchronization


Hadoop HDFS

Task 1

Task 2Task 3

Task 4


All tasks within the same group are synchronized with each others. Each group works asynchronously as independent BSP job.

...Task 6

Task 5

Async mini-batches using Regional Synchronization


Hadoop HDFS

Task 1

Task 2Task 3

Task 4


...

Task 5

Task 6

Each group performs minibatch in BSP paradigm, and interacts with Parameter Server asynchronously.

Parameter Swapping

Parameter Server Parameter Server


Hadoop HDFS

Task 1

Task 2Task 3

Task 4


...

Task 5

Task 6

One of group works as a Coordinator

Each group performs minibatch in BSP paradigm, and interacts with Parameter Server asynchronously.

Parameter Swapping

Async mini-batches using Regional Synchronization

Parameter Server Parameter Server

Neuron-centric Programming APIs

User-defined neuron-centric programming APIs:

The activation and cost functions computes the propagated information, or error messages and sends its updates to Parameter Server (but not fully designed yet).

Similar to Google’s Pregel.

Job Configuration APIs /* * Sigmoid Activation Function */ public static class Sigmoid extends ActivationFunction { public double apply(double input) { return 1.0 / (1 + Math.exp(-input)); } }

... public static void main(String[] args) { ANNJob ann = new ANNJob();

// Initialize the topology of the model ann.addLayer(int featureDimension, Sigmoid.class, int numOfTasks); ann.addLayer(int featureDimension, Step.class, int numOfTasks); ann.addLayer(int featureDimention, Tanh.class, int numOfTasks); …

ann.setCostFunction(CrossEntropy.class); ..}

Job Submission Flow

BSP framework onApache Hama or YARN

clusters

Task 1

Task 4

Task 7

Task 2 Task 3

Task 5 Task 6

Task 8 Task 9

Parameter Server

Parameter Server

Parameter Swapping

One of worker group works as a Coordinator

Hadoop HDFS

Data Parallelism

Model Parallelism

Apache Horn

Client and Web UIUser’sANN Job

Horn Community● https://horn.incubator.apache.org/● https://issues.apache.org/jira/browse/HORN● Mailing lists

○ [email protected]

https://horn.incubator.apache.org/

https://horn.incubator.apache.org/

https://issues.apache.org/jira/browse/HORN

https://issues.apache.org/jira/browse/HORN