26
Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Embed Size (px)

Citation preview

Page 1: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Mocha.jlDeep Learning for Julia

Chiyuan Zhang (@pluskid)CSAIL, MIT

Page 2: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

JULIA BASICS10-minute Introduction to Julia

Page 3: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

A glance of basic syntax

Numpy Matlab Julia Descriptionx[0] x(1) x[1] Index 1st elem of

arraynp.random.randn(3,3) randn(3) randn(3,3) 3-by-3 random

Gaussian matrixnp.arange(1,11) 1:10 1:10 1,2,…,10X * Y X .* Y X .* Y Elementwise mulnp.dot(X,Y) X * Y X * Y Matrix mullinalg.solve(X,Y) X \ Y X \ Y Left matrix divisiond = {‘one’:1, ‘two’:2}d[‘one’]

d = containers.Map({‘one’,’two’},{1,2});d(‘one’)

d = Dict("one"=>1,"two"=>2)d["one"] Hash table

r = np.random.rand(*x.shape)y = x * (r > t)

r = rand(size(x))y = x .* (r > t)

r = rand(size(x))y = x .* (r .> t) Dropout

f = lambda x, mu, sigma: np.exp(-(x-mu)**2/(2*sigma**2)) / sqrt(2*np.pi*sigma^2)

f=@(x,mu,sigma)exp(-(x-mu)^2/(2*sigma^2)) / sqrt(2*pi*sigma^2)

f(x,μ, )=𝛿exp(-(x-μ)^2/2 ^2)/sqrt(2𝛿 π* ^2)𝛿 Gaussian density

function

Page 4: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Beyond Syntax

Close-to-C performance in native Julia code, typically do not need to explicitly vectorize your code (like what you have been doing in Matlab).

Type annotation, LLVM-based just-in-time (JIT) compiler, easy parallelization with co-routine on single machine or over nodes of clusters; blah blah blah…

Page 5: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Convenient FFICalling C/Fortran functions

Calling Python functions (PyCall.jl, PyPlot.jl, IJulia, …)

Interaction with C++ functions / objects directly, see Cxx.jl

Page 6: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Powerful Macros

JuMP, optimization models

OpenPPL, probabilistic programming

Page 7: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Disadvantages of Julia

■ Still at early stage, so– The ecosystem is still young (653 Julia packages vs. 66,687 PyPI

packages)e.g. Images.jl still does not have a resize function…

– The core language is still evolvinge.g. current v0.4-RC introduced a lot of breaking changes (and also exciting new features)

– ??

Page 8: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

MOCHA.JLDeep Learning in Julia

Page 9: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Image sources: Li Deng and Dong Yu. Deep Learning – Methods and Applications. Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015). Google Deep Mind. Human-level control through deep reinforcement learning. Nature, Feb. 2015. Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR 2015.

Page 10: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Why Deep Learning is Successful?■ Theoretical point of view– Nowhere near a complete theoretical understanding of deep learning yet

■ Practical point of view– Big Data: large amount of (thank Internet) labeled (thank Amazon M-Turk),

high-dimensional (large images, videos, speech and text corpus, etc.)– Computational Power: GPUs, large clusters– Human Power: the “deep learning conspiracy”– Software Engineering: network architecture & computation components

decoupled

Page 11: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Layers & back-propagateTop: Typical way of visualizing a neural network: clear and intuitive, but does not have well decomposition of computation into layers.

Bottom: Alternative way of thinking about neural networks. Each layer is a black box that could carry out forward and backward computation. Important thing: the computation is complete encapsulated inside the layer, the black box does NOT need to know the external environment (e.g. the overall network architecture) to do the computation.e.g. Linear Layer (Input – Output)Forward:

Backward:

More generally, a deep neural network can be viewed as an directed acyclic graph (optionally with time-delayed recurrent connections)

Page 12: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Advantage of de-coupled view of NN■ Highly efficient computation components could be written by

programmers and experts in high-performance computing and code optimization.

– E.g. cuDNN library from Nvidia

■ Researchers can try out novel architectures easily without needing to worry too much about internal implementation of commonly used layers

– Some examples of complicated networks built with standard components: Network-in-Network, Siamese Networks, Fully-Convolutional Networks, etc.

Image Source: J. Long, E. Shelhamer, T. Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Page 13: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Deep Learning Libraries

■ C++: Caffe (widely adopted in academia), dmlc/cxxnet, cuda-convnet, CNTK (by MSR), etc.

■ Python: Theano (auto-differentiation) and various wrappers; NervanaSystems/neon; etc.

■ Lua: Torch 7 (supported by Facebook and Google DeepMind)

■ Matlab: MatConvNet (by VGG)

■ Julia: pluskid/Mocha.jl

■ …

Page 14: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Why Mocha.jl?

1. Julia: written in Julia and easy interaction with the rest of the Julia ecosystem.

2. Minimum dependency: the Julia backend runs out of the box. CUDA backend depends on Nvidia cuDNN.

3. Correctness: all the computation components are unit-tested.

4. Modular architecture: layers, activation functions, regularizers, network topology, solvers, etc.

Julia compiles with LLVM, so potentially Julia code could be compiled directly to GPU devices in the future. After that, writing neural networks in Julia will be really enjoyable!

Page 15: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Image Classification IJulia Demo

Page 16: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Mini-Tutorial: ConvNets on MNIST

■ MNIST: Handwritten digits

■ Data preparation:– Following convention, images are represented as 4D tensor: width-by-

height-by-channels-by-count– For MNIST: 28-by-28-by-1-by-64– Mocha.jl supports general ND tensors

■ Data are stored in HDF5 file format– Commonly supported by Matlab, Numpy, etc.– See examples/mnist/gen-mnist.sh

Page 17: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Defining Network Architecture

■ A network starts with data layers (inputs), and ends with prediction or loss layers

data_layer = AsyncHDF5DataLayer(name="train-data", source="data/train.txt", batch_size=64, shuffle=true)

■ Source file data/train.txt lists the HDF5 files for training set

■ 64 images is provided for each mini-batch

■ the data is shuffled to improve convergence

■ async data layer use Julia’s @async to pre-read data while waiting for computation on CPU / GPU

Page 18: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Convolution Layer

LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.

conv_layer = ConvolutionLayer(name="conv1", n_filter=20, kernel=(5,5), bottoms=[:data], tops=[:conv])

Page 19: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Pooling Layer

pool_layer = PoolingLayer(name="pool1", kernel=(2,2), stride=(2,2), bottoms=[:conv], tops=[:pool])

■ Pooling layer operate on the output of convolution layer

■ By default, MAX pooling is performed; can switch to MEAN pooling by specifying pooling=Pooling.Mean()

Page 20: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Constructing DAG with tops and bottoms

■ Network architecture is determined by connecting tops (output) blobs to bottoms (input) blobs with matching blob names.

■ Layers are automatically sorted and connected as a directed acyclic graph (DAG).

■ The figure on the right shows the visualization of the LeNet for MNIST: conv-pool x2 + dense x2

Page 21: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Definition of the rest of the layersconv2_layer = ConvolutionLayer(name="conv2", n_filter=50, kernel=(5,5), bottoms=[:pool], tops=[:conv2])

pool2_layer = PoolingLayer(name="pool2", kernel=(2,2), stride=(2,2), bottoms=[:conv2], tops=[:pool2])

fc1_layer = InnerProductLayer(name="ip1", output_dim=500, neuron=Neurons.ReLU(), bottoms=[:pool2], tops=[:ip1])

fc2_layer = InnerProductLayer(name="ip2", output_dim=10, bottoms=[:ip1], tops=[:ip2])

loss_layer = SoftmaxLossLayer(name="loss", bottoms=[:ip2, :label])

Page 22: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

The Stochastic Gradient Descent Solvermethod = SGD()

params = make_solver_parameters(method, max_iter=10000, regu_coef=0.0005, mom_policy=MomPolicy.Fixed(0.9), lr_policy=LRPolicy.Inv(0.01, 0.0001, 0.75), load_from=exp_dir)

solver = Solver(method, params)

■ Solvers have many customizable parameters, including learning-rate policy, momentum-policy, etc. Advanced policies like halfing the learning rate when performance on validation set drops are also supported.

■ See Mocha.jl document for other available solvers.

Page 23: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Coffee Breaks

… for the solver

setup_coffee_lounge(solver, save_into="$exp_dir/statistics.jld", every_n_iter=1000)

# report training progress every 100 iterations

add_coffee_break(solver, TrainingSummary(), every_n_iter=100)

# save snapshots every 5000 iterations

add_coffee_break(solver, Snapshot(exp_dir), every_n_iter=5000)

Page 24: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Solver Statistics

Solver statistics will be automatically saved if coffee lounge is set up.

Snapshots save the training progress periodically, can resume training from the last snapshot after interruption.

Page 25: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Switching Backends: CPU vs GPU

backend = use_gpu ? GPUBackend() : CPUBackend()

Page 26: Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

THANK YOU!http://julialang.org

https://github.com/pluskid/Mocha.jl