Upload
0xdata
View
2.213
Download
2
Embed Size (px)
DESCRIPTION
Suggestions: 1) For best quality, download the PDF before viewing. 2) A screencast with audio is available at: http://youtu.be/fdbQreQacIQ In this talk, we take Deep Learning to task with real world data puzzles to solve. Data: - Africa Soil Kaggle Challenge top (#1) position by H2O DeepLearning - Higgs binary classification dataset (10M rows, 29 cols) - MNIST 10-class dataset - Weather categorical dataset - eBay text classification dataset (8500 cols, 500k rows, 467 classes) - ECG heartbeat anomaly detection
Citation preview
Deep Learning through Examples
0xdata H2OaiScalable In-Memory Machine Learning
Silicon Valley Big Data Science Meetup Vendavo Mountain View 91114
Arno Candel
Who am IPhD in Computational Physics 2005
from ETH Zurich Switzerland
6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning 9 months at 0xdataH2O - Machine Learning
15 years in HPCSupercomputingModeling
Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine
ArnoCandel
H2O Deep Learning ArnoCandel 3
matlabulous (Jo-fai Chow Blend it like a Bayesian) says
ldquoI am 9999999999999 sure that I can still go further with H2Ordquo
Achieved with H2O Deep Learning from R
H2O DeepLearning Kaggle 1 rank (out of 413) - 40d left
1
17
H2O Deep Learning ArnoCandel
OutlineIntro amp Live Demo (10 mins)
Methods amp Implementation (20 mins)
Results amp Live Demos (25 mins)
Higgs boson detection
MNIST handwritten digits
text classification
Q amp A (5 mins)
4
H2O Deep Learning ArnoCandel
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
5
H2O Deep Learning ArnoCandel
Customer Demands for Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API SDK Extensibility
H2O was developed by 0xdata from scratch to meet these requirements
H2O Deep Learning ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java
H2O Deep Learning ArnoCandel
H2O Architecture
Distributed In-Memory K-V storeCol compression
Machine Learning
Algorithms
R EngineNano fast
Scoring Engine
Prediction Engine
Memory manager
eg Deep Learning
8
MapReduce
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
Who am IPhD in Computational Physics 2005
from ETH Zurich Switzerland
6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning 9 months at 0xdataH2O - Machine Learning
15 years in HPCSupercomputingModeling
Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine
ArnoCandel
H2O Deep Learning ArnoCandel 3
matlabulous (Jo-fai Chow Blend it like a Bayesian) says
ldquoI am 9999999999999 sure that I can still go further with H2Ordquo
Achieved with H2O Deep Learning from R
H2O DeepLearning Kaggle 1 rank (out of 413) - 40d left
1
17
H2O Deep Learning ArnoCandel
OutlineIntro amp Live Demo (10 mins)
Methods amp Implementation (20 mins)
Results amp Live Demos (25 mins)
Higgs boson detection
MNIST handwritten digits
text classification
Q amp A (5 mins)
4
H2O Deep Learning ArnoCandel
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
5
H2O Deep Learning ArnoCandel
Customer Demands for Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API SDK Extensibility
H2O was developed by 0xdata from scratch to meet these requirements
H2O Deep Learning ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java
H2O Deep Learning ArnoCandel
H2O Architecture
Distributed In-Memory K-V storeCol compression
Machine Learning
Algorithms
R EngineNano fast
Scoring Engine
Prediction Engine
Memory manager
eg Deep Learning
8
MapReduce
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 3
matlabulous (Jo-fai Chow Blend it like a Bayesian) says
ldquoI am 9999999999999 sure that I can still go further with H2Ordquo
Achieved with H2O Deep Learning from R
H2O DeepLearning Kaggle 1 rank (out of 413) - 40d left
1
17
H2O Deep Learning ArnoCandel
OutlineIntro amp Live Demo (10 mins)
Methods amp Implementation (20 mins)
Results amp Live Demos (25 mins)
Higgs boson detection
MNIST handwritten digits
text classification
Q amp A (5 mins)
4
H2O Deep Learning ArnoCandel
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
5
H2O Deep Learning ArnoCandel
Customer Demands for Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API SDK Extensibility
H2O was developed by 0xdata from scratch to meet these requirements
H2O Deep Learning ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java
H2O Deep Learning ArnoCandel
H2O Architecture
Distributed In-Memory K-V storeCol compression
Machine Learning
Algorithms
R EngineNano fast
Scoring Engine
Prediction Engine
Memory manager
eg Deep Learning
8
MapReduce
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
OutlineIntro amp Live Demo (10 mins)
Methods amp Implementation (20 mins)
Results amp Live Demos (25 mins)
Higgs boson detection
MNIST handwritten digits
text classification
Q amp A (5 mins)
4
H2O Deep Learning ArnoCandel
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
5
H2O Deep Learning ArnoCandel
Customer Demands for Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API SDK Extensibility
H2O was developed by 0xdata from scratch to meet these requirements
H2O Deep Learning ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java
H2O Deep Learning ArnoCandel
H2O Architecture
Distributed In-Memory K-V storeCol compression
Machine Learning
Algorithms
R EngineNano fast
Scoring Engine
Prediction Engine
Memory manager
eg Deep Learning
8
MapReduce
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
About H20 (aka 0xdata)Java Apache v2 Open Source
Join the wwwh2oaicommunity 1 Java Machine Learning in Github
5
H2O Deep Learning ArnoCandel
Customer Demands for Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API SDK Extensibility
H2O was developed by 0xdata from scratch to meet these requirements
H2O Deep Learning ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java
H2O Deep Learning ArnoCandel
H2O Architecture
Distributed In-Memory K-V storeCol compression
Machine Learning
Algorithms
R EngineNano fast
Scoring Engine
Prediction Engine
Memory manager
eg Deep Learning
8
MapReduce
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Customer Demands for Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API SDK Extensibility
H2O was developed by 0xdata from scratch to meet these requirements
H2O Deep Learning ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java
H2O Deep Learning ArnoCandel
H2O Architecture
Distributed In-Memory K-V storeCol compression
Machine Learning
Algorithms
R EngineNano fast
Scoring Engine
Prediction Engine
Memory manager
eg Deep Learning
8
MapReduce
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java
H2O Deep Learning ArnoCandel
H2O Architecture
Distributed In-Memory K-V storeCol compression
Machine Learning
Algorithms
R EngineNano fast
Scoring Engine
Prediction Engine
Memory manager
eg Deep Learning
8
MapReduce
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
H2O Architecture
Distributed In-Memory K-V storeCol compression
Machine Learning
Algorithms
R EngineNano fast
Scoring Engine
Prediction Engine
Memory manager
eg Deep Learning
8
MapReduce
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
H2O - The Killer App on Spark9
httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
H2O DeepLearning on Spark10
Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)
Brand-Sparkling-New Sneak Preview
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 11
John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with RData resides on the H2O cluster
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 13
Higgs Particle Discovery
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
Machine Learning Meets Physics
Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 14
Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)
add derived
features
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 15
Higgs Can Deep Learning Do Better
Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
Deep Learning
ltYour guess goes heregt
reference paper results baseline 0733
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
Example Input data(image)
Prediction (who is it)
16
Facebooks DeepFace (Yann LeCun) recognises faces as well as humans
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
17
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Deep Learning is Trending
20132009
Google trends
2011
18
Businesses are usingDeep Learning techniques
Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Deep Learning Historyslides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions AND
makes humans businesses and machines (cyborgs) smarter
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)
= Top-notch prediction engine
Deep Learning in H2O20
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network21
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
22
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
23
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
24
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
25
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
26
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
27
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
28
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Detail Dropout Regularization29
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
30
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
Letrsquos see how H2O does on the MNIST dataset
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Frequent errors confuse 27 and 49
H2O Deep Learning on MNIST 087 test set error (so far)
31
test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours
World-class results
No pre-training No distortions
No convolutions No unsupervised
training
Running on 4 nodes with 16 cores each
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning A Candel
Weather Dataset32
Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning A Candel
Live Demo Weather Prediction
Interactive ROC curve with real-time updates
33
3 hidden Rectifier layers Dropout
L1-penalty
127 5-fold cross-validation error is at least as good as GBMRFGLM models
5-fold cross validation
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Live Demo Grid Search
How did I find those parameters Grid Search(works for multiple hyper parameters at once)
34
Then continue training the best model
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
35
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Binary word vector 0010000010001hellip0
vintagegold condition
Letrsquos see how H2O does on the ebay dataset
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
36
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
37
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Deep Learning Auto-Encoders for Anomaly Detection
38
Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each
Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data
Also for categorical data
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 39
Test set with anomaly
Test set prediction is reconstruction looks ldquonormalrdquo
Found anomaly large reconstruction error
Model of whatrsquos ldquonormalrdquo
+
=gt
Deep Learning Auto-Encoders for Anomaly Detection
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 40
R Vignette with example R scripts http0xdatacomh2oalgorithms
All parameters are available from Rhellip
H2O brings Deep Learning to R
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
41
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 42
How well did H2O Deep Learning do
Letrsquos see how H2O did in the past 30 minutes
Higgs Particle Discovery with H2O
ltYour guess goes heregt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
43
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 44
Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel 45
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data
46
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Extensions for H2O Deep Learning47
- Vision Convolutional amp Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training Stacked Auto-Encoders PUB-1014
- Faster Training GPGPU support PUB-1013
- LanguageSequences Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2OAdd your own JIRA tickets
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai httpdocsh2oai wwwh2oaicommunity h2oai
48
Thank you