Upload
todd-niven
View
203
Download
0
Embed Size (px)
Citation preview
Sparkling Water
Melbourne Spark Meetup February 2016
Todd Niven
05/02/2023 2
H2O: Distributed In-Memory Machine Learning
• Generalized Linear Model• K-Means Clustering• Naïve Bayes• Principle Component
Analysis• Generalized Low Rank
Model• Random Forest• Gradient Boosted Machine• Deep Learning
• R • Python• Excel• Tableau• Flow (H2O’s native UI)
ML AlgorithmsRest API Clients
H2O consists of nodes (1 or more JVMs).
Data is distributed amongst the nodes and stored in memory as (compressed) H2OFrames.
• Ensemble• Grid Search
Other Features
05/02/2023 3
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/architecture/Architecture.md
1. R user importing data.csv from
HDFS Conceptual Model
05/02/2023 4
http://www.h2o.ai/product/sparkling-water/
H2O Running in Spark: Sparkling Water
Sparkling Water provides the ability to pass RDDs to H2O and back againallowing feature set generation and machine learning to be done in the one environment and at scale
05/02/2023 5
Features and Benefits:• Flow interface provides easy to digest statistics and visualisations of model performance• Only 2 datatypes to consider in H2OFrames: numeric and enum.• Analysts can use many of their favourite languages to talk to H2O e.g. R, Python, Java and Scala• Accessible and useful from small single node local machine to multi-node clusters• In memory H2OFrames are highly compressed• Models can be exported as POJO to be used directly without needing to start H2O• Accessible for beginners to predictive modelling as well having the depth for experts• Competes in speed and accuracy with the other well known platforms: https://github.com/szilard/benchm-ml
Compared with MLLIB, the H2O’s algorithms do not run directly on Spark RDDs. Spark RDDs must first be copied to H2O (highly compressed) and then results copied back to a Spark RDD.
For building models this is not a draw back.For scoring data, the POJO method can be used (removing all need for starting H2O at all!)
Common Machine Learning Workflow
Build machine learning model
(R, Python, SAS)
Data Warehouse
Business Problem
(Data Scientist)
Deploy Model(Data
Warehouse)
A Scalable Alternative
Business Problem (Data
Scientist)
HDFS
Sparkling Water
Output to systems
Output to systems
Hadoop Platform
Page 6
Simplifying The Analytical Workflow
Page 7
GBM 1-Node Performance Comparison
https://github.com/szilard/benchm
-ml
Page 8
Guessing What’s Next?
• Steam: enterprise grade model factory (champion challenger?, auto deployment?)
• Data munging abilities inside H2O (data.tables capabilities?)
• NLP algorithms (word2vec?)
• Random hyper-parameter search