8
Sparkling Water Melbourne Spark Meetup February 2016 Todd Niven

Spark meetup feb 2016

Embed Size (px)

Citation preview

Page 1: Spark meetup feb 2016

Sparkling Water

Melbourne Spark Meetup February 2016

Todd Niven

Page 2: Spark meetup feb 2016

05/02/2023 2

H2O: Distributed In-Memory Machine Learning

• Generalized Linear Model• K-Means Clustering• Naïve Bayes• Principle Component

Analysis• Generalized Low Rank

Model• Random Forest• Gradient Boosted Machine• Deep Learning

• R • Python• Excel• Tableau• Flow (H2O’s native UI)

ML AlgorithmsRest API Clients

H2O consists of nodes (1 or more JVMs).

Data is distributed amongst the nodes and stored in memory as (compressed) H2OFrames.

• Ensemble• Grid Search

Other Features

Page 3: Spark meetup feb 2016

05/02/2023 3

https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/architecture/Architecture.md

1. R user importing data.csv from

HDFS Conceptual Model

Page 4: Spark meetup feb 2016

05/02/2023 4

http://www.h2o.ai/product/sparkling-water/

H2O Running in Spark: Sparkling Water

Sparkling Water provides the ability to pass RDDs to H2O and back againallowing feature set generation and machine learning to be done in the one environment and at scale

Page 5: Spark meetup feb 2016

05/02/2023 5

Features and Benefits:• Flow interface provides easy to digest statistics and visualisations of model performance• Only 2 datatypes to consider in H2OFrames: numeric and enum.• Analysts can use many of their favourite languages to talk to H2O e.g. R, Python, Java and Scala• Accessible and useful from small single node local machine to multi-node clusters• In memory H2OFrames are highly compressed• Models can be exported as POJO to be used directly without needing to start H2O• Accessible for beginners to predictive modelling as well having the depth for experts• Competes in speed and accuracy with the other well known platforms: https://github.com/szilard/benchm-ml

Compared with MLLIB, the H2O’s algorithms do not run directly on Spark RDDs. Spark RDDs must first be copied to H2O (highly compressed) and then results copied back to a Spark RDD.

For building models this is not a draw back.For scoring data, the POJO method can be used (removing all need for starting H2O at all!)

Page 6: Spark meetup feb 2016

Common Machine Learning Workflow

Build machine learning model

(R, Python, SAS)

Data Warehouse

Business Problem

(Data Scientist)

Deploy Model(Data

Warehouse)

A Scalable Alternative

Business Problem (Data

Scientist)

HDFS

Sparkling Water

Output to systems

Output to systems

Hadoop Platform

Page 6

Simplifying The Analytical Workflow

Page 7: Spark meetup feb 2016

Page 7

GBM 1-Node Performance Comparison

https://github.com/szilard/benchm

-ml

Page 8: Spark meetup feb 2016

Page 8

Guessing What’s Next?

• Steam: enterprise grade model factory (champion challenger?, auto deployment?)

• Data munging abilities inside H2O (data.tables capabilities?)

• NLP algorithms (word2vec?)

• Random hyper-parameter search