29
H 2 O the Prediction Engine Better predictions https://github.com/0xdata/h2o

0xdata_h2o_BigDataScience_5.28.2013

Embed Size (px)

DESCRIPTION

Data Science is no longer Rocket Science with H2O. H2O is the OpenSource Math and Prediction Engine for Big Data. H2O makes hadoop do math! And scales statistics, machine learning and math over BigData. With H2O everyone can get past tooling and scale issues to discover insights in the data. H2O is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON so that big data enthusiasts & & experts can explore, munge, model and score datasets using a range of simple to advanced algorithms. Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. H2O has a vision of online scoring and modeling in a single platform.

Citation preview

Page 1: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Better predictions

https://github.com/0xdata/h2o

Page 2: 0xdata_h2o_BigDataScience_5.28.2013

H2O makes hadoop do math

Hadoop = opportunity Not enough Data Scientists Analysts won’t code java

Page 3: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine Exploration Modeling Scoring

Big Data

Page 4: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Adhoc Exploration

Math Modeling

Real-time Scoring

Big Data Velocity Volume

Page 5: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Adhoc Exploration

Math Modeling

Real-time Scoring

Big Data

Messy Clustering

Classification

Ensembles

100’s nanos models

Regression

Page 6: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Big Data Exploration Modeling Scoring

Real-time

Page 7: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Big Data Exploration Modeling Scoring

Real-time

No New API

Approximate results each step

Page 8: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Big Data Exploration Modeling Scoring

Real-time

More Data beats Better Algorithms

Page 9: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Big Data Exploration Modeling Scoring

Real-time

More Data and Better Algorithms Scale & Parallelism

Page 10: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Big Data Exploration Modeling Scoring

Real-time

More Data and Better Algorithms Scale & Parallelism

fraud detection

Apps

reco engine

Page 11: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Intellectual Legacy

Math needs to be free

Open Source

Support and Innovation

https://github.com/0xdata/h2o

Page 12: 0xdata_h2o_BigDataScience_5.28.2013

SriSatish Ambati, CEO & Co-founder Director of Engineering, DataStax, Cassandra & Hadoop Customers & Platform Marketing, Azul Cliff Click. CTO & Co-founder Chief JVM Architect, Azul, Sun, HP, Motorola, JIT & Hotspot Tomas Nykodym Phd Security, Intrusion Detection Cyprien Noel Founder ObjectFabric, TradeWeb, SmartTrade Michal Malohlava Phd DSLs, Compilers Jan Vitek Full Professor, Purdue, On Sabbatical, Real-time VM, R/stats Compiler

Kevin Normoyle AMD Fellow, Distinguished Engineer Sun, Consistency Models Tom Kraljevic VP Of Engineering, founder Luminix, Azul, PMC-Sierra, Chromatic

Credits & Team

Page 13: 0xdata_h2o_BigDataScience_5.28.2013

Stephen Boyd Professor of Mathemat ica l Engineer ing, Stanford, Convex Opt

Trevor Hastie Professor of Stat is t ics, Stanford, General ized Addi t ive Models

Rob Tibshirani Professor of Stat is t ics, Stanford, GLMNet, Lasso

Doug Lea Mal loc for C. fork- jo in. java memory model , suny oswego Dhruba Borthakur HDFS, Hive, Facebook Nial l Dalton TimeSer ies DB, KX, High- f requency Trading, Cantor-F i tz Char les Zedlewski VP Products, Cloudera

Data Science & Advisors

Page 14: 0xdata_h2o_BigDataScience_5.28.2013

Distributed! Extensible, reconfigurable!

Math-at Scale – Simple Legos

H2O

+ σ cov

*

µ mean

n

GLM Logistic

Regression

rand shuffle

histo gram

Random Decision

Trees

OLS

k-means

Page 15: 0xdata_h2o_BigDataScience_5.28.2013

Volume: HDFS

HIVE/SQL

Data Scientist

Munging slice n dice Features

Classification Regression Clustering Optimal Model

Engineer

Velocity: Events Online Scoring

Exploration

Modeling

Offline Scoring

Business Analyst

Ensemble models Low latency

Applications

Predictions

Rule Engine

Before H2O

Page 16: 0xdata_h2o_BigDataScience_5.28.2013

Product Road Map

algos: RandomForest GLM, ADMM, GLMnet, k-means data: dense, categorical api: REST, JSON, R-like console Scale, Single-Execution GridSearch

In 4-pilots

algos: GroupBy, Grep Unbalanced App: Fraud Detection data: sparse api: R, math, string Adhoc Analytics Multi-Execution Scoring Engine Event Ingest In production

algos: GBM, SVM, KNN Optimization App:RecoEngine data: sparse api: Tableau Visualization Multi-tenant Library Big Adoption

1.15.2013 5.15.2013 8.15.2013

Page 17: 0xdata_h2o_BigDataScience_5.28.2013

secret sauce move code. not data

Linear Regression

fork/join. data partitioning. fine grain parallelism

phase 1 sums phase 2 distance phase 3 validate

arraylets leaf computes parent aggregates

company confidential. copyright 2012

Page 18: 0xdata_h2o_BigDataScience_5.28.2013

Fraud Detection Scoring: Event stream on a ScoreCard Model Modeling: Random Forest for outlier detection Modeling: Event sequence patterns

Customer Behavior & Merchant Analytics Scoring: Purchase event stream scoring on Ensemble Models Modeling: Logistic Regression models for Customer Engagement

Failure Prediction from Sensor Data Model device failures and rank vendor graphs.

Upstream Oil Exploration Distance & Regression on 1TB big data MLS for Oil fields

Use Cases

Page 19: 0xdata_h2o_BigDataScience_5.28.2013

Math & Hadoop users recommend us!

Page 20: 0xdata_h2o_BigDataScience_5.28.2013

Data & Algorithms

SQL | HDFS | S3 | NoSQL

H2O – Real Time

REST

patterns sequences

Distributed Collections Execution

JSON R Excel

Java API

Page 21: 0xdata_h2o_BigDataScience_5.28.2013

Hadoop Ecosystem

HDFS

H2O Map Reduce

Hive Pig

Impala Drill

Batch Interactive

H2O

Page 22: 0xdata_h2o_BigDataScience_5.28.2013

•  Alternating Direction Method of Multipliers (Boyd) •  Decomposition-coordination •  Small Local Sub-Problems and Global Coordination

•  Broadcast & Gather •  Decomposability Dual Ascent + Convergence of Multipliers •  Block & Component Separability

•  Generalized Gradients (Hastie, Tibshirani, et al)

Generalized Linear Modeling

Page 23: 0xdata_h2o_BigDataScience_5.28.2013

l1 norm regularization

https://github.com/0xdata/h2o/blob/master/src/main/java/hex/DLSM.java

Page 24: 0xdata_h2o_BigDataScience_5.28.2013

•  Text Book implementation from Breiman’s paper.

•  Data is distributed upon ingest •  Splits on random selection of features

•  Gini & Entropy

•  Handle NAs (during training) •  Class-Weighting •  Stratified Sampling (local)

Random Forest

https://github.com/0xdata/h2o/tree/master/src/main/java/hex/rf

Page 25: 0xdata_h2o_BigDataScience_5.28.2013

forest for the tree.. iris dataset

Page 26: 0xdata_h2o_BigDataScience_5.28.2013

•  1% increase in predictive power - $11m @ major online payment system

•  Each fraud scored accurately = expected value of 10s of thousand dollars.

•  Leads cost $10-100/lead – Predicting accurate conversion and quality of leads goes directly to bottom line.

•  Competitive advantage in predicting which assets to acquire.

Models unlock value in data

Page 27: 0xdata_h2o_BigDataScience_5.28.2013

Deployment - commodity / cloud

H2O

x86

H2O is pure java and easy-to-install

company confidential. copyright 2012

H2O

H2O

Page 28: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Better predictions

https://github.com/0xdata/h2o

Page 29: 0xdata_h2o_BigDataScience_5.28.2013

H2O the Prediction Engine

Big Data Science Modeling & Scoring Engine Approximate results each step No new API

Use R, Excel & SAS Scale & Parallelism