50
© 2016 MapR Technologies © 2016 MapR Technologies MapR Confidential © 2016 MapR Technologies 1 Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform Mathieu Dumoulin ([email protected]) Mateusz Dymczyk ([email protected]) Hadoop Summit Tokyo 2016

Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Embed Size (px)

Citation preview

Page 1: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies1

Real-World Machine Learning - Leverage the Features of MapR Converged Data PlatformMathieu Dumoulin ([email protected]) Mateusz Dymczyk ([email protected])

Hadoop Summit Tokyo 2016

Page 2: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 2

Today’s goals• Machine Learning projects in the Enterprise

have a LOT of requirements beyond training a

good ML model

• Current options are too complex

• Need a Converged Data Platform

• Introduce specific features useful for ML: – MapR-FS, Volumes, Mirrors and Topologies

– MapR-DB and MapR Streams

Page 3: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 3

Mathieu Dumoulin, Data Engineer• Master’s degree in text classification on Hadoop at Fujitsu Canada’s Innovation Lab

• In Tokyo, I’ve worked as a Data Scientist, Search Engineer and Data Engineer

• I like Scikit-Learn and H2O •日本料理が大好き。とくに鍋としゃぶしゃぶです。

Page 4: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 4

Mateusz Dymczyk, Software Engineer• M.Sc. in CS (Software and

System Engineering) @ AGH

UST, Poland

• Ph.D. (Machine Learning) dropout

• Software Engineer @ H2O.ai

• Previously ML/NLP @ Fujitsu

Laboratories and en-japan inc

• I’m taking Sommelier classes

Page 5: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 5

A common machine learning pipeline

*Image from scikit-learn.org

Page 6: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 6

… meets the real world (Enterprise IT)

Page 7: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 7

… meets the real worldData comes from many sources maybe very large

Data isn’t always labeled!

Page 8: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 8

… meets the real worldData comes from many sources, maybe very large

Needs ETL and cleaning

Finding the best algorithm and parameters can use a lot of CPU

Data isn’t always labeled!

Page 9: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 9

… Meets the real worldData comes from many sources, maybe very large

Needs ETL and cleaning

Finding the best algorithm and parameters can use a lot of CPU

Data isn’t always labeled!

From production systems? Is it real time?

What server will serve predictions?

The predictions are used by another system...

Page 10: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 10

Machine learning here...

Page 11: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 11

Is not the same when you do it here

Page 12: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 12

Enterprise machine learning mattersGrowing number of ML use cases at successful companies

Anomaly Detection 異常検出

Customer 360Fraud Detection 不正検出

Log Security Analysis ログ分析

Recommender Engines

レコメンデーションSensor Data Analysis (IoT)

Personalized Offers 個人化

Ad Tech

Page 13: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 13

…but it’s HARD

Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline

Page 14: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 14

There must be a better way...

Page 15: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 15

Big data Enterprise IT infrastructure for ML

• You can start simple and show value quickly • It just works. Easy configuration and administration.

• Works with existing systems, and tools

• Includes common basics (File storage, DB, Streams)

• Strong ecosystem support (Apache projects)

• Enterprise class (multi-tenancy, security, HA, support)

An ideal platform for ML:

Page 16: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies 16

MapR Converged Data Platform

Page 17: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 17

MapR Converged Data Platform

Open Source Engines & Tools Commercial Engines & Applications

Utility-Grade Platform Services

Dat

aP

roce

ssin

g

Enterprise StorageMapR-FS MapR-DB MapR Streams

Database Event Streaming

Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy

Search & Others

Cloud & Managed Services

Custom Apps

Unified M

anagement and M

onitoring

Page 18: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 18

MapR is great for Enterprise ML projects

●MapR-FS and NFS mount

●Volumes and Topologies

●Mirrors and Snapshots

Page 19: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 19

MapR Filesystem

•Native implementation in C/C++, it’s fast •Use it like your own local filesystem •Everything that can use files works as usual •Unique MapR technology

•For more info watch on Youtube: •What is MapR-FS •MapR-FS vs. HDFS

Working, battle-tested distributed read-write filesystem

Page 20: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 20

NFS MountMount the cluster as a regular folder

$> sudo mount -o hard,nolock ip-10-0-0-110:/mapr /mapr $> ll /mapr/hadoopsummit/ total 3 drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:21 appsdrwxr-xr-x. 2 mapr mapr 0 Oct 13 11:12 hbasedrwxr-xr-x. 3 root root 1 Oct 13 11:21 installerdrwxr-xr-x. 2 mapr mapr 0 Oct 13 11:14 optdrwxrwxrwx. 2 mapr mapr 1 Oct 14 10:41 tmpdrwxr-xr-x. 6 mapr mapr 4 Oct 14 10:52 userdrwxr-xr-x. 3 mapr mapr 1 Oct 13 11:13 var

Page 21: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 21

MapR NFS and Volumes

[mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr

Page 22: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 22

MapR NFS and Volumes

[mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr

Page 23: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 23

MapR NFS and Volumes

[mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr

Page 24: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 24

MapR-FS and NFS mount for ML• Get started quickly and simply • Use your favorite tool like...

– Custom code (Scikit-learn, R) – SPSS, SAS, RapidMiner – Apache Spark, Drill, Flink

• Super easy data import – Just save to file on MapR – Integrate with legacy servers

and code – Use any ecosystem (Sqoop) it

all works

• Quick and scalable roundtrip during development

– ETL/cleaning -> train/test -> predict

– Don’t copy data (cluster to cluster, local to cluster)

• Run in production direct from the cluster

– no copying around

Page 25: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 25

Volumes and Topologies - Managed in MCS

Page 26: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 26

Volumes and TopologiesVolumes are just “regular” volumes

Page 27: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 27

Volumes and TopologiesVolumes are just “regular” volumes

Select what nodes for volume data = Topology

Page 28: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 28

Volumes and Topologies for ML

• With YARN’s Node Labels, run tasks on nodes with guaranteed data locality – Special nodes with GPU, high memory or big CPU

• Multi-Tenancy – Share cluster with business use cases in production – Data isolation guaranteed – Easy unified admin (Data scientists != Hadoop

admin) – Bigger cluster, more reliable and faster

Page 29: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 29

Snapshots and Mirrors

Page 30: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 30

Snapshots and Mirrors

Page 31: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 31

Snapshots and Mirrors

Page 32: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 32

Snapshots - Instant point in time save

Page 33: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 33

Mirrors - Physical copy

Page 34: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 34

Snapshots

[... mateusz]$ cd .snapshot [... .snapshot]$ ll total 1 drwxr-xr-x. 2 mapr mapr 1 Oct 14 10:56 mateusz.snap1

Page 35: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 35

Snapshots and Mirrors for ML

• Versioned data and models = Repeatable results

– same model, same data guaranteed

– Go back in time for free

• Keep intermediate transformations

– Quickly change your mind, don’t redo work

• A/B Testing easy-mode

Page 36: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 36

Real-time events and DB for ML• Built-in, no config, it just works • Support next-gen use cases

– hyper-personalization of web/store content – IoT Sensor data

• easy to start small but grows with your data/use case

Page 37: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2014 MapR Technologies 37

MapR Converged Application Blueprint

• Microservices connected by real-time streams – Ideal to serve predictions from ML models

• Next-Generation large-scale architecture • Working example: https://www.mapr.com/appblueprint/

overview

Page 38: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 38

Converged Data Platform 💖 Machine Learning

• Features that work together to support all phases of ML

• Supports your existing tools/code and the state of the art

large scale frameworks

• Easier to manage, more robust and secure.

• MapR is made for the enterprise and great for ML!

Page 39: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 39

Demo of H2O on MapR: Features in Action

Page 40: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Agenda

• Why tooling matters in Machine Learning • What is H2O and Sparkling Water • Why MapR • Demo

Page 41: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

ML project problems

• Multiple data sources • Different formats • Large volumes of data to be read • System bootstrap time • Collaboration between data scientists • Comparing models • Deployment of the model • Versioning • Too many moving parts! • etc.etc.

Page 42: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Successful ML platform

• Fast ingestion and manipulation of versatile data • Intuitive modeling UI/API • Easy model validation, visualisation and comparison • Easy model deployment w/ versioning for fast predictions

Page 43: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

• Written in high performance Java - native Java API

• Supports multiple file formats and data sources

• ETL capabilities

• Highly paralleled and distributed implementation

• Fast in-memory computation on highly compressed data

• Allows you to use all your data without sampling

• Runs on top of most major Hadoop distributions

ML platform

Ingestions platform

Big data platform

What is H2O?

• Open source platform

• Exposes math and predictive algorithms

• GLM, Random Forest, GBM, Deep Learning etc.

Page 44: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

FlowUI

• Notebook style open source interface for H2O

• Code execution, mathematics, plots, and rich media

Page 45: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Why H2O?

• Fast ingestion and manipulation of versatile data • Blazing fast data parsing, supports multiple formats and

data sources • Intuitive modeling UI/API

• FlowUI, R/Python/REST APIs • Easy model validation, visualisation and comparison

• Cross-validation, FlowUI graphs, comparison via Steam • Easy model deployment /w versioning for fast predictions

• Model export as POJO, deploy as service via Steam

Page 46: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

What is Sparkl ing Water?

• Framework integrating Spark and H2O • H2O instances on Spark executors • Allows to call Spark and H2O methods together

Page 47: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Why MapR?

• H2O + MapR-FS = fast data ingestion made even faster • Data resilience • MapR snapshots + H2O modelling from checkpoints =

continuous and versioned modelling

Page 48: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Demo

Page 49: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Air l ine delay classif ication

Model predicting flight delays

ETL Modell ing Predict ions

Load data from CSVs Model using H2O’s GLM

* https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

Page 50: Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 50

Q & A@mapr

[email protected]

Engage with us!

mapr-technologies