Zenika matinale spark-zeppelin_ml

Matinale Big Data

Spark et Machine Learning

Zenika Lyon, le 25/05/16

Hervé RIVIERE

Développeur Big Data / NoSQLFormateur Couchbase

Fabrice SZNAJDERMAN

Développeur Java / Scala / WebFormateur Java / ScalaCo-organisateur du ScalaIO 2016 (Lyon le 27 – 28 octobre)

Big Data : Spark + Machine LearningSommaire

Big Data : Panorama 2016 (15 ’)1

2 Présentation d’Apache Spark et

Apache Zeppelin (45’)

4 Démystifions le Machine Learning (45’)

3 Pause (30’)

Panorama

2016 du Big Data

Big Data ?

De 2014 à 2017….

• POC / expérimentation

• Usage analytique

• Hadoop Map-Reduce / HDFS / Pig / Hive / HBase / Storm ….

• Industrialisation Data-Lake / Création Plateforme Big Data analytique

• POC streaming / Plateforme Big Data opérationnelle

• Spark / Cassandra / HDFS /Kafka / Storm / Samza / Mesos

• Industrialisation Streaming / Plateforme Big Data opérationnelle

• Expérimentation/ POC Big Data Prédictif / Machine Learning

• Kafka / Spark / Flink / HDFS / Notebook web / Cassandra / Mesos….

2017• Industrialisation Big Data Prédictif / Machine Learning ? Internet of Things ?

• Kafka stream ? / Kudu ? /Spark 2.0 ? / Flink ? ….

Le Big Data pour quoi faire ?

• Informatique décisionnelle : Statistiques descriptives sur des données à forte densité en informationExemple : Données CRM dans une BDD

• Big Data : Données à faible densité d’informations mais dont l’important volume permet d’en déduire des lois / règles Statistiques inférentiellesExemple : Données issues de capteurs dans un Data Lake

• Fast Data : Transformer en temps réel la données à la place de traitements quotidiens / hebdomadaires / mensuelsExemple : Données issues d’un site web dans des topic Kafka

Exemple de projets

• Vision clients 360° (Banque / Distribution / Service…)o Réagir lors de certains évènements cross-canaux o Recommandation o Analyse ad-hoc spécifique métier (marketing, fraude…)

• Analyse de données logs/capteurs (Industrie, Services, IT…)• Automatiser une surveillance humaine• Analyser puis optimiser

• Soulager des outils décisionnels par des technologies Big Data• Pour la scalabilité• Pour de nouvelles possibilités (temps réel, schéma plus

flexible, vitesse ….)

Nos interventions

Architecture Big Data Industrialisations

développements

POC Java / Scala

Dataviz

Formations

Industrialisation algorithmes

machine learning

Expertise technique

Ateliers innovations

Outils

architectures

Streaming

Query/SQL

Machine Learning

Search Engine

Scheduler

Service Discovery

Resource Manager

Kafka NiFi Flink StormZookeeper Spark Yarn

File System

Columns

Document

Key-Value

In-memory/Cache

Time-Series

CassandraMongoDBNeo4j

Titan Couchbase Druid InfluxDB

Hazelcast

Aerospike Kylin

SolRElasticSearch

MahoutTez, Slider

Impala, Hawq

HbaseHDFS

Architectures Big Data

Couche temps réel / Opérationnelle

Couche batch / analytique

Requêtes

Données

Analytique

De 3 à 300 nœuds ! Stocker / traiter un (très) important volume de données (Tera octets…) à intervalle régulier Système analytique et non opérationnel !

StockageOutil couramment utilisé

En complément ou alternative

Exécution Outil couramment utilisé

Scheduler

• NiFi• Oozie

Notebook web

• Zeppelin• Jupyter

data-minning / Machine learning

• R / Python • Mahout / H2O• Dataiku

• Sqoop- Kafka

Ressource negociator

• YARN• Mesos

Opérationnelle

De 3 à 300 nœuds ! Traiter un important volume de données en temps réelSystème opérationnel et non analytique !

StockageOutil couramment utilisé

ExécutionOutil couramment utilisé

Schema registry

• Avro

API I/O

• Akka• Spring• Play…

Ressource negociator

• Yarn• Mesos

partenaires

Nos partenaires conseil et formation

Langages & Ecosystème Big Data

Intégration &continuous delivery

Spark & Zeppelin

Matinale Spark et ML

25/05/16Fabrice Sznajderman

Agenda

●Apache Spark●Apache Zeppelin

Introduction

SparkIntroduction

Big pictureSpark introduction

What is it about?

●A cluster computing framework ●Open source●Written in Scala

History

2009 : Project start at MIT research lab

2010 : Project open-sourced

2013 : Become a Apache project and creation of the Databricks company

2014 : Become a top level Apache project and the most active project in the Apache fundation (500+ contributors)

2014 : Release of Spark 1.0, 1.1 and 1.2

2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6

2015 : IBM, SAP… investment in Spark

2015 : 2000 registration in Spark Summit SF, 1000 in Spark Summit Amsterdam

2016 : new Spark Summit in San Francisco in June 2016

Where spark is used?

Source : http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf?t=1443057549926

The results reflect the answers and opinions of over 1,417 respondents representing over 842

organizations.

Which kind of using?

Multi-languages

Spark Shell

●REPL●Learn API●Interactive Analysis

RDDCore concept

Definition

●Resilient ●Distributed ●Datasets

Properties

●Immutable ●Serializable●Can be persist in RAM and / or

disk●Simple or complexe type

Use as a collection

●DSL●Monadic type●Several operators

–map, filter, count, distinct, flatmap, ...– join, groupBy, union, ...

●A collection (List, Set)●Various formats of file

– json, text, Hadoop SequenceFile, ...

●Various database –JDBC, Cassandra, ...

●Others RDD

Created from

Sources must be natively distributed (hdfs, cassandra,..), if not network become bottleneck

Sample

val conf = new SparkConf()

.setAppName("sample")

.setMaster("local")

val sc = new SparkContext(conf)

val rdd = sc.textFile("data.csv")

val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Lazy-evaluation

●Intermediate operators –map, filter, distinct, flatmap, …

●final operators–count, mean, fold, first, ...

val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Caching

●Reused an intermediate result●Cache operator●Avoid re-computing

val r = rdd.map(s => s.length).cache()

val nb = r.filter(i => i> 10).count()

val sum = r.filter(i => i> 10).sum()

DistributedArchitecture

Core concept

Run locally

val master = "local"

val master = "local[*]"

val master = "local[4]"

val conf = new SparkConf().setAppName("sample")

.setMaster(master)

Run on cluster

val master = "spark://..."

val conf = new SparkConf().setAppName("sample")

.setMaster(master)

Cluster

Master

client

ModulesCore concept

Composed by

Spark Core

StreamingMLlib GraphX

ML PipelineDataFrames

Several data sources

Statistics of using

Several data sources

http://prog3.com/article/2015-06-18/2824958

Spark SQL

●Structured data processing

●SQL Language

●DataFrame

DataFrame 1/3

●A distributed collection of rows

organized into named columns

●An abstraction for selecting,

filtering, aggregating and

plotting structured data

●Provide a schema

●Not a RDD replacement

DataFrame 1/3

●RDD more efficient than before

(Hadoop)

●But RDD is still too complicated

for common tasks

●DataFrame is more simple and

faster

DataFrame 2/3

Optimized

DataFrame 3/3

●From Spark 1.3

● DataFrame API is just an interface

– Implementation is done one time in Spark engine

–All languages take benefits of optimization with out rewriting anything

Spark Streaming

●Framework over RDD and Dataframe API

●Real-time data processing●RDD is DStream here●Same as before but dataset is

not static

Spark StreamingInternal flow

http://spark.apache.org/docs/latest/img/streaming-flow.png

Spark StreamingInputs / Ouputs

http://spark.apache.org/docs/latest/img/streaming-arch.png

Spark MLlib

●Make pratical machine learning scalable and easy

●Provide commons learning algorithms & utilities

Spark MLlib

●Divides into 2 packages

– spark.mllib– spark.ml

Spark MLlib

●Original API based on RDD●Each model has its own

interface

spark.mllib

Spark MLlib

●Provides uniform set of high-level APIs

●Based on top of Dataframe●Pipeline concepts

–Transformer–Estimator–Pipeline

spark.ml

Spark MLlibspark.ml

●Transformer : transform(DF)–map a dataFrame by adding new

column–predict the label and adding result in

new column

●Estimator : fit(DF)– learning algorithm–produces a model from dataFrame

Spark MLlibspark.ml

●Pipeline –sequence of stages (transformer or

estimator)–specific order

Spark 2.0

●Easier ●Faster●Smarter

3 axis

Spark 2.0

●Unifying DataFrames and Datasets in Scala/Java

●SparkSession (replace SQLContext & HiveContext)

●Simpler, more performant Accumulator API

●spark.ml package emerges as the primary ML API

Easier

Spark 2.0

According to our 2015 Spark Survey, 91%of users consider performance as the most important aspect of Spark.

Faster

Spark 2.0Faster

●The second generation of Tungsten engine

●Builds upon ideas from– Modern compilers

– Massively Parallel Processing Database (MPP)

●Spark SQL’s Catalyst Optimizer improvement

Spark 2.0Faster

Spark 2.0

●Structured Streaming API

●Based on Catalyst optimizer

●Unifying DataFrames and Datasets

Smarter

Spark 2.0

This technical preview version is now available on Databricks :

https://databricks.com/try-databricks

Try it

ZeppelinIntroduction

Big pictureZeppelin introduction

What it is about?

●“A web-based notebook that enables interactive data analytics”

●100% opensource●Undergoing Incubation but …

Top level project at ASF!

Multi-purpose

●Data Ingestion

●Data Discovery

●Data Analytics

●Data Visualization & Collaboration

Multiple Language backend

●Scala

●shell

●python

●markdown

●your language by creation your own interpreter

Data visualizationEasy way to build graph from data

Thank you

Démystifions le Machine

LearningMatinale Spark et ML

25/05/16Hervé RIVIERE

Démystifions le Machine LearningSommaire

Machine Learning ? 1

Fondamentaux

Algorithmes

3 Préparation des données

5 Outils

6 Mettre en place un projet ML

Machine

Learning ?

Machine learning : ”Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel

Solves tasks that people are good at, but traditional computation is bad at.

Programmes qui ecrivent de nouveaux programmes

Orange : « Sauvons les livebox »Prévenir le foudroiement Demande client de débrancher son équipement

Fnac : Ciblage marketing / envoi d’email de recommandationPasser d’une solution avec des RG statiques à des algorithmes de machine learningOptimiser ROI

Remplacer des règles de gestion métier statiques par un algorithme auto-apprenant.

1- Mesure du risque (exemple : taux de prêt en fonction du dossier)

2- Recommandation (exemple : recommandation de films, pub)

3- Prédiction de revenu

4- Prédiction d’un comportement client (désabonnement, appel hotline…)

Etre capable de détecter et réagir à des signaux faibles

1- Prévision et / ou détection d’une panne

2- Diagnostic médical

3- Asservissement machine – optimiser consommation électrique

Mieux comprendre un jeu de données via les corrélations faites par les algorithmes ML

1 – Détecter / identifier des signaux faibles (ex : fraude, marketing…)

2 – Segmentation en différente catégories (exemple : campagne de publicité)

Machine Learning RegressionDeep Learning ClusteringData Science Features engineering

(….)

Fondamentaux

Variable cible

numérique

Type Surface (m²) Nb de pièces Date de construction

Prix (€)

Appartement 120 4 2005 200 000

Maison 200 7 1964 250 000

Maison 450 15 1878 700 000

Appartement 300 8 1986 ?????

Variables prédictives = Features

Prédire une valeur numérique : Algorithme de régression

Variable cible textuelle

= classe

Prix (€)

Appartement 120 4 2005 200 000

Maison 200 7 1964 250 000

Maison 450 15 1878 700 000

???? 300 8 1986 600 000

Prédire une valeur textuelle : Algorithme de classification

0 5 10 15 20 25

Observations

Revenu réel

Fonction prédictive

Bruit aléatoire

Prix réel = f(X) + a

Modèle ML

Ecart imprévisible

Prédiction jamais exacte !

Si « a » trop important…

Modèle ML

Ecart imprévisible

Prédiction jamais exacte !

Prix réel = f(X) + a

Données non prédictible !

0 5 10 15 20 25

Observations

Revenu réel

Open Data

crawling

Dataset d’entrainement

avec variables

prédictives et cible

Modèle

PrédictionVariable cible

HypothèsesVariables

prédictives

Préparation Construction du

modèle :

Générer un

programme (ie. le

modèle)

Production :

Utiliser le

programme généré

Voitures

• Prédiction de l’avenir proche en fonction du

passé

• Approximation d’un pattern à partir d’exemple

• Copie d’un comportement en « boite noire »

(juste input et output)

• Algorithmes qui s’adaptent

Préparation des

données

Open Data

crawling

Modèle

Prédiction

Hypothèses

Préparation

avec variables

prédictives et cible

- Complétude: champs manquant ?

- Echelle: Revenues par pays et nombre d’achats par

région !

- Exactitude : données réelles ?

- Fraicheur : Données du 19e siècle ?

- Format : CSV, images, JSON, BDD JSON

- Agréger

- Enrichir

A B C D E F G H

10 3 2 5 7 43 2 4

1 24 34 5 876 7 6 52

43 24 1 558 23 4 5 6

Algorithmes ML

Moyenne des X : 9Moyenne des Y : 7.5

• Une tache potentiellement (très…) longue

• Ingrat ?

• Influence directement le modèle

• Une bonne préparation des données est

meilleure que des bon algorithmes !

Algorithmes

Open Data

Web crawling

Modèle

Prédiction

Hypothèses

AlgorithmesRégression

Variable cible

numérique

Prix (€)

Appartement 120 4 2005 200 000

Maison 200 7 1964 250 000

Maison 450 15 1878 700 000

Appartement 300 8 1986 ?????

Prédire une valeur numérique : Algorithme de régression

Illustration en 2D, majorité des modèles avec 5..10..1000 dimensions

0 5 10 15 20 25

Observations

Revenu réel

Linéaire : f(X)=aX+b (avec « a » et « b » découverts automatiquement)

0 5 10 15 20 25

Observations

Revenu réel

Polynomiale : f(X)=aXy+bXz… (avec « a » et « b », « x », « y » découverts automatiquement)

Programme généré par l’algorithme après entrainement :Une formule mathématiques

Prix maison = 2*nbPieces + 3*surface

Essai successifs de l’algorithme pour trouver la courbe qui minimise l’erreur

Simple à visualiser / comprendre

Algorithme supervisé (nécessite un entrainement préalable)

Peut être utilisé à des fin prédictive ou descriptive

Très sensible à la préparation initiale (valeurs aberrantes…)

Suppose que les données peuvent être modélisées sous formes

d’équations

Prix d’une maison : Si 10 + pièces…

Pièce >10 Surface > 300

Etage <= 3 Ville = Paris

MaisonAppartement

Oui Non

Oui Oui NonNon

Oui Non

300 000€ 200 000€900 000€700 000€

400 000€600 000€

Programme généré par l’algorithme après l’entrainement :Conditions

If(surface>10 && piece=3)if(type==maison) 250 000else if (type==appartement) 150 000

Else 145 000

Algorithme supervisé (nécessite un entrainement préalable)

Moins sensible à la qualité de préparation de données

Paramètre à définir : nombre d’arbres / profondeurs etc…

Plusieurs arbres entrainés avec des subsets variés peuvent être

combinés Random Forest

Le random forest est un des algorithmes actuellement le plus performant

AlgorithmesClassification

Variable cible textuelle

= classe

Prix (€)

Appartement 120 4 2005 200 000

Maison 200 7 1964 250 000

Maison 450 15 1878 700 000

???? 300 8 1986 600 000

Prédire une valeur textuelle : Algorithme de classification

Malade / Sain

Recommandation de film

Transformer un problème de régression (ex : prix d’une maison) en

classification :

« Cette maison va-t-elle se vendre plus cher que le prix moyen de

la ville » Oui / Non

Minimiser l’erreur

Ne fonctionne qu’avec 2 catégories uniquement !

Boisson = alcool

Prix > 30€ Steak haché

Boisson=vin

NonOui

Oui Non

Adulte

Oui Non

AdolescentEnfant

Senior Adulte

Midi Soir

Algorithme non supervisé (pas d’entrainement)

Utilisé pour des algorithmes de recommandation (Netflix)

Le nombre de catégorie est définis par l’utilisateur ou dynamique

Le nom / description des catégorie est à définir par l’utilisateur

Quels outils ?

Mathématiques !

Connaissances métier !

PrototypageVoir grand, commencer petit

Prototypage : tester rapidement et de façon autonome les

hypothèses

• SAS

• Scikit-learn (Python)

• Dataiku

• Excel

• Tableau

• ….

Industrialisation : Automatisation, performance, maintenabilité,

important volume de données….

Important travail de réécriture de code !

• Brique ETL en amont

• Construction du modèle :• Volume de donnée « faible » : R / SAS / Python industrialisé

• Volume de donnée « important » : Spark / Hadoop/Mahout (calcul distribué)

• Solutions cloud (Azure ML / Amazon ML / Google prediction API)

• Distribution du modèle en aval :• Webservice

• Embarqué dans une application

• …

Mettre en place

un projet

Start Small – Scale Fast

Big Data et machine learning: Manuel du data scientistDunod

MOOC Machine Learning, CourseraAndrew Ng

Zenika matinale spark-zeppelin_ml

Technology

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Spark summit2014 techtalk - testing spark

Spark Infrastructure - Australian Securities Exchange · Spark Infrastructure represents Spark Infrastructure Trust and its consolidated entities. Spark Infrastructure RE Limited

23/11/17 Matinale Usages Réactivation et SMS Enrichi

Matinale du MDM 2011

Spark & Spark SQL

Matinale des Innov’Acteurs 2019 NOKIA

[Spark meetup] Spark Streaming Overview

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Spark, spark streaming & tachyon

Spark Streaming Resiliency (Bay Area Spark Meetup)

Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide5xracing.com/...spark-plug-and-spark-plug-wire-installation-guide.pdf · Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide

Learning spark ch09 - Spark SQL

Invitation Matinale Stratégie B2B - Le 15 Janvier 2013

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

REPLACEMENT SPARK PLUGS Spark Plug Application Chart · REPLACEMENT SPARK PLUGS Spark Plug Application Chart ... EC Series Air-Cooled 1 ... REPLACEMENT SPARK PLUGS Spark Plug Application

Spark streaming , Spark SQL

Matinale Technologique Adobe Campaign

Big Data : au delà du proof of concept et de l'expérimentation (Matinale business decision 2016)