22
Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame Christopher Nguyen, PhD—CEO & Co-Founder, Arimo Rohit Rai—CEO, Tuplejump Bringing BigApps to Flink @arimoinc @pentagoniac http//ddf.io

Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

Embed Size (px)

Citation preview

Collaborative Predictive Intelligence

via DDF-on-Flink using Distributed DataFrame

Christopher Nguyen, PhD—CEO & Co-Founder, Arimo

Rohit Rai—CEO, Tuplejump

Bringing BigApps to Flink

@arimoinc@pentagoniachttp//ddf.io

@arimoinc@pentagoniachttp//ddf.io

What Are Adatao Big Apps?

§Predictive: Predictive Analytics for Business Users

§Collaborative: Real-time Collaboration with Data Scientists

@arimoinc@pentagoniachttp//ddf.io

Demo

@arimoinc@pentagoniachttp//ddf.io

The EXPLOSION

of Data & Compute engines

The CIO Challenge

ScalaClient

Scala

JavaClient

Java

PyClientPyth

on

RClient

R

Ignite

HDFS

S3

Redshift

BigQ

Cassandra

RDBMS

Spark

Flink

Presto

Ignite

HDFS

S3

RedshiftBigQ

Cassandra

RDBMS

Spark

Flink

PrestoIgnite

HDFS

S3

Redshift

BigQ

Cassandra

RDBMS

Spark

FlinkPresto

ScalaClient

Scala

PyClient

PythonJavaC

lient

Java

RClient

R

FlinkFlin

k

Ignite

HDFSRDBMS

Redshift

Cassandra HDFS RDBMSHDFS

Flink

@arimoinc@pentagoniachttp//ddf.io

Scala Java Python R

DDF

Spark Flink

DDF

Ignite

DDF

Data in Memory

Presto

DDF

Data at Rest

HDFS

DDF

DWs DBs

Enterprise Data Bus

DDF

S3

DDF

Redshift

DDF

BigQ

DDF

Cassandra

DDF

RDBMS

The Solution: DDF Data Integration

@arimoinc@pentagoniachttp//ddf.io

Benefits of DDF Data Integration

§ FOR DATA ENGINEERS

§ Unified API across data sources and engines

§ HDFS, S3, Cassandra, Redshift, BigQuery, RDBMS, Salesforce, Spark, Flink, Ignite …

§ FOR DATA SCIENTISTS

§ Uniform high-level DataFrame abstractions: ETL, ML, Streaming

@arimoinc@pentagoniachttp//ddf.io

Custom Apps

Adatao AppBuilder

Adatao PredictiveEngine

Arimo Predictive Intelligence Platform

Big Compute

Big Data

Big Apps

Distributed DataFrame (DDF)Open

Sourced

Data ScientistBusiness User Data Engineer

@arimoinc@pentagoniachttp//ddf.io

Why Flink?

§ Emerging engine with unique strengths (e.g., streaming)

§Driven by Customer & Partner conversations

@arimoinc@pentagoniachttp//ddf.io

Demo

@arimoinc@pentagoniachttp//ddf.io

Java Python R

DDF DDF DDF

Spark Flink RedshiftSpark APIs

RDD DataFrame DStream

Flink APIs DataSet Table

DataStream …

ETL Interfaces

ML Interfaces

Streaming Interfaces

Unified DDF APIs

DDF: “Under the Hood”

@arimoinc@pentagoniachttp//ddf.io

DDF API in a Nutshell

// To start working with an engine

DDFManager manager = DDFManager.get(“flink”); // or “spark”

// Then, data can be loaded into a DDF as follows:DDF table = manager.sql2ddf("select * from airline");

// ETL, transformtable = table.transform("dist= round(distance/2, 2)”);

// Run Machine learning using MLlib, then run predictionKMeansModel kmeansModel = (KMeansModel) ddf.ML.train("kmeans", 5, 5).getRawModel();Int prediction = ddf.ML.applyModel(kmeansModel, false, true);

@arimoinc@pentagoniachttp//ddf.io

Demo

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ It was easy for us to implement DDF on Flink

§ Flink API close to functional collection API

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ With DDF, it’s easy to port applications on DDF from one engine to another

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ There’s now an opportunity to use Flink for interactive applications

§ Backtracking scheduler, session management, better graph analysis

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ Null/missing value handling in Flink

§ Null value support needed in RowSerializer

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ Map vs MapPartitions vs Accumulators

§ Map for aggregations can cause a lot of object creation overhead

§ Accumulators may fail for huge datasets

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ Use caution when doing array copy overs in Table API

@arimoinc@pentagoniachttp//ddf.io

DDF: Where is it heading?

§ More Engines: DBs & DWs: BigQuery, Cassandra, Teradata, Presto, Ignite

§ Enterprise Databus to seamlessly move data across sources

§ Richer APIs

@arimoinc@pentagoniachttp//ddf.io

Get Started with DDF§ Increase your productivity & build engine-agnostics Apps

• Build your analytics apps on existing modules

• Flink, Spark, JDBC

§ Expand possibilities. Contribute to DDF

• Enrich existing plugins: Data APIs, ML APIs...

• Add new DDF plugins:

• BigQuery, Cassandra

•Marketo

• Ignite, Presto

§ Spread the word!

www.ddf.io/gettingstarted

Collaborative Predictive Intelligence

via DDF-on-Flink using Distributed DataFrame

Christopher Nguyen, PhD—CEO & Co-Founder, Arimo

Rohit Rai—CEO, Tuplejump

Bringing BigApps to Flink

@arimoinc@pentagoniachttp//ddf.io