31
Till Rohrmann Flink PMC member [email protected] @stsffap Interactive Data Analysis with Apache Flink

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Embed Size (px)

Citation preview

Page 1: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Till Rohrmann Flink PMC member

[email protected] @stsffap

Interactive Data Analysis with Apache Flink

Page 2: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Data Analysis

1

Page 3: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Exploratory Data Analysis §  Visualize data §  Calculate main

characteristics §  Understand data and

find possibly new hypothesis

2

Page 4: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Data Analysts

3

Page 5: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Read-Evaluate-Print Loop §  New Scala shell offers REPL §  Interactive queries §  Let’s you explore data quickly

4

Page 6: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Scala Shell

5

Page 7: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Simple Scala Shell Example

6

Page 8: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Problems §  No visualization §  No saving or replaying of written code §  No assistance à Bad IDE

7

Page 9: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Notebooks §  Web-based interactive

computation environment

§  Combines rich text, execution code, plots and rich media

§  Storytelling

8

Page 10: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Apache Zeppelin §  Web-based REPL with pluggable

interpreters §  Since 2014 in the Apache Incubator §  Supported interpreters: •  Flink •  Spark •  Python •  Markdown •  Many more …

9

Page 11: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Word Count with Zeppelin §  Find the 10 most frequent words with

more than 4 letters in the King James version of the bible.

10

Page 12: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

11

Page 13: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

12

Page 14: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

13

Page 15: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

14

Page 16: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Linear regression §  Let’s predict the influence of advertisement

spending on sales §  Input data set:

http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv

§  Features: •  TV advertisement money •  Radio advertisement money •  Newspaper advertisement money

§  Response: •  Sales

15

Page 17: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

16

Page 18: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

17

Page 19: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

18

Page 20: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

19

Page 21: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

20

Page 22: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

21

Page 23: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

22

Page 24: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

23

Page 25: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

24

Page 26: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Classification §  Let’s build a classifier for insult detection §  Kaggle challenge

https://www.kaggle.com/c/detecting-insults-in-social-commentary

§  Label: 1 – Insult, 0 – No insult §  Feature: Comment text

25

Page 27: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

26

Page 28: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

27

Page 29: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Conclusion §  Interactive data analysis is really easy with

Apache Flink §  Apache Zeppelin is great interactive

notebook §  Zeppelin and Flink play well together to

solve machine learning tasks and more

28

Page 30: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

29

Page 31: Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

flink.apache.org @ApacheFlink