30

Spark Summit EU talk by Sudeep Das and Aish Faenton

Embed Size (px)

Citation preview

Page 1: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 2: Spark Summit EU talk by Sudeep Das and Aish Faenton

@NetflixResearch@aishfenton @datamusing

The missing MatPlotLib for Scala/Spark

Page 3: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 4: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 5: Spark Summit EU talk by Sudeep Das and Aish Faenton

Jan 6th, 2016

Page 6: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 7: Spark Summit EU talk by Sudeep Das and Aish Faenton

MACHINE LEARNINGSYSTEMS CAN GET QUITECOMPLICATED

Real life workflow

Page 8: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 9: Spark Summit EU talk by Sudeep Das and Aish Faenton

STATISTICAL VISUALIZATIONS CAN BE PAINFULConsider a researcher at Netflix who who has raw data in a spark dataframe with columns:

show_id, num_of_views, country, timestamp, video_age

The researcher wants to make the following plot:

Plot the five most popular titles,

according to total number of views,

in the last 5 hours,

as bar charts faceted by country,

where the bars are color coded by video_age.

Sorting

Aggregating after filtering by timestamp

Grouping data by a categorical value (Country)

Mapping a quantitative column to a color

One could perform all these operations on the DF first and then create one bar plot per country via a loop.

Painful indeed!

Page 10: Spark Summit EU talk by Sudeep Das and Aish Faenton

DECLARATIVESTATISTICALVISUALIZATION GRAMMAR

IN SCALA

You tell is WHAT should be done with the data, and it knowsHOW to do it!

Operations such as filtering, aggregation, faceting are built into the visualization, rather than putting the burden on the user to massage the data into shape.

Complex visualizations can be built with a few high level abstractions:

DATA

TRANS-FORMS

SCALES

GUIDES

MARKS

cf : Altair Talk by Brian Granger in PyData 2016 https://youtu.be/v5mrwq7yJc4

Page 11: Spark Summit EU talk by Sudeep Das and Aish Faenton

Anatomy of a plot

X/Y channel

Shape Channel

Size Channel

Color Channel

Page 12: Spark Summit EU talk by Sudeep Das and Aish Faenton

Features…

Page 13: Spark Summit EU talk by Sudeep Das and Aish Faenton

1. Supports most plot types

Page 14: Spark Summit EU talk by Sudeep Das and Aish Faenton

2. Trellis plots

Page 15: Spark Summit EU talk by Sudeep Das and Aish Faenton

3. Layers

Layer 1.

Layer 2.

Layer 3.

Page 16: Spark Summit EU talk by Sudeep Das and Aish Faenton

4. Notebook and Consoles

Page 17: Spark Summit EU talk by Sudeep Das and Aish Faenton

5. Built-in spark support

Vegas.withDataFrame(myDataFrame).encodeX(“population”).encodeY(“age”)

Mapped Columns

Pass In DF.

Page 18: Spark Summit EU talk by Sudeep Das and Aish Faenton

6. Visual statistics

● Advanced Binning

● Sorting

● Scaling

● Custom Transforms

● Time Series

● Aggregation

● Filtering

● Math functions (log, etc)

● Missing data support

● Descriptive Statistics

Page 19: Spark Summit EU talk by Sudeep Das and Aish Faenton

How It Works !

Page 20: Spark Summit EU talk by Sudeep Das and Aish Faenton

VEGA

D3JS

VEGA - LITE

VEGAS SCALA DSL EMITS TYPE-CHECKED

VEGA-LITE JSON

VEGA-LITE CONVERTS INTERNALLY TO VEGA JSON SPEC

VEGA TRANSLATES JSON TO D3JS CODE THAT CAN BE VERY VERBOSE

A SCALA DSL FOR VEGA-LITE

Page 21: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 22: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 23: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 24: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 25: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 26: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 27: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 28: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 29: Spark Summit EU talk by Sudeep Das and Aish Faenton
Page 30: Spark Summit EU talk by Sudeep Das and Aish Faenton