43
Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack reasonengine.wordpress.com Wednesday, October 9, 2013

Information Visualization for Large-Scale Data Workflows

Embed Size (px)

DESCRIPTION

The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effective machine learning at scale. Applied in this capacity, information visualization technologies drive product innovation, shorten iteration cycles, reduce uncertainty, and ultimately improve the performance of predictive models. It can be challenging, however, to understand where in a workflow to employ data visualization, and, once committed to doing so, developing revealing visualizations that suggest clear next steps can be similarly daunting. In this talk we’ll describe the role that information visualization technologies play in the LinkedIn data science ecosystem, and explore best practices for understanding the structure of large-scale data in a production environment. From hypothesis generation and feature development to model evaluation and tooling, visualization is at the heart of LinkedIn’s machine learning workflows, enabling our data scientists to reason and communicate more effectively. Broken down into clear, structured insights based on proven workflow patterns, this talk will help you understand how to apply information visualization to the analytical challenges you encounter every day.

Citation preview

Page 1: Information Visualization for Large-Scale Data Workflows

Information Visualization for Large-Scale Data Workflows

Michael Conover

Senior Data Scientist, LinkedIn

@vagabondjackreasonengine.wordpress.com

Wednesday, October 9, 2013

Page 2: Information Visualization for Large-Scale Data Workflows

Emergent StructureWednesday, October 9, 2013

Page 3: Information Visualization for Large-Scale Data Workflows

Elegant ComplexityPedro Cruz, University of CoimbraDavid Crandall, Indiana UniversityJohn Nelson, IDV Solutions

Credit

Wednesday, October 9, 2013

Page 4: Information Visualization for Large-Scale Data Workflows

Intellectual Dividends

Realistic Mental Models

Verification of Assumptions

Shortened Iteration Cycles

Improved Predictive Performance

Product Insights

Clarity of Communication

Wednesday, October 9, 2013

Page 5: Information Visualization for Large-Scale Data Workflows

Hypothesis Generation

Wednesday, October 9, 2013

Page 6: Information Visualization for Large-Scale Data Workflows

Wednesday, October 9, 2013

Page 7: Information Visualization for Large-Scale Data Workflows

Color Commentary

@whitehouse #RSVP

Wednesday, October 9, 2013

Page 8: Information Visualization for Large-Scale Data Workflows

Flock TogetherWednesday, October 9, 2013

Page 9: Information Visualization for Large-Scale Data Workflows

Political Polarization On TwitterWednesday, October 9, 2013

Page 10: Information Visualization for Large-Scale Data Workflows

Basic Workflow StructureWednesday, October 9, 2013

Page 11: Information Visualization for Large-Scale Data Workflows

aes_string()

Basic Visualization BatteryWednesday, October 9, 2013

Page 12: Information Visualization for Large-Scale Data Workflows

Feature Development

Wednesday, October 9, 2013

Page 13: Information Visualization for Large-Scale Data Workflows

Anscombe’s Quartet

http://en.wikipedia.org/wiki/Anscombe's_quartet

Wednesday, October 9, 2013

Page 14: Information Visualization for Large-Scale Data Workflows

0.0

0.1

0.2

0.3

0.4

−2.5 0.0 2.5 5.0Standard Normal

Den

sity

0.0

0.1

0.2

0.3

0.4

−5.0 −2.5 0.0 2.5 5.0Standard Normal

Den

sity

100,

000

1,00

0,00

0

Wednesday, October 9, 2013

Page 15: Information Visualization for Large-Scale Data Workflows

A Lens On The Joint Distribution

log(Connections)

log(

Endo

rsem

ent P

ager

ank)

geom_point()

Wednesday, October 9, 2013

Page 16: Information Visualization for Large-Scale Data Workflows

A Lens On The Joint Distribution

log(Connections)

log(

Endo

rsem

ent P

ager

ank)

geom_point(alpha=1/5)

Wednesday, October 9, 2013

Page 17: Information Visualization for Large-Scale Data Workflows

A Lens On The Joint Distribution

log(Connections)

log(

Endo

rsem

ent P

ager

ank)

25

50

75

100count

geom_bin2d(bins=35)

Wednesday, October 9, 2013

Page 18: Information Visualization for Large-Scale Data Workflows

A Lens On The Joint Distribution

log(Connections)

log(

Endo

rsem

ent P

ager

ank)

Class

Negative

Positive geom_point(alpha=1/5, aes(color=label))

Wednesday, October 9, 2013

Page 19: Information Visualization for Large-Scale Data Workflows

A Lens On The Joint Distribution

log(Connections)

log(

Endo

rsem

ent P

ager

ank)

Class

Negative

Positive geom_density2d(aes(color=label), bins=10)

Wednesday, October 9, 2013

Page 20: Information Visualization for Large-Scale Data Workflows

A Lens On The Joint Distribution

Marginal Histograms

Wednesday, October 9, 2013

Page 21: Information Visualization for Large-Scale Data Workflows

A Lens On The Joint Distribution

Sepal.Length6

7

8

5 6 7 8

Cor : −0.118

setosa: 0.743

versicolor: 0.526

virginica: 0.457

Cor : 0.872

setosa: 0.267

versicolor: 0.754

virginica: 0.864

Cor : 0.818

setosa: 0.278

versicolor: 0.546

virginica: 0.281

Sepal.Width

2.5

3

3.5

44.5

2 2.5 3 3.5 4 4.5

Cor : −0.428

setosa: 0.178

versicolor: 0.561

virginica: 0.401

Cor : −0.366

setosa: 0.233

versicolor: 0.664

virginica: 0.538

Petal.Length4

6

2 4 6

Cor : 0.963

setosa: 0.332

versicolor: 0.787

virginica: 0.322

Petal.Width

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

Species

setosa

versicolor

virginica

GGally (ggpairs)

Wednesday, October 9, 2013

Page 22: Information Visualization for Large-Scale Data Workflows

Model Fitting & Evaluation

Wednesday, October 9, 2013

Page 23: Information Visualization for Large-Scale Data Workflows

Model Selection

Model A Model B

Training Data I

Training Data II

Battery Battery

Battery Battery

Wednesday, October 9, 2013

Page 24: Information Visualization for Large-Scale Data Workflows

stanford.edu/~jhuang11/

Homework At ScaleWednesday, October 9, 2013

Page 25: Information Visualization for Large-Scale Data Workflows

Topic Modelingvis.stanford.edu/papers/termite

Wednesday, October 9, 2013

Page 26: Information Visualization for Large-Scale Data Workflows

LayercakeWednesday, October 9, 2013

Page 27: Information Visualization for Large-Scale Data Workflows

Workflow Principles

Latent, Pervasive

Modular

Consistent Visual Language

Wednesday, October 9, 2013

Page 28: Information Visualization for Large-Scale Data Workflows

Workflow Management

Wednesday, October 9, 2013

Page 29: Information Visualization for Large-Scale Data Workflows

Azkabandata.linkedin.com/opensource/azkaban

Wednesday, October 9, 2013

Page 30: Information Visualization for Large-Scale Data Workflows

White Elephantdata.linkedin.com/opensource/white-elephant

Wednesday, October 9, 2013

Page 31: Information Visualization for Large-Scale Data Workflows

Netflix’ Lipstickgithub.com/Netflix/Lipstick

Wednesday, October 9, 2013

Page 32: Information Visualization for Large-Scale Data Workflows

Information Visualization for Large-Scale Data Workflows

Michael Conover

Senior Data Scientist, LinkedIn

@vagabondjackreasonengine.wordpress.com

Wednesday, October 9, 2013

Page 33: Information Visualization for Large-Scale Data Workflows

Extended Toolbox

Wednesday, October 9, 2013

Page 34: Information Visualization for Large-Scale Data Workflows

tableausoftware.com/public

TableauWednesday, October 9, 2013

Page 35: Information Visualization for Large-Scale Data Workflows

rstudio.com/shiny/

RStudio ShinyWednesday, October 9, 2013

Page 36: Information Visualization for Large-Scale Data Workflows

code.google.com/p/google-motion-charts-with-r

GoogleVisWednesday, October 9, 2013

Page 37: Information Visualization for Large-Scale Data Workflows

rweb.stat.ucla.edu/ggplot2/

Wednesday, October 9, 2013

Page 38: Information Visualization for Large-Scale Data Workflows

kuler.adobe.com

Adobe KulerWednesday, October 9, 2013

Page 39: Information Visualization for Large-Scale Data Workflows

colorbrewer2.org

Color BrewerWednesday, October 9, 2013

Page 40: Information Visualization for Large-Scale Data Workflows

d3js.org

D3.jsWednesday, October 9, 2013

Page 41: Information Visualization for Large-Scale Data Workflows

bl.ocks.org/mbostock

Bostock’s BlocksWednesday, October 9, 2013

Page 42: Information Visualization for Large-Scale Data Workflows

maps.stamen.com

Stamen OpenStreetMap TilesWednesday, October 9, 2013

Page 43: Information Visualization for Large-Scale Data Workflows

zipfianacademy.com/maps/h3/

SF Health InspectionsWednesday, October 9, 2013