Upload
michael-conover
View
3.166
Download
2
Embed Size (px)
DESCRIPTION
The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effective machine learning at scale. Applied in this capacity, information visualization technologies drive product innovation, shorten iteration cycles, reduce uncertainty, and ultimately improve the performance of predictive models. It can be challenging, however, to understand where in a workflow to employ data visualization, and, once committed to doing so, developing revealing visualizations that suggest clear next steps can be similarly daunting. In this talk we’ll describe the role that information visualization technologies play in the LinkedIn data science ecosystem, and explore best practices for understanding the structure of large-scale data in a production environment. From hypothesis generation and feature development to model evaluation and tooling, visualization is at the heart of LinkedIn’s machine learning workflows, enabling our data scientists to reason and communicate more effectively. Broken down into clear, structured insights based on proven workflow patterns, this talk will help you understand how to apply information visualization to the analytical challenges you encounter every day.
Citation preview
Information Visualization for Large-Scale Data Workflows
Michael Conover
Senior Data Scientist, LinkedIn
@vagabondjackreasonengine.wordpress.com
Wednesday, October 9, 2013
Emergent StructureWednesday, October 9, 2013
Elegant ComplexityPedro Cruz, University of CoimbraDavid Crandall, Indiana UniversityJohn Nelson, IDV Solutions
Credit
Wednesday, October 9, 2013
Intellectual Dividends
Realistic Mental Models
Verification of Assumptions
Shortened Iteration Cycles
Improved Predictive Performance
Product Insights
Clarity of Communication
Wednesday, October 9, 2013
Hypothesis Generation
Wednesday, October 9, 2013
Wednesday, October 9, 2013
Color Commentary
@whitehouse #RSVP
Wednesday, October 9, 2013
Flock TogetherWednesday, October 9, 2013
Political Polarization On TwitterWednesday, October 9, 2013
Basic Workflow StructureWednesday, October 9, 2013
aes_string()
Basic Visualization BatteryWednesday, October 9, 2013
Feature Development
Wednesday, October 9, 2013
Anscombe’s Quartet
http://en.wikipedia.org/wiki/Anscombe's_quartet
Wednesday, October 9, 2013
0.0
0.1
0.2
0.3
0.4
−2.5 0.0 2.5 5.0Standard Normal
Den
sity
0.0
0.1
0.2
0.3
0.4
−5.0 −2.5 0.0 2.5 5.0Standard Normal
Den
sity
100,
000
1,00
0,00
0
Wednesday, October 9, 2013
A Lens On The Joint Distribution
log(Connections)
log(
Endo
rsem
ent P
ager
ank)
geom_point()
Wednesday, October 9, 2013
A Lens On The Joint Distribution
log(Connections)
log(
Endo
rsem
ent P
ager
ank)
geom_point(alpha=1/5)
Wednesday, October 9, 2013
A Lens On The Joint Distribution
log(Connections)
log(
Endo
rsem
ent P
ager
ank)
25
50
75
100count
geom_bin2d(bins=35)
Wednesday, October 9, 2013
A Lens On The Joint Distribution
log(Connections)
log(
Endo
rsem
ent P
ager
ank)
Class
Negative
Positive geom_point(alpha=1/5, aes(color=label))
Wednesday, October 9, 2013
A Lens On The Joint Distribution
log(Connections)
log(
Endo
rsem
ent P
ager
ank)
Class
Negative
Positive geom_density2d(aes(color=label), bins=10)
Wednesday, October 9, 2013
A Lens On The Joint Distribution
Marginal Histograms
Wednesday, October 9, 2013
A Lens On The Joint Distribution
Sepal.Length6
7
8
5 6 7 8
Cor : −0.118
setosa: 0.743
versicolor: 0.526
virginica: 0.457
Cor : 0.872
setosa: 0.267
versicolor: 0.754
virginica: 0.864
Cor : 0.818
setosa: 0.278
versicolor: 0.546
virginica: 0.281
Sepal.Width
2.5
3
3.5
44.5
2 2.5 3 3.5 4 4.5
Cor : −0.428
setosa: 0.178
versicolor: 0.561
virginica: 0.401
Cor : −0.366
setosa: 0.233
versicolor: 0.664
virginica: 0.538
Petal.Length4
6
2 4 6
Cor : 0.963
setosa: 0.332
versicolor: 0.787
virginica: 0.322
Petal.Width
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
Species
setosa
versicolor
virginica
GGally (ggpairs)
Wednesday, October 9, 2013
Model Fitting & Evaluation
Wednesday, October 9, 2013
Model Selection
Model A Model B
Training Data I
Training Data II
Battery Battery
Battery Battery
Wednesday, October 9, 2013
stanford.edu/~jhuang11/
Homework At ScaleWednesday, October 9, 2013
Topic Modelingvis.stanford.edu/papers/termite
Wednesday, October 9, 2013
LayercakeWednesday, October 9, 2013
Workflow Principles
Latent, Pervasive
Modular
Consistent Visual Language
Wednesday, October 9, 2013
Workflow Management
Wednesday, October 9, 2013
Azkabandata.linkedin.com/opensource/azkaban
Wednesday, October 9, 2013
White Elephantdata.linkedin.com/opensource/white-elephant
Wednesday, October 9, 2013
Netflix’ Lipstickgithub.com/Netflix/Lipstick
Wednesday, October 9, 2013
Information Visualization for Large-Scale Data Workflows
Michael Conover
Senior Data Scientist, LinkedIn
@vagabondjackreasonengine.wordpress.com
Wednesday, October 9, 2013
Extended Toolbox
Wednesday, October 9, 2013
tableausoftware.com/public
TableauWednesday, October 9, 2013
rstudio.com/shiny/
RStudio ShinyWednesday, October 9, 2013
code.google.com/p/google-motion-charts-with-r
GoogleVisWednesday, October 9, 2013
rweb.stat.ucla.edu/ggplot2/
Wednesday, October 9, 2013
kuler.adobe.com
Adobe KulerWednesday, October 9, 2013
colorbrewer2.org
Color BrewerWednesday, October 9, 2013
bl.ocks.org/mbostock
Bostock’s BlocksWednesday, October 9, 2013
maps.stamen.com
Stamen OpenStreetMap TilesWednesday, October 9, 2013
zipfianacademy.com/maps/h3/
SF Health InspectionsWednesday, October 9, 2013