39
Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack

Information Visualization for Large-Scale Data Workflows by Michael Conover (Linkedin)

Embed Size (px)

DESCRIPTION

The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effective machine learning at scale. Applied in this capacity, information visualization technologies drive product innovation, shorten iteration cycles, reduce uncertainty, and ultimately improve the performance of predictive models. It can be challenging, however, to understand where in a workflow to employ data visualization, and, once committed to doing so, developing revealing visualizations that suggest clear next steps can be similarly daunting. In this talk we’ll describe the role that information visualization technologies play in the LinkedIn data science ecosystem, and explore best practices for understanding the structure of large-scale data in a production environment. From hypothesis generation and feature development to model evaluation and tooling, visualization is at the heart of LinkedIn’s machine learning workflows, enabling our data scientists to reason and communicate more effectively. Broken down into clear, structured insights based on proven technology and workflow patterns, this talk will help you understand how to apply information visualization to the analytical challenges you encounter every day. Meet Michael Conover Mike Conover builds machine learning technologies that leverage the behavior and relationships of hundreds of millions of people. A senior data scientist at LinkedIn, Mike has a Ph.D. in complex systems analysis with a focus on information propagation in large-scale social networks. His work has appeared in the New York Times, the Wall Street Journal, Science, MIT Technology Review and on National Public Radio.

Citation preview

Information Visualization for Large-Scale Data Workflows

Michael ConoverSenior Data Scientist, LinkedIn

@vagabondjack

Emergent Structure

Credit Pedro Cruz, University of CoimbraDavid Crandall, Indiana UniversityJohn Nelson, IDV Solutions Elegant Complexity

Intellectual Dividends

Realistic Mental Models

Verification of Assumptions

Shortened Iteration Cycles

Improved Predictive Performance

Product Insights

Clarity of Communication

Hypothesis Generation

@whitehouse #RSVP

Color Commentary

Flock Together

Political Polarization on Twitter

Feature Development

Anscombe’s Quartet

http://en.wikpedia.org/Anscombe’s_quartet

Basic Workflow Structure

0.0

0.1

0.2

0.3

0.4

−5.0 −2.5 0.0 2.5 5.0Standard Normal

Den

sity

0.0

0.1

0.2

0.3

0.4

−5.0 −2.5 0.0 2.5 5.0Standard Normal

Den

sity

100,

000

1,00

0,00

0

geom_point()

A Lens on the Joint Distribution

geom_point(alpha=1/5)

A Lens on the Joint Distribution

geom_bin2d(bins=35)

A Lens on the Joint Distribution

geom_point(alpha=1/5, aes(color=label))

A Lens on the Joint Distribution

geom_density2d(aes(color=label), bins=20)

A Lens on the Joint Distribution

Marginal Histograms

A Lens on the Joint Distribution

GGally (ggpairs)

A Lens on the Joint Distribution

Model Fitting & Evaluation

Model A Model B

Training Data I

Training Data II

Evaluation Batteries

stanford.edu/~jhuang11/ Homework at Scale

github.com/StanfordHCI/termiteTopic Modeling

Layercake

Workflow Principles

Latent, Pervasive

Modular

Consistent Visual Language

Workflow Management

data.linkedin.com/opensource/azkabanLinkedIn Azkaban

data.linkedin.com/opensource/white-elephant LinkedIn White Elephant

github.com/Netflix/Lipstick Netflix Lipstick

driven.cascading.io Concurrent Driven

bigml.com BigML Sunburst

Information Visualization for Large-Scale Data Workflows

Michael ConoverSenior Data Scientist, LinkedIn

@vagabondjack

Tableau

RStudio Shiny

rstudio.com/shiny/

RStudio Shiny

rweb.stat.ucla.edu/ggplot2/

Data Driven Documents (D3)

d3js.org

Superconductor

superconductor.github.io

Stamen Tiles

maps.stamen.com