Three Tools for "Human-in-the-loop" Data Science

Aditya ParameswaranAssistant ProfessorUniversity of Illinois

http://data-people.cs.illinois.edu

Three Tools for “Human-in-the-loop” Data Science

2

Many many contributors!

• PIs: Kevin Chang, Karrie Karahalios, Aaron Elmore, Sam Madden, Amol Deshpande (Spanning Illinois, UMD, MIT, Chicago)

• PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim, Manasi Vartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke

• MS Students: Vipul Venkataraman, Tarique Siddiqui, Chao Wang, Sili Hui

• Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean Zou, Jialin Liu, Changfeng Liu, Xiaofo Yu

3

Scale is a Solved ProblemMost work in the database community is myopically focused on scale: the ability to pose SQL queries on larger and larger datasets.

My claim:Scale is a solved problem.

Findings:– Median job size at Microsoft and Yahoo is 16GB;– >90% of the jobs within Facebook are <100GB

The bottleneck is no longer our ability to pose SQL queries on large datasets!

Of course, exceptions exist: the “1%” of data analysis needs

4

What about the Needs of the 99%?The bottleneck is actually the “humans-in-the-loop”

As our data size has grown, what has stayed constant is • the time for analysis, • the human cognitive load,• the skills to extract value from data

There is a severe need for tools that can help analysts extract value from even moderately sized datasets

From “Big data and and its Technical Challenges”, CACM 2014

For big data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans. We have to make sure that the end points—humans—can properly “absorb” the results of the analysis and not get lost in a sea of data.

5

Need of the hour: Human-In-the-Loop Data Analytics Tools

HILDA tools:• treat both humans and

data as first-class citizens

• reduce human labor• minimize complexity

Interaction Data Mining

Databases

Taking the human

perspective into account

Go beyond SQL

Scalability/Interactivity is still important

Magic happens here

6

A Maslow’s Hierarchy for HILDABackground: Maslow developed a theory for what motivates individuals in 1943; highly influential

Complex Needs

Basic Needs

7

A Maslow’s Hierarchy for HILDA

Share & Collaborate

Play & View

Touch & Feel

Incr

easin

g so

phist

icatio

n of

ana

lysis

8

Touch and Feel:DataSpread is a spreadsheet-database hybrid:

Goal: Marrying the flexibility and ease of use of spreadsheets with the scalability and power of databases

Enables the “99%” with large datasets but limited prog. skills to open, touch, and examine their datasets

http://dataspread.github.io

[VLDB’15,VLDB’15,ICDE’16]

http://dataspread.github.io/

9

Play and View:Zenvisage is effortless visual exploration tool.

Goal: “fast-forward” to visual patterns, trends, without having analyst step through each one individually

Enables individuals to play with, and extract insights from large datasets at a fraction of the time.

http://zenvisage.github.io

[TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14]

http://zenvisage.github.io/

10

Collaborate and Share:

OrpheusDB is a tool for managing dataset versions with a database

Goal: building a versioned database system to reduce the burden of recording datasets in various stages of analysis

Enables individuals to collaborate on data analysis, and share, keep track of, and retrieve dataset versions.

http://orpheus-db.github.io

[VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15]

(also part of : a collab. analysis system w/ MIT & UMD) datahub

http://orpheus-db.github.io/

11

This talkAbout 10 minutes per system:

overview + architecture + one key technical challenge

Common theme: if you torture databases enough, you can get them to do what you want!

Share & Collaborate

Play & View

Touch & Feel

Incr

easin

g so

phist

icatio

n of

ana

lysis

12

13

MotivationMost of the people doing ad-hoc data

manipulation and analysis use spreadsheets,

e.g., Excel

Why?

• Easy to use: direct manipulation• Built-in visualization capabilities• Flexible: no need for a schema

14

But Spreadsheets are Terrible!

– Slow• single change wait minutes on a 10,000 x 10

spreadsheet• can’t even open a spreadsheet with >1M cells• speed by itself can prevent analysis

– Tedious + not Powerful• filters via copy-paste• only FK joins via VLOOKUPs; others impossible• even simple operations are cumbersome

– Brittle• sharing excel sheets around, no collab/recovery• using spreadsheets for collaboration is painful and

error-prone

15

Let’s turn to DatabasesDatabases are:• Slow Scalable• Tedious + not Powerful Powerful and expressive (SQL)• Brittle Collaboration, recovery, succinct

So why not use databases? Well, for the same reason why spreadsheets are so useful:

• Easy to use Not easy to use• Built-in visualization No built-in visualization• Flexible Not flexible

16

Combining the benefits of spreadsheets and databases

Spreadsheet as a frontend interfaceDatabases as a backend engine

Result: retain the benefits of both!

But it’s not that simple…

17

Different IdeologiesDatabases and spreadsheets have different ideologies that need to be reconciled…

Due to this, the integration is not trivial…

Feature Databases SpreadsheetsData Model Schema-first Dynamic/No SchemaAddressing Tuples with PK Cells, using Row/ColPresentation

Set-oriented, no such notion

Notion of current window, order

Modifications

Must correspond to queries

Can be done at any granularity

Computation

Query at a time Value at a time

18

First Problem: RepresentationQ: how do we represent spreadsheet

data?

Dense spreadsheets: represent as tables(Row #, Col1 val, Col2 val, …)

Sparse spreadsheets: represent as triples(Row #, Column #, Value)

19

First Problem: RepresentationQ: how do we represent spreadsheet

data?

Can we do even better than the two extremes? Yes!

Carve out dense areas store as tables, sparse areas store as triples

20

First Problem: Representation

However, even if we only use “tables”, carving out the ideal # partitions (min. storage, modif., access) is NP-HardReduction from min. edge-length

partition of rectilinear polygons

Thankfully, we have a way out…

21

Solution: Constrain the Problem

A new class of partitionings: recursive decomp.

A very natural class of partitionings!

22

Solution: Constrain the Problem

The optimal recursive decomp. partitioning can be found in PTIME using DP

Still quadratic in # rows, columns

Merge rows/columns with identical signatures

~ the time for a single scan

23

Initial Progress and Architecture

Postgres backendZK spreadsheet • open-source web

frontend

Comfortably scales to arbitrarily many rows+ handle SQL queries

Hopefully bring spreadsheets to the big data age!

1224560

25

Standard Visual Data Analysis Recipe:

1. Load dataset into viz tool2. Select viz to be generated3. See if it matches desired

visual pattern or insight4. Repeat until you find a

match

26

Tedious and Time-consuming!

27

Key Issue: Visualizations can be generated by

• varying subsets of data, and • varying attributes being

visualized

Too many visualizations to look at to find desired visual patterns!

28

MotivationThis is a real problem!

• Advertisers at Turn – find keywords with similar CTRs to a specific one

• Bioinformaticians at an NIH genomics center– find aspects on which two sets of genes differ

• Battery scientists at CMU – find solvents with desired properties

Common theme: finding the “right” visualization can take several hours of combing through visualizations manually.

29

Key InsightWe can automate that!• instead of combing through visualizations manually• tell us what you want, and we can “fast-forward” to desired

insights

Desiderata for automation:• Expressive – the ability to specify what you want• Interactive – interact with the results, catering to non-

programmers• Scalable – get interesting results quickly

Enter Zenvisage:(zen + envisage: to effortlessly visualize)

30

Effortless Visual Exploration of Large Datasets with

Ingredients• Drag-and-drop and sketch based interactions • to find specific patterns

• Sophisticated visual exploration language, ZQL • to ask more elaborate questions

• Scalable visualization generation engine• preprocess, batch and parallel eval. for

interactive results• Rapid pattern matching algorithms• sampling-based techniques

31Attribute Selection

Sketching Canvas

Matches Typical Trends and Outliers

ZQL: Advanced Exploration Interface

Screenshots

32

Screenshots

Challenges: One Specific InstanceFind visualizations on which two groups of data differ most.

Examples: • find visualizations where solvent x differs from solvent y• find visualizations where product x differs from product y

We represent a visualization using [d, m, f]• dimension = x axis• measure = y axis• function = aggregate applied to y

Each [d,m,f] on a specific subset of data can be computed using a single SQL query.

33

Challenge: One Specific InstanceFind visualizations on which two groups of data differ most.

Naïve approach:

For each [d, m, f]:Compute visualization for both products (two SQL queries), then compare

Pick k best (“highest utility”) [d, m, f]

Utility Metric: We ignore how to compare for now, but there are many standard distance metrics

Scale: 10s of dimensions, 10s of measures, handful of aggregates 100s of queries for a single user task!

34

Issues w/ Naïve Approach

• Repeated processing of same data in sequence across queries

• Computation wasted on low-utility visualizations

Sharing

Pruning

35

Sharing Optimizations

1. Minimize # of queries: Group queries together• Combine multiple aggregates: (d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1)• Combine multiple group-bys:

(d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1)

2. Minimize sequential execution: Parallel query evaluation

A bit tricky!

36

Pruning Optimizations

• Keep running estimates of utility• Prune visualizations based on estimates: Two flavors– Vanilla Confidence Interval based Pruning– Multi-armed Bandit Pruning

Discard low-utility views early to avoid wasted computation

37

38

Visualizations

Queries (100s)

Sharing

Pruning

Optimizer

DBMS

Middleware Layer

Viz interface

Up to 300X speedup: <1s for SM, 4s for L

Experimental Findings

39

40

Effortless Visual Exploration of Large Datasets with

Ingredients• Drag-and-drop and

sketch based interactions

• Sophisticated visual exploration language, ZQL

• Scalable visualization generation engine

• Rapid pattern matching algorithms

41

42

MotivationCollaborative data science is ubiquitous• Many users, many versions of

the same dataset stored at many stages of analysis

• Status quo:– Stored in a file system,

relationships unknown

Challenge: can we build a versioned data store?

– Support efficient access, retrieval, querying, and modification of versions

43

Motivation: Starting Points• VCS: Git/svn is inefficient and

unsuitable– Ordered semantics– No data manipulation API– No efficient multi-version queries– Poor support for massive files

• DBMS: Relational databases don’t support versioning, but are efficient and scalable

44

OrpheusDB: Current FocusPostgreSQL + Versioning Commands

45

Challenge: Storing Versions Compactly/Retrieving Versions Quickly

1000s of versions, spanning millions of records.

Store all versions independentlyHuge storage, version access time is very small

Store one version, all others via chains of “deltas” Very small storage, version access time is high

46

And Answer Queries…• Retrieve the first version that contains this

tuple• Find versions where the average(salary) is

greater than 1000• Find all pairs of versions where over 100 new

tuples were added• Show the history of the tuple with record id

34.

For more examples, see [TAPP’15]

Framework

“Versioning” Layer(translation/bookkeeping)

User Interface Layer

47

Unmodified Postgres Backend(not aware of versions)

Parser &Translato

rLayout

Optimizer

DBMS

git commands, or SQL (versions as

rel)

48

Summary: Make Data Analytics Great Again!

orpheus-db.github.ioShare & Collaborate

Play & View

Touch & Feel

Incr

easin

g so

phist

icatio

n of

ana

lysis

zenvisage.github.io

dataspread.github.io

My website: http://data-people.cs.illinois.eduTwitter: @adityagp

http://orpheus-db.github.io/



http://data-people.cs.illinois.edu/

Technology

Three Tools for "Human-in-the-loop" Data Science