Upload
aditya-parameswaran
View
859
Download
0
Embed Size (px)
Citation preview
Aditya ParameswaranAssistant ProfessorUniversity of Illinois
http://data-people.cs.illinois.edu
Three Tools for “Human-in-the-loop” Data Science
2
Many many contributors!
• PIs: Kevin Chang, Karrie Karahalios, Aaron Elmore, Sam Madden, Amol Deshpande (Spanning Illinois, UMD, MIT, Chicago)
• PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim, Manasi Vartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke
• MS Students: Vipul Venkataraman, Tarique Siddiqui, Chao Wang, Sili Hui
• Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean Zou, Jialin Liu, Changfeng Liu, Xiaofo Yu
3
Scale is a Solved ProblemMost work in the database community is myopically focused on scale: the ability to pose SQL queries on larger and larger datasets.
My claim:Scale is a solved problem.
Findings:– Median job size at Microsoft and Yahoo is 16GB;– >90% of the jobs within Facebook are <100GB
The bottleneck is no longer our ability to pose SQL queries on large datasets!
Of course, exceptions exist: the “1%” of data analysis needs
4
What about the Needs of the 99%?The bottleneck is actually the “humans-in-the-loop”
As our data size has grown, what has stayed constant is • the time for analysis, • the human cognitive load,• the skills to extract value from data
There is a severe need for tools that can help analysts extract value from even moderately sized datasets
From “Big data and and its Technical Challenges”, CACM 2014
For big data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans. We have to make sure that the end points—humans—can properly “absorb” the results of the analysis and not get lost in a sea of data.
5
Need of the hour: Human-In-the-Loop Data Analytics Tools
HILDA tools:• treat both humans and
data as first-class citizens
• reduce human labor• minimize complexity
Interaction Data Mining
Databases
Taking the human
perspective into account
Go beyond SQL
Scalability/Interactivity is still important
Magic happens here
6
A Maslow’s Hierarchy for HILDABackground: Maslow developed a theory for what motivates individuals in 1943; highly influential
Complex Needs
Basic Needs
7
A Maslow’s Hierarchy for HILDA
Share & Collaborate
Play & View
Touch & Feel
Incr
easin
g so
phist
icatio
n of
ana
lysis
8
Touch and Feel:DataSpread is a spreadsheet-database hybrid:
Goal: Marrying the flexibility and ease of use of spreadsheets with the scalability and power of databases
Enables the “99%” with large datasets but limited prog. skills to open, touch, and examine their datasets
http://dataspread.github.io
[VLDB’15,VLDB’15,ICDE’16]
9
Play and View:Zenvisage is effortless visual exploration tool.
Goal: “fast-forward” to visual patterns, trends, without having analyst step through each one individually
Enables individuals to play with, and extract insights from large datasets at a fraction of the time.
http://zenvisage.github.io
[TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14]
10
Collaborate and Share:
OrpheusDB is a tool for managing dataset versions with a database
Goal: building a versioned database system to reduce the burden of recording datasets in various stages of analysis
Enables individuals to collaborate on data analysis, and share, keep track of, and retrieve dataset versions.
http://orpheus-db.github.io
[VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15]
(also part of : a collab. analysis system w/ MIT & UMD) datahub
11
This talkAbout 10 minutes per system:
overview + architecture + one key technical challenge
Common theme: if you torture databases enough, you can get them to do what you want!
Share & Collaborate
Play & View
Touch & Feel
Incr
easin
g so
phist
icatio
n of
ana
lysis
12
13
MotivationMost of the people doing ad-hoc data
manipulation and analysis use spreadsheets,
e.g., Excel
Why?
• Easy to use: direct manipulation• Built-in visualization capabilities• Flexible: no need for a schema
14
But Spreadsheets are Terrible!
– Slow• single change wait minutes on a 10,000 x 10
spreadsheet• can’t even open a spreadsheet with >1M cells• speed by itself can prevent analysis
– Tedious + not Powerful• filters via copy-paste• only FK joins via VLOOKUPs; others impossible• even simple operations are cumbersome
– Brittle• sharing excel sheets around, no collab/recovery• using spreadsheets for collaboration is painful and
error-prone
15
Let’s turn to DatabasesDatabases are:• Slow Scalable• Tedious + not Powerful Powerful and expressive (SQL)• Brittle Collaboration, recovery, succinct
So why not use databases? Well, for the same reason why spreadsheets are so useful:
• Easy to use Not easy to use• Built-in visualization No built-in visualization• Flexible Not flexible
16
Combining the benefits of spreadsheets and databases
Spreadsheet as a frontend interfaceDatabases as a backend engine
Result: retain the benefits of both!
But it’s not that simple…
17
Different IdeologiesDatabases and spreadsheets have different ideologies that need to be reconciled…
Due to this, the integration is not trivial…
Feature Databases SpreadsheetsData Model Schema-first Dynamic/No SchemaAddressing Tuples with PK Cells, using Row/ColPresentation
Set-oriented, no such notion
Notion of current window, order
Modifications
Must correspond to queries
Can be done at any granularity
Computation
Query at a time Value at a time
18
First Problem: RepresentationQ: how do we represent spreadsheet
data?
Dense spreadsheets: represent as tables(Row #, Col1 val, Col2 val, …)
Sparse spreadsheets: represent as triples(Row #, Column #, Value)
19
First Problem: RepresentationQ: how do we represent spreadsheet
data?
Can we do even better than the two extremes? Yes!
Carve out dense areas store as tables, sparse areas store as triples
20
First Problem: Representation
However, even if we only use “tables”, carving out the ideal # partitions (min. storage, modif., access) is NP-HardReduction from min. edge-length
partition of rectilinear polygons
Thankfully, we have a way out…
21
Solution: Constrain the Problem
A new class of partitionings: recursive decomp.
A very natural class of partitionings!
22
Solution: Constrain the Problem
The optimal recursive decomp. partitioning can be found in PTIME using DP
Still quadratic in # rows, columns
Merge rows/columns with identical signatures
~ the time for a single scan
23
Initial Progress and Architecture
Postgres backendZK spreadsheet • open-source web
frontend
Comfortably scales to arbitrarily many rows+ handle SQL queries
Hopefully bring spreadsheets to the big data age!
1224560
25
Standard Visual Data Analysis Recipe:
1. Load dataset into viz tool2. Select viz to be generated3. See if it matches desired
visual pattern or insight4. Repeat until you find a
match
26
Tedious and Time-consuming!
27
Key Issue: Visualizations can be generated by
• varying subsets of data, and • varying attributes being
visualized
Too many visualizations to look at to find desired visual patterns!
28
MotivationThis is a real problem!
• Advertisers at Turn – find keywords with similar CTRs to a specific one
• Bioinformaticians at an NIH genomics center– find aspects on which two sets of genes differ
• Battery scientists at CMU – find solvents with desired properties
Common theme: finding the “right” visualization can take several hours of combing through visualizations manually.
29
Key InsightWe can automate that!• instead of combing through visualizations manually• tell us what you want, and we can “fast-forward” to desired
insights
Desiderata for automation:• Expressive – the ability to specify what you want• Interactive – interact with the results, catering to non-
programmers• Scalable – get interesting results quickly
Enter Zenvisage:(zen + envisage: to effortlessly visualize)
30
Effortless Visual Exploration of Large Datasets with
Ingredients• Drag-and-drop and sketch based interactions • to find specific patterns
• Sophisticated visual exploration language, ZQL • to ask more elaborate questions
• Scalable visualization generation engine• preprocess, batch and parallel eval. for
interactive results• Rapid pattern matching algorithms• sampling-based techniques
31Attribute Selection
Sketching Canvas
Matches Typical Trends and Outliers
ZQL: Advanced Exploration Interface
Screenshots
32
Screenshots
Challenges: One Specific InstanceFind visualizations on which two groups of data differ most.
Examples: • find visualizations where solvent x differs from solvent y• find visualizations where product x differs from product y
We represent a visualization using [d, m, f]• dimension = x axis• measure = y axis• function = aggregate applied to y
Each [d,m,f] on a specific subset of data can be computed using a single SQL query.
33
Challenge: One Specific InstanceFind visualizations on which two groups of data differ most.
Naïve approach:
For each [d, m, f]:Compute visualization for both products (two SQL queries), then compare
Pick k best (“highest utility”) [d, m, f]
Utility Metric: We ignore how to compare for now, but there are many standard distance metrics
Scale: 10s of dimensions, 10s of measures, handful of aggregates 100s of queries for a single user task!
34
Issues w/ Naïve Approach
• Repeated processing of same data in sequence across queries
• Computation wasted on low-utility visualizations
Sharing
Pruning
35
Sharing Optimizations
1. Minimize # of queries: Group queries together• Combine multiple aggregates: (d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1)• Combine multiple group-bys:
(d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1)
2. Minimize sequential execution: Parallel query evaluation
A bit tricky!
36
Pruning Optimizations
• Keep running estimates of utility• Prune visualizations based on estimates: Two flavors– Vanilla Confidence Interval based Pruning– Multi-armed Bandit Pruning
Discard low-utility views early to avoid wasted computation
37
38
Visualizations
Queries (100s)
Sharing
Pruning
Optimizer
DBMS
Middleware Layer
Viz interface
Up to 300X speedup: <1s for SM, 4s for L
Experimental Findings
39
40
Effortless Visual Exploration of Large Datasets with
Ingredients• Drag-and-drop and
sketch based interactions
• Sophisticated visual exploration language, ZQL
• Scalable visualization generation engine
• Rapid pattern matching algorithms
41
42
MotivationCollaborative data science is ubiquitous• Many users, many versions of
the same dataset stored at many stages of analysis
• Status quo:– Stored in a file system,
relationships unknown
Challenge: can we build a versioned data store?
– Support efficient access, retrieval, querying, and modification of versions
43
Motivation: Starting Points• VCS: Git/svn is inefficient and
unsuitable– Ordered semantics– No data manipulation API– No efficient multi-version queries– Poor support for massive files
• DBMS: Relational databases don’t support versioning, but are efficient and scalable
44
OrpheusDB: Current FocusPostgreSQL + Versioning Commands
45
Challenge: Storing Versions Compactly/Retrieving Versions Quickly
1000s of versions, spanning millions of records.
Store all versions independentlyHuge storage, version access time is very small
Store one version, all others via chains of “deltas” Very small storage, version access time is high
46
And Answer Queries…• Retrieve the first version that contains this
tuple• Find versions where the average(salary) is
greater than 1000• Find all pairs of versions where over 100 new
tuples were added• Show the history of the tuple with record id
34.
For more examples, see [TAPP’15]
Framework
“Versioning” Layer(translation/bookkeeping)
User Interface Layer
47
Unmodified Postgres Backend(not aware of versions)
Parser &Translato
rLayout
Optimizer
DBMS
git commands, or SQL (versions as
rel)
48
Summary: Make Data Analytics Great Again!
orpheus-db.github.ioShare & Collaborate
Play & View
Touch & Feel
Incr
easin
g so
phist
icatio
n of
ana
lysis
zenvisage.github.io
dataspread.github.io
My website: http://data-people.cs.illinois.eduTwitter: @adityagp