69
Democratizing Data Science in the Cloud Bill Howe, Ph.D. Associate Director and Senior Data Science Fellow, eScience Institute Affiliate Associate Professor, Computer Science & Engineering 11/1/2016 Bill Howe, UW 1

Democratizing Data Science in the Cloud

Embed Size (px)

Citation preview

Page 1: Democratizing Data Science in the Cloud

Democratizing Data Science

in the Cloud

Bill Howe, Ph.D.Associate Director and Senior Data Science Fellow, eScience Institute

Affiliate Associate Professor, Computer Science & Engineering

11/1/2016 Bill Howe, UW 1

Page 2: Democratizing Data Science in the Cloud

11/1/2016 Bill Howe, UW 2

Cloud Data Management is about

sharing resources between tenants

We’re interested in new services powered by sharing

more than infrastructure – schema, data, queries

Page 3: Democratizing Data Science in the Cloud

Why? Example: JBOT* Open Data systems

Google Fusion Tables

3

Entrepreneurship

1) “Data once guarded for assumed but untested

reasons is now open, and we're seeing benefits.”

-- Nigel Shadbolt, Open Data Institute

2) Need to help “non-specialists within an

organization use data that had been the

realm of programmers and DB admins”

-- Benjamin Romano, Xconomy

“Businesses are now using data the way

scientists always have”

-- Jeff Hammerbacher

Mt. Sinai, formerly Cloudera

*Just a Bunch of Tables

Page 4: Democratizing Data Science in the Cloud

Data, data, data

4

Kevin Merrit

CEO

Socrata

Deep Dhillon

CTO

Socrata

Page 5: Democratizing Data Science in the Cloud

Q Q Q

….

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Page 6: Democratizing Data Science in the Cloud

Q Q Q

….

Benefits: Significantly reduced management overhead

Challenges: security, scheduling, SLAs, isolation

Virtualization

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Page 7: Democratizing Data Science in the Cloud

Q Q Q

….

DB-as-a-Service

Benefits: Significantly reduced management overhead

Challenges: security, scheduling, SLAs, isolation

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Page 8: Democratizing Data Science in the Cloud

Q Q Q

….

JBOT* Query-as-a-Service Systems

Goal:

smart cross-tenant services,

trained on everyone’s data

• Metadata inference and data curation

• Query recommendation via common idioms

• Data discovery – e.g., “find me things to join with”

• Visualization recommendation

• Semi-automatic integration services

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

*Just a Bunch of Tables

Page 9: Democratizing Data Science in the Cloud

Example Service: Automated Data Curation

11/1/2016 Bill Howe, UW 9

Microarray samples submitted to the Gene Expression Omnibus

Curation is fast becoming the

bottleneck to data sharing

Maxim

Gretchkin

Hoifung

Poon

Page 10: Democratizing Data Science in the Cloud

Example Service: Automated Data CurationMaxim

Gretchkin

Hoifung

Poon

Goal: Repair metadata for genetic

datasets using the content of the data, the

structure of an associated ontology, the

abstract of the paper, and everything else.

Deep Neural Network

Tissue Type Labels

Innovations in transfer learning,

poor training data, etc.

Paper

Abstract

Page 11: Democratizing Data Science in the Cloud

Example Service: Automated Data CurationMaxim

Gretchkin

Hoifung

PoonIterative co-learning between text-based classified and

expression-based classifier: Both models improve by

training on each others’ results

Page 12: Democratizing Data Science in the Cloud

• SQLShare: Query-as-a-Service

• VizDeck: Visualization recommendation

• Myria: Big Data Ecosystems

VizDeck

Some Cloud Data Systems

Page 13: Democratizing Data Science in the Cloud

1) Upload data “as is”

Cloud-hosted, secure; no

need to install or design a

database; no pre-defined

schema; schema inference;

some itegration

2) Write Queries

Right in your browser,

writing views on top of

views on top of views ...

SELECT hit, COUNT(*)

FROM tigrfam_surface

GROUP BY hit

ORDER BY cnt DESC

3) Share the results

Make them public, tag them,

share with specific colleagues –

anyone with access can query

http://sqlshare.escience.washington.edu

Page 14: Democratizing Data Science in the Cloud

11/1/2016 Bill Howe, UW 15

http://sqlshare.escience.washington.edu

Page 15: Democratizing Data Science in the Cloud

SIGMOD 2016

Page 16: Democratizing Data Science in the Cloud

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp

, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp

, w.category as nc_category

, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)

THEN x.end_bp - x.start_bp + 1

WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)

THEN x.end_bp - w.start_bp + 1

WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)

THEN w.end_bp - x.start_bp + 1

END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x

INNER JOIN [[email protected]].[table_noncoding_positions.tab] w

ON x.chr = w.chr

WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)

OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)

OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)

ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Non-programmers can write very complex queries

(rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of

queries written by

non-programmers

Page 17: Democratizing Data Science in the Cloud

The SQLShare Corpus:

A multi-year log of hand-written analytics queries

Queries 24275

Views 4535

Tables 3891

Users 591

SIGMOD 2016

Shrainik Jain

https://uwescience.github.io/sqlshare

Page 18: Democratizing Data Science in the Cloud

19/57

A SQL “learner”

http://uwescience.github.io/sqlshare/

Page 19: Democratizing Data Science in the Cloud

Latent Idioms for Schema-Independent Query Recommendation

Background on

Word2Vec, GloVE:

Map each term in a

corpus to a vector in

a high-dimensional

space based on its

co-occurrences.

Linear relationships

between these

vectors appear to

capture remarkable

semantic properties

Page 20: Democratizing Data Science in the Cloud

:

SELECT COUNT(*) FROM [[email protected]].[table_Firearms.txt]

SELECT COUNT (HiLo) FROM [[email protected]].[table_MUK.csv]

SELECT count(*) FROM [[email protected]].[Depth_combined]

select count(Wave_Height) from [[email protected]].[Join]

SELECT count(*) FROM [[email protected]].[ecoli_nogaps_1.csv]

SELECT Count(*) FROM [[email protected]].[TargetTrackFeatures.csv]

SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011]

SELECT Count(*) FROM [[email protected]].[table_ec_pdb_genus.csv]

SELECT count(*) FROM [[email protected]].[ecoli_nogaps_1.csv]

SELECT COUNT(*) FROM [[email protected]].[Tokyo_0_merged.csv]

SELECT COUNT(*) FROM [[email protected]].[SPID_GOnumber.txt]

SELECT COUNT (species) FROM [[email protected]].[Orthosia]

SELECT COUNT (species) FROM [[email protected]].[Leucania]

:

Apply the same trick to the SQLShare corpus, cluster the results

A not-very-interesting cluster:

Latent SQL Idioms

Page 21: Democratizing Data Science in the Cloud

:

SELECT COUNT(*) FROM [[email protected]].[table_proteins.csv] WHERE species LIKE 'Homo sapiens%'

SELECT count (*) FROM [[email protected]].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'

SELECT count (*) FROM [[email protected]].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'

SELECT Count (*) FROM [[email protected]].[Dated_Join] WHERE Category = 'Warm'

SELECT COUNT (*) FROM [[email protected]].[table_PopulationV2.txt] WHERE Column1='Country'

SELECT COUNT(*) FROM [[email protected]].[table_pHWaterTemp] WHERE TempCategory='normal'

SELECT COUNT(*) FROM [[email protected]].[no retweete] WHERE hashtags_in_text LIKE '%#odisha%’

:

Another not-very-interesting cluster:

We see other clusters that seem to capture more basics: “union,”

“group by with one grouping column,” “left outer join,” “string

manipulation,” etc.

Latent SQL Idioms

Page 22: Democratizing Data Science in the Cloud

Latent SQL Idioms

More interesting examples:

select floor(latitude/0.7)*0.7 as latbin

, floor(longitude/0.7)*0.7 as lonbin

, species

FROM [[email protected]].[All3col]

select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number

and charindex(',', [protein]) = 0 -- and no comma present

then [protein]

else substring([protein], patindex('%[0-9]%', [protein]),

charindex(',', [protein])-patindex('%[0-9]%', [protein]))

end as [protein d1124],

[tot indep spectra] as [tot spectra d1124]

from [[email protected]].[d1_file124.txt]

Parsing a common

bioinformatics file format

Expressions for binning

space and time columns

Page 23: Democratizing Data Science in the Cloud

MYRIA: BIG DATA POLYSTORES

11/1/2016 Bill Howe, UW 24

Page 24: Democratizing Data Science in the Cloud

Q Q Q

….

Control Plane /

Infrastructure

Data Plane /

Database sys.

Application /

schema, data,

query logs

Page 25: Democratizing Data Science in the Cloud

Q Q Q

….

Polystore Ecosystems: “Software Defined Databases”

Data Plane /

Database sys.

Application /

schema, data,

query logs

RDBMS HPC / Linear Algebra Graphs

Page 26: Democratizing Data Science in the Cloud

Polystore

Execution

Plan

move

data

execute

query

Page 27: Democratizing Data Science in the Cloud

Polystore

Execution

Plan

Tables KeyVal Arrays Graphs

Page 28: Democratizing Data Science in the Cloud

Myria Algebra

Tables KeyVal Arrays Graphs

Page 29: Democratizing Data Science in the Cloud

Spark Accumulo CombBLAS GraphX

Parallel Algebra

Logical Algebra

RACORelational Algebra COmpiler

CombBLAS API

Spark API

Accumulo Graph API

rewrite

rulesArray

Algebra

MyriaL

Services: visualization, logging, discovery, history, browsing

Orchestration

https://github.com/uwescience/raco

Page 30: Democratizing Data Science in the Cloud
Page 31: Democratizing Data Science in the Cloud

https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf

Page 32: Democratizing Data Science in the Cloud

11/1/2016 Bill Howe, UW 33

Ollie Lo, Los Alamos National Lab

Page 33: Democratizing Data Science in the Cloud

34

CurGood = SCAN(public:adhoc:sc_points);

DO

mean = [FROM CurGood EMIT val=AVG(v)];

std = [FROM CurGood EMIT val=STDEV(v)];

NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];

CurGood = CurGood - NewBad;

continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];

WHILE continue;

DUMP(CurGood);

Sigma-clipping, V0

Page 34: Democratizing Data Science in the Cloud

35

CurGood = P

sum = [FROM CurGood EMIT SUM(val)];

sumsq = [FROM CurGood EMIT SUM(val*val)]

cnt = [FROM CurGood EMIT CNT(*)];

NewBad = []

DO

sum = sum – [FROM NewBad EMIT SUM(val)];

sumsq = sum – [FROM NewBad EMIT SUM(val*val)];

cnt = sum - [FROM NewBad EMIT CNT(*)];

mean = sum / cnt

std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))

NewBad = FILTER([ABS(val-mean)>std], CurGood)

CurGood = CurGood - NewBad

WHILE NewBad != {}

Sigma-clipping, V1: Incremental

Page 35: Democratizing Data Science in the Cloud

36

Points = SCAN(public:adhoc:sc_points);

aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];

newBad = []

bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];

DO

new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];

aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,

sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];

stats = [FROM aggs EMIT mean=_sum/cnt,

std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];

newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];

tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v

AND v >= bounds.lower EMIT v=Points.v];

tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v

AND v <= bounds.upper EMIT v=Points.v];

newBad = UNIONALL(tooLow, tooHigh);

bounds = newBounds;

continue = [FROM newBad EMIT COUNT(v) > 0];

WHILE continue;

output = [FROM Points, bounds WHERE Points.v > bounds.lower AND

Points.v < bounds.upper EMIT v=Points.v];

DUMP(output);

Sigma-clipping, V2

Page 36: Democratizing Data Science in the Cloud

Dominik Moritz

EuroVis 15

Empower the end user to do

performance profiling, debugging, etc.

Page 37: Democratizing Data Science in the Cloud

Diagnosing problems

Sou

rce n

ode

Destination node

Dominik Moritz

EuroVis 15

Page 38: Democratizing Data Science in the Cloud

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 39

Page 39: Democratizing Data Science in the Cloud

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 40

Page 40: Democratizing Data Science in the Cloud
Page 41: Democratizing Data Science in the Cloud

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 42

Page 42: Democratizing Data Science in the Cloud

Query compilation for distributed processing

pipeline

as

parallel

code

parallel compiler

machine

code

[Myers ’14]

pipeline

fragment

code

pipeline

fragment

code

sequential

compiler

machine

code

[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]

sequential

compiler

Page 43: Democratizing Data Science in the Cloud

RADISH

ICS 16

Brandon

Myers

Page 44: Democratizing Data Science in the Cloud

11/1/2016 Bill Howe, UW 45/57

1% selection microbenchmark, 20GB

Avoid long code paths

ICS 16

Brandon

Myers

Page 45: Democratizing Data Science in the Cloud

11/1/2016 Bill Howe, UW 46/57

Q2 SP2Bench, 100M triples, multiple self-joins

Communication optimization

ICS 16

Brandon

Myers

Page 46: Democratizing Data Science in the Cloud

Graph Patterns

47

• SP2Bench, 100 million triples

• Queries compiled to a PGAS C++ language layer, then

compiled again by a low-level PGAS compiler

• One of Myria’s supported back ends

• Comparison with Shark/Spark, which itself has been shown to

be 100X faster than Hadoop-based systems

• …plus PageRank, Naïve Bayes, and more

RADISH

ICS 16

Brandon

Myers

Page 47: Democratizing Data Science in the Cloud

11/1/2016 Bill Howe, UW 48

ICS 15

RADISH

ICS 16

Brandon

Myers

TPC-H

Page 48: Democratizing Data Science in the Cloud

Some ongoing work

• “from scratch” polystore optimizer

– Columbia-style, with some ideas from PL community

• Anecdotal Optimization

– Infer optimization decisions based on coarse-grained experimental

results from unreliable sources (blogs, literature)

– “System X is 2X faster than System Y on PageRank”

• Benchmarking Linear Algebra Systems vs. Databases

– HPC community thinks they are 1000X faster; they aren’t

– DB community thinks they are competitive; they aren’t

• Query compilation

– “Software-defined Databases”

– Bridge the gap between MPI and DB

• New query language Kamooks blending arrays and relations

11/1/2016 Bill Howe, UW 49

Page 49: Democratizing Data Science in the Cloud

select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

Matrix multiply in RA

Matrix multiply

Page 50: Democratizing Data Science in the Cloud

sparsity exponent (r s.t. m=nr)

Complexity

exponent

n2.38

mn

m0.7n1.2+n2

slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication

n = number of rows

m = number of non-zerosComplexity of matrix multiply

naïve sparse

algorithm

best known

sparse

algorithm

best known

dense

algorithm

lots of room

here

Page 51: Democratizing Data Science in the Cloud

BLAS vs. SpBLAS vs. SQL (10k)off the shelf

database

15X

Page 52: Democratizing Data Science in the Cloud

11/1/2016 Bill Howe, UW 54

20k X 20k matrix multiply by sparsity

CombBLAS, MyriaX, Radish

Page 53: Democratizing Data Science in the Cloud

11/1/2016 Bill Howe, UW 55

50k X 50k matrix multiply by sparsity

CombBLAS, MyriaX, Radish

Filter to upper left corner of result matrix

Page 54: Democratizing Data Science in the Cloud

select AB.i, C.m, sum(AB.val*C.val)

from

(select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

) AB,

C

where AB.k = C.k

group by AB.i, C.m

A x B x C

select A.i, C.m, sum(A.val*B.val*C.val)

from A, B, C

where A.j = B.j

and B.k = C.k

group by A.i, C.m

A(i, j, val)

B(j, k, val)

C(k, m, val)

take three sparse

matrices

Now compute

multiway hypercube join:

O (|A|/p + |B|/p^2 + |C|/p)

Group by:

~O (N)

But wait, there’s more…..

Page 55: Democratizing Data Science in the Cloud

2 seconds,

balancedHypercube

shuffle

Partitioned

hash join43 seconds,

tons of skew

Task: self-multiply with 1M non-zeros

Page 56: Democratizing Data Science in the Cloud

Seung-Hee

BaeScalable Graph Clustering

Version 1

Parallelize Best-known

Serial Algorithm

ICDM 2013

Version 2

Free 30% improvement

for any algorithm

TKDD 2014 SC 2015

Version 3

Distributed approx.

algorithm, 1.5B edges

Page 57: Democratizing Data Science in the Cloud

http://escience.washington.edu

http://myria.cs.washington.edu

http://uwescience.github.io/sqlshare/

Page 58: Democratizing Data Science in the Cloud

VIZDECK: VISUALIZATION

RECOMMENDATION

11/1/2016 Bill Howe, UW 60

Page 59: Democratizing Data Science in the Cloud

“Data Triage” Pipeline

61

SAS

Excel

XML

CSV

SQL Azure

Files Tables Views

parse /

extract

“relational

analysis”

visual

analysis

Visualizations

SIGMOD 11SSDBM 13SIGMOD 16

sqlshare.escience.washington.edu

CHI 12SIGMOD 12 iConference 13

SSDBM 11CiSE 13 SSDBM 15

Page 60: Democratizing Data Science in the Cloud

62

Page 61: Democratizing Data Science in the Cloud

63

Page 62: Democratizing Data Science in the Cloud
Page 63: Democratizing Data Science in the Cloud

video

11/1/2016 Bill Howe, UW 65

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Fusion VizDeck ManyEyes Tableau

Task Completion Rate / Time - All QuestionsCHI 13

Page 64: Democratizing Data Science in the Cloud

Visualization Recommendation

• Model each “vizlet” as a triple

(x_column, y_column, vizlet_type)

• Extract features from each column

(f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type)

• Interpret each “promotion” as a yes vote and each “discard” as a

no vote

• Train a (simple) model to predict vizlet type from features

• Recommend highest-scoring vizlets

• Add a diversity term to prevent a bunch of similar plots

• Incorporate score modifiers defined by the vizlet designer

– “My bar chart looks best when there are about 5 bars.”

– “My timeseries plot ignores null values”

11/1/2016 Bill Howe, UW 66

Page 65: Democratizing Data Science in the Cloud

Example of a Learned Rule (1)

low x-entropy => bad scatter plot

11/1/2016 Bill Howe, UW 67

bad scatter plotgood scatter plot

Page 66: Democratizing Data Science in the Cloud

Example of a Learned Rule (2)

low x-entropy => histogram

11/1/2016 Bill Howe, UW 68

bad scatter plot good histogram

Page 67: Democratizing Data Science in the Cloud

Example of a Learned Rule (3)

69

high x-periodicity => timeseries plot

(periodicity = 1 / variance in gap length between successive values)

Page 68: Democratizing Data Science in the Cloud

Voyager

11/1/2016 Bill Howe, UW 70

Kanit “Ham” Wongsuphasawat Dominik Moritz

InfoVis 15

Page 69: Democratizing Data Science in the Cloud

Within the first few queries, you’ve

touched all the tables.

SIGMOD 2016

Shrainik Jain

http://uwescience.github.io/sqlshare/