64
DeepDive: A Data Management System for Automatic Knowledge Base Construction Ce Zhang Department of Computer Sciences [email protected]

PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

  • Upload
    vunhu

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

DeepDive: A Data Management System for Automatic Knowledge Base

Construction

Ce ZhangDepartment of Computer Sciences

[email protected]

Page 2: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

http://deepdive.stanford.edu

DeepDive for Knowledge Base Construction (KBC)

Page 3: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Overview

It is feasible to build a data management system to support the end-to-end workflow of building KBC applications.

Application Abstraction Techniques

Why KBC? How does

DeepDive help KBC?

How to build a KBC

Application with

DeepDive?

How to make DeepDive

Efficient and Scalable?

Page 4: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Overview

It is feasible to build a data management system to support the end-to-end workflow of building KBC applications.

Application Abstraction

Why KBC? How does

DeepDive help KBC?

How to build a KBC

Application with

DeepDive?

Covered

in Pre

lim.

Techniques

How to make DeepDive

Efficient and Scalable?

Page 5: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

DeepDive Workflow

Feature Extraction

Statistical Learning

Statistical Learning & Inference

Feature Extraction

Probabilistic Knowledge Engineering

Input Sources

External KB

FeaturesFeature Extracto

r

Factor Graph

Supe

rvisi

on R

ule

Dom

ain

Know

ledg

e Ru

le Inference Resultp

0.9

0.6

[IEEE Data Eng. Bull. 2014]

R.V.

Page 6: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Overview

Application Abstraction Techniques

Why KBC? How does

DeepDive help KBC?

How to build a KBC

Application with

DeepDive?

How to make DeepDive

Efficient and Scalable?

Page 7: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Technique: Teasers

1. One-shot ExecutionPerformant and Scalable Statistical Inference and Learning on Modern Hardware.2.Iterative ExecutionMaterialization Optimizations to support exploratory iterative development for statistical workload.

Techniques

How to make DeepDive

Efficient and Scalable?

Page 8: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Why are there efficiency and scalabilitychallenges in DeepDive?

Page 9: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Data Flow of PaleoDeepDive

Feature Extraction

Statistical Learning

Statistical Learning & Inference

Feature Extraction

Probabilistic Knowledge Engineering

Input Sources

External KB

FeaturesFeature Extracto

r

Factor Graph

Supe

rvisi

on R

ule

Dom

ain

Know

ledg

e Ru

le Inference Resultp

0.9

0.6

[IEEE Data Eng. Bull. 2014]

R.V.

300 K2 TB

> 10 M

Tuples

3 TB

0.3B vars0.7 B

factors

Add a new feature!

Add a new rule!

Batch Execution

IncrementalMaintenance

Page 10: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Techniques

Scalable Statistical Inference (via Gibbs sampling) over factor graphs.

Performant Statistical Learning on modern hardware.

Performant Iterative Feature Selection.

Performant Iterative Feature Engineering.

[VLDB 2015]

[SIGMOD 2014]

[VLDB 2014]

[SIGMOD 2013]

Batc

h Ex

ecut

ion

Incr

emen

tal

Mai

nten

ance

Page 11: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Scalable Gibbs Sampling: System Elementary

Page 12: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Scalable Gibbs Sampling

GoalScalable Statistical Inference

ContributionsReexamine the impact of

classical DB tradeoffs to Gibbs sampling.

Terabytes-scale databasesData stored in different

storagesMaterialization

Page-oriented LayoutBuffer replacement

Run inference on 6TB factor graphs on a single machine in 1 dayTopic modeling and relation extraction of 1 billion words

everyday

[SIGMOD 2013]

Page 13: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Overview

Background: Gibbs sampling & factor graph

Elementary

Experimental Results

Page 14: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Gibbs Sampling

Background: Gibbs Sampling & Factor Graph

Variables Factors

If we set v1 to True, we are rewarded by 5 points!If we set v2 and v3 to thesame, we get 10 more points!

Probability

1. Initialize variables with a random assignment.2. For each random variable:

2.1 Calculate the points we earn for each assignmente.g., v2= 0 points

v2= 10 points2.2 Randomly pick one assignment:

e.g., P(v2= )= exp(0)/(exp(0)+exp(10))

P(v2= )= exp(10)/(exp(0)+exp(10))

3. Generate one sample. Goto 2 if we want more samples.

FT

TF

TF

v1

v3

f1

f2

F

T

F

v2

v1

v3

f1

f2

F

T

F

v2F

A “Possible World”

Page 15: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Gibbs Sampling as JoinsVariables Factors

Variable ID AssignmentVariable ID Factor IDAssignments (A)Edges (E)

Variable ID Factor ID

v2 f2Variable ID Factor ID

v3 f2Variable ID Assignment

v3 False

v1

v3

f1

f2

F

T

F

v2

Page 16: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

More about Joins

v f v’ a

v1 f1 v1 F

v2 f2 v2 T

v2 f2 v3 F

v3 f2 v2 T

v3 f2 v3 F

v1

v3

f1

f2

F

T

F

v2

Twist 1

Update the view Q after each variable.

Twist 2

Run sequential scans multiple times in the same order.

v1

v2

v3

New Epoch

Page 17: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

The State-of-the-art Architecture

How classical DB techniques play a role in performance and scalability?

Elementary

v1

v3

f1

f2

F

T

F

v2

Billions!

The Elementary Architecture

Graph in Main

Memory

Gibbs SamplerU

nix

file

HBas

e

Accu

mul

oStorage Backend

Main Memory

Buffer

Gibbs Sampler

Page 18: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Trade-off Space

Materialization

Page-oriented Layout

Buffer Replacement

Page 19: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Mat

eria

lizati

onBu

ffer

Repl

acem

ent

Page

-orie

nted

La

yout

Trade-off Space: Materialization

Update Cost

Look

up C

ost

LAZY: No Materialization

EAGER: Materialize Q

V-COC: Materialize QV(v,f,v’) E(v,f), E(v’,f)

F-COC: Materialize QF(v’,f,a’) E(v’,f), A(v’,a’)

Page 20: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Mat

eria

lizati

onBu

ffer

Repl

acem

ent

Page

-orie

nted

La

yout

Trade-off Space: Page Layout

Random Access

e.g., E(v’, f) in LAZY Storage Main Memory Buffer

RequestTuple

Page 21: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Mat

eria

lizati

onBu

ffer

Repl

acem

ent

Page

-orie

nted

La

yout

Trade-off Space: Page Layout

e.g., E(v’, f) in LAZY

Tuples: t1, t2, …, tn

Visiting Sequence: ta1, …, tam

Proposition: Finding the optimal paging strategy for f1,…,fn given visiting sequence ta1, …, tam is NP-hard for LRU or OPTIMAL buffer replacement strategy.

HEURISTIC: Greedily pack f1,…,fn into pages according to ta1, …, tam.

Random Access

Storage Main Memory Buffer

RequestTuple

Q2: What buffer replacement strategy to use?

Q1: How to organize relation into pages?

Page 22: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Mat

eria

lizati

onBu

ffer

Repl

acem

ent

Page

-orie

nted

La

yout

Trade-off Space: Buffer Replacement

Random Access

e.g., E(v’, f) in LAZY

LRU: Evict the page that is Least-Recently-Used.

OPTIMAL: Evict the page that will be used latest in the future.

Main Memory Buffer

Secondary Storage

Load Evict

Tuples: t1, t2, …, tn

Visiting Sequence: ta1, …, tam

Page 23: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Trade-off Space: Recap

Materialization

Page-oriented Layout

Buffer Replacement

4 Strategies

HEURISTIC

OPTIMAL

Page 24: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Overview

Background: Gibbs sampling & factor graph

Elementary

Experimental Results

Page 25: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Experiments

Trade-off 1: Materialization

Trade-off 2: Page-oriented Layout

Trade-off 3:Buffer Replacement

Main Experiments End-to-end comparison with other systems.

Compare LAZY, EAGER, VCOC, FCOC

Compare RANDOM, HEURISTIC

Compare LRU, RANDOM, OPTIMAL

Page 26: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Experiments Setup

LR: Logistic RegressionCRF: Skip-chain CRFLDA: Latent Dirichlet Allocation

#Var #Factor Size #Var #Factor Size

LR 47K 47K 2MB 5B 5B 0.2TB

CRF 47K 94K 3MB 5B 9B 0.3TB

LDA 0.4M 12K 10MB 39B 0.2B 0.9TB

Bench (1x) Scale (100,000x)

FACTORIE (LR, CRF, LDA)PGibbs (LR, CRF, LDA)WinBUGS (LR, LDA)MADLib (LDA)

Page 27: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Main Experiments

1x10x

100x1,000x

10,000x

100,000x1E-061E-051E-041E-031E-021E-011E+001E+01 LR

Data set size

Thro

ughp

ut (#

sam

ples

/sec

ond)

EleMM

EleFILE

EleHBASE

Other main-memory Systems

40GB Buffer

Page 28: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Trade-offs: Materialization

00.20.40.60.8

1

Nor

mal

ized

thro

ughp

ut

00.20.40.60.8

1

Does not finish in 1 hour

Different Page-size/buffer-size settings

LAZYEAGERV-CoCF-CoC

CRF (EleFILE) LDA (EleFILE)

Page 29: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

00.20.40.60.8

1

00.20.40.60.8

1

Trade-offs: Page-oriented LayoutN

orm

alize

d th

roug

hput

Does not finish in 1 hour

GreedyShuffle

CRF (EleFILE) LDA (EleFILE)

Different Page-size/buffer-size settings

Page 30: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

00.20.40.60.8

1

00.20.40.60.8

1

Trade-offs: Buffer ReplacementN

orm

alize

d th

roug

hput CRF (EleFILE) LDA (EleFILE)

OptimalLRURandom

Different Page-size/buffer-size settings

Page 31: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Conclusion (of Elementary)

Task

Gibbs Sampling over Factor Graphs.(Terabyte-scale Factor Graphs!)

Elementary System

Scaling up Gibbs sampling by revisiting classical DB techniques.

Page 32: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Data Flow

Feature Extraction

Statistical Learning

Statistical Learning & Inference

Feature Extraction

Probabilistic Knowledge Engineering

Input Sources

External KB

FeaturesFeature Extracto

r

Factor Graph

Supe

rvisi

on R

ule

Dom

ain

Know

ledg

e Ru

le Inference Resultp

0.9

0.6

R.V.

300 K2 TB

> 10 M

Tuples

3 TB

0.3B vars0.7 B

factors

Add a new feature!

Add a new rule!

Batch Execution

IncrementalMaintenance

Page 33: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection: System Columbus(Joint effort with Arun & Pradap)

Page 34: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection

State

Name

Age

Churn?Customer Information

Predict

Alice 20 CA

Dave 22 WI

Name Age Stat

e Churn?Yes

?

Bob 21 CA No… …

Data

Task: Select a subset of features

Features

[SIGMOD 2014]

Page 35: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection: MotivationHow does one select features?

AgeName

State

Credit score

# Calls# Messages

Statistical Performance

Explanatory Power

Human-in-the-loop Dialogue

[* Interviews are done by Arun and Pradap]

Page 36: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection Dialogue

“Age” may affect customer churn

I get an accuracy of 70% by just using {Age}.

State

Name

AgePredict

Train Model

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Accuracy = 70%

SubselectAge

Page 37: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection Dialogue

Not bad! Add “Age”

State

Name

AgePredict

Train Model

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Accuracy = 70%

SubselectAge

Page 38: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection Dialogue

I want to add one more feature, which one should I add?

The accuracy of {Age, State} is higher

than{Age, Name}

State

Name

AgePredict

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Subselect

Train Model

Accuracy = 30%Accuracy = 80%

{Age, State}

StateName

Page 39: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection Dialogue

Let’s add “State”

State

Name

AgePredict

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Subselect

Train Model

{Age, State}

StateName

Accuracy = 30%Accuracy = 80%

Page 40: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection Dialogue

… …

State

Name

AgePredict

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Subselect

Train Model

{Age, State}

… …

Accuracy = 30%Accuracy = 80%

StateName

Page 41: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection Dialogue

I want to add three more features out of

the 100 available features.

State

Name

AgePredict Churn?

YesNo

Subselect

Train Model

{Age, State}

161,700 different models to train!

… …

Can we make this dialogue

faster?

How does an analyst specify

a dialogue?…

… …

Page 42: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Feature Selection Dialogue

Subselect

Train Model

{Age, State} Optimization TechniqueMake each opera-

tion faster

Optimization Technique

Reuse computationacross operations

RIOT-DB

Columbus

Higher-level DSL

StepAddCrossValidation

…Acc. = 30% Acc. = 80%

Page 43: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Columbus: Technical Contributions

Classical Database TechniquesMaterialized view, Shared scan, etc.

Classical Numerical Analysis TechniquesQR Decomposition, etc.

Classical DB techniques lead to 2x speedup.

Study opportunities for data and computation reuse

Applying all techniques improves up to 100x.

Page 44: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Outline

Experimental Result

System Overview

Materialization Tradeoff

Page 45: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

System Overview

Program Basic Blocks R Operations

BB: StepAdd

A, b,{f1, f2}

R: UNION

fs3

fs4

fs2

fs1

A, b <- DataSet(“file://...”) fs1 <- FeatureSet(f1, f2) fs2 <- StepAdd(A, fs1) fs3 <- FeatureSet(f3) fs4 <- UNION(fs2, fs3)

R: QR(A)

R:

R:

Run in Parallel

R: UNION

Focus of this talk.

Looks like a query plan

Page 46: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Basic Block

Train ModelsAccuracy1 Accuracy2

StepAdd Basic BlockData

A b

Loss

Subselections

Linear Least Squares RegressionSupport Vector Machine

Logistic Regression

Page 47: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Outline

Experimental Result

System Overview

Materialization Tradeoff

Page 48: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Outline

Database Inspired: Lazy vs. EagerMaterialization Tradeoff

Numerical Analysis Inspired: QR Decomposition

Page 49: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Linear Basic Block: Lazy Strategy

a b ce f gi j k

m n oq r s

Solve using R

Solve using R

A b

aei

mq

a be fi j

m nq r

Task

a b ce f gi j k

m n oq r s

A bBasic Block

Apply sub-selectionR

F

Page 50: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Linear Basic Block: Classical Database Opt.

a b ce f gi j k

m n oq r s

Solve using R

Solve using R

A b

aei

mq

a be fi j

m nq r

Apply sub-selection

Eager: Project away extra columns (rows)

Batch I/O if all “solves” are scans

Task

a b ce f gi j k

m n oq r s

A b a

Basic BlockR

F

Page 51: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Background: QR Decomposition

Linear Basic Block: Numerical Analysis Opt.

a b ce f gi j k

m n oq r s

Aa b c

i j k=

Ra b ce f gi j k

m n oq r s

QOrthogonal: QT=Q-1

Upper Triangular

a b c

i j k

Rc = a

ei

QT bxd2

2d2n

n

d

Task

a b ce f gi j k

m n oq r s

A b a

Basic BlockR

F

Page 52: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Linear Basic Block: Numerical Analysis Opt.

a b ce f gi j k

m n oq r s

Aa b c

i j k=

Ra b ce f gi j k

m n oq r s

QOrthogonal: QT=Q-1

Upper Triangular2d2nabcde

abcde

a b ce f gi j k

m n oq r s

a b ce f gi j k

m n oq r s

a

i j

R1

= aei

QT bxd2

a cg

i j k

R2

= aei

QT bxd2

Task

a b ce f gi j k

m n oq r s

A b a

Basic BlockR

F

Page 53: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Linear Basic Block: Lazy vs. QR

2d2n

Lazy

QR

a b ce f gi j k

m n oq r s

abcde

A b a b ce f gi j k

m n oq r s

abcde

A b

a b ce f gi j k

m n oq r s

abcde

A b a b c

i j k

Ra b ce f gi j k

m n oq r s

Qaei

QTb

0

d2n+d3

d2n+d3

d2n+d3

d2

d2

d2

Page 54: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Linear Basic Block: Tradeoff Space

0 1002003004000.01

0.11

10100

# FeaturesTi

me

QRLazy

1 5 101

20

# Reuse

Tim

e

LazyQR

1 5 101

20

# Threads

Tim

e

QR

LazyData

(e.g., # Features)Task

(e.g., # Reuse)Parallelism

(e.g., # Threads)

DataTask Parallelism

We find that a simple cost-based optimizer works pretty well

Page 55: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Outline

Experimental Result

System Overview

Materialization Tradeoff

Page 56: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Experimental ResultWe use feature selection programs from analysts

More CrossValidation

More StepAdd

# Features # RowsKDD 481 191 K

Census 161 109 KMusic 91 515 KFund 16 74 M

House 10 2 M

Page 57: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

KDD Census Music Fund House1

10

100

1000

10000

Exec

utio

n Ti

me

(sec

onds

)VanillaR

dbOPT

Columbus

183x

Experimental Results

25x

Datasets

Page 58: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Other Techniques

Sampling-based Optimizationa b c

i j kSolve

Solve

A bai

a be f

error tol.ε

(Coreset)Importance Sampling

Non-linear Basic BlockNon-linear Basic Block

Linear Basic Block

RADMM

Same tradeoff applies!

Multi-block Optimization

Greedy heuristic

The problem of deciding the optimal merging/splitting of basic blocks is

NP-hard.

Warmstart

Page 59: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Conclusion (of Columbus)

Columbus takes advantage of opportunities for data and computation reuse for feature selection workload.

We build a DSL in Columbus to facilitate the feature selection dialogue.

Page 60: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Recap (Before Future Work)

Application Abstraction Technique

Why KBC? How does

DeepDive help KBC?

How to build a KBC

Application with

DeepDive?

How to make DeepDive

Efficient, and Scalable?

Page 61: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Gibbs sampling over Peta-byte Factor Graphs?

Page 62: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

Is it possible with Elementary?

Amazon EC2 d2.xlarge instance: $3.216/hour = 48 TB storage=> Peta-byte storage is only $60/hour=> Full scan in 1.3 hours with 100 machines ($418)=> 20 epoches = $8360 & 26 hours

Not bad, but not Ideal!

How to achieve $8.3K/20 epoches?

How to improve $8.3K/20 epoches?

Page 63: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

To Achieve: Better PartitioningVariables Factors

v1

v3

f1

f2

F

T

F

v2

v1

v3

f1

f2

F

T

F

v2

v1

v3

f1

f2

F

T

F

v2

Partition Strategy 1Partition Strategy 2

How to minimize the amount of

communication between different

nodes? Can we decide this without

grounding the whole graph?

IsNoun(docid, sentid2, wordid2, word2) :- IsNoun(docid, sentid1, wordid1, word2), IsNeighbor(wordid1, wordid2)

Observation: Factor graphs in DeepDive is grounded with high-

level rules.

Should partition with this key. [PODS 1991] When there are multiple rules? We just need a database optimizer (hopefully).

Page 64: PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

To Improve: Better Compression

dog

Factors(wordid, feature) :- IsNoun(docid, sentid, wordid, word) WordFeature(word, feature)

f1f2f3

dog

f1f2f3

cat f4

Similar to multi-value dependencies, can weonly ground one copy for factors of the same word?Similar to the idea of ‘lifted inference’, but we are interested more on the system part.

Coming Soon (Hopefully)…

How does the decision of compression interact withthe decision of partition? How far can we push these classic static analysis techniques to machine learning?