PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base

DeepDive: A Data Management System for Automatic Knowledge Base

Construction

Ce ZhangDepartment of Computer Sciences

[email protected]

http://deepdive.stanford.edu

DeepDive for Knowledge Base Construction (KBC)

Overview

It is feasible to build a data management system to support the end-to-end workflow of building KBC applications.

Application Abstraction Techniques

Why KBC? How does

DeepDive help KBC?

How to build a KBC

Application with

DeepDive?

How to make DeepDive

Efficient and Scalable?

Overview

It is feasible to build a data management system to support the end-to-end workflow of building KBC applications.

Application Abstraction

Why KBC? How does

DeepDive help KBC?

How to build a KBC

Application with

DeepDive?

Covered

in Pre

lim.

Techniques



DeepDive Workflow

Feature Extraction

Statistical Learning

Statistical Learning & Inference

Feature Extraction

Probabilistic Knowledge Engineering

Input Sources

External KB

FeaturesFeature Extracto

r

Factor Graph

Supe

rvisi

on R

ule

Dom

ain

Know

ledg

e Ru

le Inference Resultp

0.9

0.6

[IEEE Data Eng. Bull. 2014]

R.V.

Overview

Application Abstraction Techniques

Why KBC? How does

DeepDive help KBC?

How to build a KBC

Application with

DeepDive?



Technique: Teasers

1. One-shot ExecutionPerformant and Scalable Statistical Inference and Learning on Modern Hardware.2.Iterative ExecutionMaterialization Optimizations to support exploratory iterative development for statistical workload.

Techniques



Why are there efficiency and scalabilitychallenges in DeepDive?

Data Flow of PaleoDeepDive

Feature Extraction



Feature Extraction


Input Sources

External KB


r

Factor Graph

Supe

rvisi

on R

ule

Dom

ain

Know

ledg

e Ru


0.9

0.6

[IEEE Data Eng. Bull. 2014]

R.V.

300 K2 TB

> 10 M

Tuples

3 TB

0.3B vars0.7 B

factors

Add a new feature!

Add a new rule!

Batch Execution

IncrementalMaintenance

Techniques

Scalable Statistical Inference (via Gibbs sampling) over factor graphs.

Performant Statistical Learning on modern hardware.

Performant Iterative Feature Selection.

Performant Iterative Feature Engineering.

[VLDB 2015]

[SIGMOD 2014]

[VLDB 2014]

[SIGMOD 2013]

Batc

h Ex

ecut

ion

Incr

emen

tal

Mai

nten

ance

Scalable Gibbs Sampling: System Elementary

Scalable Gibbs Sampling

GoalScalable Statistical Inference

ContributionsReexamine the impact of

classical DB tradeoffs to Gibbs sampling.

Terabytes-scale databasesData stored in different

storagesMaterialization

Page-oriented LayoutBuffer replacement

Run inference on 6TB factor graphs on a single machine in 1 dayTopic modeling and relation extraction of 1 billion words

everyday

[SIGMOD 2013]

Overview

Background: Gibbs sampling & factor graph

Elementary

Experimental Results

Gibbs Sampling

Background: Gibbs Sampling & Factor Graph

Variables Factors

If we set v1 to True, we are rewarded by 5 points!If we set v2 and v3 to thesame, we get 10 more points!

Probability

1. Initialize variables with a random assignment.2. For each random variable:

2.1 Calculate the points we earn for each assignmente.g., v2= 0 points

v2= 10 points2.2 Randomly pick one assignment:

e.g., P(v2= )= exp(0)/(exp(0)+exp(10))

P(v2= )= exp(10)/(exp(0)+exp(10))

3. Generate one sample. Goto 2 if we want more samples.

FT

TF

TF

v1

v3

f1

f2

F

T

F

v2

v1

v3

f1

f2

F

T

F

v2F

A “Possible World”

Gibbs Sampling as JoinsVariables Factors

Variable ID AssignmentVariable ID Factor IDAssignments (A)Edges (E)

Variable ID Factor ID

v2 f2Variable ID Factor ID

v3 f2Variable ID Assignment

v3 False

v1

v3

f1

f2

F

T

F

v2

More about Joins

v f v’ a

v1 f1 v1 F

v2 f2 v2 T

v2 f2 v3 F

v3 f2 v2 T

v3 f2 v3 F

v1

v3

f1

f2

F

T

F

v2

Twist 1

Update the view Q after each variable.

Twist 2

Run sequential scans multiple times in the same order.

v1

v2

v3

New Epoch

The State-of-the-art Architecture

How classical DB techniques play a role in performance and scalability?

Elementary

v1

v3

f1

f2

F

T

F

v2

Billions!

The Elementary Architecture

Graph in Main

Memory

Gibbs SamplerU

nix

file

HBas

e

Accu

mul

oStorage Backend

Main Memory

Buffer

Gibbs Sampler

Trade-off Space

Materialization

Page-oriented Layout

Buffer Replacement

Mat

eria

lizati

onBu

ffer

Repl

acem

ent

Page

-orie

nted

La

yout

Trade-off Space: Materialization

Update Cost

Look

up C

ost

LAZY: No Materialization

EAGER: Materialize Q

V-COC: Materialize QV(v,f,v’) E(v,f), E(v’,f)

F-COC: Materialize QF(v’,f,a’) E(v’,f), A(v’,a’)

Mat

eria

lizati

onBu

ffer

Repl

acem

ent

Page

-orie

nted

La

yout

Trade-off Space: Page Layout

Random Access

e.g., E(v’, f) in LAZY Storage Main Memory Buffer

RequestTuple

Mat

eria

lizati

onBu

ffer

Repl

acem

ent

Page

-orie

nted

La

yout

Trade-off Space: Page Layout

e.g., E(v’, f) in LAZY

Tuples: t1, t2, …, tn

Visiting Sequence: ta1, …, tam

Proposition: Finding the optimal paging strategy for f1,…,fn given visiting sequence ta1, …, tam is NP-hard for LRU or OPTIMAL buffer replacement strategy.

HEURISTIC: Greedily pack f1,…,fn into pages according to ta1, …, tam.

Random Access

Storage Main Memory Buffer

RequestTuple

Q2: What buffer replacement strategy to use?

Q1: How to organize relation into pages?

Mat

eria

lizati

onBu

ffer

Repl

acem

ent

Page

-orie

nted

La

yout

Trade-off Space: Buffer Replacement

Random Access

e.g., E(v’, f) in LAZY

LRU: Evict the page that is Least-Recently-Used.

OPTIMAL: Evict the page that will be used latest in the future.

Main Memory Buffer

Secondary Storage

Load Evict

Tuples: t1, t2, …, tn

Visiting Sequence: ta1, …, tam

Trade-off Space: Recap

Materialization

Page-oriented Layout

Buffer Replacement

4 Strategies

HEURISTIC

OPTIMAL

Overview

Background: Gibbs sampling & factor graph

Elementary


Experiments

Trade-off 1: Materialization

Trade-off 2: Page-oriented Layout

Trade-off 3:Buffer Replacement

Main Experiments End-to-end comparison with other systems.

Compare LAZY, EAGER, VCOC, FCOC

Compare RANDOM, HEURISTIC

Compare LRU, RANDOM, OPTIMAL

Experiments Setup

LR: Logistic RegressionCRF: Skip-chain CRFLDA: Latent Dirichlet Allocation

#Var #Factor Size #Var #Factor Size

LR 47K 47K 2MB 5B 5B 0.2TB

CRF 47K 94K 3MB 5B 9B 0.3TB

LDA 0.4M 12K 10MB 39B 0.2B 0.9TB

Bench (1x) Scale (100,000x)

FACTORIE (LR, CRF, LDA)PGibbs (LR, CRF, LDA)WinBUGS (LR, LDA)MADLib (LDA)

Main Experiments

1x10x

100x1,000x

10,000x

100,000x1E-061E-051E-041E-031E-021E-011E+001E+01 LR

Data set size

Thro

ughp

ut (#

sam

ples

/sec

ond)

EleMM

EleFILE

EleHBASE

Other main-memory Systems

40GB Buffer

Trade-offs: Materialization

00.20.40.60.8

1

Nor

mal

ized

thro

ughp

ut

00.20.40.60.8

1

Does not finish in 1 hour

Different Page-size/buffer-size settings

LAZYEAGERV-CoCF-CoC

CRF (EleFILE) LDA (EleFILE)

00.20.40.60.8

1

00.20.40.60.8

1

Trade-offs: Page-oriented LayoutN

orm

alize

d th

roug

hput

Does not finish in 1 hour

GreedyShuffle

CRF (EleFILE) LDA (EleFILE)


00.20.40.60.8

1

00.20.40.60.8

1

Trade-offs: Buffer ReplacementN

orm

alize

d th

roug

hput CRF (EleFILE) LDA (EleFILE)

OptimalLRURandom


Conclusion (of Elementary)

Task

Gibbs Sampling over Factor Graphs.(Terabyte-scale Factor Graphs!)

Elementary System

Scaling up Gibbs sampling by revisiting classical DB techniques.

Data Flow

Feature Extraction



Feature Extraction


Input Sources

External KB


r

Factor Graph

Supe

rvisi

on R

ule

Dom

ain

Know

ledg

e Ru


0.9

0.6

R.V.

300 K2 TB

> 10 M

Tuples

3 TB

0.3B vars0.7 B

factors

Add a new feature!

Add a new rule!

Batch Execution

IncrementalMaintenance

✔

Feature Selection: System Columbus(Joint effort with Arun & Pradap)

Feature Selection

State

Name

Age

Churn?Customer Information

Predict

Alice 20 CA

Dave 22 WI

Name Age Stat

e Churn?Yes

?

Bob 21 CA No… …

Data

Task: Select a subset of features

Features

[SIGMOD 2014]

Feature Selection: MotivationHow does one select features?

AgeName

State

Credit score

# Calls# Messages

Statistical Performance

Explanatory Power

Human-in-the-loop Dialogue

[* Interviews are done by Arun and Pradap]

Feature Selection Dialogue

“Age” may affect customer churn

I get an accuracy of 70% by just using {Age}.

State

Name

AgePredict

Train Model

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Accuracy = 70%

SubselectAge


Not bad! Add “Age”

State

Name

AgePredict

Train Model

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Accuracy = 70%

SubselectAge


I want to add one more feature, which one should I add?

The accuracy of {Age, State} is higher

than{Age, Name}

State

Name

AgePredict

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Subselect

Train Model

Accuracy = 30%Accuracy = 80%

{Age, State}

StateName


Let’s add “State”

State

Name

AgePredict

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Subselect

Train Model

{Age, State}

StateName



… …

State

Name

AgePredict

Alice 20 CA

Name Age Stat

e Churn?Yes

Bob 21 CA No

Subselect

Train Model

{Age, State}

… …


StateName


I want to add three more features out of

the 100 available features.

State

Name

AgePredict Churn?

YesNo

Subselect

Train Model

{Age, State}

161,700 different models to train!

… …

Can we make this dialogue

faster?

…

How does an analyst specify

a dialogue?…

… …


Subselect

Train Model

{Age, State} Optimization TechniqueMake each opera-

tion faster

Optimization Technique

Reuse computationacross operations

RIOT-DB

Columbus

Higher-level DSL

StepAddCrossValidation

…Acc. = 30% Acc. = 80%

Columbus: Technical Contributions

Classical Database TechniquesMaterialized view, Shared scan, etc.

Classical Numerical Analysis TechniquesQR Decomposition, etc.

Classical DB techniques lead to 2x speedup.

Study opportunities for data and computation reuse

Applying all techniques improves up to 100x.

Outline

Experimental Result

System Overview

Materialization Tradeoff

System Overview

Program Basic Blocks R Operations

BB: StepAdd

A, b,{f1, f2}

R: UNION

fs3

fs4

fs2

fs1

A, b <- DataSet(“file://...”) fs1 <- FeatureSet(f1, f2) fs2 <- StepAdd(A, fs1) fs3 <- FeatureSet(f3) fs4 <- UNION(fs2, fs3)

R: QR(A)

R:

R:

Run in Parallel

R: UNION

Focus of this talk.

Looks like a query plan

Basic Block

Train ModelsAccuracy1 Accuracy2

StepAdd Basic BlockData

A b

Loss

Subselections

Linear Least Squares RegressionSupport Vector Machine

Logistic Regression

Outline

Experimental Result

System Overview


Outline

Database Inspired: Lazy vs. EagerMaterialization Tradeoff

Numerical Analysis Inspired: QR Decomposition

Linear Basic Block: Lazy Strategy

a b ce f gi j k

m n oq r s

Solve using R

Solve using R

A b

aei

mq

a be fi j

m nq r

Task

a b ce f gi j k

m n oq r s

A bBasic Block

Apply sub-selectionR

F

Linear Basic Block: Classical Database Opt.

a b ce f gi j k

m n oq r s

Solve using R

Solve using R

A b

aei

mq

a be fi j

m nq r

Apply sub-selection

Eager: Project away extra columns (rows)

Batch I/O if all “solves” are scans

Task

a b ce f gi j k

m n oq r s

A b a

Basic BlockR

F

Background: QR Decomposition

Linear Basic Block: Numerical Analysis Opt.

a b ce f gi j k

m n oq r s

Aa b c

i j k=

Ra b ce f gi j k

m n oq r s

QOrthogonal: QT=Q-1

Upper Triangular

a b c

i j k

Rc = a

ei

QT bxd2

2d2n

n

d

Task

a b ce f gi j k

m n oq r s

A b a

Basic BlockR

F

Linear Basic Block: Numerical Analysis Opt.

a b ce f gi j k

m n oq r s

Aa b c

i j k=

Ra b ce f gi j k

m n oq r s

QOrthogonal: QT=Q-1

Upper Triangular2d2nabcde

abcde

a b ce f gi j k

m n oq r s

a b ce f gi j k

m n oq r s

a

i j

R1

= aei

QT bxd2

a cg

i j k

R2

= aei

QT bxd2

Task

a b ce f gi j k

m n oq r s

A b a

Basic BlockR

F

Linear Basic Block: Lazy vs. QR

2d2n

Lazy

QR

a b ce f gi j k

m n oq r s

abcde

A b a b ce f gi j k

m n oq r s

abcde

A b

a b ce f gi j k

m n oq r s

abcde

A b a b c

i j k

Ra b ce f gi j k

m n oq r s

Qaei

QTb

0

d2n+d3

d2n+d3

d2n+d3

d2

d2

d2

Linear Basic Block: Tradeoff Space

0 1002003004000.01

0.11

10100

# FeaturesTi

me

QRLazy

1 5 101

20

# Reuse

Tim

e

LazyQR

1 5 101

20

# Threads

Tim

e

QR

LazyData

(e.g., # Features)Task

(e.g., # Reuse)Parallelism

(e.g., # Threads)

DataTask Parallelism

We find that a simple cost-based optimizer works pretty well

Outline

Experimental Result

System Overview


Experimental ResultWe use feature selection programs from analysts

More CrossValidation

More StepAdd

# Features # RowsKDD 481 191 K

Census 161 109 KMusic 91 515 KFund 16 74 M

House 10 2 M

KDD Census Music Fund House1

10

100

1000

10000

Exec

utio

n Ti

me

(sec

onds

)VanillaR

dbOPT

Columbus

183x


25x

Datasets

Other Techniques

Sampling-based Optimizationa b c

i j kSolve

Solve

A bai

a be f

error tol.ε

(Coreset)Importance Sampling

Non-linear Basic BlockNon-linear Basic Block

Linear Basic Block

RADMM

Same tradeoff applies!

Multi-block Optimization

Greedy heuristic

The problem of deciding the optimal merging/splitting of basic blocks is

NP-hard.

Warmstart

Conclusion (of Columbus)

Columbus takes advantage of opportunities for data and computation reuse for feature selection workload.

We build a DSL in Columbus to facilitate the feature selection dialogue.

Recap (Before Future Work)

Application Abstraction Technique

Why KBC? How does

DeepDive help KBC?

How to build a KBC

Application with

DeepDive?


Efficient, and Scalable?

Gibbs sampling over Peta-byte Factor Graphs?

Is it possible with Elementary?

Amazon EC2 d2.xlarge instance: $3.216/hour = 48 TB storage=> Peta-byte storage is only $60/hour=> Full scan in 1.3 hours with 100 machines ($418)=> 20 epoches = $8360 & 26 hours

Not bad, but not Ideal!

How to achieve $8.3K/20 epoches?

How to improve $8.3K/20 epoches?

To Achieve: Better PartitioningVariables Factors

v1

v3

f1

f2

F

T

F

v2

v1

v3

f1

f2

F

T

F

v2

v1

v3

f1

f2

F

T

F

v2

Partition Strategy 1Partition Strategy 2

How to minimize the amount of

communication between different

nodes? Can we decide this without

grounding the whole graph?

IsNoun(docid, sentid2, wordid2, word2) :- IsNoun(docid, sentid1, wordid1, word2), IsNeighbor(wordid1, wordid2)

Observation: Factor graphs in DeepDive is grounded with high-

level rules.

Should partition with this key. [PODS 1991] When there are multiple rules? We just need a database optimizer (hopefully).

To Improve: Better Compression

dog

Factors(wordid, feature) :- IsNoun(docid, sentid, wordid, word) WordFeature(word, feature)

f1f2f3

dog

f1f2f3

cat f4

Similar to multi-value dependencies, can weonly ground one copy for factors of the same word?Similar to the idea of ‘lifted inference’, but we are interested more on the system part.

Coming Soon (Hopefully)…

How does the decision of compression interact withthe decision of partition? How far can we push these classic static analysis techniques to machine learning?

Documents

PowerPoint Presentationftp.cs.wisc.edu/machine-learning/shavlik-group/zhang... · PPT file · Web view2015-08-25 · DeepDive: A Data Management System for Automatic Knowledge Base