Upload
vunhu
View
212
Download
0
Embed Size (px)
Citation preview
DeepDive: A Data Management System for Automatic Knowledge Base
Construction
Ce ZhangDepartment of Computer Sciences
http://deepdive.stanford.edu
DeepDive for Knowledge Base Construction (KBC)
Overview
It is feasible to build a data management system to support the end-to-end workflow of building KBC applications.
Application Abstraction Techniques
Why KBC? How does
DeepDive help KBC?
How to build a KBC
Application with
DeepDive?
How to make DeepDive
Efficient and Scalable?
Overview
It is feasible to build a data management system to support the end-to-end workflow of building KBC applications.
Application Abstraction
Why KBC? How does
DeepDive help KBC?
How to build a KBC
Application with
DeepDive?
Covered
in Pre
lim.
Techniques
How to make DeepDive
Efficient and Scalable?
DeepDive Workflow
Feature Extraction
Statistical Learning
Statistical Learning & Inference
Feature Extraction
Probabilistic Knowledge Engineering
Input Sources
External KB
FeaturesFeature Extracto
r
Factor Graph
Supe
rvisi
on R
ule
Dom
ain
Know
ledg
e Ru
le Inference Resultp
0.9
0.6
[IEEE Data Eng. Bull. 2014]
R.V.
Overview
Application Abstraction Techniques
Why KBC? How does
DeepDive help KBC?
How to build a KBC
Application with
DeepDive?
How to make DeepDive
Efficient and Scalable?
Technique: Teasers
1. One-shot ExecutionPerformant and Scalable Statistical Inference and Learning on Modern Hardware.2.Iterative ExecutionMaterialization Optimizations to support exploratory iterative development for statistical workload.
Techniques
How to make DeepDive
Efficient and Scalable?
Why are there efficiency and scalabilitychallenges in DeepDive?
Data Flow of PaleoDeepDive
Feature Extraction
Statistical Learning
Statistical Learning & Inference
Feature Extraction
Probabilistic Knowledge Engineering
Input Sources
External KB
FeaturesFeature Extracto
r
Factor Graph
Supe
rvisi
on R
ule
Dom
ain
Know
ledg
e Ru
le Inference Resultp
0.9
0.6
[IEEE Data Eng. Bull. 2014]
R.V.
300 K2 TB
> 10 M
Tuples
3 TB
0.3B vars0.7 B
factors
Add a new feature!
Add a new rule!
Batch Execution
IncrementalMaintenance
Techniques
Scalable Statistical Inference (via Gibbs sampling) over factor graphs.
Performant Statistical Learning on modern hardware.
Performant Iterative Feature Selection.
Performant Iterative Feature Engineering.
[VLDB 2015]
[SIGMOD 2014]
[VLDB 2014]
[SIGMOD 2013]
Batc
h Ex
ecut
ion
Incr
emen
tal
Mai
nten
ance
Scalable Gibbs Sampling: System Elementary
Scalable Gibbs Sampling
GoalScalable Statistical Inference
ContributionsReexamine the impact of
classical DB tradeoffs to Gibbs sampling.
Terabytes-scale databasesData stored in different
storagesMaterialization
Page-oriented LayoutBuffer replacement
Run inference on 6TB factor graphs on a single machine in 1 dayTopic modeling and relation extraction of 1 billion words
everyday
[SIGMOD 2013]
Overview
Background: Gibbs sampling & factor graph
Elementary
Experimental Results
Gibbs Sampling
Background: Gibbs Sampling & Factor Graph
Variables Factors
If we set v1 to True, we are rewarded by 5 points!If we set v2 and v3 to thesame, we get 10 more points!
Probability
1. Initialize variables with a random assignment.2. For each random variable:
2.1 Calculate the points we earn for each assignmente.g., v2= 0 points
v2= 10 points2.2 Randomly pick one assignment:
e.g., P(v2= )= exp(0)/(exp(0)+exp(10))
P(v2= )= exp(10)/(exp(0)+exp(10))
3. Generate one sample. Goto 2 if we want more samples.
FT
TF
TF
v1
v3
f1
f2
F
T
F
v2
v1
v3
f1
f2
F
T
F
v2F
A “Possible World”
Gibbs Sampling as JoinsVariables Factors
Variable ID AssignmentVariable ID Factor IDAssignments (A)Edges (E)
Variable ID Factor ID
v2 f2Variable ID Factor ID
v3 f2Variable ID Assignment
v3 False
v1
v3
f1
f2
F
T
F
v2
More about Joins
v f v’ a
v1 f1 v1 F
v2 f2 v2 T
v2 f2 v3 F
v3 f2 v2 T
v3 f2 v3 F
v1
v3
f1
f2
F
T
F
v2
Twist 1
Update the view Q after each variable.
Twist 2
Run sequential scans multiple times in the same order.
v1
v2
v3
New Epoch
The State-of-the-art Architecture
How classical DB techniques play a role in performance and scalability?
Elementary
v1
v3
f1
f2
F
T
F
v2
Billions!
The Elementary Architecture
Graph in Main
Memory
Gibbs SamplerU
nix
file
HBas
e
Accu
mul
oStorage Backend
Main Memory
Buffer
Gibbs Sampler
Trade-off Space
Materialization
Page-oriented Layout
Buffer Replacement
Mat
eria
lizati
onBu
ffer
Repl
acem
ent
Page
-orie
nted
La
yout
Trade-off Space: Materialization
Update Cost
Look
up C
ost
LAZY: No Materialization
EAGER: Materialize Q
V-COC: Materialize QV(v,f,v’) E(v,f), E(v’,f)
F-COC: Materialize QF(v’,f,a’) E(v’,f), A(v’,a’)
Mat
eria
lizati
onBu
ffer
Repl
acem
ent
Page
-orie
nted
La
yout
Trade-off Space: Page Layout
Random Access
e.g., E(v’, f) in LAZY Storage Main Memory Buffer
RequestTuple
Mat
eria
lizati
onBu
ffer
Repl
acem
ent
Page
-orie
nted
La
yout
Trade-off Space: Page Layout
e.g., E(v’, f) in LAZY
Tuples: t1, t2, …, tn
Visiting Sequence: ta1, …, tam
Proposition: Finding the optimal paging strategy for f1,…,fn given visiting sequence ta1, …, tam is NP-hard for LRU or OPTIMAL buffer replacement strategy.
HEURISTIC: Greedily pack f1,…,fn into pages according to ta1, …, tam.
Random Access
Storage Main Memory Buffer
RequestTuple
Q2: What buffer replacement strategy to use?
Q1: How to organize relation into pages?
Mat
eria
lizati
onBu
ffer
Repl
acem
ent
Page
-orie
nted
La
yout
Trade-off Space: Buffer Replacement
Random Access
e.g., E(v’, f) in LAZY
LRU: Evict the page that is Least-Recently-Used.
OPTIMAL: Evict the page that will be used latest in the future.
Main Memory Buffer
Secondary Storage
Load Evict
Tuples: t1, t2, …, tn
Visiting Sequence: ta1, …, tam
Trade-off Space: Recap
Materialization
Page-oriented Layout
Buffer Replacement
4 Strategies
HEURISTIC
OPTIMAL
Overview
Background: Gibbs sampling & factor graph
Elementary
Experimental Results
Experiments
Trade-off 1: Materialization
Trade-off 2: Page-oriented Layout
Trade-off 3:Buffer Replacement
Main Experiments End-to-end comparison with other systems.
Compare LAZY, EAGER, VCOC, FCOC
Compare RANDOM, HEURISTIC
Compare LRU, RANDOM, OPTIMAL
Experiments Setup
LR: Logistic RegressionCRF: Skip-chain CRFLDA: Latent Dirichlet Allocation
#Var #Factor Size #Var #Factor Size
LR 47K 47K 2MB 5B 5B 0.2TB
CRF 47K 94K 3MB 5B 9B 0.3TB
LDA 0.4M 12K 10MB 39B 0.2B 0.9TB
Bench (1x) Scale (100,000x)
FACTORIE (LR, CRF, LDA)PGibbs (LR, CRF, LDA)WinBUGS (LR, LDA)MADLib (LDA)
Main Experiments
1x10x
100x1,000x
10,000x
100,000x1E-061E-051E-041E-031E-021E-011E+001E+01 LR
Data set size
Thro
ughp
ut (#
sam
ples
/sec
ond)
EleMM
EleFILE
EleHBASE
Other main-memory Systems
40GB Buffer
Trade-offs: Materialization
00.20.40.60.8
1
Nor
mal
ized
thro
ughp
ut
00.20.40.60.8
1
Does not finish in 1 hour
Different Page-size/buffer-size settings
LAZYEAGERV-CoCF-CoC
CRF (EleFILE) LDA (EleFILE)
00.20.40.60.8
1
00.20.40.60.8
1
Trade-offs: Page-oriented LayoutN
orm
alize
d th
roug
hput
Does not finish in 1 hour
GreedyShuffle
CRF (EleFILE) LDA (EleFILE)
Different Page-size/buffer-size settings
00.20.40.60.8
1
00.20.40.60.8
1
Trade-offs: Buffer ReplacementN
orm
alize
d th
roug
hput CRF (EleFILE) LDA (EleFILE)
OptimalLRURandom
Different Page-size/buffer-size settings
Conclusion (of Elementary)
Task
Gibbs Sampling over Factor Graphs.(Terabyte-scale Factor Graphs!)
Elementary System
Scaling up Gibbs sampling by revisiting classical DB techniques.
Data Flow
Feature Extraction
Statistical Learning
Statistical Learning & Inference
Feature Extraction
Probabilistic Knowledge Engineering
Input Sources
External KB
FeaturesFeature Extracto
r
Factor Graph
Supe
rvisi
on R
ule
Dom
ain
Know
ledg
e Ru
le Inference Resultp
0.9
0.6
R.V.
300 K2 TB
> 10 M
Tuples
3 TB
0.3B vars0.7 B
factors
Add a new feature!
Add a new rule!
Batch Execution
IncrementalMaintenance
✔
Feature Selection: System Columbus(Joint effort with Arun & Pradap)
Feature Selection
State
Name
Age
Churn?Customer Information
Predict
Alice 20 CA
Dave 22 WI
Name Age Stat
e Churn?Yes
?
Bob 21 CA No… …
Data
Task: Select a subset of features
Features
[SIGMOD 2014]
Feature Selection: MotivationHow does one select features?
AgeName
State
Credit score
# Calls# Messages
Statistical Performance
Explanatory Power
Human-in-the-loop Dialogue
[* Interviews are done by Arun and Pradap]
Feature Selection Dialogue
“Age” may affect customer churn
I get an accuracy of 70% by just using {Age}.
State
Name
AgePredict
Train Model
Alice 20 CA
Name Age Stat
e Churn?Yes
Bob 21 CA No
Accuracy = 70%
SubselectAge
Feature Selection Dialogue
Not bad! Add “Age”
State
Name
AgePredict
Train Model
Alice 20 CA
Name Age Stat
e Churn?Yes
Bob 21 CA No
Accuracy = 70%
SubselectAge
Feature Selection Dialogue
I want to add one more feature, which one should I add?
The accuracy of {Age, State} is higher
than{Age, Name}
State
Name
AgePredict
Alice 20 CA
Name Age Stat
e Churn?Yes
Bob 21 CA No
Subselect
Train Model
Accuracy = 30%Accuracy = 80%
{Age, State}
StateName
Feature Selection Dialogue
Let’s add “State”
State
Name
AgePredict
Alice 20 CA
Name Age Stat
e Churn?Yes
Bob 21 CA No
Subselect
Train Model
{Age, State}
StateName
Accuracy = 30%Accuracy = 80%
Feature Selection Dialogue
… …
State
Name
AgePredict
Alice 20 CA
Name Age Stat
e Churn?Yes
Bob 21 CA No
Subselect
Train Model
{Age, State}
… …
Accuracy = 30%Accuracy = 80%
StateName
Feature Selection Dialogue
I want to add three more features out of
the 100 available features.
State
Name
AgePredict Churn?
YesNo
Subselect
Train Model
{Age, State}
161,700 different models to train!
… …
Can we make this dialogue
faster?
…
How does an analyst specify
a dialogue?…
… …
Feature Selection Dialogue
Subselect
Train Model
{Age, State} Optimization TechniqueMake each opera-
tion faster
Optimization Technique
Reuse computationacross operations
RIOT-DB
Columbus
Higher-level DSL
StepAddCrossValidation
…Acc. = 30% Acc. = 80%
Columbus: Technical Contributions
Classical Database TechniquesMaterialized view, Shared scan, etc.
Classical Numerical Analysis TechniquesQR Decomposition, etc.
Classical DB techniques lead to 2x speedup.
Study opportunities for data and computation reuse
Applying all techniques improves up to 100x.
Outline
Experimental Result
System Overview
Materialization Tradeoff
System Overview
Program Basic Blocks R Operations
BB: StepAdd
A, b,{f1, f2}
R: UNION
fs3
fs4
fs2
fs1
A, b <- DataSet(“file://...”) fs1 <- FeatureSet(f1, f2) fs2 <- StepAdd(A, fs1) fs3 <- FeatureSet(f3) fs4 <- UNION(fs2, fs3)
R: QR(A)
R:
R:
Run in Parallel
R: UNION
Focus of this talk.
Looks like a query plan
Basic Block
Train ModelsAccuracy1 Accuracy2
StepAdd Basic BlockData
A b
Loss
Subselections
Linear Least Squares RegressionSupport Vector Machine
Logistic Regression
Outline
Experimental Result
System Overview
Materialization Tradeoff
Outline
Database Inspired: Lazy vs. EagerMaterialization Tradeoff
Numerical Analysis Inspired: QR Decomposition
Linear Basic Block: Lazy Strategy
a b ce f gi j k
m n oq r s
Solve using R
Solve using R
A b
aei
mq
a be fi j
m nq r
Task
a b ce f gi j k
m n oq r s
A bBasic Block
Apply sub-selectionR
F
Linear Basic Block: Classical Database Opt.
a b ce f gi j k
m n oq r s
Solve using R
Solve using R
A b
aei
mq
a be fi j
m nq r
Apply sub-selection
Eager: Project away extra columns (rows)
Batch I/O if all “solves” are scans
Task
a b ce f gi j k
m n oq r s
A b a
Basic BlockR
F
Background: QR Decomposition
Linear Basic Block: Numerical Analysis Opt.
a b ce f gi j k
m n oq r s
Aa b c
i j k=
Ra b ce f gi j k
m n oq r s
QOrthogonal: QT=Q-1
Upper Triangular
a b c
i j k
Rc = a
ei
QT bxd2
2d2n
n
d
Task
a b ce f gi j k
m n oq r s
A b a
Basic BlockR
F
Linear Basic Block: Numerical Analysis Opt.
a b ce f gi j k
m n oq r s
Aa b c
i j k=
Ra b ce f gi j k
m n oq r s
QOrthogonal: QT=Q-1
Upper Triangular2d2nabcde
abcde
a b ce f gi j k
m n oq r s
a b ce f gi j k
m n oq r s
a
i j
R1
= aei
QT bxd2
a cg
i j k
R2
= aei
QT bxd2
Task
a b ce f gi j k
m n oq r s
A b a
Basic BlockR
F
Linear Basic Block: Lazy vs. QR
2d2n
Lazy
QR
a b ce f gi j k
m n oq r s
abcde
A b a b ce f gi j k
m n oq r s
abcde
A b
a b ce f gi j k
m n oq r s
abcde
A b a b c
i j k
Ra b ce f gi j k
m n oq r s
Qaei
QTb
0
d2n+d3
d2n+d3
d2n+d3
d2
d2
d2
Linear Basic Block: Tradeoff Space
0 1002003004000.01
0.11
10100
# FeaturesTi
me
QRLazy
1 5 101
20
# Reuse
Tim
e
LazyQR
1 5 101
20
# Threads
Tim
e
QR
LazyData
(e.g., # Features)Task
(e.g., # Reuse)Parallelism
(e.g., # Threads)
DataTask Parallelism
We find that a simple cost-based optimizer works pretty well
Outline
Experimental Result
System Overview
Materialization Tradeoff
Experimental ResultWe use feature selection programs from analysts
More CrossValidation
More StepAdd
# Features # RowsKDD 481 191 K
Census 161 109 KMusic 91 515 KFund 16 74 M
House 10 2 M
KDD Census Music Fund House1
10
100
1000
10000
Exec
utio
n Ti
me
(sec
onds
)VanillaR
dbOPT
Columbus
183x
Experimental Results
25x
Datasets
Other Techniques
Sampling-based Optimizationa b c
i j kSolve
Solve
A bai
a be f
error tol.ε
(Coreset)Importance Sampling
Non-linear Basic BlockNon-linear Basic Block
Linear Basic Block
RADMM
Same tradeoff applies!
Multi-block Optimization
Greedy heuristic
The problem of deciding the optimal merging/splitting of basic blocks is
NP-hard.
Warmstart
Conclusion (of Columbus)
Columbus takes advantage of opportunities for data and computation reuse for feature selection workload.
We build a DSL in Columbus to facilitate the feature selection dialogue.
Recap (Before Future Work)
Application Abstraction Technique
Why KBC? How does
DeepDive help KBC?
How to build a KBC
Application with
DeepDive?
How to make DeepDive
Efficient, and Scalable?
Gibbs sampling over Peta-byte Factor Graphs?
Is it possible with Elementary?
Amazon EC2 d2.xlarge instance: $3.216/hour = 48 TB storage=> Peta-byte storage is only $60/hour=> Full scan in 1.3 hours with 100 machines ($418)=> 20 epoches = $8360 & 26 hours
Not bad, but not Ideal!
How to achieve $8.3K/20 epoches?
How to improve $8.3K/20 epoches?
To Achieve: Better PartitioningVariables Factors
v1
v3
f1
f2
F
T
F
v2
v1
v3
f1
f2
F
T
F
v2
v1
v3
f1
f2
F
T
F
v2
Partition Strategy 1Partition Strategy 2
How to minimize the amount of
communication between different
nodes? Can we decide this without
grounding the whole graph?
IsNoun(docid, sentid2, wordid2, word2) :- IsNoun(docid, sentid1, wordid1, word2), IsNeighbor(wordid1, wordid2)
Observation: Factor graphs in DeepDive is grounded with high-
level rules.
Should partition with this key. [PODS 1991] When there are multiple rules? We just need a database optimizer (hopefully).
To Improve: Better Compression
dog
Factors(wordid, feature) :- IsNoun(docid, sentid, wordid, word) WordFeature(word, feature)
f1f2f3
dog
f1f2f3
cat f4
Similar to multi-value dependencies, can weonly ground one copy for factors of the same word?Similar to the idea of ‘lifted inference’, but we are interested more on the system part.
Coming Soon (Hopefully)…
How does the decision of compression interact withthe decision of partition? How far can we push these classic static analysis techniques to machine learning?