Upload
jordan-mckinney
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Optimizing and Parallelizing Optimizing and Parallelizing
Ranked EnumerationRanked Enumeration
Konstantin GolenbergKonstantin Golenberg Benny KimelfeldBenny Kimelfeld Yehoshua SagivYehoshua SagivThe Hebrew University
of JerusalemIBM Research –
AlmadenThe Hebrew University
of Jerusalem
VLDB 2011Seattle, WA
2
Background: DB Search at HebrewUBackground: DB Search at HebrewU
eu brussels search
• Initial implementation was too slow…• Purchased a multi-core server• Didn’t help: cores were usually idle
– Due to the inherent flow of the enumeration technique we used
•Needed deeper understanding of ranked enumeration to benefit from parallelization– This paperThis paper
demo in SIGMOD’10, implementation in SIGMOD’08, algorithms in PODS’06
OutlineOutline
Lawler-Murty’s Ranked EnumerationLawler-Murty’s Ranked Enumeration
Optimizing by Progressive BoundsOptimizing by Progressive Bounds
Parallelization / Core UtilizationParallelization / Core Utilization
ConclusionsConclusions
4
Ranked EnumerationRanked Enumeration
UserUser
ProblemProblem
Huge number (e.g., 2|Problem|) of ranked answers
best answer2nd best answer3rd best answer . . .. . .
Examples:• Various graph optimizations
–Shortest paths–Smallest spanning trees–Best perfect matchings
• Top results of keyword search on DBs (graph search)
• Most probable answers in probabilistic DBs
• Best recommendations for schema integration
Examples:• Various graph optimizations
–Shortest paths–Smallest spanning trees–Best perfect matchings
• Top results of keyword search on DBs (graph search)
• Most probable answers in probabilistic DBs
• Best recommendations for schema integration
““Complexity”:Complexity”:
•What is the delay between successive answers?
•How much time to get top-k?
Here
(Can’t afford to instantiate all answers)
5
Goal:Goal: Find top-k answersFind top-k answers
Abstract Problem Formulation Abstract Problem Formulation
O =A collection of objects
A =
score()
21 31 2827 17
score(a) is high a is of high-quality
Huge, described by a condition on A’s subsets
……
……32 31 28
Answersa ⊆ O
inputinput
17
a1 a2 a3 ak
6
Goal:Goal: Find top-k answersFind top-k answers
Graph Search in The AbstractionGraph Search in The Abstraction
A = …… Answersa ⊆ O
• Data graph G• Set Q of keywords • Data graph G• Set Q of keywords
Edges of G
Subtrees (edge sets) a containing all keywords in Q (w/o redundancy, see [GKS 2008])
score(a):1
, IR measures, etc.weight(a)
O =
7
What is the Challenge?What is the Challenge?
O =
32start
1st (top) answer
Optimization problem
31
2nd answer
??
. . .. . . 17
j th answer
• ≠ previous (j-1) answers• best remaining answer
Conceivably, much Conceivably, much more complicated more complicated
than top-1!than top-1!
??
How to handle these constraints? (j may be large!)
. . .. . .
8
Lawler-Murty’s ProcedureLawler-Murty’s ProcedureLawler-Murty’s gives a general reduction:
Finding top-k answers
Finding top-1 answer under simple constraints
if PTIME
then PTIME
We understand optimization much better!
Often, amounts to classical optimization, e.g., shortest path(but sometimes it may get involved, e.g., [KS 2006])
[Murty, 1968][Lawler, 1972][Murty, 1968][Lawler, 1972]
Other general top-k procedure:
[Hamacher & Queyranne 84], very similar!
9
Among the Uses of Lawler-Murty’sAmong the Uses of Lawler-Murty’s
• Shortest simple paths [Yen 1972]• Minimum spanning trees [Gabow 1977, Katoh et al., 1981]• Best solutions in resource allocation [Katoh et al. 1981]• Best perfect matchings, best cuts [Hamacher & Queyranne 1985]• Minimum Steiner trees [KS 2006]
Graph/Combinatorial Algorithms:Graph/Combinatorial Algorithms:
• Yen’s algorithm to find sets of metabolites connected by chemical reactions [Takigawa & Mamitsuka 2008]
Bioinformatics:Bioinformatics:
• ORDER-BY queries [KS 2006, 2007]• Graph/XML search [GKS 2008]• Generation of forms over integrated data [Talukdar et al. 2008]• Course recommendation [Parameswaran & Garcia-Molina 2009]• Querying Markov sequences [K & Ré 2010]
Data Management:Data Management:
10
Lawler-Murty’s Method: Conceptual Lawler-Murty’s Method: Conceptual
start
11
OutputOutput
1. 1. Find & Print the Top AnswerFind & Print the Top Answer
start
But Instead…But Instead…
In principle, at this point we should find the second-best answer
12
2.2. Partition the Remaining Answers Partition the Remaining AnswersPartition defined by a set of simple constraintssimple constraints
OutputOutputstart • Inclusion constraint: “must contain ”
• Exclusion constraint: “must not contain ”
13
3.3. Find the Top of Each Set Find the Top of Each Set
OutputOutputstart
14
4.4. Find & Print the Second Answer Find & Print the Second Answer
OutputOutputstart Next answer: Best among all the Best among all the top answers in the partitionstop answers in the partitions
15
5.5. Further Divide the Chosen Partition Further Divide the Chosen Partition
… and so on … (until k answers are printed)
OutputOutputstart . . .. . .
16
OutputOutput
Partition Reps. + Best of EachPartition Reps. + Best of Each
Lawler-Murty’s: Actual ExecutionLawler-Murty’s: Actual Execution
18182424
3434 3030
Printed
already
Best of each
partitionbest
1919
17
OutputOutput
Lawler-Murty’s: Actual ExecutionLawler-Murty’s: Actual Execution
2424
Partition Reps. + Best of EachPartition Reps. + Best of Each
For each new partition, a task to find the best
answer1919 1818
3434 3030
18
OutputOutput
Lawler-Murty’s: Actual ExecutionLawler-Murty’s: Actual Execution
1818 2121
Partition Reps. + Best of EachPartition Reps. + Best of Each
2424
best…
1919 1818
3434 3030
2222
OutlineOutline
Lawler-Murty’s Ranked EnumerationLawler-Murty’s Ranked Enumeration
Optimizing by Progressive BoundsOptimizing by Progressive Bounds
Parallelization / Core UtilizationParallelization / Core Utilization
ConclusionsConclusions
20
OutputOutput
Typical BottleneckTypical Bottleneck
2424
Partition Reps. + Best of EachPartition Reps. + Best of Each
3434 3030
1414 1212
21
OutputOutput
Typical BottleneckTypical Bottleneck
2424
Partition Reps. + Best of EachPartition Reps. + Best of Each
3434 3030
2222 2020 1515
1414 1212
In top k?
22
1212
Progressive Upper BoundProgressive Upper Bound
• Throughout the execution, an optimization alg. often upper bounds it’s final solution’s score
• Progressive: bound gets smaller in time
• Often, nontrivial bounds, e.g.,– Dijkstra's algorithm: distance at the top of the queue
• Similarly: some Steiner-tree algorithms [DreyfusWagner72]
– Viterbi algorithms: max intermediate probability– Primal-dual methods: value of dual LP solution
≤18 ≤14≤22≤24
TimeTime
23
OutputOutput
Freezing Tasks (Simplified)Freezing Tasks (Simplified)
2424
Partition Reps. + Best of EachPartition Reps. + Best of Each
3434 3030
1414 1212
24
OutputOutput
Freezing Tasks (Simplified)Freezing Tasks (Simplified)
2424
Partition Reps. + Best of EachPartition Reps. + Best of Each
≤24≤23
3434 3030
2222
≤24≤23≤22
2020
1414 1212
25
OutputOutput
Freezing Tasks (Simplified)Freezing Tasks (Simplified)
2424
Partition Reps. + Best of EachPartition Reps. + Best of Each
22 > 20
3434 3030
1414 12122222 2020
≤24≤23≤20
26
OutputOutput
Freezing Tasks (Simplified)Freezing Tasks (Simplified)
Partition Reps. + Best of EachPartition Reps. + Best of Each
best
3434 3030 2424
1414 1212
≤20
2222 2020
≤24≤23≤20≤18≤16≤15
1515
27
0
20000
40000
60000
80000
100000
120000
ms
0
2000
4000
6000
8000
10000
ms
0
200
400
600
800
1000
ms
Improvement of FreezingImprovement of Freezing
Mondialk = 10 , 100
DBLP (part)k = 10 , 100
DBLP (full)k = 10 , 100
On average, freezing On average, freezing saved saved 56%56% of the running of the running
timetime
Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory
Simple Lawler-Murty w/ Freezing
OutlineOutline
Lawler-Murty’s Ranked EnumerationLawler-Murty’s Ranked Enumeration
Optimizing by Progressive BoundsOptimizing by Progressive Bounds
Parallelization / Core UtilizationParallelization / Core Utilization
ConclusionsConclusions
29
Awaiting TasksAwaiting Tasks
OutputOutput
Straightforward ParallelizationStraightforward Parallelization
1414 1212
3434 3030
2424
30
Awaiting TasksAwaiting Tasks
OutputOutput
Straightforward ParallelizationStraightforward Parallelization
1414 1212
3434 3030 2424
2222
1515
2020
31
Awaiting TasksAwaiting Tasks
OutputOutput
Straightforward ParallelizationStraightforward Parallelization
1414 121220202222
3434 3030 2424
1515
Not so fast…Not so fast…
Typical: reduced 30% of running time
Same for 2,3…,8 threads!
33
Awaiting TasksAwaiting Tasks
OutputOutput
Idle Cores while WaitingIdle Cores while Waiting
1414 1212
3434 3030
2424
34
Awaiting TasksAwaiting Tasks
OutputOutput
Idle Cores while WaitingIdle Cores while Waiting
idle
1414 1212
3434 3030 2424
2222
1515
2020
35
Awaiting TasksAwaiting Tasks
OutputOutput
Early PoppingEarly Popping
≤24≤23≤20
22 > 20
≤22
≤22
Skipped issues:
• Thread synchronization
– semaphores, locking, etc.
• Correctness
1414 121220202222
3434 3030 2424
≤19
36
Improvement of Early PoppingImprovement of Early Popping
Mondialshort, medium-size & long queries
DBLP (part)short, medium-size & long queries
0%
50%
100%
150%
1 2 4 6 8
Number of Threads
% o
f Law
ler-
Mur
ty
Short Medium Long
0%
50%
100%
150%
1 2 4 6 8
Number of Threads
% o
f Law
ler-
Mur
ty
Short Medium Long
Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory
37
Early Popping vs. (Serial) Freezing Early Popping vs. (Serial) Freezing
Mondialshort, medium-size & long queries
DBLP (part)short, medium-size & long queries
0
100
200
300
1 2 4 6 8
Number of Threads
% o
f S
eri
al F
ree
zin
g
Short Medium Long
0
100
200
300
1 2 4 6 8
Number of Threads
% o
f S
eri
al F
ree
zin
g
Short Medium Long
•Need 4 threads to start Need 4 threads to start gaininggaining•And even then, fairly poor…And even then, fairly poor…
Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory
38
Combining Freezing & Early PoppingCombining Freezing & Early Popping
• We discuss additional ideas and techniques to further utilize the cores– Not here, see the paper
• Main speedup by combining early popping with freezing– Cores kept busy… on high-potential tasks– Thread synchronization is quite involved
• At the high level, the final algorithm has the following flow:
39
Combining: General IdeaCombining: General Idea
Computed Answers (to-print)Computed Answers (to-print)
Partition Reps. as Frozen TasksPartition Reps. as Frozen Tasks
OutputOutput
171725251515
Threads work on frozen tasks
frozen + new tasks
computed
answers
3434 3030
24242020 1212
2626
40
Combining: General IdeaCombining: General Idea
Computed Answers (to-print)Computed Answers (to-print)
Partition Reps. as Frozen TasksPartition Reps. as Frozen Tasks
OutputOutput
171725251515
Threads work on frozen tasks
frozen + new tasks
computed
answers
3434 3030
24242020 1212
20
41
Main task just pops computed results to print… but validates: no better results by frozen
tasks
Combining: General IdeaCombining: General Idea
Computed Answers (to-print)Computed Answers (to-print)
Partition Reps. as Frozen TasksPartition Reps. as Frozen Tasks
OutputOutput
17172525151520
Threads work on frozen tasks
frozen + new tasks
computed
answers 22222222
3434 3030
2424
2222
2020 1212
42
Combined vs. (Serial) Freezing Combined vs. (Serial) Freezing
0%
20%40%
60%
80%100%
120%
1 2 4 6 8
Number of Threads
% o
f S
eri
al F
ree
zin
g
Short Medium Long
0%
20%40%
60%
80%100%
120%
1 2 4 6 8
Number of Threads%
of
Se
ria
l Fre
ezi
ng
Short Medium Long
Mondial DBLP
Now, significant gain (Now, significant gain (≈50%≈50%) already w/ 2 ) already w/ 2 threadsthreads
Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory
43
Improvement of CombinedImprovement of Combined
0%
10%
20%
30%
40%
50%
1 2 4 6 8
Number of Threads
% o
f L
aw
ler-
Mu
rty
Short Medium Long
DBLP
0%
10%
20%
30%
40%
50%
1 2 4 6 8
Number of Threads
% o
f L
aw
ler-
Mu
rty
Short Medium Long
4%-5% 3%-10%
On average, with 8 threads we On average, with 8 threads we got 5.7% of the original running got 5.7% of the original running
timetime
Mondial
Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory
OutlineOutline
Lawler-Murty’s Ranked EnumerationLawler-Murty’s Ranked Enumeration
Optimizing by Progressive BoundsOptimizing by Progressive Bounds
Parallelization / Core UtilizationParallelization / Core Utilization
ConclusionsConclusions
45
ConclusionsConclusions• Considered Lawler-Murty’s ranked enumeration
– Theoretical complexity guarantees– …but a direct implementation is very slow– Straightforward parallelization poorly utilizes cores
• Ideas: progressive bounds, freezing, early popping– In the paper: additional ideas, combination of ideas
• Most significant speedup by combining these ideas– Flow substantially differs from the original procedure– 20x faster on 8 cores
• Test case: graph search; focus: general apps – Future: additional test cases
Questions?Questions?