Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld...

Preview:

Citation preview

Optimizing and Parallelizing Optimizing and Parallelizing

Ranked EnumerationRanked Enumeration

Konstantin GolenbergKonstantin Golenberg Benny KimelfeldBenny Kimelfeld Yehoshua SagivYehoshua SagivThe Hebrew University

of JerusalemIBM Research –

AlmadenThe Hebrew University

of Jerusalem

VLDB 2011Seattle, WA

2

Background: DB Search at HebrewUBackground: DB Search at HebrewU

eu brussels search

• Initial implementation was too slow…• Purchased a multi-core server• Didn’t help: cores were usually idle

– Due to the inherent flow of the enumeration technique we used

•Needed deeper understanding of ranked enumeration to benefit from parallelization– This paperThis paper

demo in SIGMOD’10, implementation in SIGMOD’08, algorithms in PODS’06

OutlineOutline

Lawler-Murty’s Ranked EnumerationLawler-Murty’s Ranked Enumeration

Optimizing by Progressive BoundsOptimizing by Progressive Bounds

Parallelization / Core UtilizationParallelization / Core Utilization

ConclusionsConclusions

4

Ranked EnumerationRanked Enumeration

UserUser

ProblemProblem

Huge number (e.g., 2|Problem|) of ranked answers

best answer2nd best answer3rd best answer . . .. . .

Examples:• Various graph optimizations

–Shortest paths–Smallest spanning trees–Best perfect matchings

• Top results of keyword search on DBs (graph search)

• Most probable answers in probabilistic DBs

• Best recommendations for schema integration

Examples:• Various graph optimizations

–Shortest paths–Smallest spanning trees–Best perfect matchings

• Top results of keyword search on DBs (graph search)

• Most probable answers in probabilistic DBs

• Best recommendations for schema integration

““Complexity”:Complexity”:

•What is the delay between successive answers?

•How much time to get top-k?

Here

(Can’t afford to instantiate all answers)

5

Goal:Goal: Find top-k answersFind top-k answers

Abstract Problem Formulation Abstract Problem Formulation

O =A collection of objects

A =

score()

21 31 2827 17

score(a) is high a is of high-quality

Huge, described by a condition on A’s subsets

……

……32 31 28

Answersa ⊆ O

inputinput

17

a1 a2 a3 ak

6

Goal:Goal: Find top-k answersFind top-k answers

Graph Search in The AbstractionGraph Search in The Abstraction

A = …… Answersa ⊆ O

• Data graph G• Set Q of keywords • Data graph G• Set Q of keywords

Edges of G

Subtrees (edge sets) a containing all keywords in Q (w/o redundancy, see [GKS 2008])

score(a):1

, IR measures, etc.weight(a)

O =

7

What is the Challenge?What is the Challenge?

O =

32start

1st (top) answer

Optimization problem

31

2nd answer

??

. . .. . . 17

j th answer

• ≠ previous (j-1) answers• best remaining answer

Conceivably, much Conceivably, much more complicated more complicated

than top-1!than top-1!

??

How to handle these constraints? (j may be large!)

. . .. . .

8

Lawler-Murty’s ProcedureLawler-Murty’s ProcedureLawler-Murty’s gives a general reduction:

Finding top-k answers

Finding top-1 answer under simple constraints

if PTIME

then PTIME

We understand optimization much better!

Often, amounts to classical optimization, e.g., shortest path(but sometimes it may get involved, e.g., [KS 2006])

[Murty, 1968][Lawler, 1972][Murty, 1968][Lawler, 1972]

Other general top-k procedure:

[Hamacher & Queyranne 84], very similar!

9

Among the Uses of Lawler-Murty’sAmong the Uses of Lawler-Murty’s

• Shortest simple paths [Yen 1972]• Minimum spanning trees [Gabow 1977, Katoh et al., 1981]• Best solutions in resource allocation [Katoh et al. 1981]• Best perfect matchings, best cuts [Hamacher & Queyranne 1985]• Minimum Steiner trees [KS 2006]

Graph/Combinatorial Algorithms:Graph/Combinatorial Algorithms:

• Yen’s algorithm to find sets of metabolites connected by chemical reactions [Takigawa & Mamitsuka 2008]

Bioinformatics:Bioinformatics:

• ORDER-BY queries [KS 2006, 2007]• Graph/XML search [GKS 2008]• Generation of forms over integrated data [Talukdar et al. 2008]• Course recommendation [Parameswaran & Garcia-Molina 2009]• Querying Markov sequences [K & Ré 2010]

Data Management:Data Management:

10

Lawler-Murty’s Method: Conceptual Lawler-Murty’s Method: Conceptual

start

11

OutputOutput

1. 1. Find & Print the Top AnswerFind & Print the Top Answer

start

But Instead…But Instead…

In principle, at this point we should find the second-best answer

12

2.2. Partition the Remaining Answers Partition the Remaining AnswersPartition defined by a set of simple constraintssimple constraints

OutputOutputstart • Inclusion constraint: “must contain ”

• Exclusion constraint: “must not contain ”

13

3.3. Find the Top of Each Set Find the Top of Each Set

OutputOutputstart

14

4.4. Find & Print the Second Answer Find & Print the Second Answer

OutputOutputstart Next answer: Best among all the Best among all the top answers in the partitionstop answers in the partitions

15

5.5. Further Divide the Chosen Partition Further Divide the Chosen Partition

… and so on … (until k answers are printed)

OutputOutputstart . . .. . .

16

OutputOutput

Partition Reps. + Best of EachPartition Reps. + Best of Each

Lawler-Murty’s: Actual ExecutionLawler-Murty’s: Actual Execution

18182424

3434 3030

Printed

already

Best of each

partitionbest

1919

17

OutputOutput

Lawler-Murty’s: Actual ExecutionLawler-Murty’s: Actual Execution

2424

Partition Reps. + Best of EachPartition Reps. + Best of Each

For each new partition, a task to find the best

answer1919 1818

3434 3030

18

OutputOutput

Lawler-Murty’s: Actual ExecutionLawler-Murty’s: Actual Execution

1818 2121

Partition Reps. + Best of EachPartition Reps. + Best of Each

2424

best…

1919 1818

3434 3030

2222

OutlineOutline

Lawler-Murty’s Ranked EnumerationLawler-Murty’s Ranked Enumeration

Optimizing by Progressive BoundsOptimizing by Progressive Bounds

Parallelization / Core UtilizationParallelization / Core Utilization

ConclusionsConclusions

20

OutputOutput

Typical BottleneckTypical Bottleneck

2424

Partition Reps. + Best of EachPartition Reps. + Best of Each

3434 3030

1414 1212

21

OutputOutput

Typical BottleneckTypical Bottleneck

2424

Partition Reps. + Best of EachPartition Reps. + Best of Each

3434 3030

2222 2020 1515

1414 1212

In top k?

22

1212

Progressive Upper BoundProgressive Upper Bound

• Throughout the execution, an optimization alg. often upper bounds it’s final solution’s score

• Progressive: bound gets smaller in time

• Often, nontrivial bounds, e.g.,– Dijkstra's algorithm: distance at the top of the queue

• Similarly: some Steiner-tree algorithms [DreyfusWagner72]

– Viterbi algorithms: max intermediate probability– Primal-dual methods: value of dual LP solution

≤18 ≤14≤22≤24

TimeTime

23

OutputOutput

Freezing Tasks (Simplified)Freezing Tasks (Simplified)

2424

Partition Reps. + Best of EachPartition Reps. + Best of Each

3434 3030

1414 1212

24

OutputOutput

Freezing Tasks (Simplified)Freezing Tasks (Simplified)

2424

Partition Reps. + Best of EachPartition Reps. + Best of Each

≤24≤23

3434 3030

2222

≤24≤23≤22

2020

1414 1212

25

OutputOutput

Freezing Tasks (Simplified)Freezing Tasks (Simplified)

2424

Partition Reps. + Best of EachPartition Reps. + Best of Each

22 > 20

3434 3030

1414 12122222 2020

≤24≤23≤20

26

OutputOutput

Freezing Tasks (Simplified)Freezing Tasks (Simplified)

Partition Reps. + Best of EachPartition Reps. + Best of Each

best

3434 3030 2424

1414 1212

≤20

2222 2020

≤24≤23≤20≤18≤16≤15

1515

27

0

20000

40000

60000

80000

100000

120000

ms

0

2000

4000

6000

8000

10000

ms

0

200

400

600

800

1000

ms

Improvement of FreezingImprovement of Freezing

Mondialk = 10 , 100

DBLP (part)k = 10 , 100

DBLP (full)k = 10 , 100

On average, freezing On average, freezing saved saved 56%56% of the running of the running

timetime

Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

Simple Lawler-Murty w/ Freezing

OutlineOutline

Lawler-Murty’s Ranked EnumerationLawler-Murty’s Ranked Enumeration

Optimizing by Progressive BoundsOptimizing by Progressive Bounds

Parallelization / Core UtilizationParallelization / Core Utilization

ConclusionsConclusions

29

Awaiting TasksAwaiting Tasks

OutputOutput

Straightforward ParallelizationStraightforward Parallelization

1414 1212

3434 3030

2424

30

Awaiting TasksAwaiting Tasks

OutputOutput

Straightforward ParallelizationStraightforward Parallelization

1414 1212

3434 3030 2424

2222

1515

2020

31

Awaiting TasksAwaiting Tasks

OutputOutput

Straightforward ParallelizationStraightforward Parallelization

1414 121220202222

3434 3030 2424

1515

Not so fast…Not so fast…

Typical: reduced 30% of running time

Same for 2,3…,8 threads!

33

Awaiting TasksAwaiting Tasks

OutputOutput

Idle Cores while WaitingIdle Cores while Waiting

1414 1212

3434 3030

2424

34

Awaiting TasksAwaiting Tasks

OutputOutput

Idle Cores while WaitingIdle Cores while Waiting

idle

1414 1212

3434 3030 2424

2222

1515

2020

35

Awaiting TasksAwaiting Tasks

OutputOutput

Early PoppingEarly Popping

≤24≤23≤20

22 > 20

≤22

≤22

Skipped issues:

• Thread synchronization

– semaphores, locking, etc.

• Correctness

1414 121220202222

3434 3030 2424

≤19

36

Improvement of Early PoppingImprovement of Early Popping

Mondialshort, medium-size & long queries

DBLP (part)short, medium-size & long queries

0%

50%

100%

150%

1 2 4 6 8

Number of Threads

% o

f Law

ler-

Mur

ty

Short Medium Long

0%

50%

100%

150%

1 2 4 6 8

Number of Threads

% o

f Law

ler-

Mur

ty

Short Medium Long

Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

37

Early Popping vs. (Serial) Freezing Early Popping vs. (Serial) Freezing

Mondialshort, medium-size & long queries

DBLP (part)short, medium-size & long queries

0

100

200

300

1 2 4 6 8

Number of Threads

% o

f S

eri

al F

ree

zin

g

Short Medium Long

0

100

200

300

1 2 4 6 8

Number of Threads

% o

f S

eri

al F

ree

zin

g

Short Medium Long

•Need 4 threads to start Need 4 threads to start gaininggaining•And even then, fairly poor…And even then, fairly poor…

Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

38

Combining Freezing & Early PoppingCombining Freezing & Early Popping

• We discuss additional ideas and techniques to further utilize the cores– Not here, see the paper

• Main speedup by combining early popping with freezing– Cores kept busy… on high-potential tasks– Thread synchronization is quite involved

• At the high level, the final algorithm has the following flow:

39

Combining: General IdeaCombining: General Idea

Computed Answers (to-print)Computed Answers (to-print)

Partition Reps. as Frozen TasksPartition Reps. as Frozen Tasks

OutputOutput

171725251515

Threads work on frozen tasks

frozen + new tasks

computed

answers

3434 3030

24242020 1212

2626

40

Combining: General IdeaCombining: General Idea

Computed Answers (to-print)Computed Answers (to-print)

Partition Reps. as Frozen TasksPartition Reps. as Frozen Tasks

OutputOutput

171725251515

Threads work on frozen tasks

frozen + new tasks

computed

answers

3434 3030

24242020 1212

20

41

Main task just pops computed results to print… but validates: no better results by frozen

tasks

Combining: General IdeaCombining: General Idea

Computed Answers (to-print)Computed Answers (to-print)

Partition Reps. as Frozen TasksPartition Reps. as Frozen Tasks

OutputOutput

17172525151520

Threads work on frozen tasks

frozen + new tasks

computed

answers 22222222

3434 3030

2424

2222

2020 1212

42

Combined vs. (Serial) Freezing Combined vs. (Serial) Freezing

0%

20%40%

60%

80%100%

120%

1 2 4 6 8

Number of Threads

% o

f S

eri

al F

ree

zin

g

Short Medium Long

0%

20%40%

60%

80%100%

120%

1 2 4 6 8

Number of Threads%

of

Se

ria

l Fre

ezi

ng

Short Medium Long

Mondial DBLP

Now, significant gain (Now, significant gain (≈50%≈50%) already w/ 2 ) already w/ 2 threadsthreads

Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

43

Improvement of CombinedImprovement of Combined

0%

10%

20%

30%

40%

50%

1 2 4 6 8

Number of Threads

% o

f L

aw

ler-

Mu

rty

Short Medium Long

DBLP

0%

10%

20%

30%

40%

50%

1 2 4 6 8

Number of Threads

% o

f L

aw

ler-

Mu

rty

Short Medium Long

4%-5% 3%-10%

On average, with 8 threads we On average, with 8 threads we got 5.7% of the original running got 5.7% of the original running

timetime

Mondial

Experiments: Graph Search2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

OutlineOutline

Lawler-Murty’s Ranked EnumerationLawler-Murty’s Ranked Enumeration

Optimizing by Progressive BoundsOptimizing by Progressive Bounds

Parallelization / Core UtilizationParallelization / Core Utilization

ConclusionsConclusions

45

ConclusionsConclusions• Considered Lawler-Murty’s ranked enumeration

– Theoretical complexity guarantees– …but a direct implementation is very slow– Straightforward parallelization poorly utilizes cores

• Ideas: progressive bounds, freezing, early popping– In the paper: additional ideas, combination of ideas

• Most significant speedup by combining these ideas– Flow substantially differs from the original procedure– 20x faster on 8 cores

• Test case: graph search; focus: general apps – Future: additional test cases

Questions?Questions?

Recommended