A Parallelization of State-of-the-Art Graph Bisection Algorithms

A Parallelization of State-of-the-A Parallelization of State-of-the-Art Graph Bisection AlgorithmsArt Graph Bisection Algorithms

A Parallelization of State-of-the-A Parallelization of State-of-the-Art Graph Bisection AlgorithmsArt Graph Bisection Algorithms

Nan DunNan Dun, Kenjiro Taura, Akinori Yonezawa, Kenjiro Taura, Akinori YonezawaGraduate School of Information Science and TechnologyGraduate School of Information Science and Technology

The University of TokyoThe University of Tokyo

Nan DunNan Dun, Kenjiro Taura, Akinori Yonezawa, Kenjiro Taura, Akinori YonezawaGraduate School of Information Science and TechnologyGraduate School of Information Science and Technology

The University of TokyoThe University of Tokyo

July 31, KochiJuly 31, Kochi SWoPP 2006SWoPP 2006 22

Problem DescriptionProblem DescriptionProblem DescriptionProblem Description

• Graph Partition Goal: To minimize cut K-partition Bisection (Bipartition)

• Problem Complexity To find best partition or

To find approximate partitions: NP-Hard1)2)

• Solutions Heuristics

Non-deterministic On the Grid

• Graph Partition Goal: To minimize cut K-partition Bisection (Bipartition)

• Problem Complexity To find best partition or

To find approximate partitions: NP-Hard1)2)

• Solutions Heuristics

Non-deterministic On the Grid

2

1

3

4

6

5

グラフ分割問題グラフ分割問題

無向グラフ無向グラフ G=(V,E)G=(V,E) が与えが与えられたとき、られたとき、 |L|=|R||L|=|R| を満たを満たすす VV の分割の分割 (L,R)(L,R) で、で、 LL とと RR間の枝の本数を最小にするも間の枝の本数を最小にするものを求める問題。のを求める問題。

L={1,2,3}L={1,2,3} R={4,5,6}R={4,5,6}

2

1


Practical ApplicationPractical ApplicationPractical ApplicationPractical Application

• In Mathematics Analysis of sparse system of linear equations

• In Computer Science Modeling data placement on distributed memory,

to minimize communication

• In other Various Domains VLSI Design Transportation Networks Communication Networks

• In Mathematics Analysis of sparse system of linear equations

• In Computer Science Modeling data placement on distributed memory,

to minimize communication

• In other Various Domains VLSI Design Transportation Networks Communication Networks


Bisection FlowBisection FlowBisection FlowBisection Flow

• Bisection Initialization Random Initialization Half-Half Initialization Region Growing

• Bisection Refinement Kernighan-Lin3)4)

Tabu Search7)

Fixed Tabu Search Reactive Tabu Search

• Bisection Initialization Random Initialization Half-Half Initialization Region Growing

• Bisection Refinement Kernighan-Lin3)4)

Tabu Search7)

Fixed Tabu Search Reactive Tabu Search

Bisection InitializationBisection Initialization

Bisection RefinementBisection Refinement

Initial Bisection

Final Bisection


Min-Max Greedy GrowingMin-Max Greedy Growing7)7)Min-Max Greedy GrowingMin-Max Greedy Growing7)7)

Min: Search vertices Search vertices which cause minimal which cause minimal edge-cutedge-cut

Max: Breaking ties Breaking ties by maximizing by maximizing internal internal connectionsconnections

AB

C

addsetaddset

A


Kernighan-LinKernighan-Lin3)4)3)4)Kernighan-LinKernighan-Lin3)4)3)4)

1. Calculate gain of each vertex

2. Search a serials of pairs which leads to maximal edge-cut reduction if being swapped

3. Swap pairs of vertices obtained in 2, lock them from further swap in current pass

4. Iterate step 1, 2, 3 until edge-cut stops to converge

1. Calculate gain of each vertex

2. Search a serials of pairs which leads to maximal edge-cut reduction if being swapped

3. Swap pairs of vertices obtained in 2, lock them from further swap in current pass

4. Iterate step 1, 2, 3 until edge-cut stops to converge

A

B

C

D

A B

C D

Swapping Pair of VerticesSwapping Pair of Vertices

*gain := # of Internal Edges - # of External *gain := # of Internal Edges - # of External EdgesEdges

gain(B) = -1, gain(C) = -2gain(B) = -1, gain(C) = -2

ΔΔCut of swapping B, C = Cut of swapping B, C = gain(B) + gain(C) + 2 = -gain(B) + gain(C) + 2 = -

11


Tabu SearchTabu Search7)7)Tabu SearchTabu Search7)7)

• Kernighan-Lin Like Swapping pairs of vertices according to their

gains

• Temporarily Forbidden Previously swapped vertices are temporarily

forbad to move for a period of time (Tabu Length) Tabu Length: A fraction (Tabu Fraction) of |V|

E.g.: Tabu Fraction = 0.01, |V| = 1000, Tabu Length = 0.01 x |V| = 10 Previously swapped pairs are allowed to move again after 10 other swaps

To exceed “Local-Minimum”

• Kernighan-Lin Like Swapping pairs of vertices according to their

gains

• Temporarily Forbidden Previously swapped vertices are temporarily

forbad to move for a period of time (Tabu Length) Tabu Length: A fraction (Tabu Fraction) of |V|

E.g.: Tabu Fraction = 0.01, |V| = 1000, Tabu Length = 0.01 x |V| = 10 Previously swapped pairs are allowed to move again after 10 other swaps

To exceed “Local-Minimum”


Graph Types – Tabu LengthsGraph Types – Tabu LengthsGraph Types – Tabu LengthsGraph Types – Tabu Lengths

• Number of Vertex Degree Denser random graphs tend to prefer smaller Tabu

lengths, while denser geometric graphs tend to prefer larger tabu lengths8)

• Distribution of Vertex Degree Graphs having uniform distribution of vertex degree

tend to have unique fitting tabu length

• Number of Vertex Degree Denser random graphs tend to prefer smaller Tabu

lengths, while denser geometric graphs tend to prefer larger tabu lengths8)

• Distribution of Vertex Degree Graphs having uniform distribution of vertex degree

tend to have unique fitting tabu length

1400

1500

1600

1700

1800

1900

0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.22 0.25

7150

7250

7350

7450

7550

7650

7750

0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.22 0.25

Edg

e-C

ut

Tabu Fraction

|V| = 17758 |E| = 54196 Deg: Max 573 Min 1 Avg. 6.1 |V| = 35000 |E| = 346572 Deg: Max 43 Min 3 Avg. 19.8


RRTSRRTS7)7)RRTSRRTS7)7)

• Synthesis of Heuristics Heuristics perform as

complementary for each other

• Reactive Try each Tabu-length to

see which is better Adaptive to various

graphs

• Best Quality Beyond “Local-minimum”

• Long Running Time Scoring Phase

• Synthesis of Heuristics Heuristics perform as

complementary for each other

• Reactive Try each Tabu-length to

see which is better Adaptive to various

graphs

• Best Quality Beyond “Local-minimum”

• Long Running Time Scoring Phase

RREACTIVEEACTIVERRANDOMIZEDANDOMIZEDTTABUABUSSEARCEARCHH

Scoring each Tabu length by smallScoring each Tabu length by small runs of TS runs of TS do I times

Initial bisection by Min-Max

do J times TS with high-scoredhigh-scored Tabu length Refine by Kernighan-Lin runs

R. Battiti and A. A. Bertossi.R. Battiti and A. A. Bertossi. Greedy, Greedy, Prohibition, and Reactive Heuristics for Prohibition, and Reactive Heuristics for Graph Partitioning. Graph Partitioning. IEEE Transactions IEEE Transactions on Computers, Vol. 48, April 1999.on Computers, Vol. 48, April 1999.


Multi-level for Large GraphsMulti-level for Large GraphsMulti-level for Large GraphsMulti-level for Large Graphs

• Coarsen Phase Coarsen large graphs to

smaller one by using “Match Scheme”

Multi-level coarsen

• Bisection Phase Bisecting small graphs is

usually very fast

• Uncoarsen Phase Mapping back to original

graph Perform refinement in

each uncoarsening phase

• METIS5)12)

• Coarsen Phase Coarsen large graphs to

smaller one by using “Match Scheme”

Multi-level coarsen

• Bisection Phase Bisecting small graphs is

usually very fast

• Uncoarsen Phase Mapping back to original

graph Perform refinement in

each uncoarsening phase

• METIS5)12)

Matching Scheme


Comparison of HeuristicsComparison of HeuristicsComparison of HeuristicsComparison of Heuristics

METISMETIS RRTS100RRTS100 FTS10000FTS10000

cutcut timetime cutcut timetime cutcut timetime

G1 130 0.01 130 168.11 130 1.22

G2 366 0.07 353 696.49 354 13.85

G3 311 0.10 311 935.56 306 32.85

G4 6337 0.04 6257 353.45 6316 3.77

G5 950 0.17 Timeout (1 hour)Timeout (1 hour) 929 31.55

Graph |V| |E|Degree

Best Tabu FractionAvg Min Max

G1:fe_4elt 11143 32818 7.93 0 15 0.02

G2:fe_pwt 36519 144794 5.89 3 12 0.02

G3:fe_body 45087 163734 7.26 0 28 0.02

G4:mem 17758 54196 6.10 1 573 0.14

G5:wing 62032 121544 3.92 2 4 0.01


Comparison of HeuristicsComparison of HeuristicsComparison of HeuristicsComparison of Heuristics

• METIS Extremely Fast

Using Multi-level Technique High-Quality Bisections but worse than RRTS

Multi-level lacks “Global-Optimizing” during coarsen phase

• RRTS Very Slow

Scoring Phase is time costing “Ever-best” Bisections

Adaptive to kinds of graphs

• FTS with Known Tabu-Length Must faster than RRTS Comparable result to RRTS

• METIS Extremely Fast

Using Multi-level Technique High-Quality Bisections but worse than RRTS

Multi-level lacks “Global-Optimizing” during coarsen phase

• RRTS Very Slow

Scoring Phase is time costing “Ever-best” Bisections

Adaptive to kinds of graphs

• FTS with Known Tabu-Length Must faster than RRTS Comparable result to RRTS


A Naive ParallelizationA Naive ParallelizationA Naive ParallelizationA Naive Parallelization

• Run RRTS independently on each node Simply equivalent to scale-up iterations

• Generate Different seeds for different nodes Heuristics are initial sensitive 10% ~ 20% enhanced

• Run RRTS independently on each node Simply equivalent to scale-up iterations

• Generate Different seeds for different nodes Heuristics are initial sensitive 10% ~ 20% enhanced

RRTS100RRTS100

RRTS100RRTS100

RRTS100RRTS100

RRTS100RRTS100

RRTS100RRTS100

RRTS100RRTS100

RRTS100RRTS100

Dispatch GraphsDispatch Graphs

Synthesize Results

Synthesize Results


Statistical Properties of Cut-Statistical Properties of Cut-sizesizeStatistical Properties of Cut-Statistical Properties of Cut-sizesize• Incidence of Bests

Average quality is good Only 0.25% is the best

• General Property Distribution becomes

“Peak” as |V| grows Distribution tends

towards Gaussian8)

Mean and Variance scales linearly with |V|

• Incidence of Bests Average quality is good Only 0.25% is the best

• General Property Distribution becomes

“Peak” as |V| grows Distribution tends

towards Gaussian8)

Mean and Variance scales linearly with |V|

0

10

20

30

40

50

60

70

80

1050 1090 1130 1170 1210 1250 1290 1330 1370

Edge-Cut

Cou

nt

|V| = 35000 |E| = 346572 Degree: Max 43 Min 3 Avg 19.80

RRTS100 on 400 nodes provided by Grid Challenge Federation


Issues of Parallelizing Issues of Parallelizing HeuristicsHeuristicsIssues of Parallelizing Issues of Parallelizing HeuristicsHeuristics• Hard by Message-Passing Model (MPI)

J.R. Gilbert and E. Zmijewski9): A parallel graph partitioning algorithm for a message-passing multiprocessor. International Journal of Parallel Programming

Par-METIS (Parallel METIS) Par-METIS only parallelized “coarsen-uncoarsen” part

• Hard to Be Efficient (statistic property) If we could parallelize heuristic efficiently

The fraction of reach the best bisections is still small among overall iterations

If we corporately run independent instance on Grid How many nodes will leads to best partition When will a good threshold come

• Hard by Message-Passing Model (MPI) J.R. Gilbert and E. Zmijewski9):

A parallel graph partitioning algorithm for a message-passing multiprocessor. International Journal of Parallel Programming

Par-METIS (Parallel METIS) Par-METIS only parallelized “coarsen-uncoarsen” part

• Hard to Be Efficient (statistic property) If we could parallelize heuristic efficiently

The fraction of reach the best bisections is still small among overall iterations

If we corporately run independent instance on Grid How many nodes will leads to best partition When will a good threshold come


Contribution of PhasesContribution of PhasesContribution of PhasesContribution of Phases

• Initial Phase Reduce large portion of

Edge-cut Good initial partitions

lead to good final partitions

Consistent time for different running, good initial partitions gain time for refinement

• TS and KL Phase Reductions tend be

alike More iterations, better

results

• Initial Phase Reduce large portion of

Edge-cut Good initial partitions

lead to good final partitions

Consistent time for different running, good initial partitions gain time for refinement

• TS and KL Phase Reductions tend be

alike More iterations, better

results

900

1100

1300

1500

1700

1900

912 1035 1075 1079

Final KL FTS Init

Best Edge-Cuts

ΔE

dge-

Cu

t


Results from Same Initial Results from Same Initial BisectionsBisectionsResults from Same Initial Results from Same Initial BisectionsBisections• Given Same Initial

Partitions Best initial partitions

leads to best final partitions

FTS and KL tend to be deterministic

Fewer swapping are available

• Diversity of edge-cut can be cancelled by distributing only one phase Run FTS and KL on one

node is enough

• Given Same Initial Partitions Best initial partitions

leads to best final partitions

FTS and KL tend to be deterministic

Fewer swapping are available

• Diversity of edge-cut can be cancelled by distributing only one phase Run FTS and KL on one

node is enough

0

10

20

30

40

50

915

Init: 1078

985

Init: 1156

987

Init: 1197

1000

Init: 1185

Perform FTS and KL on same initial partitions, 50 nodes

Cou

nt


Multi-level ScoringMulti-level ScoringMulti-level ScoringMulti-level Scoring

• Mainly Used to Adapt Large-Scale Graphs If |V| = 1000, Tabu = 0.01 x 1000 = 10

If |V| = 100000, Tabu = 0.01 x 100000 = 1000

• Tuning Tabu-Length to fit specific graphs better Level-1 Scoring distinguish graphs from their types Level-2 Scoring test better Tabu-length from specific graphs

• Mainly Used to Adapt Large-Scale Graphs If |V| = 1000, Tabu = 0.01 x 1000 = 10

If |V| = 100000, Tabu = 0.01 x 100000 = 1000

• Tuning Tabu-Length to fit specific graphs better Level-1 Scoring distinguish graphs from their types Level-2 Scoring test better Tabu-length from specific graphs

950

1050

1150

1250

1350

1450

1550

1650

1750

0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.22 0.25

Avg. Cut Min. Cut

900

950

1000

1050

1100

1150

1200

1250

1300

0.001 0.004 0.007 0.01 0.013 0.016 0.019

Avg. Cut Min. Cut

Level-2 Tabu Fraction

Level-1 Tabu Fraction

Edg

e-C

ut

Edg

e-C

ut


Final ApproachesFinal ApproachesFinal ApproachesFinal Approaches

• Not to Use Multi-level Partition To preserve a “best” quality

• Not to Parallelize Heuristics Itself Not a good trade-off

• To Parallelize Scoring Phase One group of nodes score one tabu length With multi-level scoring technique

• To Parallelize Initial Phase Only Remove diversity of edge-cut ASAP

Take advantage of running distribution to remove diversity of edge-cut

Reduce computing effort AMAP Further refinement can be done on single node

• To Use GXP Cluster Shell “mw” command: mw M {{ W }}

• Not to Use Multi-level Partition To preserve a “best” quality

• Not to Parallelize Heuristics Itself Not a good trade-off

• To Parallelize Scoring Phase One group of nodes score one tabu length With multi-level scoring technique

• To Parallelize Initial Phase Only Remove diversity of edge-cut ASAP

Take advantage of running distribution to remove diversity of edge-cut

Reduce computing effort AMAP Further refinement can be done on single node

• To Use GXP Cluster Shell “mw” command: mw M {{ W }}


Full PictureFull PictureFull PictureFull Picture

S: 0.01

S: 0.01

S: 0.02

S: 0.02

S: 0.03

S: 0.03

S: 0.04

S: 0.04

S: 0.05

S: 0.05

S: 0.06

S: 0.06

S: 0.07

S: 0.07

S:0.001

S:0.001

S: 0.002

S: 0.002

S: 0.003

S: 0.003

S: 0.004

S: 0.004

S: 0.005

S: 0.005

S: 0.006

S: 0.006

S: 0.007

S: 0.007

InitInit InitInit InitInit InitInit InitInit InitInit

FTS and KLFTS and KL

Multi-Multi-LevelLevel

Scoring Scoring

Initial Initial PhasePhase

RefinemenRefinement Phaset Phase

High-Scored Level-1 Tabu Fraction




Best Initial Partitions

Best Initial Partitions


ConclusionsConclusionsConclusionsConclusions

• Bisection Quality “Ever-Best” partitions

Edge-CutOUR ≤ Edge-CutRRTS≤ Edge-CutMETIS

• Bisection Time Comparable and Reasonable

TimeMETIS < TimeOUR << TimeRRTS

Speed Up 10 comparing to RRTS

• Adapted to Grid Environment Scalable Performance Convenient usage Good Fault Tolerant

• Bisection Quality “Ever-Best” partitions

Edge-CutOUR ≤ Edge-CutRRTS≤ Edge-CutMETIS

• Bisection Time Comparable and Reasonable

TimeMETIS < TimeOUR << TimeRRTS

Speed Up 10 comparing to RRTS

• Adapted to Grid Environment Scalable Performance Convenient usage Good Fault Tolerant


御静聴ありがとうございまし御静聴ありがとうございました！た！

Documents

A Parallelization of State-of-the-Art Graph Bisection Algorithms