Parallel All -Points Shortest Paths - Purdue Engineeringeigenman/ECE563/Project... ·...

Preview:

Citation preview

Parallel All-Points Shortest PathsECE 563 - Spring 2013

Jason HolmesBharadwaj KrishnamurthyHector Rodriguez-Simmonds

Outline

•Overview•Sequential Code Development•Parallel Dijkstra•Parallel Floyd-Warshall•Results

Overview• Tackled the All-Points Shortest Paths problem• Constructed graphs from real data (social networks, road

networks, etc.)• Wrote modification of Dijkstra’s Algorithm

• Better for sparse graphs• Wrote Floyd-Warshall Dynamic Programming Algorithm

• Less structural overhead• Can handle negative edge weights

• Developed parallel versions using OpenMP• Parallel Dijkstra: 7.6x speedup on 8 cores• Parallel Floyd-Warshall: ~6x speedup on 8 cores

Sequential Code - graphCreate

#Input File<1, 2><1, 4><2, 5><3, 5><3, 6><4, 2><5, 4><6, 6>

buildGraph(Dijkstra)

buildGraph(FW)

Input Data Adj. List

Adj. Matrix

Sequential Code - Dijkstragraph = (vertex **) buildGraphFromFile(argv[1],LIST, &numberOfVertices);

for (source = 0; source < numberOfVertices ; source++) { for (target = 0 ; target < numberOfVertices ; target++) {

vertex * VSource = returnVertex(graph, source); vertex * VTarget = returnVertex(graph, target); VSource->distance = 0; int dist = Dijkstra3(graph, VSource, VTarget, VSource->

number);initGraph(graph, numberOfVertices);

} }

Run Dijkstra’s single source algorithm V times

Sequential Code - FW• Dynamic programming problem• Find the shortest path from i to j using only intermediate

nodes 1 to k-1• Once k reaches total number of nodes, we have the shortest

path from i to j

Sequential Code - FWedge ** FW_direct (edge ** matrix,int v_count){

int i,j,k; edge ** max_node;

max_node = malloc(v_count*sizeof(edge *));for(i = 0;i < v_count;i++){….}for(k = 1;k < v_count;k++){

for(j = 0;j < v_count;j++){for(i = 0;i < v_count;i++){

if(matrix[i][j] > matrix[i][k] + matrix [k][j]){matrix[i][j] = matrix[i][k]+matrix[k][j];max_node[i][j] = k;

}}

}}return(max_node);

}

K loop cannot be parallelized!

Bad Parallelization

i

j

Let K = 5

CORE 0 CORE 1 CORE 2 CORE 3

Sequential Code - FW• Change the algorithm – use smaller blocks and deal with

dependencies

Parallel Floyd-Warshall• Transformations

1. Parallel with tuned blocks2. Restructured parallel with nowait3. Manual balancing of workload distribution4. Parallelized computation of self dependent block5. Loop coalesced version of previous transformation

Parallel Floyd-Warshall

i

j

Parallel Floyd-Warshall

i

j

1. Parallel With Tuned Blocks• Transformed from naïve OpenMP directives• Large block size reduces number of independent blocks that

can run in parallel• Small block sizes cut down on number of computations per

block• Optimum block size found to be ~20x20

• This is somewhat graph-size dependent

2. Restructured with NOWAIT• Issue: Many separate loops can run in parallel for processing

different blocked types• Most for loops combined into one OMP parallel construct

• Eliminates multiple fork/join (wakeup/sleep) operations• Intermediate serial sections handled by OpenMP master• NOWAIT clause added to loops where correctness would not

be violated

3. Redistribute Workload• Issue: Self dependent block migrates as k varies, workload

becomes unbalanced• Using various scheduling options (guided, dynamic) decreased

performance• Hence, manually restructured the loops to balance workload

4/5. Loop Coalescing• 4. Parallelize internal loops of self-dependent blocks to

eliminate serialization• 5. Coalesce loops as number of iterations is small

#pragma omp for nowaitfor(i = block_ly;i < (block_ly + BLOCK_SIZE);i++){

for(j = block_lx;j < (block_lx + BLOCK_SIZE);j++){

if((i >= v_count)||(j >= v_count)||(k >= v_count)) continue;

if(submatrix[i][j] > (submatrix[i][k] + submatrix[k][j])){

submatrix[i][j] = submatrix[i][k] + submatrix[k][j];max_node[i][j] = k;

}}

}

for(k = start_k;k < (start_k + BLOCK_SIZE);k++){

#pragma omp for nowaitfor(ij = 0;ij < BLOCK_SIZE_SQ ;ij++){

i = (ij / BLOCK_SIZE) + block_ly;j = (ij % BLOCK_SIZE) + block_lx;

if((i >= v_count)||(j >= v_count)||(k >= v_count)) continue;

if(submatrix[i][j] > (submatrix[i][k] + submatrix[k][j])){

submatrix[i][j] = submatrix[i][k] + submatrix[k][j];max_node[i][j] = k;

}}

Normal Coalesced

Parallel Dijkstragraph0 = (vertex **) buildGraphFromFile(argv[1],LIST, &numberOfVertices);

//Able to parallelize the very outer loop, compiler could not detect due to subroutine calls#pragma omp parallel {

vertex ** graphX = copyGraph(graph0, numberOfVertices); //Done X times for X threads#pragma omp for private (target) for (source = 0; source < numberOfVertices ; source++) {

for (target = 0 ; target < numberOfVertices ; target++) {

if (omp_get_thread_num() == X) { //Again X is thread numbervertex * VSource = returnVertex(graph0, source); vertex * VTarget = returnVertex(graph0, target); VSource->distance = 0; int dist = Dijkstra3(graph0, VSource, VTarget ,

VSource->number); initGraph(graph0, numberOfVertices);

}}

}}

Parallel Dijkstra

Copy Graph

Process N/X single source

shortest paths

Copy Graph

Process N/X single source

shortest paths

Copy Graph

Process N/X single source

shortest paths

Copy Graph

Process N/X single source

shortest paths

Build Graph

• Outer loop parallelized, each thread executes Dijkstra’salgorithm with N/X source vertices (X # cores)

• Each thread retains a copy of the graph to modify

Results - FW

00.5

11.5

22.5

33.5

44.5

5

Sped

up

Program Version

Floyd-Warshall Speedup – Input Graph 1• Graph 1

• 493 vertices• 1189 edges

• Final Speedup: 4.93 on 8 cores

Results - FW

01234567

Sped

up

Program Version

Floyd-Warshall Speedup – Input Graph 1• Graph 2

• 767 vertices• 1795 edges

• Final Speedup: 6.66 on 8 cores

Results - FW

012345678

Sped

up

Program Version

Floyd-Warshall Speedup – Input Graph 3• Graph 2

• 5,242 vertices• 28,980 edges

• Final Speedup: 7.66 on 8 cores

Results - FW

0123456789

1 2 4 8

Spee

dup

Speedup vs. # of Cores

Graph 1Graph 2Graph 3

• More parallelism exploited for larger graphs

Results - Dijkstra

7.667.68

7.77.727.747.767.78

7.87.82

Graph 1 Graph 2 Graph 3

Parallel Dijkstra Speedup on 8 Cores

• Near linear speedup due to outer loop parallelization• As graph size increases less graph build and copy overhead

Future Work / Improvements• Utilize Mapreduce for huge graph input sets• Covert to MPI for Floyd-Warshall to deal with memory issues

on one machine• Port to map API to view shortest path information on a GUI

• OpenStreetMap• Add mechanisms to detect sparsity, negative edge weights and

call appropriate routines

Questions?