Parallel All -Points Shortest Paths - Purdue Engineeringeigenman/ECE563/Project... ·...

Parallel All-Points Shortest PathsECE 563 - Spring 2013

Jason HolmesBharadwaj KrishnamurthyHector Rodriguez-Simmonds

Outline

•Overview•Sequential Code Development•Parallel Dijkstra•Parallel Floyd-Warshall•Results

Overview• Tackled the All-Points Shortest Paths problem• Constructed graphs from real data (social networks, road

networks, etc.)• Wrote modification of Dijkstra’s Algorithm

• Better for sparse graphs• Wrote Floyd-Warshall Dynamic Programming Algorithm

• Less structural overhead• Can handle negative edge weights

• Developed parallel versions using OpenMP• Parallel Dijkstra: 7.6x speedup on 8 cores• Parallel Floyd-Warshall: ~6x speedup on 8 cores

Sequential Code - graphCreate

#Input File<1, 2><1, 4><2, 5><3, 5><3, 6><4, 2><5, 4><6, 6>

buildGraph(Dijkstra)

buildGraph(FW)

Input Data Adj. List

Adj. Matrix

Sequential Code - Dijkstragraph = (vertex **) buildGraphFromFile(argv[1],LIST, &numberOfVertices);

for (source = 0; source < numberOfVertices ; source++) { for (target = 0 ; target < numberOfVertices ; target++) {

vertex * VSource = returnVertex(graph, source); vertex * VTarget = returnVertex(graph, target); VSource->distance = 0; int dist = Dijkstra3(graph, VSource, VTarget, VSource->

number);initGraph(graph, numberOfVertices);

Run Dijkstra’s single source algorithm V times

Sequential Code - FW• Dynamic programming problem• Find the shortest path from i to j using only intermediate

nodes 1 to k-1• Once k reaches total number of nodes, we have the shortest

path from i to j

Sequential Code - FWedge ** FW_direct (edge ** matrix,int v_count){

int i,j,k; edge ** max_node;

max_node = malloc(v_count*sizeof(edge *));for(i = 0;i < v_count;i++){….}for(k = 1;k < v_count;k++){

for(j = 0;j < v_count;j++){for(i = 0;i < v_count;i++){

if(matrix[i][j] > matrix[i][k] + matrix [k][j]){matrix[i][j] = matrix[i][k]+matrix[k][j];max_node[i][j] = k;

}}return(max_node);

K loop cannot be parallelized!

Bad Parallelization

Let K = 5

CORE 0 CORE 1 CORE 2 CORE 3

Sequential Code - FW• Change the algorithm – use smaller blocks and deal with

dependencies

Parallel Floyd-Warshall• Transformations

1. Parallel with tuned blocks2. Restructured parallel with nowait3. Manual balancing of workload distribution4. Parallelized computation of self dependent block5. Loop coalesced version of previous transformation

Parallel Floyd-Warshall

1. Parallel With Tuned Blocks• Transformed from naïve OpenMP directives• Large block size reduces number of independent blocks that

can run in parallel• Small block sizes cut down on number of computations per

block• Optimum block size found to be ~20x20

• This is somewhat graph-size dependent

2. Restructured with NOWAIT• Issue: Many separate loops can run in parallel for processing

different blocked types• Most for loops combined into one OMP parallel construct

• Eliminates multiple fork/join (wakeup/sleep) operations• Intermediate serial sections handled by OpenMP master• NOWAIT clause added to loops where correctness would not

be violated

3. Redistribute Workload• Issue: Self dependent block migrates as k varies, workload

becomes unbalanced• Using various scheduling options (guided, dynamic) decreased

performance• Hence, manually restructured the loops to balance workload

4/5. Loop Coalescing• 4. Parallelize internal loops of self-dependent blocks to

eliminate serialization• 5. Coalesce loops as number of iterations is small

#pragma omp for nowaitfor(i = block_ly;i < (block_ly + BLOCK_SIZE);i++){

for(j = block_lx;j < (block_lx + BLOCK_SIZE);j++){

if((i >= v_count)||(j >= v_count)||(k >= v_count)) continue;

if(submatrix[i][j] > (submatrix[i][k] + submatrix[k][j])){

submatrix[i][j] = submatrix[i][k] + submatrix[k][j];max_node[i][j] = k;

for(k = start_k;k < (start_k + BLOCK_SIZE);k++){

#pragma omp for nowaitfor(ij = 0;ij < BLOCK_SIZE_SQ ;ij++){

i = (ij / BLOCK_SIZE) + block_ly;j = (ij % BLOCK_SIZE) + block_lx;

if((i >= v_count)||(j >= v_count)||(k >= v_count)) continue;

if(submatrix[i][j] > (submatrix[i][k] + submatrix[k][j])){

submatrix[i][j] = submatrix[i][k] + submatrix[k][j];max_node[i][j] = k;

Normal Coalesced

Parallel Dijkstragraph0 = (vertex **) buildGraphFromFile(argv[1],LIST, &numberOfVertices);

//Able to parallelize the very outer loop, compiler could not detect due to subroutine calls#pragma omp parallel {

vertex ** graphX = copyGraph(graph0, numberOfVertices); //Done X times for X threads#pragma omp for private (target) for (source = 0; source < numberOfVertices ; source++) {

for (target = 0 ; target < numberOfVertices ; target++) {

if (omp_get_thread_num() == X) { //Again X is thread numbervertex * VSource = returnVertex(graph0, source); vertex * VTarget = returnVertex(graph0, target); VSource->distance = 0; int dist = Dijkstra3(graph0, VSource, VTarget ,

VSource->number); initGraph(graph0, numberOfVertices);

Parallel Dijkstra

Copy Graph

Process N/X single source

shortest paths

Copy Graph

shortest paths

Copy Graph

shortest paths

Copy Graph

shortest paths

Build Graph

• Outer loop parallelized, each thread executes Dijkstra’salgorithm with N/X source vertices (X # cores)

• Each thread retains a copy of the graph to modify

Results - FW

Program Version

Floyd-Warshall Speedup – Input Graph 1• Graph 1

• 493 vertices• 1189 edges

• Final Speedup: 4.93 on 8 cores

Results - FW

01234567

Program Version

• 767 vertices• 1795 edges

Results - FW

012345678

Program Version

• 5,242 vertices• 28,980 edges

Results - FW

0123456789

1 2 4 8

Speedup vs. # of Cores

Graph 1Graph 2Graph 3

• More parallelism exploited for larger graphs

Results - Dijkstra

7.667.68

7.77.727.747.767.78

7.87.82

Graph 1 Graph 2 Graph 3

Parallel Dijkstra Speedup on 8 Cores

• Near linear speedup due to outer loop parallelization• As graph size increases less graph build and copy overhead

Future Work / Improvements• Utilize Mapreduce for huge graph input sets• Covert to MPI for Floyd-Warshall to deal with memory issues

on one machine• Port to map API to view shortest path information on a GUI

• OpenStreetMap• Add mechanisms to detect sparsity, negative edge weights and

call appropriate routines

Questions?

Parallel All -Points Shortest Paths - Purdue Engineeringeigenman/ECE563/Project... ·...

Documents

12 All Pairs Shortest Paths

Shortest Paths - uniroma1.itdamore/asd/altromateriale/sedgewickch21.pdf · Shortest-pathtrees A shortest-path tree (SPT) deﬁnes shortest paths from the root to other vertices(see

Dynamic Shortest Paths

Multiple-Source Shortest Paths in Embedded Graphsjeffe.cs.illinois.edu/pubs/pdf/multishort.pdf · Multiple-Source Shortest Paths in Embedded Graphs ... parametric shortest-path problem

2.2 Shortest Paths

Shortest Paths in Three Dimensions

Design and Analysis of Algorithms Single-source shortest paths, all-pairs shortest paths

Chapter 13 Shortest Paths

2004 SDU Lecture9- Single-Source Shortest Paths 1.Related Notions and Variants of Shortest Paths Problems 2.Properties of Shortest Paths and Relaxation

Near-Shortest and K-Shortest Simple Paths 1 Introduction

EE365: Shortest Paths

Lecture 7 All-Pairs Shortest Paths. All-Pairs Shortest Paths

Shortest Paths

Shortest Paths Alogrithgms

Dynamic All Pairs Shortest Paths

Shortest Paths (1/11)

Shortest Paths and Dijkstra's Algorithm

Graphs: Finding shortest paths

11. Shortest Paths

Dynamic Single-source Shortest Paths