Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Parallel All-Points Shortest PathsECE 563 - Spring 2013
Jason HolmesBharadwaj KrishnamurthyHector Rodriguez-Simmonds
Outline
•Overview•Sequential Code Development•Parallel Dijkstra•Parallel Floyd-Warshall•Results
Overview• Tackled the All-Points Shortest Paths problem• Constructed graphs from real data (social networks, road
networks, etc.)• Wrote modification of Dijkstra’s Algorithm
• Better for sparse graphs• Wrote Floyd-Warshall Dynamic Programming Algorithm
• Less structural overhead• Can handle negative edge weights
• Developed parallel versions using OpenMP• Parallel Dijkstra: 7.6x speedup on 8 cores• Parallel Floyd-Warshall: ~6x speedup on 8 cores
Sequential Code - graphCreate
#Input File<1, 2><1, 4><2, 5><3, 5><3, 6><4, 2><5, 4><6, 6>
buildGraph(Dijkstra)
buildGraph(FW)
Input Data Adj. List
Adj. Matrix
Sequential Code - Dijkstragraph = (vertex **) buildGraphFromFile(argv[1],LIST, &numberOfVertices);
for (source = 0; source < numberOfVertices ; source++) { for (target = 0 ; target < numberOfVertices ; target++) {
vertex * VSource = returnVertex(graph, source); vertex * VTarget = returnVertex(graph, target); VSource->distance = 0; int dist = Dijkstra3(graph, VSource, VTarget, VSource->
number);initGraph(graph, numberOfVertices);
} }
Run Dijkstra’s single source algorithm V times
Sequential Code - FW• Dynamic programming problem• Find the shortest path from i to j using only intermediate
nodes 1 to k-1• Once k reaches total number of nodes, we have the shortest
path from i to j
Sequential Code - FWedge ** FW_direct (edge ** matrix,int v_count){
int i,j,k; edge ** max_node;
max_node = malloc(v_count*sizeof(edge *));for(i = 0;i < v_count;i++){….}for(k = 1;k < v_count;k++){
for(j = 0;j < v_count;j++){for(i = 0;i < v_count;i++){
if(matrix[i][j] > matrix[i][k] + matrix [k][j]){matrix[i][j] = matrix[i][k]+matrix[k][j];max_node[i][j] = k;
}}
}}return(max_node);
}
K loop cannot be parallelized!
Bad Parallelization
i
j
Let K = 5
CORE 0 CORE 1 CORE 2 CORE 3
Sequential Code - FW• Change the algorithm – use smaller blocks and deal with
dependencies
Parallel Floyd-Warshall• Transformations
1. Parallel with tuned blocks2. Restructured parallel with nowait3. Manual balancing of workload distribution4. Parallelized computation of self dependent block5. Loop coalesced version of previous transformation
Parallel Floyd-Warshall
i
j
Parallel Floyd-Warshall
i
j
1. Parallel With Tuned Blocks• Transformed from naïve OpenMP directives• Large block size reduces number of independent blocks that
can run in parallel• Small block sizes cut down on number of computations per
block• Optimum block size found to be ~20x20
• This is somewhat graph-size dependent
2. Restructured with NOWAIT• Issue: Many separate loops can run in parallel for processing
different blocked types• Most for loops combined into one OMP parallel construct
• Eliminates multiple fork/join (wakeup/sleep) operations• Intermediate serial sections handled by OpenMP master• NOWAIT clause added to loops where correctness would not
be violated
3. Redistribute Workload• Issue: Self dependent block migrates as k varies, workload
becomes unbalanced• Using various scheduling options (guided, dynamic) decreased
performance• Hence, manually restructured the loops to balance workload
4/5. Loop Coalescing• 4. Parallelize internal loops of self-dependent blocks to
eliminate serialization• 5. Coalesce loops as number of iterations is small
#pragma omp for nowaitfor(i = block_ly;i < (block_ly + BLOCK_SIZE);i++){
for(j = block_lx;j < (block_lx + BLOCK_SIZE);j++){
if((i >= v_count)||(j >= v_count)||(k >= v_count)) continue;
if(submatrix[i][j] > (submatrix[i][k] + submatrix[k][j])){
submatrix[i][j] = submatrix[i][k] + submatrix[k][j];max_node[i][j] = k;
}}
}
for(k = start_k;k < (start_k + BLOCK_SIZE);k++){
#pragma omp for nowaitfor(ij = 0;ij < BLOCK_SIZE_SQ ;ij++){
i = (ij / BLOCK_SIZE) + block_ly;j = (ij % BLOCK_SIZE) + block_lx;
if((i >= v_count)||(j >= v_count)||(k >= v_count)) continue;
if(submatrix[i][j] > (submatrix[i][k] + submatrix[k][j])){
submatrix[i][j] = submatrix[i][k] + submatrix[k][j];max_node[i][j] = k;
}}
Normal Coalesced
Parallel Dijkstragraph0 = (vertex **) buildGraphFromFile(argv[1],LIST, &numberOfVertices);
//Able to parallelize the very outer loop, compiler could not detect due to subroutine calls#pragma omp parallel {
vertex ** graphX = copyGraph(graph0, numberOfVertices); //Done X times for X threads#pragma omp for private (target) for (source = 0; source < numberOfVertices ; source++) {
for (target = 0 ; target < numberOfVertices ; target++) {
if (omp_get_thread_num() == X) { //Again X is thread numbervertex * VSource = returnVertex(graph0, source); vertex * VTarget = returnVertex(graph0, target); VSource->distance = 0; int dist = Dijkstra3(graph0, VSource, VTarget ,
VSource->number); initGraph(graph0, numberOfVertices);
}}
}}
Parallel Dijkstra
Copy Graph
Process N/X single source
shortest paths
Copy Graph
Process N/X single source
shortest paths
Copy Graph
Process N/X single source
shortest paths
Copy Graph
Process N/X single source
shortest paths
Build Graph
• Outer loop parallelized, each thread executes Dijkstra’salgorithm with N/X source vertices (X # cores)
• Each thread retains a copy of the graph to modify
Results - FW
00.5
11.5
22.5
33.5
44.5
5
Sped
up
Program Version
Floyd-Warshall Speedup – Input Graph 1• Graph 1
• 493 vertices• 1189 edges
• Final Speedup: 4.93 on 8 cores
Results - FW
01234567
Sped
up
Program Version
Floyd-Warshall Speedup – Input Graph 1• Graph 2
• 767 vertices• 1795 edges
• Final Speedup: 6.66 on 8 cores
Results - FW
012345678
Sped
up
Program Version
Floyd-Warshall Speedup – Input Graph 3• Graph 2
• 5,242 vertices• 28,980 edges
• Final Speedup: 7.66 on 8 cores
Results - FW
0123456789
1 2 4 8
Spee
dup
Speedup vs. # of Cores
Graph 1Graph 2Graph 3
• More parallelism exploited for larger graphs
Results - Dijkstra
7.667.68
7.77.727.747.767.78
7.87.82
Graph 1 Graph 2 Graph 3
Parallel Dijkstra Speedup on 8 Cores
• Near linear speedup due to outer loop parallelization• As graph size increases less graph build and copy overhead
Future Work / Improvements• Utilize Mapreduce for huge graph input sets• Covert to MPI for Floyd-Warshall to deal with memory issues
on one machine• Port to map API to view shortest path information on a GUI
• OpenStreetMap• Add mechanisms to detect sparsity, negative edge weights and
call appropriate routines
Questions?