Optimizing Graph Algorithms for Improved Cache Performance Aya Mire & Amir Nahir Based on: Optimizing Graph Algorithms for Improved Cache Performance –

Optimizing Graph Algorithms for Improved Cache

Performance

Aya Mire & Amir Nahir

Based on: Optimizing Graph Algorithms for Improved Cache Performance – Joon-Sang Park,

Michael Penner, Viktor K Prasanna

The Problem with Graphs…

Graph problems pose unique challenges to improving cache performance due to their irregular data access patterns.

1

2

99

Agenda

• A recursive implementation of the Floyd-Warshall Algorithm.

• A tiled implementation of the Floyd-Warshall Algorithm.

• Efficient data structures for general graph problem.

• Optimizations for the maximum matching algorithm.

Analysis model

• All proofs and complexity analysis will be based on the I/O model.i.e: the goal of the improved algorithm is to minimize the number of cpu-memory transactions.

CPU

Cache

Main Memory

ABC

cost(A) ≪ cost(B)

cost(C) ≪ cost(B)

Analysis model

All proofs will assume total control of the cache. i.e if the cache is big enough to hold two data blocks, than the two can be held in the cache without running over each other (no conflict misses)

The Floyd Warshall Algorithm

• An ‘all pairs shortest path’ algorithm.• Works by iteratively calculating Dk,

where Dk is the matrix of all pair shortest paths going through vertices 1, 2, …k.

• Each iteration depends on the result of the previous one.

• Time complexity: Θ(|V|3).


Pseudo Code:for k from 1 to |V| for i from 1 to |V|

for j from 1 to |V| Di,j

(k) ← min Di,j(k-1) , Di,k

(k-1) + Dk,j(k-1)

return D(|V|)


The algorithm accesses the entire matrix in each iteration.

The dependency of the kth iteration on the results of the (k-1)th iteration eliminate the ability to perform data reuse.

Lemma 1

Suppose Di,j(k) is computed as

Di,j(k) ← min Di,j

(k-1) , Di,k(k’) +

Dk,j(k’’)

for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.


(k-1) , Di,k(k-1) + Di,k

(k-1)

Lemma 1 - Proof

To distinguish between the traditional FW Algorithm, we’ll use Ti,j

(k) to denote the results calculated using the “new” computation way.

⇒ Ti,j(k) ← min Ti,j

(k-1) , Ti,k(k’) + Tk,j

(k’’)

for k-1 ≤ k’ , k’’ ≤ |V|



(k-1) , Di,k(k’) + Dk,j

(k’’) for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.

Lemma 1 - Proof

First, we’ll show that for 1 ≤ k ≤ |V| the following inequality holds:

Ti,j(k) ≤ Di,j

(k)

We Prove this by induction.

Base case: by definition we haveTi,j

(0) = Di,j(0)

Lemma 1 - Proof

Induction step: suppose Ti,j

(k) ≤ Di,j(k) for k = m-1. Then:

Ti,j(m) ← min Ti,j

(m-1) , Ti,m(m’) + Tm,j

(m’’)

≤ min Di,j(m-1) , Ti,m

(m’) + Tm,j(m’’)

≤ min Di,j(m-1) , Ti,m

(m-1) + Tm,j(m-1)

≤ min Di,j(m-1) , Di,m

(m-1) + Dm,j(m-1)

= Di,j(m)

for 1 ≤ k ≤ |V| : Ti,j(k) ≤ Di,j

(k)

Ti,j(k) ← min Ti,j

(k-1) , Ti,k(k’) + Ti,k

(k’’)

By step of induction

Limiting the choices for

intermediate vertices

makes path same or longer

By step of induction

By definition

Lemma 1 - Proof

On the other hand, since the traditional algorithm computes the shortest paths at termination, and since Ti,j

(|V|) is the length of some path, we have:

Di,j(|V|) ≤ Ti,j

(|V|)

⇒ Di,j(|V|) = Ti,j

(|V|)

for 1 ≤ k ≤ |V| : Ti,j(k) ≤ Di,j

(k)



(k-1) , Di,k(k’) + Dk,j


FW’s Algorithm – Recursive Implementation

We first consider the basic case of a two-node graph.

w1

w2

1

2

-W1

W2-

Floyd-Warshall (T)

T11 = min T11, T11 + T11

T12 = min T12, T11 + T12

T21 = min T21, T21 + T11

T22 = min T22, T21 + T12

T22 = min T22, T22 + T22

T21 = min T21, T22 + T21

T12 = min T12, T12 + T22

T11 = min T11, T12 + T21

FW’s Algorithm – Recursive Implementation

The general case

I II

III IV

Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …

FW’s Recursive Algorithm –Correctness

It can be shown, that for each action Di,j

(k) ← min Di,j(k-1) , Di,k

(k-1) + Dk,j(k-1)

in FW’s traditional implementation, there is a corresponding action, Ti,j

(k) ← min Ti,j(k-1) , Ti,k

(k’) + Tk,j(k’’) ,

where k-1 ≤ k’ , k’’ ≤ |V|.Hence the algorithm’s correctness follows

from lemma 1.

FW’s Recursive Algorithm – How does it actually work…

TI(0) TII

(0)

TIV(0)TIII

(0)


T (0)T (|V|)

TI(|V|/2) TII

(|V|/2)

TIII(|V|/2) TIV

(|V|/2)TIV(|V|)TIII

(|V|)

TII(|V|)TI

(|V|)

FW’s Recursive Algorithm - Example

1

2

3

68

4

5

7

10

2

30

8

5

8

3

49

1

720

12345678

110502

23085

38

4

5

6349

71720

8

50

FW’s Recursive Algorithm – Example

12345678

1102

23085

38

4

5

6349

717

8

Floyd-Warshall (T)

T11 = min T11, T11 + T11

T12 = min T12, T11 + T12

T21 = min T21, T21 + T11

T22 = min T22, T21 + T12

T22 = min T22, T22 + T22

T21 = min T21, T22 + T21

T12 = min T12, T12 + T22

T11 = min T11, T12 + T21

1850

2016

1-3-4

7-6-8

FW’s Recursive Algorithm – Example

12345678

1102

285

38

4

5

6349

717

8

Floyd-Warshall (T)

T11 = min T11, T11 + T11

T12 = min T12, T11 + T12

T21 = min T21, T21 + T11

T22 = min T22, T21 + T12

T22 = min T22, T22 + T22

T21 = min T21, T22 + T21

T12 = min T12, T12 + T22

T11 = min T11, T12 + T21

18

166

12

6 11

3011

31

7-2-47-6-5

11

7-2-87-6-4

10

1-6-81-6-52-6-51-6-4 52-6-4

Representing the Matrix in an efficient way

We usually store matrices in the memory in one of two ways:

Using either of these layouts will not improve performance since the algorithm breaks the matrix into quadrants.

0123

4567

891011

12131415

0123456789101112131415

048121591326101437815

Column-major layout:

Row-major layout:


The Z-Morton layout:perform the following operations recursively until the quadrant size is of a single data unit:

divide the matrix into four quadrants.store quadrant I, II, III, IV in the memory.

For example:

0123

4567

891011

12131415

0145236789121310111415

Complexity Analysis

The running time ofthe algorithm is givenby T(|V|) = 8·T(|V|/2) = Θ(|V|3)

Without considering Cache the number of cpu-memory transactionsis exactly as the running time


Complexity Analysis - Theorem

There exists some B, where B = O(|cache|1/2), such that,

when using the FW-Recursive implementation, with the matrix stored in the Z-Morton layout, the number of cpu-memory transactions will be reduced by a factor of B.

⇒ there will be O(|V|3/B) cpu-memory transactions.

Complexity Analysis

After k recursive calls, the size of a quadrant’s dimension is |V|/2k.

There exists some k, such that B ≜ |V|/2k and3 · B2 ≤ |cache|

Once the above condition is fulfilled, 3 matrices of size B2 can be placed in the cache, and no further cpu-memory transactions are required.

⇒B = O(|cache|1/2)


Complexity Analysis

Therefore we get:

O(|V|/B)3 · O(B2)

⇒ the number of cpu-memory transactions is reduced by a factor of B.

Transaction complexity of FW, when the size of the matrix dimension is

|V|/B, and there’s no cache

Transactions required in order to bring a BxB

quadrant into the cache

= O(|V|3/B)

Complexity Analysis – lower bound

In “I/O complexity: The Red Blue Pebble Game” J.Hong and H.Kung have shown that the lower bound on cpu-memory transactions for multiplying matrices is Ω(N3/B)

where B = O(|cache|1/2)

Complexity Analysis – lower bound – Theorem

The lower bound on cpu-memory transactions for the Floyd Warshall algorithm is Ω(|V|3/B)

where B = O(|cache|1/2)

Proof: by reduction

Complexity Analysis – lower bound theorem - Proof

for k from 1 to N for i from 1 to N

for j from 1 to N Ck,i += Ak,j · Bj,I

|V|

|V||V|

Di,j

(k) ← min Di,j(k-

1) ,Di,k

(k-1) + Dk,j(k-1)

Complexity Analysis - Conclusion

The algorithm’s complexity: O(|V|3/B)Lower bound for FW: Ω(|V|3/B) The recursive implementation is

asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).

FW’s Algorithm – Recursive Implementation - Comments

Note, that the size of the cache is not part of the algorithm’s parameters, neither it is needed in order to store the matrix in the Z-Morton layout.

Therefore:the algorithm is cache- oblivious


Though the analysis model included only a single hierarchy of cache, since no special attributes were defined, the proofs can be generalized to multiple levels of cache.

L0 Cache

Main Memory

L1 Cache

L2 Cache


Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.

FW’s Algorithm – Recursive Implementation - ImprovementThe algorithm can be further improved

by making it cache conscious: performing the recursive calls until the problem size is reduced to B, and solving the B-size problem in the traditional way (saves recursive calls’ overhead)

This modification showed up to 2x improvement of running time on some of the machines.

FW’s Algorithm – Tiled Implementation

Consider a special case of lemma 1 when k’, k’’ are restricted such that

k - 1 ≤ k’, k’’ ≤ k - 1 + BWhere |cache| ≤ 3 · B2

( B = O(|cache|1/2))



(k-1) , Di,k(k’) + Dk,j


This leads to the following tiled implementation of FW’s algorithm


Divide the matrix into BxB tilesPerform |V|/B iterations:

during the tth iteration:I. update the (t,t)th blockII. update the remainder of the tth row

and tth columnIII. update the rest of the matrix


Each iterationconsists of threephases:

Phase I:performing FW’s algorithms on the (t,t)th tile

(which is self-dependent).

Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix


Phase II:updating theremainder of row t:

Ai,j(k) ← minAi,j

(k-1), Ai,k(tB) + Ak,j

(k-1)updating the remainder of column t:

Ai,j(k) ← minAi,j

(k-1), Ai,k(k-1) + Ak,j

(tB)


During the tth iteration, k goes from i·(B-1) to i·B


Phase III:updating the rest of the matrix:

Ai,j(k) ← minAi,j

(k-1), Ai,k(tB) + Ak,j

(tB)


During the tth iteration, k goes from i·(B-1) to i·B

FW’s Algorithm – Tiled Example

1

2

3

68

4

5

7

10

2

30

8

5

8

3

49

1

720

12345678

110502

23085

38

4

5

6349

71720

8

50


12345678

110502

23085

38

4

5

6349

717

8

Divide the matrix into BxB tiles

Perform N/B iterations:

during the tth iteration:

I. update the (t,t)th block

II. update the remainder of

the tth row and tth column

III. update the rest of the

matrix206

7-2-8


12345678

1102

23085

38

4

5

6349

7176

8








matrix

50181-3-4


12345678

1102

285

38

4

5

6349

7176

8








matrix

6

12

11

1-6-52-6-5

7-6-5

1-6-4 5182-6-43011

111-6-8

7-6-410


12345678

11052

21185

38

4

5

6349

7176

8








matrix

6

12

11

11

10


In order to match the data access pattern, a tile must be stored in continuous memory.

Therefore, the Z-Morton layout is used.

0123

4567

891011

12131415

0145236789121310111415

FW’s Tiled Algorithm –correctness

Let Di,j(k) be the result of the kth

iteration of the traditional FW’s implementation.

Even though Di,j(k) and Ai,j

(k) may not be equal during the “inner” iterations, it can be shown, using induction, that at the end of each iteration, Di,j

(k) = Ai,j

(k) (where k = t·B)

Complexity Analysis - Theorem

There exists some B, where B = O(|cache|1/2), such that,

when using the FW-Tiled implementation, the number of cpu-memory transactions will be reduced by a factor of B.

⇒ there will be O(|V|3/B) cpu-memory transactions.

Complexity Analysis

There are |V|/B x |V|/B tiles in the matrix.

There are |V|/B iterations in the algorithm, in each iteration, all tiles are accessed.

Updating a tile requires holding at most 3 tiles in the cache. ⇒ 3 · B2 ≤ |cache|


Complexity Analysis

Therefore we get:

(|V|/B) · [(|V|/B)x (|V|/B)] · O(B2)

⇒ the number of cpu-memory transactions is reduced by a factor of B.

The number of iterations

Transactions required in order to bring a BxB tile

into the cache

= O(|V|3/B)

The size of the matrix

Complexity Analysis - Conclusion

The algorithm’s complexity: O(|V|3/B)Lower bound for FW: Ω(|V|3/B) The tiled implementation is

asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).

FW’s Algorithm – Tiled Implementation - Comments

Note, that when using the tiling method, the size of the cache is one of the algorithm’s parameters

Therefore:the tiled algorithm is cache - conscious

FW’s Algorithm – Tiled Implementation - Comments

Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.

FW’s Algorithm – experimental results

Both algorithms (recursive and tiled) have shown a 30% improvement in L1 cache misses and 50% improvement in L2 cache misses for problem size of 1024 and 2048 vertices.

The results for both algorithms are nearly identical! (less than 1% difference)

Dijkstra’s algorithm for Single Source Shortest Paths & Prim’s Algorithm for

Minimum Spanning Tree

Dijkstra’s Algorithm:S ← ∅Q ← VWhile Q ≠ ∅ u ← extract-min (Q) S ← S ∪ u for each v ∊ adj(u) update d[v]Return S

Prim’s Algorithm:Q ← Vfor each u ∊ Q do key(u) ← ∞key (root) ← 0While Q ≠ ∅ u ← extract-min (Q) for each v ∊ adj(u) if v ∊ Q and weight(u,v) < key(v) than key(v) ← weight(u,v)

Both Algorithms have the same data access pattern

Graph representation

There are two commonly used graph representations.

The Adjacency matrix:A(i,j) = the cost of the edge from vertex i to vertex j.

Elements are accessed in adjacent fashion.

Representation size of O(|V|2)

Graph representation

The adjacency list representation: a pointer-based representation where

a list of adjacent vertices is stored for each vertex in the graph, each node in the list holds the cost of the edge from the given vertex to the adjacent vertex.

Representation size of O(|V| + |E|) Pointer-based representation leads to

cache pollution.

Adjacency Array representation

For each vertex in the graph, there exists an array of adjacent vertices.

Representation size of O(|V| + |E|)

Elements are accessed in adjacent fashion.

123… |V|

viwivjwj

Matching Algorithm for Bipartite Graph

Matching: A set M of edges in a graph is a

matching if no vertex of the graph is end of more than one edge in M.

A matching is considered maximum if it is larger than any other matching.

1

2

3

4

1 – 4 is a maximal matching

1 – 3 , 2 – 4 is a maximum matching


Let M be a matching.All edges in the graph are divided into

two groups: matching-edges and non-matching-edges.

A vertex is called free if it is not an end of any matching edge.


A path P = u0, e1, u1, … , ek, uk is called an augmenting path (with respect to M) if:

- u0 and uk are free.

- the even numbered edges e2, e4, …,

ek-1 are matching edges.

The set of edges M\e2,e4, …,ek-1 ∪ e1,e3, …,ek

is also a matching; it has one edge more than Mhas.So, if we find an augmenting path, we can construct

a larger matching.

Finding Augmenting paths in a Bipartite Graph

In bipartite graphs, each augmenting path has one end in A and one end in B. following such augmenting path starting from its end in A, we traverse non-matching edges from A to B and matching edges from B to A.

By turning the graph into a directed graph (all matching edges are directed vB → vA, all the rest vA → vB), we turn the problem into a simple path finding problem in a directed graph.


The Algorithm:while (there exists an augmenting path)

increase |M| by one using the augmenting path

return MAlgorithm’s complexity: O(|V|·|E|)

Matching Algorithm for Bipartite Graph – first

optimizationIn order to find augmenting paths, we

use the BFS algorithm, which has similar data access pattern to that of Dijsktra/Prim.

Therefore, using the adjacency array instead of the adjacency list / matrix improves running time.

Matching Algorithm for Bipartite Graph – second

optimizationWe try to reduce the size of the

working set as in tiling:I. Partition G into g[1], g[2], … g[p].II. Find the maximum matching in g[i]

for each i ∊ 1,2,.. P using the basic algorithm.

III. Unite all sub-matches into M.IV. Find maximum matching in G using

basic algorithm (starting with M).

Matching Algorithm for Bipartite Graph – second

optimizationIf the sizes of sub-graphs are chosen

appropriately, each of which fits into the cache, it generates minimal cpu-memory transactions of O(|V| + |E|) during phase II, because a single loading of each data element into the cache is necessary.

Finding the best size for a sub-graph is by experiment.

Matching Algorithm for Bipartite Graph – best case

In the best case, the maximum matching is found in phase II, the algorithm’s cpu-memory transactions complexity is O(|V| + |E|)

That leaves us with the problem of partition the graph optimally.

I. Partition G into g[1], g[2], … g[p].II. Find the maximum matching in g[i]

for each i ∊ 1,2,.. P using the basic

algorithm.III. Unite all sub-matches into M.IV. Find maximum matching in G using

basic algorithm (starting with M).

Partitioning the Bipartite Graph

The goal: to partition the edges into two groups such that the best matching possible is found within each group.

Algorithm:I. Arbitrarily partition the vertices into 4

equal partitions.II. Count the number of edges between

each pair of partitions.III. Combine partitions into two partitions

such that as many “internal” edges as possible are created.

Conclusions

Using efficient data representation methods can highly improve algorithms’ running time.

Further improvement can be achieved by methods as tiling and recursion.

Other graph algorithms, such as Bellman-Ford, BFS & DFS can be improved by the above, because of their data access pattern.

Documents

Optimizing Graph Algorithms for Improved Cache Performance Aya Mire & Amir Nahir Based on: Optimizing Graph Algorithms for Improved Cache Performance –