BFS algorithm for large graph stored in external memory Philippe Giabbanelli CMPT 880 – Spring 2008

BFS algorithm for large graph stored in external memory

Philippe Giabbanelli CMPT 880 – Spring 2008

1

A bit of theory

Main algorithms used to solve this problem until now

This article is not an introduction to a topic but a follow-up of numerous works (algorithms, implementations…).

Hence, I will beginning by defining the background and models so that we can get an idea of what are the problems to be faced.

Experiments and results

Then, we can move on to the algorithms used so far.

How to design better algorithms and/or implementations

A bit of theory Algorithms Design Experiments

2

• First, let me remind you how Breadth-First Search (BFS) works…

• Starting from a given root, the graph is decomposed into layers of vertices. The vertices in one layer are the neighbours of the vertices of the previous layer (with the condition that they have not been taken before).

• Although a basic algorithm, it is used in many important situations: collecting data from the Web, state space exploration in Model Checking…


3

• When we want to use a Breadth-First Traversal on a massive graph (a billion edges), it is not that trivial to do it efficiently.

• Nowadays, we have good processors with instructions measured in GHz.…but in the meantime, the hard disk latency is in the order of milliseconds!

• In other words, instructions are cheap, but an Input/Output operation costs a million times more. Thus, the cost of I/O operations dominates the overall cost of a BFS Traversal for large graphs, and basic implementations are not viable anymore.

• So, what can we do?∙ As instructions are much cheaper than I/O operations, we can store the graph in a compact way and spend some operations to convert this ‘compact representation’ so that we can work on it.

∙ Take I/O operations in the metric of the model for the algorithm.


4

Compact representation of a Graph• If the graph is random or sparse, there isn’t much we can save.

• However, most graph that arise in practice are neither random nor sparse: they are dense, with structural properties (scale-free, small-world, hierarchical, …).• One very used basic structural property is the existence of separators.To make it easy, we can say that separators are a vertices so that, if we delete them, we can partition the graph in subgraphs of almost equal size.

Definition. Let G = (V, E) be a digraph. A separator S is a subset of V that divides V in A1 and A2 such that A1 + A2 + S = V, and there are no

edges between A1 and A2.

• Lipton & Tarjan’s Theorem. For any graph G with n vertices, there is a separation (A1, S, A2) where neither A1 nor A2 contains more than αn

vertices (α < 1) and S no more than βf(n) vertices (β > 0).

Thus separators theorem are expressed with the constants α and β.


5

Compact representation of a Graph

• In a random graph, the expected separator size if too big. However, it is O(√n) for planar graphs which makes it duable to compress planar graphs (hence lots of research on the topic).

• In scale-free networks, a few vertices maintain the connectivity of the whole network, thus the separator size is small. As many networks are scale-free, we can apply this compression approach in a good range of situations.

• A data structure based on separators can work by recursively separating a graph, then numbering one subgraph, the separators and the other subgraph. The representation needs O(n) bits, and supports constant time adjacency/degree queries and listings of neighbours.

If you want to know more, an article on the topic is Compact representations of Separable Graphs, by Blandford, Blelloch and Kash (Carnegie-Mellon

University), ACM-SIAM 2003


6

Computation Models• What if the graph doesn’t have good separators (sparse, …)? We begin by taking I/O operations as the metric to measure the cost of the algorithm.• The RAM (Random Memory Access) model is not realistic enough for this approach, as it considers unbounded memory and the capability to store arbitrarily large integers in each of the memory cells.

• The external memory model of Aggarwal and Vitter uses the following two-level memory hierarchy:

∙ The faster internal memory (cache) can hold M vertices/edges

∙ In a I/O operation, we can transfer a block of B vertices/edges from the disk to the cache.

∙ The number of I/O operations to read N contiguous items from the disk is scan(N) = Φ(N/B). To sort: Φ((N/B)logM/B(N/B)).

∙ For realistic values of N, B, M : scan(N) < sort(N) << N.


7

Computation Models• Another model is the Cache-Oblivious.

∙ The performance is still measured by the number of memory transfers, but we have less controls over it. The algorithm does not know the values of M and B and we cannot control the cache by saying « put this block in this slot ».

∙ When algorithms are designed with the cache-oblivious model, it means that they do not need to be tuned to some particular system. Typically, those algorithms perform well on machines with multi-level memory hierarchies.

∙ The number of I/O operations to read N contiguous items from the disk in external memory was N/B (if they are stored in adjacent positions with the first element on a block boundary). Here, we cannot ensure that the first element is on the boundary so (N/B)+1.


8

• Remember the basic algorithm, traditionnaly taught in algorithms class:

∙ appropriate candidates nodes for the next vertex to be visited are kept in a FIFO queue

∙ the adjacency list of a vertex taken from the queue is examined to append unvisited nodes to Q (thus nodes are marked either visited or unvisited)

• However, remembering visited nodes has a big overhead in terms of I/O operations and the access to adjacency list is not that great neither.

The RAM model predicts that the algorithm should run in some minutes.

It takes hours.The RAM model (and hence the complexity

given under this model) does not take enough into consideration I/O operations, and here they

are the main contribution to the overall cost.


9

Classic method’s overhead Remembering visited node: θ(m) Unstructured adjacency list: θ(n)

Total θ(n + m)

Munagala & Ranade’s overhead Remembering…: θ(sort(n+m))

Unstructured adjacency list: θ(n) Total θ(n + sort(n+m))

• Munagala and Ranade use this idea to simplify the remembering part. The ‘sort’ part is when we sort a level to eliminate the duplicates.

Let say we are at level i and we want to compute the nodes at level i + 1…

Some of them have been visited, some not. But we can know from the

structure in layers!

Among our neighbours, we just have to delete the ones from level i – 1 and i.

The ones that will be left will not have been visited…

• More formally, let L(t) be the set of nodes at level t. N(L(t – 1)) is the multi-set of neighbours of the nodes at level t – 1.

• Given that we are at level t – 1, we want the nodes at level t. We begin by computing the multi-set of the neighbours of nodes at level t – 1, which requires |L(t – 1)| accesses to the adjacency list.

• Then, we remove the duplicates: sort the list, scan and compact, which costs O(sort(|N( L(t – 1) )|).


10

Classic method’s overhead Remembering visited node: θ(m) Unstructured adjacency list: θ(n)

Total θ(n + m)

Melhorn & Meyer’s overhead

Total:

• In Melhorn and Meyer’s algorithm, we have two steps:

∙ Pre-processing. We choose randomly some nodes with probability µ. By launching a BFS in parallel from them, we partition the graph in disjoint clusters of small diameter.

∙ We perform a BFS like in Munugala and Ranade’s way, with small modifications. We know that when a node from a cluster is reached, then the other nodes of the clusters will be reached soon after (as the diameter is small).

We keep a pool of clusters so that when we want to know the neighbours of a node, we just have to scan the pool and not the whole graph. This ‘pool’ data-structure saves us some I/O.


11

• There are other ways to do the pre-processing (partition in clusters with small diameters) than the random one.

∙ Build a spanning tree in around O(sort(n+m)*log.log(n/m)).

∙ Then we build a Euler tour, and we chop it into chunks of √B nodes (using the deterministic list-ranking algorithm).

• Brodal’s algorithm is based on the same principle of creating clusters and searching through the pool of clusters. However, its pre-processing produces a hierarchical clustering and thus he has a hierarchy of pools.

So, lots of things have been done. What do we do in this paper?

• Improve Munagala and Ranade’s algorithm (MR_BFS) and Melhborn and Mayer’s algorithm with random pre-processing (MM_BFS_R).

• Improve the implementation of the deterministic pre-processing. It is shown that for most graph classes, the deterministic approach is better than the random one. A new heuristic to manage the pool is also given.


12

• What the authors mean with « Improve MR_BFS and MM_BFS_R » is a detail about initializing the sorting of the nodes at the previous level.

• Although it does reduce the overhead at each level, it concerns STXXL (Standard Template Library for Extra Large Data Sets) more than it provides a brand new idea.

• We have to use an external sorter (I/O operation) if there are more than B elements to sort. This sorter must be initialized.

• An external sorter is a complex beast, and initializing it has a cost. It is necessary to sort more than B elements. As we do not know in advance the number of elements to sort, we initialize the external sorter at each level…

• One way not to initialize the external sorter at each level is by buffering the first elements. If we get more than B elements, we initialize and use the external sorter. Otherwise, we use the internal sorter. Reduce the overhead.


13

How to improve the implementation of MM_BFS_D

• There are three operations to do: sorting, minimum spanning tree and list ranking. For each of those operations, there are many available algorithms/implementations. The authors compared the implementations in an empiric way: applying different implementations to the same situations and see which one is faster.

• Guess what? EM_MST is faster than CO_MST, hence better to use!

• A problem that worth speaking of is the list ranking. When we make a Euler tour of the spanning tree, we assign to each node a rank: its first occurrence in T.

• Once a list is ranked, we can turn it into an array on which operations can be performed efficiently. However, it is hard to design an algorithm for list ranking that is efficient with external memory.

• To find the better algorithm for list ranking, the authors compared two. One took 14 hours, and the other one 40 minutes (Sibeyn’s algorithm)…


14

• When you do experiments, you have to precisely state on which machine you do it. So, here we go!

∙ Implemented in C++ using g++ 4.02 optimization O3∙ Version of the linux kernel 2.6, library STXXL version .077

∙ 2 GHz Opteron processors among which we will use only one

∙ 3 Gb of RAM among which we will use only one

∙ 1 MB cache

∙ 250Gb Seagate Baracuda hard-disks with 8 Mb cache

∙ Average read/write 8µ/s and 9µs with peak data rate 65Mb/s

• You must give enough indication so that somebody can do the same experiments again in the same condition than you if necessary.

• For example, the authors are restricting themselves to 1 processor and 1 Gb memory to compare their results with previous experiments.• It is probably biaised (different architecture/speed of processors and latency of memory), but it is a better base to compare than nothing!


15

(line graph: n nodes and n – 1 edges so that a path from u to v consists of all the edges)


16

• The experiments also showed that for Melhborn and Mayer’s algorithm, the random pre-processing (MM_BFS_R) was not as efficient as the deterministic one (MM_BFS_D) given a set of improvements.

∙ The clusters resulting from the random strategies have a bigger diameter that if we use a good spanning tree and we chunk it carefully after a Euler tour.

∙ The deterministic preprocessing gather neighbouring clusters of the graph on contiguous location of the disk, hence it optimizes the I/O operations.


17

For each graph class, the best implementation currently available and its total running time.

Although it still takes a number of hours, it has been considerably reduced. For example, the (previous) implementation of MM_BFS_R on a grid was required 54 hours whereas here it is done in 21 hours.

Even more significant is the different on a random line graph. It takes 12 days with MR_BFS and a few months with IM_BFS whereas we

can do it in just 3,6 hours.

T H A N K Y O U

Main article used in this presentation

Improved external memory BFS implementations (Deepak Ajwani, Ulrich Meyer, Vitaly Osipov, ALENEX 07 published in SIAM)

Compact Representations of Separable Graphs (Daniel Blandford, Guy Blelloch, Ian Kash, Proc ACM-SIAM Discrete Algorithms 2003)

Review of the Cache-Oblivious Model (Lecture 15 from course 6.897 at MIT, Erik Demaine)

Other articles used to provide a better understanding

From parallel to external list ranking (Jop Sibeyn, Technical Report MPI-I-91-. 1-021, Max-Planck Institut fiir Informatik, 1997 )

Documents

BFS algorithm for large graph stored in external memory Philippe Giabbanelli CMPT 880 – Spring 2008