Eindhoven University of Technology MASTER Simulation of ... · Simulation of large-scale genetic regulatory systems Janssen, T.H.M. Award date: 2006 Link to publication Disclaimer

Eindhoven University of Technology

MASTER

Simulation of large-scale genetic regulatory systems

Janssen, T.H.M.

Award date:2006

Link to publication

DisclaimerThis document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Studenttheses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the documentas presented in the repository. The required complexity or quality of research of student theses may vary by program, and the requiredminimum study period may vary in duration.

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

https://research.tue.nl/en/studentthesis/simulation-of-largescale-genetic-regulatory-systems(4275bd48-ceca-47ec-988f-0605958fde81).html

TECHNISCHE UNIVERSITEIT EINDHOVEN

Department of Mathematics and Computer Science

Master’s Thesis

Simulation of Large-scaleGenetic Regulatory Systems

T.H.M. Janssen

Supervisor:

Prof. dr. P.A.J. Hilbers

Eindhoven, June 2006

.

Abstract

On the molecular level, processes that occur inside the cells of an organism are controlledby genetic regulatory systems. These systems are highly complex and currently not wellunderstood. To gain a better understanding, simulations of such systems are needed. Theinherent complexity of the regulatory systems results in a need to abstract away from details.A Boolean network is such an abstraction. In this thesis we will present a parallel algorithmfor the simulation of Boolean networks, in order to provide a means to simulate geneticregulatory systems. We apply the algorithm on different Boolean networks, and compare theresults to each other.

.

Acknowledgements

This thesis concludes my studies at the Department of Mathematics and Computer Science atEindhoven University of Technology. The project was conducted in the System Architectureand Networking area of expertise, during the period from August 2005 to June 2006.

I would like to thank my supervisor, Peter Hilbers, for his input in the project, and inparticular for the highly motivating talks about my work. I also would like to thank the othermembers of the assessment committee, Mark de Berg and Rudolf Mak, for reviewing my work.

Much appreciation goes to Richard Verhoeven, for the excellent and speedy support he offeredon the use of the sandpit cluster.

A word of thanks goes to my family and friends, and in particular my parents, for theirencouragements and belief in a good ending of the project.

And finally I would like to thank my girlfriend Monique, for all her love, patience and support.

Thijs JanssenZoetermeer, June 2006

.

Contents

Abstract iii

Acknowledgements v

1 Introduction 11.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Biological network topology 22.1 Random networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Scale-free networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Boolean networks 5

4 Algorithm and implementation 94.1 Parallel computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3 Load balancing for boolean networks . . . . . . . . . . . . . . . . . . . . . . . 104.4 Load distribution for parallel programs . . . . . . . . . . . . . . . . . . . . . . 11

4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4.4 Adding parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4.6 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.5 Boolean network simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5.3 Performance expectations . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Results 395.1 Behaviour of Boolean networks . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1 Random Boolean functions . . . . . . . . . . . . . . . . . . . . . . . . 395.1.2 Pseudo-random Boolean functions . . . . . . . . . . . . . . . . . . . . 405.1.3 Weighted vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.4 Weighted vote with inhibitors . . . . . . . . . . . . . . . . . . . . . . . 425.1.5 Weighted vote with random inhibitor edges . . . . . . . . . . . . . . . 44

5.2 Performance tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

viii CONTENTS

6 Conclusion and recommendations 536.1 Recommendations for future work . . . . . . . . . . . . . . . . . . . . . . . . 546.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A Test results for Parallel Recursive Minimal Cut 57A.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.1.1 Binary tree network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57A.1.2 Cyclic network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.1.3 Random network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62A.1.4 Scale-free network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

B Histograms of Minimal Cut for scale-free networks 66

Chapter 1

Introduction

Inside the cells of an organism, many different and highly complicated processes take place.Examples of such processes are cell differentiation, responding to external stimuli and DNAreplication ([6]). The genome (the collection of genes) of an organism has a very importantrole in these processes, as the genome determines the proteins that are generated inside acell. These proteins can act as transcription factors by binding to regulatory sites of othergenes, as enzymes that catalyze metabolic reactions, or as a component of signal transductionpathways ([6]).

Besides internal regulation, cells also receive a large amount of different signals that areaimed to influence the behaviour of the cell. In order to handle these signals, highly complexsignalling networks have evolved, called signalling pathways. In the past few years, it hasbecome apparent that these signalling pathways not only process external signals, but arealso related to the cell’s internal signalling ([11]). It turns out that internal signalling, genenetwork regulation and metabolic regulation are not separate control systems, but that theyare highly related to and integrated with each other ([11]). As such, these signalling networksplay a very important role in the organization and functioning of the cell. Therefore, in orderto understand how a cell functions on the molecular level, we need an understanding of thesesignalling networks.

Different genome sequencing projects have resulted in huge sources of information on differ-ent parts of the regulatory systems. The patterns and behaviour that these parts constitutetogether are, unfortunately, much less understood ([6]). Gaining a better understanding ofthese systems is therefore considered to be a huge scientific challenge.

1.1 Goal

The goal of this project is to develop and implement an efficient parallel algorithm for thesimulation of large-scale Boolean networks. Such an algorithm can then be used to simulategenetic regulatory networks, in order to gain more insight in their dynamics.

Chapter 2

Biological network topology

In this thesis, we will be representing biologic signalling networks as graphs, where the ac-tors are represented by vertices, and the existence of relations between actors are representedby edges. A vertex can then represent a protein, a concentration of some sort of moleculeexceeding a threshold, the absence of some molecule, etc. An edge connecting two verticesrepresents that one of the vertices is influenced by the other. In general, such an edge will bedirected, as the relations between vertices in the network are in general not symmetrical.

An important property of such a signalling network is the topology: the structure of thenetwork. The topology defines the architectural properties the network admits. We willmainly be considering two different topologies in this work, namely random networks andscale-free networks. These topologies will be described next.

2.1 Random networks

The most straightforward topology is a random graph. There are mainly two approaches togenerate a random graph with n vertices ([4]):

• Initially start with the graph consisting of the n vertices, but without any edges. Next,edges are randomly added; every edge has a probability p to be included in the graph.

• Consider the set of all possible graphs of n vertices. Randomly draw an element fromthis set. This element is the resulting graph.

In this thesis we will be working with the former type of random graphs. Such a randomgraph contains ”typical” vertices. The number of connections for each vertex follows a normaldistribution. Therefore, a single vertex that has the average number of connections can beused to characterize the network. The latter category will in general not display such ”typical”vertices.Random networks display the so-called small-world property : for the average path length lbetween two different vertices we have that l ∼ log n, with n the number of vertices ([4]).

2.2 Scale-free networks 3

2.2 Scale-free networks

A special type of networks are the so-called scale-free networks, which are described in detailin [2]. Instead of each node having about the same connectedness as in random networks,the number of connections that vertices have in scale-free networks follows a power law. Inparticular, the following holds for the probability for a vertex to have k links:

P (k) ∼ k−γ

For most scale-free networks, the degree exponent γ is in the range 2 < γ < 3. The resultingnetworks are highly non-uniform: there are no ”typical” nodes in the network. Instead, in ascale-free network most vertices have very few links, whereas a very small number of nodeshas a very large number of links. These highly connected nodes are called the ”hubs” of thenetwork ([3]). In section 2.1 we have seen that random networks display the small-world prop-erty. Scale-free networks have an average path length l between two vertices of l ∼ log log n(n is the number of vertices), which is referred to as the ultra-small-world property ([3]).

Scale-free networks appear in various areas. In [2], the collaboration graph of movie ac-tors is described. In this graph, vertices represent specific actors, where an edge between twovertices means that those two actors were cast in the same movie. This graph turns out to bescale-free, with γ = 2.3± 0.1 ([2]). In the same work, it is shown that when the World WideWeb is represented as a graph, with pages as vertices, and links between pages as edges, theresulting graph also is scale-free, with γ = 2.1 ± 0.1. There are numerous other examples,such as the graph representation of friendship in a certain population ([5]) and the citationbetween scientific papers ([2]). Networks become scale-free through growth (new nodes areadded over time), and a phenomenon called preferential attachment, which means that newnodes prefer to connect to nodes that already have many connections ([3]).

For our purposes there is another area of interest in which scale-free networks appear: itturns out that cellular networks and signalling pathways are organized as scale-free graphs([3]). The consequence of this organization is that a large number of elements of a geneticregulatory system is related to only a very small part of the system, whereas a very smallnumber of elements is related to a very large part of the system. As described earlier, thelatter category are the hubs of the network.

A scale-free network can be generated by starting with a very small network, with somerandom connections between the vertices. In each step, we add a single vertex with a randomnumber of connections (the number of allowed connections can be limited to some range).Next, the vertices to which the connections lead are determined. The probability for a con-nection to lead to some vertex is proportional with the number of connections that specificvertex already has. As a result, strongly connected nodes have a higher probability to receiveadditional connections. This effect leads to the development of hubs in the network. Verticesthat have a small number of connections have a much lower probability to receive additionalconnections, which implies that they are likely to remain less connected.

In Figure 2.1, taken from [5], we see examples of three different network topologies.

4 Biological network topology

�

�

�

�

�

�

� �

Figure 2.1: Different network topologies

The regular network is created by connecting nodes that are close to each other ([5]). Thesenetworks are mostly ”cliquish”: local groups of nodes are highly connected, whereas inter-local nodes are not connected. In order to generate such a network, the networks need toinclude a spatial component for the vertices.

We will only be using scale-free and random networks in the remainder of this thesis.

Chapter 3

Boolean networks

The goal of this thesis is to gain better insight in the dynamics of genetic regulatory systems.One problem with these systems is their inherent complexity: the different number of pathsthat such a system can admit is extremely large, making it impossible to simulate such asystem in full detail. Therefore, abstractions are needed, so we can limit the complexity.We can then make statements about a genetic regulatory system by looking at the results ofsimulating an abstraction of that system.In [6], different approaches are described for abstracting from real genetic regulatory systems.One approach is that of Boolean networks, which is the approach we adopt for our simulation.This approach was first proposed in [9], in which the relation between Boolean networks andgenetic networks was first described. The reason for choosing Boolean networks is that theyoffer a simple and straightforward mechanism, that can be extended in the future to incorpo-rate more detail, resulting in better simulations. Although Boolean networks are well-knownin computer science, little is known on their behaviour, even for small networks. Complexityof the networks is the main reason for this.

In a boolean network abstraction, the state of each gene is represented by a boolean variable,that can either be true (meaning the gene is active) or false (meaning the gene is inactive).Only the products of active genes are assumed to be present in the cell. The new state of agene is then determined by the state of other genes influencing that gene.

Definition 3.0.1 A boolean network consists of a directed graph G = (V, E), and a set ofboolean functions, such that every vertex i ∈ V has its own boolean function. The inputto vertex v ∈ V is defined as the set of vertices that have an outgoing edge to v, that is:Inputs(v) = {v′|v′ ∈ V ∧ (v′, v) ∈ E}. Let vertex v have k inputs. Boolean function fv(x) :Bk → B then denotes the boolean function assigned to vertex v, that maps the states of allinputs of v to a Boolean value.

Let n be the number of vertices of which the boolean network consists. That is: n = |V |.Furthermore, let x denote the n-vector of boolean variables, representing the state of thenetwork. Since each xi can be either true or false, the network has a total of 2n differentstates. Now, assume that vertex v has k incoming edges. Then there are 22k

different booleanfunctions possible for vertex v. Of course, k can be different for each vertex. Clearly, thenumber of different functions explodes with the number of inputs. For example, when k = 2,the number of different functions is 16.

6 Boolean networks

The behavior of a vertex in the boolean network in terms of state transitions is now de-fined as follows:

xi(t + 1) = fi(x(t)), 1 ≤ i ≤ n

We see that the state of vertex xi at step t + 1 is determined by a function of the state ofthe network at step t. Besides the state of a single vertex, there is also the state of the entirenetwork, which consists of the list of states of all vertices that are part of the network. Sincethe state of each vertex changes in every step that is taken in the boolean network, the stateof the network also changes with every step taken (unless we are in a steady state, whichwill be explained below). The global state of the boolean network at step t, denotes as x(t),consist of the sequence of vertex states at step t. That is:

x(t) = (x0(t), x1(t), ..., x|V |−1(t)) = (xi)i∈V

Important is that the behavior of a state transition of the network is completely deterministic:the current state of the network combined with the definition of the boolean functions foreach vertex completely determine the next state of each vertex, and therefore the completenetwork as well.From this point forward, we are mainly interested in states of the network. Whenever we referto a ”state”, it means the state of the entire network, unless specifically stated otherwise.We refer to a sequence of states as a trajectory. Given that the number of different states thenetwork can reach is finite, the number of different states in a trajectory must be finite aswell. In particular, every initial state of a trajectory eventually will reach a cycle of states,of size at least 1 (a cycle of size 1 is a steady state). Such a cycle is called a point attractorfor a steady state, and a dynamic attractor for a cycle of size larger then 1. The states thatare part of the trajectory, but not part of the cycle are referred to as the basin of attraction.

There are now two interesting questions that arise in this theory:

• Given a boolean network and an initial state of the network, to what state cycle doesthe network evolve?

• Given a boolean network and a state cycle, what initial states lead to that specific statecycle?

In this dissertation, we will try to answer the first question. That is: we are interested in theway that a network evolves from a certain state to a state cycle or stable state.

Although a boolean network is an abstraction from genetic regulatory systems, it still remainsa very complex system, with a large number of possible paths in its state space. Since thenumber of different states and boolean functions rapidly explode as the number of verticesincreases, we need a computer program to simulate such networks, even for a small numberof vertices.

7

In order to illustrate the theory, we discuss a simple example. Consider a network consistingof three vertices. The connections between the vertices are shown in Figure 3.1:

�

�

�

Figure 3.1: Example Boolean network

Now assume the following functions have been assigned to the different vertices:

Vertex Functiona ¬b

b a ∨ c

c ¬a

The state space and state transitions this network yields are shown in Figure 3.2, whereeach state consists of the values of vertices a, b and c respectively. A value of 0 means thecorresponding vertex is set to false, whereas a value of 1 indicates the vertex is set to true.

��

��

��

��

��

��

��

��

Figure 3.2: State space corresponding to example

Note that every state has precisely one outgoing edge, which is the result of the deterministicnature of a boolean network. We see that some states have no incoming edges. These statesare at the start of a basin of attraction. Other vertices have several incoming edges, meaningthat they are either the point where a basin of attraction turns into a state cycle, or thatthey join different basins of attraction to one state sequence (or both). In our example, thereis one state that has an outgoing edge to itself. This state is therefore a steady state. Note

8 Boolean networks

that such a steady state can have other incoming edges as well (although this is not the casein our example).

As explained, we will be using this theory as an abstraction of genetic regulatory systems. Inthe next chapter, we will present a parallel algorithm for the simulation of boolean networks.The results of these simulations can then be used to make statements and predictions aboutreal genetic regulatory systems.

Chapter 4

Algorithm and implementation

In this chapter we will describe a parallel algorithm that can, given a boolean network and aninitial state, determine the dynamic or point attractor to which the network will evolve. Sinceload balancing and load distribution are very important issues in parallel computing, we willdeal with these problems first. In section 4.3 we will describe how a boolean network can bedivided into smaller pieces, such that it can be distributed over several processes. In section4.4, we will describe an algorithm that assigns different tasks to processes, such that overallcommunication requirements are minimized. In section 4.5, we describe a parallel algorithmfor the simulation of boolean networks.

4.1 Parallel computing

First, we describe what a parallel program is.A parallel program is a computer program that performs its instructions not as a singleprocess, but as a set of different processes. Each process runs a sequential program, and allprocesses are executed concurrently. Concurrent execution means that different processes canexecute their atomic actions at the same time.We will only be using Single Program, Multiple Data (or SPMD) programs, meaning thateach proces executes the same program, but with different data.

In order to reduce the time needed to run the parallel program, different processes are mappedto different processors, as this results in a larger amount of computational resources. Betweenthe processes communication is possible, enabling the processes to share the results of theircomputations with each other. Using message passing, processes are able to send results oftheir computation or instructions to other processes, such that all processes can work togetheron the same task.

For current hardware technologies, simple computation actions take little time, whereas smallcommunication actions are expensive in time. In order for the parallel program to be effi-cient, the distribution of the workload is an important issue. A distribution that requiresmuch communication when compared to computation time will result in poor performance- possibly worse than a sequential implementation. Besides workload distribution, load bal-ancing is another important issue. Load balancing means the amount of workload that eachprocess has been assigned is about equal. Such a load balance will result in each processes

10 Algorithm and implementation

performing about the same amount of work. If processes have to be synchronized at somepoint in their execution (for example, for communication), a bad load balance will most likelyresult in a process with a low workload waiting for a process with high workload. In thatcase, the process with low workload has to wait for the other process, resulting in idle time,and therefore, a waste of computational resources. This is in general a bad idea. Therefore,we need good load balancing.

4.2 Notation

In the remainder of this chapter, we will refer to graphs as a tuple, G = (V,E). This meansthat graph G consists of a set of vertices, V , and a set of edges, E. We denote the length ofa set x as |x|. So, graph G has |V | vertices and |E| edges.When we consider a parallel system, we will refer to the set of available processes as P , andthe number of available processes as |P |. We assume that each process is assigned a uniqueidentification number, called its rank, which is in the range 0 ≤ rank < |P |.

4.3 Load balancing for boolean networks

Since we are developing a parallel algorithm for the simulation of boolean networks, weshould come up with an approach to split the network over several processes. We choose avery straightforward approach: we consider a vertex in the boolean network to be the smallestunit of computation.The advantage of this approach is that it allows us to consider communication between tasksas being the same as connections between vertices in the boolean network. This follows fromthe fact that the inputs to vertices, which are specified in the boolean network, determine thecommunication needs: a vertex requires the state of each of its inputs to determine its ownnew state.As noted earlier, good performance requires good load distribution: each process is assignedan (almost) equal amount of work. This can easily be achieved by assigning about thesame number of vertices to each process, assuming the computational resources every vertexrequires are (almost) equal.Next, we choose to let each process in our parallel system be assigned to a unique processor.This results in the number of processors being equal to the number of processes. Each processis then assigned a subset of the vertices, such that every vertex is assigned to precisely oneprocess. Clearly, this results in a need to communicate the state of vertices between processes,in order to make state transitions. We are looking for an assignment of vertices to processesthat minimizes communication dependencies between processes. In particular, we are lookingfor an assignment in which communication dependencies between vertices within a processcan be very high, whereas communication dependencies between vertices that are assigned todifferent processes is minimized. An additional requirement related to load balancing is thatthe amount of external communication per process is about equal; if there is one process thatperforms much more communication, it will delay other processes, as the state transitions aresynchronized.In the next section we provide an algorithm that returns an assignment of vertices to processesthat satisfies our requirements.

4.4 Load distribution for parallel programs 11

4.4 Load distribution for parallel programs

4.4.1 Introduction

As mentioned, a good load distribution is essential for good performance of a parallel program;distributing the workload in a smart manner may reduce communication time, resultingin a shorter runtime of a parallel program. However, a bad distribution may cause hugecommunication overhead, resulting in very bad performance of an otherwise perhaps verygood implementation. In this section we present a systematic way to distribute the workloadamong the computation nodes. We compare the results of the distribution for different graphs(representing communication actions) with expected results of a random distribution.

4.4.2 Approach

The algorithm which we propose takes as its input a directed graph, where each vertexrepresents a computational unit, and each edge represents a data dependency, and thereforea communication action. For example, if the graph contains an edge from vertex x to vertexy, this means that vertex y needs data from vertex x in order to perform its computation.A computational unit is an action or set of actions that can be performed on one process,assuming the required input is available. One can choose for larger computational units, re-sulting in coarse-grained programs. Smaller computational units yield fine-grained programs,that are characterized by a larger communication graph with more communication actions.Clearly, this will have a significant effect on the performance of a load distribution algorithm.In order to reduce the overall communication cost within the network we propose a recursiveMinimal Cut as a strategy to distribute the computational components. This approach wasinspired by the algorithm proposed in [7].Minimal Cut is a well-known problem in graph theory ([1]). Its formal specification reads:

Definition 4.4.1 Given an undirected graph G = (V, E), a Minimal Cut of G is a parti-tioning of set V into two (non-trivial) subsets V1 and V2 such that:

• V1 ∪ V2 = V

• V1 ∩ V2 = ∅• 〈#i, j : i ∈ V1 ∧ j ∈ V2 : (i, j) ∈ E〉 is minimal

In order to use Minimal Cut on the mapping problem for a parallel program we have tocome up with a mapping from Load Distribution to Minimal Cut. This can be done asfollows: we choose to map computational units in the parallel program to vertices in theMinimal Cut problem. Communication actions are represented by edges. In order to beable to accurately represent real communication behavior, we must make a choice in whatedges exactly represent. We can choose to give weights to edges representing the number ofcommunication actions between the vertices that the edge connects, or we can introduce anew edge for each communication action. We choose the latter, since this offers the ability toeasily extend the algorithm in the future with edge weights representing the amount of datapassing over a communication line. (When edges are weighted, the definition of Minimal Cut


is slightly different; instead of the number of edges in the cut, we are interested in minimizingthe total weight of all edges in the cut). However, this approach is not possible in the currentdefinition of our graphs. Instead of a set E of edges, we let E represent a multiset of edges,which allows the same edge to be included several times.Communication in a parallel program is a directed action. Since Minimal Cut requires undi-rected edges, we choose to remove the direction of the edge. This doesn’t really make adifference, as long as the same edge can occur more than once in a graph. This is necessaryin order to give a realistic representation of the program. For example, given two computa-tional units a and b. If a requires information from b, and vice versa, this would take twocommunication actions. This information must be incorporated in the graph representationin order to obtain good results from the Minimal Cut computation.We want the workload evenly distributed among the processes, so we choose to require thateach Minimal Cut returns two clusters that differ at most one vertex in size. That is:

−1 ≤ |V1| − |V2| ≤ 1

The problem is now similar to Minimum b-Balanced Cut ([1]).Minimal Cut is known to be NP-complete in both its general and its balanced form, meaningthat there currently is no algorithm known to man that runs in polynomial time, and thatcalculates the optimal cut. Therefore, the best we can do in polynomial time (assuming P 6=NP) is to construct an approximation to the optimal cut. The algorithm makes use of theKernighan-Lin heuristic, which was first described in [10]. Given the communication graphand two clusters of vertices C1 and C2, the value of the heuristic for a vertex v, referred toas its gain, can be calculated by counting the number of edges going from v to vertices in theother cluster, minus the number of edges from v to vertices in the cluster of which v is a part.This is the value for the heuristic when edges are not weighted. The heuristic determines thechange in the cutsize, should vertex v be moved. The value of the gain of vertex v, gain(v),is now defined as follows:

gain(v) =

〈#i : i ∈ C2 : (i, v) ∈ E ∨ (v, i) ∈ E〉−〈#i : i ∈ C1 : (i, v) ∈ E ∨ (v, i) ∈ E〉 , if v ∈ C1;

〈#i : i ∈ C1 : (i, v) ∈ E ∨ (v, i) ∈ E〉−〈#i : i ∈ C2 : (i, v) ∈ E ∨ (v, i) ∈ E〉 , if v ∈ C2;

This heuristic is the basis of our Minimal Cut algorithm, which will be described next.

4.4.3 Algorithm

The algorithm we propose is, as mentioned, largely based on the algorithm proposed in [7].Some details are removed from the algorithm, and it is implemented as a parallel program.The part we leave out is the optimization that is specific for clusters with a hypercube-architecture. We assume that the communication network between our processes is fullyconnected. A parallel implementation is not suggested in [7]. We will describe a method torun the algorithm on a parallel system.The algorithm is based on a heuristic that calculates, for each vertex, the change in the sizeof the cut, should that vertex be moved to the other cluster of vertices. It is known as theKernighan-Lin heuristic, and is described in detail in [10]. The value of the heuristic for somevertex is referred to as the gain of that vertex.


The (sequential) Minimal Cut algorithm looks as follows (almost literally copied from [7]):

Algorithm MinCut(Set Orig-C1, Set orig-C2):

// accep t s two c l u s t e r s Orig−C1, Orig−C2 as input and t r i e s to2 // reduce the c u t s i z e between the s e c l u s t e r s by moving v e r t i c e s

// between them . The a l gor i thm re turns the f i n a l c l u s t e r s .4

Set C1 := or ig−C1 ;6 Set C2 := or ig−C2 ;

8

10 Assoc i a t e a gain value v . gain , with each node v ;f o r a l l v in C1 ∪ C2 :

12 v . ga in := 0 ;

14

do {16 − Mark a l l nodes unlocked

− Assume V = C1 ∪ C2

18 − f o r a l l v in V:v . ga in = calc GAIN (v , C1 , C2)

20

− Compute W1 and W222 // ( t ha t i s : the t o t a l workload in C1 and C2 r e s p e c t i v e l y )

24 seqno := 0 ;done := fa l se ;

26 repeat {seqno := seqno + 1

28 Let Ci be the c l u s t e r with g r e a t e r t o t a l weightand Cj the c l u s t e r with l e s s e r t o t a l weight ,

30 i . e . , Wi ≥ Wj , i, j ∈ {1, 2}, i 6= j ;

32 Among the unlocked v e r t i c e s , i d e n t i f y v∗ ∈ Ci

such that 〈∀v : v ∈ Ci ∧ unlocked(v) : v.gain ≤ v∗.gain〉34 // That i s : v∗ i s an unlocked v e r t e x

// wi th maximal gain36

I f no such ver tex e x i s t s :38 done := true ;

break ;40

Assume that v∗ i s moved to Cj , update the42 gain for a l l unlocked nodes connected to


v∗ , then re−c a l c u l a t e l oads W1 and W2

44 for C1 and C2 ;

46 Lock v∗ and record the s t a tu s o f the movement ;

48 gain [ seqno ] := v∗ . ga in ;} un t i l done

50

Let G∗ = max(∑|C1|+|C2|

i=1 gain [ i ] ) =∑l∗

i=1 gain [ i ]52 // i . e . l∗ i s the number o f movements t ha t maximizes

// the cumula t ive gain54

i f (G∗ > 0)56 perform a l l moves from 1 to l∗

} while G∗ > 058 return (Cl , C2)

Algorithm Calc GAIN(Vertex v, Set C1, Set C2):

2 // A g l o b a l adjacency matrix f o r the en t i r e graph i s used .// This rou t ine has acces s to a l l ne i ghbor s o f v and

4 // in format ion about them to c a l c u l a t e the exac t gain f o r// v e r t e x v during any s t a g e o f the r e cu r s i v e

6 // b i p a r t i t i o n i n g proces s

8 v . ga in = 0 // I n i t i a l i z e the gain o f v e r t e x v to zerofor ( each neighbor vi o f v in C1 or C2) do

10 i f (vi i s in the same c l u s t e r where v i s ) then// su b t r a c t edge−weigh t from gain :

12 v . gain = v . gain − co s t (v, vi)else i f ( edge (vi, v) i s in the cut between C1 and C2 ) then

14 // add edge−weigh t to the gainv . gain = v . gain + cos t (v, vi)

16

end i f18 endfor // each ne ighbor

return v . gain

When this algorithm terminates, the following holds for clusters C1 and C2:

Theorem 4.4.2 When keeping −1 ≤ |C1|−|C2| ≤ 1 invariant, the cut is optimal when usingthe Kernighan-Lin heuristic as a measure (that is: we use the Kernighan-Lin heuristic todetermine the best movement).


Proof The first part of the proof follows from the guard of the main loop. The algorithmdetermines if there is a sequence of vertex movements that optimizes the cut. If there is sucha sequence, the algorithm performs the movements in that sequence, and the new clusters areevaluated. Only when such a sequence does not exist, the algorithm terminates. If we cannow show that the inner loop maintains the invariant, the proof is complete. If we considerthe inner loop, we observe that the next vertex that is to be moved is taken from the biggercluster (lines 28-32). Next, in lines 41- 44, the algorithm changes the weights of the twoclusters and the gains of all unlocked vertices, assuming that the chosen vertex would bemoved. Clearly, if movements of vertices are performed in the same order as the vertices areconsidered for moving, the invariant will be maintained. In line 51 a prefix of all movementsis chosen, which are the movements that are to be performed. Clearly, such a prefix maintainsthe order of movements. In line 56 this prefix of movements is executed. ¥

Note that there is a possibility that the algorithm terminates while there is a vertex withpositive gain (moving the vertex would reduce the cut size). However, such a situation canonly occur if moving that vertex would not maintain the invariant −1 ≤ |C1| − |C2| ≤ 1.

An important question is whether the algorithm can admit an infinite sequence of vertexmovements.

Theorem 4.4.3 The described Minimal Cut algorithm terminates in all cases.

Proof Consider the main loop, that terminates when there is no sequence of movements thatreduces the total cut size (the guard is true when G∗ > 0). Assume that after the main loophas been executed t times, the total size of the cut equals St. The algorithm determines thebest possible sequence of moves from that specific cut, and if that sequence actually reducesthe cut size, the moves are performed. If we can find such a movement after t steps, thatmovement will be performed. The result is, that after t + 1 executions of the main loop, thenew size of the cut St+1 will be less then St. If such a sequence is not possible, the algorithmterminates.In particular, we have that every sequence of moves that is performed, actually reduces thecut size. Since the optimal cut of any graph is at least 0, it is now clear that this algorithmwill terminate in all cases. ¥

When we look at the complexity, we observe that the number of movement sequences that areperformed is at most the initial size of the cut; in the worst case, every movement sequencereduces the cut size by one. In practice, we can expect much less iterations are needed;a minimal cut of size 0 is usually impossible (we will mainly have connected graphs), andespecially in the first iterations we expect to make good progress in reduction of the cut size.In every iteration, the following actions are taken:

• Calculate gains for all vertices (O(|V |2))• For every vertex:

– Move the vertex precisely once to the other cluster, and store the change in cutsize that movement yields (O(|V |) to find the vertex to move, and O(1) to movethe vertex)


– When the vertex is moved, re-calculate the gains for all unmoved vertices that areconnected to it (O(|E|))

• Determine the optimal sequence of moves (O(|V |))• Perform the optimal sequence of moves (O(|V |))

The initial cut size will be O(|E|), so we get a total complexity of:

O(|V |2) + O(|E|) ∗ (O(|E|) + O(|V |) ∗ (O(|V |) + O(|E|)) + 2 ∗O(|V |))=

O(|E|)2 + O(|E|)2 ∗O(|V |) + O(|V |)2 ∗O(|E|)

This algorithm only calculates one Minimal Cut. We want to recursively run this algorithmon the resulting clusters, until we have as many clusters of computational units as there areprocesses. This can be easily done as follows (assuming we have x processes, and there issome integer i > 0 such that 2i = x):

Algorithm RecursiveMinCut(V, NumberOfProcesses, CurrentDepth):

2 − Evenly and randomly s p l i t V among s e t s C1 and C2

4 MaxDepth = 2 log NumberOfProcessesMinCut(C1 , C2)

6

i f ( CurrentDepth ≤ MaxDepth) {8 RecursiveMinCut (C1 , NumberOfProcesses , CurrentDepth+1)

RecursiveMinCut (C2 , NumberOfProcesses , CurrentDepth+1)10 }

Running this algorithm for the initial set of all vertices, and with parameter CurrentDepthinitially set to 1, results in one set of vertices for each process.

If the number of processes is not a power of 2, the Recursive Minimal Cut algorithm cannot result in clusters that are of about the same size.

4.4.4 Adding parallelism

The algorithm as proposed in [7] and described above is a sequential algorithm. Since ourintended use of the algorithm is a pre-processing step in a parallel computation, we want tomake better use of the parallel processing capabilities of the system the algorithm is runningon. The initial computation of the gain function for each vertex is a computational step that


can easily be distributed among different processes. Therefore, we adopt this distribution ofworkload.

A possible approach for this distribution is described next. At each depth, all processesare synchronized, i.e. all processes are calculating Minimal Cuts for the same depth. Further-more, all Minimal Cuts at some depth are calculated concurrently, and for each Minimal Cut,an equal number of processes is assigned. The processes for each Minimal Cut are assignedvia a block mapping. In the remainder of this chapter, we assume that depths are numberedstarting from 0. At depth x, there are 2x Minimal Cuts to be calculated. By dividing the totalnumber of available processes through the number of cuts at the current level, the number ofprocesses available for each Minimal Cut is obtained.

More formally, at depth x, 2x Minimal Cuts are being calculated at the same time. Fur-thermore, we have that the number of processes working on a specific Minimal Cut equals|P |/2x. In particular, the i’th Minimal Cut on depth x is assigned to all processes that havea rank in the following range:

(i− 1)(|P |/2x) ≤ rank < i(|P |/2x)

When |P | = 16, the maximum depth of this Minimal Cut tree equals 3 (at depth 4, everyprocess has been assigned its own cluster). In Figure 4.1, the assignment of processes toMinimal Cuts at different depths are displayed graphically. In this picture, a node representsa Minimal Cut.

��

��

��

��

��

��

��

�

�

�

Figure 4.1: Assignment of processes to Minimal Cuts, |P | = 16

Every time the depth increases by one, the number of Minimal Cuts is doubled. As a result,per Minimal Cut, the number of available processes will be halved. However, the number ofvertices for which a MinCut is calculated is also halved, resulting in much better performanceper Minimal Cut, even when much less processes are available for the Minimal Cut. Themain reason for this is that the number of possible movements of vertices is radically reducedif the number of vertices is halved. Also, the initial size of the cut, which provides an upperbound on the number of iterations the algorithm requires, is expected to be much smaller ifthe number of vertices is halved.

Next, we consider what happens in a specific computation of a Minimal Cut. The firstprocess in a block that computes a specific Minimal Cut (the process with the lowest rank)


is designated to be the host process for that MinCut. This process splits the initial set intotwo (random) parts that are equal in size, and communicates these two sets to all processesin the block, such that each process knows which vertices are part of the computation, andhow the vertices are distributed among the two clusters.The host process does all the computation, except calculation of the gain function for thedifferent vertices. This is done because that operation can easily be done concurrently, asthere are no real data dependencies. We assume that every process has a local copy of alledges, i.e. every process can determine if two vertices are connected or not. We choose to letall processes with odd rank handle cluster C2, and processes with even rank handle clusterC1. The division of vertices for gain calculation among processes in one block is again donethrough a block mapping.

A process first has to identify the host process in its block, before it can receive the clus-ters. As mentioned, at depth x, 2x Minimal Cuts are being calculated at the same time, andevery Minimal Cut consists of |P |/2x processes. A process is a host process if the followingholds for its rank i:

i mod (|P |/2x) = 0

In that case, the process has to split the assigned set of vertices in two clusters, and send theresulting clusters to the next (|P |/2x)− 1 processes.

If the process is not a host process, it has to determine from which process it will receive thetwo clusters. If i denotes the rank of the process which we are considering, its host processwill be the process with the following rank:

(i div (|P |/2x)) ∗ (|P |/2x)

Given this information, communication of the cluster from the host process to the assistingprocesses can take place. Next, each process in a block (a block is the set of vertices assignedto one Minimal Cut, including the host process) calculates the gains for a part of the verticesin the two clusters. In order to keep matters relatively simple, the vertices from cluster C1

are assigned to the processes in the block with even rank. The vertices from cluster C2 areassigned to processes in the block with odd rank. What remains is to assign the specificvertices in a cluster to a proces, such that every vertex is handled by precisely one process.Again a block mapping is used, as this is the most straightforward solution. We assumethe vertices in each cluster can be uniquely identified with a positive integer, such that thevertices can be ordered, and it is possible to refer to the i’th vertex in a cluster. First, thenumber of processes that are assigned to clusters C1 and C2 need to be determined. Assumethere are y processes available to calculate gains for vertices in C1 ∪ C2, that is: y = |P |/2x

(x denotes the current depth). Then dy/2e processes (the processes in the block with evenrank) are assigned to cluster C1, and by/2c processes (the processes in the block with oddrank) are assigned to cluster C2.

Now assume the host process has rank h, and that the processes for which the assignment ofvertices needs to be determined has rank i. Furthermore, assume that there are x1 processesavailable for cluster C1, and x2 processes available for cluster C2. Finally, assume the vertices


in clusters C1 and C2 are numbered in the range 0 ≤ j < |C1| and 0 ≤ j < |C2| respectively.Note that such a numbering is trivial to obtain if the vertices can be ordered with theirrespective identifying numbers. The assignment of vertices to process i is then as follows:

• If i mod 2 = 0: the process is assigned vertices from cluster C1. There are z = (i−h)/2processes preceding this process in the block mapping. There is a possibility that thevertices can not be evenly spread among the processes. This possibility has to be takeninto consideration. There are two options:

– If |C1| mod x1 > z, process i is assigned the first (|C1| div x1)+1 vertices countedfrom and including the vertex with number (z ∗ ((|C1| div x1) + 1) in C1.

– If |C1| mod x1 ≤ z, process i is assigned the first |C1| div x1 vertices countedfrom and including the vertex with number (z ∗ (|C1| div x1)) + (|C1| mod x1 inC1.

• If i mod 2 = 1: the process is assigned vertices from cluster C2. There are z =(i− (h + 1))/2 processes preceding this process in the block mapping. Again there aretwo options, completely analogous to the assignment for C1:

– If |C2|mod x2 > z, process i is assigned the first (|C2| div x2)+1 vertices, countedfrom and including the vertex with number (z ∗ ((|C2| div x2) + 1) in C2.

– If |C2| mod x2 ≤ z, process i is assigned the first |C2| div x2 vertices countedfrom and including the vertex with number (z ∗ (|C2| div x2)) + (|C2| mod x2 inC2.

After a process has finished its part of gain calculations, it sends the results to the host pro-cess of its block. The host process obtains gain information from all processes, and combinesthis information with the gains it has calculated itself.

The host process then determines which vertices are to be moved, and adapts C1 and C2

accordingly, as in the sequential algorithm. If the Minimal Cut has not terminated, the hostprocess again communicates C1 and C2 to the assisting processes, gain calculation is doneagain by the different processes, and the host process again optimizes the Minimal Cut. Wechoose to let one process determine what vertices to move, as this is an operation that wouldtake a lot of communication and synchronization in order to distribute it, which would takemuch more time.

This process is repeated until the Minimal Cut computation terminates.Next, all processes are aligned again. Each process that hosted a Minimal Cut computationsends its cluster C2 to the process that is halfway its block (that is: if process x is a hostprocess, and there are a total of y processes available for each Minimal Cut computation,process x then sends cluster C2 to process x + (y/2)). Such a process receiving the verticesof cluster C2 will be a host process for computations at the next level. Next, the depth isincreased by one, and the process can start over. If the depth is beyond the maximum depth,all processes have one cluster of vertices, which is then the final assignment of vertices tothe processes. Each process then communicates to the other processes which vertices it hasobtained.


If the depth is not beyond the maximum depth, each process that either hosted a MinimalCut in the previous step, or has obtained cluster C2 from the process that hosted a MinimalCut computation in the previous step will now host a Minimal Cut computation.

The block mappings and distribution of the clusters only function correctly if the numberof processes is a power of two, so we put this restriction on the total number of processes.Note that, given its number and two sets of vertices, each process can determine by itself forwhich vertices it has to calculate the gain functions, and to which process the result has tobe sent. This results in no communication overhead to assign jobs to processes, and thereforein a faster algorithm

4.4.5 Performance

The described Minimal Cut algorithm has been implemented as a C++ program. For messagepassing between processes LAM/MPI is used.

In order to test the algorithm, we use various networks with different topologies. We testthe algorithm on cycles, binary trees, random networks and scale-free networks. As theload-distribution algorithm is only used as a pre-processing step for the boolean network sim-ulation, all test results are included in Appendix A. We will shortly discuss the results here.

The goal of the algorithm is to limit the number of external connections (connections fromone process to another), at the cost of a high number of internal connections (connectionswithin a process). For some network topologies, this is easier to achieve then for others. Forexample, a cycle can easily be distributed among processes in such a way that every processhas precisely two outgoing connections. A random network, however, will in general not allowsuch a good assignment. The algorithm usually does not find the optimal solution, but itperforms quite satisfactory. In particular, it performs much better then a random assignment.Since we are mainly interested in scale-free networks, we will discuss results on such networksin more detail here.

In Appendix B, various histograms are shown. These histograms show the number of in-ternal and external connections for different processes that the Minimal Cut algorithm yields.Looking at the histograms, we observe large fluctuations in the number of internal connections,and smaller fluctuations in the number of external connections. Since an even distributionof the number of external connections is most important for the performance of a parallelalgorithm, we can conclude that the algorithm performs satisfactory in most cases. However,in some cases (such as runs 9 and 10), the differences in number of connections can becomequite large. The most likely reason for this is the occurrence of hubs in scale-free networks.Our algorithm has no specific mechanism to assign the hubs to different processes. Since weconsider every vertex to have equal computational requirements, there is a good probabilitythat two or more hubs are assigned to one process. This process then automatically gains alarge amount of connections. It would perhaps be better to determine what vertices are the|P | largest hubs, split them evenly among the clusters, and keep their position fixed. Thisway, all of the |P | processes receive precisely one of the |P | largest hubs. The vertices withfew connections can then be moved between the clusters in order to keep them in the samecluster as the hubs to which they are connected. Although this approach might result in a


significant improvement, we did not implement it.

4.4.6 Improvements

The described algorithm only considers the gains for moving single vertices. One might ex-pect the algorithm to perform differently (and perhaps better) if the gains for moving a setof vertices are considered.Weakening the requirement that both resulting clusters differ at most one vertex in size, mightalso result in different behaviour. If one particular vertex results in much communication fora process, letting some other process do a little extra work by including that specific vertexmight result in better performance. The proposed algorithm enforces strict load balancing.

Since the goal of the algorithm is to limit the number of external connections per process,it might be interesting to include the current number of external connections of a cluster inits weight. We then move vertices from the cluster with the most external connections. Ofcourse, the difference in size of the two clusters should be limited.

The restriction on the number of processes that is the result of the algorithm (the numberof processes has to be a power of 2) is another fact that can be altered. The Kernighan-Linheuristic we use is suitable for two clusters. It can easily be extended to more clusters, how-ever. If one needs n clusters in total, expanding the heuristic to n clusters might be a goodsolution. There are some problems to this approach, however. The first problem is decidingto which cluster a vertex should be moved. There is a possibility that it is clear which vertexshould be moved, but that there are several clusters to which the vertex can be moved, eachresulting in the same cut size. A simple solution would be to choose a random cluster fromthe candidates.There is also a problem related to load balancing: suppose the best candidate cluster for avertex already has a very high load. In that case, a choice has to be made whether good loadbalancing is more important then a small cut size.A third issue related to the Kernighan-Lin heuristic for multiple clusters lies in the lockingof vertices. In the two-cluster approach, a vertex is locked (meaning it can not be movedanymore) if it has been moved once, mostly to prevent an infinite sequence of moves. In ascenario with multiple clusters, it might be a good idea to lock a vertex only if it has been inall clusters, or in at least some part of the clusters. Allowing a small number of movementsfor each vertex might result in poor cuts, whereas allowing a high number of movements foreach vertex probably results in bad running times. A sensible choice has to be made for thisproblem.

The multiple-cluster approach clearly has some issues that need to be resolved. It is a muchmore complex approach, as the different number of assignments of vertices to clusters, aswell as the possible number of movement sequences is much larger then in the two-clusterapproach.


4.5 Boolean network simulation

4.5.1 Introduction

In the remainder of this chapter, we will describe a parallel algorithm for the simulation ofBoolean networks. The Recursive Minimal Cut algorithm we described earlier can be used asa pre-processing step for the Boolean network simulation. A formal description of Booleannetworks, including an example, can be found in chapter 3.

4.5.2 Algorithm

Since Boolean networks have a state space that grows exponentially with the number of ver-tices, the computation time readily becomes too large for single processor systems. Besidescomputation time, memory requirements can be expected to become very large as well, evenfor moderate networks, as a history of states must be kept in order to determine whethersome state has been visited before.

In order to obtain shorter computation times, we propose a parallel algorithm. Besides anincrease in computational capabilities a parallel system can offer, it also offers the advantageof extra memory, since each node of the parallel computer has its own memory. Since a par-allel cluster usually consists of several relatively standard computers, the combined memoryof all these systems is available, which can become quite large. For a single processor system,having a memory larger then 4 GB would result in a need for non-standard hardware, andtherefore, much larger costs. The memory offered by a parallel cluster is therefore more cost-efficient per MB. Unfortunately, the memory in a parallel cluster is a distributed memory,which is a disadvantage when we consider the intended use.

The reason that such a distributed memory is inconvenient lies in the way state transitionsare made. To determine the new state of a vertex, the states of all its neighboring verticesmust be known. If the vertices and their states are split over several processes, a sharedmemory would result in easy access to the necessary information to make a state transition.A parallel cluster only offers a distributed memory, however, which forces explicit synchro-nization between processes to be part of the algorithm, as there has to be a mechanism forvertices to retrieve the states of their respective inputs.

However, distributing the state space also has advantages. Again, we assume that the booleannetwork is defined as a graph G = (V, E), and we use the notation described in section 4.2.As mentioned, the global state space has size 2|V |. If we distribute the vertices evenly amongall processes, and let each process store only the state space of its local vertices, we see that

each process has to store a maximum state space of 2|V ||P | bytes (assuming the state of one

vertex is stored in one byte). This yields a total memory requirement for the algorithm of

|P | ∗ 2|V ||P | bytes (assuming the state of one vertex can be stored in one byte of memory). Of

course, the state space needs to be stored in a somewhat smart manner, as the order in whichthe local states are traversed is unknown. A simple and effective solution is to store eachunique local state precisely once, and for each state store a list of steps in which that statewas reached. This results in a negligible overhead.

4.5 Boolean network simulation 23

Even for small P , we have that |P | ∗ 2|V ||P | is much smaller then 2|V |. Hence, the total amount

of memory needed by all process together in a distributed memory implementation is signif-icantly smaller then the total amount of memory needed in a single processor system. Ofcourse, the memory requirements will still grow exponentially with the number of verticesin the parallel implementation, and such a memory requirement is still unfeasible for largenetworks that either have very large basins of attraction, or have large state cycles. However,we expect that the networks we are interested in converge fast enough to a stable state or astate cycle.

When the vertices are split among different processes, there will be a distinction betweenthe local state of a process (the states of all vertices assigned to that process) and the globalstate (the states of all vertices in the Boolean network). The global state equals the combinedlocal states of all processes. We now introduce some notation, to be able to refer to localstates of processes and the global state of the network. We let the current state of all verticesof some process p ∈ P be denoted by a vector sp, with p ∈ P . Such a vector is a sequenceof the form {0, 1}∗, with length equal to the number of vertices assigned to process p. Theglobal state S of the network is then defined as the following sequence:

S = (s0, s1, ..., s|P |−1) = (sp)p∈P

In order to be able to identify the states after different number of steps, let sp,i, with p ∈ Pand i ≥ 0 be the state of process p after i steps. The global state of the network after i steps,Si is then defined as follows:

Si = (s0,i, s1,i, ..., s|P |−1,i)i≥0 = (sp,i)p∈P

Now we can state that the global state of the network after n steps has been visited previouslyif and only if the following holds:

〈∃m : 0 ≤ m < n : Sm = Sn〉We now make the following important observation, which is the basis for our parallel algo-rithm:

〈∀m,n : 0 ≤ m < n : Sm = Sn ≡ 〈∀p : p ∈ P : sp,m = sp,n〉〉What this formula states is that if the global state of the network after m steps is equal to theglobal state after n steps, then the local state of every process after m steps is equal to thelocal state of that same process after n steps. Important for our purposes is that it also holdsthe other way around, that is: we can determine whether the global state has been visitedbefore using only information on the local state history of processes. In particular, in order tofind such a solution for m, it is not necessary to compare the local states of different processesto each other; it suffices to compare candidate values for m that each process returns.Now, assume that each process has stored its local state history, that contains all local statesthat have been visited in previous steps, and let x denote the total amount of transitionsmade in the network. Furthermore, assume that each process can determine, for a specificlocal state, the set of step numbers for which that specific local state was reached.


Then we have that the state of the network after x steps has been previously visited atthe steps that are part of the following set:

⋂

p∈P

{i : 0 ≤ i < x ∧ si,p = sx,p}

So, when there is some natural number i < x for which all processes have a local state afteri transitions that is equal to the local state after x transitions, the global network is in astate that it has visited previously. Now, when a stable state has been found, we know fromthe definition of a stable state that i = x − 1. Otherwise, a state cycle has been found,which has size x − i. If the simulation is stopped once a previously visited state is reached,it is also clear that there are either zero or one values for i that satisfy the stated requirement.

Next, the parallel algorithm is presented. We assume that each process has a unique numberi in the range 0 ≤ i < |P |, and that each process has been assigned a subset of the vertices,such that every vertex is assigned to precisely one process. Furthermore, each process hasenough information to determine which vertices have an incoming connection to each of itsown vertices.

SimulateBooleanNetwork ( ) {2 int Step = 0 ;

//Determines the number o f t r a n s i t i o n s made4

bool Stab le = fa l se ;6 //Determines i f a s t a b l e s t a t e / bas in was found

8 int Stab leStep ;// Stores the stepnumber f o r which

10 //a s t a b l e s t a t e / bas in was found

12 array Loca lS ta t e s [ ] ;// Stores h i s t o r y o f t r a v e r s ed s t a t e s

14

while (not Stab le ) {16 Loca lS ta t e s [ Step ] = CalculateNewLocalState ( ) ;

// I f Step==0, t h i s means s e t t i n g the i n i t i a l s t a t e18

f o r a l l p r o c e s s e s i :20 Communicate new l o c a l s t a t e o f i

to a l l other p r o c e s s e s ;22

Set I d en t i c a l S t e p s ;24

for ( int i =0; i<Step ; i++) {26 i f ( Loca lState [ i ] == Loca lState [ Step ] ) {

I d en t i c a l S t e p s . add ( i ) ;28 }

}


30

Communicate I d en t i c a l S t e p s through t r e e ;32 //See be low fo r exp l ana t i on o f t r e e procedure

34 i f ( Step mod s i z e == rank ) {Calcu la te Stable−value from re c e i v ed data ;

36 Stab le = Calcu lated value ;i f ( Stab l e ) {

38 Stab leStep = Common value r e c e i v ed ;//Note : t h e r e i s a t most one common va lue

40 }Send Stable−value down through t r e e ;

42 }else {

44 Receive new value for s t ab l e through t r e e ;i f (One value r e c e i v ed ) {

46 Stab le = true ;}

48 //The root o f the t r e e i s Step mod s i z e , see be low}

50

52

}54

//A s t a b l e s t a t e / s t a t e c y c l e was found56 i f ( l o c a l p roce s s was root in t r e e ) {

i f ( ( Step − Stab leStep ) == 1) {58 pr in t ” Stab le s t a t e found a f t e r ”

<< Stab leStep << ” s t ep s ” ;60 }

else {62 pr in t ” State cy c l e found a f t e r ”

<< Stab leStep << ” s t ep s ” ;64 pr in t ” S i z e o f the cy c l e i s ”

<< Step − Stab leStep ;66 }

}68 }

Next, we will explain the algorithm. In lines 2-12, several variables are declared. IntegerStep denotes the number of state transitions the algorithm has made. Boolean Stable denoteswhether the algorithm has reached a steady state or state cycle. This is the case when thealgorithm makes a transition to a state that has been visited previously. Integer StableStepstores the number of steps after which a familiar state has been found. This variable is usedto determine the size of a state cycle at the end of the algorithm. The array LocalStatesstores all local states that a process has visited. Initially, this array is empty. A very simple


(and inefficient) way to store the local states is used in this description, as it keeps matterssimple. The approach is to dump each local state at the end of the array. Finding a localstate can then be done using a linear search. A better solution is is to store each unique localstate once per process, and keep a list of step number per state for which that local state wasobtained. When looking up the local state history, the required list of step numbers is thenfound when the matching state (if any) has been found.Next, the simulation begins, at line 15. The loop terminates only when variable Stable istrue, meaning that a stable state or cycle has been found. At the start of every execution ofthe loop (line 16), each processes calculates and stores its new local state. If Step equals 0,this implies that the local state has to be set. Next each process communicates its new localstate to the other processes (lines 19-21), so that each process can calculate the new states ofits vertices in the next iteration of the loop.After the communication is done, each process determines the previous step numbers forwhich the same local state was reached as the local vertices are currently in. This is donein lines 25-29. Next, these sets of step numbers are sent through a tree-structure, which isexplained in section 4.5.2. The result is that one process can calculate an intersection of allthe sets of step numbers that the local processes have calculated, which is done in lines 34-39.This process then sends the resulting intersection to all other processes (line 41). All otherprocesses receive the intersection (line 44). If this intersection is not empty, it is then clearthat the current global state of the network has been visited previously, and therefore, thesimulation can be stopped by setting the value of Stable to the value of the number in theintersection (line 45). Since the simulation is stopped after a previously visited state has beenreached, the intersection will contain at most one value.In lines 55-67, the process that calculated the state intersection determines the size of thebasin of attraction and state cycle, and writes this information to the screen. The root ofthe tree needs to do this, as this is the only process that has the correct value for variableStableStep.

In the algorithm, a tree-structure is used to gather information on local identical states thatwere previously reached, and to communicate information if a global stable state was found.In order to obtain better load balancing, a different root is appointed for the tree in everystep. The root of the tree in step x equals x mod |P |.

A slight improvement

When considering the theory of Boolean networks and the algorithm, we can make a simpleobservation: there will be precisely one step in which we find a previous reached state that hasthe same global state as the current step. After that step, the algorithm terminates. In otherwords: sending the new value of Stable down through the tree is a waste of time, except forone case. Assuming that the total required number of steps before the algorithm terminatesis not very small, distributing the new value of Stable down the tree in every round will bea waste of communication time. Therefore, we propose not to communicate the new value ofstable in the current step. Instead, the root of the tree (which is the only process that has thenew value of stable) will include the new value in the next step, when it communicates thenew state of its local vertices. This yields a negligible communication overhead, but comes atthe cost of making one more step then needed. However, if a stable state has been found inthe previous step, there is no need to calculate local identical states in history again. Luckily,


the new value for Stable is available before the tree-communication procedure is executed.Therefore, an extra check is introduced halfway the procedure, which prevents performingunnecessary operations if a stable state has been found.

The improved algorithm then looks as follows:

SimulateBooleanNetwork ( ) {2 int Step = 0 ;

bool Stab le = fa l se ;4 bool TempStable = fa l se ;

6 int Stab leStep ;

8 while (not Stab le ) {

10

; Ca l cu la t e new l o c a l s t a t e ;12 // I f Step==0 t h i s i s the i n i t i a l s t a t e

14 ; S tate [ Step ] = New l o c a l s t a t e ;

16 i f ( Step >0) {//This i s the normal procedure :

18 // (We make a d i s t i n c t i o n f o r the f i r s t s t ep// s ince the tree−procedure i s not necessary

20 // in the f i r s t s t ep )

22 //The new s t a t e must be d i s t r i b u t e d// throughout the netwerk

24

// I f t h i s proces s was roo t in the26 // prev ious s tep , i t must i n c l ude

// the va lue o f TempStable in i t s s t a t e28

i f ( rank == ( Step−1)mod s i z e ) {30

Di s t r i bu t e new s t a t e and va r i ab l e32 TempStable through network ;

Acquire new s t a t e s from other p r o c e s s e s ;34 }

else {36 Di s t r i bu t e new s t a t e through network ;

Acquire new s t a t e s from other p r o c e s s e s ;38

Set va r i ab l e TempStable to value o f40 TempStable r e c e i v ed from proce s s

that was root in prev ious s tep ;


42 }

44

i f ( TempStable ) {46 Stab le = true ;

}48 i f (not Stab le ) {

//When a s t a b l e s t a t e was found , t h e r e50 // i s no need to see i f another s t a b l e s t a t e

//was found52

i f ( rank == ( Step mod s i z e ) ) {54 //This proces s i s the roo t o f

// the t r e e t ha t c o l l e c t s s t a b l e s t a t e s56

Co l l e c t s t ab l e s t a t e s from ch i l d r en ;58

Compare s t ab l e s t a t e s from ch i l d r en to60 own s t ab l e s t a t e s with s e t intersection ;

62 i f ( Stab l e s t a t e has been found ) {TempStable = true ;

64 Stab leStep = Result from se t intersection}

66 }else {

68 //This proces s i s a normal node in the// t r e e t ha t c o l l e c t s s t a b l e s t a t e s

70

Co l l e c t s t ab l e s t a t e s from both72 ch i ld ren , if a pp l i c ab l e ;

74 Compare s t ab l e s t a t e s fromch i l d r en to own s t ab l e s t a t e s ;

76

Send o v e r a l l s t ab l e s t a t e s to parent ;78 }

}80

else {82 //This i s the i n i t i a l s t ep

84

Di s t r i bu t e the i n i t i a l s t a t e through network ;86 Acquire i n i t i a l s t a t e s from other p r o c e s s e s ;

88 // In the f i r s t s tep , i t i s impo s s i b l e


// t ha t a s t a b l e s t a t e has been found ,90 // so we do not need to send data

// through a spanning t r e e92

// Var iab l e s S t a b l e and TempStable94 //have been i n i t i a l i z e d to f a l s e ,

//and need not to be changed96

}98

Step++;100 }

102

//A s t a b l e s t a t e or bas in o f a t t r a c t i o n has been reached104 //This s t a b l e s t a t e was found at s t ep wi th number Step−2

//Hence , proces s wi th rank = ( Step−2)mod s i z e has the106 // stepnumber on which the s t a b l e s t a t e was reached :

108 i f ( rank == ( Step−2) mod s i z e ) {i f ( Stab leStep = ( Step−3)) {

110 // S t a b l e s t a t ep r i n t l n ” Stab le s t a t e reached a f t e r ”

112 << Step−3 << ” s t ep s ” ;else {

114 //Basin o f a t t r a c t i o n was foundp r i n t l n ” State cy c l e o f s i z e ”

116 << ( Step−2) − Stab leStep << ” reached a f t e r ”<< Stab leStep << ” s t ep s ” ;

118 }}

120 }

The algorithm can also be improved by letting each process communicate only the stateof vertices that serve as input to a vertex that is not assigned to the local process. Moreformally, if the communication dependencies are represented by graph G = (V, E), and theset of vertices Vi ⊂ V is assigned to process i, process i only communicates the states of thefollowing set of vertices:

{j|j ∈ V ∧ j /∈ Vi ∧ i ∈ Vi ∧ (i, j) ∈ E}

Since the vertex states are broadcasted throughout the network, it is not relevant to considerthe exact processes that use vertex i. The algorithm sends all vertex states needed by atleast one other processes to all other processes. This small improvement should significantlyreduce the amount of data that is communicated, as the Recursive Minimal Cut algorithmis used to distribute the vertices in a smart manner. That is: data dependencies betweendifferent processes are relatively small.


Tree algorithm

As mentioned, a tree algorithm is used to communicate the step intersections of local pro-cesses, in order to combine them into a global step intersection. With this information, it ispossible to determine whether a state cycle/stable state has been found. The root node ofthe tree will be the last node to finish this procedure. For the structure of the tree, we usea binary tree. This solution is chosen as it allows the most efficient communication of datafor the number of processes we use. Recall that the Recursive Minimal Cut procedure onlyallows a number of processes that is a power of 2. In particular, if more processes need to beadded, the number of processes must be doubled. Since it is not possible to put all processesin a binary tree that has, for every node in the tree, the same number of children in the leftsubtree and the right subtree, we choose to let the one remaining process be attached to theroot node as well, so that the root node receives data from three processes. An example ofsuch a tree is shown in Figure 4.2.

�

��

��

��

�

�

� ��

�

�

��

��

Figure 4.2: Binary tree of size 16 with process 0 as root, step 0

Now, when processes are added to the tree (that is: the number of processes is doubled),observe that all leafs of the tree obtain two new children, such that the tree remains bal-anced (except for the one node below the root). The advantage of this property is that itonly requires the additional time of two communication actions to include the informationof the additional processes in the collection procedure. This follows from the fact that everynode has to receive data from all of its children. In this case, it means two children in allcases but the root. All nodes at the same level can perform the receiving of data in parallel,which implies that it is possible to go up one level in the tree while using time equal to twocommunication actions. Only the step to the root node takes three actions, as the root hasthree children. However, if more then four processes are available, the first of these threecommunication actions can be done while communication is happening further down the tree.

If a tree-structure with more then two children per node is chosen, adding processes wouldresult in slightly worse scaling behaviour, as other tree-structures can not be as well-balancedfor our numbers of processes as a binary tree structure can. If less then two children per nodeare allowed, there can be no parallelism in communication. Clearly, that is not a good option.

Now the data collection process itself can be described. In the collection process, each processgathers step numbers from its children (if any), and then calculates the intersection of thesestep numbers, combined with its own step numbers. The result is then sent to the parent ofthe node (if any).

In order to have computation loads evenly spread among the processes, a different root node


is appointed for each step. If x denotes the current number of steps traversed by the networkand |P | the number of processes, we choose to let process x mod |P | be the root of thetree. It turns out that a rotating root process, when compared to a fixed root process, resultsin a performance increase of more then 100%: in the same time, the algorithm is able tosimulate more then the double amount of steps. The most likely explanation for this is theposition the root node has in the next step. Since the root node is the last node to finishthe data collection procedure, we choose to make it the node with the least work in the nextstep, which is the position directly below the root node. That specific process only has togo through its local state space, and then send the result to the root node. It is now alsoclear that communication times can have a significant impact on performance of the algorithm.

For a network with 16 processes, the assignment of processes to tree nodes are displayedin Figure 4.3 and Figure 4.4.

�

��

�

��

�

�

� ��

�

�

��

��

�

��

�

�

�

�

� ��

�

�

��

��

Figure 4.3: Binary tree of size 16, steps 1 (left) and 2 (right)

��

��

��

��

�

�

� � ��

� �

�

��

��

�

��

��

��

�

�

� ��

�

�

��

��

Figure 4.4: Binary tree of size 16, steps 15 (left) and 16 (right)

Note that the tree for step 16 is the same as the tree for step 0, as there are 16 processes inthis example.

An alternative approach to store the state history

Instead of the proposed strategy for storing the state space, a different approach is also pos-sible, where the traversed states are stored using a binary tree. In such a tree, we let the leafsrepresent global states of the network, such that every state is a leaf of the tree. Every nodeand leaf in the tree can be marked as “not reached” or “reached”. Initially, all nodes andleafs are marked as “not reached”. When some state is reached, the corresponding leaf x ismarked “reached”. Now let node p be the parent of x, and let node y be the other child of p.We then mark parent node p “reached” if and only if y is marked as reached as well. Leafsx and y can then be removed from the tree. If we marked p as “reached”, we continue to


parent p′ of p, and check whether the other child of p′, y′ is marked “reached” as well. If thisis the case, we mark p′ as “reached” and remove p and y′ from the tree. This way, we recurseupwards through the tree, until we find a parent that may not be marked “reached” usingthis rule. In this case, we stop removing nodes from the tree and the recursive procedureterminates.

Such a tree can be stored quite easily and efficiently, by storing the depth of the lowestnodes or leafs that are marked “not reached”. (Or, we can store a depth-number, togetherwith the number of leafs/nodes directly followed by that leaf/node that are at the same depth,resulting in very efficient storage at the beginning of the simulation).If we store the depths as they occur from left-to-right in the tree, the number of depths thatneed to be stored will also decrease in size as we have visited more states. Using this list,we can reconstruct the tree. Furthermore, given some state, we can check whether that statewas reached before in the reconstructed tree by checking whether its leaf is still present or,alternatively, by checking if any of its parents is marked as “reached”.

This method offers a very efficient way to store the state space. However, it also has dis-advantages. The first, and most important, disadvantage is that one process has to maintainthe complete administration of the tree. This is not really a problem in terms of memory,but it is a problem in terms of communication. If one process has to store the entire tree, thestates from all processes somehow have to reach that process so they can be stored. For largenetworks, this would result in large amounts of data being sent to one process, and wouldmost likely yield an algorithm that uses most time on communication instead of computation.Furthermore, this process would be the only process that has to compare the current stateto the state history, resulting in very bad load balancing. Such a solution can be expected toperform badly.The second disadvantage is that we can use the tree to determine which states were reachedpreviously, but not when a specific state was reached. Since we are interested in the numberof steps needed to reach a state cycle or stable state, we would have to find some alternativeapproach. We did think of a work-around for this problem. If we reach a familiar state afterx steps, we start a special routine. This routine continues to simulate the network until weagain reach the same state that was reached after x steps. Let the number of steps needed toreach the familiar state again be denoted by y (so, when starting at the state after x steps,the network needs to make y state transitions to reach the state after x steps again). Then wehave that the size of the state cycle equals y. Furthermore, we can then conclude that the sizeof the basin of attraction was x−y, since after x steps, we went through one entire state cycle.

When we have very small state cycles, combined with networks of moderate size, this ap-proach yields only a minor overhead. However, we do not know how large the cycles are, anddoing the same work twice seems like a waste of time. Also, the memory of all but one nodein the cluster is not really used, which we also consider to be a waste of resources. Therefore,we reject this approach.

One might suggest that we can let each process store its local state history using this binarytree approach. Unfortunately, this is not an option. Each individual process has to be ableto reconstruct a list of step numbers for which the local vertices where in some specific state.As mentioned above, the binary tree approach does not provide such a possibility, and can


therefore not be used in a distributed manner.

4.5.3 Performance expectations

As mentioned, we expect the parallel implementation to be able to tackle much larger networksthan a sequential algorithm. In order to determine the speed of the parallel implementation,we have to make some assumptions on the time needed for communication and computation.We assume the following:

• A communication operation takes constant time. Although this is in general not true,the assumption will hold for our algorithm. Every time the algorithm makes the nextstate transition, each process has to communicate the state of local vertices that otherprocesses need, as well as the step numbers that resulted in the same local state as thecurrent local state. The number of vertex states sent in every iteration is constant. Theamount of step numbers sent is not a constant, but it can be expected to be negligiblewhen compared to the number of vertex states that are being sent. Therefore, we havethat in every iteration of the main loop, we have to send about the same amount ofdata, resulting in a constant amount of communication time in every iteration of themain loop.

• Calculating a new state for a vertex takes time proportional to the number of inputs ofthat vertex.

• Communication operations take T milliseconds on average.

We only have to look at the differences between a parallel and a sequential algorithm for everystep to determine the break-even point between the two.Now, when we look at the algorithm, we see that in every step, each process must communicateits new local state, and each process must communicate its history of previous step numberswith identical local states as the current step. Communication of the local states can not bedone in parallel, so this takes |P |∗T milliseconds. Communication through the tree, however,can be performed in parallel. Since we choose to use a binary tree, we make the followingobservations:

1. Between two levels in the tree, we need two times the communication cost of one op-eration, as the parent has to receive data from two processes, and this has to be donesequentially.

2. The number of levels in the tree equals log2 |P |, so the effective number of levels thatmust be traversed equals (log2 |P |)− 1

With these observations, we conclude that communication time for one step equals (2 ∗(log2|P | − 1) + |P |) ∗ T milliseconds.

This is the time needed to communicate the local states, plus the time needed to communi-cate the state history of identical states. A sequential algorithm obviously does not admitcommunication time. However, a sequential algorithm must compute the new state for everyvertex, and go through the state history of the global network. We are interested in theproperties of networks that can be simulated faster using the parallel implementation.


A sequential algorithm would need to compute the new local state, and traverse the en-tire state history to find step numbers with identical states. This would take |E| to computethe new local state. The time needed to compute the state history is proportional to the num-ber of vertices multiplied by the current step. The parallel algorithm can do both operationswith a speedup factor proportional to the number of processes.

The parallel algorithm has another advantage in traversing the state history. A sequentialalgorithm will find a state it has already been in only when it has reached a stable state, orit has been through a state cycle, and has reached the first state of that cycle again. (Herethe first state is the state of the cycle that was visited first). The parallel algorithm however,has the advantage that the state history is split among different processes. Each process goesthrough the state history of only its own vertices. We can expect that processes will reachseveral local states that have been visited before by the process, but without reaching a globalstable state or state cycle. This results in the advantage that we can maintain a list of stepnumbers that resulted in some local state. Given this strategy, we can reduce the memoryrequirements, as we do not need to store each unique global state completely; we only needto store unique local states. Furthermore, we can adopt a more efficient search strategy togo through the local state history; once we find an identical state, all we have to do is takethe list of step numbers associated with that state, and we have the list of step numbersthat resulted in that local state. (Of course, we need to add the current step to that list).Clearly, this can significantly reduce the time needed to go through the state history. Theexact speedup, however, is very difficult to predict, as this effect is determined by the slowestprocess. Therefore, in our calculations, we ignore this effect.

Assuming we have x vertices, with an average of y connections per vertex, and we havealready made z steps, then we have for the sequential algorithm that in every step we needto calculate the items mentioned in the table.

Action OperationsCalculate new state space xyGo through state history xz

Total x ∗ (y + z)

Now for the parallel implementation. We assume the same properties for the network. Weassume T = 1000 microseconds, which is a very high estimate, given the results of thebenchmark of the cluster in [12]. We deliberately choose a very high estimate, as the processescan not be expected to be perfectly aligned at the time of communication; if one process wantsto send data, it can only do so if the receiving proces is at its receiving statement. Clearly,such an alignment of two processes will in general take some additional time to establish.We also assume that we have 32 processors, each with 3 GHz, which we need to convertcommunication time to computation units. Then we have the following:


Action OperationsCalculate new state space xy

32Go through state history xz

32Communicate state space 32 ∗ 1000 µs = 96 ∗ 106

Communicate state intersections 2 ∗ ((log2 32)− 1) ∗ 1000µs = 24 ∗ 106

Total x∗(y+z)32 + 120 ∗ 106

So, the break-even point is at 3132(x ∗ (y + z)) = 120 ∗ 106 = 1, 20 ∗ 108. Now, if we assume we

have a network of 10000 vertices, where each vertex has an average of 12 connections, thenwe can determine the number of steps z after which the parallel algorithm becomes faster:

3132(x ∗ (y + z)) = 1, 20 ∗ 108

3132(10000 ∗ (12 + z)) = 1, 20 ∗ 108

10000 ∗ (12 + z) ≈ 1, 23 ∗ 108

12 + z ≈ 1, 23 ∗ 104

z ≈ 12300

When we have 25000 vertices, again with an average of 12 connections per vertex, it looks asfollows:

3132(x ∗ (y + z)) = 1, 20 ∗ 108

3132(25000 ∗ (12 + z)) = 1, 20 ∗ 108

25000 ∗ (12 + z) ≈ 1, 23 ∗ 108

12 + z ≈ 4, 92 ∗ 103

z ≈ 4, 92 ∗ 103

So we expect the parallel algorithm to scale well with the number of vertices. In fact, theperformance increase will be slightly better then suggested, due to the advantage we gain bythe distributed storage of the state space. As mentioned, we ignored this advantage in ourcalculations, as its effects are difficult to predict.

We remark that the described calculation results in a very rough estimate of the expectedperformance. For example, we assume a processor can perform one operation per clock cycle,which is unrealistic. However, since we only want to have a rough idea of what performancewe can expect, we see no reason to include more detail in the calculations.

Of course, we are interested in how well the algorithm works in practice, when comparedto a sequential algorithm. The results exceed the expectations. As input, we generate ascale-free graph of 2000 vertices, where each vertex has an average of 12 connections, and weuse completely random boolean functions. When we simulate such a network for 2 hours usinga sequential algorithm on one processor, and for 2 hours on a cluster with 32 processors (on 16nodes, each processor handles one process), we observe that the sequential algorithm is ableto go through somewhere between 8500 and 9000 states. The parallel algorithm, however,can simulate between 47500 and 48000 steps. We should remark that the used sequentialalgorithm consists of the parallel algorithm for the simulation on one process, where the com-munication has been stripped from the algorithm. Of course, the algorithm is not designedto run sequential; an optimized sequential version would perform much better. Since we have


no optimized sequential version available, such comparisons have not been made.

When considering the described test results, the first impression is that the parallel algo-rithm performs quite acceptable: it appears as if we get a performance increase of about500%. However, performance is in fact much better. The explanation for this claim lies in thesearching of the state history. Recall that the algorithm stores all previously visited states.For every state transition made, the algorithm has to do a lookup in the state history todetermine whether that state was visited before. As more states have been traversed, sucha lookup will require more time, as the history will be larger. In other words: as more statetransitions have been made, making a new state transition will become more expensive intime. A better way to measure performance is therefore to fix the workload, for example bysimulating a fixed number of state transitions for some network. Results for such performancetests can be found in Section 5.2.

We believe there are two major reasons for the performance measurements described aboveto exceed the expectations:

• Looking up a state in the local state history is much faster then expected; the searchstrategy used in the parallel implementation results in a bigger advantage then expected.Ignoring this effect in performance predictions turns out to result in estimates that aretoo low.

• Communication times are probably lower then estimated, resulting in much better per-formance then predicted.

Bad performance

When compared to a sequential algorithm, we can expect the parallel algorithm to performbad in the following cases:

• Small networks: for a small network, the communication overhead will be larger thenthe advantage of additional computational resources.

• Networks that converge very fast to a basin of attraction.

We believe the performance of the parallel and a sequential algorithm relate to each other asdepicted in Figure 4.5.


��

��

��

��

Figure 4.5: Comparison of expected performance: parallel and sequential implementation

The values on the axes are omitted, as they are dependent on several factors, such as thenumber of edges and number of processes. We reason as follows: the number of verticesare evenly split among the processes. When vertices are added, each process of the parallelalgorithm is assigned only a portion of the added workload. A sequential process howeverwill have to do computations for all the added vertices.Additional vertices also result in more time needed to traverse the state history. Again, theparallel algorithm has the advantage that each process only receives a portion of the addedprocess. The additional time required for traversing the state history will therefore be muchsmaller. Hence, we expect the parallel algorithm to scale better when vertices are added.As more vertices are added, traversing the state history will begin to take a larger portionof computation times, until it will almost fully dominate computation times. At that point,performance scales very badly with the addition of vertices. Hence the curved lines.The performance for making additional state transitions will also display the same behaviour.That is: the same picture, but with “Number of traversed states” as title for the x-axis and“Time for additional transition” on the y-axis.

Maximum problem size

When considering the parallel algorithm, we see that each process has to store the followingdata:

• Network administration (including information on connections)

• History of local states

• Copy of global state of previous step

Again, we ignore the advantage that the storage of the state history offers; we assume wehave to store every local state we encounter. Assume we have about 900 MB available for thestorage of the state history for each process. That is: each process has 1 GB of memory, andwe assume we need 100 MB for the network administration and global state space. Assumingthe following:


• We can store one vertex state per byte

• The network has 10000 vertices in total

• 32 processes are available

Then we have for the number of steps x we can simulate:

x ∗ 1000032 ≈ 900 ∗ 106

x ≈ 2, 88 ∗ 106

So, we can traverse about 3 million states before we run out of memory. Of course, this isa huge number; it will take a large amount of time before we have traversed such a numberof states. Simulation will likely be finished or aborted before we run out of memory, andtherefore, memory is considered not to be an issue when we have 10000 vertices. When wescale the size of the network with some factor, the number of steps we can simulate reduceswith the same factor. So, if we double the number of vertices to 20000, the number of stepswe can simulate is about 1, 44 ∗ 106 (assuming we can still store the network administrationin 100 MB).

Chapter 5

Results

In this chapter, we will discuss results obtained with the parallel algorithm. We use a C++implementation of the algorithm. For message passing, LAM/MPI is used.

This chapter is divided into two parts. In Section 5.1, we describe simulation results ofdifferent graphs. In Section 5.2, performance of the algorithm is tested and evaluated.

5.1 Behaviour of Boolean networks

In this section, simulation results of different networks and different Boolean functions aredescribed. A multitude of different approaches is described. Many questions arise, but notall of them are answered.We will mainly be concerned with behaviour of the different Boolean networks. Recall that aBoolean network always converges to a steady state or a state cycle in a finite number of steps.The states traversed from the initial state to the state cycle or stable state are referred to asthe basin of attraction. When simulating a Boolean network, we will mainly be interested inthe size of the basin of attraction, and the size of the resulting state cycle.

Results are arranged according to the Boolean functions that are used; in Section 5.1.1,we describe behaviour of networks that use completely random Boolean functions. In Sec-tion 5.1.2, we slightly change the random Boolean functions, by varying the probability forsuch a function to yield true or false.In Section 5.1.3 we describe results for simulations where the Boolean functions are definedas a weighted vote of the inputs. We define a metric to define the importance of an input.Section 5.1.4 contains results of simulations where we combine the weighted vote approachwith inhibitors. Finally, in Section 5.1.5, results are described with inhibitor edges, where theprobability for an edge to be an inhibitor is varied.

5.1.1 Random Boolean functions

We have performed some preliminary tests with the algorithm. We have generated severalscale-free networks with sizes of 2000, 5000, 10000 and 15000 vertices. We have also generatedrandom initial states and random boolean functions, that have for each specific combinationof inputs, an equal probability of yielding true or false. Compiling a source file with these

40 Results

functions turns out to be only feasible for functions that have up to 10 inputs. Since ournetworks contain vertices with more then 10 inputs, we choose to let vertices with more then10 inputs to calculate a new value through a layered structure: in each round we reduce thenumber of inputs until we have an input set of at most 10 inputs. In round x, every 10inputs are used as input to a boolean function. The result is an input for round x + 1. Ofcourse, we need to make sure the order in which we do this is deterministic, in order to obtaindeterministic behavior. This can be done by numbering the inputs in every round using adeterministic method. Every vertex is then assigned its own boolean function.

When running the algorithm using boolean functions that have an equal probability of yield-ing true or false, we observe that the networks do not converge to stable states in a reasonablenumber of steps: in more then 90% of the cases, the networks need over 50000 steps, evenfor (small) networks of 1000 vertices. However, based on [6], we would expect the number ofrequired steps to be about 2∗

√|V |. Clearly, something is wrong, either with the assumptions

on behavior of scale-free networks, or in our input.

5.1.2 Pseudo-random Boolean functions

Since very little is known on the behaviour and functioning of biological networks, we chooseto adapt the boolean functions. This choice comes from the fact that in [3], the authors arguethat biological networks have a scale-free topology. If the topology is fixed, the only thing wecan change is the boolean functions. We adapt the boolean functions by setting a parameterthat determines the probability for some combination of inputs to yield false. We have per-formed more tests on scale-free networks of various sizes and with different probabilities forboolean functions to yield false. For each combination of parameters, we perform ten runs,were in each run we generate a new network, new boolean functions for each vertex and anew initial state.

Our results show that the networks converge very fast to a steady state/state cycle, wheneverwe keep the probability for the boolean functions to yield false below 10%. The basins ofattraction are about 25 steps in size. Furthermore, we observe that networks with much morevertices display no significant difference in behaviour; a network of 2000 and a network of10000 vertices each go through about the same number of steps to reach a state that hasbeen visited previously. One might expect that this behaviour is a specific property of scale-free networks; a small number of hubs determine the behaviour of the network to a largeextent. However, when we perform the same tests on random networks, we observe the samebehaviour. The conclusion is that the described boolean functions determine the speed withwhich the networks converge to a steady state or state cycle.

Our results also show that, whenever we adjust the probability for the boolean functionsto become ”false” to anything above 10%, the networks suddenly start to require very largenumbers of steps to reach a familiar state.

We believe the reason for this behavior is the complete randomness of our boolean functions.In particular, we believe random functions are not a realistic model of genetic regulatory

5.1 Behaviour of Boolean networks 41

systems. The reason we come to this conclusion, is that in our functions, every input tosome vertex is equally important to its new state. It would seem more logical that there aresome ”dominant” inputs, that have a higher influence on the new state than other vertices.For example, the hubs in scale-free networks are likely be more important; if they have moreconnections, one might expect that they have something useful to contribute to the system.

In other words: we expect that real genetic regulatory systems are highly systematic, whereasrandom boolean functions obviously are not. By decreasing the probability for a function toresult in ”false”, we bring some order to the chaotic nature of the system.

5.1.3 Weighted vote

These observations suggest that we should think of different boolean functions to map to thevertices. We want our functions to take into account the importance of different vertices;as mentioned, one can expect that some vertices are more important then others in a bio-logical network. The approach we suggest for modeling such a function is the following: wedetermine the importance of a vertex by the number of outputs that vertex has. The reasonfor this choice is that, in a biological signalling network, an outgoing edge from vertex x tovertex y means that the neighboring vertex y is influenced by products of vertex x. The moreoutgoing edges a vertex has, the more important we consider that vertex to be.

The state of a vertex is then determined by a ”weighted vote” of all its inputs. Every vertex”votes” value 1 if its current state is true, and 0 otherwise. The value of each vote is thenmultiplied by the number of outputs the voting vertex has. The sum of weighted votes fromall inputs is then determined, and divided by the total number of outputs that all incomingvertices have. The result is a value between 0 and 1. A vertex is then set to ”true” if thatvalue exceeds 0.5.

There is the possibility that a vertex has no inputs, and such a vertex can therefore notdetermine its new state. In this case, we have to make an exception. We choose to set thestate of such a vertex to false.When this type of boolean functions is applied to either scale-free or random networks, weobserve that a stable state is reached in all cases (a cycle size of 1 denotes a stable state),and the size of the basin of attraction is less then ten. See the following table for the resultsof experiments with random networks and scale-free networks:

42 Results

Run Size of basin Size of cycle1 5 12 6 13 7 14 6 15 6 16 6 17 5 18 5 19 6 110 8 1

Table 5.1: Random network, 20000 vertices, avg. 20 connections per vertex


Table 5.2: Scale-free network, 20000 vertices, avg. 12 connections per vertex

Further tests indicate that adding additional vertices yields no significant change in behaviour.An important observation we make is that there is no real difference in conversion speedbetween scale-free and random networks. We would expect that the presence of hubs in scale-free networks would result in a difference in behaviour. When we look at the resulting stablestates, we see that either all vertices are set to ”false”, all vertices are set to ”true”, or onlyvertices without inputs are set to ”true”. Note that this last category consists of vertices forwhich we made an exception in the boolean functions.

5.1.4 Weighted vote with inhibitors

An important aspect of biological networks that is not incorporated in the described booleanfunctions is that of inhibitors: the presence of products of one gene suppresses the productionof proteins from another gene. As a first (very simple) abstraction from this property, weassociate a random boolean value with each vertex. The new state of that vertex is then equalto its random value if the resulting weighted sum from the vote exceeds 0.5, and is set to thenegation of its random value otherwise. When we simulate networks with these functions,we observe a very strange result: scale-free networks do not converge in reasonable time toa steady state or basin of attraction (for a network of 2000 vertices we are able to simulate


more then 50.000 steps without reaching a state that has been previously visited). A randomnetwork, however, admits a basin of attraction of size between 10 and 1000 steps. The sizeof the state cycles is between 1 and 100. See the following table:



About half of the vertices end up being set to ”true”, while the other vertices end up beingset to ”false”. This result is in accordance with our previous findings: the original functionsresult in either all vertices being true, or all vertices being false. If we invert the functionsfor half of the vertices, one might expect that half of the vertices end up in a different finalstate.Scale-free networks appear to converge slower. When simulating a scale-free network of 500instead of 2000 vertices, where each vertex has an average of 12 connections, we get thefollowing results:



We observe that both the size of the basins of attraction and the size of the state cyclesdisplay large fluctuations. More importantly, there appears to be no real pattern in theresults. When simulating a network with the same properties, but with 250 vertices, weobtain similar results, although the fluctuations are less extreme:

44 Results



5.1.5 Weighted vote with random inhibitor edges

Instead of letting a single vertex have inhibiting properties, it is more realistic if some vertexcan be an inhibitor to some of its outputs, and a normal output to the rest of its outputs.In order to include this better representation of inhibitors, we make another change in theway boolean functions are calculated. We assign a weight of either 1 or -1 to each edge. Theprobability for an edge to have weight -1 is a parameter that can be set. Again, we deter-mine the importance of a vertex by the number of outputs a vertex has. We now determinethe new state of the vertex as follows: for each input that is set to ”true”, we multiply itsimportance with its weight. We then sum these products for all obtained values. If the resultis larger then zero, the new state of the vertex is set to ”true”. Otherwise, it is set to ”false”.Inputs that are set to ”false” are ignored. This approach is inspired by neural networks theory.

Simulating networks with such functions yields results that demonstrate much less fluctu-ations. Therefore, average values have more meaning. In the next table, average results andstandard deviations of different networks are displayed. Averages are taken from 10 runs ofnetworks with properties as depicted in the table. The percentage of vertices set to true isthe average value of the first state of the cycle that was reached.


P(Inhibitor edge) Avg. size ofbasin

σ|basin| Avg. size ofcycle

σ|cycle| Avg. % ofvertices setto true

σ%

0 2.4 0.5 1 0 95.8 1.10.10 8.1 3.0 2.9 2.8 91.9 1.60.20 3.6 0.7 1 0 86.4 2.60.30 4.1 0.7 1.3 0.5 76.2 3.60.40 8.1 3.0 2.9 2.8 62.0 3.80.50 263.8 702.6 51.2 75.1 45.0 6.50.60 179.0 266.0 393.9 1152.8 30.4 2.10.70 702.2 1785.9 265.5 364.2 19.1 2.40.80 224.3 262.9 230.0 366.4 10.5 2.80.90 10.6 7.6 13.5 16.1 3.1 3.01 1 0 1 0 0 0


It is clear that the number of inhibitor edges has a large effect on the number of statetransitions needed. The largest fluctuations in the size of basins of attraction are observedwhen 70% of the edges are inhibiting. We would have expected to see the largest fluctuationswhen 50% of the edges are inhibiting. When 60% of the edges are inhibiting, the networksconverge faster to a steady state or state cycle then when either 50% or 70% of the edgesare inhibiting. However, when 60% of the edges are inhibitors, the average size and standarddeviation of the cycles are much larger. High standard deviations indicate large differencesbetween results of different runs.

We also observe a jump in sizes of the basins of attraction when switching from 40% ofinhibitor edges to 50% of inhibitor edges.

When all edges are inhibiting, the network converges to a steady state in one step, and inthat state all vertices are set to false. However, when no edges are inhibiting, the networkdoes not always converge to a steady state, and not all vertices are set to true in all cases.This is most likely the result of the exception made for vertices without inputs. All otherpercentages are almost linear in the percentage of edges that are inhibiting.

In Figure 5.1, the average basin size and its standard deviation are shown. Figure 5.2 showsaverage values of cycle size and its standard deviation. Figure 5.3 displays the percentage ofvertices set to true in the first state of the cycle that was visited.

46 Results

��

�

��

��

��

��

��

��

��

��

� ��

��

� ��

��

�

��

��

��

��

��

��

��

��

��

��

� ��

��

��

Figure 5.1: Average size of basin (left) and standard deviation (right) - scale-free network

��

�

��

��

��

��

��

��

��

��

��

� ��

��

� ��

��

�

��

��

��

��

��

��

��

� ��

��

��

Figure 5.2: Average size of state cycle (left) and standard deviation (right) - scale-free network

�

��

��

��

��

��

��

� ��

��

� ��

Figure 5.3: Average percentage of vertices set to true - scale-free network

The large fluctuations in the sizes of basins and cycles are quite unexpected. Since thenetworks are generated using the same rules, we expect different networks to behave aboutthe same when the percentage of inhibitor edges is fixed. In order to validate this expectation,ten runs are made using a fixed network of 250 vertices. The inhibitor edges are all the samein every run. The only difference is the initial state, which is generated for every run, witha 50% chance for a vertex to be set to true. The results of these simulations are shown inTable 5.7.


Run Size of basin Size of cycle % of vertices set to true1 19 35 24.82 20 4 28.83 48 2 25.64 12 37 25.25 29 2 25.66 33 35 26.87 18 36 26.48 51 35 26.89 39 2 25.610 16 51 28.0

Average 28.5 23.9 26.4

Table 5.7: Scale-free network, 250 vertices, avg. 12 connections per vertex, P(Inhibitor)=0.6

The standard deviations of the basin size and cycle size are 13.8 and 19.0 respectively, whichis much smaller then the simulations with different networks.The same simulation, but with a scale-free network of 250 vertices with the probability foran edge to be an inhibitor set to 70% yields the results shown in Table 5.8.


Average 24.9 11.4 21.7


Standard deviations for basin size and cycle size are 13.5.

The conclusion is that it is not the initial state that results in large fluctuations; differ-ent networks turn out not to display the same behaviour. In particular, different scale-freenetworks, but all with the same properties, display different behaviour. This is a very im-portant result, as the Boolean functions we use are determined by the topology. It turns outthat these functions still allow a multitude of different systems.

When looking at the stable states or start of the state cycles, we observe another remarkableproperty: there is a large overlap in the vertices that are set to true. For the results in Ta-ble 5.7, 14.4% of the vertices are true in all ten runs, which is roughly 60% of the vertices thatare true in any run. Of the vertices that are set to false, 59.6% are set to false in all runs.

48 Results

So 74% of the vertices has the same value in the stable state or first state of the state cyclein all runs. Considering that the initial state is different for every run, this is an incredibleresult: only 26% of the vertices differ in value in the final states.

For results in Table 5.8, the overlap for vertices set to true is 6.0%, which is about 25% ofthe vertices that are true for any run. 62.4% of the vertices are set to false in all runs. Intotal, 68.4% of the vertices have the same value in the final state of all runs.Clearly, these numbers are too large to be a coincidence. We expect that when the statecycles are compared to each other, even more similarities can be identified. When we relatethis to genetic regulatory systems, a possible conclusion is that these systems indeed performa pre-defined function: changes in the initial state lead to similar state cycles. Similar statecycles can be thought of as a specific task being executed by the system. A basin of attractionis then the system responding to external stimuli or a disturbance of its function. The systemwill try to handle the anomaly and go back to its original function. In other words: thesystems seem to display very high robustness.

We also test the same boolean functions on random networks, again with different probabilitiesfor edges to be inhibitors. For each combination, we make ten runs in which a new networkis generated every time. This yields the following results:

P(Inhibitor edge) Avg. size ofbasin

σ|basin| Avg. size ofcycle

σ|cycle| Avg. % ofvertices setto true

σ%

0 1.5 0.5 1 0 100 00.10 2 0.3 1 0 99.2 0.20.20 3 0 1 0 98.2 1.00.30 5.9 1.4 1.4 0.7 90.9 2.40.40 252.4 496.1 133.2 270.5 71.5 2.80.50 No results0.60 No results0.70 No results0.80 No results0.90 9.4 6.5 10.1 17.0 1.4 1.91 1 0 1 0 0 0


For some simulations with random networks, we are unable to obtain results. The simulationof only one run already requires more time then the parallel cluster allows for a job. Thebasin and cycle size together are at least 150.000 states in size. The most likely explanationis that the specific combination of the network topology and percentage of inhibiting edgesresults in a network that displays no systematic organization.

When the probability for an edge to be an inhibitor reaches 40%, we observe a sudden changein the average basin and cycle size. Of the ten runs made with this probability, only two runs


display large basin sizes. If we leave these two runs out, the average basin size is only 29.3,and the average cycle size then becomes 59.4. There are two other runs with high cycle sizes(200 states instead of about 4). One of the runs left out because of its basin size has a cyclesize of only 3. The number of vertices that are set to true in the end is about the same forevery run.

It is difficult to draw conclusions from these observations. The best explanation we haveis that the networks display little structure.

Since we are mainly interested in scale-free networks, we have not investigated this phe-nomenon in more detail. The important thing is that there clearly are functions for whichscale-free networks behave completely different than random networks.

In order to determine whether random networks have the same overlapping in the final states,a single random network was generated with a 40% chance for an edge to be inhibiting. Thisnetwork has been simulated ten times, with a new initial state for every run, where a vertexhas equal probability of being true or false. Results for these tests are shown in Table 5.10.


Average 25.8 24.4 72.2

Table 5.10: Random network, 250 vertices, avg. 12 connections per vertex, P(Inhibitor)=0.4

Standard deviations for the basin and cycle size are 11.0 and 13.9 respectively. Again wecompare the final states to each other, as was done for scale-free networks as well. Weobserve that 67.2% of the vertices are set to true for all runs, and 20.8% set to false in allruns. So a total of 88% of the vertices has the same value in all final states, which seemsquite a large number. However, a similar simulation (with P(Inhibitor) = 0.40) has not beenperformed for scale-free networks, so it is difficult to compare. Therefore, a simulation forscale-free networks has been performed, resulting in the data in Table 5.11.

50 Results


Average 7.6 1 65.3


For the scale-free network, we observe a stable state in every run. Furthermore, the numberof vertices set to true is equal in all runs but one. 64.8% of the vertices are set to true in allruns, and 34.0% of the vertices are set to false in all runs . This yields a total of 98.8% of allvertices that have the same value in all runs, which is an amazing result. It is significantlylarger when compared to the random network. We expect that nine of the ten runs haveresulted in the exact same stable state, although we did not verify this claim. The randomnetwork clearly results in more different state cycles, which becomes clear when consideringthe sizes of the cycles: they have different sizes. The deterministic nature of the networksimply that two cycles of different size have no overlapping state.At this point, not enough research has been done to claim that scale-free networks have ahigher probability to end up in a specific state cycle then random networks. However, theseresults do suggest such a property. Clearly, this is an interesting starting point for futureresearch.

5.2 Performance tests

In the previous sections we have described test results describing behaviour of boolean net-works. In this section we will discuss performance of the algorithm. All tests are conducted ona parallel cluster that consists of 16 nodes, where each node meets the following specifications:

• 2 Intel Pentium 4 3.06 GHz CPU’s (512KB cache per CPU)

• 2 GB RAM

• Gigabit LAN connection to other nodes

• Running Gentoo Linux 2.4.26

• LAM 7.0.6/MPI 2 C++

In our tests, we map precisely one process to a CPU, i.e.: for every two processes, we use onenode.

5.2 Performance tests 51

In order to test performance, we generate a single scale-free network with a specific num-ber of vertices. We then simulate 50.000 state transitions several times, each time using adifferent number of processes. Since we keep the network fixed for all tests, every simulationhas an exactly equal workload. We measure the runtime the simulation requires in seconds.The results are as follows, for different number of vertices:

#Processes #Nodes Runtime (hours)2 1 2.64 2 1.58 4 1.016 8 0.7432 16 0.78

Table 5.12: Performance results for network of 250 vertices

In this table we see that the algorithm scales acceptable with the number of processes, eventhough the number of vertices is quite small. We observe that the fastest result is obtainedwith 16 processes, and not with 32. The most likely explanation for this would be that thecommunication overhead becomes too big in the step from 16 to 32 processes. When thenetwork contains more vertices, we expect this effect to disappear.

The results for 500 vertices are as follows:

#Processes #Nodes Runtime (hours)2 1 10.14 2 5.38 4 3.116 8 1.732 16 1.1


For 500 vertices, we already have that 32 processes perform better then 16 processes, as ex-pected. As the number of vertices increases, the advantage will become bigger.

For 750 vertices, results are as follows:

#Processes #Nodes Runtime (hours)2 1 > 244 2 13.18 4 7.016 8 4.432 16 2.0


These results are visualized in Figure 5.4.

52 Results

�

�

��

��

��

��

��

� � ��

��

��

��

��

��

Figure 5.4: Performance comparison

In order to determine whether comparing the current local state to the state history requiresmore time then communication, the algorithm was tested on the same network of 750 ver-tices, using 16 nodes, but with 64 and 128 processes. Runtimes are not as good as using 32processes. Conclusion: communication is more expensive then searching the state history.

Given these performance results, we can conclude that the performance increase is signifi-cant when adding processes. If the network is relatively small, adding more processors andprocesses will not necessarily result in better performance. Using more processes on the samenumber of processors results in a performance decrease when compared to one process perprocessor.

Chapter 6

Conclusion and recommendations

Genetic regulatory systems are highly complex systems, and difficult to comprehend. Simu-lation provides a manner to gain better insight in such systems. In order to allow simulation,the complexity needs to be reduced by abstracting away from the details. One possible ab-straction is that of Boolean networks. Even though such an abstraction greatly reduces thecomplexity of the simulation, it still remains a system that displays a rapidly growing statespace, resulting in the need for computer simulation. In order to obtain reasonable runningtimes, a parallel algorithm has been designed and implemented. The algorithm has beenshown to be much faster then a sequential implementation.

To obtain an effective distribution of workload between processes in a parallel algorithm,a Recursive Minimal Cut algorithm has been designed, implemented and tested. This algo-rithm performs good for different topologies, and in particular for scale-free networks. Assuch, it can be used as a general load distribution algorithm. Several optimizations have beensuggested for further improving the Minimal Cut algorithm.

Besides the load distribution algorithm, a parallel algorithm for simulating Boolean networkshas been designed and implemented. When simulating Boolean networks, an important choiceis what Boolean functions are mapped to the vertices in a Boolean network. In extreme casesthey can completely determine the behaviour of the network; the size or topology of thenetwork are irrelevant to the results. The scale-free topology of genetic regulatory systemshas been shown to provide different behaviour then random networks for a special type ofBoolean functions. Furthermore, different scale-free networks with the same properties yielddifferent simulation results.However, a fixed scale-free network with varied initial states results in steady states or statecycles with many similarities, which suggests the networks display high robustness.

The results from all simulations we have done differ from what is described in literature.A possible explanation lies in the choice of Boolean functions. But scientific literature onlydescribes expectations on behaviour of Boolean networks, which are not necessarily true. Itis possible that these expectations prove to be wrong.

Performance tests indicate that the parallel algorithm is much faster then a sequential imple-mentation. The size of a Boolean network influences the performance increase of the parallel

54 Conclusion and recommendations

algorithm. However, even moderate sized networks allow a significant performance increasewhen more processes are added.

The implementation of the parallel algorithm provides an efficient solution to simulatingBoolean networks. Although the number of different simulations described in this thesis issmall, it does provide some approaches for choosing Boolean functions, which is the mostimportant choice for the simulation. It is clear that random Boolean functions do not offerrealistic simulation results. It is difficult, however, to draw conclusions on genetic regulatorysystems from simulation results without a background in molecular biology. In particular,without a way to validate the Boolean functions, conclusions on the behaviour of geneticregulatory systems are impossible to draw.

6.1 Recommendations for future work

The work presented in this thesis provides a means to simulate Boolean networks. Severalquestions related to the behaviour of the described networks and functions are not answered.For example, how does the behaviour change when we make the following changes to thenetwork:

• Adding additional vertices

• Adding additional edges

• Removing one or more vertices from a scale-free network

• Removing one or more hubs from the network

In the area of cellular development, it is interesting to see what happens when some vertexis removed from the network, or some new vertices or edges are added. Is the network thencapable of maintaining its function? And how important are the hubs to the behaviour of thenetwork?

Although the answers to some of these questions are easy to obtain using our implemen-tation of the algorithm, they are all quite time-consuming to simulate.

In the results, we observe a large similarity in steady states and state cycles that a spe-cific network leads to, given some random initial state. The results suggest that this can notbe a coincidence. The experiments described in this thesis are not broad enough to state aconclusion on this. It would be very interesting to see whether there really is a systematicrelation. In particular, it would be interesting to be able to create a ”map” of the differentstates of a network, and how they are connected to each other. This will, however, requirethe simulation of all state transitions in the network, which is a huge computational challengeeven for moderate networks.

Another important thing that would be of interest is a method to compare the Boolean func-tions used in this thesis to real genetic regulatory systems. The functions we use are designedfrom the perspective of computer science, by using simple metrics of the graphs. It is very

6.1 Recommendations for future work 55

well possible that these functions prove to be unrealistic abstractions. However, without abackground in molecular biology, it is difficult to validate the functions. One needs a moredetailed understanding on the functioning of the regulatory systems to determine suitableBoolean functions.

One of the disadvantages of Boolean networks, is that each state transition is completelysynchronized, that is: every vertex changes to its new state at the same moment. It mightbe more realistic if there is a less strict mechanism, where groups of vertices synchronizewith each other every step, and vertices are only globally synchronized after some (perhapsrandom) time interval. In particular, it would be more logical if vertices that are close toeach other in the network influence each other more then vertices that are far away from eachother. Such a simulation would require a spatial component in the graphs representing thenetworks. The regular networks, which are discussed very briefly in Section 2.2, might proveto be an interesting topology for this purpose.

One of the problems encountered when relaxing synchronization between vertices is the lackof determinism, making it very difficult, if not impossible, to properly recognize a state cycle;if the simulation becomes non-deterministic, every possible path in the state space has to beconsidered. Of course, doing this in an efficient manner is a complicated task, as the statespace branches (and therefore explodes) more and more as we loosen the synchronizationrestrictions. Reaching the same state twice is then no longer a proof of a state cycle. Clearly,such an approach poses a huge challenge.

56 BIBLIOGRAPHY

6.2 References

[1] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, and M. Pro-tasi. Complexity and Approximation: Combinatorial optimization problems and theirapproximability properties. Springer, 2003.

[2] A.L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509,1999.

[3] A.L. Barabasi and Z. Oltvai. Network biology: understanding the cell’s functional orga-nization. Nature Reviews - Genetics, 5:101–114, 2004.

[4] B. Bollobas. Random Graphs. Cambridge University Press, January 2001.

[5] D. Bray. Molecular networks: The top-down view. Science, 301(5641):1864–1865,September 2003.

[6] H. de Jong. Modeling and simulation of genetic regulatory systems: A literature review.Journal of Computational Biology, 9(1):67–103, 2002.

[7] F. Ercal, J. Ramanujam, and P. Sadayappan. Task allocation onto a hypercube byrecursive mincut bipartitioning. Journal of Parallel and Distributed Computing, 10(1):35–44, 1990.

[8] W. Feijen and A. van Gasteren. On a method of multiprogramming. Springer-Verlag NewYork, Inc., New York, NY, USA, 1999.

[9] S. Kauffman. Metabolic stability and epigenesis in randomly constructed genetic nets.Theoretical biology, 22:437–467, 1969.

[10] B. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. TheBell system technical journal, 49(1):291–307, 1970.

[11] H. Sauro and B. Kholodenko. Quantitative analysis of signaling networks. Progress inBiophysics and Molecular Biology, 86(1):5–43, September 2004.

[12] R. Verhoeven. Measurements of SKaMPI, Version 4.0 (Special Karlsruher MPI-Benchmark) of the pentium4 at SAN / TU/e performed by river on Thu Jan 20 13:11:422005. http://sandpit.win.tue.nl/main/sandpit-skampi-20050120.pdf, January 2005.

Appendix A

Test results for Parallel RecursiveMinimal Cut

A.1 Results

We tested the algorithm on four different types of graphs: a binary tree, a cycle, a randomnetwork and a scale-free network. The vertices were distributed among 32 processes. Wevaried the total number of computational units in the graphs, and measured the results. Wemeasured runtime, average number of internal connections and average number of externalconnections. We compared the results with expected values for a random mapping of com-putational units to the processes, requiring that each process is assigned a similar number ofcomputational units.Results were obtained on a cluster consisting of 16 nodes, each with the following specifica-tions:

• Dual 3.06 GHz Pentium 4 Processor

• 2 GB RAM

• Operating system: Gentoo Linux, kernel 2.4.26

• LAM 7.0.6/MPI 2 C++

All networks were divided into 32 clusters in total, in order to be used in a parallel programwith 32 processes.

A.1.1 Binary tree network

First, we generated binary tree networks, for different numbers of vertices, with edges in onedirection. We measured the results of our algorithm for these networks:

58 Test results for Parallel Recursive Minimal Cut

#Vertices Runtime(sec)

#Internalconn. -alg

#Externalconn.- alg

#Internalconn. -random

#Externalconn. -random

500 1 12.6 2.1 0.48 15.111000 2 25.7 4.4 0.98 30.242500 5 61.6 15.4 2.44 75.655000 18 126.6 28.5 4.88 151.3410000 69 258.5 53.2 9.76 302.7

Table A.1: Binary tree network - Avg. results per process

We also represented these results in various graphs; see below.

�

��

��

��

��

��

��

��

��

��

��

��

Figure A.1: Binary tree - internal connections

A.1 Results 59

�

��

��

��

��

��

��

��

��

��

��

��

��

Figure A.2: Binary tree - external connections

�

��

��

��

��

��

��

��

�

��

��

��

Figure A.3: Binary tree - runtime

We see that the algorithm performs quite well. The ideal distribution would result in at most3 external connection per cluster. So the optimal solution is not obtained. Although thereis a large factor between the optimal solution and the resulting distribution, we consider theresulting distributions satisfactory when related to the size of the graphs.

A.1.2 Cyclic network

We generated a cyclic network, with connections in one direction. Results were as follows:



#Internalconn. -alg

#Externalconn.- alg



500 1 10.5 4 0.49 15.141000 2 22 8.3 0.98 30.272500 4 56 21 2.44 75.685000 17 112.8 42.6 4.88 151.3710000 66 226.4 85.1 9.77 302.73

Table A.2: Cyclic network - results

�

��

��

��

��

��

��

��

��

��

��

Figure A.4: Cyclic network - internal connections

A.1 Results 61

�

��

��

��

��

��

��

��

��

��

��

��

��

Figure A.5: Cyclic network - external connections

�

��

��

��

��

��

��

��

��

��

��

Figure A.6: Cyclic network - runtime

Again, we see that the algorithm performs quite well, although not as good as in the caseof the binary tree network. An ideal distribution would result in two external connectionsper process. The reason that an ideal distribution is not obtained lies in the choice of theheuristic; in the ideal distribution the first half of the cycle would become one cluster as resultof the MinCut. However, if the (preceding) neighbor of some vertex is in one cluster, and thevertex itself and its successor neighbor in the other, the heuristic would return a gain of 0 forthat vertex, making it not really feasible to move.


A.1.3 Random network

We generated a random network, where each vertex has a 5% change to have an outgoingconnection to each other vertex.


#Internalconn. -alg

#Externalconn.- alg



500 1 43.7 347.6 12.21 378.421000 3 136.8 1423 48.83 1513.672500 35 645.8 9114.8 305.18 9460.455000 266 2160.1 36867.7 1220.7 37841.810000 2359 7492.6 148750.3 4882.81 151367.19

Table A.3: Random network - results

�

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure A.7: Random network - internal connections

A.1 Results 63

�

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure A.8: Random network - external connections

�

��

��

��

��

��

��

��

��

Figure A.9: Random network - runtime

These results are by far not as good as the results for the cyclic and binary tree networks, notonly in terms of communication cost, but also in terms of runtime. There is, however, a goodexplanation for this. The random network has much more connections than the other twonetworks, which results in a lot more possibilities to move vertices and improve the cutsize.Also, more edges result in a longer time to calculate the gain for each vertex. This results inlonger runtimes.Furthermore, the network is completely random generated. There are no real ”clusters” ofdensely connected vertices; edges are spread evenly through the network, which means thatthere is not a real ”good” solution as was the case with the other two networks.


A.1.4 Scale-free network

A special type of network is a so-called ”Scale-free” network. Such a network can be char-acterized by a relatively small number of ”hubs”, which are vertices with a large number ofconnections. This type of network is described in [3]. In this article, the authors show thatscale-free networks are observed in many biological cells. We generated scale-free networkswith the method described in the paper, and tested the algorithm on these graphs. We alsodetermined the number of different processes among which the largest hubs were distributed.A better distribution means that communication load is spread more evenly. We measuredthe number of processes among which the sixteen largest hubs were assigned to. In best-case,this would be sixteen. All scale-free networks used for testing were generated with an averagenumber of connections per vertex of 5.

We did not compare the quality of the mapping to a random mapping.

Number ofvertices

Runtime #Internalconnections

#Externalconnections

#Processesfor sixteenlargest hubs

500 2 15.3 4.5 14.31000 3 32.6 7.9 12.62500 9 85.1 17.8 13.15000 38 172.0 34.5 12.910000 172 351.8 62.1 13.6

Table A.4: Scale-free network - results

�

��

��

��

��

��

��

��

��

��

��

��

��

Figure A.10: Scale-free network - internal connections

A.1 Results 65

�

��

��

��

��

��

��

��

��

��

��

��

Figure A.11: Scale-free network - external connections

�

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure A.12: Scale-free network - runtime

In general we can state that the distribution is pretty good for scale-free networks. The hubsare distributed in an acceptable manner, although the algorithm does not necessarily split thehubs from each other. We believe the distribution of the hubs is the result of the randomnessin splitting one set of vertices into two clusters for the MinCut algorithm. The algorithmitself has no explicit mechanism to separate strongly connected nodes.

Appendix B

Histograms of Minimal Cut forscale-free networks

��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

��

� � � � � � � � ��

��

��

Figure B.1: Histogram of connections for run 1

��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

��

� � � � � � � � ��

��

��


67

��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

��

� � � � � � � � ��

��

��


��

��

��

��

��

��

��

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��


��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

��

� � � � � � � � ��

��

��

��


��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

��

� � � � � � � � ��

��

��


68 Histograms of Minimal Cut for scale-free networks

��

��

��

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

��

� � � � � � � � ��

��

��


��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

��

� � � � � � � � ��

��

��


��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��


��

�

��

��

��

��

��

��

� � � � � � � � ��

��

��

��

�

��

��

��

��

��

��

��

� � � � � � � � ��

��

��


Documents

Eindhoven University of Technology MASTER Simulation of ... · Simulation of large-scale genetic regulatory systems Janssen, T.H.M. Award date: 2006 Link to publication Disclaimer