5
978-1-4673-4431-9/11//$31.00 ©2012 IEEE May 02-05, 2011 Izmir, Turkey 86 Parallel SPICi Seyedsasan Hashemikhabir, Tolga Can Computer Engineering Department, Middle East Technical University Ankara, Turkey [email protected], [email protected] ABSTRACT In this paper, a concurrent implementation of the SPICi algorithm is proposed for clustering large-scale protein- protein interaction networks. This method is motivated by selecting a defined number of protein seed pairs and expanding multiple clusters concurrently using the selected pairs in each run; and terminates when there is no more protein node to process. This approach can cluster large PPI networks with considerable performance gain in comparison with sequential SPICi algorithm. Experiments show that this parallel approach can achieve nearly three times faster clustering time on the STRING human dataset on a system with 4-core CPU while maintaining high clustering quality. 1- Introduction In recent years, improvement in cell level science and high-throughput experimental methods resulted in large and dense protein interaction networks. Analyzing and understanding the structures and revealing functional modules of these large-scale networks become a challenging task for researchers. The precision of functional modules and the amount of processing time are the main issues for the large-scale protein interaction networks. In previous works, many methods have been proposed for networks with moderate size (Bader and Hogue, et al., [3]; Hartwell et al., [4]; Pereira-Leal et al., [5]; Rives and Galitski,et al, [6]; Spirin and Mirny, et al [7]. Altaf-Ul-Amin et al., [8]; Brun et al, [9]; Chen and Yuan, [10]; Colak et al., [11]; Enright et al., [12]; Georgii et al., [13]; King et al.,[14] ; Loewenstein et al., [15]; Navlakha et al., [16]; Palla et al., [17]; Samanta and Liang, [18]; Sharan et al., [19]). Most of these methods are based on sequential algorithms. However, with development of multi-core processor architectures, many traditional serial methods moved toward parallel versions. Yang and Lonardi et al [2] proposed a parallel version of Girvan and Newman’s clustering algorithm which could achieve almost linear speed-up up to 32 processors. The parallel algorithm proposed in this paper is based on the SPICi algorithm by Jiang and Singh et al. [1]. The SPICi algorithm uses a greedy approach for selecting seed nodes with highest node weight degree and expanding each selected seed using the incident nodes while keeping the density of resulting cluster more than a defined threshold. In the sequential SPICi algorithm, single seed pair seed is selected in each run of the algorithm, however, in Parallel SPICi, multiple seed nodes are selected and concurrent cluster expansion is conducted on selected seed pairs with density threshold check. After each run of the algorithm is completed, selection of seed nodes and concurrent expansion continue until there is no more unprocessed node left in the network. The sequential SPICi algorithm has O(VlogV+E) running time where V and E are the number of vertices and edges in the network and it is expected that the parallel version of SPICi algorithm become even faster than the sequential implementation. This paper is organized as follows: First we explain different parts of the parallel algorithm in detail and then the proposed algorithm is tested with the STRING human protein network dataset and the quality of the resulting clusters are measured and finally conclusions are given. 2- SPICi Concepts and Overview Jiang and Singh et al., [1] introduced three main concepts the in SPICI algorithm which are also used in Parallel SPICi. The network is presented as G=(V,E) where V and E are the nodes and edges of the graph and the confidence score for each edge as 0<w u,v <1. “Node weight” of a node is defined as sum of all incident edges’ confidence scores: “Density” of a subset of nodes, S, where S V is given as the sum of all edge confidence values among the nodes in S divided by the total number of possible edges in S: | | | |

[IEEE 2011 6th International Symposium on Health Informatics and Bioinformatics (HIBIT) - Izmir, Turkey (2011.05.2-2011.05.5)] Proceedings of the 6th International Symposium on Health

  • Upload
    tolga

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 6th International Symposium on Health Informatics and Bioinformatics (HIBIT) - Izmir, Turkey (2011.05.2-2011.05.5)] Proceedings of the 6th International Symposium on Health

978-1-4673-4431-9/11//$31.00 ©2012 IEEEMay 02-05, 2011Izmir, Turkey86

Parallel SPICi

Seyedsasan Hashemikhabir, Tolga Can

Computer Engineering Department, Middle East Technical University

Ankara, Turkey [email protected],

[email protected]

ABSTRACT

In this paper, a concurrent implementation of the SPICi

algorithm is proposed for clustering large-scale protein-

protein interaction networks. This method is motivated

by selecting a defined number of protein seed pairs and

expanding multiple clusters concurrently using the

selected pairs in each run; and terminates when there is

no more protein node to process. This approach can

cluster large PPI networks with considerable

performance gain in comparison with sequential SPICi

algorithm. Experiments show that this parallel approach

can achieve nearly three times faster clustering time on

the STRING human dataset on a system with 4-core CPU

while maintaining high clustering quality.

1- Introduction

In recent years, improvement in cell level science and

high-throughput experimental methods resulted in large

and dense protein interaction networks. Analyzing and

understanding the structures and revealing functional

modules of these large-scale networks become a

challenging task for researchers. The precision of

functional modules and the amount of processing time

are the main issues for the large-scale protein interaction

networks. In previous works, many methods have been

proposed for networks with moderate size (Bader and

Hogue, et al., [3]; Hartwell et al., [4]; Pereira-Leal et al.,

[5]; Rives and Galitski,et al, [6]; Spirin and Mirny, et al

[7]. Altaf-Ul-Amin et al., [8]; Brun et al, [9]; Chen and

Yuan, [10]; Colak et al., [11]; Enright et al., [12]; Georgii

et al., [13]; King et al.,[14] ; Loewenstein et al., [15];

Navlakha et al., [16]; Palla et al., [17]; Samanta and

Liang, [18]; Sharan et al., [19]). Most of these methods

are based on sequential algorithms. However, with

development of multi-core processor architectures, many

traditional serial methods moved toward parallel

versions. Yang and Lonardi et al [2] proposed a parallel

version of Girvan and Newman’s clustering algorithm

which could achieve almost linear speed-up up to 32

processors. The parallel algorithm proposed in this paper

is based on the SPICi algorithm by Jiang and Singh et al.

[1]. The SPICi algorithm uses a greedy approach for

selecting seed nodes with highest node weight degree and

expanding each selected seed using the incident nodes

while keeping the density of resulting cluster more than a

defined threshold. In the sequential SPICi algorithm,

single seed pair seed is selected in each run of the

algorithm, however, in Parallel SPICi, multiple seed

nodes are selected and concurrent cluster expansion is

conducted on selected seed pairs with density threshold

check. After each run of the algorithm is completed,

selection of seed nodes and concurrent expansion

continue until there is no more unprocessed node left in

the network. The sequential SPICi algorithm has

O(VlogV+E) running time where V and E are the

number of vertices and edges in the network and it is

expected that the parallel version of SPICi algorithm

become even faster than the sequential implementation.

This paper is organized as follows: First we explain

different parts of the parallel algorithm in detail and then

the proposed algorithm is tested with the STRING human

protein network dataset and the quality of the resulting

clusters are measured and finally conclusions are given.

2- SPICi Concepts and Overview

Jiang and Singh et al., [1] introduced three main concepts

the in SPICI algorithm which are also used in Parallel

SPICi. The network is presented as G=(V,E) where V

and E are the nodes and edges of the graph and the

confidence score for each edge as 0<wu,v <1.

“Node weight” of a node is defined as sum of all incident

edges’ confidence scores:

“Density” of a subset of nodes, S, where S ⊂V is given

as the sum of all edge confidence values among the

nodes in S divided by the total number of possible edges

in S:

∑ | | | |

Page 2: [IEEE 2011 6th International Symposium on Health Informatics and Bioinformatics (HIBIT) - Izmir, Turkey (2011.05.2-2011.05.5)] Proceedings of the 6th International Symposium on Health

87

This definition of density is closely related to the

clustering coefficient of a node and takes into account the

weights on the edges. “Support” of node u of by S where

S⊂V is the defined as the sum of all incident edges between node u and a node in S and computed as

follows:

In sequential SPICi, nodes are sorted according to their

node weights, and then the node with the highest node

weight is selected as the first member of seed pair. The

second node in the seed pair is selected among the

incident nodes to the first node with highest edge

confidence and node weight. After selecting the seed pair

S, cluster expansion starts with selecting the node with

highest support by S and this expansion continues until

the support of a new candidate node would be less than a

defined threshold, Ts or density of resulted cluster

becomes less than the density threshold, Td. The nodes in

the resulting cluster are removed from the whole graph

and then same procedure is applied until no more nodes

are left in the graph.

3- Parallel SPICi

The main intuition behind parallel SPICi is selecting

multiple start nodes and expanding the multiple clusters

simultaneously. The clusters are expanded in a similar

way as sequential SPICi with some minor changes.

Degree of concurrency, Dc, defines the number of

clusters expanding in a time unit. This algorithm is

capable of expanding at most Dc clusters in a time unit.

In each iteration of the algorithm, processed nodes from

all clusters are removed from the main graph. The

algorithm terminates when there is no more node left in

the graph.

4- Selection of the First Seed Nodes

We need to select multiple start nodes for expanding

multiple clusters simultaneously. First, all the nodes are

sorted based on their node weights. In sequential SPICi,

the node with the highest weight is selected for

expansion; however, choosing a number of nodes with

highest weights is not feasible in the concurrent method.

If the selected nodes will be from the same part of graph

then there would be a possibility that some of the

selected nodes belong to the same cluster and the

algorithm would expand two or many different clusters

which logically are the same. Different heuristics can be

applied to solve this problem. In this paper, each

candidate node’s neighbors are compared with each

node’s neighbors in the start seed node set. If the number

of shared neighbor nodes is more than the defined

threshold, Tn, adding of the candidate node is rejected

and the next node with the highest weight is compared

with the start seed node set. Tn can be learned from the

structure of the test dataset. However, in each iteration of

algorithm the graph will become sparser and

consequently the number of nodes will decrease, so the

Tn threshold needs to be updated frequently for more

precise selection of first seed nodes.

5- Selection of Second Seed Nodes and Process

Assignment

After choosing the first member of each seed pair, we

assign a process to each node. In every process, similar to

the sequential SPICi, neighbors of the first node are

divided into five bins based on their edge values: (0,0.2],

(0.2,0.4], (0.4,0.6], (0.6,0.8], and (0.8,1]. Each process

starts looking for a node from the highest bin with the

highest node weight. When the node is found, it is

combined with the first seed node and both constitute the

seed pair. However, there is a possibility that at least two

process request the same node as the second seed pair

node. In this case, the process which requested first will

take the node and other process should select the second

best node. In another scenario, a process might request a

node and eliminate it from the network graph, and other

processes in next unit of time may request the same

deleted node, in this case, a shared list of deleted nodes

should be kept in the system that every process can

access to verify the availability of a node.

6- Cluster Expansion

We have defined a vertex set S for each process which

initially contains seed nodes of the cluster. For expanding

the clusters, similar to the sequential SPICi, we search

among the unclustered neighbors of vertices in S to find a

node u with highest support(u,S). If support(u,S) is lower

than the defined threshold Ts then adding of the new node

is rejected and the current set is output as a new cluster.

However, if the support (u,S) is higher than Ts, density of

set S containing new node u is calculated and if it is

lower than the defined threshold Td, the adding of node u

is also rejected. At last, if both conditions are satisfied,

the selected node is checked with the shared list to see

whether it is taken or not by another process. If it was

Page 3: [IEEE 2011 6th International Symposium on Health Informatics and Bioinformatics (HIBIT) - Izmir, Turkey (2011.05.2-2011.05.5)] Proceedings of the 6th International Symposium on Health

88

already taken, the second best node is checked. Every

process in the system repeats the same process until no

more nodes are left that satisfy the criteria.

7- Data Structures and Running Time

Several data structures are defined for the algorithm

implementation. Most of the data structures are based on

the hash table data structure for fast insert, update, and

retrievals. For keeping track of node weights, a hash

table structure, WeightsH, is used where the keys are the

node’s name and the value part is the related weight. So,

Insert, UpdateKey, and retrieve operations can be done in

O(1). For selecting the seed nodes, we need a data

structure that can retrieve the nodes efficiently. For this

purpose, we use a hash table data structure, DegreeH,

where the keys are rounded node weights and the value

part is a hash set of nodes which has the corresponding

weight. Since we can have access each node’s weight

value in constant time in WeightsH, the insert,

DecreaseKey, and retrieve can be done in O(1) time in

DegreeH. In the time of constructing DegreeH, if we

keep track of Max(Node Weight), it will help us to start

looking for nodes with highest degree in DegreeH. The

next data structure we describe is used for cluster

expansions. A data structure is needed that we can insert,

update, and retrieve the node with highest support in

efficient time. For this purpose, in the sequential SPICi

algorithm, a Fibonacci heap is used which in amortized

analysis, achieves a running time of O(1) for both Insert

and update operations and O(lg n) time for extracting the

node with maximum support. In Parallel SPICI, there is a

scenario that the node with the highest support is already

selected by another process so there will be multiple calls

to ExtractMax function by each process. In parallel

SPICI, we use a hash set data structure, CandidateH

with running time of O(1) for insert and update

operations. For efficient ExtractMax, first, we sort the

hash set using Insertion sort, which performs in O(n+d)

in average case where n is the number of set members

and d is the number of inversions. Also, Insertionsort

takes linear time O(n) for sorting nearly sorted data. In

this case if ExtractMax was not successful in the first

attempt, the second or more ExtractMax attempts would

be done in O(1) running time. Since in every successful

ExtracMax operation, the unclustered neighbors of a new

node are added to the sorted CandidateH, it can be

sorted in nearly O(n) running time using Insertion sort.

Figure 1, shows the steps of the proposed parallel

algorithm in detail.

Search Initialize DegreeH to be V

Initialize WeightH to be V with their Weight values

Initialize DC as Degree of Concurrency

While DegreeH is not empty

1- Select DC number of nodes, u[], from DegreeH

considering criteria mentioned in text.

2- For each node k in u[] do in parallel

If k has second seed node then :

Find second seed node, v, from

adjacent vertices according to the

text.

Add Expand(k,v) to Clusters Set, S.

else Add k to Cluster set S.

3- Remove resulted Cluster nodes from V

4- For each Neighbor, t, of a node in resulted cluster,

first its weight is decreased by

support(t,Resulted_Cluster) in WeightH then the

DegreeH is updated accordingly.

Expand(u,v) Initial cluster S as {u,v}

Initialize CandidateH to contain unprocessed neighbors of S

While CandidateH is not empty

1- Sort CandidateH using Insertion Sort. 2- Try to get Node t with highest support (see text)

If support(t,S)≥Ts |S| density(S) and

density(S∪{t})>Td then Add t to S

Increase the support for vertices connected to

t in CandidateH

For all unclustered neighbors of t, Add them

to CandidateH if not processed yet.

else break from loop

Figure 1. Parallel SPCi pseudo code

8- Experimental Results

We used Intel Quad core 2.8Ghz CPU with 8 GB of

memory for performing the experiments as the main

machine. The programming language for algorithm

implementation is C# 4.0 using “Task Parallel Library”.

STRING human dataset with 17369 nodes and 1288886

edges is used as test dataset. We set Ts and Td to 0.5 and

Tn to 900. After every 20 clusters is outputted the Tn is

decreased by 50 until ¾ of all nodes are processed.

Degree of Concurrency is also set to 5 in our

experiments. We have tested our system robustness with

different type of CPUs and in different machines and the

running time results in seconds are as follows:

Page 4: [IEEE 2011 6th International Symposium on Health Informatics and Bioinformatics (HIBIT) - Izmir, Turkey (2011.05.2-2011.05.5)] Proceedings of the 6th International Symposium on Health

89

Ru

nn

ing T

ime

Performance

Table Sequential SPICi Parallel SPICi

Pentium 4 (2 Ghz) 23 27 Core2 dual (3 Ghz) 15 11 QuadCore(2.8Ghz) 14 5 Core i7 (2.8 Ghz) 14 4

Table 1. Running time of the algorithms on different CPU

architectures

As it is shown in Table 1, both Core i7 and Quad Core

CPUs have 4 cores, however, the Core i7 CPU support 8

threads in OS level whereas Quad Core CPUs support 4

threads in OS level according to Intel Specification. This

difference might be the result for faster computation time

on a Core i7 system.

The Parallel Algorithm performance is tested with

different values for degree of concurrency on our main

test machine to observe the effect of different

concurrency degree on system robustness.

18

16

14

12

10

8

6

4

2

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Processes

Figure 2. Running time of Parallel SPICi with different Concurrent Degrees

Figure 2, shows that with increase in number of process

in system, the overhead of concurrency has negative

effect on running time of the system and this overhead

can be explained as time needed for creating processes

and time to process hit and misses when accessing shared

data.

We have also tested the quality of the resulting clusters

generated by Parallel SPICi using the method proposed

by Song and Singh et al., [20] using GO annotations.

They utilized three measures, Jaccard, PR( Precision-

Recall) and Semantic Density to measure the overlap

between resulting clusters and GO functional modules.

Similar to the sequential SPICi paper, we calculate these

measures for the Biological Process and Cellular

Component subontologies separately. For detailed

description please refer to SPICi paper and Song and

Singh et al. [20]. The results show that the quality of

parallel SPICi is almost the same as sequential SPICi.

BP CC algorithm sDensity Jaccard PR sDensity Jaccard PR

Parallel

SPICi 0.290 0.187 0.172 0.312 0.113 0.088

Sequential

SPICi 0.311 0.209 0.176 0.328 0.119 0.096

Table 2. Go analysis of clusters

The analysis results shows that sequential SPICi

algorithm has better results in comparison with parallel

SPICi, the reason can be explained as number of

members in each resulting cluster by Parallel SPICi are

less than the number of members in basic cluster in

average; since, in parallel SPICi, processes race with

each other on taking a node and in result they have fewer

number of nodes in each cluster.

9- Conclusions and Future Work

Analyzing and extracting functional modules in large-

scale networks is a challenging research topic in

bioinformatics. Recent methods proposed fast and

efficient algorithms with acceptable cluster quality. Jiang

and Singh et al., [1] proposed SPICi algorithm which can

cluster large and dense networks by selecting the node

with highest node weight as first seed node and making

the seed pair by choosing the node with the highest edge

value and node weight among the neighbors of the first

node. After forming the seed pair, nodes are added to the

initial seed set with high support to the current set and the

procedure would stop when the density of resulted cluster

is less than a defined threshold. This algorithm iterates

until there is no more nodes left in the graph. Most of the

previous methods are based on serial running of

functions and methods in algorithms. Today’s computing

power provides parallel execution of multiple functions

benefiting the concurrent methods in performance aspect.

In this paper, a concurrent implementation of SPICi is

proposed. In this method, multiple start nodes are

selected as first seed pairs of independent clusters and

then concurrently, second seed pairs are chosen based on

the same procedure in sequential SPICi with checking

whether the same node does not appear in two clusters at

the same time. The expansion method concurrently

expands each seed pair considering the criteria explained.

Finally the processed nodes are removed from the graph

and this algorithm iterates until no more nodes are left in

the system. In this paper, we have shown that, by

Page 5: [IEEE 2011 6th International Symposium on Health Informatics and Bioinformatics (HIBIT) - Izmir, Turkey (2011.05.2-2011.05.5)] Proceedings of the 6th International Symposium on Health

90

utilizing multiple cores in a system, the SPICi algorithm

can be parallelized and three to four times speed-up is

possible without sacrificing accuracy of the resulting

clusters. In future works, better selection of first seed

nodes with more independence and also running of

algorithm on clustered systems and measuring the

performance of the algorithm with different thresholds

can be done.

10- References

[1] Peng Jiang and Mona Singh et al, (2010), SPICi: a

fast clustering algorithm for large biological networks.

Singh. Bioinformatics (2010) 26:1105--1111.

[2] Qiaofeng Yang, Stefano Lonardi, "A Parallel

Algorithm for Clustering Protein-Protein Interaction

Networks," csbw, pp.174-177, 2005 IEEE Computational

Systems Bioinformatics Conference - Workshops

(CSBW'05), 2005.

[3] Bader,G.D. and Hogue,CW. (2003) An automated

method for finding molecular complexes in large protein

interaction networks. BMC Bioinformatics, 4,2

[4] Hartwell,L.H. et al. (1999) From molecular to

modular cell biology. Nature, 402, 6761

[5] Pereira-Leal,J. et al. (2004) Detection of functional

modules from protein interaction networks. Proteins, 54,

49–57.

[6] Rives,A.W. and Galitski,T. (2003) Modular

organization of cellular networks. Proc. Natl Acad. Sci.

USA, 100, 1128–1133.

[7] Spirin,V. andMirny,L.A. (2003) Protein complexes

and functionalmodules inmolecular networks. Proc. Natl

Acad. Sci. USA, 100, 12123.

[8] Altaf-Ul-Amin,M. et al. (2006) Development and

implementation of an algorithm for detection of protein

complexes in large interaction networks. BMC

Bioinformatics, 7, 207.

[9] Brun,C. et al. (2003) Functional classification of

proteins for the prediction of cellular function from a

protein-protein interaction network. Genome Biol., 5, R6

[10] Chen,J. and Yuan,B. (2006) Detecting functional

modules in the yeast protein-protein interaction network.

Bioinformatics, 22, 2283–2290.

[11] Colak,R. et al. (2009) Dense graphlet statistics of

protein interaction and random networks. In Pacific

Symposium on Biocomputing, pp. 178–189.

[12] Enright,A.J. et al. (2002) An efficient algorithm for

large-scale detection of protein families. Nucleic Acids

Res., 30, 1575–1584.

[13] Georgii,E. et al.(2009) Enumeration of condition-

dependent dense modules in protein interaction networks.

Bioinformatics, 25, 933–940.

[14] King,A.D. et al. (2004) An efficient algorithm for

large-scale detection of protein families. Bioinformatics,

20, 3013–3020.

[15] Loewenstein,Y. et al. (2008) Efficient algorithms for

accurate hierarchical clustering of huge datasets: tackling

the entire protein space. Bioinformatics, 24, i41–i49

[16] Navlakha,S. et al. (2009) Revealing biological

modules via graph summarization. J. Comput. Biol., 16,

253–264.

[17] Palla,G. et al. (2005) Uncovering the overlapping

community structure of complex networks in nature and

society. Nature, 435, 814–818.

[18] Samanta,M. and Liang,S. (2003) Predicting protein

functions from redundancies in large-scale protein

interaction networks. Proc. Natl Acad. Sci. USA, 100,

12579–12583

[19] Sharan,R. et al. (2005) Conserved patterns of protein

interaction in multiple species. Proc. Natl Acad. Sci.

USA, 102, 1974–1979

[20] Song,J. and Singh,M. (2009)Howandwhen should

interactome-derived clusters be used to predict functional

modules and protein function? Bioinformatics, 25, 3143–

3150.