Upload
tolga
View
215
Download
2
Embed Size (px)
Citation preview
978-1-4673-4431-9/11//$31.00 ©2012 IEEEMay 02-05, 2011Izmir, Turkey86
Parallel SPICi
Seyedsasan Hashemikhabir, Tolga Can
Computer Engineering Department, Middle East Technical University
Ankara, Turkey [email protected],
ABSTRACT
In this paper, a concurrent implementation of the SPICi
algorithm is proposed for clustering large-scale protein-
protein interaction networks. This method is motivated
by selecting a defined number of protein seed pairs and
expanding multiple clusters concurrently using the
selected pairs in each run; and terminates when there is
no more protein node to process. This approach can
cluster large PPI networks with considerable
performance gain in comparison with sequential SPICi
algorithm. Experiments show that this parallel approach
can achieve nearly three times faster clustering time on
the STRING human dataset on a system with 4-core CPU
while maintaining high clustering quality.
1- Introduction
In recent years, improvement in cell level science and
high-throughput experimental methods resulted in large
and dense protein interaction networks. Analyzing and
understanding the structures and revealing functional
modules of these large-scale networks become a
challenging task for researchers. The precision of
functional modules and the amount of processing time
are the main issues for the large-scale protein interaction
networks. In previous works, many methods have been
proposed for networks with moderate size (Bader and
Hogue, et al., [3]; Hartwell et al., [4]; Pereira-Leal et al.,
[5]; Rives and Galitski,et al, [6]; Spirin and Mirny, et al
[7]. Altaf-Ul-Amin et al., [8]; Brun et al, [9]; Chen and
Yuan, [10]; Colak et al., [11]; Enright et al., [12]; Georgii
et al., [13]; King et al.,[14] ; Loewenstein et al., [15];
Navlakha et al., [16]; Palla et al., [17]; Samanta and
Liang, [18]; Sharan et al., [19]). Most of these methods
are based on sequential algorithms. However, with
development of multi-core processor architectures, many
traditional serial methods moved toward parallel
versions. Yang and Lonardi et al [2] proposed a parallel
version of Girvan and Newman’s clustering algorithm
which could achieve almost linear speed-up up to 32
processors. The parallel algorithm proposed in this paper
is based on the SPICi algorithm by Jiang and Singh et al.
[1]. The SPICi algorithm uses a greedy approach for
selecting seed nodes with highest node weight degree and
expanding each selected seed using the incident nodes
while keeping the density of resulting cluster more than a
defined threshold. In the sequential SPICi algorithm,
single seed pair seed is selected in each run of the
algorithm, however, in Parallel SPICi, multiple seed
nodes are selected and concurrent cluster expansion is
conducted on selected seed pairs with density threshold
check. After each run of the algorithm is completed,
selection of seed nodes and concurrent expansion
continue until there is no more unprocessed node left in
the network. The sequential SPICi algorithm has
O(VlogV+E) running time where V and E are the
number of vertices and edges in the network and it is
expected that the parallel version of SPICi algorithm
become even faster than the sequential implementation.
This paper is organized as follows: First we explain
different parts of the parallel algorithm in detail and then
the proposed algorithm is tested with the STRING human
protein network dataset and the quality of the resulting
clusters are measured and finally conclusions are given.
2- SPICi Concepts and Overview
Jiang and Singh et al., [1] introduced three main concepts
the in SPICI algorithm which are also used in Parallel
SPICi. The network is presented as G=(V,E) where V
and E are the nodes and edges of the graph and the
confidence score for each edge as 0<wu,v <1.
“Node weight” of a node is defined as sum of all incident
edges’ confidence scores:
∑
“Density” of a subset of nodes, S, where S ⊂V is given
as the sum of all edge confidence values among the
nodes in S divided by the total number of possible edges
in S:
∑ | | | |
87
This definition of density is closely related to the
clustering coefficient of a node and takes into account the
weights on the edges. “Support” of node u of by S where
S⊂V is the defined as the sum of all incident edges between node u and a node in S and computed as
follows:
∑
In sequential SPICi, nodes are sorted according to their
node weights, and then the node with the highest node
weight is selected as the first member of seed pair. The
second node in the seed pair is selected among the
incident nodes to the first node with highest edge
confidence and node weight. After selecting the seed pair
S, cluster expansion starts with selecting the node with
highest support by S and this expansion continues until
the support of a new candidate node would be less than a
defined threshold, Ts or density of resulted cluster
becomes less than the density threshold, Td. The nodes in
the resulting cluster are removed from the whole graph
and then same procedure is applied until no more nodes
are left in the graph.
3- Parallel SPICi
The main intuition behind parallel SPICi is selecting
multiple start nodes and expanding the multiple clusters
simultaneously. The clusters are expanded in a similar
way as sequential SPICi with some minor changes.
Degree of concurrency, Dc, defines the number of
clusters expanding in a time unit. This algorithm is
capable of expanding at most Dc clusters in a time unit.
In each iteration of the algorithm, processed nodes from
all clusters are removed from the main graph. The
algorithm terminates when there is no more node left in
the graph.
4- Selection of the First Seed Nodes
We need to select multiple start nodes for expanding
multiple clusters simultaneously. First, all the nodes are
sorted based on their node weights. In sequential SPICi,
the node with the highest weight is selected for
expansion; however, choosing a number of nodes with
highest weights is not feasible in the concurrent method.
If the selected nodes will be from the same part of graph
then there would be a possibility that some of the
selected nodes belong to the same cluster and the
algorithm would expand two or many different clusters
which logically are the same. Different heuristics can be
applied to solve this problem. In this paper, each
candidate node’s neighbors are compared with each
node’s neighbors in the start seed node set. If the number
of shared neighbor nodes is more than the defined
threshold, Tn, adding of the candidate node is rejected
and the next node with the highest weight is compared
with the start seed node set. Tn can be learned from the
structure of the test dataset. However, in each iteration of
algorithm the graph will become sparser and
consequently the number of nodes will decrease, so the
Tn threshold needs to be updated frequently for more
precise selection of first seed nodes.
5- Selection of Second Seed Nodes and Process
Assignment
After choosing the first member of each seed pair, we
assign a process to each node. In every process, similar to
the sequential SPICi, neighbors of the first node are
divided into five bins based on their edge values: (0,0.2],
(0.2,0.4], (0.4,0.6], (0.6,0.8], and (0.8,1]. Each process
starts looking for a node from the highest bin with the
highest node weight. When the node is found, it is
combined with the first seed node and both constitute the
seed pair. However, there is a possibility that at least two
process request the same node as the second seed pair
node. In this case, the process which requested first will
take the node and other process should select the second
best node. In another scenario, a process might request a
node and eliminate it from the network graph, and other
processes in next unit of time may request the same
deleted node, in this case, a shared list of deleted nodes
should be kept in the system that every process can
access to verify the availability of a node.
6- Cluster Expansion
We have defined a vertex set S for each process which
initially contains seed nodes of the cluster. For expanding
the clusters, similar to the sequential SPICi, we search
among the unclustered neighbors of vertices in S to find a
node u with highest support(u,S). If support(u,S) is lower
than the defined threshold Ts then adding of the new node
is rejected and the current set is output as a new cluster.
However, if the support (u,S) is higher than Ts, density of
set S containing new node u is calculated and if it is
lower than the defined threshold Td, the adding of node u
is also rejected. At last, if both conditions are satisfied,
the selected node is checked with the shared list to see
whether it is taken or not by another process. If it was
88
already taken, the second best node is checked. Every
process in the system repeats the same process until no
more nodes are left that satisfy the criteria.
7- Data Structures and Running Time
Several data structures are defined for the algorithm
implementation. Most of the data structures are based on
the hash table data structure for fast insert, update, and
retrievals. For keeping track of node weights, a hash
table structure, WeightsH, is used where the keys are the
node’s name and the value part is the related weight. So,
Insert, UpdateKey, and retrieve operations can be done in
O(1). For selecting the seed nodes, we need a data
structure that can retrieve the nodes efficiently. For this
purpose, we use a hash table data structure, DegreeH,
where the keys are rounded node weights and the value
part is a hash set of nodes which has the corresponding
weight. Since we can have access each node’s weight
value in constant time in WeightsH, the insert,
DecreaseKey, and retrieve can be done in O(1) time in
DegreeH. In the time of constructing DegreeH, if we
keep track of Max(Node Weight), it will help us to start
looking for nodes with highest degree in DegreeH. The
next data structure we describe is used for cluster
expansions. A data structure is needed that we can insert,
update, and retrieve the node with highest support in
efficient time. For this purpose, in the sequential SPICi
algorithm, a Fibonacci heap is used which in amortized
analysis, achieves a running time of O(1) for both Insert
and update operations and O(lg n) time for extracting the
node with maximum support. In Parallel SPICI, there is a
scenario that the node with the highest support is already
selected by another process so there will be multiple calls
to ExtractMax function by each process. In parallel
SPICI, we use a hash set data structure, CandidateH
with running time of O(1) for insert and update
operations. For efficient ExtractMax, first, we sort the
hash set using Insertion sort, which performs in O(n+d)
in average case where n is the number of set members
and d is the number of inversions. Also, Insertionsort
takes linear time O(n) for sorting nearly sorted data. In
this case if ExtractMax was not successful in the first
attempt, the second or more ExtractMax attempts would
be done in O(1) running time. Since in every successful
ExtracMax operation, the unclustered neighbors of a new
node are added to the sorted CandidateH, it can be
sorted in nearly O(n) running time using Insertion sort.
Figure 1, shows the steps of the proposed parallel
algorithm in detail.
Search Initialize DegreeH to be V
Initialize WeightH to be V with their Weight values
Initialize DC as Degree of Concurrency
While DegreeH is not empty
1- Select DC number of nodes, u[], from DegreeH
considering criteria mentioned in text.
2- For each node k in u[] do in parallel
If k has second seed node then :
Find second seed node, v, from
adjacent vertices according to the
text.
Add Expand(k,v) to Clusters Set, S.
else Add k to Cluster set S.
3- Remove resulted Cluster nodes from V
4- For each Neighbor, t, of a node in resulted cluster,
first its weight is decreased by
support(t,Resulted_Cluster) in WeightH then the
DegreeH is updated accordingly.
Expand(u,v) Initial cluster S as {u,v}
Initialize CandidateH to contain unprocessed neighbors of S
While CandidateH is not empty
1- Sort CandidateH using Insertion Sort. 2- Try to get Node t with highest support (see text)
If support(t,S)≥Ts |S| density(S) and
density(S∪{t})>Td then Add t to S
Increase the support for vertices connected to
t in CandidateH
For all unclustered neighbors of t, Add them
to CandidateH if not processed yet.
else break from loop
Figure 1. Parallel SPCi pseudo code
8- Experimental Results
We used Intel Quad core 2.8Ghz CPU with 8 GB of
memory for performing the experiments as the main
machine. The programming language for algorithm
implementation is C# 4.0 using “Task Parallel Library”.
STRING human dataset with 17369 nodes and 1288886
edges is used as test dataset. We set Ts and Td to 0.5 and
Tn to 900. After every 20 clusters is outputted the Tn is
decreased by 50 until ¾ of all nodes are processed.
Degree of Concurrency is also set to 5 in our
experiments. We have tested our system robustness with
different type of CPUs and in different machines and the
running time results in seconds are as follows:
89
Ru
nn
ing T
ime
Performance
Table Sequential SPICi Parallel SPICi
Pentium 4 (2 Ghz) 23 27 Core2 dual (3 Ghz) 15 11 QuadCore(2.8Ghz) 14 5 Core i7 (2.8 Ghz) 14 4
Table 1. Running time of the algorithms on different CPU
architectures
As it is shown in Table 1, both Core i7 and Quad Core
CPUs have 4 cores, however, the Core i7 CPU support 8
threads in OS level whereas Quad Core CPUs support 4
threads in OS level according to Intel Specification. This
difference might be the result for faster computation time
on a Core i7 system.
The Parallel Algorithm performance is tested with
different values for degree of concurrency on our main
test machine to observe the effect of different
concurrency degree on system robustness.
18
16
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Processes
Figure 2. Running time of Parallel SPICi with different Concurrent Degrees
Figure 2, shows that with increase in number of process
in system, the overhead of concurrency has negative
effect on running time of the system and this overhead
can be explained as time needed for creating processes
and time to process hit and misses when accessing shared
data.
We have also tested the quality of the resulting clusters
generated by Parallel SPICi using the method proposed
by Song and Singh et al., [20] using GO annotations.
They utilized three measures, Jaccard, PR( Precision-
Recall) and Semantic Density to measure the overlap
between resulting clusters and GO functional modules.
Similar to the sequential SPICi paper, we calculate these
measures for the Biological Process and Cellular
Component subontologies separately. For detailed
description please refer to SPICi paper and Song and
Singh et al. [20]. The results show that the quality of
parallel SPICi is almost the same as sequential SPICi.
BP CC algorithm sDensity Jaccard PR sDensity Jaccard PR
Parallel
SPICi 0.290 0.187 0.172 0.312 0.113 0.088
Sequential
SPICi 0.311 0.209 0.176 0.328 0.119 0.096
Table 2. Go analysis of clusters
The analysis results shows that sequential SPICi
algorithm has better results in comparison with parallel
SPICi, the reason can be explained as number of
members in each resulting cluster by Parallel SPICi are
less than the number of members in basic cluster in
average; since, in parallel SPICi, processes race with
each other on taking a node and in result they have fewer
number of nodes in each cluster.
9- Conclusions and Future Work
Analyzing and extracting functional modules in large-
scale networks is a challenging research topic in
bioinformatics. Recent methods proposed fast and
efficient algorithms with acceptable cluster quality. Jiang
and Singh et al., [1] proposed SPICi algorithm which can
cluster large and dense networks by selecting the node
with highest node weight as first seed node and making
the seed pair by choosing the node with the highest edge
value and node weight among the neighbors of the first
node. After forming the seed pair, nodes are added to the
initial seed set with high support to the current set and the
procedure would stop when the density of resulted cluster
is less than a defined threshold. This algorithm iterates
until there is no more nodes left in the graph. Most of the
previous methods are based on serial running of
functions and methods in algorithms. Today’s computing
power provides parallel execution of multiple functions
benefiting the concurrent methods in performance aspect.
In this paper, a concurrent implementation of SPICi is
proposed. In this method, multiple start nodes are
selected as first seed pairs of independent clusters and
then concurrently, second seed pairs are chosen based on
the same procedure in sequential SPICi with checking
whether the same node does not appear in two clusters at
the same time. The expansion method concurrently
expands each seed pair considering the criteria explained.
Finally the processed nodes are removed from the graph
and this algorithm iterates until no more nodes are left in
the system. In this paper, we have shown that, by
90
utilizing multiple cores in a system, the SPICi algorithm
can be parallelized and three to four times speed-up is
possible without sacrificing accuracy of the resulting
clusters. In future works, better selection of first seed
nodes with more independence and also running of
algorithm on clustered systems and measuring the
performance of the algorithm with different thresholds
can be done.
10- References
[1] Peng Jiang and Mona Singh et al, (2010), SPICi: a
fast clustering algorithm for large biological networks.
Singh. Bioinformatics (2010) 26:1105--1111.
[2] Qiaofeng Yang, Stefano Lonardi, "A Parallel
Algorithm for Clustering Protein-Protein Interaction
Networks," csbw, pp.174-177, 2005 IEEE Computational
Systems Bioinformatics Conference - Workshops
(CSBW'05), 2005.
[3] Bader,G.D. and Hogue,CW. (2003) An automated
method for finding molecular complexes in large protein
interaction networks. BMC Bioinformatics, 4,2
[4] Hartwell,L.H. et al. (1999) From molecular to
modular cell biology. Nature, 402, 6761
[5] Pereira-Leal,J. et al. (2004) Detection of functional
modules from protein interaction networks. Proteins, 54,
49–57.
[6] Rives,A.W. and Galitski,T. (2003) Modular
organization of cellular networks. Proc. Natl Acad. Sci.
USA, 100, 1128–1133.
[7] Spirin,V. andMirny,L.A. (2003) Protein complexes
and functionalmodules inmolecular networks. Proc. Natl
Acad. Sci. USA, 100, 12123.
[8] Altaf-Ul-Amin,M. et al. (2006) Development and
implementation of an algorithm for detection of protein
complexes in large interaction networks. BMC
Bioinformatics, 7, 207.
[9] Brun,C. et al. (2003) Functional classification of
proteins for the prediction of cellular function from a
protein-protein interaction network. Genome Biol., 5, R6
[10] Chen,J. and Yuan,B. (2006) Detecting functional
modules in the yeast protein-protein interaction network.
Bioinformatics, 22, 2283–2290.
[11] Colak,R. et al. (2009) Dense graphlet statistics of
protein interaction and random networks. In Pacific
Symposium on Biocomputing, pp. 178–189.
[12] Enright,A.J. et al. (2002) An efficient algorithm for
large-scale detection of protein families. Nucleic Acids
Res., 30, 1575–1584.
[13] Georgii,E. et al.(2009) Enumeration of condition-
dependent dense modules in protein interaction networks.
Bioinformatics, 25, 933–940.
[14] King,A.D. et al. (2004) An efficient algorithm for
large-scale detection of protein families. Bioinformatics,
20, 3013–3020.
[15] Loewenstein,Y. et al. (2008) Efficient algorithms for
accurate hierarchical clustering of huge datasets: tackling
the entire protein space. Bioinformatics, 24, i41–i49
[16] Navlakha,S. et al. (2009) Revealing biological
modules via graph summarization. J. Comput. Biol., 16,
253–264.
[17] Palla,G. et al. (2005) Uncovering the overlapping
community structure of complex networks in nature and
society. Nature, 435, 814–818.
[18] Samanta,M. and Liang,S. (2003) Predicting protein
functions from redundancies in large-scale protein
interaction networks. Proc. Natl Acad. Sci. USA, 100,
12579–12583
[19] Sharan,R. et al. (2005) Conserved patterns of protein
interaction in multiple species. Proc. Natl Acad. Sci.
USA, 102, 1974–1979
[20] Song,J. and Singh,M. (2009)Howandwhen should
interactome-derived clusters be used to predict functional
modules and protein function? Bioinformatics, 25, 3143–
3150.