Upload
ronaldo
View
212
Download
0
Embed Size (px)
Citation preview
ORIGINAL ARTICLE
A self-organized approach for detecting communities in networks
Ben Collingsworth • Ronaldo Menezes
Received: 9 May 2013 / Revised: 6 September 2013 / Accepted: 21 November 2013
� Springer-Verlag Wien 2014
Abstract The community structure of networks reveals
hidden information about the whole network structure that
cannot be discerned using other topological properties. Yet,
the importance of identifying community structure in net-
works to many fields such as medicine, social sciences and
national security, calls for better approaches for performing
the identification. The prevalent community detection
algorithms utilize a centralized approach that is unlikely to
scale to very large networks and does not handle dynamic
networks. We propose a self-organized approach to com-
munity detection which utilizes a newly introduced concept
of node entropy to allow individual nodes to make
decentralized and independent decisions concerning the
community to which they belong; we call our approach
Self-Organized Community Identification ALgorithm
(SOCIAL). Node entropy is a mathematical expression of
an individual node’s satisfaction with its current commu-
nity. As nodes become more ‘‘satisfied’’, i.e., entropy is
low, the community structure of a network is emergent.
Our algorithm offers several advantages over existing
algorithms including near-linear performance, identifica-
tion of partial community overlaps, and handling of
dynamic changes in the network in a local manner.
1 Introduction
The identification of community structure in networks has
helped many scientific fields understand complex phe-
nomena (John 2011); prime examples include: medicine
(Srividhya et al. 2012), social sciences (Akshay et al.
2007), national security (Valdis 2002), and marketing
(Adomavicius and Tuzhilin 2005; Troy and Chawla 2011).
Community structure reveals global hidden information
about networks that cannot be discerned using other
topological properties. Early research in this area focused
on a limited aspect of this problem, reasonably, finding
communities. However, these solutions neglected impor-
tant properties when dealing with real-world networks: (1)
the enormous size of many networks, (2) community
overlap (i.e., a node belonging to multiple communities),
and (3) the dynamic nature of networks. The importance of
the first property (network size) is evident when running
one of the various network analysis packages on a very
large network; the package may require many hours to run.
This issue is heightened with increased interest in complex
network analysis coupled with accessibility to databases
capable of generating huge networks. The second property
(community overlap) is the reality that nodes in a network
often belong to more than one community. This property is
particularly evident in social networks describing direct
relationships between people (Bradley et al. 2012). For
example, in a network describing ties between students, an
individual may participate in activities such as academic
societies, sports activities, and religious affiliations, which
may constitute several communities. Important information
is lost if this student must be cast into a single community.
Finally, real-world networks are constantly evolving, with
nodes and edges dynamically being created and removed.
Networks representing the spread of pandemic diseases are
B. Collingsworth � R. Menezes (&)
Computer Sciences BioComplex Laboratory,
Florida Institute of Technology, Melbourne, FL, USA
e-mail: [email protected]
B. Collingsworth
e-mail: [email protected]
123
Soc. Netw. Anal. Min. (2014) 4:169
DOI 10.1007/s13278-014-0169-5
a good example of these dynamic structures. Here, the
network is continuously changing as the disease spreads to
new individuals and expires in those that are infected. In
these networks, a community detection algorithm that must
be completely re-executed with each update is impractical.
The importance of community detection can be high-
lighted by the increasing attention given by researchers.
Indeed, a search on arXiv.org for the words ‘‘community’’
and ‘‘networks’’ reveals that about 10 % of papers pub-
lished today are about networks and more than half of them
relate to communities (details in Fig. 1). Moreover, the
increase over the years is quite significant (5-fold for net-
works and 3-fold for community-related papers).
We propose a community detection algorithm which
observes these three properties while maintaining partition
quality. In this algorithm, individual nodes are indepen-
dently responsible for determining the community to which
they belong. The mechanism for making this decision is
derived from Shannon entropy (Claude et al. 1948) and
requires a node to have knowledge only of its immediate
neighbors. Initially, entropy is high, and there is tension in
the network as communities form and nodes make deci-
sions to join or leave these communities. Over time,
entropy becomes low as nodes are satisfied with their
current community. At this point, community structure is
emergent. Since each node’s decision on community is
based only on immediate neighbors, the algorithm offers
near-linear performance on average. In addition, near-lin-
ear performance is preserved as the size of the network
increases. Nodes that belong to multiple communities are
identified by a high individual entropy when the overall
network has stabilized. These nodes may be seen to toggle
between communities because they are uncertain about
where they belong; moreover the toggling allows us to
introduce a complete new idea on community overlap in
which nodes can belong to multiple communities at dif-
ferent levels of belongingness (e.g., a node can be 20 % in
one community and 80 % in another). These levels of
belongingness make SOCIAL fundamentally different
from other approaches able to identify overlaps such as the
clique-percolation approach (Gergely et al. 2005). Finally,
dynamic changes to the network are processed locally. The
processing required to adapt to a change in the network is
proportional to the number of nodes directly impacted by
the change.
2 Related work
In this section, we describe a few community detection
algorithms proposed to date. Our intent for this section is
not to be comprehensive but rather to demonstrate that they
generally fail to handle at least one of the issues described
in Sect. 1 that are handled in SOCIAL, that is, (1) com-
plexity, (2) detection of community overlap, and (3) ability
to efficiently handle dynamic network changes. For a very
comprehensive work on community algorithms one should
refer to the work of Fortunato (Santo 2010).
A broad class of community detection algorithms may
be termed metric-based heuristic algorithms where some
network, node, or edge property is calculated across the
network by means of a centralized control process, and
used to partition the network into communities. A well-
known example of this type of algorithm is the Girvan–
Newman (Girvan and Newman 2002) algorithm. This
algorithm utilizes edge betweenness centrality to perform a
sequence of divisive cuts in the network which result in a
community partition. Intuitively, the edge with the highest
betweenness centrality is likely to be a bridge between
communities, and its removal will result in the isolation of
two communities previously connected by it (see Fig. 2a).
The algorithm is run iteratively, where at each iteration, an
edge is selected and removed. In addition, each iteration
updates a dendrogram structure which reflects the current
partition (see Fig. 2b). The algorithm is terminated when
all edges have been removed. At this point, the dendrogram
is used to select the desired level of community decom-
position. The performance of the algorithm is obviously
tied to the edge betweenness calculation which is O(n2) for
each iteration which leads to a complexity of O(n2 m),
where m is the number of iterations (edges) and n is the
number of nodes in the network. Optimization techniques
have been developed to reduce the cost of this calculation
(Wilkinson and Huberman 2004; Rattigan et al. 2007).
However, these optimizations impact the quality of the
partition. The algorithm does not reveal community over-
lap; each node is assigned to a single community. Further,
the algorithm is not suited for dynamic networks since
changes in network topology require a completely new
dendrogram to be generated (restart of the whole process).
Fig. 1 Percentage of articles placed on arXiv.org that relate to
‘‘networks’’ and ‘‘community’’. The chart above shows that nearly
6 % of the papers placed in the arXiv in 2012 relate to communities.
The arXiv is the main repository for papers in Network Sciences
(from bookworm ArXiv http://bit.ly/RlRNBD. Last accessed on
November 10, 2012)
169 Page 2 of 12 Soc. Netw. Anal. Min. (2014) 4:169
123
Another metric-based algorithm is the Blondel et al.
(2008) algorithm which utilizes the Girvan–Newman
modularity metric (Clauset et al. 2004) to discern com-
munity structure. Modularity is a quality function based on
the idea that random graphs are not expected to contain
cluster structures. The modularity is then used as a com-
parison between the actual density of edges in a subgraph
and the density we would expect to have if the vertices of
the graph were attached randomly. The algorithm uses an
iterative two-phased approach. In the first phase of each
iteration, nodes are assigned to communities in an
arrangement that maximizes modularity. In the second
phase, a new network is constructed where nodes in the
same community are combined to create a single node. The
network resulting from the second phase is the starting
point for the first phase of the next iteration. The algorithm
terminates when modularity is maximized and no addi-
tional updates can be made. The output of the algorithm is
the hierarchical set of communities which correspond to
each iteration. While the Blondel algorithm offers a com-
plexity of O(n), it has been shown to have accuracy issues
which lead to spurious partitions during hierarchical
agglomeration (Santo 2010). In addition, Girvan–Newman
Modularity exhibits a resolution loss when applied to dense
networks (Fortunato and Barthelemy 2007). Finally,
Blondel’s algorithm does not detect community overlap
and is not suited for managing dynamic networks.
A large class of community detection algorithms utilizes
spectral analysis to partition networks. These algorithms,
referred to as spectral-clustering algorithms (Santo 2010),
transform the relationship between nodes into a spatial
relationship in which similarities between nodes are much
more evident, thereby simplifying community detection.
The transformation is done by generating and evaluating
the eigenvectors of various matrixes derived from the
network adjacency matrix. Laplacian matrixes are partic-
ularly effective for this transformation because the eigen-
vectors produced correspond to node coordinate vectors in
a k-dimensional space, where nodes belonging to the same
community are in close proximity to each other. From this
transformation, simple clustering algorithms such as
k-means may be applied to identify the set of communities,
based on node density in the spatial representation of the
network. An early example of the use of spectral clustering
is given by Shi et al. (1997). Here, the algorithm is used to
accomplish perceptual grouping (i.e., image partitioning).
They reduce the problem to simple network partitioning by
transforming images into spatial representations based on
the attributes such as brightness, color, texture. The algo-
rithm groups image features into communities by exam-
ining the coherence of these attributes in Euclidean space.
Similar to the Girvan–Newman algorithm, spectral clus-
tering generates good community partitions, but has com-
plexity issues. The complexity of producing the
eigenvectors for a network containing n nodes is O(n3).
Again, strategies have been developed to reduce this
complexity, but at the cost of partition quality. Given that
spectral clustering reveals each node’s proximity to other
nodes, and hence proximity to communities, community
overlap detection is achievable. Indeed, Ma et al. (2010)
have demonstrated this capability by extending the tradi-
tional spectral clustering algorithm to recognize candidate
overlapping nodes through examination of spatial density.
Finally, as shown by Ning et al. (2010), spectral clustering
may be applied to dynamic networks. This is accomplished
by maintaining incremental approximations of the eigen-
values and eigenvectors rather than performing a recalcu-
lation of eigenvectors after each update to the network.
This approach has shown promising initial results. How-
ever, Ning et al. concede that the incremental approxima-
tions incur cumulative errors, which impact the algorithm’s
performance.
Raghavan et al. (2007) introduced a family of commu-
nity detection algorithms which utilize label propagation.
Label propagation Algorithms (LPAs) are similar to
SOCIAL in that they are self-organizing and retain no
(a) (b)
Fig. 2 Given–Newman algorithm based on systematic identification of edges with highest betweenness, (a), yields a dendrogram that represents
the series of cuts made during the algorithm, (b)
Soc. Netw. Anal. Min. (2014) 4:169 Page 3 of 12 169
123
global information regarding the network. Label propaga-
tion begins by assigning a unique label to each node in the
network. Following this, an iterative process is used where
each node is selected and assigned the label shared by the
majority of its neighbors. During an iteration, the order in
which nodes are selected is random. If a tie occurs between
the possible labels a node may select, the label is selected
randomly from the set of tied candidate labels. Iteration
continues until every node in the network has a label to
which the maximum number of its neighbors belong to.
Random tie breaking in the label propagation algorithm
leads to the disadvantage that different community struc-
tures are reported on different runs of the algorithm. As a
result, the algorithm must be run a number of times, with a
consensus of the results taken as the true community
structure. This ambiguity may be observed in the simple
network shown in Fig. 3. Arguably, by observing the
density of connections between nodes, this network can be
separated into the two distinct communities shown. Each
community has a higher number of edges between com-
munity members than edges going to the peer community.
Further, each community has a relatively large number of
triads, indicating social structure. Node 1 has an equal
number of edges going to each community. However, node
1 completes two triads in the red community and one triad
in the blue community, indicating stronger membership to
the red community. The LPA provided by the igraph
(Csardi and Nepusz 2006) network analysis package was
run ten times on this network with the following results: (1)
in four of the runs, two communities were detected as
shown in Fig. 3, (2) in four of the runs, two communities
were detected with node 1 in the blue community (3) in two
of the runs, a single community was discerned. For com-
parison, SOCIAL was run on the same network ten times.
SOCIAL detected the communities shown in Fig. 3 with
each run of the algorithm. For accuracy and consistency
SOCIAL has two advantages. First, the SOCIAL entropy
calculation favors community assignments where triads, a
fundamental social network structure, are present. Second,
the SOCIAL algorithm allows nodes to choose to make no
decision on community selection if a node’s entropy is
high. Delayed community selection of the uncertain nodes
allows other less ambiguous nodes to begin forming true
communities based on the presence of community struc-
ture. Once the core community structure is formed, the
uncertain nodes have a more sound basis for community
selection. This process eliminates scenarios where a ran-
dom ordering of community selection allows a spurious
community to emerge and claim the entire network.
Enhancements to the Raghavan et al. algorithm have been
made to extend its functionality to support dynamic net-
work processing and community overlap detection as
supported by SOCIAL. Xie et al. (2013) have proposed a
mechanism to allow the algorithm to remain operating after
convergence to receive and process dynamic changes in the
network. As with SOCIAL, changes in the network are
incorporated into the last snapshot of community structure
and the algorithm continues to iterate and process these
changes. Gregory (2010) has developed a version of the
label propagation algorithm which detects community
overlap. In this algorithm, information regarding label
selection, including candidate labels which were rejected,
is retained and reported when convergence is achieved.
While these extensions increase the utility of label propa-
gation, they further exasperate the issue of requiring mul-
tiple runs to reach a consensus. In the case of processing
dynamic networks, this would seem to require running
multiple instances of the algorithm concurrently for con-
sensus to be achieved. Similarly, achieving consensus on
community overlap is complex and burdensome. Clearly,
SOCIAL provides a preferable solution. Community
detection provided by SOCIAL is accurate and consistent.
Further, the ability to process dynamic network updates
and community overlap detection is inherent to the algo-
rithm, i.e., disparate versions of the algorithm are not
required to achieve full functionality.
Many hybrid solutions have been developed to address
the community detection problem. One interesting example
of this is the solution proposed by Cruz et al. (2011). They
apply a combination of algorithms for community detec-
tion in social networks. First, the agglomerative Blondel
(2008) algorithm is used to perform community partition-
ing based on modularity. Following the partitioning by
Blondel’s algorithm, community structure is enhanced
through observation of semantic information contained in
the network. The semantic information consists of attri-
butes assigned to each node. For example, in a network of
employees, the attributes might include age, gender, and
profession. An entropy calculation is used to measure the
Fig. 3 Sample network containing two communities which highlight
the inconsistency of LPA and the need for multiple runs to achieve
concensus. In the text of the paper, the red community refer to the
nodes on the right-hand side of the figure above, while the blue
community refer to the nodes on the left-hand side of the same picture
(color figure online)
169 Page 4 of 12 Soc. Netw. Anal. Min. (2014) 4:169
123
level of similarity between a node and the peers in its
community. The goal of entropy reduction is achieved by
means of Monte Carlo selection. In this process, nodes are
selected and moved to the community where the lowest
entropy value is calculated. The combination of algorithms
are executed iteratively, maximizing modularity with
Blondel’s algorithm and minimizing entropy through the
relocation of nodes to communities with common attri-
butes. Since modularity optimization increases entropy,
and entropy optimization decreases modularity, a balance
must be struck between the two algorithms. The overall
process favors modularity by terminating when modularity
improvements are no longer achievable. The use of
semantic information improves partition quality. However,
the overall approach of Cruz et al. does not support com-
munity overlap or dynamic networks. The use of semantic
information suggests a capability for community overlap
detection. However, the semantics have the effect of
grouping nodes that have the highest commonality among
the entire set of semantics. For example, using the student
network previously described, students belonging to same
academic society, sports team, and religious group would
be attracted to each other to form a single community
rather than each individual being assigned to four over-
lapping communities.
3 Algorithm description
As argued earlier, SOCIAL’s main concern is with decen-
tralization. The algorithm is inspired by self-organized systems
observed in nature. In these systems, complex goals are
achieved with high efficiency and robustness (Marco et al.
2006). The bottlenecks associated with centralized control are
removed as work is performed by individuals possessing a
minimal amount of intelligence and tools required for the task.
From the apparent chaos of no centralized leadership, highly
effective processes are deployed with frequently astounding
emergent results (e.g., foraging in ant colonies and bees hives).
The basis for the self-organized community detection
algorithm is node entropy calculation. Node entropy is
derived from Shannon entropy, and expresses the certainty
each node has with regard to its current community. The
use of entropy is consistent with the simple localized
algorithms employed by other self-organized systems. The
work of discerning community structure is focused on
individual nodes rather than entire network. Equation 1
shows the node entropy (S) calculation.
S ¼ �Xn
i¼1
pi log2 pi
ek4
; ð1Þ
where, n is the number of communities a node may
potentially join (including its current community), pi is
the ratio of (the number of neighbors in potential com-
munity, i) by (the neighborhood size of the node), and
k is the number of intra-community triads formed by
joining a particular community. As shown by the equa-
tion, intra-community triad structures are favored by the
algorithm. Knowledge beyond a node’s immediate
neighborhood is not required to perform the calculation.
To determine which community a node belongs to, the
entropy calculation is performed with the node tempo-
rarily assigned to each community it could possibly join
(pi). It is important to note that entropy is also calculated
on the community that the node currently belongs to.
Here, if the node is the only member of the community,
it is essentially assumed to be connected to itself. After
the entropy calculations are performed, the resulting
entropy values are then evaluated to make a decision on
community assignment. For example, Fig. 4 shows a
simple network where node 4 must decide which com-
munity to choose.
The entropy calculation is done for each of the three
choices node 4 has:
1. Stay in the green community:
Sg ¼ �1
5log2
1
5þ 3
5log2
3
5þ 1
5log2
1
5
� �¼ 1:37
2. Join the blue community:
Sb ¼ �1
5log2
1
5þ
45
log245
e:25
� �¼ 0:264
3. Join the red community:
Sr ¼ �2
5log2
2
5þ 3
5log2
3
5
� �¼ 0:972
Following the entropy calculation, the entropy values
are inverted so that the ‘‘best’’ entropies are the largest.
1. Staying in the green community: 0.74
2. Join the blue community: 3.79
3. Join the red community: 1.03
4
1
3
2
5
7
6 8
Fig. 4 Example used to demonstrate how the entropy is calculated.
The colors represent different communities. The example focuses on
node 4. The red community is composed of nodes (1, 2, and 3), the
green community is composed of only node (4), and the blue
community is composed of nodes (5, 6, 7, and 8) (color figure online)
Soc. Netw. Anal. Min. (2014) 4:169 Page 5 of 12 169
123
Finally, the entropy values are normalized for the rou-
lette wheel selection used to choose a community. The
normalization makes the sum of entropy values equal to 1.
1. Staying in the green community: 0.132
2. Join the blue community: 0.682
3. Join the red community: 0.186
The normalized entropy values may be viewed as the
probability of a particular choice being made. For example,
there is a 68.2 % probability that the blue community will
be chosen for node 4.
Algorithm 1 describes the overall community detection
process. Because of SOCIAL’s ability to process dynamic
changes to the network, it continues running, looking for
changes in the network, until the user terminates it. For
static networks, the user runs SOCIAL until the message
indicating a stable community structure is displayed, and
terminates the algorithm. During each iteration, a growing
subset of nodes recognize that they are completely certain
with their current community. For these nodes, the entropy
calculations are not performed. There are two cases where
a node is in this state. First, if a node is completely sur-
rounded by neighbors which are in the same community as
themselves. Second, if the inverted normalized entropy
value for the node in its current community is greater than
or equal to 0.80 (i.e., 80 %) and none of the node’s
neighbors have changed community in the preceding iter-
ation. An added performance characteristic of the algo-
rithm is the ability to calculate the node entropy values
required for each iteration in parallel. As may be observed
in any beehive or anthill, concurrent activity is the rule for
self-organizing systems. With today’s multi-core and par-
allel computer architectures, algorithms which support
concurrent operations enjoy a significant performance
advantage over centralized algorithms.
4 Experimental results
4.1 Understanding the algorithm
The Zachary’s Karate Club Network (Zachary 1977) is
commonly used to evaluate partition quality of community
algorithms. Although it is small with 34 nodes and 77 edges,
this network has become a standard because, through par-
ticipant feedback, we have the ground truth on community
divisions. In addition, the network contains subtle topolog-
ical features that can become a stumbling block to commu-
nity detection algorithms. The network represents a karate
club which split into two factions after a dispute between the
club president (called John A) and the lead instructor (Mr.
Hi). The dispute lead to the formation of two separate clubs
with the original 34 members. The ground truth of the
community division after the dispute is represented in Fig. 5.
We have executed SOCIAL in the karate network and
we have found that the algorithm correctly partitions the
network between the club president (node 34) and the
instructor (node 1). This can be observed in Fig. 6
SOCIAL divides the club, but higher resolution in the
partition is demonstrated by the correct identification of two
sub-communities within this main division. This higher
134
33
Fig. 5 After the dispute between the club president (node 34) and the
lead instructor (node 1), the club was split into two factions (new
clubs). This figure represents how each of the 34 original members
was split (based on their preference of who to work with: president or
lead instructor)
134
33
Fig. 6 Partitioned Zachary’s Karate Club Network found by
SOCIAL. Note the identification of two sub-communities in the
graph. The left most color is orange, the color of node 1 and its
community is red, the color of node 34 and its community is blue, and
lastly, the community composed of 4 nodes just above the community
of node 34 is the green community (color figure online)
169 Page 6 of 12 Soc. Netw. Anal. Min. (2014) 4:169
123
resolution is not unique to SOCIAL and we are using them
here simply to show that we match other algorithm’s results.
Our results are highly correlated to algorithms based on
Girvan–Newman modularity. In fact, Fig. 7 shows the
communities according to a modularity-based algorithm
proposed by Blondel et al. (2008). The correlation to the
network modularity is demonstrated in Fig. 8.
The entropy values used in Fig. 8 are the non-inverted and
non-normalized entropy of each node in its current com-
munity, averaged over the entire network. During the exe-
cution of SOCIAL, the average entropy starts high and is
reduced with each iteration. At the same time, modularity
starts low and increases over time. The two metrics converge
as community structure emerges. Although, as pointed out in
Sect. 2, Girvan–Newman Modularity has serious resolution
issues in highly connected networks, these issues are not
present in the karate club network example. In this case, the
comparison between entropy and modularity serves as the
validation of node entropy as a community quality metric.
The main difference between the solution found by
SOCIAL and Blondel’s algorithm is on node 33. Although
this might appear to be strange, node 33 forms a triad with
two other nodes in the same community as node 33 (shown
in red on the online version of the paper). Although node
33 is surrounded by nodes in another community (shown in
blue on the online version of the paper), these nodes do not
know each other, except for the case involving node 34.
Hence, we believe node 33 can be correctly classified as
not belonging to the community of node 34. Note that it is
also acceptable for it to be part of that community and
given the probabilistic nature of SOCIAL it does happen
that it is classified as such in some of the executions.
4.2 Understanding overlaps in SOCIAL
Community overlap may be discerned from the node entropy
values seen in a partitioned network. The final inverted and
normalized entropy values of four nodes with interesting
entropies from the karate network are as follows:
• Node 2 red community 1.0
• Node 5 orange community 0.729, red community 0.271
• Node 6 orange community 0.896, red community 0.104
• Node 29 blue community 0.33, red community 0.33,
green community 0.33
These nodes are highlighted in Fig. 9. The colors men-
tioned above are visible in Fig. 6.
These entropies are an indication of the certainty of each
node with its current community. Here, the value associ-
ated with the current community of a node is an indication
of the node’s certainty that it belongs to that community,
where a higher value indicates a higher certainty. Nodes
with an entropy value of 1.0 (i.e., 100 % certainty), such as
node 2, are surrounded by neighboring nodes in the same
community (as seen in Fig. 6). Hence, there is no uncer-
tainty concerning which community node 2 belongs in.
Nodes with multiple low entropy values of nearly equal
value are identified as nodes which overlap into multiple
communities. For example, node 29, has an entropy value
of 0.33 for three different communities. This entropy value
indicates a 100 percent chance that node 29 will re-eval-
uate its community position. Node 29 has a single link to
three separate communities (blue, red, and green in Fig. 6),
and no intra-community triads are formed if the node were
to join any one of these communities. With each iteration
of the algorithm, the entropy value which results from the
node joining each of these communities is calculated.
These three entropies are equal, with a value 0.33. As a
134
33
Fig. 7 Partitioned Zachary’s Karate Club Network found by Blondel
et al. (2008) algorithm
Fig. 8 Correlation between average node entropy and Girvan–
Newman modularity in Zachary’s Karate Club Network
6 5
2
29
Fig. 9 Nodes with different entropies have different overlap proba-
bilities. The lower the inverted and normalized entropy, the more
likely the node belongs to more than one community
Soc. Netw. Anal. Min. (2014) 4:169 Page 7 of 12 169
123
result, the node has an equal probability of selecting any
one of these communities. Indeed, during the final itera-
tions of the algorithm, this node is seen to toggle between
the three communities. Node 5 has a final entropy of 0.729
in orange community. It has one edge going into the red
community, and two edges going into the orange com-
munity with no intra-community triads. Because of the
higher number of edges in the orange community, its
entropy favors remaining in the orange community. Node 6
has an entropy of 0.896 in the orange community which
corresponds to having 3 edges in the orange community,
one edge in the red community, and an intra-community
triad in the orange community. From the entropies calcu-
lated which correspond to the communities a node may
belong to, a ratio of the ‘‘belongingness’’ of the node to
each community is determined. This concept corresponds
to fuzzy community structures described by Reichardt and
Bornholdt (2004), where the robustness of node assign-
ments to communities is examined to obtain a higher res-
olution of community structure than seen by traditional
algorithms which detect overlap (Reichardt and Bornholdt
2004). For example, in the karate network, node 29 belongs
equally in the green, orange, and blue communities. In
general, a threshold may be used to determine whether a
community has sufficient node attraction to indicate over-
lap. Using a threshold, the level of participation of a node
in each community is discerned from the entropy values
calculated when the node is evaluated in each community
during the last iteration of the algorithm. Following the
discernment of participation level, a community may be
eliminated as an overlap candidate if the node entropy is
sufficiently low when evaluated in that community.
A more graphic illustration of SOCIAL’s overlap
detection capability is shown in Fig. 10. In this figure, a
small network is pictured in an intermediate stage of
community discernment where two communities have
formed. The nodes in the figure are represented by pie
charts. Here, each pie chart shows the level of participation
of the corresponding node in each community. Several
nodes (2, 4, and 6) are completed surrounded by neighbors
in the same community. Hence, they show no participation
in their peer communities. Nodes 1 and 8 have intra-
community overlaps that are so insignificant that the
overlaps are not visible in the node pie charts. Each of
these nodes has multiple triads within its current commu-
nity with single edges going out to the external community.
Nodes 5 and 3 have identical configurations with four
edges in the dark community and 3 edges in the light
community. These nodes favor the dark community
because they each form four triads within the dark com-
munity and only three triads in the light community. Node
9 strongly favors the light community because it forms 3
triads in the light community and only 1 triad in the dark
community. The most interesting case is node 7, which has
equal participation in both communities. The inverted and
normalized entropy values for node 7 are light 0.5, and
dark 0.5. Node 7 has 3 edges going to each of the com-
munities with 3 intra-community triads in both communi-
ties. As a result, node 7 will toggle between belonging to
each community if we want to assign it to one of the
communities, however, the algorithm indicates that node 7
belongs to both communities with a 50 % participation in
each. In this network, the instability of node 7 results in the
light community being absorbed by the dark community.
Because of the centrality of node 5 to the light community,
when it toggles to dark community, its peers in the light
community follow.
4.3 Dealing with dynamics
The amount of processing required by the algorithm to
incorporate real-time updates to a network is proportional to
the complexity of the change. For example, if a single node
7
86
1
4
2
9
53
Community 1
Community 2
Fig. 10 Network with nodes represented by pie charts showing levels
of community participation
35
Fig. 11 Impact of adding disruptive node (35) to Zachary’s Karate
Club Network
169 Page 8 of 12 Soc. Netw. Anal. Min. (2014) 4:169
123
with a single edge is added to a network, one iteration of the
algorithm is required which impacts only the added node as it
joins the community it is connected to. On the other hand, if a
node with many edges joining several communities is added,
more processing is required. As an example, node 35 is added
to the karate network. This node has multiple edges into both
the orange and red communities (colors refer to Fig. 6).
Further, a triad is formed between node 35 and nodes in the
blue community. Fig. 11 shows how the community struc-
ture is changed by this addition.
The addition of node 35 results in the orange community
in Fig. 6 being absorbed by the red community. Only four
iterations were required for the network to stabilize, and six
nodes are impacted. The four iterations reflect a cascading
effect as the nodes most attached to the red community
decide to change communities, and the remaining nodes
follow. The important characteristic to note is that the update
to community structure is localized to the affected nodes.
4.4 Performance
The network generation package developed by Lancichinetti
et al. (2008) was used to generate benchmark networks for
performance evaluation. This package utilizes a variety of
parameters to generate networks containing social structure.
Using these parameters, the algorithm follows a set of rules
when connecting nodes which result in the desired social
structure. For example, the user may specify the mean degree
and the exponent of power law degree distribution (c) to
create social networks with varying levels of density. A key
feature of the algorithm is the output of the set of known
communities upon completion. Based on the input parame-
ters and the fixed set of generation rules, the set of commu-
nities and the community that each node belongs to are
known (ground truth). This known community structure may
then be compared to the structure discerned by a community
detection algorithm to evaluate performance. The parame-
ters of interest for this analysis and their corresponding
values used are described in Table 1.
The mixing parameter l specifies the fraction of edges
of each node that connect to nodes in other communities;
i.e., edges that span two communities. A fraction 1 - l of
each node’s edges are inter-community edges.
For algorithm testing in this analysis, networks were
generated for each of following l values: 0.1, 0.2, 0.3, 0.4,
0.5, and 0.6. These l values result in networks which are
increasingly difficult to have communities discerned auto-
matically. Networks generated with l values C 0.3 exhibit
distinct communities which are relatively easy to identify.
However, the communities seen in networks created with lvalues [0.3 are highly interconnected, hence the commu-
nities are quite difficult to discern. Both the SOCIAL
algorithm and the network generation package are proba-
bilistic. As a result, to ensure a balanced view of perfor-
mance, ten networks were generated with each l value.
Using the 10 networks, the standard deviation was taken
over the set performance results.
In addition to the benchmark network generation pack-
age, Lancichinetti et al. (2008) provide a metric named
normalized mutual information which measures the simi-
larity between expected community structure and the
structure discerned by a community detection algorithm.
This metric utilizes concepts borrowed from information
theory to compute a numerical measure of the similarity
between two community partitions. The value which
results from a comparison is in the range from 0 to 1, where
0 indicates no similarity and a value of 1 indicates an
identical partitioning.
Three algorithms were tested with SOCIAL for perfor-
mance comparison. The first algorithm was developed by
Clauset et al. (2004). It utilizes modularity maximization to
discern community structure. The second algorithm is the
Blondel (2008) algorithm described in Sect. 2 which relies
on a combination of agglomeration and modularity maxi-
mization. The third algorithm is the LPA implementation
provided by the igraph (Csardi and Nepusz 2006) network
analysis package. Three criteria were used in the selection
of these algorithms. First, algorithms which were ‘‘pack-
aged’’, well tested, and available for download were con-
sidered. Second, algorithms were selected because of their
relatively fast execution. A variety of downloadable algo-
rithms are available. However, many of them require hours
(or days) to complete and were not practical for use in this
analysis. Finally, well-known algorithms were desired.
The performance tests were executed on a Windows
Vista 64-bit platform. The system configuration includes
Quad CPUs running at 3.00 GHz with 4 GB of RAM.
During testing, care was taken to dedicate the system
exclusively to the algorithm being evaluated.
Figure 12 shows a comparison of the accuracies of the
four algorithms utilizing the normalized mutual information
metric. As described earlier, each algorithm was executed on
Table 1 Parameter used in the benchmark program by Lancichinetti
et al. (2008)
Parameter Value
Number of nodes 1,000
Mean degree 15
Maximum Degree 50
Exponent for degree distribution (c) 2
Exponent for community size distribution (b) 1
Mixing parameter (l) varied
Minimum for the community sizes 20
Maximum for the community sizes 50
Soc. Netw. Anal. Min. (2014) 4:169 Page 9 of 12 169
123
the ten networks corresponded to each l setting and the
standard deviations of the results were taken. In addition, to
overcome the inconsistency issue seen with LPA, each
generated network was parsed ten times by LPA and the
average of the result was taken. In Fig. 12, the results of
SOCIAL and the Blondel are similar, with excellent accu-
racy for l values B 0.4. For l values [0.4, performance
begins to degrade as communities become more inter-con-
nected and social structure is more difficult to discern. LPA
also performed well with all l values except 0.6. With this lvalue LPA consistently produced a single community. This
failure suggests the presence of network structures which are
problematic for LPA. These structures, if even sparsely
present, in real-world networks could lead to inaccurate
community discernment with LPA.
The execution time of SOCIAL is compared to the other
three algorithms in Fig. 13. Precise time measurement in the
Windows test environment was not possible. As a result,
execution times less than one second should simply be
interpreted as ‘extremely fast’. The actual values should not
be viewed as precise measurements. While SOCIAL is
several seconds slower than the Blondel algorithm and LPA,
it should be noted that the current version of SOCIAL is not
optimized to utilize the multi-processor configurations
which are ubiquitous in today’s personal computing plat-
forms. The self-organized approach used bySOCIAL allows
entropy computations to be performed in parallel; it is part of
the algorithm design. As a result, SOCIAL execution time is
expected to scale down linearly as multiple processors are
exploited to perform concurrent entropy calculations. In
Fig. 14, a linear progression is seen in the number of itera-
tions required by SOCIAL to achieve convergence as the
social structure within the networks becomes more complex.
Although the number of iterations increases with network
complexity, Fig. 15 demonstrates that with each iteration, fewer
entropy calculations are required as the algorithm progresses.
This reduction in entropy calculations is attributed to the
emergent community structure. As communities are formed, an
increasing number of individual nodes are ‘‘satisfied’’ with their
community and no entropy calculation is required.
The inset in Fig. 15 shows the total entropy calculations
performed by SOCIAL for each mixing parameter l.
Fig. 12 SOCIAL accuracy compared to other algorithms using the
normalized mutual information metric. Despite being self-organized,
SOCIAL nearly matches the results of Blondel’s algorithm
Fig. 13 SOCIAL execution time compared to two other algorithms
Fig. 14 SOCIAL iterations required to achieve convergence for each
mixing parameter value
Fig. 15 SOCIAL entropy calculations per iteration with inset show-
ing total entropy calculations
169 Page 10 of 12 Soc. Netw. Anal. Min. (2014) 4:169
123
Again, a linear correspondence is seen with the complexity
of social structure exhibited by each network.
The time complexity of SOCIAL is difficult to precisely
characterize because it is dependent on the complexity of
the network being partitioned. Network complexity deter-
mines three factors that impact the time complexity of
SOCIAL: (1) the number of nodes in the network n, (2) the
number of iterations required for convergence Imax, and (3)
the average degree of nodes in the network z. For the
purposes of this discussion, the worst case is used where a
very complex network, i.e., l = 0.6, is analyzed. Argu-
ably, it would be very unusual to find a real-world social
network containing topological features more complex
than the ones found in this network. Referring to Fig. 14,
the number of iterations required to achieve convergence
on a network with l = 6 is approximately 59. This value is
referred to as Imax in the calculations that follow. The
number of entropy calculations per iteration is the core of
SOCIAL time complexity. For each node, an entropy cal-
culation is done for each community surrounding the node,
and the node’s current community. The first iteration
requires a maximum of n 9 (z ? 1) entropy calculations
because each node initially belongs to a unique commu-
nity. The fact that many communities are merged during
the first iteration is ignored to achieve a conservative
estimate of time complexity. For subsequent iterations, the
number entropy calculations is reduced as communities
merge and the average number of communities surround-
ing each node is reduced. Again, the worst case of l = 0.6
is used where the number of entropy calculations is
reduced by only 10 % with each iteration as shown in
Fig. 15. To model this behavior, the equation
(z ? 1) 9 (0.9i) is used to show the decline in entropy
calculations with each iteration. Here, i is the iteration
number. Finally, using the equation for the number of
entropy calculations per iteration and the maximum num-
ber of iterations Imax, time complexity, i.e., total entropy
calculations, is computed as:
Oðn� TeÞ; ð2Þ
where Te represents the number of entropy calculations for
first and subsequent interactions of the algorithm (from 2 to
Imax):
Te ¼XImax
i¼0
ðzþ 1Þ � ð0:9iÞ: ð3Þ
Note that the geometric series in Eq. 3,PImax
i¼0ð0:9iÞ,converges to 9 which means that the performance of
SOCIAL depends on n and given that z� n the performance
has near-linear time complexity in regard to the number of
nodes in the network, n. Clearly, this could degenerate into a
performance that is not linear for very dense networks but
the argument here is that these are not typical networks that
appear as the result of real phenomena.
5 Discussion
The self-organized approach to community detection used by
SOCIAL is new and quite exciting given its capabilities of
high-resolution overlap detection and the ability to process
dynamic networks. In addition, SOCIAL was shown to be
comparable in accuracy and execution speed to the well-
known Blondel and LP algorithms. Furthermore, SOCIAL
does not have the resolution issue present in algorithms, such
as the Blondel algorithm, which utilize Girvan–Newman
Modularity for network partitioning. Although this resolution
issue was not evident in the networks generated for perfor-
mance testing in this analysis, it is a well documented and
considered a serious concern (Fortunato and Barthelemy
2007). Likewise, SOCIAL was shown not to have the incon-
sistency issue present in LPA which requires multiple runs to
be performed and a consensus to be taken from the results.
Future testing will include performance comparisons on
networks which explicitly include topologies which are
problematic for the Blondel algorithm and LPA. Finally,
SOCIAL was several seconds slower than the Blondel algo-
rithm and LPA in execution time on the more complex net-
works. As stated previously,SOCIAL has not been optimized
for utilization of multi-processor platforms. Although 3 or 4
seconds is hardly considered an excessive execution time for
networks as complex as those tested, a significant gain in
speed is expected once this optimization is included.
6 Conclusions
We have introduced a self-organizing community detection
algorithm called SOCIAL which reflects the high efficiency
and adaptivity of decentralized systems observed in nature.
The algorithm is based on localized node entropy calcula-
tions which require only knowledge of each node’s local
neighborhood. The core entropy calculation reveals com-
munity overlap, and additionally, shows the level of
‘‘belongingness’’ or ‘‘fuzziness’’ of nodes to the communi-
ties in which they belong. In a demonstration of community
partition quality, the algorithm was shown to successfully
identify a high-resolution partitioning of the Zachary’s
Karate Club Network. Finally, the algorithm is well suited to
manage dynamically changing networks. Network pertur-
bations are localized and the processing required for each
update is proportional to the complexity of the update.
Future work includes further exploration of the perfor-
mance characteristics of the algorithm on large networks
Soc. Netw. Anal. Min. (2014) 4:169 Page 11 of 12 169
123
containing more than 100,000 nodes. The performance,
with regard to partition quality and speed, will be com-
pared to existing community detection algorithms. Part of
these evaluations will include measurement of performance
gains made by exploiting our algorithm’s ability to support
concurrent entropy calculations on multi-core computer
architectures. Lastly, community detection in dynamic
networks will be further explored. Here, networks will be
iteratively generated, and with each iteration, updates to the
network will be made accessible to the algorithm.
References
Adomavicius G, Tuzhilin A (2005) Toward the next generation of
recommender systems: A survey of the state-of-the-art and possible
extensions. IEEE Trans. Knowl. Data Eng 17(6):734–749
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast
unfolding of communities in large networks. J Stat Mech Theory
Exp 2008(10):P10008
Clauset A, Newman MEJ, Moore C (2004) Finding community
structure in very large networks. Phys Rev E 70(6):66111.
doi:10.1103/PhysRevE.70.066111
Cruz JD, Bothorel C, Poulet F (2011) Entropy based community
detection in augmented social networks. In: 2011 International
Conference on computational aspects of social networks (CA-
SoN), 19–21 October 2011, pp 163–168
Csardi G, Nepusz T (2006) The igraph software package for complex
network research. Int J Complex Systems 1695. http://igraph.sf.
net
Fortunato S (2010) Community detection in graphs. Physics Reports
486:75–174
Fortunato S, Barthelemy M (2007) Resolution limit in community
detection. PNAS 104:36
Girvan M, Newman M.E.J (2002) Community structure in social and
biological networks. PNAS 99(12):7821–7826
Gregory S (2010) Finding overlapping communities in networks by
label propagation. New J Phys 12(10):103018
Java A, Song X, Finin T, Tseng B (2007) Why we twitter:
understanding microblogging usage and communities. In: Pro-
ceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop
on Web mining and social network analysis, WebKDD/SNA-
KDD ’07. ACM, New York, NY, pp 56–65
Valdis Krebs (2002) Mapping networks of terrorist cells. Connections
24(3):43–52
Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs
for testing community detection algorithms. Phys Rev E
78(4):046110
Ma X, Gao L, Yong X (2010) Eigenspaces of networks reveal the
overlapping and hierarchical community structure more pre-
cisely. J Stat Mec Theory Exp 2010(08):P08012
Marco Mamei, Ronaldo Menezes, Robert Tolksdorf, Franco Zambo-
nelli (2006) Case studies for self-organization in computer
science. J. Syst. Archit 52(8):443–460
Newman M EJ, Girvan M (2004) Finding and evaluating community
structure in networks. Phys Rev E 69(2):026113
Huazhong Ning, Wei Xu, Yun Chi, Yihong Gong, Thomas S. Huang
(2010) Incremental spectral clustering by efficiently updating the
eigen-system. Pattern Recogn 43(1):113–127
Gergely Palla, Imre Derenyi, Illes Farkas, Tamas Vicsek (2005)
Uncovering the overlapping community structure of complex
networks in nature and society. Nature 435(7043):814–818
Troy Raeder, Nitesh V. Chawla (2011) Market basket analysis with
networks. Social Netw. Analys. Mining 1(2):97–113
Raghavan UN, Albert R, Kumara S (2007) Near inear time algorithm
to detect community structures in large-scale networks. arXiv
76(3):036106
Rattigan MJ, Maier M, Jensen D (2007) Graph clustering with
network structure indices. In: Proceedings of the 24th interna-
tional conference on machine learning, ICML ’07. ACM, New
York, NY, pp 783–790
Rees BS, Gallagher KB (2012) Overlapping community detection
using a community optimized graph swarm. Soc Netw Anal Min
2(4):405–417
Reichardt J, Bornholdt S (2004) Detecting fuzzy community struc-
tures in complex networks with a potts model. Phys Rev Lett
93(21):218701
John Scott (2011) Social network analysis: developments, advances,
and prospects. Social Network Analysis and Mining 1(1):21–26
Claude Shannon, Noshirwan Petigara, Satwiksai Seshasai (1978) A
mathematical theory of communication. Bell System Technical
Journal 27:379–423
Shi J, Malik J (1997) Normalized cuts and image segmentation. In:
Proceedings of the 1997 conference on computer vision and
pattern recognition (CVPR ’97), CVPR ’97. IEEE Computer
Society, Washington, DC, p 731
Venugopal S, Stoner E, Cadeiras M, Menezes R (2012) The social
structure of organ transplantation in the united states. In:
Proceedings of the 3rd international workshop on complex
networks (CompleNet 2012), vol 424 of studies in computational
intelligence. Springer-Verlag, Melbourne, FL, pp 199–206
Wilkinson DM, Huberman BA (2004) A method for finding
communities of related genes. Proc Natl Acad Sci 101(Suppl
1):5241–5248. doi:10.1073/pnas.0307740100
Xie J, Chen M, Szymanski BK (May 2013) LabelRankT: incremental
community detection in dynamic networks via label propagation
Zachary WW (1977) An information flow model for conflict and
fission in small groups. Journal of Anthropological Research
33:452–473
169 Page 12 of 12 Soc. Netw. Anal. Min. (2014) 4:169
123