12
ORIGINAL ARTICLE A self-organized approach for detecting communities in networks Ben Collingsworth Ronaldo Menezes Received: 9 May 2013 / Revised: 6 September 2013 / Accepted: 21 November 2013 Ó Springer-Verlag Wien 2014 Abstract The community structure of networks reveals hidden information about the whole network structure that cannot be discerned using other topological properties. Yet, the importance of identifying community structure in net- works to many fields such as medicine, social sciences and national security, calls for better approaches for performing the identification. The prevalent community detection algorithms utilize a centralized approach that is unlikely to scale to very large networks and does not handle dynamic networks. We propose a self-organized approach to com- munity detection which utilizes a newly introduced concept of node entropy to allow individual nodes to make decentralized and independent decisions concerning the community to which they belong; we call our approach Self-Organized Community Identification ALgorithm (SOCIAL). Node entropy is a mathematical expression of an individual node’s satisfaction with its current commu- nity. As nodes become more ‘‘satisfied’’, i.e., entropy is low, the community structure of a network is emergent. Our algorithm offers several advantages over existing algorithms including near-linear performance, identifica- tion of partial community overlaps, and handling of dynamic changes in the network in a local manner. 1 Introduction The identification of community structure in networks has helped many scientific fields understand complex phe- nomena (John 2011); prime examples include: medicine (Srividhya et al. 2012), social sciences (Akshay et al. 2007), national security (Valdis 2002), and marketing (Adomavicius and Tuzhilin 2005; Troy and Chawla 2011). Community structure reveals global hidden information about networks that cannot be discerned using other topological properties. Early research in this area focused on a limited aspect of this problem, reasonably, finding communities. However, these solutions neglected impor- tant properties when dealing with real-world networks: (1) the enormous size of many networks, (2) community overlap (i.e., a node belonging to multiple communities), and (3) the dynamic nature of networks. The importance of the first property (network size) is evident when running one of the various network analysis packages on a very large network; the package may require many hours to run. This issue is heightened with increased interest in complex network analysis coupled with accessibility to databases capable of generating huge networks. The second property (community overlap) is the reality that nodes in a network often belong to more than one community. This property is particularly evident in social networks describing direct relationships between people (Bradley et al. 2012). For example, in a network describing ties between students, an individual may participate in activities such as academic societies, sports activities, and religious affiliations, which may constitute several communities. Important information is lost if this student must be cast into a single community. Finally, real-world networks are constantly evolving, with nodes and edges dynamically being created and removed. Networks representing the spread of pandemic diseases are B. Collingsworth R. Menezes (&) Computer Sciences BioComplex Laboratory, Florida Institute of Technology, Melbourne, FL, USA e-mail: rmenezes@cs.fit.edu B. Collingsworth e-mail: bcolling@my.fit.edu 123 Soc. Netw. Anal. Min. (2014) 4:169 DOI 10.1007/s13278-014-0169-5

A self-organized approach for detecting communities in networks

  • Upload
    ronaldo

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

ORIGINAL ARTICLE

A self-organized approach for detecting communities in networks

Ben Collingsworth • Ronaldo Menezes

Received: 9 May 2013 / Revised: 6 September 2013 / Accepted: 21 November 2013

� Springer-Verlag Wien 2014

Abstract The community structure of networks reveals

hidden information about the whole network structure that

cannot be discerned using other topological properties. Yet,

the importance of identifying community structure in net-

works to many fields such as medicine, social sciences and

national security, calls for better approaches for performing

the identification. The prevalent community detection

algorithms utilize a centralized approach that is unlikely to

scale to very large networks and does not handle dynamic

networks. We propose a self-organized approach to com-

munity detection which utilizes a newly introduced concept

of node entropy to allow individual nodes to make

decentralized and independent decisions concerning the

community to which they belong; we call our approach

Self-Organized Community Identification ALgorithm

(SOCIAL). Node entropy is a mathematical expression of

an individual node’s satisfaction with its current commu-

nity. As nodes become more ‘‘satisfied’’, i.e., entropy is

low, the community structure of a network is emergent.

Our algorithm offers several advantages over existing

algorithms including near-linear performance, identifica-

tion of partial community overlaps, and handling of

dynamic changes in the network in a local manner.

1 Introduction

The identification of community structure in networks has

helped many scientific fields understand complex phe-

nomena (John 2011); prime examples include: medicine

(Srividhya et al. 2012), social sciences (Akshay et al.

2007), national security (Valdis 2002), and marketing

(Adomavicius and Tuzhilin 2005; Troy and Chawla 2011).

Community structure reveals global hidden information

about networks that cannot be discerned using other

topological properties. Early research in this area focused

on a limited aspect of this problem, reasonably, finding

communities. However, these solutions neglected impor-

tant properties when dealing with real-world networks: (1)

the enormous size of many networks, (2) community

overlap (i.e., a node belonging to multiple communities),

and (3) the dynamic nature of networks. The importance of

the first property (network size) is evident when running

one of the various network analysis packages on a very

large network; the package may require many hours to run.

This issue is heightened with increased interest in complex

network analysis coupled with accessibility to databases

capable of generating huge networks. The second property

(community overlap) is the reality that nodes in a network

often belong to more than one community. This property is

particularly evident in social networks describing direct

relationships between people (Bradley et al. 2012). For

example, in a network describing ties between students, an

individual may participate in activities such as academic

societies, sports activities, and religious affiliations, which

may constitute several communities. Important information

is lost if this student must be cast into a single community.

Finally, real-world networks are constantly evolving, with

nodes and edges dynamically being created and removed.

Networks representing the spread of pandemic diseases are

B. Collingsworth � R. Menezes (&)

Computer Sciences BioComplex Laboratory,

Florida Institute of Technology, Melbourne, FL, USA

e-mail: [email protected]

B. Collingsworth

e-mail: [email protected]

123

Soc. Netw. Anal. Min. (2014) 4:169

DOI 10.1007/s13278-014-0169-5

a good example of these dynamic structures. Here, the

network is continuously changing as the disease spreads to

new individuals and expires in those that are infected. In

these networks, a community detection algorithm that must

be completely re-executed with each update is impractical.

The importance of community detection can be high-

lighted by the increasing attention given by researchers.

Indeed, a search on arXiv.org for the words ‘‘community’’

and ‘‘networks’’ reveals that about 10 % of papers pub-

lished today are about networks and more than half of them

relate to communities (details in Fig. 1). Moreover, the

increase over the years is quite significant (5-fold for net-

works and 3-fold for community-related papers).

We propose a community detection algorithm which

observes these three properties while maintaining partition

quality. In this algorithm, individual nodes are indepen-

dently responsible for determining the community to which

they belong. The mechanism for making this decision is

derived from Shannon entropy (Claude et al. 1948) and

requires a node to have knowledge only of its immediate

neighbors. Initially, entropy is high, and there is tension in

the network as communities form and nodes make deci-

sions to join or leave these communities. Over time,

entropy becomes low as nodes are satisfied with their

current community. At this point, community structure is

emergent. Since each node’s decision on community is

based only on immediate neighbors, the algorithm offers

near-linear performance on average. In addition, near-lin-

ear performance is preserved as the size of the network

increases. Nodes that belong to multiple communities are

identified by a high individual entropy when the overall

network has stabilized. These nodes may be seen to toggle

between communities because they are uncertain about

where they belong; moreover the toggling allows us to

introduce a complete new idea on community overlap in

which nodes can belong to multiple communities at dif-

ferent levels of belongingness (e.g., a node can be 20 % in

one community and 80 % in another). These levels of

belongingness make SOCIAL fundamentally different

from other approaches able to identify overlaps such as the

clique-percolation approach (Gergely et al. 2005). Finally,

dynamic changes to the network are processed locally. The

processing required to adapt to a change in the network is

proportional to the number of nodes directly impacted by

the change.

2 Related work

In this section, we describe a few community detection

algorithms proposed to date. Our intent for this section is

not to be comprehensive but rather to demonstrate that they

generally fail to handle at least one of the issues described

in Sect. 1 that are handled in SOCIAL, that is, (1) com-

plexity, (2) detection of community overlap, and (3) ability

to efficiently handle dynamic network changes. For a very

comprehensive work on community algorithms one should

refer to the work of Fortunato (Santo 2010).

A broad class of community detection algorithms may

be termed metric-based heuristic algorithms where some

network, node, or edge property is calculated across the

network by means of a centralized control process, and

used to partition the network into communities. A well-

known example of this type of algorithm is the Girvan–

Newman (Girvan and Newman 2002) algorithm. This

algorithm utilizes edge betweenness centrality to perform a

sequence of divisive cuts in the network which result in a

community partition. Intuitively, the edge with the highest

betweenness centrality is likely to be a bridge between

communities, and its removal will result in the isolation of

two communities previously connected by it (see Fig. 2a).

The algorithm is run iteratively, where at each iteration, an

edge is selected and removed. In addition, each iteration

updates a dendrogram structure which reflects the current

partition (see Fig. 2b). The algorithm is terminated when

all edges have been removed. At this point, the dendrogram

is used to select the desired level of community decom-

position. The performance of the algorithm is obviously

tied to the edge betweenness calculation which is O(n2) for

each iteration which leads to a complexity of O(n2 m),

where m is the number of iterations (edges) and n is the

number of nodes in the network. Optimization techniques

have been developed to reduce the cost of this calculation

(Wilkinson and Huberman 2004; Rattigan et al. 2007).

However, these optimizations impact the quality of the

partition. The algorithm does not reveal community over-

lap; each node is assigned to a single community. Further,

the algorithm is not suited for dynamic networks since

changes in network topology require a completely new

dendrogram to be generated (restart of the whole process).

Fig. 1 Percentage of articles placed on arXiv.org that relate to

‘‘networks’’ and ‘‘community’’. The chart above shows that nearly

6 % of the papers placed in the arXiv in 2012 relate to communities.

The arXiv is the main repository for papers in Network Sciences

(from bookworm ArXiv http://bit.ly/RlRNBD. Last accessed on

November 10, 2012)

169 Page 2 of 12 Soc. Netw. Anal. Min. (2014) 4:169

123

Another metric-based algorithm is the Blondel et al.

(2008) algorithm which utilizes the Girvan–Newman

modularity metric (Clauset et al. 2004) to discern com-

munity structure. Modularity is a quality function based on

the idea that random graphs are not expected to contain

cluster structures. The modularity is then used as a com-

parison between the actual density of edges in a subgraph

and the density we would expect to have if the vertices of

the graph were attached randomly. The algorithm uses an

iterative two-phased approach. In the first phase of each

iteration, nodes are assigned to communities in an

arrangement that maximizes modularity. In the second

phase, a new network is constructed where nodes in the

same community are combined to create a single node. The

network resulting from the second phase is the starting

point for the first phase of the next iteration. The algorithm

terminates when modularity is maximized and no addi-

tional updates can be made. The output of the algorithm is

the hierarchical set of communities which correspond to

each iteration. While the Blondel algorithm offers a com-

plexity of O(n), it has been shown to have accuracy issues

which lead to spurious partitions during hierarchical

agglomeration (Santo 2010). In addition, Girvan–Newman

Modularity exhibits a resolution loss when applied to dense

networks (Fortunato and Barthelemy 2007). Finally,

Blondel’s algorithm does not detect community overlap

and is not suited for managing dynamic networks.

A large class of community detection algorithms utilizes

spectral analysis to partition networks. These algorithms,

referred to as spectral-clustering algorithms (Santo 2010),

transform the relationship between nodes into a spatial

relationship in which similarities between nodes are much

more evident, thereby simplifying community detection.

The transformation is done by generating and evaluating

the eigenvectors of various matrixes derived from the

network adjacency matrix. Laplacian matrixes are partic-

ularly effective for this transformation because the eigen-

vectors produced correspond to node coordinate vectors in

a k-dimensional space, where nodes belonging to the same

community are in close proximity to each other. From this

transformation, simple clustering algorithms such as

k-means may be applied to identify the set of communities,

based on node density in the spatial representation of the

network. An early example of the use of spectral clustering

is given by Shi et al. (1997). Here, the algorithm is used to

accomplish perceptual grouping (i.e., image partitioning).

They reduce the problem to simple network partitioning by

transforming images into spatial representations based on

the attributes such as brightness, color, texture. The algo-

rithm groups image features into communities by exam-

ining the coherence of these attributes in Euclidean space.

Similar to the Girvan–Newman algorithm, spectral clus-

tering generates good community partitions, but has com-

plexity issues. The complexity of producing the

eigenvectors for a network containing n nodes is O(n3).

Again, strategies have been developed to reduce this

complexity, but at the cost of partition quality. Given that

spectral clustering reveals each node’s proximity to other

nodes, and hence proximity to communities, community

overlap detection is achievable. Indeed, Ma et al. (2010)

have demonstrated this capability by extending the tradi-

tional spectral clustering algorithm to recognize candidate

overlapping nodes through examination of spatial density.

Finally, as shown by Ning et al. (2010), spectral clustering

may be applied to dynamic networks. This is accomplished

by maintaining incremental approximations of the eigen-

values and eigenvectors rather than performing a recalcu-

lation of eigenvectors after each update to the network.

This approach has shown promising initial results. How-

ever, Ning et al. concede that the incremental approxima-

tions incur cumulative errors, which impact the algorithm’s

performance.

Raghavan et al. (2007) introduced a family of commu-

nity detection algorithms which utilize label propagation.

Label propagation Algorithms (LPAs) are similar to

SOCIAL in that they are self-organizing and retain no

(a) (b)

Fig. 2 Given–Newman algorithm based on systematic identification of edges with highest betweenness, (a), yields a dendrogram that represents

the series of cuts made during the algorithm, (b)

Soc. Netw. Anal. Min. (2014) 4:169 Page 3 of 12 169

123

global information regarding the network. Label propaga-

tion begins by assigning a unique label to each node in the

network. Following this, an iterative process is used where

each node is selected and assigned the label shared by the

majority of its neighbors. During an iteration, the order in

which nodes are selected is random. If a tie occurs between

the possible labels a node may select, the label is selected

randomly from the set of tied candidate labels. Iteration

continues until every node in the network has a label to

which the maximum number of its neighbors belong to.

Random tie breaking in the label propagation algorithm

leads to the disadvantage that different community struc-

tures are reported on different runs of the algorithm. As a

result, the algorithm must be run a number of times, with a

consensus of the results taken as the true community

structure. This ambiguity may be observed in the simple

network shown in Fig. 3. Arguably, by observing the

density of connections between nodes, this network can be

separated into the two distinct communities shown. Each

community has a higher number of edges between com-

munity members than edges going to the peer community.

Further, each community has a relatively large number of

triads, indicating social structure. Node 1 has an equal

number of edges going to each community. However, node

1 completes two triads in the red community and one triad

in the blue community, indicating stronger membership to

the red community. The LPA provided by the igraph

(Csardi and Nepusz 2006) network analysis package was

run ten times on this network with the following results: (1)

in four of the runs, two communities were detected as

shown in Fig. 3, (2) in four of the runs, two communities

were detected with node 1 in the blue community (3) in two

of the runs, a single community was discerned. For com-

parison, SOCIAL was run on the same network ten times.

SOCIAL detected the communities shown in Fig. 3 with

each run of the algorithm. For accuracy and consistency

SOCIAL has two advantages. First, the SOCIAL entropy

calculation favors community assignments where triads, a

fundamental social network structure, are present. Second,

the SOCIAL algorithm allows nodes to choose to make no

decision on community selection if a node’s entropy is

high. Delayed community selection of the uncertain nodes

allows other less ambiguous nodes to begin forming true

communities based on the presence of community struc-

ture. Once the core community structure is formed, the

uncertain nodes have a more sound basis for community

selection. This process eliminates scenarios where a ran-

dom ordering of community selection allows a spurious

community to emerge and claim the entire network.

Enhancements to the Raghavan et al. algorithm have been

made to extend its functionality to support dynamic net-

work processing and community overlap detection as

supported by SOCIAL. Xie et al. (2013) have proposed a

mechanism to allow the algorithm to remain operating after

convergence to receive and process dynamic changes in the

network. As with SOCIAL, changes in the network are

incorporated into the last snapshot of community structure

and the algorithm continues to iterate and process these

changes. Gregory (2010) has developed a version of the

label propagation algorithm which detects community

overlap. In this algorithm, information regarding label

selection, including candidate labels which were rejected,

is retained and reported when convergence is achieved.

While these extensions increase the utility of label propa-

gation, they further exasperate the issue of requiring mul-

tiple runs to reach a consensus. In the case of processing

dynamic networks, this would seem to require running

multiple instances of the algorithm concurrently for con-

sensus to be achieved. Similarly, achieving consensus on

community overlap is complex and burdensome. Clearly,

SOCIAL provides a preferable solution. Community

detection provided by SOCIAL is accurate and consistent.

Further, the ability to process dynamic network updates

and community overlap detection is inherent to the algo-

rithm, i.e., disparate versions of the algorithm are not

required to achieve full functionality.

Many hybrid solutions have been developed to address

the community detection problem. One interesting example

of this is the solution proposed by Cruz et al. (2011). They

apply a combination of algorithms for community detec-

tion in social networks. First, the agglomerative Blondel

(2008) algorithm is used to perform community partition-

ing based on modularity. Following the partitioning by

Blondel’s algorithm, community structure is enhanced

through observation of semantic information contained in

the network. The semantic information consists of attri-

butes assigned to each node. For example, in a network of

employees, the attributes might include age, gender, and

profession. An entropy calculation is used to measure the

Fig. 3 Sample network containing two communities which highlight

the inconsistency of LPA and the need for multiple runs to achieve

concensus. In the text of the paper, the red community refer to the

nodes on the right-hand side of the figure above, while the blue

community refer to the nodes on the left-hand side of the same picture

(color figure online)

169 Page 4 of 12 Soc. Netw. Anal. Min. (2014) 4:169

123

level of similarity between a node and the peers in its

community. The goal of entropy reduction is achieved by

means of Monte Carlo selection. In this process, nodes are

selected and moved to the community where the lowest

entropy value is calculated. The combination of algorithms

are executed iteratively, maximizing modularity with

Blondel’s algorithm and minimizing entropy through the

relocation of nodes to communities with common attri-

butes. Since modularity optimization increases entropy,

and entropy optimization decreases modularity, a balance

must be struck between the two algorithms. The overall

process favors modularity by terminating when modularity

improvements are no longer achievable. The use of

semantic information improves partition quality. However,

the overall approach of Cruz et al. does not support com-

munity overlap or dynamic networks. The use of semantic

information suggests a capability for community overlap

detection. However, the semantics have the effect of

grouping nodes that have the highest commonality among

the entire set of semantics. For example, using the student

network previously described, students belonging to same

academic society, sports team, and religious group would

be attracted to each other to form a single community

rather than each individual being assigned to four over-

lapping communities.

3 Algorithm description

As argued earlier, SOCIAL’s main concern is with decen-

tralization. The algorithm is inspired by self-organized systems

observed in nature. In these systems, complex goals are

achieved with high efficiency and robustness (Marco et al.

2006). The bottlenecks associated with centralized control are

removed as work is performed by individuals possessing a

minimal amount of intelligence and tools required for the task.

From the apparent chaos of no centralized leadership, highly

effective processes are deployed with frequently astounding

emergent results (e.g., foraging in ant colonies and bees hives).

The basis for the self-organized community detection

algorithm is node entropy calculation. Node entropy is

derived from Shannon entropy, and expresses the certainty

each node has with regard to its current community. The

use of entropy is consistent with the simple localized

algorithms employed by other self-organized systems. The

work of discerning community structure is focused on

individual nodes rather than entire network. Equation 1

shows the node entropy (S) calculation.

S ¼ �Xn

i¼1

pi log2 pi

ek4

; ð1Þ

where, n is the number of communities a node may

potentially join (including its current community), pi is

the ratio of (the number of neighbors in potential com-

munity, i) by (the neighborhood size of the node), and

k is the number of intra-community triads formed by

joining a particular community. As shown by the equa-

tion, intra-community triad structures are favored by the

algorithm. Knowledge beyond a node’s immediate

neighborhood is not required to perform the calculation.

To determine which community a node belongs to, the

entropy calculation is performed with the node tempo-

rarily assigned to each community it could possibly join

(pi). It is important to note that entropy is also calculated

on the community that the node currently belongs to.

Here, if the node is the only member of the community,

it is essentially assumed to be connected to itself. After

the entropy calculations are performed, the resulting

entropy values are then evaluated to make a decision on

community assignment. For example, Fig. 4 shows a

simple network where node 4 must decide which com-

munity to choose.

The entropy calculation is done for each of the three

choices node 4 has:

1. Stay in the green community:

Sg ¼ �1

5log2

1

5þ 3

5log2

3

5þ 1

5log2

1

5

� �¼ 1:37

2. Join the blue community:

Sb ¼ �1

5log2

1

45

log245

e:25

� �¼ 0:264

3. Join the red community:

Sr ¼ �2

5log2

2

5þ 3

5log2

3

5

� �¼ 0:972

Following the entropy calculation, the entropy values

are inverted so that the ‘‘best’’ entropies are the largest.

1. Staying in the green community: 0.74

2. Join the blue community: 3.79

3. Join the red community: 1.03

4

1

3

2

5

7

6 8

Fig. 4 Example used to demonstrate how the entropy is calculated.

The colors represent different communities. The example focuses on

node 4. The red community is composed of nodes (1, 2, and 3), the

green community is composed of only node (4), and the blue

community is composed of nodes (5, 6, 7, and 8) (color figure online)

Soc. Netw. Anal. Min. (2014) 4:169 Page 5 of 12 169

123

Finally, the entropy values are normalized for the rou-

lette wheel selection used to choose a community. The

normalization makes the sum of entropy values equal to 1.

1. Staying in the green community: 0.132

2. Join the blue community: 0.682

3. Join the red community: 0.186

The normalized entropy values may be viewed as the

probability of a particular choice being made. For example,

there is a 68.2 % probability that the blue community will

be chosen for node 4.

Algorithm 1 describes the overall community detection

process. Because of SOCIAL’s ability to process dynamic

changes to the network, it continues running, looking for

changes in the network, until the user terminates it. For

static networks, the user runs SOCIAL until the message

indicating a stable community structure is displayed, and

terminates the algorithm. During each iteration, a growing

subset of nodes recognize that they are completely certain

with their current community. For these nodes, the entropy

calculations are not performed. There are two cases where

a node is in this state. First, if a node is completely sur-

rounded by neighbors which are in the same community as

themselves. Second, if the inverted normalized entropy

value for the node in its current community is greater than

or equal to 0.80 (i.e., 80 %) and none of the node’s

neighbors have changed community in the preceding iter-

ation. An added performance characteristic of the algo-

rithm is the ability to calculate the node entropy values

required for each iteration in parallel. As may be observed

in any beehive or anthill, concurrent activity is the rule for

self-organizing systems. With today’s multi-core and par-

allel computer architectures, algorithms which support

concurrent operations enjoy a significant performance

advantage over centralized algorithms.

4 Experimental results

4.1 Understanding the algorithm

The Zachary’s Karate Club Network (Zachary 1977) is

commonly used to evaluate partition quality of community

algorithms. Although it is small with 34 nodes and 77 edges,

this network has become a standard because, through par-

ticipant feedback, we have the ground truth on community

divisions. In addition, the network contains subtle topolog-

ical features that can become a stumbling block to commu-

nity detection algorithms. The network represents a karate

club which split into two factions after a dispute between the

club president (called John A) and the lead instructor (Mr.

Hi). The dispute lead to the formation of two separate clubs

with the original 34 members. The ground truth of the

community division after the dispute is represented in Fig. 5.

We have executed SOCIAL in the karate network and

we have found that the algorithm correctly partitions the

network between the club president (node 34) and the

instructor (node 1). This can be observed in Fig. 6

SOCIAL divides the club, but higher resolution in the

partition is demonstrated by the correct identification of two

sub-communities within this main division. This higher

134

33

Fig. 5 After the dispute between the club president (node 34) and the

lead instructor (node 1), the club was split into two factions (new

clubs). This figure represents how each of the 34 original members

was split (based on their preference of who to work with: president or

lead instructor)

134

33

Fig. 6 Partitioned Zachary’s Karate Club Network found by

SOCIAL. Note the identification of two sub-communities in the

graph. The left most color is orange, the color of node 1 and its

community is red, the color of node 34 and its community is blue, and

lastly, the community composed of 4 nodes just above the community

of node 34 is the green community (color figure online)

169 Page 6 of 12 Soc. Netw. Anal. Min. (2014) 4:169

123

resolution is not unique to SOCIAL and we are using them

here simply to show that we match other algorithm’s results.

Our results are highly correlated to algorithms based on

Girvan–Newman modularity. In fact, Fig. 7 shows the

communities according to a modularity-based algorithm

proposed by Blondel et al. (2008). The correlation to the

network modularity is demonstrated in Fig. 8.

The entropy values used in Fig. 8 are the non-inverted and

non-normalized entropy of each node in its current com-

munity, averaged over the entire network. During the exe-

cution of SOCIAL, the average entropy starts high and is

reduced with each iteration. At the same time, modularity

starts low and increases over time. The two metrics converge

as community structure emerges. Although, as pointed out in

Sect. 2, Girvan–Newman Modularity has serious resolution

issues in highly connected networks, these issues are not

present in the karate club network example. In this case, the

comparison between entropy and modularity serves as the

validation of node entropy as a community quality metric.

The main difference between the solution found by

SOCIAL and Blondel’s algorithm is on node 33. Although

this might appear to be strange, node 33 forms a triad with

two other nodes in the same community as node 33 (shown

in red on the online version of the paper). Although node

33 is surrounded by nodes in another community (shown in

blue on the online version of the paper), these nodes do not

know each other, except for the case involving node 34.

Hence, we believe node 33 can be correctly classified as

not belonging to the community of node 34. Note that it is

also acceptable for it to be part of that community and

given the probabilistic nature of SOCIAL it does happen

that it is classified as such in some of the executions.

4.2 Understanding overlaps in SOCIAL

Community overlap may be discerned from the node entropy

values seen in a partitioned network. The final inverted and

normalized entropy values of four nodes with interesting

entropies from the karate network are as follows:

• Node 2 red community 1.0

• Node 5 orange community 0.729, red community 0.271

• Node 6 orange community 0.896, red community 0.104

• Node 29 blue community 0.33, red community 0.33,

green community 0.33

These nodes are highlighted in Fig. 9. The colors men-

tioned above are visible in Fig. 6.

These entropies are an indication of the certainty of each

node with its current community. Here, the value associ-

ated with the current community of a node is an indication

of the node’s certainty that it belongs to that community,

where a higher value indicates a higher certainty. Nodes

with an entropy value of 1.0 (i.e., 100 % certainty), such as

node 2, are surrounded by neighboring nodes in the same

community (as seen in Fig. 6). Hence, there is no uncer-

tainty concerning which community node 2 belongs in.

Nodes with multiple low entropy values of nearly equal

value are identified as nodes which overlap into multiple

communities. For example, node 29, has an entropy value

of 0.33 for three different communities. This entropy value

indicates a 100 percent chance that node 29 will re-eval-

uate its community position. Node 29 has a single link to

three separate communities (blue, red, and green in Fig. 6),

and no intra-community triads are formed if the node were

to join any one of these communities. With each iteration

of the algorithm, the entropy value which results from the

node joining each of these communities is calculated.

These three entropies are equal, with a value 0.33. As a

134

33

Fig. 7 Partitioned Zachary’s Karate Club Network found by Blondel

et al. (2008) algorithm

Fig. 8 Correlation between average node entropy and Girvan–

Newman modularity in Zachary’s Karate Club Network

6 5

2

29

Fig. 9 Nodes with different entropies have different overlap proba-

bilities. The lower the inverted and normalized entropy, the more

likely the node belongs to more than one community

Soc. Netw. Anal. Min. (2014) 4:169 Page 7 of 12 169

123

result, the node has an equal probability of selecting any

one of these communities. Indeed, during the final itera-

tions of the algorithm, this node is seen to toggle between

the three communities. Node 5 has a final entropy of 0.729

in orange community. It has one edge going into the red

community, and two edges going into the orange com-

munity with no intra-community triads. Because of the

higher number of edges in the orange community, its

entropy favors remaining in the orange community. Node 6

has an entropy of 0.896 in the orange community which

corresponds to having 3 edges in the orange community,

one edge in the red community, and an intra-community

triad in the orange community. From the entropies calcu-

lated which correspond to the communities a node may

belong to, a ratio of the ‘‘belongingness’’ of the node to

each community is determined. This concept corresponds

to fuzzy community structures described by Reichardt and

Bornholdt (2004), where the robustness of node assign-

ments to communities is examined to obtain a higher res-

olution of community structure than seen by traditional

algorithms which detect overlap (Reichardt and Bornholdt

2004). For example, in the karate network, node 29 belongs

equally in the green, orange, and blue communities. In

general, a threshold may be used to determine whether a

community has sufficient node attraction to indicate over-

lap. Using a threshold, the level of participation of a node

in each community is discerned from the entropy values

calculated when the node is evaluated in each community

during the last iteration of the algorithm. Following the

discernment of participation level, a community may be

eliminated as an overlap candidate if the node entropy is

sufficiently low when evaluated in that community.

A more graphic illustration of SOCIAL’s overlap

detection capability is shown in Fig. 10. In this figure, a

small network is pictured in an intermediate stage of

community discernment where two communities have

formed. The nodes in the figure are represented by pie

charts. Here, each pie chart shows the level of participation

of the corresponding node in each community. Several

nodes (2, 4, and 6) are completed surrounded by neighbors

in the same community. Hence, they show no participation

in their peer communities. Nodes 1 and 8 have intra-

community overlaps that are so insignificant that the

overlaps are not visible in the node pie charts. Each of

these nodes has multiple triads within its current commu-

nity with single edges going out to the external community.

Nodes 5 and 3 have identical configurations with four

edges in the dark community and 3 edges in the light

community. These nodes favor the dark community

because they each form four triads within the dark com-

munity and only three triads in the light community. Node

9 strongly favors the light community because it forms 3

triads in the light community and only 1 triad in the dark

community. The most interesting case is node 7, which has

equal participation in both communities. The inverted and

normalized entropy values for node 7 are light 0.5, and

dark 0.5. Node 7 has 3 edges going to each of the com-

munities with 3 intra-community triads in both communi-

ties. As a result, node 7 will toggle between belonging to

each community if we want to assign it to one of the

communities, however, the algorithm indicates that node 7

belongs to both communities with a 50 % participation in

each. In this network, the instability of node 7 results in the

light community being absorbed by the dark community.

Because of the centrality of node 5 to the light community,

when it toggles to dark community, its peers in the light

community follow.

4.3 Dealing with dynamics

The amount of processing required by the algorithm to

incorporate real-time updates to a network is proportional to

the complexity of the change. For example, if a single node

7

86

1

4

2

9

53

Community 1

Community 2

Fig. 10 Network with nodes represented by pie charts showing levels

of community participation

35

Fig. 11 Impact of adding disruptive node (35) to Zachary’s Karate

Club Network

169 Page 8 of 12 Soc. Netw. Anal. Min. (2014) 4:169

123

with a single edge is added to a network, one iteration of the

algorithm is required which impacts only the added node as it

joins the community it is connected to. On the other hand, if a

node with many edges joining several communities is added,

more processing is required. As an example, node 35 is added

to the karate network. This node has multiple edges into both

the orange and red communities (colors refer to Fig. 6).

Further, a triad is formed between node 35 and nodes in the

blue community. Fig. 11 shows how the community struc-

ture is changed by this addition.

The addition of node 35 results in the orange community

in Fig. 6 being absorbed by the red community. Only four

iterations were required for the network to stabilize, and six

nodes are impacted. The four iterations reflect a cascading

effect as the nodes most attached to the red community

decide to change communities, and the remaining nodes

follow. The important characteristic to note is that the update

to community structure is localized to the affected nodes.

4.4 Performance

The network generation package developed by Lancichinetti

et al. (2008) was used to generate benchmark networks for

performance evaluation. This package utilizes a variety of

parameters to generate networks containing social structure.

Using these parameters, the algorithm follows a set of rules

when connecting nodes which result in the desired social

structure. For example, the user may specify the mean degree

and the exponent of power law degree distribution (c) to

create social networks with varying levels of density. A key

feature of the algorithm is the output of the set of known

communities upon completion. Based on the input parame-

ters and the fixed set of generation rules, the set of commu-

nities and the community that each node belongs to are

known (ground truth). This known community structure may

then be compared to the structure discerned by a community

detection algorithm to evaluate performance. The parame-

ters of interest for this analysis and their corresponding

values used are described in Table 1.

The mixing parameter l specifies the fraction of edges

of each node that connect to nodes in other communities;

i.e., edges that span two communities. A fraction 1 - l of

each node’s edges are inter-community edges.

For algorithm testing in this analysis, networks were

generated for each of following l values: 0.1, 0.2, 0.3, 0.4,

0.5, and 0.6. These l values result in networks which are

increasingly difficult to have communities discerned auto-

matically. Networks generated with l values C 0.3 exhibit

distinct communities which are relatively easy to identify.

However, the communities seen in networks created with lvalues [0.3 are highly interconnected, hence the commu-

nities are quite difficult to discern. Both the SOCIAL

algorithm and the network generation package are proba-

bilistic. As a result, to ensure a balanced view of perfor-

mance, ten networks were generated with each l value.

Using the 10 networks, the standard deviation was taken

over the set performance results.

In addition to the benchmark network generation pack-

age, Lancichinetti et al. (2008) provide a metric named

normalized mutual information which measures the simi-

larity between expected community structure and the

structure discerned by a community detection algorithm.

This metric utilizes concepts borrowed from information

theory to compute a numerical measure of the similarity

between two community partitions. The value which

results from a comparison is in the range from 0 to 1, where

0 indicates no similarity and a value of 1 indicates an

identical partitioning.

Three algorithms were tested with SOCIAL for perfor-

mance comparison. The first algorithm was developed by

Clauset et al. (2004). It utilizes modularity maximization to

discern community structure. The second algorithm is the

Blondel (2008) algorithm described in Sect. 2 which relies

on a combination of agglomeration and modularity maxi-

mization. The third algorithm is the LPA implementation

provided by the igraph (Csardi and Nepusz 2006) network

analysis package. Three criteria were used in the selection

of these algorithms. First, algorithms which were ‘‘pack-

aged’’, well tested, and available for download were con-

sidered. Second, algorithms were selected because of their

relatively fast execution. A variety of downloadable algo-

rithms are available. However, many of them require hours

(or days) to complete and were not practical for use in this

analysis. Finally, well-known algorithms were desired.

The performance tests were executed on a Windows

Vista 64-bit platform. The system configuration includes

Quad CPUs running at 3.00 GHz with 4 GB of RAM.

During testing, care was taken to dedicate the system

exclusively to the algorithm being evaluated.

Figure 12 shows a comparison of the accuracies of the

four algorithms utilizing the normalized mutual information

metric. As described earlier, each algorithm was executed on

Table 1 Parameter used in the benchmark program by Lancichinetti

et al. (2008)

Parameter Value

Number of nodes 1,000

Mean degree 15

Maximum Degree 50

Exponent for degree distribution (c) 2

Exponent for community size distribution (b) 1

Mixing parameter (l) varied

Minimum for the community sizes 20

Maximum for the community sizes 50

Soc. Netw. Anal. Min. (2014) 4:169 Page 9 of 12 169

123

the ten networks corresponded to each l setting and the

standard deviations of the results were taken. In addition, to

overcome the inconsistency issue seen with LPA, each

generated network was parsed ten times by LPA and the

average of the result was taken. In Fig. 12, the results of

SOCIAL and the Blondel are similar, with excellent accu-

racy for l values B 0.4. For l values [0.4, performance

begins to degrade as communities become more inter-con-

nected and social structure is more difficult to discern. LPA

also performed well with all l values except 0.6. With this lvalue LPA consistently produced a single community. This

failure suggests the presence of network structures which are

problematic for LPA. These structures, if even sparsely

present, in real-world networks could lead to inaccurate

community discernment with LPA.

The execution time of SOCIAL is compared to the other

three algorithms in Fig. 13. Precise time measurement in the

Windows test environment was not possible. As a result,

execution times less than one second should simply be

interpreted as ‘extremely fast’. The actual values should not

be viewed as precise measurements. While SOCIAL is

several seconds slower than the Blondel algorithm and LPA,

it should be noted that the current version of SOCIAL is not

optimized to utilize the multi-processor configurations

which are ubiquitous in today’s personal computing plat-

forms. The self-organized approach used bySOCIAL allows

entropy computations to be performed in parallel; it is part of

the algorithm design. As a result, SOCIAL execution time is

expected to scale down linearly as multiple processors are

exploited to perform concurrent entropy calculations. In

Fig. 14, a linear progression is seen in the number of itera-

tions required by SOCIAL to achieve convergence as the

social structure within the networks becomes more complex.

Although the number of iterations increases with network

complexity, Fig. 15 demonstrates that with each iteration, fewer

entropy calculations are required as the algorithm progresses.

This reduction in entropy calculations is attributed to the

emergent community structure. As communities are formed, an

increasing number of individual nodes are ‘‘satisfied’’ with their

community and no entropy calculation is required.

The inset in Fig. 15 shows the total entropy calculations

performed by SOCIAL for each mixing parameter l.

Fig. 12 SOCIAL accuracy compared to other algorithms using the

normalized mutual information metric. Despite being self-organized,

SOCIAL nearly matches the results of Blondel’s algorithm

Fig. 13 SOCIAL execution time compared to two other algorithms

Fig. 14 SOCIAL iterations required to achieve convergence for each

mixing parameter value

Fig. 15 SOCIAL entropy calculations per iteration with inset show-

ing total entropy calculations

169 Page 10 of 12 Soc. Netw. Anal. Min. (2014) 4:169

123

Again, a linear correspondence is seen with the complexity

of social structure exhibited by each network.

The time complexity of SOCIAL is difficult to precisely

characterize because it is dependent on the complexity of

the network being partitioned. Network complexity deter-

mines three factors that impact the time complexity of

SOCIAL: (1) the number of nodes in the network n, (2) the

number of iterations required for convergence Imax, and (3)

the average degree of nodes in the network z. For the

purposes of this discussion, the worst case is used where a

very complex network, i.e., l = 0.6, is analyzed. Argu-

ably, it would be very unusual to find a real-world social

network containing topological features more complex

than the ones found in this network. Referring to Fig. 14,

the number of iterations required to achieve convergence

on a network with l = 6 is approximately 59. This value is

referred to as Imax in the calculations that follow. The

number of entropy calculations per iteration is the core of

SOCIAL time complexity. For each node, an entropy cal-

culation is done for each community surrounding the node,

and the node’s current community. The first iteration

requires a maximum of n 9 (z ? 1) entropy calculations

because each node initially belongs to a unique commu-

nity. The fact that many communities are merged during

the first iteration is ignored to achieve a conservative

estimate of time complexity. For subsequent iterations, the

number entropy calculations is reduced as communities

merge and the average number of communities surround-

ing each node is reduced. Again, the worst case of l = 0.6

is used where the number of entropy calculations is

reduced by only 10 % with each iteration as shown in

Fig. 15. To model this behavior, the equation

(z ? 1) 9 (0.9i) is used to show the decline in entropy

calculations with each iteration. Here, i is the iteration

number. Finally, using the equation for the number of

entropy calculations per iteration and the maximum num-

ber of iterations Imax, time complexity, i.e., total entropy

calculations, is computed as:

Oðn� TeÞ; ð2Þ

where Te represents the number of entropy calculations for

first and subsequent interactions of the algorithm (from 2 to

Imax):

Te ¼XImax

i¼0

ðzþ 1Þ � ð0:9iÞ: ð3Þ

Note that the geometric series in Eq. 3,PImax

i¼0ð0:9iÞ,converges to 9 which means that the performance of

SOCIAL depends on n and given that z� n the performance

has near-linear time complexity in regard to the number of

nodes in the network, n. Clearly, this could degenerate into a

performance that is not linear for very dense networks but

the argument here is that these are not typical networks that

appear as the result of real phenomena.

5 Discussion

The self-organized approach to community detection used by

SOCIAL is new and quite exciting given its capabilities of

high-resolution overlap detection and the ability to process

dynamic networks. In addition, SOCIAL was shown to be

comparable in accuracy and execution speed to the well-

known Blondel and LP algorithms. Furthermore, SOCIAL

does not have the resolution issue present in algorithms, such

as the Blondel algorithm, which utilize Girvan–Newman

Modularity for network partitioning. Although this resolution

issue was not evident in the networks generated for perfor-

mance testing in this analysis, it is a well documented and

considered a serious concern (Fortunato and Barthelemy

2007). Likewise, SOCIAL was shown not to have the incon-

sistency issue present in LPA which requires multiple runs to

be performed and a consensus to be taken from the results.

Future testing will include performance comparisons on

networks which explicitly include topologies which are

problematic for the Blondel algorithm and LPA. Finally,

SOCIAL was several seconds slower than the Blondel algo-

rithm and LPA in execution time on the more complex net-

works. As stated previously,SOCIAL has not been optimized

for utilization of multi-processor platforms. Although 3 or 4

seconds is hardly considered an excessive execution time for

networks as complex as those tested, a significant gain in

speed is expected once this optimization is included.

6 Conclusions

We have introduced a self-organizing community detection

algorithm called SOCIAL which reflects the high efficiency

and adaptivity of decentralized systems observed in nature.

The algorithm is based on localized node entropy calcula-

tions which require only knowledge of each node’s local

neighborhood. The core entropy calculation reveals com-

munity overlap, and additionally, shows the level of

‘‘belongingness’’ or ‘‘fuzziness’’ of nodes to the communi-

ties in which they belong. In a demonstration of community

partition quality, the algorithm was shown to successfully

identify a high-resolution partitioning of the Zachary’s

Karate Club Network. Finally, the algorithm is well suited to

manage dynamically changing networks. Network pertur-

bations are localized and the processing required for each

update is proportional to the complexity of the update.

Future work includes further exploration of the perfor-

mance characteristics of the algorithm on large networks

Soc. Netw. Anal. Min. (2014) 4:169 Page 11 of 12 169

123

containing more than 100,000 nodes. The performance,

with regard to partition quality and speed, will be com-

pared to existing community detection algorithms. Part of

these evaluations will include measurement of performance

gains made by exploiting our algorithm’s ability to support

concurrent entropy calculations on multi-core computer

architectures. Lastly, community detection in dynamic

networks will be further explored. Here, networks will be

iteratively generated, and with each iteration, updates to the

network will be made accessible to the algorithm.

References

Adomavicius G, Tuzhilin A (2005) Toward the next generation of

recommender systems: A survey of the state-of-the-art and possible

extensions. IEEE Trans. Knowl. Data Eng 17(6):734–749

Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast

unfolding of communities in large networks. J Stat Mech Theory

Exp 2008(10):P10008

Clauset A, Newman MEJ, Moore C (2004) Finding community

structure in very large networks. Phys Rev E 70(6):66111.

doi:10.1103/PhysRevE.70.066111

Cruz JD, Bothorel C, Poulet F (2011) Entropy based community

detection in augmented social networks. In: 2011 International

Conference on computational aspects of social networks (CA-

SoN), 19–21 October 2011, pp 163–168

Csardi G, Nepusz T (2006) The igraph software package for complex

network research. Int J Complex Systems 1695. http://igraph.sf.

net

Fortunato S (2010) Community detection in graphs. Physics Reports

486:75–174

Fortunato S, Barthelemy M (2007) Resolution limit in community

detection. PNAS 104:36

Girvan M, Newman M.E.J (2002) Community structure in social and

biological networks. PNAS 99(12):7821–7826

Gregory S (2010) Finding overlapping communities in networks by

label propagation. New J Phys 12(10):103018

Java A, Song X, Finin T, Tseng B (2007) Why we twitter:

understanding microblogging usage and communities. In: Pro-

ceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop

on Web mining and social network analysis, WebKDD/SNA-

KDD ’07. ACM, New York, NY, pp 56–65

Valdis Krebs (2002) Mapping networks of terrorist cells. Connections

24(3):43–52

Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs

for testing community detection algorithms. Phys Rev E

78(4):046110

Ma X, Gao L, Yong X (2010) Eigenspaces of networks reveal the

overlapping and hierarchical community structure more pre-

cisely. J Stat Mec Theory Exp 2010(08):P08012

Marco Mamei, Ronaldo Menezes, Robert Tolksdorf, Franco Zambo-

nelli (2006) Case studies for self-organization in computer

science. J. Syst. Archit 52(8):443–460

Newman M EJ, Girvan M (2004) Finding and evaluating community

structure in networks. Phys Rev E 69(2):026113

Huazhong Ning, Wei Xu, Yun Chi, Yihong Gong, Thomas S. Huang

(2010) Incremental spectral clustering by efficiently updating the

eigen-system. Pattern Recogn 43(1):113–127

Gergely Palla, Imre Derenyi, Illes Farkas, Tamas Vicsek (2005)

Uncovering the overlapping community structure of complex

networks in nature and society. Nature 435(7043):814–818

Troy Raeder, Nitesh V. Chawla (2011) Market basket analysis with

networks. Social Netw. Analys. Mining 1(2):97–113

Raghavan UN, Albert R, Kumara S (2007) Near inear time algorithm

to detect community structures in large-scale networks. arXiv

76(3):036106

Rattigan MJ, Maier M, Jensen D (2007) Graph clustering with

network structure indices. In: Proceedings of the 24th interna-

tional conference on machine learning, ICML ’07. ACM, New

York, NY, pp 783–790

Rees BS, Gallagher KB (2012) Overlapping community detection

using a community optimized graph swarm. Soc Netw Anal Min

2(4):405–417

Reichardt J, Bornholdt S (2004) Detecting fuzzy community struc-

tures in complex networks with a potts model. Phys Rev Lett

93(21):218701

John Scott (2011) Social network analysis: developments, advances,

and prospects. Social Network Analysis and Mining 1(1):21–26

Claude Shannon, Noshirwan Petigara, Satwiksai Seshasai (1978) A

mathematical theory of communication. Bell System Technical

Journal 27:379–423

Shi J, Malik J (1997) Normalized cuts and image segmentation. In:

Proceedings of the 1997 conference on computer vision and

pattern recognition (CVPR ’97), CVPR ’97. IEEE Computer

Society, Washington, DC, p 731

Venugopal S, Stoner E, Cadeiras M, Menezes R (2012) The social

structure of organ transplantation in the united states. In:

Proceedings of the 3rd international workshop on complex

networks (CompleNet 2012), vol 424 of studies in computational

intelligence. Springer-Verlag, Melbourne, FL, pp 199–206

Wilkinson DM, Huberman BA (2004) A method for finding

communities of related genes. Proc Natl Acad Sci 101(Suppl

1):5241–5248. doi:10.1073/pnas.0307740100

Xie J, Chen M, Szymanski BK (May 2013) LabelRankT: incremental

community detection in dynamic networks via label propagation

Zachary WW (1977) An information flow model for conflict and

fission in small groups. Journal of Anthropological Research

33:452–473

169 Page 12 of 12 Soc. Netw. Anal. Min. (2014) 4:169

123