8
Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang Qin School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China E_mail: [email protected], {qliu, qinzg}@uestc.edu.cn AbstractReal social networks usually have some structural features of the complex networks, such as community structure, the scale-free degree distribution, clustering, "small world" network, dynamic evolution and so on. A new community detecting algorithm for directed and weighted social networks is proposed in this paper. Due to the use of more reference information, the accuracy of the algorithm is better than some of the typical detecting algorithms. And because of the use of heap structure and multi-task modular architecture, the algorithm also got a high computational efficiency than other algorithms. The effectiveness and efficiency of the algorithm is validated by experiments on real social networks. Based on the theories and models of complex networks, the features of the real large social networks are analyzed. Index Termsmodularity, degree distribution, clustering coefficient, hierarchy I. INTRODUCTION Recently, many researchers in different fields use the topological properties and evolutionary processes of the complex networks to describe the relationships and collective behaviors in their own fields [1, 2]. This methodology is called network analysis. New analysis methods and topology properties are proposed by this approach. Network modeling is becoming an essential tool to study and understand the complexity of many natural and artificial systems [3]. This understanding firstly analyzed the topological features of these systems, which usually related to the complex networks. Examples are the degree distribution, the average degree, the clustering coefficient, the node betweenness, the short path length and the assortative mixing describing the node correlations in the networks. Nowadays, the detecting of community structure has become a hot area in the research of the complex networks, which is also known as graph partitioning. Many definitions of community are presented in this area. In essence, it means the gathering of nodes into groups such that there is a higher density of edges within groups than between them [4]. The main problem in the area is to find communities within large networks in some automated fashion with the shortest possible execution time. Several algorithms have been proposed to find reasonably good partitions in a reasonably fast way. Early community detecting algorithms are always related to the theories of graph partitioning and hierarchical clustering, such as the agglomerative algorithm [5] and the divisive algorithm [2, 3, 6] using the hierarchical clustering theory; the Kernighan-Lin algorithm [7] and the spectral bisection algorithm [8] using the graph partitioning theory. But they all face the problem that they can’t determine the exact value of the community size and number, or the optimal community partition. To solve this problem, a number of new algorithms have been proposed in recent years. Girvan and Newman proposed the GN divisive algorithm that uses edge betweenness as a metric to identify the boundaries of communities [2]. But the GN algorithm makes heavy demands on computational resources, running in 2 ( ) OMN time on an arbitrary network with M edges and N nodes, or 3 ( ) ON time on a sparse graph. This restricts the algorithm’s use to large networks with thousands of nodes. In 2004, Newman proposed the modularity concept, which can be used to measure the quality of community partition [6]. The modularity of a partition is a scalar value between -1 and 1 that measures the density of links inside communities as compared to links between communities [1, 2]. High values of modularity are supposed to indicate high values of internal link densities for the communities, which are distinct from the groups with randomly linked nodes. Based on modularity, a number of faster algorithms have been proposed. Newman proposed an algorithm based on the greedy optimization of the quantity modularity [5]. This method has a running time in (( ) ) O M NN + or 2 ( ) On on a sparse graph, which is substantially faster than the GN algorithm. So it can be used in large networks. Clauset proposed the CNM algorithms [4] that performs the same greedy optimization as the algorithm of [5]. By exploiting some shortcuts in the optimization problem and using more sophisticated data structures, it runs far more quickly, in time ( log ) O dM N where d is the depth of the “dendrogram” describing the network’s community structure [4]. For networks that have a hierarchical structure log d N , the CNM algorithm has an essentially linear running time 2 ( log ( )) ON N . However the CNM algorithm has a tendency to produce super-communities that contain a large fraction of the 1432 JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013 © 2013 ACADEMY PUBLISHER doi:10.4304/jnw.8.6.1432-1439

Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

Community Detecting and Feature Analysis in Real Directed Weighted Social Networks

Yao Liu, Qiao Liu, and Zhiguang Qin

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China

E_mail: [email protected], {qliu, qinzg}@uestc.edu.cn

Abstract—Real social networks usually have some structural features of the complex networks, such as community structure, the scale-free degree distribution, clustering, "small world" network, dynamic evolution and so on. A new community detecting algorithm for directed and weighted social networks is proposed in this paper. Due to the use of more reference information, the accuracy of the algorithm is better than some of the typical detecting algorithms. And because of the use of heap structure and multi-task modular architecture, the algorithm also got a high computational efficiency than other algorithms. The effectiveness and efficiency of the algorithm is validated by experiments on real social networks. Based on the theories and models of complex networks, the features of the real large social networks are analyzed. Index Terms—modularity, degree distribution, clustering coefficient, hierarchy

I. INTRODUCTION

Recently, many researchers in different fields use the topological properties and evolutionary processes of the complex networks to describe the relationships and collective behaviors in their own fields [1, 2]. This methodology is called network analysis. New analysis methods and topology properties are proposed by this approach.

Network modeling is becoming an essential tool to study and understand the complexity of many natural and artificial systems [3]. This understanding firstly analyzed the topological features of these systems, which usually related to the complex networks. Examples are the degree distribution, the average degree, the clustering coefficient, the node betweenness, the short path length and the assortative mixing describing the node correlations in the networks.

Nowadays, the detecting of community structure has become a hot area in the research of the complex networks, which is also known as graph partitioning. Many definitions of community are presented in this area. In essence, it means the gathering of nodes into groups such that there is a higher density of edges within groups than between them [4]. The main problem in the area is to find communities within large networks in some automated fashion with the shortest possible execution time.

Several algorithms have been proposed to find reasonably good partitions in a reasonably fast way. Early

community detecting algorithms are always related to the theories of graph partitioning and hierarchical clustering, such as the agglomerative algorithm [5] and the divisive algorithm [2, 3, 6] using the hierarchical clustering theory; the Kernighan-Lin algorithm [7] and the spectral bisection algorithm [8] using the graph partitioning theory. But they all face the problem that they can’t determine the exact value of the community size and number, or the optimal community partition.

To solve this problem, a number of new algorithms have been proposed in recent years. Girvan and Newman proposed the GN divisive algorithm that uses edge betweenness as a metric to identify the boundaries of communities [2]. But the GN algorithm makes heavy demands on computational resources, running in

2( )O M N time on an arbitrary network with M edges and N nodes, or 3( )O N time on a sparse graph. This restricts the algorithm’s use to large networks with thousands of nodes.

In 2004, Newman proposed the modularity concept, which can be used to measure the quality of community partition [6]. The modularity of a partition is a scalar value between -1 and 1 that measures the density of links inside communities as compared to links between communities [1, 2]. High values of modularity are supposed to indicate high values of internal link densities for the communities, which are distinct from the groups with randomly linked nodes.

Based on modularity, a number of faster algorithms have been proposed. Newman proposed an algorithm based on the greedy optimization of the quantity modularity [5]. This method has a running time in

(( ) )O M N N+ or 2( )O n on a sparse graph, which is substantially faster than the GN algorithm. So it can be used in large networks.

Clauset proposed the CNM algorithms [4] that performs the same greedy optimization as the algorithm of [5]. By exploiting some shortcuts in the optimization problem and using more sophisticated data structures, it runs far more quickly, in time ( log )O dM N where d is the depth of the “dendrogram” describing the network’s community structure [4]. For networks that have a hierarchical structure logd N∝ , the CNM algorithm has an essentially linear running time 2( log ( ))O N N . However the CNM algorithm has a tendency to produce super-communities that contain a large fraction of the

1432 JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013

© 2013 ACADEMY PUBLISHERdoi:10.4304/jnw.8.6.1432-1439

Page 2: Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

nodes, even on synthetic networks which have no significant community structure. It also has the disadvantage to slow down the algorithm considerably and makes it inapplicable to large networks with more than a million nodes. These undesired effects have been circumvented by introducing tricks in order to balance the size of the communities against the speeding up of the running time, and making it possible to deal with networks which have millions of nodes [9].

Ref. [10] pointed out that the modularity optimization has a number of problems. Ref. [11] fount that the modularity landscape is glassy, and includes an exponentially growing number of local maxima, even if the corresponding partitions are quite different from each other in topologic structure. Ref. [12] concluded that the modularity optimization has a resolution limit that leads to the systematic merger of small clusters in larger modules, even when the clusters are well defined and loosely connected to each other. The resolution limit can be solved by the adopting of multiresolution modularity [13, 14].

From the above, the modularity optimization algorithm is the most widely used method for community detecting. But the researches are focused on the undirected unweighted graphs. However, many real social networks are essentially weighted, and their edges have different strengths. When the relationships between nodes are multi-level, the use of only one line to describe the existing relationship is not enough. In order to fully describe the interactions between nodes in the real networks, the edge weights should be considered. The use of edge weights can help to describe the similarity between nodes more accurately, and provide more information for community detecting.

Based on the unity of the undirected/directed and the unweighted/weighted networks, a new community detecting algorithm is proposed in this paper. By introducing the edge directions and weights, the directed weighted modularity is used to detect the community structure. Due to the more useful reference information, the accuracy and effectiveness of the algorithm is improved greatly. The method uses the heap data structure and multi-task modular architecture to raise the computational efficiency. This modularity optimization algorithm can achieve an optimum efficiency on large scale networks which have millions of nodes. At last, the experiment results are analyzed using the theories and models of complex networks, so as to uncover the connection relationships between nodes, reveal the inner structures and evolution patterns of social networks.

II. COMMUNITY MINING ALGORITHM

In this section, we propose our community detecting algorithm to mine communities on directed weighted graphs (DWG). The algorithm finds high modularity partitions of large networks in short time and unfolds a complete hierarchical community structure for the network. Contrary to the other community detection algorithms, the network size limit what we are facing is due to the limited storage capacity rather than the limited

computation time. The algorithm can identify communities in a 100 million nodes network with only 176 minutes.

In the case of DWG, the modularity is defined as

1 ( , )

in outi jdw

ij i jij

w wQ W c c

w wδ

= −

∑ (1)

where ic is the community to which node i is assigned, the δ -function ( , )u vδ is 1 if u v= and 0 otherwise,

ijW is the edge weight between node i and j , iniw and

outjw is the in-weighted and out-weighted degree of the

node, and in outij i jij i j

w W w w= = =∑ ∑ ∑ is the summation of edge weight in the directed network. Eq. (1) can be simplified as

( )

1 ( , ) ( , )

1 [ ( , ) ( , )

1 1 ( , ) ( , )]

in outi jdw

ij i jij v

ij i jv ij

in outi i j j

i j

d in outvv v v

v

w wQ W c v c v

w w

W c v c vw

w c v w c vw w

S r r

δ δ

δ δ

δ δ

= −

= −

= −

∑ ∑

∑ ∑

∑ ∑

(2)

where dvvs is the fraction of summation of the directed

edge weights that connect two nodes inside the community v , in

vr and outvr is the fraction of the directed

edge weights entering or leaving the nodes of community v .

The gain in modularity dwQ∆ obtained by merging community u with community v can be computed as:

[

( )( )]

( ) ( )

dw d d dvv uu uv

in in out outv u v u

d in out d in outvv v v uu u u

d in out in outuv v u u v

Q S S Sr r r rS r r S r r

S r r r r

∆ = + +

− + +

− − − −

= − −

(3)

The algorithm can be defined as follows. 1. Initialization. We start off with each node being

the sole member of a community, apparently initialize modularity 0dwQ = . uv ijs W w= if u and v are connected and zero otherwise, and

v jr w w= , where ( , ) 1ic uδ = and ( , ) 1jc vδ = . Thus we initially set

( )2

0dw ij i juv

W w w w wQ −∆ =

(4)

2. Calculate the initial values of dwuvQ∆ and

vr according to Eq. (4), and populate the max-heap H with the largest element of each

JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013 1433

© 2013 ACADEMY PUBLISHER

Page 3: Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

row of the matrix dwQ∆ . 3. Select the largest dw

uvQ∆ from H , join the corresponding communities u and v , label the combined community v , update the matrix

dwQ∆ , the heap H and vr . (1) We need adjust a few of the elements of

dwQ∆ , update the number v row and column, remove the number u row and column. The update rules are as follows: (a) If community k is connected to both

u and v , then

'dw dw dwvk uk vkQ Q Q∆ = ∆ + ∆ (5)

(b) If community k is connected to u but not to v , then

' 2dw dwvk uk v kQ Q r r∆ = ∆ − (6)

(c) If community k is connected to v but not to u , then

' 2dw dwvk vk u kQ Q r r∆ = ∆ − (7)

(2) Update the heap H according to dwQ∆ . (3) Update vr : ' '; 0v u v ur r r r= + = . (4) Increase 'dw dw dwQ Q Q= + ∆ and record it.

4. Repeat step 3 until only one community remains. Note that these equations imply that dwQ has a single

peak values over the course of the algorithm, since after the largest dwQ∆ becomes negative all the dwQ can only decrease. So when the peak value of dwQ appears, the algorithm should been stopped and the perfect division is got.

The detailed implementation steps of the algorithm have been described in my another paper, so here we just discuss the main theory and describe it in brief.

III. APPLICATION IN LARGE NETWORKS

We collected the mail logs of a campus mail system from Jan 1 to Dec 31, 2011. As we all know, there exists a large number of spam mails in the internet which inevitably disturb the data mining research of human social behaviors. So for the integrity and reliability of the data, we only extract the local users’ mail log. Considering the users’ privacy, we use digital number to replace the mail users’ address. From the mail server, the logs comprise 5,435 local mail users and 1,400,740 mails in 2011, and each user averagely sends 257.73 mails per year.

Definition 1 Mail network is a kind of weighted graph ( , , )G V E W= which V is the set of nodes, E is the set

of edges, and W is the weighted function of edges. Node v V∈ , represents a mail address (person name or email address). Edge e E∈ , [ , ]i je v v= , represents member

iv sent a mail to member jv . ijW is the number of

mails which member iv sent to member jv as the connection strength. As a matter of convenience, we use

ijE to represent edge [ , ]i jv v . Mail network can be seen as a directed graph, member

iv sends a mail to member jv , but jv may not send a mail to iv . It means that the edge directions have different influences on the importance of nodes. Thus we measure the importance of a node with directed weighted degree.

Definition 2 The node out-weighted degree is the amount of edge weights leaving the node iv ,

outi ijj

w W=∑ , the node in-weighted degree is the amount

of edge weights entering the node iv , ini jij

w W=∑ , and

the node weighted degree is defined as in outi i iw w w= + .

We remove the system administrator accounts such as “admin”, “webmaster”, “mail-Daemon” or “emdg-daemon”, which always mass mails to all local domain users. We also remove the self-loop edges such as iiE in undirected graph. As Fig.1 shown, the mail data set consists of 4,368 active nodes, 49,886 undirected edges with edge weights ranging from 1 to 1625, and 77,936 directed edges with edge weights ranging from 1 to 876 in one year.

(a) Undirected-Unweighted (b) Directed-Unweighted

(c) Undirected-Weighted (d) Directed-Weighted

Figure 1. System Architecture Diagram

In order to verify the effectiveness and efficiency of our algorithm, we have applied it on the real directed weighted mail network, and compared its function in different network types. Fig. 2 shows the community assignments of the mail network in four different network types based on the Yifan Hu layout algorithm [15]; different colors stand for different communities. The experiments are conducted on Windows XP installed on PC with Pentium Dual-Core 2.5GHZ CPU and 2G MEM.

1434 JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013

© 2013 ACADEMY PUBLISHER

Page 4: Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

(a) Undirected-Unweighted (b) Directed-Unweighted

(c) Undirected-Weighted (d) Directed-Weighted

Figure 2. Community partition graph

Modularity described above is the metric to evaluate the quality of graph partitioning. The larger the modularity Q is, the better the partition summarizes the overall community structure [16]. Note that Q is a real number between 0 and 1, and when it is larger than 0.3 the community structure is remarkable [5]. The optimal modularity of real networks usually lies between 0.3 and 0.7.

TABLE 1. COMPARE THE PARTITION RESULTS IN DIFFERENT NETWORK TYPES

Type Nodes Edges Modularity Communities Number

Undirected - Unweighted 4368 4 9886 0.615 23

Directed - Unweighted 4368 77936 0.609 70

Undirected - Weighted 4368 49886 0.797 72

Directed - Weighted 4368 77936 0.769 71

As shown in Table 1, the modularity is not the same in

different network types. The lowest value appears in undirected-unweighted type, and the peak value appears in undirected-weighted type that indicates the most strong community structure appears in this type. After the use of weights, the modularity increases from 0.6 to 0.7. So we can conclude that the introduction of weights can increase the modularity value, which means it can improve the community detecting performance of the algorithm. We also notice another interesting thing that in weighted condition, the modularity in directed type is lower than in undirected type. It means that the larger number of edges disturbs the community partition performance, so the edge directions cannot improve the detecting performance.

-5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80-100

0

100

200

300

400

500

600

700

800

Verti

ces

Num

ber

CommunitySize Rank

Undirected-Unweighted Directed-Unweighted Undirected-Weighted Directed-Weighted

(a)

-5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive v

ertic

es n

umbe

r(%)

CommunitySize Rank

Undirected-Unweighted Directed-Unweighted Undirected-Weighted Directed-Weighted

(b)

Figure 3. Community partitions for the mail network

We compare the partition results of the four network types in detail. Fig. 3 shows the community partition details, the Y axis represents the nodes number inside each community, the X axis represents the community size rank.

In weighted condition, the community partition results are similar both in directed graph and undirected graph. In unweighted network types, we find super communities that some communities contain a large fraction of nodes. The biggest community has 624 (14.28%) nodes in undirected-unweighted type, and 712 (16.30%) nodes in directed-unweighted type. On the contrary, the biggest community accounts for only 10% nodes of the total in the weighted types. So the use of weights can avoid the super-communities problem.

In conclusion, we can summarize that the use of weights can help to increase the partition performance and accuracy of our algorithm. And the use of directions could not improve the partition performance.

Table 2. Compare the experiment results of different algorithms Algorithms Run

time Modularity Community

Number GN / / /

CNM 47.69s 0.543 67 SNCD

(directed-weighted) 6.35s 0.769 71

In table 2, we compare our algorithms with two typical

community detecting algorithms such as GN and CNM. Due to the high time complexity, the GN algorithm is not suitable for large-scale networks. The campus mail network has 4368 nodes and 49886 edges, so the GN algorithm fails to get a result under our experiment environment. Because the true community structure of the campus mail network is unknown, so we use modularity to measure the community detecting quality. Compare with the CNM algorithm, the effectiveness and efficiency of our algorithm is better.

JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013 1435

© 2013 ACADEMY PUBLISHER

Page 5: Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

IV. FEATURE ANALYSIS IN REAL MAIL NETWORK

It is interesting to compare the basic statistical results of the four different network types. In Table 3 we have collected a few statistical features of the four type networks.

TABLE 3. STATISTICAL FEATURES COMPARE

Undirected-Unweighted

Directed- Unweighted

Undirected-Weighted

Directed- Weighted

Avg. Degree 22.842 17.842 22.842 17.842

Avg. Weighted Degree

/ / 297.897 155.515

Avg. Clustering Coefficient

0.407 0.244 0.407 0.244

Avg. Path Length 3.282 3.36 3.282 3.36

Network Diameter 8 8 8 8

Graph Density 0.05 0.04 0.05 0.04

A. Node Degree Distribution Degree of a node, as defined in Eq. (8), is the number

of edges it is directly connected to. The degree is one of the measures for node centrality in networks. The degree symbolizes the importance of a node in the network - the higher the degree value is, the more important the node is.

1

N

i ijj

k a=

= ∑ (8)

Direction is one of the important features in mail communications. In the directed networks, the node in-degree and out-degree is the number of in-coming and out-coming links.

1 1

; ;N N

in out in out

i ji i ij i i ij j

k a k a k k k= =

= = = +∑ ∑ (9)

Average degree is obtained by dividing the total number of edges by the number of nodes [17]. It shows that how many connections (on average) each user has to other unique users in the mail network. The higher the average degree is, the more frequent the connections within users are. The average degree of the mail network is 2 22.84K M N= = in undirected type, and

17.84K M N= = in directed type. The results are according with our statistical features in Table 3.

The most basic topological feature of a graph G can be obtained in terms of the degree distribution ( )P k , which is defined as the probability that a node chosen at random has degree k, or equivalently as the fraction of nodes in the graph having degree k . It can shed light on the process of the network coming into existence. The degree distribution of nodes is close to power law distribution

( ) ~P k k α− (see Fig. 4 where the X axis is the degrees and the Y axis is the degree frequencies). In undirected type,

0.7875α = for degree. In directed type, 0.7931α = for in-degree, 0.7016α = for out-degree, and 0.8237α = for degree. The degree distributions in different type are

similar, the frequencies decays with the increase of degree. It can be safely said that the mail network connectivity is relatively medium, most of nodes have little edges, and a few but significant number of nodes as the centers have a large number of edges.

10 100 1000

10

100

1000

Freq

uenc

ies

P(k)

Degree k

Undirected Type Allometric1 Fit of Frequencies P(k)

Equation y = a*x^bAdj. R-Square 0.93836

Value Standard ErrorFrequencies P(k)

a 423.55454 21.05598

Frequencies P(k)

b -0.78754 0.02664

1 10 100

10

100

1000

Freq

uenc

ies

P(k)

In-Degree k

In-Degree Allometric1 Fit of Frequencies P(k)

Equation y = a*x^bAdj. R-Square 0.94483

Value Standard ErrorFrequencies P(k)

a 456.49112 21.64411

Frequencies P(k)

b -0.79314 0.02554

(a)Degree in undirected type (b) In-degree in directed type

10 100 1000

10

100

1000

Freq

uenc

ies

P(k)

Out-Degree k

Out-Degree Allometric1 Fit of Frequencies P(k)

Equation y = a*x^bAdj. R-Square 0.94297

Value Standard ErrorFrequencies P(k)

a 328.22172 15.47435

Frequencies P(k)

b -0.70164 0.02338

10 100 1000

10

100

1000 Directed Type Allometric1 Fit of Frequencies P(k)

Freq

uenc

ies

P(k)

Degree k

Equation y = a*x^bAdj. R-Square 0.95616

Value Standard ErrorFrequencies P(k)

a 404.86487 12.77772

Frequencies P(k)

b -0.82365 0.01604

(c) Out-degree in directed type (d) Degree in directed type

Figure 4. Node degree distribution

B. Node Strength (weighted degree) Distribution We cannot obtain the activity or importance of network

nodes only from the degree distribution. For example, a user has 1,000 contacts, but he only contacts each one through a mail; and a user has 100 contacts, but he contacts each one through many mails every day. We can say the latter user is more active or important than the former. So the weighted degree can better describe the activity or importance of a node. In Table 4, we can see the nodeID of the top seven active nodes is different when they are measured by degree and weighted degree respectively. After examining the nodeID dictionary, we find that the top 1 nodeID 1019 is a mailbox for campus alumni, who exchanges mails with a lot of alumnus, but the mails number is small between them. The nodeID 100 is a teacher’s mailbox, and he always exchanges a large number of mails with a few contacts.

Table 4. The top 7 nodes measured by degree and weighted degree

Measure 1 2 3 4 5 6 7 degree 1019 3435 2804 1000 1071 2824 1931

weighted degree 100 2341 148 1310 407 1660 1860

In weighted network, we use a different feature to measure the node importance. The node weighted degree, or strength is is defined as

( )

i ij ijj N vi

s a w∈

= ∑ (10)

where ( )iN v is the neighbor nodes set of iv . Strength of a mail user represents the total mails handled by it. The strength distribution ( )R s measures the probability that a node has strength s . It together with the degree distribution ( )P k , provide useful information on a weighted network. In particular, since the node strength is related to the node degree, we can observe ( )R s has

1436 JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013

© 2013 ACADEMY PUBLISHER

Page 6: Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

heavy-tailed distribution in weighted networks with slow decaying of ( )P k .

1 10 100 1000 100001

10

100

1000

Freq

uenc

ies

R(s)

Strength s

Undirected strength Allometric1 Fit of Frequencies P(s)

Equation y = a*x^bAdj. R-Square 0.85539

Value Standard ErrorFrequencies P(s)

a 294.11603 25.70999

Frequencies P(s)

b -0.94195 0.03725

10001

Freq

uenc

ies

R(s)

Strength s

Directed-Strength In-Strength Out-Strength

(a) Undirected type (b) Directed type

Figure 5. Node strength distribution

Fig. 5 shows the strength distribution, it is close to the power law distribution ( ) ~R s s α− in undirected type. Fig. 6 shows the correlation between degree k and the average strength ( )S k of nodes. We find that ( )S k increases with k , ( )S k k β

. In the undirected/directed mail network, 1.1715 / 1.0963β = . The exponent implies the node

strengths are strongly correlated to the degree. For weighted mail network, the general trend is that nodes with large degree have large strength, however, this trend does not hold true when node degree exceeds a certain threshold. As illustrated in Fig. 6, for the mail network, 100 is the threshold.

1 10 100 10001

10

100

1000

10000

Aver

age

Stre

ngth

s(k

)

Degree k

Undirected type Allometric1 Fit of Average Strength s(k)

Equation y = a*x^bAdj. R-Square 0.72928

Value Standard ErrorAverage Strength s(k)

a 6.75412

3.22144

Average Strength s(k)

b 1.17147

0.11063

1 10 100 10001

10

100

1000

10000

Aver

age

Stre

ngth

s(k

)

Degree k

Directed type Allometric1 Fit of Average Strength s(k)

Equation y = a*x^bAdj. R-Square 0.65448

Value Standard ErrorAverage Strength s(k)

a 5.63964

2.62784

Average Strength s(k)

b 1.09625

0.09878

(a) Undirected type (b) Directed type

Figure 6. Average strength ( )S k of nodes with degree k

C. Edge Weight Distribution In the weighted network, each edge has a weight

representing its communication strength,which is defined in Definition 1. Fig. 7 shows the edge weight distribution of the weighted mail network. The X axis is the edge weights and the Y axis is the edge weight frequencies. The distribution can be approximated by the power-law behavior ( )P W W γ−

with 1.2138γ = in undirected mail network and 1.3985γ = in directed mail network.

1 10 100 10001

10

100

1000

10000

Freq

uenc

ies

P(W

)

Edge Weight W

Undirected Type Allometric1 Fit of Frequencies P(W)

Equation y = a*x^bAdj. R-Square 0.99241

Value Standard ErrorFrequencies P(W) a 15394.2728 120.485Frequencies P(W) b -1.21381 0.01129

1 10 100 10001

10

100

1000

10000

Freq

uenc

ies

P(W

)

Edge Weight W

Directed Type Allometric1 Fit of Frequencies P(W)

Equation y = a*x^bAdj. R-Square 0.99909

Value Standard ErrorFrequencies P(W) a 30086.1575 79.59027Frequencies P(W) b -1.3985 0.00513

(a)Undirected type (b) Directed type

Figure 7. Edge weight distribution

D. Clustering Coefficient Clustering, also known as transitivity is a typical

feature of acquaintance networks, where two individuals with a common friend are likely to know each other. In

terms of a generic graph G, transitivity means the presence of a high number of triangles. This can be quantified by defining the transitivity T of the graph as the relative number of transitive triples, i.e. the fraction of connected triples of nodes (triads) which also form triangles:

3 # of triangles in # of connected triples of nodes in

GTG

×= (11)

An alternative possibility is to use the graph clustering coefficient, and defined as follows. Clustering coefficient

iC of a node is defined as the ratio of number of links shared by its neighboring nodes to the maximum number of possible links among them. In other words, the clustering coefficient of a node v (of degree at least 2) is the probability that any two randomly chosen neighbors of v are linked together. It is computed by dividing the number of triangles containing v by the number of possible edges between its neighbors [18].

2( 1)

ii

i i

ECk k

=−

(12)

where iE is the number of edges shared by the neighboring nodes of iv , ik is the degree of iv .

Fig. 8 shows the scaling of ( )C K with K for undirected and directed mail networks. Previous studies show that, with the increase of K , the corresponding average clustering coefficient will significantly decrease. Fig. 8 doesn’t meet the law completely, when 100K ≤ the

( )C K changes little, after 100K > the ( )C K falls rapidly thereafter. This is similar to the law found by Boguna and Tsai [19, 20].

10 100 10000.01

0.1

Undirected type Allometric1 Fit of Avg.Clustering Coefficient

Avg.

Clus

terin

g Co

effic

ient

Degree

Equation y = a*x^bAdj. R-Square 0.61685

Value Standard ErrorAvg.Clustering Coefficient

a 3.63989 0.65125

Avg.Clustering Coefficient

b -0.6012 0.04071

10 100 1000

0.01

0.1

Directed type Allometric1 Fit of Avg.Clustering Coefficient c(k)

Avg.

Clus

terin

g Co

effic

ient

c(k

)

Degree k

Equation y = a*x^bAdj. R-Square 0.29044

Value Standard ErrorAvg.Clustering Coefficient c(k)

a 4.52398 1.93492

Avg.Clustering Coefficient c(k)

b -0.62776 0.08385

(a)Undirected type (b) Directed type

Figure 8. Avg. clustering coefficient ( )C K of nodes with degree k

V. TOPOLOGY OF THE REAL MAIL NETWORK

Most of the real social networks are characterized by some topological features, as for instance relatively short path lengths, high clustering coefficients, fat tailed shapes in the degree distributions, degree correlations, the presence of motifs and community structures. All these features make real social networks radically different from regular graphs and random graphs. Model the social networks can help to understand the evolution mechanisms of the network topology.

A. Small-world Network The small-world network is an obvious mathematical

property in some network models. At variance with random graphs, the small-world property in real networks is often associated with the presence of clustering,

JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013 1437

© 2013 ACADEMY PUBLISHER

Page 7: Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

denoted by the clustering coefficient [21]. For this reason, Watts and Strogatz defined the small-world network as the network having both a small average path length L like random graphs, and a high clustering coefficient C like regular graphs [22].

The clustering coefficient of a graph is the average value of clustering coefficient for all nodes (of degree at least 2). Average clustering coefficient measures the global density of interconnected nodes in the network. It is defined as

1

1 N

ii

C CN =

= ∑ (13)

The clustering coefficient measures local cohesiveness, by taking into account the interaction intensity found on the local triplets. The undirected and directed clustering coefficients could be compared (the normalization ensures that 0 1iC≤ ≤ ) to assess the accumulation traffic on interconnected triplets. The average clustering coefficient 0.407C = in undirected type and 0.244C = in directed type, which is an order of magnitude higher than it in the random network with the same scale

~ /randC K N =0.005, K is the average degree. We can conclude that the mail network has a much higher clustering coefficient value than random network.

We found the average path length 3.282L = in undirected type and 3.36L = in directed type, which is an order of same size with it in the random network ln / ln 2.6792randL N K= = . These two features indicate that the real mail network is a small-world network.

B. Scale-free Network The large amount of research works on the topological

characterization of real networks focus on constructing graphs with power law degree distributions. Graphs with power-law degree distribution can be simply obtained as special cases of the random graphs with given degree distribution.

From Fig. 4, 5 and 7, the distribution all display power law distribution like ( ) ~P k Ak α− . So the real mail network can be named as scale-free networks [23], because the power-laws model has the property of having the same functional form at all scales [1]. The mail network, having a highly inhomogeneous degree distribution, results that a few nodes (the hubs) have a large number of links with many other nodes, and a lot of nodes have few links.

In finite-size networks, fat-tailed distributions have natural cut-offs [24]. When analyzing the mail network, it happens because the data have a rather strong intrinsic noise due to the data sampling finiteness. When the size of the system is small and the degree distribution ( )P k is heavy-tailed, it is sometimes advisable to measure the cumulative degree distribution ( )cumP k , defined as

'( ) ( ')k kcumP k P k∞

== ∑ . Fig. 9 shows the cumulative degree distribution in undirected and directed mail network, 0.2177 / 0.1792cumα = . We can summarize that the

scaling exponent cumα is related to the exponent α of ( )P k by 1 cumα α= − .

10 100 1000

100

1000

10000

Cum

ulat

ive F

requ

encie

s P(

>k)

Degree k

Undirected type Allometric1 Fit of Cumulative Frequencies P(>k)

Equation y = a*x^bAdj. R-Square 0.93902

Value Standard ErrorCumulative Frequencies P(>k)

a 15004.57594 365.24471

Cumulative Frequencies P(>k)

b -0.21765 0.01231

10 100 1000

100

1000

10000

Cum

ulat

ive F

requ

encie

s P(

>k)

Degree k

Undirected type Allometric1 Fit of Cumulative Frequencies P(>k)

Equation y = a*x^bAdj. R-Square 0.93902

Value Standard ErrorCumulative Frequencies P(>k)

a 15004.57594 365.24471

Cumulative Frequencies P(>k)

b -0.21765 0.01231

(a)Undirected type (b) Directed type

Figure 9. Cumulative degree distribution

C. Hierarchical Organization Hierarchy is a fundamental feature of many complex

systems. The high degree of clustering and the scale-free topology under a single roof generates a hierarchical network. So the mail network has hierarchical organization, which implying that small groups of nodes organize into increasingly large groups in a hierarchical manner. The mail network also maintains a scale-free topology. In hierarchical networks the degree of clustering characterizing the different groups follows a strict scaling law, which can be used to identify the presence of a hierarchical organization in real networks.

Ref. [25] pointed out that hierarchical networks are expected to have a non-trivial, power-law decay of

( )C K as a function of K ,which means low degree nodes belong to interconnected communities. Many real networks have been found to be having such nontrivial decay [25-27]. From Fig. 8, we find that in mail network,

( )C K shows power-law decay with ( )C K K α− . So the

mail network has an inherent hierarchy organization in the architecture. This is an interesting result as it has been found [25] that some other infrastructure networks such as Internet at the router level and the power grid network of Western United States which are constrained by “geographical organization” don’t show such hierarchy organization.

VI. CONCLUSIONS

The social networks are used to describe the relations between the members of society. With the rapid development of computer network technology, the emerging social networks based on Internet such as e-mail networks, peer-to-peer networks and blog networks get a greatly development, and give a profound influence on the behavior patterns of the human society. Therefore, research on the social networks has an important significance in theory and a great value in practice. Since the social networks are often large-scale and complex in structure, so we usually use the theories and models of complex networks to research and measure the social networks.

In this paper, we propose a solution to detect the weighted and directed communities of large mail network, based on modularity optimization method. Our algorithm using the flexible and multi-task architecture has a considerable speed advantage over previous algorithms, and brings new possibilities to work with large complex

1438 JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013

© 2013 ACADEMY PUBLISHER

Page 8: Community Detecting and Feature Analysis in Real Directed ...€¦ · Community Detecting and Feature Analysis in Real Directed Weighted Social Networks Yao Liu, Qiao Liu, and Zhiguang

data sets. The algorithm is tested in the real large mail network, and the visualization module produces valuable visual results. The features of the mail communication network with different types in weight and direction are analyzed in detail. The metrics of weight allows us to characterize the statistical features like the link and node strength. We analyze the campus mail network quantities such as the node degree distribution, the edge weight distribution, the node strength distribution, the correlation between degree and some average statistical indexes. We also calculated the clustering coefficients of nodes. These results give us clues to understanding the mail network topology based on the complex network models.

ACKNOWLEDGMENT

This work was sponsored by the National Natural Science Foundation of China (Grant No. 61133016) and the National High Technology Research and Development Program (“863” Program) of China (Grant No. 2011AA010706).

REFERENCES

[1] R. AlbertA. L. Barabási, "Statistical mechanics of complex networks", Reviews of modern physics, Maryland, vol. 74, no. 1, pp. 47-101, 2002.

[2] M. GirvanM. E. J. Newman, "Community structure in social and biological networks", Proceedings of the National Academy of Sciences, Washington, DC, vol. 99, no. 12, pp. 78-21, 2002.

[3] F. Radicchi,C. Castellano, "Defining and identifying communities in networks", Proceedings of the National Academy of Sciences, Washington, DC, vol. 101, no. 9, pp. 26-58, 2004.

[4] A. Clauset,M. E. J. Newman, "Finding community structure in very large networks", Physical Review E, Maryland, vol. 70, no. 6, pp. 066-111, 2004.

[5] M. E. J. Newman, "Fast algorithm for detecting community structure in networks", Physical Review E, Maryland, vol. 69, no. 6, pp. 066-133, 2004.

[6] M. E. J. NewmanM. Girvan, "Finding and evaluating community structure in networks", Physical Review E, Maryland, vol. 69, no. 2, pp. 026-113, 2004.

[7] B. KernighanS. Lin, "An eflicient heuristic procedure for partitioning graphs", Bell System Technical Journal, Bell, vol. 5, no. 4, pp. 67-81, 1970.

[8] M. Fiedler, "Algebraic connectivity of graphs", Czechoslovak Mathematical Journal, Heidelberg, vol. 23, no. 2, pp. 298-305, 1973.

[9] B. Ball,B. Karrer, "An efficient and principled method for detecting communities in networks", Physical Review E, Maryland, vol. 84, no. 3, pp. 036-103, 2011.

[10] A. LancichinettiS. Fortunato, "Limits of modularity maximization in community detection", Physical Review E, Maryland, vol. 84, no. 6, pp. 066-122, 2011.

[11] B. H. Good, Y.-A. de Montjoye, "Performance of modularity maximization in practical contexts", Physical Review E, Maryland, vol. 81, no. 4, pp. 046-106, 2010.

[12] S. FortunatoM. Barthelemy, "Resolution limit in community detection", Proceedings of the National Academy of Sciences, Washington, DC, vol. 104, no. 1, pp. 36-41, 2007.

[13] D. S. Bassett,M. A. Porter, "Robust detection of dynamic community structure in networks", arXiv preprint arXiv:1206.4358, 2012.

[14] J. W. Berry,B. Hendrickson, "Tolerating the community detection resolution limit with edge weighting", Physical Review E, Maryland, vol. 83, no. 5, pp. 056-119, 2011.

[15] Y. Hu, "Algorithms for Visualizing Large Networks", Combinatorial Scientific Computing, vol. 5, no. 3, pp. 180-186, 2011.

[16] D. Duan,Y. Li, "Community mining on dynamic weighted directed graphs", In Poceedings of the 1st ACM international workshop on Complex networks meet information & knowledge management,London, pp.11-18, 2009.

[17] M. Newman, Networks: an introduction, Oxford University Press, England, 2010.

[18] M. Latapy, "Main-memory triangle computations for very large (sparse (power-law)) graphs", Theoretical Computer Science, Elsevier, vol. 407, no. 1, pp. 458-473, 2008.

[19] M. Boguñá,R. Pastor-Satorras, "Models of social networks based on social distance attachment", Physical Review E, Maryland, vol. 70, no. 5, pp. 056-122, 2004.

[20] Y. TsaiC. C. Lin, "Characteristics of Weighted Email Communications", In Poceedings of Fifth International Conference on Information, Communications and Signal Processing, Bangkok, pp.399-403, 2005.

[21] G. DánN. Carlsson, "Power-law revisited: large scale measurement study of P2P content popularity", In Poceedings of the 9th international conference on Peer-to-peer systems, California, pp.12-16, 2010.

[22] D. WattsS. Strogatz, "The small world problem", Collective Dynamics of Small-World Networks, NASA, vol. 393, pp. 440-442, 1998.

[23] G. Caldarelli, Scale-Free Networks: complex webs in nature and technology, OUP Catalogue, England, 2011.

[24] L. A. N. Amaral,A. Scala, "Classes of small-world networks", Proceedings of the National Academy of Sciences, Washington, DC, vol. 97, no. 21, pp. 11149-11152, 2000.

[25] E. RavaszA. L. Barabási, "Hierarchical organization in complex networks", Physical Review E, Maryland, vol. 67, no. 2, pp. 026-112, 2003.

[26] E. Ravasz,A. L. Somera, "Hierarchical organization of modularity in metabolic networks", Science, Washington, DC, vol. 297, no. 5586, pp. 1551-1563, 2002.

[27] A. Vázquez,R. Pastor-Satorras, "Large-scale topological and dynamical properties of the Internet", Physical Review E, Maryland, vol. 65, no. 6, pp. 066-130, 2002.

JOURNAL OF NETWORKS, VOL. 8, NO. 6, JUNE 2013 1439

© 2013 ACADEMY PUBLISHER