[IEEE 2014 International Conference on Information Networking (ICOIN) - Phuket, Thailand (2014.02.10-2014.02.12)] The International Conference on Information Networking 2014 (ICOIN2014)

IASM: An Integrated Attribute Similarity for Complex Networks Generation

Bassant E. Youssef Bradley Department of Electrical and

Computer Engineering, Virginia Tech

Email: [email protected]

Hoda M. Hassan Bradley Department of Electrical and

Computer Engineering, Virginia Tech

Email: hmhassan @vt.edu

Abstract— Complex networks are seen in different real life disciplines. They are characterized by a scale-free power-law degree distribution, a small average path length (small world phenomenon), a high average clustering coefficient, and the emergence of community structure. Most proposed complex networks models did not incorporate all of the four common properties of complex networks. Models have also neglected incorporating the heterogeneous nature of network nodes. In this paper, we propose two generation models for heterogeneous complex networks. We introduce the Integrated Attribute Similarity Model (IASM). IASM uses preferential attachment to connect nodes based on their attributes similarities integrated with node’s structural popularity (normalized degree or Eigen vector centrality). IASM proposed model is modified to increase their clustering coefficient using a triad formation step.

Keywords—Complex network modelling; BA model; preferential attachment; Heterogeneous nodes

I. INTRODUCTION

Complex networks are observed in numerous fields such as the Internet, the World Wide Web (WWW), social networks, food web networks, and many other disciplines [1]. The study and analysis of data extracted from complex networks revealed a number of distinct features and behavioural patterns that distinguish these networks. Solid awareness of these features can lead to an improved understanding of the network’s structure and dynamics. Such knowledge can be utilized in different fields such as enhancing the decision-making strategies for network management and resource allocation, realizing the mediator for disease transmission in sexual networks, predicting future connections between websites in the WWW, identifying critical nodes or links in power grid networks; etc. Accordingly, the presence of mathematical models that can faithfully mimic the complex network structure, dynamics and evolution is paramount. To create such models, researchers use advanced computer capabilities to analyse real large databases to identify essential properties for modelling complex networks [2, 3].

The observed properties in complex networks are: small world effect, high average clustering coefficient, scale free power law degree distributions, and the emergence of community structure [3, 4]. Small world effect means that for a certain fixed value of the nodes’ mean degree, the value of the average path length scales logarithmically, or slower, with network size. A node's clustering coefficient C can be defined as “the average fraction of pairs of neighbours of a node that are also neighbours of each other”, where C lies between 0 and 1 [1]. The average clustering coefficients in real complex network tend to have high values. Community structure emerges when nodes in a community have denser connections within themselves than to vertices of different communities [5]. Degree distribution which is the fraction of vertices in the network with degree k follows a scale free power law distribution in real complex

networks. Scale free power law distributions, P (k) ~ k –, have a power law (PL) exponent independent of the size of the network and its values are in the range of 1< < [3, 4].

Efforts to faithfully model complex networks have sprouted several models. The most influential models in the complex-network modelling field are: Erd s and R nyi (ER), Watts and Strogatz (WS), and Barabási and Albert (BA). Networks generated according to the ER random graph model have small average path length but they have Poisson degree distributions and are characterized by having clustering coefficients lower than that found in real complex networks [3, 4]. Networks generated by WS small-world network model have a short average path length and a high clustering coefficient. However, it lacks modelling the scale free property for the networks’ degree distribution [2, 3, and 4]. Thus, the scale-free power-law degree distribution of real complex networks was not represented in the ER or the WS models, rendering both models to be inaccurate in modelling the four characteristics of real complex-networks. This motivated Barabási and Albert to induce the scale free property for node-degree distribution in their highly acclaimed model [2]. The BA model uses a Preferential Attachment (PA) connection algorithm that reflects the belief that nodes usually prefer to connect to higher-degree structurally-popular nodes [2]. BA model succeeded in preserving the PL degree distribution and small world phenomenon of real complex-networks. Networks generated by the BA model show a power-law heavy-tail degree distribution, if and only if, the model has the following two properties; growth (where new nodes are continuously added to the network) and preferential attachment (PA). The BA model starts with a small number of nodes (mo), which is referred to as the seed network. A new node is added at each time step. The new node preferentially attach to other m nodes, (where m mo) using a connection function based on the old nodes’ normalized degrees. Thus, new node i connects preferentially to an old node j having degree Dj using a connection function (CF) based on the normalized degree of the node dj, where dj =

.

Networks generated using the BA model have a scale-free power-law degree-distribution and their average path lengths exhibit the small world phenomenon. However, BA model generates networks with a constant PL exponent value of =3, unlike real networks where the exponent values differ according to the network type and ranges between 1< <. Additionally, BA modeled network average clustering coefficient is lower than that observed in real complex networks of the same size [3, 4, 5].

The BA model was still inaccurate in representing all four properties observed in real complex-networks. This motivated many researchers to introduce modifications to the BA model in an attempt

567978-1-4799-3689-2/14/$31.00 ©2014 IEEE ICOIN 2014

to remedy the model’s shortcomings. Accordingly, devising a model that can represent all four properties of complex-network is still an ongoing research [5].

In this paper we aim to devise a mathematical model that is capable to reflect the four properties of complex-networks. Moreover, our model takes into consideration node heterogeneity, a factor that, we claim, was undermined in most contemporary complex-network models [4, 5]. Moreover, even proposed heterogeneous complex network were not general for different complex networks. We define node-heterogeneity as the heterogeneity of node characteristics. The contribution in this paper can be summarized as:

1) Accounting for node heterogeneity in the graph-theory by incorporating node-attributes as one of the elements defining a network graph. Accordingly, our model defines the network graph, G as a set of three elements; G= {V, E, A}, where V is the number of nodes in the network, E is the number of edges [6] and A is the set of attributes assigned to each network node. 2) Based on (1) we propose an Integrated Attribute Similarity Model (IASM) for generating complex-networks. IASM is based on the BA algorithm for network generation. However, it acknowledges the heterogeneous nature of nodes by integrating the attribute-similarity with the structural popularity measure within the CFs. Two measures for node structure popularity were used. The first uses the nodes’ normalized degree (IASM_A), while the other uses Eigen vector centrality as a more accurate structural popularity measure (IASM_B). To increase the clustering coefficient for IASM generated networks triad formation was applied to the model as an enhancement. Each of the proposed models will be validated using Matlab simulation [7]. The success of each proposed model to mimic real complex networks will be verified by examining the generated network statistical properties, namely the average path length, clustering coefficient, and degree distribution. The rest of the paper is organized as follows: section two presents the related work, section three presents our proposed models and their simulation results, and section four is the conclusion and future work.

II. RELATED WORK

Several researchers have proposed mathematical models that address the heterogeneous nature of the nodes composing a network. The success of these models in generating networks that mimic real complex-networks was examined by observing the statistical properties of these networks. This section will review a subset of these attempts.

Bianconi and Barabási in [8] introduced the term node fitness to represent nodes’ different abilities to attain connections. Their work was motivated by the observation that the nodes’ abilities to attract connections do not depend only on their degrees based on the nodes’ ages. WWW nodes that provide good content are likely to acquire more connections irrespective of their ages. In citation networks, a new paper with a breakthrough is likely to have more connections than older papers. Thus each node should be assigned a parameter that describes its competitive nature to attain connections. In their model, node j upon birth is assigned a fitness factor j, following some distribution (), which represents its intrinsic ability to attain connections. Bianconi and Barabási model followed the BA PA connection algorithm with a modified PA function. The model has the PA function value for connecting an old node j to a new added node i depending on the old-node degree Dj, and its fitness value j.

When () follows a uniform distribution, the degree distribution is a generalized power law, with an inverse logarithmic correction. The average clustering coefficient and average path length values of networks generated by this model were not calculated in the presented work. Shaohua et al. in [9] observed that nodes with common traits or interests tend to interact. They introduced an evolving model based on attribute-similarity between the nodes. Each of the network nodes has an attribute set. Node-attributes can be described by a true or false function as in fuzzy logic. Shaohua et al. used fuzzy similarity rules to define a similarity function between attribute sets of two nodes. A connection is established between two nodes if their attributes similarities fall within a certain sector. Despite that this model satisfies the small world property; its degree distribution does not follow a power law. Yixiao Li et al. in [10] argued that every vertex is identified with a social identity represented by a vector whose elements represent distinctive social features. The new node added at each time step connects with probability p to the group closest to its social identity and to the other groups with probability (1-p). The higher degree node is attached to the new node within a group using PA. Random linking to neighbours of the previously attached old node is repeated until the new node establishes its m links. Their generated network follows power-law degree distribution and used triad formation to produce high average clustering coefficients. The authors claimed that using triad formation produced high average clustering but they did not present values for it and they did not measure their generated networks’ average path length. Additionally, the model did not increase the length of the attribute vector to more than one. While [8, 9, 10] based their connection algorithm on the PA attachment algorithm, some authors experimented with models that were not based on the BA PA algorithm such as those presented in [11] and [12]. Kleinberg et al. in [11] used a copying mechanism which entails randomly choosing a node then connecting its m links to neighbours of other randomly chosen nodes. The model was found to preserve power-law distributions using heuristics only. They argued that analytical tools were unable to prove this conclusion, because the copying mechanism generated dependencies between random variables. Krapivsky et al. [12] argued that an author, in a citation network, citing a paper is most likely going to cite one of its references as well. In their model, when a new node 'i' is added to the network, its edge attaches to a randomly chosen node 'j' with probability (1-r). Then with probability r this edge from the new node 'i' is redirected to the ancestor node ‘o’ of the previous randomly chosen node 'j'. The rate equations of the model show that it has a power-law degree distribution with degree exponent decreasing with the increase of the probability r value. Other statistical properties were not studied. These models were able to generate networks having a power-law degree distribution without using the PA algorithm of BA. However, they are not applicable to all complex-networks. Whether the node is copying its connections from a random node or connecting to the ancestor of a node previously connected to it, is not applicable for some types of complex networks. Additionally, the choice of the nodes from which the links are copied or the choice of the ancestors of the node is made randomly without regards to nodes-heterogeneous characteristics or their heterogeneous connection-standards. Therefore, finding a ubiquitous mathematical model for heterogeneous complex networks that preserves their four statistical properties is still an ongoing effort.

568

I.

III. PROPOSED MODEL

Nodes, users or entities, in real complex-networks have different profiles and characteristics. Connections between nodes affect the network dynamics, and their future evolution. We argue that nodes having different characteristics influence the density and the pattern of connections within a network. The notion of node-attributes is used to highlight the node-distinct characteristics. Attribute set is extracted from the characteristics or profiles of the network node. In our models, nodes are assigned their attributes upon their arrival to the network. Accordingly, the network graph G is now defined by a three-element set G = {V, E, A}, where V is the set of nodes in the network, E is the set of edges and A is the set of attributes defining the profiles/characteristics of all the network nodes. Our models are more general than that presented in [8] in that the node-attribute vector length is not restricted to one. All of our proposed models grow as nodes are constantly added to the network during its evolution. Our model, the Integrated Attribute Similarity Model (IASM), uses the concepts of growth and PA connection algorithm as in Barabási -Albert (BA) for network generation. We base our IASM on modifying the CF of the BA model. BA model is chosen, as it is the only model among the three influential models that succeeded in generating graphs having PL degree distributions. Instead of having the CF dependent on the old node’s fitness/degree alone, we propose making it dependent on a parameter showing the attribute-similarity between the newly added node and the old node attribute(s). IASM integrates the attribute-similarity between new node and old nodes with the structural popularity of old nodes to modify the CF used for PA in the BA model. The node structural popularity is a measure of the node’s popularity based on its network position and connections. IASM is the first network model to integrate an attribute-similarity measure within the CF. IASM uses two structural-popularity measures. In IASM_A, the normalized node-degree is used as the structural popularity measure, while Eigen vector centrality is used in IASM_B. Eigen vector centrality is considered a more accurate structural popularity measure as it takes into consideration both the density and quality of links attached to a node.

To further enhance the clustering coefficient values in IASM, a triad formation step (TFS) has been added to the network generation algorithm. TFS reflects the preference of a node to connect to its neighbor’s neighbor rather than to any other randomly chosen node.

To evaluate IASM, we generate networks based on it using MATLAB simulation. For each of the generated networks, values for the power law exponent, the average path length and the average

clustering coefficients were measured and assessed against values reported for a variety of real complex-networks [3,4].

A. Integrated Attribute Similarity Model (IASM) In IASM, each new network-node upon birth possesses its own distinct attribute-set (attribute vector) that represents the interests or engagements of the node in the network’s L interests or activities. The CF in IASM does not depend solely on a specific characteristic of the old node but on the characteristics of both the new and the old nodes. Accordingly, a new node usually prefers to connect with old nodes that are the most topologically popular and have similar interests or attributes to the new added node. IASM is a growing network model. IASM start with a seed network of size mo. Then at each time step a new node is added with m edges to be connected to it, where m mo. Each node is assigned an attribute vector having L elements. Each element takes binary values of 1or 0 representing the presence or absence of an attribute in the attribute-vector respectively. Our proposed attribute similarity is equal to the normalized summation of the inner product of the new-node and old-node attribute vectors. Each newly added node is preferentially connected to m old nodes based on the value of the CF. The CF in both models is used to connect the nodes via a preferential attachment algorithm. We used the algorithm proposed by Newman in [1] to implement the preferential attachment. Each node is identified by a Node-Id that represents its arrival order. A list of Node-Ids is created for each arriving node in which Node-Ids are repeated based on their corresponding CF values. Thus, a new vector is formed in which nodes having higher CF values are repeated more frequently. Each arriving node has to establish m connections with nodes randomly selected from this vector. Thus, the connection function CF used for preferentially connecting a new node ‘i’ with a chosen old node ‘j’ depends on the Structural popularity of node j (SPj) and node-attribute similarities (Aij) for bothnodes i and j. The CF is expressed as

CF= *

+ *

+ w *

, where +w + = 1.0, 0

1, 0 w 1, and 01. , w, and are the coefficients used to give different weights to the different terms of the CF to test their influence. Simulation of IASM_A and IASM_B starts with a seed network of size mo = 5 as shown in figure 1. The network size grows as new nodes arrive to the network, until reaching a predetermined final size N. Simulation parameters used are shown in table 1 and flow chart of simulation algorithm is shown in figure.2. Each newly arriving node has to establish m links with the preexisting network

Figure 2 IASM algorithm with modified CF depending

on normalized degree and the attribute similarity

Figure 3. IASM algorithm with triad formation

stepFigure 1 Seed network, mo =5

569

nodes, where m=mo=5. Each new node in the network is randomly assigned an attribute vector of length L =10, whose elements are derived from a uniform distribution.

The CF in IASM depends on the attribute similarity between newly arriving nodes and old network nodes as well as the structural popularity of old nodes.Two different structural popularity measures are used in IASM simulation. In IASM_A, a node’s structural popularity is based on the number of connections that the node has, i.e. the node-degree, while in IASM_B, the structural popularity is based on the node’s Eigen vector centrality. We speculate that Eigen vector centrality might be a better measure of a node’s topological popularity since it considers both the number as well as the quality the node’s connections. A node has a high Eigen vector centrality value if it is connected to many nodes, or if it is connected to few nodes having themselves many connections. Matlab simulations were performed for different combinations of the CFs’ coefficients for both models. The simulation results show the average of 10 experiments with different random-seed generator values. CFs used can be based on normalized degree only (=1, = w = 0), on degree with added attribute similarity ( = 0 and w=1-ß where 01), and on degree multiplied by the attribute similarity (w=0, = 1-ß, where 01 ). IASM_A will be reduced to the basic BA model when =1, =w=0. We will be using the basic BA model for comparison to show the effect that our proposed on the generated-networks’ statistical properties. Simulation results for the Average Clustering Coefficient (Av_CC), the Average Path length (Av_Pl), and the Exponent of PL (Exp_PL) are shown in table 2. Results show that the values recorded for network statistical parameters are very similar in IASM_A and IASM_B, which mean

that the method used in measuring a node’s structural popularity has a minor effect on the simulated parameters of the network.Inducing attribute similarity into the CF preserved the small world phenomenon, while slightly decreasing the average path length in the case of multiplicative attribute similarity based CF. Moreover, the power law exponent values of both IASM_A and IASM_B are similar to the PL exponent values for all coefficient variations in both models are within the values reported in [1, 3, 4] of the BA model (IASM_A when =1, =w=0). However, incorporating attribute similarity in the CF calculation had a positive effect on the average clustering coefficient. Results show 37% improvement in IASM_A, (when =1, =w=0) over the BA model. The average clustering coefficient was found to increase with increasing the value of when w=0 and = 1- . However, using additive attribute similarity resulted in a decrease in the generated network’s average clustering coefficient values. Thus, we argue that multiplicative attribute similarity measure is a better measure for similarity than additive attribute similarity in IASM. In an effort to increase the average clustering coefficient of networks generated by IASM_A and IASM_B, we add a Triad Formation Step (TFS). In a TFS nodes form connections with the neighbors of their neighbors. To form a triad, the newly arriving node attaches to a randomly chosen second-degree neighbor node, and then a neighbor of this old node is randomly chosen and is connected to the new arriving node too. Simulation of IASM_A and IASM_B is repeated after adding a TFS using the same mo, m, L, and N values as used in model IASM_A. Adding the triad formation steps increases the average clustering coefficients values for both IASM_A and IASM_B as shown in table 1. The TFS caused a decrease in the PL exponent compared to values of BA model. However, the PL exponent values recorded are similar to some of the real complex networks reported in [1, 3, 4]. In addition, the TFS step increased average path length while preserving the small world phenomenon. The increase in the average path length could be a result of nodes making more connections with their second-degree neighbors rather than making preferential connections.

connection function (CF)

coefficients

IASM_A IASM_B IASM_A with TFS IASM_B with TFS

w Exp_ PL

Av_ Pl

Av_ CC

Exp_ PL

Av_ Pl

Av_ CC

Exp_ PL

Av_ Pl

Av_ CC

Exp_ PL

Av_ Pl

Av_ CC

0 0 1 2.49 3.03 0.032 2.48 3.04 0.032 1.91 3.51 0.526 1.93 3.43 0.537 0.2 0 0.8 2.44 3.02 0.032 2.43 3.04 0.031 1.89 3.52 0.526 1.96 3.42 0.535 0.5 0 0.5 2.39 3.02 0.034 2.53 3.01 0.031 1.89 3.52 0.525 1.96 3.4 0.539 0.8 0 0.2 2.41 2.98 0.041 2.44 3.04 0,031 1.90 3.46 0.526 2.00 3.43 0.535 1 0 0 2.33 2.96 0.044 2.20 2.97 0.042 1.88 3.46 0.526 1.97 3.33 0.536 0 0.5 0.5 2.14 3.14 0.021 2.02 3.14 0.019 1.78 3.58 0.515 1.73 3.55 0.518

0.5 0.5 0 2.06 3.13 0.022 1.68 3.28 0.014 1.76 3.58 0.515 1.58 3.68 0.520

mo M L N

5 5 10 1000

TABLE 2.

SIMULATION RESULTS FOR IASM_A AND IASM_B

TABLE 1 SIMULATION PARAMETER VALUES

570

IV. CONCLUSION

This paper took into consideration that complex networks mathematical models should incorporate their statistical properties and should also reflect the heterogeneous nature of network nodes. In this paper, we propose a mathematical model that paves the way to successfully mimic real complex networks. The proposed model has heterogeneous network nodes with assigned distinct attributes. Our work is the first to assign more than one attribute to each node. We introduce IASM which is the first complex networks’ generation model that integrates the attribute similarity measure within the PA function. In IASM_A nodes are linked based on a PA that depends on the old node degree simultaneously with the attribute similarity between new node and old node. IASM_B studies replacing the popularity degree measure in IASM_A with another structural node’s popularity measure which is the Eigen centrality. Both models did reflect some real complex networks statistical properties. IASM preserved the power law degree distribution and the small world phenomenon but it did not reflect the high average clustering coefficient and the emergence of community structure. We enhance the IASM by adding a TFS which results in increasing the clustering coefficient values. We are also working on implementing an algorithm to IASM that would result in the emergence of community structure. The effect of using Eigen vector centrality instead of degree centrality on the emergence of community structure in IASM is still to be examined in the future work. Implementing an analytical model for IASM is also part of our future work.

REFERENCES [1] Xiao Fan Wang; Guanrong Chen; , "Complex networks: small-world, scale-free and beyond," Circuits and Systems Magazine, IEEE , vol.3, no.1, pp. 6- 20, 2003, doi: 10.1109/MCAS.2003.1228503 [2] A.-L.Barabási,R and Albert , “Emergence of scaling in random networks “

Science 286, 509–512 (1999). [3] M. E. J. Newman," The structure and function of complex networks". SIAM Review 45, 167–256 (2003) [4] R. Albert and A. Barabasi, "Statistical mechanics of complex networks", Rev. Modern Phys., 74 (2002), pp. 47–97. [5] Emilio Ferrara, "Mining and Analysis of Online Social Networks" Ph.D. dissertation 2012. [6] DIESTEL, R.: Graph Theory. Springer–Verlag (2005) [7] http://www.mathworks.com/products/matlab/ [8] Bianconi, G.and Barab´asi, A.-L., "Competition and multiscaling in evolving networks", Europhys. Lett. 54, 436–442 (2001). [9] Shaohua Tao; XiaopengYue; , "The attributes similar-degree of complex networks" ,Future Computer and Communication (ICFCC), 2010 2nd International Conference on , vol.3, no., pp.V3-531-V3-535, 21-24 May 2010, doi: 10.1109/ICFCC.2010.5497519 [10] Yixiao Li; Xiaogang Jin; Fansheng Kong; Jiming Li, "Linking via social similarity: The emergence of community structure in scale-free network", Web Society, 2009. SWS '09. 1st IEEE Symposium on , vol., no., pp.124-128, 23-24 Aug. 2009 , doi: 10.1109/SWS.2009.5271769 [11] Kleinberg, J. M., Kumar, S. R., Raghavan, P., Rajagopalan, S., and Tomkins, A., "The Web as a graph: Measurements, models and methods", in Proceedings of the International Conference on Combinatorics and Computing, no. 1627 in Lecture Notes in Computer Science, pp. 1–18, Springer, Berlin (1999). [12] Krapivsky, P. L., and S. Redner," Organization of Growing Random Networks", 2001, Phys. Rev. E 63, 0661

571

Documents

[IEEE 2014 International Conference on Information Networking (ICOIN) - Phuket, Thailand (2014.02.10-2014.02.12)] The International Conference on Information Networking 2014 (ICOIN2014)