36
Communities and Clustering in some Social Networks Guido Caldarelli SMC CNR-INFM Rome

Communities and Clustering in some Social Networks Guido Caldarelli SMC CNR-INFM Rome

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Communities and Clusteringin some Social Networks

Guido CaldarelliSMC CNR-INFM Rome

12345

INTRODUCTION

Summary

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

1 Introduction on basic notions of graphs and clustering

2 Introduction on clustering methods based on similarity/centrality

3 Introduction on clustering methods based on spectral analysis

4 The case of study of word association network

6 Conclusions and advertisements

5 The case of study of Wikipedia

12345

INTRODUCTION

1.0 Basic matrix notation

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

0111

1000

1001

1010

A

0110

0000

0001

1000

A

4

321

4

321

njnj

njj

njj

n a

a

a

k

k

k

K

,1

,12

,11

2

1

...00

............

0...0

0...0

...00

............

0...0

0...0

12345

INTRODUCTION

1.1 Clusters and Communities

Generally a cluster corresponds to a communitySome communities are hard to detect with clustering analysis

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

12345

INTRODUCTION

1.2 Small graphs

In order to detect communities, clustering is a good clue

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

• Clustering Coefficient

• Motifs

12345

INTRODUCTION

1.2 Hubs and Authorities

Sometimes vertices differ each other, according to their function

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

•HITS

• hubs are those web pages that point to a large number of authorities (i.e. they have a large number of outgoing edges).• authorities are those web pages pointed by a large number of hubs (i.e. they have a large number of ingoing edges).

Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment.Journal of the ACM, 46, 604–632.

12345

INTRODUCTION

1.3 Hubs and Authorities

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

jiji

ijji

UH

HU

If every page i,j, has authority Ui,j and hubness Hij

UAH

HAU T

HAAH

UAAUT

T

We can divide the pages according to their value of U or H. These values are obtained by the eigenvalues of the matrices ATA and AAT respectively.

12345

TOPOLOGICAL ANALYSIS

One way to cluster vertices is to find similarites between them. One “topological” way is given by considering their neighbours. One can then define a distance x given by

2.1 Agglomerative Methods

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

)()(

)()(

jiji

jijiSij SSNSSN

SSNSSNx

k jkik

jkikjkk ik

k jkk ik

k jkikk jkk ikSij aa

aaaa

aa

aaaax

22

6

Brun, et al (2003). Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology, 5, R6 1–13.

12345

TOPOLOGICAL ANALYSIS

The Algorithm of Girvan and Newman selects recursively the largest edge-betweenness in the graph

2.2 Divisive Methods: betweenness

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

The betweenness is a measure of the centrality of a vertex/edge in a graph

lji

nlj jl

jl iib

,1,

)()(

Girvan, M. and Newman, M.E.J. (2002). Community structure in social and biological networks. Proc. Natl. Acad. of Science (USA), 99, 7821–7826.

6

12345

TOPOLOGICAL ANALYSIS

2.3 Examples

The procedure on a more complicated network, produces a dendrogram of

the community structure

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

(a) friendship network from Zachary’s karate club study (26). Nodes associated with the club administrator’s faction are drawn as circles, those associated with the instructor’s faction are drawn as squares. (b) Hierarchical tree showing the complete community structure. (c) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network.

6

12345

TOPOLOGICAL ANALYSIS

2.3 Examples

One typical example is that of the e-mail network. Below the case of study of University of Tarragona (Spain). Different colors correspond to different departments

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

Guimerà, R., Danon, L., Diaz-Guilera, A., Giralt, F., and Arenas, A. (2002).Self-similar community structure in organisations. Physical Review E, 68, 065103.

12345

TOPOLOGICAL ANALYSIS

2.4 Random walks and communities

Random walks on Graphs are at the basis of the PageRank algorithm (Google). This means that the largest is the probability to pass in a certain page the largest its interest.

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

Random walks can also be used to detect clusters in graphs, the idea is that the more closed is a subgraph, the largest the time a random walker need to escape from it.

One of the heuristic algorithms based on random walks is the Markov Cluster (MCL) one.You find the complete description and codes at

http://micans.org/mcl

•Start from the Normal Matrix, •through matrix manipulation (power), one obtains a matrix for a n-steps connection.•Enhance intercluster passages by raising the elements to a certain power and then normalize.

12345

SPECTRAL ANALYSIS

3.1 The functions of the adjacency matrix

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

0111

1000

1001

1010

A

4

321

AKN 1

AKL

nnnnnnn

n

n

kakaka

kakaka

kakaka

N

/...//

............

/...//

/...//

21

22222221

11112111

nnn

n

n

kaa

aka

aak

L

...

............

...

...

21

2221

1121

Normal Matrix

Laplacian Matrix

6

njnj

njj

njj

n a

a

a

k

k

k

K

,1

,12

,11

2

1

...00

............

0...0

0...0

...00

............

0...0

0...0

12345

SPECTRAL ANALYSIS

3.1 The functions of the adjacency matrix

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

Njnjnn

nNj

j

nNj

j

aaa

aaa

aaa

L

,121

2,1

221

112,1

1

...

............

...

... If ’ = L

nj

iiijiji ka,1

2'

n

...2

1

0...//

............

/...0/

/.../0

21

22221

11112

nnnn

n

n

kaka

kaka

kaka

NThe elements of matrix N give the probability with which one field passes from a vertex i to the neighbours.

6

12345

SPECTRAL ANALYSIS

3.2 The block properties in clustered graphs

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0

0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0

0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1

0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1

0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0

In a very clustered graph, the adjacency matrix can be put in a block form.

6

12345

SPECTRAL ANALYSIS

3.2 The block properties in clustered graphs

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

Given this probabilistic explanation for the matrix N We have a series of results, for example •One eigenvalue is equal to one and •The eigenvector related is constant.

Consider the case of disconnected subclusters:The matrix N is made of blocks and a general eigenvector will be given by the space product of blocks eigenvectors (the constant can be different!)

6

12345

SPECTRAL ANALYSIS

3.3 Eigenvalues and Communities

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

It is possible to express the eigenvectors problem as a research of a minimum under constraint

where the xi are values assigned to nodes, with some constraint expressed by

Nji

ijji wxxxz,1,

2)()(

Nji

ijji mxx,1,

1

Stationary points of z(x) + constraint (A) → Lagrange multiplier

(A)

1. Define a ficticious quantity x for the sites of the graph2. Define a suitable function z on these x’s (a “distance”)3. Define a suitable constraint on these x’s (to avoid having all equal or all 0)

For example

6

12345

SPECTRAL ANALYSIS

3.3 Eigenvalues and Communities

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

Njiijji

Njiijji

mxx

wxxxz

,1,

,1,

2

1

)()(0

)(

,1,

Nji

ijjiii

mxxxx

xzi

02 ,1,1,1

Nj

ijjjNj

ijNj

iji mxxwwx

)2/()( xMxAK ww

xxAKM

xxAKKMww

www

)(1

)21(1

1M

KM wLagrange Multiplier = Normal Eigenvalue problem

Lagrange Multiplier = Laplacian Eigenvalue problem

6

12345

WORD ASSOCIATION NETWORK

4.1 The experimental data

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

The data are collected through a psychological experiment:Persons (about 100) are given as a stimulus a single word i.e. “House”. They must answer with the first word that comes on their mind i.e.“Family”. Answer are later given as new stimula, so that a network of average associations forms.

Steyvers, M. and Tenenbaum, J.B. (2005). The large scale structure of semantic networks: Statistical analyses and a model of semantic growth.Cognitive Science, 29, 41–78.

6

A path from “Volcano” to “Ache”

12345

WORD ASSOCIATION NETWORK

4.1 The experimental data

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. (2005).Detecting communities in large networks. Physica A, 352, 669–676..

The number of connections (i.e. the degree of nodes) is power-law distributed

6

12345

WORD ASSOCIATION NETWORK

4.2 The community structure

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

science 1 literature 1 piano 1

scientific 0.994 dictionary 0.994 cello 0.993

chemistry 0.990 editorial 0.990 fiddle 0.992

physics 0.988 synopsis 0.988 viola 0.990

concentrate 0.973 words 0.987 banjo 0.988

thinking 0.973 grammar 0.986 saxophone 0.985

test 0.973 adjective 0.983 director 0.984

lab 0.969 chapter 0.982 violin 0.983

brain 0.965 prose 0.979 clarinet 0.983

equation 0.963 topic 0.976 oboe 0.983

examine 0.962 English 0.975 theater 0.982

Therefore we expect similar words to be on the same plateau. We can measure the correlation between the values of various vertices averaged over 10 different eigenvectors.

6

http://www.wikipedia.org

12345

WIKIPEDIA

5.1 Introduction

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

http://www.wikipedia.org

12345

WIKIPEDIA

5.1 Introduction

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

12345

WIKIPEDIA

5.1 Introduction

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

12345

WIKIPEDIA

5.1 Introduction

A Nature investigation aimed to find if Wikipedia is an authoritative source of information with respect to established sources as Encyclopedia Britannica.

Among 42 entries tested, the difference in accuracy was not particularly great: • the average science entry in Wikipedia contained around four inaccuracies; • the one in Britannica, about three. On the other hand the articles on Wikipedia are longer on average than those of Britannica. This accounts for a lower rate of errors in Wikipedia.

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

12345

WIKIPEDIA

5.2 The network properties

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES, wikiIT and wikiPT, generated from the English, German, French, Spanish, Italian and Portuguese datasets, respectively. The graphs were obtained from an old dump of June 13, 2004. We are not using the current data due to disk space restrictions. The English dataset of June 2005 has more than 36 GB compacted, that is about 200 GB expanded.

6

12345

WIKIPEDIA

5.2 The network properties

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

in–degree(empty) and out–degree(filled). Occurrency distributions for the Wikgraph in English (o) and Portuguese ().

The Degree shows fat tails that can be approximated by a power-law function of the kind

P(k) ~ k-g

Where the exponent is the same both for in-degree and out-degree.

In the case of WWW2 ≤ gin ≤ 2.1

6

Capocci, A., et al. (2006). Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia. Physical Review E, 74, 036116

12345

WIKIPEDIA

5.2 The network properties

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

The average neighbors’ in–degree, computed along incoming edges, as a function of the in–degree for the English (o) and Portuguese ()

As regards the assortativity (as measured by the average degree of the neighbours of a vertex with degree k) there is no evidence of any assortative behaviour.

6

12345

WIKIPEDIA

5.3 The growth of Wikipedia

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

Given the history of growth one can verify the hypothesis of preferential attachment. This is done by means of the histogram P(k) who gives the number of vertices (whose degree is k) acquiring new connections at time t.This is quantity is weighted by the factor

N(t)/n(k,t)

We find preferential attachment for in and out

degree.

English (o) and Portuguese ().White= in-degreeFilled = out-degree

6

12345

WIKIPEDIA

5.4 The communities in Wikipedia

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6 Taxonomy

Categorization provided gives an imposed taxonomy to the pages.

12345

WIKIPEDIA

5.3 The Communities in Wikipedia

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

Given different wikigraphs one can compute the frequency of the category sizes in the various systems

12345

WIKIPEDIA

5.3 The Communities in Wikipedia

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

Similarly, also the cluster size frequency distribution (computed with MCL algorithm) can be considered

Qualitatively rather good agreement. But are there the same?

12345

WIKIPEDIA

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

5.3 The Communities in Wikipedia

NOT REALLY! The power-law shape is probably a very common feature for any categorization

12345

SUMMARY

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

Communities represents an important categorization of graphs.Methods to detect them varies according to the specific case of study

• SMALL GRAPHS (motifs, clustering coefficient)• LARGE GRAPHS

• FUNCTION OF VERTICES (HITS, Vertex Similarity)• CENTRALITY (Girvan Newman Algorithms)• DIFFUSION ON THE GRAPH

• MCL Algorithm• Spectral analysis of the stochastic matrices associated

with the graph

Guido Caldarelli, Communities and Clustering in Some social Networks

12345

NetSci 2007 New York, May 20th 2007

SHAMELESS ADVERTISEMENT

6

12345

Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007

6

SHAMELESS ADVERTISEMENT

http://www.complexnetworks.net