View
214
Download
0
Embed Size (px)
Citation preview
12345
INTRODUCTION
Summary
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
1 Introduction on basic notions of graphs and clustering
2 Introduction on clustering methods based on similarity/centrality
3 Introduction on clustering methods based on spectral analysis
4 The case of study of word association network
6 Conclusions and advertisements
5 The case of study of Wikipedia
12345
INTRODUCTION
1.0 Basic matrix notation
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
0111
1000
1001
1010
A
0110
0000
0001
1000
A
4
321
4
321
njnj
njj
njj
n a
a
a
k
k
k
K
,1
,12
,11
2
1
...00
............
0...0
0...0
...00
............
0...0
0...0
12345
INTRODUCTION
1.1 Clusters and Communities
Generally a cluster corresponds to a communitySome communities are hard to detect with clustering analysis
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
12345
INTRODUCTION
1.2 Small graphs
In order to detect communities, clustering is a good clue
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
• Clustering Coefficient
• Motifs
12345
INTRODUCTION
1.2 Hubs and Authorities
Sometimes vertices differ each other, according to their function
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
•HITS
• hubs are those web pages that point to a large number of authorities (i.e. they have a large number of outgoing edges).• authorities are those web pages pointed by a large number of hubs (i.e. they have a large number of ingoing edges).
Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment.Journal of the ACM, 46, 604–632.
12345
INTRODUCTION
1.3 Hubs and Authorities
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
jiji
ijji
UH
HU
If every page i,j, has authority Ui,j and hubness Hij
UAH
HAU T
HAAH
UAAUT
T
We can divide the pages according to their value of U or H. These values are obtained by the eigenvalues of the matrices ATA and AAT respectively.
12345
TOPOLOGICAL ANALYSIS
One way to cluster vertices is to find similarites between them. One “topological” way is given by considering their neighbours. One can then define a distance x given by
2.1 Agglomerative Methods
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
)()(
)()(
jiji
jijiSij SSNSSN
SSNSSNx
k jkik
jkikjkk ik
k jkk ik
k jkikk jkk ikSij aa
aaaa
aa
aaaax
22
6
Brun, et al (2003). Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology, 5, R6 1–13.
12345
TOPOLOGICAL ANALYSIS
The Algorithm of Girvan and Newman selects recursively the largest edge-betweenness in the graph
2.2 Divisive Methods: betweenness
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
The betweenness is a measure of the centrality of a vertex/edge in a graph
lji
nlj jl
jl iib
,1,
)()(
Girvan, M. and Newman, M.E.J. (2002). Community structure in social and biological networks. Proc. Natl. Acad. of Science (USA), 99, 7821–7826.
6
12345
TOPOLOGICAL ANALYSIS
2.3 Examples
The procedure on a more complicated network, produces a dendrogram of
the community structure
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
(a) friendship network from Zachary’s karate club study (26). Nodes associated with the club administrator’s faction are drawn as circles, those associated with the instructor’s faction are drawn as squares. (b) Hierarchical tree showing the complete community structure. (c) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network.
6
12345
TOPOLOGICAL ANALYSIS
2.3 Examples
One typical example is that of the e-mail network. Below the case of study of University of Tarragona (Spain). Different colors correspond to different departments
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
Guimerà, R., Danon, L., Diaz-Guilera, A., Giralt, F., and Arenas, A. (2002).Self-similar community structure in organisations. Physical Review E, 68, 065103.
12345
TOPOLOGICAL ANALYSIS
2.4 Random walks and communities
Random walks on Graphs are at the basis of the PageRank algorithm (Google). This means that the largest is the probability to pass in a certain page the largest its interest.
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
Random walks can also be used to detect clusters in graphs, the idea is that the more closed is a subgraph, the largest the time a random walker need to escape from it.
One of the heuristic algorithms based on random walks is the Markov Cluster (MCL) one.You find the complete description and codes at
http://micans.org/mcl
•Start from the Normal Matrix, •through matrix manipulation (power), one obtains a matrix for a n-steps connection.•Enhance intercluster passages by raising the elements to a certain power and then normalize.
12345
SPECTRAL ANALYSIS
3.1 The functions of the adjacency matrix
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
0111
1000
1001
1010
A
4
321
AKN 1
AKL
nnnnnnn
n
n
kakaka
kakaka
kakaka
N
/...//
............
/...//
/...//
21
22222221
11112111
nnn
n
n
kaa
aka
aak
L
...
............
...
...
21
2221
1121
Normal Matrix
Laplacian Matrix
6
njnj
njj
njj
n a
a
a
k
k
k
K
,1
,12
,11
2
1
...00
............
0...0
0...0
...00
............
0...0
0...0
12345
SPECTRAL ANALYSIS
3.1 The functions of the adjacency matrix
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
Njnjnn
nNj
j
nNj
j
aaa
aaa
aaa
L
,121
2,1
221
112,1
1
...
............
...
... If ’ = L
nj
iiijiji ka,1
2'
n
...2
1
0...//
............
/...0/
/.../0
21
22221
11112
nnnn
n
n
kaka
kaka
kaka
NThe elements of matrix N give the probability with which one field passes from a vertex i to the neighbours.
6
12345
SPECTRAL ANALYSIS
3.2 The block properties in clustered graphs
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0
In a very clustered graph, the adjacency matrix can be put in a block form.
6
12345
SPECTRAL ANALYSIS
3.2 The block properties in clustered graphs
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
Given this probabilistic explanation for the matrix N We have a series of results, for example •One eigenvalue is equal to one and •The eigenvector related is constant.
Consider the case of disconnected subclusters:The matrix N is made of blocks and a general eigenvector will be given by the space product of blocks eigenvectors (the constant can be different!)
6
12345
SPECTRAL ANALYSIS
3.3 Eigenvalues and Communities
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
It is possible to express the eigenvectors problem as a research of a minimum under constraint
where the xi are values assigned to nodes, with some constraint expressed by
Nji
ijji wxxxz,1,
2)()(
Nji
ijji mxx,1,
1
Stationary points of z(x) + constraint (A) → Lagrange multiplier
(A)
1. Define a ficticious quantity x for the sites of the graph2. Define a suitable function z on these x’s (a “distance”)3. Define a suitable constraint on these x’s (to avoid having all equal or all 0)
For example
6
12345
SPECTRAL ANALYSIS
3.3 Eigenvalues and Communities
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
Njiijji
Njiijji
mxx
wxxxz
,1,
,1,
2
1
)()(0
)(
,1,
Nji
ijjiii
mxxxx
xzi
02 ,1,1,1
Nj
ijjjNj
ijNj
iji mxxwwx
)2/()( xMxAK ww
xxAKM
xxAKKMww
www
)(1
)21(1
1M
KM wLagrange Multiplier = Normal Eigenvalue problem
Lagrange Multiplier = Laplacian Eigenvalue problem
6
12345
WORD ASSOCIATION NETWORK
4.1 The experimental data
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
The data are collected through a psychological experiment:Persons (about 100) are given as a stimulus a single word i.e. “House”. They must answer with the first word that comes on their mind i.e.“Family”. Answer are later given as new stimula, so that a network of average associations forms.
Steyvers, M. and Tenenbaum, J.B. (2005). The large scale structure of semantic networks: Statistical analyses and a model of semantic growth.Cognitive Science, 29, 41–78.
6
A path from “Volcano” to “Ache”
12345
WORD ASSOCIATION NETWORK
4.1 The experimental data
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. (2005).Detecting communities in large networks. Physica A, 352, 669–676..
The number of connections (i.e. the degree of nodes) is power-law distributed
6
12345
WORD ASSOCIATION NETWORK
4.2 The community structure
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
science 1 literature 1 piano 1
scientific 0.994 dictionary 0.994 cello 0.993
chemistry 0.990 editorial 0.990 fiddle 0.992
physics 0.988 synopsis 0.988 viola 0.990
concentrate 0.973 words 0.987 banjo 0.988
thinking 0.973 grammar 0.986 saxophone 0.985
test 0.973 adjective 0.983 director 0.984
lab 0.969 chapter 0.982 violin 0.983
brain 0.965 prose 0.979 clarinet 0.983
equation 0.963 topic 0.976 oboe 0.983
examine 0.962 English 0.975 theater 0.982
Therefore we expect similar words to be on the same plateau. We can measure the correlation between the values of various vertices averaged over 10 different eigenvectors.
6
http://www.wikipedia.org
12345
WIKIPEDIA
5.1 Introduction
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
http://www.wikipedia.org
12345
WIKIPEDIA
5.1 Introduction
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
12345
WIKIPEDIA
5.1 Introduction
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
12345
WIKIPEDIA
5.1 Introduction
A Nature investigation aimed to find if Wikipedia is an authoritative source of information with respect to established sources as Encyclopedia Britannica.
Among 42 entries tested, the difference in accuracy was not particularly great: • the average science entry in Wikipedia contained around four inaccuracies; • the one in Britannica, about three. On the other hand the articles on Wikipedia are longer on average than those of Britannica. This accounts for a lower rate of errors in Wikipedia.
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
12345
WIKIPEDIA
5.2 The network properties
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES, wikiIT and wikiPT, generated from the English, German, French, Spanish, Italian and Portuguese datasets, respectively. The graphs were obtained from an old dump of June 13, 2004. We are not using the current data due to disk space restrictions. The English dataset of June 2005 has more than 36 GB compacted, that is about 200 GB expanded.
6
12345
WIKIPEDIA
5.2 The network properties
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
in–degree(empty) and out–degree(filled). Occurrency distributions for the Wikgraph in English (o) and Portuguese ().
The Degree shows fat tails that can be approximated by a power-law function of the kind
P(k) ~ k-g
Where the exponent is the same both for in-degree and out-degree.
In the case of WWW2 ≤ gin ≤ 2.1
6
Capocci, A., et al. (2006). Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia. Physical Review E, 74, 036116
12345
WIKIPEDIA
5.2 The network properties
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
The average neighbors’ in–degree, computed along incoming edges, as a function of the in–degree for the English (o) and Portuguese ()
As regards the assortativity (as measured by the average degree of the neighbours of a vertex with degree k) there is no evidence of any assortative behaviour.
6
12345
WIKIPEDIA
5.3 The growth of Wikipedia
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
Given the history of growth one can verify the hypothesis of preferential attachment. This is done by means of the histogram P(k) who gives the number of vertices (whose degree is k) acquiring new connections at time t.This is quantity is weighted by the factor
N(t)/n(k,t)
We find preferential attachment for in and out
degree.
English (o) and Portuguese ().White= in-degreeFilled = out-degree
6
12345
WIKIPEDIA
5.4 The communities in Wikipedia
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6 Taxonomy
Categorization provided gives an imposed taxonomy to the pages.
12345
WIKIPEDIA
5.3 The Communities in Wikipedia
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
Given different wikigraphs one can compute the frequency of the category sizes in the various systems
12345
WIKIPEDIA
5.3 The Communities in Wikipedia
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
Similarly, also the cluster size frequency distribution (computed with MCL algorithm) can be considered
Qualitatively rather good agreement. But are there the same?
12345
WIKIPEDIA
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
5.3 The Communities in Wikipedia
NOT REALLY! The power-law shape is probably a very common feature for any categorization
12345
SUMMARY
Guido Caldarelli, Communities and Clustering in Some social NetworksNetSci 2007 New York, May 20th 2007
6
Communities represents an important categorization of graphs.Methods to detect them varies according to the specific case of study
• SMALL GRAPHS (motifs, clustering coefficient)• LARGE GRAPHS
• FUNCTION OF VERTICES (HITS, Vertex Similarity)• CENTRALITY (Girvan Newman Algorithms)• DIFFUSION ON THE GRAPH
• MCL Algorithm• Spectral analysis of the stochastic matrices associated
with the graph
Guido Caldarelli, Communities and Clustering in Some social Networks
12345
NetSci 2007 New York, May 20th 2007
SHAMELESS ADVERTISEMENT
6