1
Community detection and proximity measures Maximilien Danisch 1,2 , Jean-Loup Guillaume 1 and Bénédicte Le Grand 2 1 LIP6 (CNRS – UPMC) and 2 CRI (Université Paris 1) Context: in complex networks, there are communities networks nodes edges communities Wikipedia Wikipedia pages hypertext links Wikipedia categories Facebook profiles friendships colleagues, families, sports club P2P peers file exchanges communities of interests Problematics: How to find these communities? I The greedy optimization of a quality function suffers from: (i) local minima and (ii) hidden scale parameters. I Let’s look for an alternative approach! Idea 1: use the related notion of proximity measure Rank nodes according to their proximity to a given node. Irregularities in the decrease can indicate the presence of communities: I Often a powerlaw is obtained = no scale can be extracted = the node belongs to several communities of various sizes. Idea 2: use the notion of multi-ego-centered community I While one node generaly belongs to many communities, a small set of nodes, e.g. 2, is enough to define a single community. Idea 3: a proximity measure has to be parametrized n common neighbors 2 1 3 Which node is closer to 1: 2 or 3? What to do with HUBs? Prox αβλ (i, j )= 1 d β j λ l =0 α l N l (i, j ) N l (i, j ): number of paths of length l between i and j . d j : degree of node j. I If we know some nodes that are near to one another, e.g. nodes belonging to a same community, we can learn the parameters that rank these nodes as close as possible to one another. AUC optimization: the AUC is a measure of the accuracy of a classifier, it is equal to the probability that a classifier ranks a randomly chosen positive instance higher than a randomly chosen negative one. 0.0001 0.001 0.01 0.1 1 ALPHA 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 BETA 0.989 0.9892 0.9894 0.9896 0.9898 0.99 0.9902 0.9904 Application 1: find all communities of a node A- Choose a set of candidates: C- Clean and label the communities: B- Find bi-ego-centered communities: I We can use a similar framework to find all overlapping communities in a networks. Application 2: community completion Average number of paths of length 1 to 6 between two nodes in the same community and two nodes in different communities: Learning a proximity measure, combining scorings and unfolding the community: 0 500 1000 1500 2000 2500 3000 3500 4000 AMONG THE K FIRST NODES 0.0 0.2 0.4 0.6 0.8 1.0 PROPORTION OF NODES OF THE COMMUNITY DGLG KATZ T and F CAROP DISTANCE

Community detection and proximity measures · Community detection and proximity measures MaximilienDanisch1,2,Jean-LoupGuillaume1 andBénédicteLeGrand2 1LIP6(CNRS–UPMC)and2CRI(UniversitéParis1)

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Community detection and proximity measures · Community detection and proximity measures MaximilienDanisch1,2,Jean-LoupGuillaume1 andBénédicteLeGrand2 1LIP6(CNRS–UPMC)and2CRI(UniversitéParis1)

Community detection and proximity measuresMaximilien Danisch1,2, Jean-Loup Guillaume1 and Bénédicte Le Grand2

1 LIP6 (CNRS – UPMC) and 2 CRI (Université Paris 1)

Context: in complex networks, there are communities

networks nodes edges communitiesWikipedia Wikipedia pages hypertext links Wikipedia categoriesFacebook profiles friendships colleagues, families, sports club

P2P peers file exchanges communities of interests

Problematics: How to find these communities?I The greedy optimization of a quality function suffers from:(i) local minima and (ii) hidden scale parameters.I Let’s look for an alternative approach!

Idea 1: use the related notion of proximity measure

Rank nodes according to their proximity to a given node.Irregularities in the decrease can indicate the presence of communities:

I Often a powerlaw is obtained = no scale can be extracted= the node belongs to several communities of various sizes.

Idea 2: use the notion of multi-ego-centered community

I While one node generaly belongs to many communities, a smallset of nodes, e.g. 2, is enough to define a single community.

Idea 3: a proximity measure has to be parametrized

n common neighbors

2 1 3

Which node is closer to 1: 2 or 3?

What to do with HUBs?

Proxαβλ(i, j) = 1dβj

λ∑l=0αlNl(i, j)

Nl(i, j): number of paths of length lbetween i and j. dj: degree of node j.

I If we know some nodes that are near to one another, e.g. nodesbelonging to a same community, we can learn the parameters thatrank these nodes as close as possible to one another.

AUC optimization: the AUC is ameasure of the accuracy of a classifier,it is equal to the probability that aclassifier ranks a randomly chosenpositive instance higher than arandomly chosen negative one.

0.0001 0.001 0.01 0.1 1

ALPHA

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

BETA

0.989

0.9892

0.9894

0.9896

0.9898

0.99

0.9902

0.9904

Application 1: find all communities of a node

A- Choose a set of candidates:

C- Clean and label the communities:

0 10 20 30 40 50

0

10

20

30

40

500.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

B- Find bi-ego-centered communities:

100 101 102 103 104 105 10610-6

10-5

10-4

10-3

10-2

10-1

100

Chess BoxingChessboardMIN: Chess notation

100 101 102 103 104 105 10610-6

10-5

10-4

10-3

10-2

10-1

100

Chess BoxingAchi, NaganoMIN: Morabaraba

I We can use a similar framework to find all overlappingcommunities in a networks.

Application 2: community completion

Average number of paths of length 1 to 6 between two nodes in the samecommunity and two nodes in different communities:

100 101 102 103 104

DEGREE OF THE TARGET NODE10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

NUM

BER

OF N

BP O

F LE

NGTH

1 INOUT

100 101 102 103 104

DEGREE OF THE TARGET NODE10-3

10-2

10-1

100

101

102

NUM

BER

OF N

BP O

F LE

NGTH

2 INOUT

100 101 102 103 104

DEGREE OF THE TARGET NODE10-1

100

101

102

103

104

NUM

BER

OF N

BP O

F LE

NGTH

3 INOUT

100 101 102 103 104

DEGREE OF THE TARGET NODE103

104

105

106

NUM

BER

OF N

BP O

F LE

NGTH

4 INOUT

100 101 102 103 104

DEGREE OF THE TARGET NODE105

106

107

108

109

NUM

BER

OF N

BP O

F LE

NGTH

5 INOUT

100 101 102 103 104

DEGREE OF THE TARGET NODE108

109

1010

1011

1012

NUM

BER

OF N

BP O

F LE

NGTH

6 INOUT

Learning a proximity measure, combining scorings and unfolding the community:

0 500 1000 1500 2000 2500 3000 3500 4000

AMONG THE K FIRST NODES

0.0

0.2

0.4

0.6

0.8

1.0

PR

OPO

RTIO

N O

F N

OD

ES

OF

TH

E C

OM

MU

NIT

Y

DGLG

KATZ

T and F

CAROP

DISTANCE2.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

SECO

ND D

ERIV

ATIV

E

10-9

10-8

10-7

10-6

10-5

10-4

10-3

PROX

IMIT

Y SC

ORE

100 101 102 104 105 106103

ALL NODESTRAINING SETTEST SETSECONDE DERIVATIVE

RANK OF THE NODE