62
DISCOVERING IMPORTANT NODES AND EDGES

DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

DISCOVERING IMPORTANT NODES AND EDGES

Page 2: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

MACRO-MICRO

• First course: description of the graph at the macro level

• Second course: micro level

• How to describe each element ?

• How to find “exceptional” elements ?

Page 3: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE

• We can measure nodes importance using so-called centrality.

• Bad term: nothing to do with being central in general

• Common practice: run many centralities and check relation between centralities and properties/identity of nodes

Page 4: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE DEGREE

• Degree: how many neighbors

• Often enough to find important nodes‣ Main characters of a series talk with the more people‣ Largest airports have the most connections‣ …

• But not always‣ Facebook users with the most friends are spam‣ Webpages/wikipedia pages with most links are simple lists of references‣ …

Page 5: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE DEGREE

• In directed networks, degree is split in:‣ In-degree‣ Out-degree

• Example: web pages:‣ Highest out-degree: list of references‣ Highest in-degree: website that attracts a lot of link: probably interesting

Page 6: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE STRENGTH

• Strength: Degree in a weighted network

• Sum of the weight of adjacent edges

Page 7: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE CLUSTERING COEFFICIENT

• Clustering coefficient: already seen for global analysis

• The local version

• Tells you if the neighbors of the node are connected

• Be careful! ‣ Degree 2: value 0 or 1‣ Degree 1000: Not 0 or 1 (usually)‣ Ranking them is not meaningful

Page 8: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE CLUSTERING COEFFICIENT

• Clustering coefficient: already seen for global analysis

• Can be used as a proxy for “communities” belonging:‣ If node belong to single group: high CC‣ If node belong to several groups: lower CC

Page 9: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE BETWEENNESS

• Betweenness centrality:‣ 1)compute all shortest paths between all nodes‣ 2)count the fraction of them going through the node

• Idea: if the node is “between” many nodes, then it is important.

• Related to the notion of “flow” of information in the network

Page 10: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE BETWEENNESS• Betweenness centrality:

Page 11: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE BETWEENNESS

• Betweenness centrality:

• Computationally intractable

• Common approximation:‣ Compute random paths between k nodes (e.g. k=100)

Page 12: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE PAGERANK

Page 13: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE PAGERANK

• Idea: ranking webpages by relevance

• Problems with in-degree: ‣ Easy to fool‣ Where the link comes from Is ignored

Page 14: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE PAGERANK• Solution: Give a score of “authority” to each node determining

the score of other nodes

• Interpretation:‣ Likelihood to reach a particular page by clicking links at random

• Parameter:‣ Probability of random hop anywhere to avoid dead end biases

• Computation:‣ Principal eigenvector of the normalized link matrix (including random hops)‣ Power method: random walks on the graph

Page 15: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE PAGERANK

• Interpretation: A node is important if many important nodes are linking to it.

• Often correlated with in-degree

• Allow to find tops in hierarchical structures:‣ Commoners talk to local deputy, deputy talk to ministers, ministers talk to the

president: the president has low in-degree, but high pagerank.

Page 16: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE PAGERANK

• Then how do Google rank when we do a research?

• Create a subgraph of documents related to our topic

• Compute pagerank

• (Of course now it is much more complex…)

Page 17: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

EIGENVECTOR CENTRALITY

• Corresponding value of the eigenvector corresponding to the highest eigenvalue of the adjacency matrix

• Crude version of the PageRank

Page 18: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

KATZ CENTRALITY• Variant of the PageRank & Eigenvector centrality

Page 19: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

KATZ CENTRALITY• Variant of the PageRank & Eigenvector centrality

Katz centrality of node i=

Page 20: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

KATZ CENTRALITY• Variant of the PageRank & Eigenvector centrality

Repeat for all distances as long As possible (convergence)

Page 21: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

KATZ CENTRALITY• Variant of the PageRank & Eigenvector centrality

Sum for each node j

Page 22: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

KATZ CENTRALITY• Variant of the PageRank & Eigenvector centrality

Alpha is a parameter.Its strength decreases at

each iteration

Page 23: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

KATZ CENTRALITY• Variant of the PageRank & Eigenvector centrality

Number of different paths from I to jof length k

Page 24: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

KATZ CENTRALITY• Variant of the PageRank & Eigenvector centrality

Sum of paths to all other nodes at each distance multiplied by a factor decreasing

with distance

Page 25: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE CLOSENESS

• Farness: sum of length of shortest paths to all other nodes.

• Closeness: inverse of the Farness

‣ Highest closeness = More central

Page 26: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NODE CLOSENESS

Page 27: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

HARMONIC CENTRALITY

• Harmonic centrality related to closeness centrality

Closeness

Page 28: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

OTHERS

• Many other centralities have been proposed

• The problem is how to interpret them ?

• Can be used as supervised tool:‣ Compute many centralities on all nodes‣ Learn how to combine them to find chosen nodes‣ Discover new similar nodes‣ (roles in social networks, key elements in an infrastructure, …)

Page 29: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

? Which is which

Harmonic Closeness

BetweennessEigenvector

KatzDegree

Page 30: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

A: BetweennessB:Closeness

C:EigenvectorD:Degree

E:Harmonic F: Katz

Page 31: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

Try again :)

DegreeBetweenness

ClosenessEigenvector

Page 32: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

Try again :)

A: DegreeB:Closeness

C: BetweennessD: Eigenvector

Page 33: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

EDGES CENTRALITIES

Page 34: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

EDGES

• Most centralities can be computed for edges

• Methods based on flow are more natural for edges:‣ Edge betweenness centrality: how many shortest paths go through the

edge

sigma: shortest paths

Page 35: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

EDGES

Can you guess the edges ofhighest betweenness in

the European rail network ?

Page 36: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

K-PATH EDGE CENTRALITY

s: source node

K-path: random walk of distance k

Page 37: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

CURRENT-FLOW BETWEENNESS

Analogy with electrical circuit

How much voltage at the node if unit injected at random node and collected at other random node (average)

The current flowing through the ith vertex is given by a half of the sum of the absolute values of the currents flowing along the edges incident on that vertex

Average current flow for all pairs

Page 38: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

CURRENT-FLOW BETWEENNESS

Also called Random walk betweenness

Average probability to go through the edge in a random walk from

U to V for all pairs of nodes (U,V)

Page 39: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

COMMUNICABILITY BETWEENNESS CENTRALITY

Page 40: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

COMMUNICABILITY BETWEENNESS CENTRALITY

Number of shortest paths of length< s

Page 41: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

COMMUNICABILITY BETWEENNESS CENTRALITY

Number of shortest paths of length> s

Page 42: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

COMMUNICABILITY BETWEENNESS CENTRALITY

Score for paths going though r/ scores for all paths

Page 43: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

EDGE CENTRALITIES

• And of course, many more

Page 44: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

SOME EXAMPLES ON REAL NETWORKS

Page 45: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

WIKIPEDIA

• What are the most important pages on Wikipedia ?

• Wikipedia network:‣ Nodes are pages‣ Links are hypertext links

• Wikipedia in english: Cultural bias !

• Results from http://wikirank-2018.di.unimi.it

Page 46: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

WIKIPEDIA

(2018)

Page 47: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

WIKIPEDIAMovies

Page 48: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

WIKIPEDIAColombian personalities

Page 49: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

WIKIPEDIAColombia

Page 50: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

WIKIPEDIAhttps://www.sixdegreesofwikipedia.com

(side note:shortest pathsIn wikipedia)

Page 51: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

USAGES OF CENTRALITIES

• Identifying important nodes/edges‣ Search on web or any document base‣ Recommendation (products…)‣ Social Network analysis (criminal networks..)

• Identify critical nodes/edges:‣ High betweenness: a “bridge”, affect flow if disappear‣ High PageRank: on dependance graphs, many depends on that element (supply

chain, production line ..)

• Visualisation: size of nodes determined by centrality

Page 52: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

LIBRARIES FOR GRAPH MANIPULATION

Page 53: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

THE FASTEST

• Standford Network Analysis Project library

• http://snap.stanford.edu

• Built by Juri Leskovec (Prof. Standford)

• C++ / Python interface

• Not many built-in functions but all the building-blocks

• Single machine (advantages and drawbacks)

Page 54: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

THE MOST STATISTIC

• Graph tools

• https://graph-tool.skewed.de

• C++ / Python interface

• Richer than SNAP, fast

• Best for statistical inference (Stochastic Block Model, see later)

• Cons: often difficulties to install it

Page 55: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

THE SIMPLEST

• Networkx

• https://networkx.github.io

• Python

• Very simple syntax

• A lot of already implemented functions

• Cons: do not scale for large graphs

Page 56: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

AND THE OTHERS• Big Data framework

‣ Apache Giraf : http://giraph.apache.org‣ Spark GraphX: https://spark.apache.org/graphx/‣ + Efficient in distributed multi-computer environment‣ - few functions, poorly documented

‣ JAVA‣ Jung (http://jung.sourceforge.net)‣ Graph-stream (http://graphstream-project.org) => Dynamic graphs‣ (poorly maintained)

• Other famous one:‣ Igraph (http://igraph.org) (R/C/Python) (2nd best for many things)‣ …

Page 57: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NETWORKX

• For this class, I propose to use networkx‣ Already included in Anaconda (standard python package)‣ Easiest to use (in my opinion)

• You are free to try another library, they all have strengths and weaknesses

Page 58: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NETWORKX

• Where to start:

• Tutorial on the website‣ https://networkx.github.io/documentation/stable/tutorial.html

• Page “reference” list all methods organized in categories

Page 59: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NETWORKX

Page 60: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NETWORKX

• Drawing with networkx: not recommended.

• Better to export and load with Gephi

Page 61: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NETWORKX • Proposed working environment:

‣ Python 3 ‣ Jupiter notebook

- Perfect for experimenting (avoid reloading graphs…)- Make work easily reproducible

‣ IDE/Text editor for more complex functions- You can call these functions form your notebook

• Additional libraries‣ Pandas (handling tabular data: similar to R, spreadsheet logic)‣ Seaborn (for plotting faster)‣ Sklearn (data mining)‣ =>They simplify things, not complexity :)

Page 62: DISCOVERING IMPORTANT NODES AND EDGEScazabetremy.fr/Teaching/catedra/2-IdentifyNE.pdfNODE CLUSTERING COEFFICIENT • Clustering coefficient: already seen for global analysis • The

NETWORKX Demo

(for airports: file with coordinates and countries)

2)Centrality of airports in Colombia?

3)What if we take only South-American countries?

4)Can we compare graphs by continents?

5)The graph of connections between countries?6)Exporting and plotting the graph…

1)Compute other centralities/network measures