Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs

Lecture 3

1.Different centrality measures of nodes

2.Hierarchical Clustering3.Line graphs

1. Centrality measures

Within graph theory and network analysis, there are various measures of the centrality of a vertex within a graph that determine the relative importance of a vertex within the graph.

•Degree centrality

•Betweenness centrality

•Closeness centrality

•Eigenvector centrality

•Subgraph centrality

We will discuss on the following centrality measures:

Degree centrality

Degree centrality is defined as the number of links incident upon a node i.e. the number of degree of the node

Degree centrality is often interpreted in terms of the immediate risk of the node for catching whatever is flowing through the network (such as a virus, or some information).

Degree centrality of the blue nodes are higher

Betweenness centrality

The vertex betweenness centrality BC(v) of a vertex v is defined as follows:

Here σuw is the total number of shortest paths between node u and w and σuw(v) is number of shortest paths between node u and w that pass node v

Vertices that occur on many shortest paths between other vertices have higher betweenness than those that do not.

a

db f

e

c

Betweenness centrality σuw σuw(v) σuw/σuw(v)

(a,b) 1 0 0

(a,d) 1 1 1

(a,e) 1 1 1

(a,f) 1 1 1

(b,d) 1 1 1

(b,e) 1 1 1

(b,f) 1 1 1

(d,e) 1 0 0

(d,f) 1 0 0

(e,f) 1 0 0

Betweenness centrality of node c=6

Betweenness centrality of node a=0 Calculation for node c

Hue (from red=0 to blue=max) shows the node betweenness.

Betweenness centrality

•Nodes of high betweenness centrality are important for transport.

•If they are blocked, transport becomes less efficient and on the other hand if their capacity is improved transport becomes more efficient.

•Using a similar concept edge betweenness is calculated.

http://en.wikipedia.org/wiki/Betweenness_centrality#betweenness

http://en.wikipedia.org/wiki/Image:Graph_betweenness.svg

Closeness centrality

The farness of a vortex is the sum of the shortest-path distance from the vertex to any other vertex in the graph.The reciprocal of farness is the closeness centrality (CC).

Here, d(v,t) is the shortest distance between vertex v and vertex t

Closeness centrality can be viewed as the efficiency of a vertex in spreading information to all other vertices

vVt

tvdvCC

\

),(

1)(

Eigenvector centralityLet A is the adjacency matrix of a graph and λ is the largest eigenvalue of A and x is the corresponding eigenvector then

The ith component of the eigenvector x then gives the eigenvector centrality score of the ith node in the network.

From (1)

N

jjjii xAx

1,

1

•Therefore, for any node, the eigenvector centrality score be proportional to the sum of the scores of all nodes which are connected to it. •Consequently, a node has high value of EC either if it is connected to many other nodes or if it is connected to others that themselves have high EC

-----(1)N×N N×1 N×1

|A-λI|=0, where I is an identity matrix

Subgraph centrality

the number of closed walks of length k starting and ending on vertex i in the network is given by the local spectral moments μ k (i), which are simply defined as the ith diagonal entry of the kth power of the adjacency matrix, A:

Closed walks can be trivial or nontrivial and are directly related to the subgraphs of the network.

Subgraph Centrality in Complex Networks, Physical Review E 71, 056103(2005)

0 1 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 1 0 1 0 0 0 0 0 0 0 0

0 1 0 1 1 1 0 0 0 0 0 0 0 0

0 1 1 0 1 1 0 1 0 0 0 0 0 0

0 0 1 1 0 1 0 0 0 0 0 0 0 0

0 1 1 1 1 0 1 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 1 0 0 0

0 0 0 1 0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 0 0 1 0 1 0 0 1 1

0 0 0 0 0 0 0 0 1 0 1 0 1 1

0 0 0 0 0 0 1 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1 1 0 1 0 1

0 0 0 0 0 0 0 0 1 1 0 0 1 0

M =

Muv = 1 if there is an edge between

nodes u and v and 0 otherwise.

Subgraph centrality

Adjacency matrix

1 0 1 1 0 1 0 0 0 0 0 0 0 0

0 4 2 2 3 2 1 1 0 0 0 0 0 0

1 2 4 3 2 3 1 1 0 0 0 0 0 0

1 2 3 5 2 3 1 0 1 0 0 0 0 0

0 3 2 2 3 2 1 1 0 0 0 0 0 0

1 2 3 3 2 5 0 1 0 0 1 0 0 0

0 1 1 1 1 0 2 0 0 1 0 0 0 0

0 1 1 0 1 1 0 2 0 1 0 0 1 1

0 0 0 1 0 0 0 0 4 2 1 1 2 2

0 0 0 0 0 0 1 1 2 4 0 1 2 2

0 0 0 0 0 1 0 0 1 0 2 0 1 1

0 0 0 0 0 0 0 0 1 1 0 1 0 1

0 0 0 0 0 0 0 1 2 2 1 0 4 2

0 0 0 0 0 0 0 1 2 2 1 1 2 3

M2 =

(M2)uv for uv represents the

number of common neighbor of the nodes u and v.

local spectral moment

Subgraph centrality

Table 2. Summary of results of eight real-world complex networks.

Hierarchical Clustering

AtpB AtpAAtpG AtpEAtpA AtpHAtpB AtpHAtpG AtpHAtpE AtpH

Data is not always available as binary relations as in the case of protein-protein interactions where we can directly apply network clustering algorithms.

In many cases for example in case of microarray gene expression analysis the data is multivariate type.

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

We can convert multivariate data into networks and can apply network clustering algorithm about which we will discuss in the next class.

If dimension of multivariate data is 3 or less we can cluster them by plotting directly.



However, when dimension is more than 3, we can apply hierarchical clustering to multivariate data.

In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place.

Some data reveal good cluster structure when plotted but some data do not.

Data plotted in 2 dimensions


Hierarchical clustering is a technique that organizes elements into a tree.

A tree is a graph that has no cycle.

A tree with n nodes can have maximum n-1 edges.

A Graph A tree


Hierarchical Clustering is subdivided into 2 types

1. agglomerative methods, which proceed by series of fusions of the n objects into groups,

2. and divisive methods, which separate n objects successively into finer groupings.

Agglomerative techniques are more commonly used

Data can be viewed as a single cluster containing all objects to n clusters each containing a single object .


Distance measurementsThe Euclidean distance between points and

, in Euclidean n-space, is defined as:

Euclidean distance between g1 and g2

0622.81640

)910()08()1010( 222



In stead of Euclidean distance correlation can also be used as a distance measurement.

For biological analysis involving genes and proteins, nucleotide and or amino acid sequence similarity can also be used as distance between objects


•An agglomerative hierarchical clustering procedure produces a series of partitions of the data, Pn, Pn-1, ....... , P1. The first Pn consists of n single object 'clusters', the last P1, consists of single group containing all n cases. •At each particular stage the method joins together the two clusters which are closest together (most similar). (At the first stage, of course, this amounts to joining together the two objects that are closest together, since at the initial stage each cluster has one object.)



Differences between methods arise because of the different ways of defining distance (or similarity) between clusters.


How can we measure distances between clusters?

Single linkage clustering

Distance between two clusters A and B, D(A,B) is computed as

D(A,B) = Min { d(i,j) : Where object i is in cluster A and object j is cluster B}


Complete linkage clustering

Distance between two clusters A and B, D(A,B) is computed as D(A,B) = Max { d(i,j) : Where object i is in cluster A and

object j is cluster B}


Average linkage clustering

Distance between two clusters A and B, D(A,B) is computed as D(A,B) = TAB / ( NA * NB)

Where TAB is the sum of all pair wise distances between objects of cluster A and cluster B. NA and NB are the sizes of the clusters

A and B respectively.

Total NA * NB edges


Average group linkage clustering

Distance between two clusters A and B, D(A,B) is computed as D(A,B) = = Average { d(i,j) : Where observations i and j are in

cluster t, the cluster formed by merging clusters A and B }

Total n(n-1)/2 edges


Alizadeh et al. Nature 403: 503-511 (2000).


Classifying bacteria based on 16s rRNA sequences.

Line Graphs

Given a graph G, its line graph L(G) is a graph such thateach vertex of L(G) represents an edge of G; and two vertices of L(G) are adjacent if and only if their corresponding edges share a common endpoint ("are adjacent") in G.

Graph G Vertices in L(G) constructed from edges in G

Added edges in L(G)

The line graph L(G)

http://en.wikipedia.org/wiki/Line_graph

http://en.wikipedia.org/wiki/File:Line_graph_construction_3.svg

http://en.wikipedia.org/wiki/File:Line_graph_construction_4.svg

Line Graphs

RASCAL: Calculation of Graph Similarity using Maximum Common Edge SubgraphsBy JOHN W. RAYMOND1, ELEANOR J. GARDINER2 AND PETER WILLETT2

THE COMPUTER JOURNAL, Vol. 45, No. 6, 2002

The above paper has introduced a new graph similarity calculation procedure for comparing labeled graphs.

The chemical graphs G1 and G2 are shown in Figure a,and their respective line graphs are depicted in Figure b.

Line GraphsDetection of Functional Modules FromProtein Interaction NetworksBy Jose B. Pereira-Leal,1 Anton J. Enright,2 and Christos A. Ouzounis1

PROTEINS: Structure, Function, and Bioinformatics 54:49–57 (2004)

Transforming a network of proteins to a network of interactions. a: Schematic representation illustrating a graph representation of protein interactions: nodes correspond to proteins and edges to interactions. b: Schematic representation illustrating the transformation of the protein graph connected by interactions to an interaction graph connected by proteins. Each node represents a binary interaction and edges represent shared proteins. Note that labels that are not shared correspond to terminal nodes in (a)

A star is transformed into a clique

Documents

Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs