23
Presented by Yuhua Jiao 2012-12-4

Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Embed Size (px)

Citation preview

Page 1: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Presented by Yuhua Jiao2012-12-4

Page 2: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Outline

• Limitation of some network clustering methods• Hierarchical Agglomerative Clustering– Method– Performance evaluation

• Results and Discussion– Data preparation– Empirical evaluation– Multi-resolution view of a physical interaction

network

Page 3: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Background of network clustering

• Challenges in biological network analysis– Inference of structure of subgroups of related vertices– Prediction of possible links not represented in data

• Network clustering is a valuable approach for– summarizing the structure in large networks,– predicting unobserved interactions – predicting functional annotations

Page 4: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Common limitations for some network clustering algorithms

• Poor resolution of top-level clusters– Stochastic block models

• Over-splitting of bottom-level clusters– Hierarchical network model

• Requirements to pre-define the number of clusters prior to analysis– Stochastic block models

• An inability to jointly cluster over multiple interaction types

Page 5: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Hierarchical network model by Clauset, Moore, and Newman (CMN)

Page 6: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Hierarchical Agglomerative Clustering

• Hierarchical Agglomerative Clustering– An approximation for optimizing a network

probability motivated by CMN.– Interactions with vertices outside a group often

provide more information than within-group interactions.

– Power Graph Analysis is a lossless transformation of biological networks into a compact, less redundant representation, exploiting the abundance of cliques and bicliques as elementary topological motifs.

Page 7: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance
Page 8: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

HAC-Method• Notation– Graph– Groups– Edges between groups

– Total possible connections

– Number of holes

Page 9: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

• For a given pair of group i and j, edges between groups are result of tij independent Bernoulli trials.

• The probability of observed edges, conditioned on parameter θij

• The maximum likelihood estimate of θij is

• The maximum likelihood value of Pij(θij) is

HAC-Method

Page 10: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

• Given two groups: ni = 5 nj = 4• Probability density is

• The likelihood of the flat model

An instance of flat model

Page 11: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

HAC-Method• Generalization to hierarchical model– Binary dendrogram T– Each node r in the dendrogram represents the

joining of vertices in left sub-tree L(r) and vertices in right sub-tree R(r).

– Er and hr are numbers of edges and holes crossing between the left and right sub-trees.

Page 12: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

HAC-Agglomerative clustering• Maximum likelihood guide tree– K top-level clusters– R total tree nodes– Merge clusters 1 and 2 into

cluster 1’, defining a new model M’

1 21’

Current top level

Page 13: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

HAC-Agglomerative clustering

Page 14: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

• During the merging process, if clusters 1 and 2 are selected for merging and are both collapsed, the probability ratio is calculated, where the subscripts indicate edges and holes within and between groups.

• The merged cluster is collapsed if λc ≥1 . • Clusters of two vertices are always merged because λc = 1.

HAC-Bayesian model selection for terminal clusters

Page 15: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Performance Evaluation• Data preparation– BioGRID database (http://thebiogrid.org)– The graph is undirected and unweighted with no self

edges.• Other methods– Fast Modularity (CNM)– Variational Bayes modularity (VBM)– Graph Diffusion Kernel (GDK)– Heuristic merging scores

• Edge density (HAC-E)• Combined edge density and shared neighbor density (HAC-ES)• Decomposed Newman modularity Q from CNM (HAC-Q)

Page 16: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Link Prediction• Starting with a real-world network, training networks are generated by

deleting a specified fraction of edges. • A test set is defined by the held-out edges and a random choice of an equal

number of holes.• The trained group structure provides maximum likelihood estimates for

edges within and between clusters (Eq. 9). For VBM and CNM, we estimated edge densities between all pairs of clusters and within all clusters. For hierarchical models, we estimated densities between all left and right clusters at all tree levels. For GDK, each pair’s diffusion was directly used to rank pairs.

• Finally we assessed precision and recall of pairs in the test set ranked by link probability or GDK score.

Page 17: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Results and Discussion

• Data Preparation

Page 18: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance
Page 19: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance
Page 20: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance
Page 21: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance
Page 22: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance
Page 23: Presented by Yuhua Jiao 2012-12-4. Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance

Further Discussion• Extending HAC to dynamic networks is limited:

– A solution is required to the identifiability problem: how complexes inferred at one time point correspond to complexes inferred at other time points.

– Transitions of a protein from one complex to another must be permitted by the model, requiring dynamical coupling between network snapshots.

• Dynamical Hierarchical Agglomerative Clustering (DHAC) – Maximum likelihood is converted to fully Bayesian statistics– The likelihood modularity is ‘kernelized’ with an adaptive bandwidth

to couple network clusters at nearby time points.– Matching clusters across time points is solved with a new belief

propagation method that extends Expectation-Maximization and belief propagation for bipartite matching to consistently match multiple time-evolving clusters.