32
Copyright 2006, Data Mining Research Laboratory Integrated Mining of PPI Networks: A Case for Ensemble Clustering Srinivasan Parthasarathy Department of Computer Science and Engineering The Ohio State University Joint work with Sitaram Asur and Duygu Ucar

Integrated Mining of PPI Networks: A Case for Ensemble Clustering

  • Upload
    laszlo

  • View
    28

  • Download
    2

Embed Size (px)

DESCRIPTION

Integrated Mining of PPI Networks: A Case for Ensemble Clustering. Srinivasan Parthasarathy Department of Computer Science and Engineering The Ohio State University Joint work with Sitaram Asur and Duygu Ucar. I. Preliminaries and Motivation. Proteins. - PowerPoint PPT Presentation

Citation preview

Page 1: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Integrated Mining of PPI Networks: A Case for Ensemble

Clustering

Srinivasan ParthasarathyDepartment of Computer Science and

EngineeringThe Ohio State University

Joint work with Sitaram Asur and Duygu Ucar

Page 2: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

I. Preliminaries and Motivation

Page 3: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Proteins

• Central component of cell machinery and life– It is the proteins dynamically generated by a cell

that execute the genetic program [Kahn 1995]

• Proteins work with other proteins [Von Mering et al 2002]– Form large interaction networks typically refered

to as protein-protein interaction (PPI) networks– Regulate and support each other for specific

functionality or process

Page 4: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Protein Protein Interaction Networks• Why analyze?

– To fully understand cellular machinery, simply listing proteins is not enough – (clusters of) interactions need to be delineated as well [v.Mering 2002]

• Understanding the organism

– Protein function prediction• E.g. no functional annotations for one-third of baker’s yeast

– Drug design• Goal: To find modular clusters

Page 5: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Challenges in analyzing PPI Networks

– Noisy data

• False positives [Deane 2002], false negatives [Hsu 06]– Existence of Hub Nodes

• Particularly problematic for standard clustering and graph partitioning algorithms -- lead to very large core clusters and not much else!

– Proteins can be multi-faceted• Can belong to multiple functional groups – most clustering

algorithms are hard – need for soft or fuzzy clustering– Data Integration Issues

• Multiple Sources– 2-Hyrbid, Mass Spectrometry, genetic co-occurrence

• Different targets– Y2H, Mass Spec – target binding– Gene co-occurrence – target functional

• Different weaknesses (missing certain interactions)– Y2H – translation– mass-spectrometry – transport & sensing

Page 6: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Ensemble Clustering

• A useful approach to combine the results from multiple clustering arrangements into a single arrangement based on consensus [SG03]

• Objective: Mapping between clusters obtained by different algorithms to a single clustering arrangement

• Our hypothesis: Potentially offers a viable solution for problems simultaneously– Given nice theory in the context of classification it is likely to

be particularly useful in a noisy environment.• A weak analogy to the audience vote in millionaire

– Naturally handles arrangements produced from different sources or domain driven segmentation.

Page 7: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Ensemble Clustering on PPI networks:Key Questions

• What are the base clustering methods and arrangements to use in the context of interaction networks?– How to handle the influence of noise and hubs?

• How do we scale to problems of the scale of interaction networks?

• How do we address the issue of soft clustering?

• How to address the issue of data integration?– Another day another time

Page 8: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

II. Ensemble Clustering Framework

Page 9: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Birds-eye-view (coarse grained)

Clustering Arrangements

Topology-basedSimilarity Metrics

Clustering Algorithms

Cluster Representation(soft)Consensus Clustering

Final clusters

Scale-free graph

xy base clustering arrangements

x y

Page 10: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Similarity Metrics

• Central to any clustering algorithm• Key idea:

– Leverage topological information to determine the similarity between two proteins in the interaction network

– With ensemble approach we are not limited to one!• Metrics :

– Clustering coefficient based (edge oriented, local)– Edge Betweenness based (edge oriented, global)– Neighborhood based (local, non-edge oriented)

Page 11: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Clustering coefficient-based similarity

• Clustering coefficient– "all-my-friends-know-each-other" property

– Measures the interconnectivity of a node’s neighbors.

• Clustering coefficient-based similarity of two connected nodes vi and vj

– Measures the contribution of the edge between the nodes towards the clustering coefficient of the nodes

5

1 2

3 4

6

vi vj

Page 12: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Edge betweenness-based similarity

• Shortest path edge betweenness [Newman et al]– “I-am-between-every-pair” property– Computes the fraction of shortest paths passing

through an edge

– Edges that lie between communities have high values of betweenness

– Edge betweenness-based similarity

5

1 2

3 4

6 7

8

Page 13: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Neighborhood-based similarity

• “my-friends-are-your-friends” property• Based on the number of common neighbors

between nodes (Czekanowski-Dice metric [Brun et al, 2004])

where Int(i) = number of neighbors of node i

5

1 2

3 4

6

Page 14: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Base Clustering• Base clustering algorithms : Different criteria

– kMetis – Repeated bisections – Direct k-way partitioning

• Topology-based similarity measures : weight interactions – Clustering coefficient-based – local, targets FP– Edge betweenness-based – global, targets FP– Neighborhood – local, potentially targets FN &

FP

• 3X3 = 9 arrangements (variance is good!)– K clusters per arrangement (K clusters)

Page 15: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

PCA-based Consensus Technique

Cluster Purification

Dimensionality Reduction

Consensus Clustering

Page 16: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Cluster Purification

• Goal : Prune unreliable base clusters • Intra-cluster similarity measure

where SP(i,j) represents shortest path between i and j

• Low intra-cluster distance => high reliability

• Remove clusters with low reliability

Page 17: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Dimensionality Reduction

• Cluster membership matrix to represent pruned base clusters

• Dimensions likely to be high (9 X k)• Clustering inefficient for high-dimensional data

– Distance metric computations do not scale well• Lot of noise and redundancy in the matrix• Solution : Reduce dimensions of the matrix

– Apply logistic PCA– Variant of PCA for binary data (Schein et al, 2003)

Page 18: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Consensus Clustering

• Agglomerative Hierarchical Clustering – Bottom-up clustering algorithm– Begin with each point in a separate cluster– Iteratively merge clusters that are similar

• Recursive Bisection (RBR) algorithm• Soft Clustering Variants

– Find initial clusters using agglo or RBR– Assign points to multiple clusters based on similarity

– Hub nodes have high propensity for multiple membership

Page 19: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Base Clustering

Topological Metrics

Weighted GraphCluster Purification

Principal ComponentAnalysis

Final clusters

Base clustering arrangements

Agglomerative Clustering

Weights

Pruning

PCA-agglo PCA-rbr

Ensemble Framework

(Detailed View)

Consensus Clustering

PCA-soft-variants

Soft

Page 20: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

III. Evaluation

Page 21: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Validation Metrics: Domain Independant

• Topological measure : Modularity [Newman&Girvan04]– Measures the modularity within clusters

– dij represents fraction of edges linking nodes in clusters i and j

• Information theoretic measure : Normalized Mutual Information [Strehl & Ghosh03]– Measures the shared information between the

consensus and base clustering arrangements

Page 22: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Validation Metric: Domain Dependant

• Domain-based measure:– Gene ontology annotations for each cluster of

proteins• Cellular Component • Molecular Function• Biological Process

– P-value to measure statistical significance of clusters• Computes the probability of the grouping being random• Smaller p-values represent higher biological

significance

– Clustering Score to measure overall clustering arrangement

Page 23: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Experimental Setup

• Algorithms proposed by Strehl et al , 2003– HyperGraph Partitioning Algorithm (HGPA)

• Minimal Hyperedge Separator using HMetis– Meta-CLustering Algorithm (MCLA)

• Group related hyperedges to form meta-clusters• Assign each point to the closest meta-cluster

– Cluster-based Similarity Partitioning (CSPA)• Pairwise similarity matrix is partitioned with METIS

• Algorithms proposed by Gionis et al, ICDE 2005– Agglomerative algorithm (CE-agglo)– Density-based clustering algorithm (CE-balls)– Use strict thresholds and are non-parametric

• Database of Interacting Proteins (DIP)– 4928 proteins, 17194 interactions

Page 24: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Modularity and NMI

CSPA algorithm ran out of memoryCE-agglo and CE-balls algorithms resulted in pairs and singleton clusters(cluster-sizes 2121 and 2783 respectively)

PCA-based consensus methods provide best scores!

Algorithm Modularity NMI

PCA-agglo 0.471 0.66

PCA-rbr 0.46 0.656

MCLA 0.41 0.614

HGPA 0.1 0.275

Page 25: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Comparison with Ensemble Algorithms

Ensemble Algorithms

0

0.1

0.2

0.3

0.40.5

0.6

0.7

0.8

0.9

1

CE-balls CE-agglo HGPA PCA-agglo PCA-rbr MCLA Wt-agglo

Clu

ster

ing

Scor

e

Process

Function

Component

PCA-based Consensus methods outperform all other algorithms!

MCLA performs best of the other algorithms

Page 26: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Existing Solutions to Identify Dense Regions

• Molecular Complex Detection (MCODE)– Bader et al, 2003– Use local neighborhood density to identify seed

vertices– Group highly weighted vertices around seed

vertices• Markov Cluster Algorithm (MCL)

– Dongen et al 2000– Random walks on the graph will infrequently

go from one natural cluster to another – Cluster structure separates out– Fast, scalable and non-parametric

Page 27: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Comparison with MCODE and MCL

• MCODE produced only 59 clusters– Not all proteins clustered (794/4928)– 10-20 clusters insignificant

• MCL produced 1246 clusters– Most of the clusters insignificant (close to 75-80%)

Algorithm Modularity

PCA-agglo

0.471

MCL 0.217

MCODE 0.372

Page 28: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Soft Clustering: Comparison with Hub Duplication (Ucar 2006)For Hub

i++

Hi

Hi

D’iHi

Hub-induced Subgraph Si Dense components of Si

Duplicate Hi

Graph Partitioning

Page 29: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Benefits of Soft Ensemble Clustering

Page 30: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

A closer look at soft clustering performance

• CKA1 (hub protein)

Base Algorithm

Annotation PCA-agglo

PCA-softagglo

Direct-bet Kinase CK2 complex Kinase CK2 complex

Kinase CK2 complex

Direct-cc rRNA metabolism rRNA metabolism

RBR-bet Kinase CK2 complex Cell organization and biogenesis

RBR-cc Kinase CK2 complex

Metis-bet Cell organization and biogenesis

Metis-cc

Page 31: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Concluding Remarks

• Clustering PPI networks is challenging

– Noise– Presence of hubs – Need for soft clustering– Integration

• Ensemble clustering shows promise as a unified method to handle these problems

– Competes well against existing stand-alone solutions

– Scalable -- straightforward parallelization for the most part

• Ongoing work– General applicability

• WWW applications• Social network analysis

– Explicit modeling of domain knowledge

• E.g. encoding directionality

– Data Integration• Key is to weight edges and/or

components of the ensemble

– Leveraging graphical models

– More robust base models• Extrinsic similarity measures• Impact of anomalies

Page 32: Integrated Mining of PPI Networks: A Case for Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

Questions?

• We acknowledge the following grants for support

– NSF: CAREER-IIS-0347662 – NSF: NGS-CNS-0406386 – NSF: RI-CNS-0403342 – DOE: ECPI-FG02

• Graduate Student Colleagues– S. Asur and D. Ucar

• Details– http://dmrl.cse.ohio-state.edu– www.cse.ohio-state.edu/~srini/