1
Clustering Metabolic Networks Using Minimum Cut Trees Ryan Kellogg 1 , Allison Heath 2 , Lydia Kavraki 2,3 1 Carnegie Mellon University, Department of Electrical & Computer Engineering, 2 Rice University, Department of Computer Science; 3 Rice University, Department of Bioengineering Problem Finding clusters in metabolic networks is important for several reasons: Clusters may correspond to groups of reactions that perform a common function Complex metabolic networks can be simplified based on their cluster composition Insights about large-scale organization and evolutionary history can be achieved [3] Our approach is interesting because: One can change the size and number of clusters produced by adjusting a single parameter The algorithm is elegant and mathematically robust Execution is efficient and based on network flow computations Motivation This project is about the discovery and analysis of clusters in metabolic networks. We implement an algorithm for cluster detection based on minimum cut trees, apply the algorithm to metabolic network data, analyze the identified clusters and discuss the biological implications. Overview Conclusion and Future Work Results The algorithm for detecting clusters is based on a structure called a minimum cut tree [2]. The minimum cut tree T of a graph G has the property that lowest edge weight along the path between two nodes in T equals the minimum cut between the same two nodes in G. Consider the following example graph and its corresponding minimum cut tree: Explanation: Suppose we are interested in the minimum cut between nodes A and F. The dashed red line indicates this cut, which has capacity 17. Consequently, in the min-cut tree, along the path between nodes A and F, the lowest edge weight is 17. Minimum Cut Trees Method We model metabolic networks using a directed, bipartite graph: One set of nodes represents compounds One set of nodes represents reactions Edges associate compounds with reactions Metabolic networks are very complex. This model is a first order approximation. It relates the topological information necessary for cluster identification. Metabolic Networks as Graphs The minimum cut tree clustering (MCTC) algorithm proceeds as follows [1]: Clustering Algorithm Tuning Alpha Begin with an undirected, weighted graph G. Attach artificial sink to each node in G with edge of weight α. Call this structure “expanded graph”. Compute the minimum cut tree of the expanded graph. Now, remove the artificial sink from the structure. The disconnected components are clusters of G. We obtain optimal clusterings for each of the four organisms and compare with known metabolic pathways. Matches fall roughly into four categories: Full match: A cluster coincides exactly with a pathway. Partial match: A cluster is contained by but does not fill a pathway. Multi-match: A single cluster spans multiple pathways. No match: There is little discernable clustering in a pathway. We present an example of each type: Biological Analysis This is a ongoing project. More analysis is necessary to determine the extent that the MCTC algorithm is useful for understanding metabolic networks. Current progress is encouraging; the algorithm seems to produce biologically meaningful clusters with reasonable efficiency. Future work we will explore: cluster detection when pathway structure is unknown, simplified network representations based on cluster composition, and applications in other types of biological networks, such as motif identification in regulatory networks. References [1] G.W. Flake, R.E. Tarjan, K. Tsioutsiouliklis. “Graph Clustering and Minimum Cut Trees.” Internet Mathematics;1: 385-408. 2002. [2] R.E. Gomory and T.C. Hu. “Multi-terminal Network Flows.” J. Soc. Indust. Appl. Math; 9: 551-571.1961. [3] P. Holme and M. Huss. “Discovery and Analysis of Biochemical Network Hierarchies”. Bioinformatics; 19: 532- 538. 2003. For questions or comments: Allison Heath [email protected] We seek to objectify selection of alpha in our analysis: Choose the value corresponding to clusters that “best fit” known metabolic pathway structure To calculate, find intersection of average pathways per cluster (PPC) and average clusters per pathway (CPP) Figure to right shows best fit alpha values for the four organisms in our study Cluster Statistics Interesting observations: Number of clusters changes with α in step-like fashion Moderate sized clusters for only small range of α Overall behavior is as expected Full Match: E. coli Fatty Acid Biosynthesis No Match: A. thaliana Reductive carboxylate cycle Partial Match S. cerevisiae Nucleotide sugars metabolism Multi Match H. sapiens Methionine metabolism / Cysteine metabolism Our data comes from the Kyoto Encyclopedia for Genes and Genomes (KEGG). We study the full metabolism of four organisms: Saccharomyces cerevisiae Arabidopsis thaliana Escherichia coli Homo sapiens Total LCC Edges Node s Edges Node s S. Cerevisiae 2361 2415 1174 1055 A. thaliana 2670 2818 1267 1151 E. coli 3109 3152 1654 1470 H. sapiens 3529 3507 1785 1575 We note the KEGG data is disconnected. We focus on the primary, largest connected component (LCC) in the metabolic network. Data

Clustering Metabolic Networks Using Minimum Cut Trees Ryan Kellogg 1, Allison Heath 2, Lydia Kavraki 2,3 1 Carnegie Mellon University, Department of Electrical

Embed Size (px)

Citation preview

Page 1: Clustering Metabolic Networks Using Minimum Cut Trees Ryan Kellogg 1, Allison Heath 2, Lydia Kavraki 2,3 1 Carnegie Mellon University, Department of Electrical

Clustering Metabolic Networks Using Minimum Cut Trees Ryan Kellogg1, Allison Heath2, Lydia Kavraki2,3

1Carnegie Mellon University, Department of Electrical & Computer Engineering, 2Rice University, Department of Computer Science; 3Rice University, Department of Bioengineering

Problem

Finding clusters in metabolic networks is important for several reasons: Clusters may correspond to groups of reactions that perform a common function Complex metabolic networks can be simplified based on their cluster composition Insights about large-scale organization and evolutionary history can be achieved [3]

Our approach is interesting because: One can change the size and number of clusters produced by adjusting a single parameter The algorithm is elegant and mathematically robust Execution is efficient and based on network flow computations

Motivation

This project is about the discovery and analysis of clusters in metabolic networks. We implement an algorithm for cluster detection based on minimum cut trees, apply the algorithm

to metabolic network data, analyze the identified clusters and discuss the biological implications.

Overview

Conclusion and Future Work

Results

The algorithm for detecting clusters is based on a structure called a minimum cut tree [2].

The minimum cut tree T of a graph G has the property that lowest edge weight along the path between two nodes in T equals the minimum cut between the same two nodes in G.

Consider the following example graph and its corresponding minimum cut tree:

Explanation: Suppose we are interested in the minimum cut between nodes A and F. The dashed red line indicates this cut, which has capacity 17. Consequently, in the min-cut tree, along the path between nodes A and F, the lowest edge weight is 17.

Minimum Cut Trees

Method

We model metabolic networks using a directed, bipartite graph: One set of nodes represents compounds One set of nodes represents reactions Edges associate compounds with reactions

Metabolic networks are very complex. This model is a first order approximation. It relates the topological information necessary for cluster identification.

Metabolic Networks as Graphs

The minimum cut tree clustering (MCTC) algorithm proceeds as follows [1]:

Clustering Algorithm Tuning Alpha

Begin with an undirected, weighted graph G.

Attach artificial sink to each node in G with edge of weight α. Call this

structure “expanded graph”.

Compute the minimum cut tree of the expanded graph.

Now, remove the artificial sink from the structure.

The disconnected components are clusters of G.

We obtain optimal clusterings for each of the four organisms and compare with known metabolic pathways. Matches fall roughly into four categories:

Full match: A cluster coincides exactly with a pathway. Partial match: A cluster is contained by but does not fill a pathway. Multi-match: A single cluster spans multiple pathways. No match: There is little discernable clustering in a pathway.

We present an example of each type:

Biological Analysis

This is a ongoing project. More analysis is necessary to determine the extent that the MCTC algorithm is useful for understanding metabolic networks. Current progress is encouraging;

the algorithm seems to produce biologically meaningful clusters with reasonable efficiency. Future work we will explore: cluster detection when pathway structure is

unknown, simplified network representations based on cluster composition, and applications in other types of biological networks, such as motif identification in regulatory networks.

References[1] G.W. Flake, R.E. Tarjan, K. Tsioutsiouliklis. “Graph Clustering and Minimum Cut Trees.” Internet Mathematics;1: 385-408. 2002.

[2] R.E. Gomory and T.C. Hu. “Multi-terminal Network Flows.” J. Soc. Indust. Appl. Math; 9: 551-571.1961.

[3] P. Holme and M. Huss. “Discovery and Analysis of Biochemical Network Hierarchies”. Bioinformatics; 19: 532- 538. 2003.

For questions or comments: Allison Heath [email protected]

We seek to objectify selection of alpha in our analysis:

Choose the value corresponding to clusters that “best fit” known metabolic pathway structure

To calculate, find intersection of average pathways per cluster (PPC) and average clusters per pathway (CPP)

Figure to right shows best fit alpha values for the four organisms in our study

Cluster Statistics

Interesting observations:

Number of clusters changes with α in step-like fashion

Moderate sized clusters for only small range of α

Overall behavior is as expected

Full Match:

E. coli Fatty Acid Biosynthesis

No Match:

A. thaliana Reductive carboxylate cycle

Partial Match

S. cerevisiae Nucleotide sugars

metabolism

Multi Match

H. sapiens Methionine metabolism / Cysteine metabolism

Our data comes from the Kyoto Encyclopedia for Genes and Genomes (KEGG). We study the full metabolism of four organisms:

Saccharomyces cerevisiae Arabidopsis thaliana Escherichia coli Homo sapiens

Total LCC

Edges Nodes Edges Nodes

S. Cerevisiae 2361 2415 1174 1055

A. thaliana 2670 2818 1267 1151

E. coli 3109 3152 1654 1470

H. sapiens 3529 3507 1785 1575

We note the KEGG data is disconnected. We focus on the primary, largest connected component (LCC) in the metabolic network.

Data