80
Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of Saarland University, Saarbrücken, Germany, in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science supervised by Prof. Dr. Thomas Lengauer, Ph.D. and Dr. Mario Albrecht Computational Biology and Applied Algorithms Group Max Planck Institute for Informatics Saarbrücken, Germany November 2006

Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

Master Thesis

Topological Analysis of Biological Networks

by Yassen Assenov

A Thesis Submitted to the Department of Computer Science of Saarland University, Saarbrücken, Germany,

in Partial Fulfillment of the Requirements for the Degree of Master of Science

in Computer Science

supervised by Prof. Dr. Thomas Lengauer, Ph.D.

and Dr. Mario Albrecht

Computational Biology and Applied Algorithms Group Max Planck Institute for Informatics

Saarbrücken, Germany

November 2006

Page 2: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

2

Yassen Assenov

Topological Analysis of Biological Networks

Master Thesis in Computer Science

Saarland University

Date: November 30, 2006

Page 3: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

3

Declaration

I hereby declare that this thesis is entirely my own work except where otherwise indicated. I have used only the resources given in the list of references.

Yassen Assenov November 2006

Page 4: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

4

Page 5: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

5

Abstract An astonishing diversity of biological concepts and processes can be represented by the relatively simple formalism of networks, which consist of nodes connected by edges. Examples range from neuron communication to metabolism and gene suppression mechanisms. The pursuit of describing many interactions and dependencies between the constituents of a living organism has produced large networks of interacting proteins, coexpressed genes, metabolite transformations and other.

With advances in large-scale experimental technologies, such as microarray experiments and yeast-two-hybrid screens, data are constantly increasing both in size and complexity. Additional techniques, such as literature curation and prediction methods diversify the biological data in the available repositories. The development of a unified framework for the analysis and integration of these data has appeared as a central challenge in bioinformatics research.

This work presents two applications for the manipulation and investigation of biological networks. The Cytoscape plugin named NetworkAnalyzer facilitates topological analysis of large networks through computation and visualization of a variety of network parameters. Simultaneous inspection of analysis results of two or more networks enables network topology comparison. The stand-alone tool NetworkLoader compiles interaction networks from different molecular interaction datasets, providing flexible criteria for data filtering and integration of identifiers. The successful application of NetworkLoader and NetworkAnalyzer in handling protein-protein interaction data, as well as community demand for the functionality provided by NetworkAnalyzer encourages future support and development of these applications.

In addition, we introduce an adaptable approach for quantifying the topological role of a subset of nodes in a network. The approach consists of estimating p-values for a collection of topological properties., It application to groups of polyanion-binding proteins within protein-protein interaction networks in yeast and human revealed intriguing topological features of some of the analyzed groups.

Page 6: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

6

Page 7: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

7

Acknowledgements I would like to thank my supervisor Prof. Thomas Lengauer for his inspiring lectures that motivated me to work in the field of Bioinformatics. I am grateful for the helpful discussions with him and with Dr. Mario Albrecht about the projects I was involved in.

I would also like to thank Dr. Jörg Rahnenführer, Fidel Ramírez, Andreas Schlicker, Adrian Alexa, and Laura Toloşi for their suggestions on improvements in my work. Last but not least I would like to thank Lyublena Antova for her numerious remarks and corrections on earlier manuscripts of this thesis.

Page 8: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

8

Page 9: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

9

Table of Contents

1 Introduction----------------------------------------------------------------------------------------11 1.1 Tools for the analysis of biological networks ------------------------------------------12 1.2 Outline---------------------------------------------------------------------------------------13

2 Definitions -----------------------------------------------------------------------------------------14 2.1 Biological Networks -----------------------------------------------------------------------15 2.2 Network Parameters -----------------------------------------------------------------------16

3 Interaction Databases ----------------------------------------------------------------------------23 3.1 BioGRID ------------------------------------------------------------------------------------23 3.2 IntAct ----------------------------------------------------------------------------------------23 3.3 CYGD ---------------------------------------------------------------------------------------23 3.4 HPRD----------------------------------------------------------------------------------------24 3.5 DIP -------------------------------------------------------------------------------------------24 3.6 UniProt --------------------------------------------------------------------------------------24

4 Interaction File Formats -------------------------------------------------------------------------26 4.1 SIF--------------------------------------------------------------------------------------------26 4.2 PSI MI ---------------------------------------------------------------------------------------26

5 Implementation of Network Tools -------------------------------------------------------------27 5.1 Compiling Interactomes with NetworkLoader -----------------------------------------27 5.2 Analyzing Topology with NetworkAnalyzer-------------------------------------------31 5.3 Quantifying Node Subset Significance--------------------------------------------------33

6 Results of Topological Analysis----------------------------------------------------------------36 6.1 Network Comparison ----------------------------------------------------------------------36 6.2 Significance of Polyanion-binding Proteins --------------------------------------------38

7 Conclusions----------------------------------------------------------------------------------------47 7.1 Summarizing Remarks --------------------------------------------------------------------47 7.2 Outlook --------------------------------------------------------------------------------------47

8 Bibliography---------------------------------------------------------------------------------------47 Appendix A – Implementation Details of NetworkAnalyzer -------------------------------------57 Appendix B – Details of the Analysis of PABPs ---------------------------------------------------63

Estimation of p-values for properties of PABPs in yeast---------------------------------------63 Estimation of p-values for properties of PABPs in human-------------------------------------68 Estimation of p-values for properties of yeast homologues of human PABPs in yeast ----78

Page 10: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

10

Page 11: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

11

1 Introduction Large-scale experiments, prediction techniques and approaches for mining biomedical literature produce vast amounts of biological data. The formalism of networks can be used to provide a systematic view of these data. Various network types are introduced based on the nature of the underlying data. Examples of such network types are protein-protein interaction, genetic interaction, transcriptional regulation, sequence homology, and expression correlation.

The bioinformatics community has focused on the analysis of these network data, testing associations between them with the aim of obtaining a robust and integrated view of the underlying biology. Constructing models based on more than one source of biological network data can give important insights into the functions of biological entities. By combining transcriptional regulatory information and gene-expression data, Luscombe et al. [61] have concluded that fast signal propagation arises as an environmental response. On the other hand, cell cycle and sporulation favor direct temporal progression of signals through multiple stages. Another study has focused on incorporating phosphorilation data with protein-protein interaction networks and transcriptional factor binding data [82].

Combining biological networks can also reveal interesting information related to protein complexes. For example, the formation of protein complexes can be analyzed by integrating protein interaction and gene expression networks [26]. Successful integration of protein interactions with expression profiling similarity and phenotypic profiling similarity can provide insights into protein complex activity [39].

Enriching an existing network with additional non-network data is also a promising approach of gaining insights into complex biological processes. Incorporating information about the three-dimensional protein structure in a protein-protein interaction network facilitates understanding of the overall behavior of a cell [7]. Yan et al. [119] have successfully integrated quantitative proteomic analysis with protein interaction data, discovering novel regulatory components of interferon response. The results of their research facilitate the understanding of the interplay between Hepatitis C virus, interferon and the host response to virus infection.

Other studies focus on the computationally intensive task of identifying significant substructures in a network, also known as motifs. Sharan et al. [93] have aligned the protein-protein interaction networks of Caenorhabditis elegans, Drosophila melanogaster, and Saccharomyces cerevisiae by grouping proteins from different species based on their sequence similarity. The aligned network has revealed 71 regions that are conserved among the three species. This conservation is used to identify protein function and predict unknown interactions. Zhang and colleagues [127] have found many three- and four-node motifs in a network which combines protein-protein interactions, genetic interactions, transcriptional regulation, sequence homology and expression correlation. Most of the motifs form classes of higher order recurring interconnection patterns, which can be tied to specific biological phenomena and are likely to represent fundamental network design principles.

Protein-protein interaction networks can also guide the prediction of structure, function and sub-cellular location of unknown proteins [53]. Goehler et al. have functionally annotated 16 uncharacterized proteins by investigating the clusters of a network obtained by combining library and matrix yeast-two-hybrid screens [35]. Sometimes network structure alone can be used to predict missing data. In protein-protein interaction networks constructed by high-throughput experiments, some of the interactions missed by the experiment are reflected as missing edges in nearly complete cliques [124].

Similar results of studies on biological data have encouraged the application of algorithms for network analysis and decomposition from graph theory to biological networks. Recent reviews discuss the algorithmic approach to information extraction from intertwined data in these

Page 12: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

12

networks. [1, 53].

Most of the approaches described above apply a reductionist approach, seeking to understand complex systems through detailed investigation of their individual parts. This thesis focuses on a different direction in the study of complex biological networks. It presents a systematic approach to characterization of the topology of very large networks, which enables network comparison on a structural level. It also facilitates the process of identifying functional and evolutionary properties of biological networks.

Direct functional inference from topological properties is subject to moderate criticism. In contrast to other studies [13, 40], Siegal and colleagues [94] have observed only weak correlation between node connectivity and lethality in protein-protein interaction networks. In addition, extrapolation of characteristics found in subnetworks to the global network are not always justified. For example, Stumpf et al. [99] have proven that the scale-free nature of partial networks does not always imply scale-free properties for the whole (usually unknown) network. Nevertheless, computation of topological parameters has inspired the development of network growth models [77, 84] and has given insights into the evolution and behaviour of organisms [4].

1.1 Tools for the analysis of biological networks A variety of applications provide automated means of network construction and visualization, with a special focus on biomolecular data. Examples include the Cytoscape software platform [92], the VisANT visual framework [47], the Protein Interactions and Network Analysis (PIANA) tool [9] and the Osprey network visualization system [20]. Recently, comprehensive reviews of network analysis tools became available [72,106]. The developers of Cytoscape have focused on creating a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. Any additional features are included in the form of plugins. For example, the GenePro plugin [108] is used for the visualization of protein clusters at different resolutions. By providing network abstraction and visualization capabilities, Cytoscape is a suitable framework for developing a topological analysis tool. PIANA is also an extensible system, focusing on data integration from multiple sources. In addition, it provides a library that handles operations on the network, enabling python scripts to perform network analysis of the loaded interaction data. A disadvantage of the system is that it lacks network visualization and editing functionality. In contrast to PIANA, the Java application VisANT provides flexible visualization of network data. It also automatically collects interaction data from MIPS [68], BIND and other databases. VisANT computes a very limited set of topological parameters, related to node degrees, clustering coefficients and shortest path lengths. Unfortunately, PIANA and VisANT can load interaction files only in the old version (1.0) of the PSI MI format. Osprey is a Java-based tool for visualization and manipulation of interaction networks. It uses the BioGRID [96] database for building networks. Osprey provides additional functionality, such as text search by gene names. However, topological analysis is not supported.

The tools described above concentrate on the loading and visualizing of interaction data. A different goal is followed for example in the TopNet [125] and SpectralNet [32] applications, namely facilitating topological analysis of biological data. TopNet provides users with the ability to compare the topological statistics of different sub-populations within the same network and to explore complex networks in pieces. SpectralNet is a .NET application that computes parameters such as degree of a node, clustering coefficient and so on. Both TopNet and SpectralNet share the same disadvantage: they support only a limited number of topological parameters.

Indeed, certain topological parameters of biomolecular networks, such as the degree

Page 13: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

13

distribution, have received impressive attention [11, 33, 125, 32]. On the other hand, the computation of some parameters, e.g. topological coefficients, has remained an isolated case [97].

In response to the demand for tools supporting the new version of the common PSI MI format [74] and providing comprehensive topological analysis of biomolecular networks, we have developed two applications, related to the manipulation and analysis of biological network data. NetworkLoader is a tool for loading, saving, filtering and transforming molecular interaction data. It compiles a network from different interaction sources provided in PSI MI 2.5 format. NetworkAnalyzer is a plugin which calculates and displays the values of a number of network parameters, a necessary step for the exhaustive topological analysis of large networks. Both tools are developed in the Java programming language. Successful application of the tools on compiling and comparing protein-protein interaction networks, as well as positive user feedback on NetworkAnalyzer’s functionality and ease of use, encourages their future support and development.

In addition, we introduce three properties for quantifying the structural importance of a node subset within a network. These properties are used to identify important topological characteristics of groups of polyanion-binding proteins in yeast and human protein-protein interaction networks.

1.2 Outline Chapter 2 introduces several node and network parameters, as well as interpretations of their values. It also lists several types of biological networks. Chapter 3 describes popular databases providing protein-protein interactions, and Chapter 4 continues with a brief introduction to two file formats for molecular interaction storage and exchange. Chapter 5 discusses the tools we have developed for manipulating and analysing biological data. In addition, it presents an approach for quantifying the significance of a node subset and the impact it has on the topology of the whole network. Results on the application of our tools and our approach are provided in Chapter 5. Chapter 6 gives concluding remarks and an outlook on future extensions of our work. Appendix A discusses implementation details of NetworkAnalyzer, and can effectively serve as developer’s manual for the tool. Appendix B presents some of the data used in quantifying the significance of polyanion-binding proteins in yeast and human protein-protein interaction networks.

Page 14: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

14

2 Definitions Throughout the thesis, several symbols are used for indicating common terms, such as G for network, E for the set of edges in a network and lower case Latin letters for node names. Many of these symbols are borrowed from graph theory [28], a field in computer science. However, the concepts of network and graph are not equivalent and thus these terms should not be considered interchangeable.

Network is defined as a pair ( )EVG ,= of disjoint sets, where V is the set of nodes in the network, and E is the set of binary edges. Every edge ( ) Evue ∈= , contains two ends

Vvu ∈, and it is said to link the nodes u and v . If one of the ends of an edge is marked as source, and the other – as target, the edge is directed [8]; otherwise it is undirected. An edge that links a node to itself is called a self-loop. The empty network is the network that does not contain any nodes: 0/== EV .

Usually nodes are visually depicted as circles or polygons, and edges are represented by lines or curves connecting them. Directed edges are depicted by arrows on their target ends. For example, the network on Figure 1 contains four nodes and four edges. Three of the edges are undirected (one is a self-loop) and one is directed.

Figure 1. Example network of four nodes and four edges.

In Figure 1, the nodes a and b are neighbors to each other, because they are linked. Moreover, these two nodes give an example of a multi-edge node pair, that is, a pair of nodes that are linked by more than one edge. Node d has no neighbors (is not linked to any other node), and is therefore an isolated node. When directed edges are present, one can speak of incoming and outgoing edges of a node. The first type of edges determines the in-neighbors of the node, and the second type – the out-neighbors.

A network that contains only directed edges is called directed. Similarly, undirected networks contain undirected edges only. Figure 2 shows an example of an undirected network.

Page 15: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

15

Figure 2. Undirected network of 12 nodes and 14 edges.

Network ( )111 , EVG = is a subnetwork of ( )222 , EVG = when 21 VV ⊆ and 21 EE ⊆ . Let ( )EVG ,= be a network and VS ⊆ . Then ( )SS ESG ,= , where

( ) ( ){ }EbaSbabaEs ∈∧∈= ,,, , is the subnetwork of G , induced the node subset S .

2.1 Biological Networks Biomolecular interaction networks, some of which are also referred to as cellular maps [113], provide a natural framework for describing different aspects of biological systems, such as a living cell. Recently, various reviews have provided comprehensive information on the available networks, as well as experimental and prediction techniques for constructing them [118, 22]. An exemplary, and by no means exhaustive, list of biological network types is given below.

2.1.1 Metabolic networks These networks combine the knowledge on metabolic pathways and cycles in an organism or a cell type. Nodes in the metabolic networks are metabolic substrates or products, and edges represent transitions through chemical reactions, catalyzed by enzymes [2, 60]. Transitions are always directional and thus the metabolic networks are directed.

2.1.2 Expression networks In expression networks, nodes are genes and edges connect genes that are coexpressed. These networks are constructed by large-scale DNA microarray experiments [29], and the unordered composition of a pair of coexpressed genes leads to the undirected nature of the networks.

2.1.3 Regulatory networks There are two types of nodes in regulatory networks – genes and transcription factors. Edge connecting a gene and a transcription factor indicate up- or downregulation [54, 23].

Page 16: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

16

2.1.4 Genetic interaction networks In general, two genes are said to interact genetically if a mutation in one gene either suppresses or enhances the phenotype of a mutation in its partner gene [102, 118]. The nodes in genetic interaction networks are genes and the edges are genetic interactions. Containing only directed interactions, genetic networks are an example for directed biological networks.

2.1.5 Protein-protein interaction networks In protein-protein interaction (PPI) networks, nodes represent proteins and edges depict physical or genetic interactions between them. The collection of all protein-protein interactions for one organism is identified by the term interactome. Systematic yeast two-hybrid (Y2H) and mass spectrometry-based biochemical affinity co-purification approaches have presumably revealed the dominant part the yeast interactome [68, 98]. The human interactome, however, remains largely unknown [83, 98].

2.1.6 Domain interaction networks Domains are independently folded modules of a protein. A domain-domain interaction (DDI) network is constructed when each protein in a PPI network is replaced by one or more nodes representing its constituent domains, and edges connecting two proteins are transformed to connect the corresponding domain nodes [90, 5]. Since most of the known proteins are composed of more than one domain, a domain-domain interaction network usually gets much larger than the original protein-protein interaction network. However, different proteins (often functionally unrelated) frequently share identical domains and therefore one domain node in a DDI network usually appears multiple times in the context of different proteins. On the other hand, in domain co-occurrence networks each domain is represented by a single node, and two nodes are connected by an edge when the corresponding domains occur in one protein at least once [115, 116].

2.2 Network Parameters A variety of layout algorithms supports the identification and characterization of global network features through visual inspection. However, the size and complexity of many biological networks severely limits the effectiveness of such an approach. Gaining insights into the organization and topology of a large network can be achieved through computation and analysis of network parameters – parameters that quantify different aspects of the network’s internal structure. Note that most of the parameters described below are not defined for arbitrary networks, but for directed and undirected networks only.

2.2.1 Simple Parameters Network parameters whose values are integer or real numbers are referred to as simple network parameters. Number of nodes, number of edges and number of self-loops of a network are examples of simple network parameters because they all take natural values.

Number of connected components In undirected networks, two nodes are connected if there is a path of edges connecting them. Obviously, if Vba ∈, , then a is connected to b if and only if b is connected to a . Therefore, the notations ( )ba, and ( )ab, are considered identical. Within one network, all nodes that are pairwise connected form a connected component. Every non-empty network contains one or more connected components. A connected network is a network that consists of a single

Page 17: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

17

connected component; all networks that are not connected are considered disconnected. An articulation point is a node in the network, whose deletion increases the number of connected components. For example, the network shown in Figure 1 contains two connected components – { }cba ,, and { }d ; it is therefore disconnected. Node b is an articulation point, as when removed the first component of the network splits into two components. The number of connected components is a simple parameter, indicating the connectivity of a network – a lower number of connected components suggests a stronger connectivity. The expression ( )GCC is used to denote the number of connected components in the network ( )EVG ,= . Since every connected component contains at least one node, the following inequalities hold:

( ) VGCC ≤≤1

In directed networks, every path is formed by a sequence of edges where the target of one edge is the source of the next edge in the path. Therefore, paths also have directions and, unlike undirected networks, a connectivity of <a,b> does not imply connectivity of <b,a>.

Parameters related to shortest paths The length of a path is considered to be the number of edges forming it. Often there are multiple paths connecting two given nodes. The shortest path length, also called distance, between two nodes n and m is denoted by ( )mnL , . The network diameter is a simple parameter, computed as the greatest distance between two nodes. If a network is disconnected, its diameter is the maximum of all diameters of its connected components. This parameter shows the maximum distance between two connected nodes in the network. Another simple parameter – the average shortest path length – gives the expected distance between two connected nodes. Additional insight into the shortest path lengths is provided by the complex parameter shortest path length distribution, described later in this chapter.

Note that two nodes are connected if and only if a shortest path between them exists. The simple parameter shortest paths is the number of node pairs that are connected by paths. Similarly to the number of connected components, described above, the value of this parameter is influenced by the connectivity of the network. For a network of V nodes, the highest possible value for

shortest paths is 2

1−VV when the network is undirected (and 1−VV when it is directed).

Moreover, the maximum value is reached only when the network is connected.

Parameters related to neighborhoods The neighborhood of a given node a is the set of its neighbors and is depicted by ( )aN . The connectivity of a , denoted by ak , is the size of its neighborhood. The average number of neighbors is a simple parameter which shows the average connectivity of a node in the network. The value of this parameter reflects the density of the network. The parameter number of isolated nodes could give additional insights into how the network’s density is distributed. In addition, the simple parameter number of multi-edge node pairs indicates how often neighboring nodes are linked by more than one edge.

Clustering coefficient In undirected networks, the clustering coefficient nC of a node n is defined as

Page 18: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

18

( )12

−=

nn

nn kk

eC , where nk is the connectivity (number of neighbors) of n and ne is the

number of connected pairs between all neighbors of n [112,12]. In directed networks, the

definition is slightly different: ( )1−=

nn

nn kk

eC .

In both cases, the clustering coefficient is a ratio MN , where N is the number of edges

between the neighbors of n , and M is the maximum number of edges that could possibly exist between the neighbors of n . The clustering coefficient of a node is always a number between 0 and 1.

The network clustering coefficient ( )GC is the average of the clustering coefficients for all nodes in the network1. An alternative definition for the network clustering coefficient, mostly

used in sociology literature [71], is ( )Gintriplesnodeconnectedofnumber

GintrianglesofnumberGC ×=

3, where

a connected node triple is formed by a node with edges running to an unordered pair of other nodes [14]. In order to remain consistent with other studies on biological networks, this thesis makes use of the first definition of network clustering coefficient given above.

2.2.2 Complex Parameters In contrast to the simple parameters, the values of the complex network parameters are data structures, such as histograms or sets of two-dimensional points.

Degree distributions In undirected networks, the node degree of a node n is the number of edges incident on n . Special cases are the self-loops, which are counted twice for the node degree [28]. The complex parameter node degree distribution gives the number of nodes with degree k for K,1,0=k . This sequence is often described by kp , a function of the degree k .

In directed networks, the in-degree of a node n is the number of incoming edges, and the out-degree is the number of outgoing edges. Similar to undirected networks, there are in-degree distribution and out-degree distribution.

Neighborhood connectivity In undirected networks, the neighborhood connectivity of a node n is defined as the average connectivity of all neighbors of n [66]:

( ) { } ( )nNxxkaveragenNC ∈=

The neighborhood connectivity distribution is a complex network parameter that gives the average of the neighborhood connectivities of all nodes n with k neighbors for K,1,0=k . Figure 3 shows the neighborhood connectivity distribution for the network presented in Figure 2.

1 In the computation of the network clustering coefficient, all nodes with less than two neighbors are assumed to have a clustering coefficient of 0.

Page 19: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

19

Figure 3. Neighborhood connectivity distribution of the network shown in Figure 2, as presented by NetworkAnalyzer.

We introduce similar parameters for directed networks. In analogy to the in- and out-degree, every node in a directed network has in- and out-connectivity. Thus, in directed networks, a node has the following types of neighborhood connectivity:

• “only in” – the average out-connectivity of all in-neighbors of n :

( ) { } ( )nNxxoutin

inkaveragenNC ∈=

• “only out” – the average in-connectivity of all out-neighbors of n :

( ) { } ( )nNxxinout

outkaveragenNC ∈=

• “in and out” – the average connectivity of all neighbors of n (edge direction is ignored):

( ) { } ( )nNxxkaveragenNC ∈=

Based on the three definitions given above, we have introduced three neighborhood connectivity distributions – “only in”, “only out” and “in and out”.

Shortest path length distribution The shortest path length distribution gives the number of node pairs ( )mn, with ( ) kmnL =, for K,2,1=k . Here, the maximum value of k is the diameter of the network.

Average clustering coefficient distribution The average clustering coefficient distribution gives the average of the clustering coefficients for all nodes n with k neighbors for K,3,2=k .

Shared neighbors distribution Let us denote by ( )mnP , the number of partners shared between the nodes n and m , that is, nodes that are neighbors of both n and m . We have introduced the complex parameter shared neighbors distribution. This parameter gives the number of node pairs ( )mn, with ( ) kmnP =, for K,2,1=k .

Page 20: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

20

Topological coefficients The topological coefficient [95] nT of a node n with nk neighbors is defined as:

( )( )n

n kmnJavgT ,

= .

Here, ( )mnJ , is defined for all nodes m which share at least one neighbor with n . The value of ( )mnJ , is the number of neighbors shared between the nodes n and m , plus one if there is a direct link between n and m . The complex parameter topological coefficients consists of the set of the topological coefficients of all nodes in the network having one or more neighbors.

2.2.3 Interpretations and Network Topologies The node coefficients described above can reveal intriguing properties of a particular node. The clustering coefficient of a node is the number of triangles (3-loops) that pass through this node, relative to the maximum number of 3-loops that could pass through the node. For example, in Figure 2, there is one triangle that passes through node g (the triangle djg ). The maximum number of triangles that could pass through g is three (if the pairs ( )id , and ( )ji, were connected). Thus, the clustering coefficient is 3/1=gC .

The interpretation of neighborhood connectivity of a node is often domain-specific. One example comes from genetic interaction networks, in which nodes represent genes, and edges – gene suppression. In these networks, nodes with higher out-degree are more general suppressors than nodes with a lower out-degree. Therefore, the “only in” neighborhood of node n shows the universality of the suppressors of n .

Similarly, network parameters encode interesting information about the structure of a network. Note that the topological coefficient is a relative measure for the extent to which a node shares neighbors with other nodes. Therefore, the plot of the topological coefficients can be used to estimate the tendency of the network nodes to have shared neighbors. Using this complex parameter, we could even compare two networks with respect to the tendencies to contain shared neighbors, without introducing bias because of differences in the sizes of the networks. In contrast to the topological coefficients, the shared neighbors distribution counts frequencies and thus contains absolute, rather than relative, values. The shared neighbors distribution is therefore of limited use in network comparison; however, it can be used to identify over-representation of certain motifs, such as the one presented in Figure 4.

Figure 4. Motif of two nodes sharing exactly four neighbors..

If the neighborhood connectivity distribution is approximating a decreasing function, the network is dominated by edges between low connected and highly connected nodes. Maslov et

Page 21: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

21

al. have discovered this effect in PPI and regulatory networks [66] and argue that it increases the overall robustness of the network.

Hubs Hubs are nodes that have significantly higher connectivity than the average one in a network. These strongly connected nodes have received considerable attention and have shown a tendency to play an essential role in protein-protein interaction networks. [122, 40]. By analyzing different conditions of the yeast transcriptional network, Luscombe et al. found that “a few transcription factors serve as permanent hubs but most act transiently only during certain conditions” [61].

Another study on dynamical properties of networks separates the hubs in protein-protein interaction networks into party hubs and date hubs [40]. Party hubs interact with all or most of their neighbors simultaneously, whereas date hubs bind their neighbors at different times or locations. Ekman et al. [31] have found that proteins which act as party hubs contain long disordered regions more often than date hubs do. This observation indicates that disordered regions are important for flexible binding but less so for static interactions.

Scale-free networks The degree distribution of many real-world networks approximates a power law: ( ) αkkDD ~ for some negative constant α . Examples of such networks include the World Wide Web [3, 12, 20] human sexual contacts [58, 50], and some biological networks [11, 56]. Networks with power-law degree distributions are referred to as scale-free [11] – a term pointing out their disagreement to the model of random networks (as defined by Erdős and Rényi [30, 17]). In fact, scale-free is any function ( )xf that remains unchanged within a multiplicative factor under rescaling of x [71], that is, any function ( )xf , such that:

( ) ( ) ( )xfabaxfa =∀ :

Since a power-law is the only solution to the equation above, the terms scale-free function and power-law can be considered equivalent.

Instances of scale-free networks often reveal additional topological properties beyond their degree distribution (e.g. error tolerance and attack vulnerability [49, 40]), and the class of scale-free networks was implicitly assumed to include only networks having the observed properties. As a result, the definition of scale-free networks lacks consistency across scientific publications [18], and thus conclusions on properties of scale-free networks are prone to criticism [41, 33]. Validity of methods for identifying if a network is scale-free is also questioned [36, 101]. Some attempts were made to provide a unified framework for studying scale-free networks. For example, Li et al. demonstrate that networks with power-law degree sequences can be divided into two distinct classes, which they name scale-free and scale-rich [57].

Modularity Biological networks have a comparatively high network clustering coefficient, independent of the network size [111, 84]. This property indicates that they have a modular structure, being constructed of loosely connected dense clusters (modules). Certain algorithms are able to identify modules in a network even when the modules are partially overlapping [75]. Interestingly, in some protein-protein interaction networks, the division of modules strongly correlates with protein function. This feature enables function prediction for an unknown protein based on the cluster it belongs to [35].

Page 22: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

22

Hierarchical networks Besides the power law degree distribution, Ravasz et al. have found that metabolic networks of different organisms are organized into “many highly connected components that combine in a hierarchical manner into larger less cohesive units” [84]. Networks that are simultaneously scale-free and modular are referred to as hierarchical networks, and can be identified by inspection of the average clustering coefficient distribution. In contrast to other scale-free networks, the average clustering coefficient distribution of hierarchical networks approximates a power law in the form 1~ −xy . In the network model by Ravasz et al, hubs connect different highly connected units, rather than forming clusters, which explains their lower clustering coefficients.

Other studies have confirmed the hierarchical structure of regulatory and protein-protein interaction networks, although they have exhibited some disagreement on the role of hubs [121, 107]. Various computational methods for identifying hierarchical organization were proposed in the past years [84, 62, 76, 46].

Small-world Property Although generally sparse, biological networks experience the small-world phenomenon, popular in the sociological literature as the “six degrees of separation” [38]. Small-world networks are characterized by an average shortest path length which is significantly shorter than expected by chance (exact definition in [71]). This is the case for many observed biological networks, including metabolic [111] and protein-protein interaction networks [56]. The small-world phenomenon is thought of as a property of networks which are neither regular nor completely disordered (random) [112, 14]. Bollobás and Riordan have shown that the average shortest path length of scale-free networks increases no faster than nn loglog/log [19] and therefore scale-free networks could be considered a subclass of the small-world ones.

Page 23: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

23

3 Interaction Databases Although relatively immature, the field of proteomics has a lot of repositories at its disposal, which collect data on protein sequences, three-dimensional structures, post-translational modifications, physical and genetic interactions, and other. This chapter introduces six of the well-established databases for protein interaction data. The list below could be extended by BIND, MINT and STRING; however, these three databases are not relevant to our studies.

3.1 BioGRID The Biological General Repository for Interaction Datasets (BioGRID) [96] was created in 2003 as a source for protein and genetic interactions. It is a curated biological database containing more that 116000 interactions in Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. BioGRID is accessible at http://www.thebiogrid.org/ and supports searches for large number of different identifiers. User-customized graphs with embedded protein, gene and interaction attributes can be constructed with a visualization system called Osprey that is dynamically linked to the BioGRID.

3.2 IntAct IntAct [45] provides an open source database and toolkit for the storage, presentation and analysis of protein interactions. To date, IntAct contains 2200 binary and complex interactions imported from the literature and curated in collaboration with the Swiss-Prot team, making intensive use of controlled vocabularies to ensure data consistency.

The data model of IntAct has three main components: Interactor, Interaction and Experiment. An interactor is a biological entity participating in an Interaction, which can be a protein, a DNA sequence, or a small molecule. An interaction contains one or more interactors participating in the interaction. Interactions are not limited to the binary case; multi-protein interactions can be represented as one interaction, without artificially splitting them up into several binary interactions. An experiment groups a number of interactions, usually from one publication, and classifies the experimental conditions in which these Interactions have been generated. Each experiment can have only a single interaction, or hundreds of interactions in the case of large-scale experiments.

IntAct uses controlled vocabularies instead of free text attributes to ensure the data from different sources remains correct and searchable. IntAct uses existing reference systems, such as the NCBI taxonomy database or Gene Ontology (GO), together with newly developed controlled vocabularies.

The web interface available at http://www.ebi.ac.uk/intact provides textual and graphical representations of protein interactions, and allows exploring interaction networks in the context of the GO annotations of the interacting proteins.

3.3 CYGD The Comprehensive Yeast Genome Database (CYGD) [37] compiles a comprehensive data resource for information on the cellular functions of the yeast Saccharomyces cerevisiae and related species. In addition to sequence information and functional annotations on individual genes and proteins, the database provides information on the physical and functional interactions among proteins as well as other genetic elements. These cellular networks include metabolic and regulatory pathways, signal transduction and transport processes as well as co-regulated gene clusters. CYGD provides a way of exploring related genomes with the aid of the

Page 24: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

24

S.cerevisiae genome as a backbone and SIMAP, the Similarity Matrix of Proteins. The comprehensive resource is available under http://mips.gsf.de/genre/proj/yeast/.

The Functional Catalog (FunCat) is a hierarchically structured, organism-independent controlled classification system, used for the functional description of proteins. To represent complex data of fungal genomes, CYGD uses the Genome Research Environment (GenRE) for annotations. GenRE allows for the combination of information on different classes of genetic elements and their relationships, such as protein–protein interactions or common regulatory features; it provides annotation features as well as flexible data retrieval interfaces.

3.4 HPRD The Human Protein Reference Database (HPRD) [79] is an object-oriented database that serves as a resource of genomic and proteomic information and assists in biomedical discoveries by providing an integrated view of sequence, structure, function and protein networks in health and disease. HPRD contains information about several aspects of human proteins including protein-protein interactions, post-translational modifications, enzyme-substrate relationships and disease associations. The cataloged information was derived manually by critical reading of the published literature by expert biologists and through bioinformatics analyses of the protein sequence.

HPRD is available at http://www.hprd.org.

3.5 DIP The Database of Interacting Proteins (DIP) [117] catalogues around 55000 protein-protein interactions between 19000 proteins that were discovered in experiments. The data has been curated both manually by experts and automatically by computational approaches that use the knowledge derived from a reliable core subset of DIP interactions. The DIP server is accessible at http://dip.doe-mbi.ucla.edu/ and provides additional tools for analyzing and visualizing interaction information.

The DIP database contains four main tables – Protein, Interaction, Method, and Reference. The first table stores information on the DIP Nodes – the proteins involved in interactions such as cross-references to UniProt and GenBank, as well as protein localization and function. The second table lists the DIP Edges, that is the binary interactions between proteins and when known the domains responsible for the interaction. Experimental techniques used to unveil interactions are cataloged in the Method table. The fourth table Reference contains references to scientific literature describing protein interactions.

3.6 UniProt UniProt [8] is a repository for protein data. The abbreviation stands for Universal Protein Knowledgebase consortium and aims to provide the scientific community with a freely accessible, centralized resource for protein sequences and functional information.

UniProt originated from the merging of three databases with different sequence coverage and annotation priorities: the Swiss-Prot and TrEMBL and the PIR-PSD database. Swiss-Prot contains extensive protein annotations, cross-references and computational analysis by experts. TrEMBL (Translated EMBL Nucleotide Sequence Data Library) provides automated annotations for the proteins not in Swiss-Prot.

UniProt is constituted by three database layers: UniProt Archive (UniParc), UniProt Knowledgebase (UniProtKB), and UniProt Reference Clusters (UniRef). UniParc is a non-redundant protein sequence collection obtained from different sources including Swiss-Prot,

Page 25: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

25

TrEMBL, PIR-PSD, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, and WormBase. Sequences are identified by a unique UniParc identifier and cross-referenced to the source databases. The UniProt knowledgebase constitutes the core of the UniProt consortium and incorporates protein sequences with functional annotations from UniProt, TrEMBL, and PIR-PSD. The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions (100%, >90% and >50%). Three databases constitute UniRef (NREF100, NREF90, NREF50), each a non-redundant sequence collection clustered by sequence identity and taxonomy. They facilitate sequence merging in UniProt and faster sequence similarity searches. NREF100 contains all records in the UniProt Knowledgebase and selected UniParc records. NREF90 and NREF50 are obtained from NREF100 by merging records from all organisms with mutual sequence identities greater 90% or 50%, respectively, into a single record that links to the corresponding UniProt knowledgebase records.

The UniProt database is available at http://www.uniprot.org.

Page 26: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

26

4 Interaction File Formats Unfortunately, the databases described in Chapter 3 are not synchronized with each other. One of the reasons is that their data formats are incompatible, which hampers the consolidation of interaction data from different sources. In addition to tab-delimited text files, some database providers offer data export in their own XML-based storage formats, such as the XIN format used by DIP. Here, we list two file formats, designed for storage of network and interaction data, respectively. These formats are relevant to the implementation of the NetworkLoader tool, presented in Chapter 5.1.

4.1 SIF The simple interaction format (SIF) is a flat file containing a list of interactions. Each line in the file specifies a source node, a relationship type (or edge type), and one or more target nodes. The file can contain duplicate entries, corresponding to multiple edges between the same pair of nodes.

SIF is supported by the Cytoscape platform and is defined in Cytoscape’s documentation. This file type facilitates easy building of a network from a list of interactions. However, since no additional information is stored in the file, network layout has to be recomputed upon each loading.

4.2 PSI MI The PSI MI format [44] is a data exchange format for molecular interactions. It has been jointly developed by members of the Proteomics Standards Initiative (PSI) work group of the Human Proteome Organization (HUPO), and is supported by major protein interaction data providers, such as DIP, HPRD, BIND, and other.

The PSI MI format uses a multi-level approach. Level 1 provides a basic format for representing the majority of the currently available protein-protein interaction data and does not contain details on the interaction mechanisms or full experimental description. Level 2.5 adds more details and support for additional interacting entities.

Each PSI MI 2.5 document contains as root node the entrySet element which defines one or more entries. Each entry describes a list of one or more interactions as a self-contained unit. An interaction is identified by a list of experiments and a list of molecules participating in it. It can include one, two or more participants, in which case it is intramolecular, binary, or n-ary, respectively. The participants in the interaction can have identifiers in an external databases. It is also possible to specify a type for the interaction. One or more interactions can have associated description of experiments. These desriptions contain the detection method used to determine interactions and, optionally, the host organism in which the experiment was performed. Each experiment can assign a confidence value for the object.

Both Level 1 and Level 2.5 of the PSI MI support controlled vocabularies provided in Gene Ontology and OBO formats, such as “interaction type”, “sequence feature type”, “feature detection”, “participant detection” and “interaction detection method”.

Page 27: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

27

5 Implementation of Network Tools

5.1 Compiling Interactomes with NetworkLoader

5.1.1 Motivation The new PSI-MI 2.5 format for storage and transfer of interaction data facilitates the processes of synchronization, validation and manipulation of interaction data, available in biological data repositories. It contrast to the expectation to become the predominant standard in January 2006 [73], its immature state and slow popularity growth result in a restricted availability of accompanying tools [74]. As of July 2006, VisANT supports only the old version of the molecular interaction standard, and the PSI MI Import plugin for Cytoscape is still under beta-testing. Reflecting the community demand for applications that operate on PSI-MI 2.5 files, we have developed NetworkLoader, a tool for filtering and combining interaction data from different sources in PSI-MI 2.5 format.

Unless otherwise stated, the abbreviation PSI in the rest of this chapter refers to version 2.5 of the specification. Thus, for example, PSI file should be read as file in PSI-MI 2.5 format.

5.1.2 NetworkLoader Functionality NetworkLoader compiles an interaction network from a group of interaction datasets. Every node in the network is an interactor identifier in an external database, and every edge depicts an interaction between the corresponding interactors. The datasets are successively loaded from PSI files. Upon loading, NetworkLoader validates the structure of every file for compliance with the PSI specification. If a deviation from the specification is found, it is reported as a warning or an error message. For example, when an unknown tag in the PSI file is reached, NetworkLoader ignores it and outputs a warning. On the other hand, invalid XML structure of the document gives rise to an error because further processing is impossible. In addition, NetworkLoader examines the use of (some of the) external controlled vocabularies. For example, if the loaded file contains terms for interaction detection method that are not defined in the MI Ontology, the user is assisted in including them (see Figure 5).

Page 28: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

28

Figure 5. Dialog window asking the user to attach the unknown interaction detection methods to the ontology. These methods are read from the PSI MI 2.5 export of S. cerevisiae PPI data in BioGRID 2.0.20.

When a dataset is successfully loaded, NetworkLoader displays a table that lists the external databases referenced in interactor descriptions (see Chapter 4.2). For every database, the number of interactors that are referencing it is presented. After selecting one of the databases, the user can integrate the dataset into the compiled network. Every interactor is transformed to a node that stores its ID in the chosen database, and interactions are represented by edges between the corresponding nodes.

NetworkLoader also provides a filter that either accepts or rejects an interaction, based on a set of customizable criteria. Only accepted interactions can be integrated into the network. The criteria include the following interaction properties:

Interaction arity. Self-interactions, as well as interactions involving more than two participants can be rejected. In PPI datasets, interactions of arity greater than two usually describe protein complexes, and therefore this property can be used for discarding them2.

Interaction Detection Method. Users can specify a set of acceptable interaction detection methods (IDMs). Interactions whose detection methods are not in the specified set will be

2 Note that interaction arity is independent of the term hierarchical interactions in the PSI MI specification. NetworkLoader ignores any hierarchical interactions defined in a PSI file.

Page 29: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

29

rejected. NetworkLoader assists the user by visualizing the ontology of IDMs. When the user selects a detection method, its descendents in the ontology are automatically included in the selection.

Feature Detection Methods. Similarly to IDM, an interaction can be accepted or rejected based on its feature detection methods (FMDs). Since this property is optional in the PSI format, some interactions may have unspecified FDMs. The user controls whether such interactions are accepted or not.

Interaction Type. The mechanism for filtering interactions based on their types is identical to the one for FDMs described above.

Host Organism. By specifying an NCBI taxonomy ID, the user can adjust the filter to accept only interactions whose participants are from a given organism. Inter-species interactions, among others, can be discarded using this property.

Another useful feature of NetworkLoader is its ability to apply mappings to interactor identifiers. A mapping translates identifiers in a source database (source IDs) to identifiers in a target database (target IDs). The source-to-target translations are defined in mapping files, which are tab-delimited text files (see Figure 6). After specifying a mapping file, the user can choose to apply the mapping to the currently loaded dataset. As a result, some (or all) interactors obtain identifiers for the target database. This happens when an interactor identifier is listed as a source ID in the mapping file.

P46994 S000003705 P10869 S000000854 P38824 S000001158 P38803 S000001127 P38896 S000000088 S000001256 P07390 S000005328

Figure 6. Excerpt of a mapping file between UniProt and SGD (http://www.yeastgenome.org/) identifiers. Each line starts with a UniProt accession number, which is followed by one or more SGD identifiers for the same protein. Note that this is a one-to-many mapping because P38896 is mapped to two identifiers.

In addition, a mapping can also be applied to the compiled network. In this case, the node labels are translated to the target IDs specified in the mapping file. When several source IDs are translated to the same target ID, the corresponding nodes are automatically merged. In case of one-to-many mappings, NetworkLoader splits every node whose identifier is mapped to more than one target ID (see Figure 6).

The resulting network can be exported in Simple Interaction Format. All interactions in the output file are labeled pp.

5.1.3 Compiling the Yeast Interactome with NetworkLoader We have used NetworkLoader to compile a protein interactome3 of Saccharomyces cerevisiae using the datasets provided in BioGRID, GYGD and IntAct. Only experimentally detected binary physical interactions were selected from each database. Hierarchical, genetic and predicted interactions were not included in the interactome. Interactions describing protein complexes were also ignored.

Using the mapping functionality of NetworkLoader, we were able to assign the systematic name to nearly all interactors in the three databases. Thus, the nodes of the resulting network are

3 Not to be confused with the yeast interactome. In this thesis, the term compiled interactome refers to a network compiled of several sources of interaction data.

Page 30: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

30

systematic gene names, and an edge between two nodes signifies that the products of the respective genes physically interact.

Table 1 lists the PPI interactions in yeast, extracted from the three aforementioned databases.

Database name PPIs (in yeast) Binary physical experimentally detected PPIs

PPIs included in the compiled

interactome BioGRID 82633 34370 34370 CYGD 15215 6759 6759 IntAct (small scale experiments) 8040 3300 3360

Table 1. Sizes of the three databases used to compile the yeast interactome.

BioGRID 2.0.20 contains definitions for 82633 binary PPIs in yeast. 34370 of them are experimentally detected physical interactions. The SGD interactor identifiers (http://www.yeastgenome.org/) were used when PPI data from CYGD was exported to a network. All identifiers were successfully mapped to systematic names. Mapping data was obtained from the web sites of UniProt and of CYGD.

We have extracted only small-scale experimental data from IntAct. The high-throughput datasets by [43] and [48] were not included in order to keep a low rate of false-positive interactions [10, 95, 110, 85]. We obtained 3300 binary physical interactions involving/among 2378 proteins identified by their UniProt accession numbers. After applying a mapping to systematic names, the resulting network contained 3360 interactions involving 2399 genes. The increase in the numbers of proteins and interactions can be explained with the fact that one UniProt accession number was sometimes mapped to multiple gene identifiers. 40 UniProt identifiers could not be mapped to systematic names and were therefore ignored.

A large fraction of the protein-protein interactions are stored in more than one database. The overlaps of physical interactions between IntAct (small-scale experiments only), BioGRID and CYGD are summarized in Figure 7. BioGRID is not only considerably larger than the other two databases, but also contains almost 90% of the interactions stored in CYGD and almost two-thirds of the interactions in IntAct.

Figure 7. Overlap of binary physical interactions in yeast, stored in BioGRID, CYGD and IntAct.

Page 31: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

31

5.2 Analyzing Topology with NetworkAnalyzer

5.2.1 Motivation In recent years, high-throughput experiments have produced large networks of interacting molecules, which are represented as nodes linked by edges in complex networks. In this context, the computation of topological parameters has become very popular in order to gain insight into the global network structure [12, 88, 97, 100, 104]. However, general software libraries for graph analysis such as JUNG, LEDA, NetworkX and yFiles are not easily applied by the biological end-user. While applications like Pajek [15] require expert knowledge in graph theory on the user side, other tools for biological networks such as TopNet [125], VisANT [47] and SpectralNet [32] calculate only a limited set of topological parameters.

Therefore, we have developed NetworkAnalyzer, which works as a user-friendly Java plugin for Cytoscape (http://www.cytoscape.org/), an established open-source software platform for the visualization of molecular interaction networks [92]. The plugin can be downloaded from our web site (http://med.bioinf.mpi-inf.mpg.de/netanalyzer/). Since its initial release in January 2006, the plugin has been downloaded by more than 400 users.

5.2.2 Features of NetworkAnalyzer NetworkAnalyzer can be applied to undirected as well as directed networks loaded into Cytoscape. A loaded network may contain only directed edges, but still be undirected in its nature (for example in case of a domain interaction network). Moreover, one network may contain both directed and undirected edges. This situation could arise, for instance, when one network is compiled by combining data from several different sources. In the scenarios described above, NetworkAnalyzer requests user input on how to interpret the edges and whether the internal representation of the network edges matches its nature. Figure 8 shows an example of a small network that contains only directed edges, along with three possible interpretations of its representation.

Figure 8. Network with directed paired edges and its three possible interpretations.

In agreement to the definitions presented in Chapter 2, NetworkAnalyzer computes two categories of parameters – simple and complex. A simple parameter is a single integer or real number. A complex parameter, on the other hand, is represented by a more complex structure, such as a distribution or a set of two-dimensional data points. NetworkAnalyzer uses bar charts

Page 32: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

32

and scatter plots to visualize complex parameters (see Figure 9).

Simple parameters computed include the number of nodes, edges, isolated nodes and self-loops in the network, the number of multi-edge node pairs, the number of connected components and connected node pairs, the clustering coefficient of the network, the average number of neighbors, the diameter and the average length of all shortest paths. In disconnected networks, the diameter reported is the longest among the diameters of all connected components.

For undirected networks, the following complex parameters are computed and visualized by NetworkAnalyzer: topological coefficients, distributions of node degrees, of average clustering coefficients, of shortest path lengths, of shared neighbors of two nodes, and of neighborhood connectivities. In the directed case, the node degree distribution is replaced by in-degree and out-degree distributions, and there exist three different neighborhood connectivity distributions, based on the definitions of neighborhood connectivity, presented in Chapter 2. In NetworkAnalyzer, the set of computed parameters is referred to as network statistics. Once calculated and visualized, network statistics can be saved into and loaded from a file in order to avoid recomputation.

Figure 9. Analysis of a human protein interaction network for neurodegenerative diseases with 3607 nodes and 7093 edges [59]. The shortest path length distribution (left) indicates that the network possesses small-world property. The decreasing trend of the neighborhood connectivity (right) shows that the network is dominated by edges between low and highly connected nodes.

The online help pages of NetworkAnalyzer provide detailed information about the definitions and the usage of the topological parameters listed above. Explanations of the different network interpretations are also included in the help pages. To ensure accurate values, NetworkAnalyzer results were compared to independent parameter computations using Pajek, TopNet, or the Python graph library NetworkX.

Aside from parameter computations, NetworkAnalyzer offers a useful set of network modifications. Users can create the intersection, union, and differences of two networks. In addition, NetworkAnalyzer provides an intuitive interface assisting the extraction of connected components as separate networks. Automatic removal of all self-loops in a network is also included in the plugin’s functionality.

Customizable settings of the plugin include the following:

Store node parameters in node attributes. For every node in a network, NetworkAnalyzer computes its degree (in- and out-degrees for directed networks), its clustering coefficient, the number of self-loops, and a variety of other parameters. If this option is enabled, NetworkAnalyzer stores the computed values as attributes of the corresponding node. This

Page 33: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

33

enables the users to apply different visualizations or to filter nodes based on the values of the computed attributes.

Use expandable style for the dialog that displays analysis results. If this option is enabled, analysis results are presented in window, in which all charts are placed below the simple parameters in expandable boxes. If this option is disabled, analysis results are presented in a window which contains tabs for the group of simple parameters and for every complex parameter. Users, who wish to view simultaneously two or more complex parameters of one network, should enable this option.

Location of the help documents. URL of NetworkAnalyzer's online help.

5.3 Quantifying Node Subset Significance

5.3.1 Topological Properties of Node Subsets In Chapter 2, we have introduced the notion of hubs. Removing a relatively small amount of highly connected nodes (hubs) from a scale-free network has a massive impact on the network’s global connectivity and topology [122]. It can therefore be concluded that hubs in scale-free networks have a significant contribution to the small-world effect of these network types. However, the manner of quantifying such significance is neither straightforward nor unique. In this chapter, we introduce a number of numerical properties of subsets of nodes. Each property is designed to quantify a specific topological role of the subset of interest.

Number of edges involved Given a subset S of the nodes in network ( )EVG ,= , we define the number of edges involved in S as:

( ) ( ) ( ){ }SvSuEvuvuSEI ∈∨∈∧∈= ,:,

( )SEI is the number of edges in the network for which at least one of the two ends is a node in S . This property can have only non-negative integer values. More precisely, ( ) { }ESEI ,,1,0 K∈ . For the network shown in Figure 10, { }( ) 9,, =cbaEI . If S consists of a

single node v , the number of edges involved in S is the degree of v 4. Therefore, this property can be seen as a generalization of the node degree to a set of nodes. The number of edges involved in a subset S can be used to quantify the contribution of the targeted node set to the overall network density. For example, in Figure 10 the subset { }cba ,, has the highest value for EI among all possible subsets of three nodes. Consequently, removing { }cba ,, decreases the average connectivity of the network stronger than removing any other set of three nodes.

4 The value of EI for a node that contains self-loops is lower than its degree. The reason is that a self-loop contributes two to the degree of a node, but only one to the number of edges involved. In networks that do not contain self-loops, the values are equal.

Page 34: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

34

Figure 10. Example network of 6 nodes and 9 edges. { }( ) 9,, =cbaEI ; { }( ) 2,, =cbaCCI .

Increase in Connected Components The property increase in the number of connected components, or simply increase in connected components, is defined for a subset S of the nodes in a network ( )EVG ,= in the following way:

( ) ( ) ( )GCCGCCSCCI SV −= \

In the formula above, SVG \ denotes the network induced by the node subset SV \ .The increase in connected components measures how many more connected components exist in the network with the nodes in S removed, in comparison to the whole network G . This property can be viewed as an extension and refinement of the definition of articulation point of a graph. Note that removing a set of nodes does not necessarily increase the number of connected components in the network. As a simple example, consider the case when VS = . Then

( ) ( ) ( )GCCVCCISCCI −== . In general, it can easily be shown that ( ) ( ) ( ){ }1\,,1, −+−−∈ SVGCCGCCSCCI K . In Figure 10, { }( ) 2,, =cbaCCI .

Intuitively, a high EI value for a set should imply a high CCI value. However, these two properties are not necessarily correlated. Note that in Figure 11a the node set { }ba, has a comparatively low EI value but a high CCI value. Also, the EI value of { }dcba ,,, in Figure 11b is high, whereas the CCI value is low.

(a) (b)

Figure 11. Two example networks.

Number of connected components in a subnetwork Let ( )SS ESG , be the subnetwork of ( )EVG , , induced by the set VS ⊆ . Then the number of connected components in S is

( ) ( )SGCCSCCN =

This property shows the tendency of the selected subset of nodes to form a network on its own.

Page 35: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

35

After applying the inequalities for CC in Chapter 2, we obtain the range of possible values for CCN : ( ) { }SSCCN ,,2,1 K∈ . The exact value of CCN is generally of little use because it strongly depends on the connectedness of the network G and, of course, on the size and the choice of S . However, its comparison to an expected value for a random choice of S might reveal information about the locations of the nodes of S in the networkG . In Figure 10, for example, { }( ) 3,, =fedCCN , the highest possible value for 3=S . Observing that the values of CCN for all other set of three nodes in the network are lower than 3, one could conclude that the distances between nodes in { }fed ,, tend to be longer than expected by chance.

Page 36: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

36

6 Results of Topological Analysis

6.1 Network Comparison

6.1.1 Motivation Composing all human protein-protein interactions (PPIs) on can give key insights into the molecular basis of life and disease. Such investigations can uncover putative functional relationships of proteins in cellular processes and between the human host and its pathogenic intruders [24, 104]. However, we are still far from a complete human interactome map, which is estimated to contain between 200,000 to 400,000 interactions [83]. Most human PPIs known until recently have been obtained in small-scale experiments collected from the literature in several databases [6, 79, 89, 126, 45]. Altogether, there are no more than 30,000-40,000 non-redundant PPIs currently available in the literature and publicly accessible [83, 34]. Since experimental data on PPIs are still scarce, bioinformatics methods have been used to predict human PPIs from experimentally derived interactions in other organisms such as yeast, worm, and fruit fly [88, 10, 26]. Lately, two comprehensive yeast two-hybrid (Y2H) screens have been performed with about 2,500 human proteins, resulting in about 4,800 PPIs [95, 88].

Unfortunately, even though the publications on those predictions and Y2H screens offer some evaluation of the provided interaction data, no comprehensive comparison and quality assessment of the different human datasets have been performed. Therefore, the reliability and the mutual relationship in terms of coverage and bias of predicted and new high-throughput data is unclear. This situation renders the selection or the combination of one or another PPI dataset for further studies of the human interactome difficult. Previous analyses of PPI datasets have mainly focused on high-throughput data for yeast, concluding that the reliability of the provided data is limited and that about 50% of the interactions may be spurious [10, 95, 110, 85]. Two studies on human PPIs have primarily compared literature-derived human datasets with each other or with interactions observed in other species [83, 34], but do not consider predicted datasets and recent Y2H screens. As a part of a comprehensive evaluation of the available human PPI datasets, we present comparison of PPI datasets based on their topological characteristics.

6.1.2 Methods We retrieved a number of human PPI datasets from different sources as follows: six predicted datasets from Bioverse [67], HiMAP [87], HomoMINT [80], Sanger [55], OPHID, and POINT, two experimental high-throughput Y2H screens CCSB-HI1 [88] and MDC (set LacZ4) [95], and the literature-curated dataset HPRD (release of September 13, 2005) [79]. We also obtained the fragment length of each gene product in the MDC dataset [95]. CCSB-HI1 was assembled using full-length proteins as baits and preys. The CCSB-HI1 authors reported that 78% of the interactions from a random sample of 217 interaction pairs could be verified using in vivo co-affinity purification assays [88]. The MDC technology was based on protein fragments of varying size as baits and preys [95]. The MDC authors described a verification rate of 62% interactions for a random sample of 116 interactions using membrane co-inmunoprecipitation assays and of 66% interactions for a random sample of 131 interactions obtained using pull-down experiments. Additionally, we used three core datasets that are available as subsets of Bioverse, HiMAP, and Sanger and are assumed to consist of particularly reliable predictions. The Bioverse-core subset contains interactions with a joint sequence similarity of at least 80% between human and the source species homologues [123]. As suggested in the original publication on HiMAP, the HiMAP-core subset consists of interactions with a likelihood ratio

Page 37: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

37

exceeding the computed threshold 1526 [87]. The Sanger-core subset comprises predictions derived from high-throughput experiments in yeast, worm, and fly that were detected more than once in the experimental assays [55]. Diverse original protein and gene identifiers RefSeq [81], Ensembl (gene) [42], Entrez Gene ID (former LocusLink) [64], UniProt [8] and GeneInfo Identifier (gi) [16] used in the interaction datasets were converted to Entrez Gene ID to allow comparisons of the datasets.

Dataset Number of proteins Number of interactions

Predicted protein-protein interactions Bioverse 7711 233941 Bioverse-core 1263 3266 HiMAP 5790 38378 HiMAP-core 2901 8832 HomoMINT 4184 10870 OPHID 4559 28255 POINT 12058 98528 Sanger 5923 67518 Sanger-core 3728 11131

Manually curated protein-protein interactions HPRD 6103 20121

Experimental Y2H protein-protein interactions CCSB-HI1 1549 2754 MDC 1124 2033

Table 2. Sizes of the 12 datasets used in the topological comparison. The given sizes are obtained after applying a mapping to Entrez gene identifiers.

Degree distributions, diameters, shortest path lengths, clustering coefficients, and topological coefficients of the interaction networks formed by the different human datasets were computed using the NetworkAnalyzer plugin.

6.1.3 Results The topological parameters can be used to uncover potential bias in the networks formed by the different PPI datasets. We computed the degree distribution, diameter, average shortest path length, and clustering coefficient for all datasets. We found that all networks fit well to a scale-free degree distribution, with degree exponent lying between -1.1887 and -2.0799. However, the average number of neighbors varies considerably between the datasets from 3.43 of CCSB-HI1 to 60.24 of Bioverse which reflects large differences in their densities (see Table 3), most noticeable between the experimentally derived and the predicted networks.

Dataset Average number of neighbors

Maximum number of neighbors

Exponent of degree distribution

Bioverse 60.24 842 -1.19 Bioverse-core 4.67 34 -1.76 HiMAP 13.26 159 -1.74 HiMAP-core 6.09 44 -1.8 HomoMINT 4.95 68 -2.08 OPHID 12.39 192 -1.43 POINT 16.26 522 -1.69 Sanger 22.69 365 -1.41

Page 38: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

38

Sanger-core 5.87 75 -1.84 CCSB-HI1 3.43 129 -1.56 MDC 3.58 95 -1.51 HPRD 6.5 161 -1.97

Table 3. Degree and neighborhood-related topological network parameters for each human protein interaction datasets. The degree distributions are fitted to power laws with exponentsγ.

The network clustering coefficient, a measure of network modularity, also exhibits significant differences ranging from 0.02 of MDC to 0.5 of Bioverse-core (Table 4). Another strong dissimilarity between the datasets is revealed by their average clustering coefficient distributions. The distributions of most of the networks did not show a strong correlation to the fitted power laws. In addition, the exponents vary from -0.82 to 0.22, suggesting lack hierarchical organization in the datasets of predicted PPIs. We have also found that all networks exhibit the small-world phenomenon, as indicated by their diameters and average shortest path lengths. This result comes to no surprise given the scale-free nature of the compared datasets and confirms previous studies on biological networks [111, 121, 122].

Dataset Network diameter

Average shortest path length

Clustering coefficient

Exponent of average clustering coefficient

distribution Bioverse 10 3.5 0.48 -0.18 Bioverse-core 17 6.32 0.5 0.22 HiMAP 18 5.16 0.44 -0.1 HiMAP-core 26 9.4 0.32 0.13 HomoMINT 12 4.92 0.07 -0.45 OPHID 18 4.54 0.19 0.09 POINT 10 3.53 0.09 -0.35 Sanger 10 3.87 0.23 0 Sanger-core 20 6.45 0.19 0.37 CCSB-HI1 12 4.36 0.06 -0.79 MDC 12 4.62 0.02 -0.82 HPRD 15 4.77 0.26 -0.61

Table 4. Topological network parameters for each human protein interaction datasets. The degree and clustering coefficient distributions are fitted to power laws with exponents γ.

6.2 Significance of Polyanion-binding Proteins

6.2.1 Motivation and Related Work A living cell consists of a number of only partially defined compartments in constant communication. Both the interior and exterior of a cell and its compartments possess regions of high negative charge density. Exterior negatively charged entities such as phospholipids, proteoglycans and polysialic acids as well as intra-cellular components including inositol phosphates, nucleotides, actin and tubulin microfilaments, DNA, RNA, and ribosomes render cells an ultimately highly crowded and extremely polyanionic environment. Considering the nature of this environment, it seems reasonable to postulate that numerous non-specific interactions between cellular polyanions and proteins can take place. Cellular polyanions are of

Page 39: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

39

important functional significance to cells since at a minimum they: a) perform regulatory functions (e.g. binding of growth factors to protoglycans) [78, 104]; b) transfer genetic information (RNA/DNA); c) play a multitude of roles in cell structure and dynamics (actin and tubulin); d) potentially serve as chaperones for protein folding [25, 65, 86]; e) stabilize proteins [109, 103, 63]; and f) perhaps facilitate non-classical transport of proteins into and out of the cell [114, 51, 52].

Given the diversity of functions involving polyanion molecules, understanding the proteome would be unlikely without characterization and analysis of the proteins in direct contact to negatively charged molecules – the polyanion-binding proteins (PABPs). We aimed at analyzing the topological role of five groups of PABPs within a compiled interactome.

In the bioinformatics community, current experience in determining the essentiality of a set of nodes in a biological network is limited to the analysis of hubs in scale-free networks [31]. Regrettably, not all PABPs are hubs, although many of them are indeed highly connected in the compiled interactomes. Various algorithms are able to extract node subsets with intriguing properties, such as paths and functional modules [62], overlapping clusters [75] or defective cliques [124]. Unfortunately, these algorithms are of little use for determining the structural role of polyanion-binding proteins, because the overlap between PABPs and an intriguing subset (as identified by the above class of algorithms) is unlikely to be complete.

In contrast to the methods discussed above, the properties discussed in Chapter 5.3 provide an ideal framework for the desired analysis. We have performed the analysis on PABPs in yeast and in human and have provided suggestions onto how their topological role can be interpreted as a functional significance.

6.2.2 Methods

PABPs in Yeast Polyanion-binding proteins were identified using yeast protein microarrays. The microarrays contain 4,087 purified proteins from Saccharomyces cerevisiae open reading frames (ORFs) which are expressed as N-terminal-GST-(Glutathione STransferase)-6xHis fusion proteins, purified, and spotted in duplicate on a nitrocellulose-coated glass slide [128]. Binding affinity was measured after series of preparation steps, including exposure to strepavidin-Alexa Fluor® 647, washing, drying, subsequent incubation with NaCl solutions and image analysis [70]. Interactions which were considered statistically significant demonstrated a mean signal greater than median signal of all protein spots on the array plus one standard deviation (+1SD). Two higher-confidence sets for each microarray were obtained by including only the proteins with a signal of +2SD and +3SD, respectively.

The above procedure was used to identify five groups of PABPs: actin-binding, DNA-binding, heparin-binding, HS-binding and Tubulin-binding. Each of these five groups contained three sets – the +1SD, +2SD and +3SD. Thus, for example, the “HS-binding 2” set contains all proteins that show a signal stronger than the median signal plus 2 standard deviations on the microarray utilized with heparin sulphate. In addition, three sets of polyanion binding proteins were formed by combining all sets of the same confidence level. The combined protein sets were named “All 1SD”, “All 2SD” and “All 3SD”.

As described in Chapter 5.1.3, the yeast interactome was compiled by combining the yeast protein-protein interaction networks from databases BioGrid, GYGD, and IntAct. Only the experimentally detected binary physical interactions were selected from each database. Regarding a specific topological property of subnetworks formed by PABP datasets, we investigated the first two properties described in Chapter 5.3, namely the number of edges (protein-protein interactions) involving at least one PABP and the increase in number of

Page 40: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

40

connected components after removing PABPs.

We constructed 50,000 random subnetworks by arbitrarily picking the same number of proteins as in the respective PABP dataset and including their interaction partners. We used the resulting histogram of the number of interactions contained in each random subnetwork to estimate a p-value quantifying the significance of the subnetwork for the respective PABP dataset. In addition, a 95% binomial confidence interval for the p-value was calculated. To give a concrete example of our evaluation procedure, there were 108 PABPs in the dataset detected at the statistical level of +3SD, and 103 of them were present in the compiled yeast interactome (see Table 5). Thus, we generated 50,000 different random subnetworks of 103 proteins and their direct interaction partners. We then counted the protein-protein interactions in each subnetwork to derive a histogram. Similarly, we constructed 50,000 random subnetworks having the same number of nodes as the respective PABP dataset, and used these subnetworks to estimate a p-value for the increase in connected components.

PABP type Standard deviations

Number of PABPs

PABPs present in the interactome

Actin-binding 1 135 126 Actin-binding 2 63 62 Actin-binding 3 27 26 DNA-binding 1 86 81 DNA-binding 2 44 40 DNA-binding 3 29 27 Heparin-binding 1 264 240 Heparin-binding 2 107 100 Heparin-binding 3 59 56 HS-binding 1 348 325 HS-binding 2 124 110 HS-binding 3 66 62 Tubulin-binding 1 60 55 Tubulin-binding 2 13 13 Tubulin-binding 3 10 10 All 1 529 489 All 2 205 188 All 3 108 103

Table 5. Sizes of the eighteen PABP sets in yeast.

PABPs in Human Seeking further evidence for the existence of an extensive role for polyanion-protein interactions in the activities of much more complex eukaryotic cells, a similar study was performed exploring a human protein array. All protein arrays were scanned using a Molecular Devices GenePix 4000B (Sunnyvale, CA), at 635 nm with 10 μm pixel size. A detailed description of the experimental procedure is available in [69].

A human interactome was compiled by combining predicted protein-protein interactions from HiMAP (http://www.himap.org) with human physical protein interaction networks from IntAct (http://www.ebi.ac.uk/intact/index.jsp), the Human Protein Reference Database (http://www.hprd.org), BioGrid (http://www.thebiogrid.org), and Database of Interacting

Page 41: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

41

Proteins (http://dip.doe-mbi.ucla.edu). Since these datasets use different accession numbers, we have converted all to NCBI Entrez Gene identifiers, leading to slightly different numbers of PABPs in the reported tables.

To probe specific topological properties of subnetworks formed by PABP datasets, we investigated the number of protein-protein interactions involving at least one PABP. In a complete analogy to the approach in yeast, we constructed 50,000 random subnetworks by arbitrarily picking the same number of proteins in the respective PABP dataset and including their interaction partners. We used the resulting histogram of the number of interactions contained in each random sub-network to estimate a p-value quantifying the significance of the subnetwork for the respective PABP dataset. Additionally, a 95% binomial confidence interval for the estimated p-value was calculated. To give a concrete example of our analysis, there were 149 PABPs in the dataset detected at the statistical level of +3SD, and 111 of them were present in the compiled human interactome (see Table 6). Thus, we generated 50,000 different random subnetworks of 111 proteins and their direct interaction partners. Then we counted the protein-protein interactions in each subnetwork to derive a histogram.

PABP type Standard deviations

Number of PABPs

PABPs present in the interactome

Actin-binding 1 189 141 Actin-binding 2 96 73 Actin-binding 3 58 44 DNA-binding 1 161 123 DNA-binding 2 85 64 DNA-binding 3 54 44 Heparin-binding 1 184 124 Heparin-binding 2 92 64 Heparin-binding 3 54 40 HS-binding 1 133 97 HS-binding 2 83 62 HS-binding 3 54 38 Tubulin-binding 1 86 65 Tubulin-binding 2 55 41 Tubulin-binding 3 36 26 All 1 370 264 All 2 214 156 All 3 149 111

Table 6. Sizes of the eighteen PABP sets in human.

In subnetworks formed exclusively by PABPs, we additionally investigated the increase in connected components (CCI) and number of connected components in the subnetworks. Similar to the case of the topological property described above, a p-value was estimated for each PABP dataset by generating 50,000 random subnetworks and computing histograms of CII and the number of connected components in them. Every p-value estimation includes the calculation of a 95% binomial confidence interval.

We have also analysed the yeast homologues of human PABPs [69]. For each human PABP dataset, we could use a corresponding set of homologous yeast proteins. Again, we have performed the topological analyses described above on the resulting sets of yeast proteins using

Page 42: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

42

a compiled yeast interactome discussed in Chapter 5.1.3.

PABP type Standard deviations

Number of PABPs in

human

Number of PABPs

mapped to homologues

Number of homologous proteins in

yeast

Homologues present in the

yeast interactome

Actin-binding 1 201 100 99 98 Actin-binding 2 100 47 66 65 Actin-binding 3 59 29 47 46 DNA-binding 1 170 73 87 85 DNA-binding 2 90 37 54 53 DNA-binding 3 56 27 46 45 Heparin-binding 1 194 90 111 110 Heparin-binding 2 99 43 73 73 Heparin-binding 3 59 25 57 57 HS-binding 1 143 58 83 83 HS-binding 2 92 37 52 52 HS-binding 3 60 24 35 35 Tubulin-binding 1 90 46 63 62 Tubulin-binding 2 58 31 50 49 Tubulin-binding 3 37 19 43 42 All 1 395 192 148 145 All 2 227 105 104 102 All 3 156 72 90 88

Table 7. Sizes of the eighteen sets of human PABP homologues in yeast.

6.2.3 Results

PABPs in Yeast To investigate whether yeast PABPs form a specific interaction network, we compiled a yeast interactome consisting of 36,135 physical interactions between 5,574 proteins.

If polyanions and PABPs have an important functional role in many cellular processes, the PABPs should be involved in significantly more protein-protein interactions than a random set of proteins. We found out that the number of interactions involving PABPs were significantly higher than in random subnetworks, judged by the estimated p-values (average of 0.04, see Table 8). The resulting histograms from counting the protein-protein interactions in each sub-network are shown in Figure 15. Interestingly, the p-values for the PABP sets at the +3SD statistical level are not always higher than the ones for the corresponding +1SD and/or +2SD datasets (see Table 8).

PABP type Standard deviation

Interactions involved (p-value)

95% confidence

interval

Increase in CC

(p-value)

95% confidence

interval Actin-binding 1 0.083 ± 0.002 0.145 ± 0.003 Actin-binding 2 0.066 ± 0.002 0.341 ± 0.004 Actin-binding 3 0.049 ± 0.002 0.920 ± 0.002

Page 43: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

43

DNA-binding 1 0.003 ± 0.0005 0.034 ± 0.002 DNA-binding 2 0.015 ± 0.001 0.091 ± 0.003 DNA-binding 3 0.021 ± 0.001 0.141 ± 0.003 Heparin-binding 1 0.061 ± 0.002 0.221 ± 0.004 Heparin-binding 2 0.034 ± 0.002 0.077 ± 0.002 Heparin-binding 3 0.031 ± 0.002 0.184 ± 0.003 HS-binding 1 0.036 ± 0.002 0.121 ± 0.003 HS-binding 2 0.017 ± 0.001 0.144 ± 0.003 HS-binding 3 0.024 ± 0.001 0.276 ± 0.004 Tubulin-binding 1 0.057 ± 0.002 0.088 ± 0.002 Tubulin-binding 2 0.009 ± 0.001 0.012 ± 0.001 Tubulin-binding 3 0.031 ± 0.002 0.632 ± 0.004 All 1 0.063 ± 0.002 0.290 ± 0.004 All 2 0.009 ± 0.001 0.064 ± 0.002 All 3 0.028 ± 0.001 0.276 ± 0.004

Table 8. Estimated p-values for the properties “number of interactions involved” and “increase in connected components” of PABPs in yeast.

We have also investigated the increase in connected components, assuming that removing proteins which bridge or mediate different processes, would cause the interactome to be disconnected into significantly higher number of connected components than expected by chance. However, the value of increase in connected components for PABPs, was not higher than the values for most of the 50,000 random sets. We have obtained high p-values for this property which suggests that PABPs are not irreplaceable with this respect.

PABPs in Human To investigate whether human PABPs are involved in some type of network of interactions as seen in yeast, we compiled a human interactome consisting of 97,157 physical interactions between 18,461 proteins. We addressed the following three questions: 1) Do polyanions and PABPs play important functional role in many cellular processes; 2) are PABPs highly interconnected and do they form a network of physical interactions among themselves, and 3) do PABPs have a role of bridging cellular processes, providing a transition from one to another?

If the answer to the first question is positive, the PABPs should be involved in significantly more protein-protein interactions than a random set of proteins. We found that the number of interactions involving all PABPs is indeed significantly larger than in random subnetworks as judged by the estimated p-values (see Table 9). For instance, at the statistical level of +1SD, the p-value from the histogram is 0.003, indicating that PABPs are involved in cellular events more than just by mere chance.

PABP type

Standard deviation

PPI involving PAPBs

(p-value)

95% confidence interval

Increase in CC

(p-value)

95% confidence

interval

CC of group

(p-value)

95% confidence

interval

Actin-binding 1 0.072 ± 0.002 0.988 ± 0.001 0.11728 ± 0.0028

Actin-binding 2 0.397 ± 0.004 0.849 ± 0.003 0.18240 ± 0.0034

Actin- 3 0.228 ± 0.004 0.536 ± 0.004 1.00000 ± 0.0000

Page 44: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

44

binding DNA-

binding 1 0.033 ± 0.002 0.965 ± 0.002 0.00008 ± 0.0001

DNA-binding 2 0.038 ± 0.002 0.762 ± 0.004 0.00094 ± 0.0003

DNA-binding 3 0.181 ± 0.003 0.923 ± 0.002 0.02082 ± 0.0013

Heparin-binding 1 0.006 ± 0.001 0.967 ± 0.002 0.00018 ± 0.0001

Heparin-binding 2 0.004 ± 0.001 0.878 ± 0.003 0.00308 ± 0.0005

Heparin-binding 3 0.003 ± 0.000 0.740 ± 0.004 0.07490 ± 0.0023

HS-binding 1 0.124 ± 0.003 0.997 ± 0.000 0.00120 ± 0.0003

HS-binding 2 0.410 ± 0.004 0.993 ± 0.001 0.00008 ± 0.0001

HS-binding 3 0.648 ± 0.004 0.991 ± 0.001 0.30993 ± 0.0041

Tubulin-binding 1 0.423 ± 0.004 0.995 ± 0.001 0.03768 ± 0.0017

Tubulin-binding 2 0.905 ± 0.003 0.993 ± 0.001 1.00000 ± 0.0000

Tubulin-binding 3 0.735 ± 0.004 0.965 ± 0.002 1.00000 ± 0.0000

All 1 0.003 ± 0.001 0.994 ± 0.001 0.00002 ± 0.00004 All 2 0.009 ± 0.001 0.943 ± 0.002 0.00002 ± 0.00004 All 3 0.016 ± 0.001 0.876 ± 0.003 0.00080 ± 0.0002

Table 9. Estimated p-values for the properties “number of interactions involved” and “increase in connected components” of PABPs in yeast.

Concerning the second question, a high connectivity of PABPs should be reflected by few large connected components of interacting PABPs. Indeed, the results of our study show that the network of all PABPs and some subnetworks of PABPs such as DNA-binding and heparin-binding tend to be connected, usually having less connected components than randomly generated subnetworks of the same size (see Table 9). This can be seen in the histogram of Figure 12 where the p-value is as low as 0.00008 at the level of +1SD.

Page 45: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

45

Figure 12. Histogram for the number of connected components in randomly generated subnetworks of the compiled human interactome. The horizontal axis of the histogram represents the different values obtained for the number of connected components in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of connected components. This histogram are produced for the statistical level of +2SD of PABP detection to estimate one of the p-values shown in Table 9. The red circle represents the number of connected components in the sub-network formed by the HS-binding proteins.

The role of PABPs, addressed in the third question, can be evaluated by observing the increase in connected components. As shown in Table 9, the estimated p-values were in the range of 0.74 to 0.997 which demonstrates that eventual removal of PABPs would not result in slicing the interactome into more connected components than expected by chance. Therefore, the polyanion-binding proteins in human do not have a tendency to bridge cellular processes. This was also shown to be the case for PABPs in yeast.

The inclusion of predicted interactions into the compiled human network was necessitated by the limited availability of experimental data on human protein-protein interactions, which results in a partially incomplete human interactome. This incompleteness can be observed by the relatively large fraction of 106 out of 370 PABPs (using Entrez Gene identifiers) that were not found in the compiled network. Since the available yeast interactome is much more complete, we have additionally mapped the human PABPs to yeast and performed the same analyses for the yeast homologues of the human PABPs. Here, the yeast homologues for PABP datasets also participate in a larger number of physical interactions in the yeast interactome (see Table 10). In particular, they also show a clear tendency to form strongly connected subnetworks. These results are in good agreement with our study of yeast PABPs.

PABP type Standard deviation

PPI involving PAPB

homologues (p-value)

95% confidence

interval

CC of group (p-value)

95% confidence

interval

Actin-binding 1 0.0001 ± 0.0001 0.0000 ± 0.0000 Actin-binding 2 0.004 ± 0.0006 0.0000 ± 0.0001

Page 46: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

46

Actin-binding 3 0.007 ± 0.0007 0.0002 ± 0.0001 DNA-binding 1 0.0001 ± 0.0001 0.0000 ± 0.0000 DNA-binding 2 0.0005 ± 0.0002 0.0001 ± 0.0001 DNA-binding 3 0.001 ± 0.0002 0.0001 ± 0.0001 Heparin-binding 1 0.001 ± 0.0003 0.0000 ± 0.0000 Heparin-binding 2 0.002 ± 0.0004 0.0000 ± 0.0000 Heparin-binding 3 0.002 ± 0.0004 0.0000 ± 0.0000 HS-binding 1 0.001 ± 0.0002 0.0000 ± 0.0001 HS-binding 2 0.002 ± 0.0004 0.0000 ± 0.0000 HS-binding 3 0.005 ± 0.0006 0.0000 ± 0.0000 Tubulin-binding 1 0.013 ± 0.0010 0.0015 ± 0.0003 Tubulin-binding 2 0.012 ± 0.0010 0.0005 ± 0.0002 Tubulin-binding 3 0.010 ± 0.0009 0.0003 ± 0.0002 All 1 0.0003 ± 0.0002 0.0000 ± 0.0000 All 2 0.0002 ± 0.0001 0.0000 ± 0.0000 All 3 0.0003 ± 0.0002 0.0000 ± 0.0001

Table 10. Estimated p-values for the properties “number of interactions involved”, and “connected components in a subnetwork” of human PABPs homologues in yeast.

Page 47: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

47

7 Conclusions

7.1 Summarizing Remarks This thesis presented two Java applications that assist the manipulation and analysis of biological networks.

The Cytoscape plugin named NetworkAnalyzer is the tool that to date provides the most comprehensive topological analysis of biological and other networks. Among the parameters it computes are degree distribution (in- and out- degree distributions for directed networks), average clustering coefficient distribution, topological coefficients, and neighborhood connectivity distribution (which comes in three different versions for directed networks).

The stand-alone tool NetworkLoader compiles interaction networks from different molecular interaction datasets. It provides flexible criteria for data filtering and integration of externally controlled vocabularies. NetworkLoader can also automatically translate identifiers in a source database to identifiers in a target database as defined in mapping files. Network loader was successfully used to compile networks of known protein-protein interactions in yeast and in human.

Finally, we presented a novel framework for quantifying the structural significance of a subset of nodes in a network. Our framework consists of estimating p-values for a collection of topological properties. We have defined, and later used, three such properties. However, the framework is further extensible through inclusion of additional topological properties.

We have used NetworkAnalyzer to compare a set of twelve protein-protein interaction networks and have found anticipated similarities, as well as surprising differences in their topologies. NetworkAnalyzer was downloaded by more than 400 users since its initial release, which indicates a substantial demand for such tools from the bioinformatics community.

The framework for quantifying node subset significance was applied to groups of polyanion-binding proteins to determine their topological role in protein-protein interaction networks in yeast and human. We have found that polyanion-binding proteins tend to participate in significantly higher number of interactions in yeast and in human. In addition, the human PABPs form cohesive subnetworks, consisting of less connected components than randomly generated subnetworks of the same size. We suggested that PAPBs play an important role in the interactomes, based on the topological characteristics we have observed in them.

7.2 Outlook We are convinced that the type of analysis presented here will be essential for studying the growing amount of genomic and proteomics data in model organisms and in human.

In particular, NetworkAnalyzer is capable of spotting structural mechanisms and perturbations in the organization of directed and undirected networks. Including the computation of additional parameters, such as vertex betweenness centrality [120], would further reinforce the topological principles investigated by the plugin. An implementation of randomized algorithms for producing approximate, rather than exact results, is likely to be of benefit, as NetworkAnalyzer will then require less computational recourses.

Although still immature, the PSI MI 2.5 format is rapidly gaining popularity. An increasing number of molecular interaction repositories are providing data export to this XML based format, which favors the usage of tools like NetworkLoader. With some minor performance improvements, NetworkLoader is likely to be one of the preferred applications for the

Page 48: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

48

compilation and transformation of intraction data.

Page 49: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

49

8 Bibliography 1. Aittokallio, T., Schwikowski, B.: Graph-based methods for analyzing networks in cell

biology. Briefings in Bioinformatics 7 (2006) 243-255

2. Albert, R.: Scale-free networks in cell biology. Cell Science 118 (2005) 4947-4957

3. Albert, R., Jeong, H., Barabási, A.-L., Diameter of the world-wide web, Nature 401 (1999) 130–131

4. Albert, R., Jeong, H., Barabasi, A.: Error and attack tolerance of complex networks. Nature 406 (2000) 378-382

5. Albrecht, M., Huthmacher, C., Tosatto, S.C.E., Lengauer, T.: Decomposing protein networks into domain–domain interactions. Bioinformatics 21 (2005) ii220-ii221

6. Alfarano, C., Andrade, C. E., Anthony, K., Bahroos, N., et al.: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Research 33 (2005) D418-424

7. Aloy, P., Russell, R.B.: Structure-based systems biology: a zoom lens for the cell. FEBS Letters 579 (2005) 1854–1858

8. Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., et al., Nucleic Acids Research 32 (2004) D115-119

9. Aragues R., Jaeggi D., Oliva, B.: PIANA: protein interactions and network analysis. Bioinformatics 22(8) (2006) 1015-1017

10. Bader, J. S., Chaudhuri, A., Rothberg, J. M., Chant, J.: Gaining confidence in high-throughput protein interaction networks. Nature Biotechnology 22 (2004) 78-85.

11. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286 (1999) 509-512

12. Barabasi, A.-L., Albert, R., and Jeong, H., Scale-free characteristics of random networks: The topology of the World Wide Web, Physica A 281 (2000) 69–77

13. Barabási, A.L., Oltvai, Z.N.: Network biology: understanding the cell's functional organization. Nature reviews. Genetics 5 (2004) 101-113

14. Barrat, A., Weigt, M.: On the properties of small-world network models. European Physical Journal B 13 (2000) 547-560

15. Batagelj, V., Mrvar, A.: Pajek – Program for large network analysis. Connections 21 (1998) 47-57

16. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Wheeler, D. L., Nucleic Acids Research 34 (2006) D16-20.

17. Bollobas, B.: Random graphs. Cambridge University Press, Cambridge (2001), ISBN: 0-521-80920-7

18. Bollobás, B., Riordan, O.: Robustness and Vulnerability of Scale-Free Random Graphs. Internet Mathematics 1 (2003) 1–35

19. Bollobas, B, Riordan, O.: The diameter of a scale-free random graph. Combinatorica 24 (1) (2004) 5–34

20. Breitkreutz, B.J., Stark, C., Tyers, M.: Osprey: a network visualization system. Genome Biology 4(3) (2003) R22

Page 50: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

50

21. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J., Graph structure in the web, Computer Networks 33 (2000) 309–320

22. Chen, Y., Xu, D.: Computational Analyses of High-Throughput Protein-Protein Interaction Data. Current Protein and Peptide Science 4 (2003) 159-181

23. Chua, G., Robinson, M.D., Morris, Q., Hughes, T.R.: Transcriptional networks: reverse-engineering gene regulation on a global scale. Current Opinion in Microbiology 7 (2004) 638–646

24. Cusick, M. E., Klitgord, N., Vidal, M., Hill, D. E.: Interactome: gateway into systems biology. Human molecular genetics 14 (2005) R171-181

25. Dabora, J.M., Sanyal, G., and Middaugh, C.R.: Effect of polyanions on the refolding of human acidic fibroblast growth factor. The Journal of biological chemistry 266 (1991) 23637-40

26. de Lichtenberg, U., Jensen, L.J., Brunak, S., Bork, P..: Dynamic Complex Formation During the Yeast Cell Cycle. Science 307 (2005) 724-7

27. Deane, C. M., Salwinski, Xenarios, I., Eisenberg, D.,: Two Methods for Assessment of the Reliability of High Throughput Observations. Molecular & Cellular Proteomics 1 (2002) 349-356.

28. Diestel, R: Graph Theory. Springer-Verlag, Heidelberg (2005), ISBN 3-540-26182-6

29. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. PNAS 95 (1998) 14863–14868

30. Erdős, P., Renyi, A.: On the evolution of random graphs. Publication of the Mathematical Institute of Hungarian Academy of Science 5 (1960) 17

31. Ekman, D., Light, S., Bjorklund, A.K., Elofsson, A.: What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biology 7(6) (2006) R45.

32. Forman, J.J., Clemons, P.A., Schreiber, S.L., Haggarty, S.J.: SpectralNET – an application for spectral graph analysis and visualization. BMC Bioinformatics 6 (2005)

33. Friedel, C.C., Zimmer, R.: Toward the complete interactome. Nature Biotechnology 24 (2006) 614 – 615

34. Gandhi, T. K., Zhong, J., Mathivanan, S., Karthick, L., et al.: Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nature Genet 38 (2006) 285-293.

35. Goehler H., Lalowski M., Stelzl U., Waelter S., Stroedicke M., Worm U., Droege A., Lindenberg KS., Knoblich M., Haenig C., Herbst M., Suopanki J., Scherzinger E., Abraham C., Bauer B., Hasenbank R., Fritzsche A., Ludewig AH., Bussow K., Coleman SH., Gutekunst CA., Landwehrmeyer BG., Lehrach H., Wanker EE.: A protein interaction network links GIT1, an enhancer of huntingtin aggregation, to Huntington's disease. Molecular Cell 15(6) (2004) 853-865

36. Goldstein, M.L., Morris, S.A., Yena, G.G.: Problems with fitting to the power-law distribution. European Physical Journal B 41 (2004) 255–258

37. Guldener U, Munsterkotter M, Kastenmuller G, Strack N, van Helden J, Lemer C, Richelles J, Wodak SJ, Garcia-Martinez J, Perez-Ortin JE, Michael H, Kaps A, Talla E, Dujon B, Andre B, Souciet JL, De Montigny J, Bon E, Gaillardin C, Mewes HW.: CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Research 33 (2005) D364-368

38. Guare, J.: Six Degrees of Separation: A Play. Vintage Books, New York, 1990

Page 51: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

51

39. Gunsalus, K.C., Ge, H., Schetter, A.J., Goldberg, D.S., Han, J.D., Hao, T., Berriz, G.F., Bertin, N., Huang, J., Chuang, L.S., Li, N., Mani, R., Hyman, A.A., Sonnichsen, B., Echeverri, C.J., Roth, FP., Vidal, M., Piano, F.: Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis.Nature 436(2005) 861-5

40. Han, J.D.J., Bertin, N., Hao, T., Goldberg, D.S., Berriz, G., Zhang, L.V., Dupuy, D., Walhaut, A.J.M., Cusick, M.E., Roth, F.P., Vidal, M.: Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature 430 (2004) 88-93

41. Han, J.D.J., Dupuy, D., Bertin, N., Cusick, M.E., Vidal, M.: Effect of sampling on topology predictions of protein-protein interaction networks. Nature Biotechnology 23 (2005) 839-844

42. Harris, M. A., Clark, J., Ireland, A., Lomax, J., et al., Nucleic Acids Research 32 (2004) D258-261

43. Hazbun, T.R., Malmström, L., Anderson, S., Graczyk,B.J., Fox, B., Riffle, M., Sundin, B.A., Aranda, J.D., McDonald, W.H., Chiu, C.H., Snydsman,B.E., Bradley, P., Muller, E.G.D., Fields, S., Baker, D., Yates, J.R., Davis, T.N.: Assigning Function to Yeast Proteins by Integration of Technologies. Molecular Cell 12 (2003) 1353–1365

44. Hermjakob, H., Montecchi-Palazzi, L., Bader, G., Wojcik, J., et al.: The HUPO PSI’s Molecular Interaction format – a community standard for the representation of protein interaction data. Nature Biotechnology 22(2) (2004) 177-183

45. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., et al.: IntAct: an open source molecular interaction database. Nucleic Acids Research 32 (2004) D452-455

46. Herrgård, M.J., Palsson , B.Ø: Untangling the web of functional and physical interactions in yeast. Journal of Biology 4 (2005)

47. Hu, Z., Mellor, J., Wu, J., Yamada, T., Holloway, D., DeLisi,C.: VisANT: data-integrating visual framework for biological networks and modules. Nucleic Acids Research 33 (2005) W352–W357

48. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y.: A comprehensive two-hybrid analysis to explore the yeast protein interactome. PNAS 98 (2001) 4569–4574

49. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., Barabasi, A.L.: The large-scale organization of metabolic networks. Nature, 407 (2000) 651-654

50. Jones, J. H. and Handcock, M. S., An assessment of preferential attachment as a mechanism for human sexual network formation, Preprint, University of Washington (2003)

51. Kueltzo, L.A., Middaugh, C.R.: Potential use of non-classical pathways for the transport of macromolecular drugs. Expert Opin. Investig. Drugs 9 (2000) 2039-50

52. Kueltzo, L.A., Normand, N., O'Hare, P., Middaugh, C.R.: Conformational lability of herpesvirus protein VP22. J. Biol. Chem. 275 (2000) 33213-21

53. Lappe, M., Holm, L.: Algorithms for protein interaction networks. Biochemical Society Transactions 33(3) (2005) 530-534

54. Lee, T.I., Rinaldi,N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B.,Ren, B., Wyrick, J.J., Tagne, J.B., Volkert, T.L., Fraenkel, E.,Gifford, D.K., Young, R.A.: Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science 25 (2002) 799 – 804

55. Lehner, B., Fraser, A. G., Genome Biology 5 (2004) R63

Page 52: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

52

56. Li, D., Li, J., Ouyang, S., Wang, J., Wu, S. Wan, P., Zhu, Y., Xu, X., He, F.: Protein interaction networks of Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster : Large-scale organization and robustness. Proteomics 6 (2006) 456–461

57. Li, L., Alderson, D., Tanaka, R., Doyle, J.C., Willinger, W.: Towards a Theory of Scale-Free Graphs: Definition, Properties, and Implications (Extended Version). Technical report (2005)

58. Liljeros, F., Edling, C. R., Amaral, L. A. N., Stanley, H. E., Aberg, Y., The web of human sexual contacts, Nature 411 (2001) 907–908

59. Lim, J., Hao, T., Shaw, C., Patel, A.J., Szabo´, G, Rual, J.F., Fisk, C.F., Li, N., Smolyar, A., Hill, D.E., Barabási, A.L., Vidal, M., Zoghbi, H.Y.: A Protein–Protein Interaction Network for Human Inherited Ataxias and Disorders of Purkinje Cell Degeneration. Cell 125 (2006) 801-814

60. Lunn, J.E.: Compartmentation in plant metabolism. Journal of Experimental Botany, To Appear.

61. Luscombe, N. M., Babu, M. M., Yu, H., Snyder, M., Teichmann, S. A., Gerstein, M.: Genomic analysis of regulatory network dynamics reveals large topological changes Nature 431 (2004) 308-312

62. Ma, H. W., Zhao, X. M., Yuan, Y. J., Zeng, A.P.: Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph Bioinformatics 20(12) (2004) 1870–1876

63. Mach, H., Ryan, J.A., Burke, C.J., Volkin, D.B., and Middaugh, C.R.: Partially structured self-associating states of acidic fibroblast growth factor. Biochemistry 32 (1993) 7703-11

64. Maglott, D., Ostell, J., Pruitt, K. D., Tatusova, T., Nucleic Acids Res 2005, 33, D54-58

65. Manna, T., Sarkar, T., Poddar, A., Roychowdhury, M., Das, K.P., and Bhattacharyya, B.: Chaperone-like activity of tubulin. binding and reactivation of unfolded substrate enzymes. The Journal of biological chemistry (2001) 276 39742-7

66. Maslov, S., Sneppen, K: Specificity and stability in topology of protein networks. Science 296 (2002) 910-913.

67. McDermott, J., Bumgarner, R., Samudrala, R., Bioinformatics 21 (2005) 3217-3226.

68. Mewes, H.W., Frishman, D., Mayer, K.F., Munsterkotter, M., Noubibou, O., Pagel, P., Rattei, T., Oesterheld, M., Ruepp, A., Stumpflen, V.: MIPS: Analysis and Annotation of Proteins from Whole genomes in 2005. Nucleic Acids Research 34 (2006) 34 D169-D172

69. Salamat-Miller, N., Fang, J., Seidel, C., Assenov, Y., Albrecht, M., Middaugh, C.R.: A Network-Based Analysis of Polyanion Binding Proteins Utilizing Human Protein Arrays. Submitted for publication (2006).

70. Salamat-Miller, N., Fang, J., Seidel, C., Smalter, A., Assenov, Y., Albrecht, M., Middaugh, C.R.: A network-based analysis of polyanion binding proteins utilizing yeast protein arrays. Molecular & cellular proteomics (2006)

71. Newman, M.E.J.: The structure and function of complex networks. Technical Report (2003)

72. Ng, A., Bursteinas, B., Gao, Q., Mollison, E., Zvelebil, M.: Resources for integrative systems biology: from data through databases to networks and dynamic system models. Briefings in Bioinformatics (2006), To Appear

73. Orchard, S., Hermjakob, H., Taylor, C., Binz, P.A., Hoogland, C., Julian, R., Garavelli, J.S, Aebersold, R., Apweiler, R.: Autumn 2005 Workshop of the Human Proteome

Page 53: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

53

Organisation Proteomics Standards Initiative (HUPO-PSI) Geneva, September, 4–6, 2005. Proteomics 6 (2006) 738-741.

74. Orchard, S., Apweiler, R., Barkovich, R., Field, D., Garavelli, J.S., Horn, D., Jones, A., Jones, P., Julian, R., McNally,R., Nerothin, J., Paton, N., Pizarro, A., Seymour, S. , Taylor, C., Wiemann, S., Hermjakob, H.: Proteomics and Beyond. Proteomics 6 (2006) 4439-4443

75. Palla, G., Derenyi, I., Farkas, I. Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(9) (2005) 814-818

76. Papin, J., Reed, J., Palsson, B.O. Hierarchical thinking in network biology: the unbiased modularization of biochemical networks. TRENDS in Biochemical Sciences 29(12) (2004)

77. Pastor-Satorras, R., Vazquez, A., Vespignani, A.: Dynamical and correlation properties of the internet.Physical review letters 87(25) (2001)

78. Pellegrini, L., Burke, D.F., von Delft, F., Mulloy, B., and Blundell, T.L.: Crystal structure of fibroblast growth factor receptor ectodomain bound to ligand and heparin. Nature 407 (2000) 1029-34

79. Peri, S., Navarro, J. D., Amanchy, R., Kristiansen, T. Z., et al.: Development of Human Protein Reference Database as an Initial Platform for Approaching Systems Biology in Humans. Genome Research 13 (2003) 2363-2371

80. Persico, M., Ceol, A., Gavrila, C., Hoffmann, R., et al., BMC Bioinformatics 6 (2005)

81. Pruitt, K. D., Tatusova, T., Maglott, D. R., Nucleic Acids Research 33 (2005) D501-504.

82. Ptacek, J., Devgan, G., Michaud, G., Zhu, H., Zhu, X., Fasolo, J., Guo, H., Jona, G., Breitkreutz, A., Sopko, R., McCartney, RR., Schmidt, M.C., Rachidi, N., Lee, S.J., Mah, A.S., Meng, L., Stark, M.J., Stern, D.F., De Virgilio, C., Tyers, M., Andrews, B., Gerstein, M., Schweitzer, B., Predki, P.F., Snyder, M.: Global analysis of protein phosphorylation in yeast. Nature 438 (2005) 679-84

83. Ramani, A. K., Bunescu, R. C., Mooney, R. J., Marcotte, E. M.: Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biology 6 (2005) R40

84. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N., Barabási, A.L.,: Hierarchical organization of modularity in metabolic networks. Science 297 (2002) 1551-1555

85. Reguly, T., Breitkreutz, A., Boucher, L., Breitkreutz, B. J., et al., Journal of Biology 5 (2006) 11

86. Rentzeperis, D., Jonsson, T., and Sauer, R.T.: Acceleration of the refolding of Arc repressor by nucleic acids and other polyanions. Nat. Struct. Biol. 6 (1999) 569-73

87. Rhodes, D. R., Tomlins, S. A., Varambally, S., Mahavisno, V., et al., Nature Biotechnology 23 (2005) 951-959

88. Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., et al.: Towards a proteome-scale map of the human protein–protein interaction network. Nature 437 (2005) 1173-1178

89. Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., et al.: STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Research 32 (2004) D449-451.

90. Santonico, E., Castagnoli, L., Cesareni, G.: Methods to reveal domain networks. Drug Discovery Today 10 (2005) 1111-1117

91. Schlicker, A., Domingues, F. S., Rahnenführer, J., Lengauer, T., BMC Bioinformatics

Page 54: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

54

2006, 7, 302.

92. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., Ideker, T.: Cytoscape: A Software Environment for Integrated Models of Biomelucular Interaction Networks. Genome Research 13 (2003) 2498-2504

93. Sharan, R., Suthram, S., Kelley, R.M., Kuhn, T., McCuine, S., Uetz, P., Sittler, T., Karp, R.M., Ideker, T.: Conserved patterns of protein interaction in multiple species. PNAS 102 (2005) 1974-1979

94. Siegal, M.L., Promislow, D.E., Bergman, A.: Functional and evolutionary inference in gene networks: does topology matter? Genetica (2006), To Appear.

95. Sprinzak, E., Sattath, S., Margalit, H.: How reliable are experimental protein-protein interaction data? J Mol Biol 327 (2003) 919-923

96. Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz A, Tyers. M.: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34 (2006) D535-539

97. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human protein-protein interaction network: a resource for annotating the proteome. Cell 122 (2005) 957-968.

98. Stelzl, U., Wanker, E.E.: The value of high quality protein–protein interaction networks for systems biology. Current Opinion in Chemical Biology 10 (2006) 1–8

99. Stumpf, M.P.H., Wiuf, C., May, R.M.: Subnets of scale-free networks are not scale-free: Sampling properties of networks. PNAS 102(12) (2005) 4221-4224

100. Suthram, S., Sittler, T., Ideker, T.: The Plasmodium protein network diverges from those of other eukaryotes. Nature 438(7064) (2005) 108-112

101. Tanaka, R., Yi, T.M, Doyle, J.: Some protein interaction data do not exhibit power law statistics. FEBS Letters 579 (2005) 5140–5144

102. Tong, A.H.Y, Evangelista, M., Parsons, A.B., Xu, H., Bader, G.D., Page, N., Robinson, M., Raghibizadeh, Hogue, C.W.V,, Bussey, H., Andrews, B., Tyers, M., Boone, C.: Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants. Science 294 (2001) 2364-2368

103. Tsai, P.K., Volkin, D.B., Dabora, J.M., Thompson, K.C., Bruner, M.W., Gress, J.O., Matuszewska, B., Keogan, M., Bondi, J.V., and Middaugh, C.R.: Formulation design of acidic fibroblast growth factor. Pharm. Res. 10 (1993) 649-59

104. Tumova, S., Woods, A., and Couchman, J.R.: Heparan sulfate proteoglycans on the cell surface: versatile coordinators of cellular functions. The international journal of biochemistry & cell biology 32 (2000) 269-88

105. Uetz, P., Dong, Y. A., Zeretzke, C., Atzler, C., et al.: Herpesviral Protein networks and their interaction with the human proteome. Science 311 (2006) 239-242.

106. Urfer, W, Grzegorczyk, M, Jung, K.: Statistics for proteomics: a review of tools for analyzing experimental data. Proteomics 2 (2006) 48-55

107. Valente, A.X.C.N., Cusick, M.E.: Yeast Protein Interactome topology provides framework for coordinated-functionality.Nucleic Acids Research, 34(9) (2006) 2812–2819

108. Vlasblom, J., Wu,S., Pu, S., Superina, M., Liu, G., Orsi,C., Wodak, S.J.: GenePro: a cytoscape plug-in for advanced visualization and analysis of interaction networks. Bioinformatics 22 (2006) 2178-2179

Page 55: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

55

109. Volkin, D.B., Tsai, P.K., Dabora, J.M., Gress, J.O., Burke, C.J., Linhardt, R.J., Middaugh, C.R.: Physical stabilization of acidic fibroblast growth factor by polyanions. Arch. Biochem. Biophys. 300 (1993) 30-41

110. von Mering, C., Krause, R., Snel, B., Cornell, M., et al., Nature 417 (2002) 399-403

111. Wagner, A., David A. Fell, D.A.: The small world inside large metabolic networks. The Royal Society 268 (2001) 1803-1810

112. Watts D.J., Strogatz, S.H.: Collective dynamics of 'small-world' networks. Nature 393 (1998) 440-442

113. Weitzman, J.B.: A road map of yeast interactions. Journal of Biology 4 (2005)

114. Wiedlocha, A., Madshus, I.H., Mach, H., Middaugh, C.R., and Olsnes, S.: Tight folding of acidic fibroblast growth factor prevents its translocation to the cytosol with diphtheria toxin as vector. Embo. J. 11 (1992) 4835-42

115. Wuchty, S.: Scale-Free Behavior in Protein Domain Networks. Molecular biology and evolution 18(9) (2001) 1694-1702

116. Wuchty, S., Almaas, E.: Evolutionary cores of domain co-occurrence networks. BMC Evolutionary Biology 5(24) (2005)

117. Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S.-M., Eisenberg, D.: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research 30(1) (2002) 303–305

118. Xia, Y., Yu, H., Jansen, R., Seringhaus, M., Baxter, S., Greenbaum, D., Zhao, H., Gerstein, M.: Analyzing cellular biochemistry in terms of molecular networks. Annual review of biochemistry 73 (2004) 1051–87

119. Yan, W., Lee, H., Yi, E.C., Reiss, D., Shannon, P., Kwieciszewski, B.K., Coito, C., Li, X.J., Keller, A., Eng, J., Galitski, T., Goodlett, DR., Aebersold, R., Katze, MG.: System-based proteomic analysis of the interferon response in human liver cells. Genome Biology 5(8) 2004 R54

120. Yoon, J., Blumer, A., Lee, K.: An algorithm for modularity analysis of directed and weighted biological networks based on edge-betweenness centrality. Bioinformatics (2006) To Appear

121. Yu, H., Gerstein, M.: Genomic analysis of the hierarchical structure of regulatory networks. PNAS 103(40) (2006) 14724-14731

122. Yu, H., Greenbaum, D., Lu, H.X., Zhu, X., Gerstein, M.: Genomic analysis of essentiality within protein networks. TRENDS in genetics 20 (2004) 227-231

123. Yu, H., Luscombe, N. M., Lu, H. X., Zhu, X., et al., Genome Research 14 (2004) 1107-1118

124. Yu, H., Paccanaro, A., Trifonov, V., Gerstein, M.: Predicting interactions in protein networks by completing defective cliques. Bioinformatics 22(7) (2006) 823-829

125. Yu, H., Zhu, X., Greenbaum, D., Karro, J. Gerstein, M.: TopNet: a tool for comparing biological subnetworks, correlating protein properties with topological statistics. Nucleic Acids Research 32 (2004) 328-337

126. Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., et al., FEBS Letters 513 (2002) 135-140.

127. Zhang, L. V., King, O. D., Wong, S. L., Goldberg, D. S., Tong, A. H. Y., Lesage, G., Andrews, B., Bussey, H., Boone, C., Roth, F. P.: Motifs, themes and thematic maps of an

Page 56: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

56

integrated Saccharomyces cerevisiae interaction network. Journal of Biology 4(2):6 (2005)

128. Zhu, H., Bilgin, M., Bangham, R., Hall, D., Casamayor, A., Bertone, P., Lan, N., Jansen, R., Bidlingmaier, S., Houfek, T., Mitchell, T., Miller, P., Dean, R.A., Gerstein, M., Snyder, M.: Global analysis of protein activities using proteome chips. Science 293 (2001) 2101-5

Page 57: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

57

Appendix A – Implementation Details of NetworkAnalyzer This appendix describes the aspects integrated in the NetworkAnalyzer project. In addition, it gives an overview of the architecture of the package implementing the most advanced aspect – plugin settings management.

Parameters NetworkAnalyzer computes two types of parameters for every network. The first type is simple parameters, which can be numbers or strings. Currently all simple parameters are numbers. The second type is the complex parameter type, which is a dataset and is visualized by chart. The simple and complex parameters for an example network can be seen in Figure 13.

Figure 13. Screenshot of NetworkAnalyzer, displaying the values of simple and complex parameters.

The computed simple parameters are the ones situated in the Simple parameters box on top of

Page 58: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

58

the window5. Examples of simple parameters are average number of neighbors, diameter of a network, and shortest paths. The other boxes in the window are visualizing complex parameters – the node degree distribution (of type IntHistogram), average clustering coefficient distribution (type is Points2D), topological coefficients (type is Points2D), shortest path length ditribution (type is LongHistogram), shared partners distribution (type is LongHistogram) and neighborhood connectivity distribution (type is IntHistogram). All complex parameters implement the interface ComplexParam.

The core functionality of NetworkAnalyzer consists of two tasks – computation and visualization of network parameters.

Network Analysis The computation of simple and complex parameters for a given network is refered to as network analysis; a class that performs this computation is an analyzer. The base class is called NetworkAnalyzer, and it has two subclasses: DirNetworkAnalyzer (for directed networks) and UndirNetworkAnalyzer (for undirected networks).

Generally speaking, the input of an analyzer is a Cytoscape network, and the output is a set of parameters. The computation itself involves iterations over the nodes and the edges of the given network and computing certain properties of the iterated objects. The abstract class NetworkAnalyzer provides common interface for passing the network to be analyzed, as well as for tracking the analysis progress and cancelling the analysis.

Among other functionality, the classes UndirNetworkAnalyzer and DirNetworkAnalyzer implement a modified version of Dijkstra’s algorithm for finding all shortest paths from a given node. Since no edge weights are introduced in the analysed networks, a relaxation of non-infinite tentative distance cannot occur by breadth-first traversal of the nodes. As a result, the initially assigned distance to a node cannot be modified later in the algorithm which effectively removes the necessity for an advanced structure such as priority queue. Therefore, unlike the classical algorithm, the running time of the modified algorithm is ( )EO for finding the shortest paths from one node to all other nodes in the networks.

Visualization The visualization of simple parameters is implemented by methods of the class AnalysisDialog. In terms of programming complexity, this is a trivial task and is not discussed here. Complex parameters are visualized by charts; the type of a chart depends on the type of its underlying data, that is, the complex parameter type. For example, in Figure 13, the node degree distribution is displayed as a bar chart because its type is IntHistogram, whereas the average clustering coefficient distribution is visualized as a scatter plot, used for the Points2D complex parameter type. NetworkAnalyzer uses the JFreeChart chart drawing library (http://www.jfree.org/jfreechart) for creating charts as Swing controls. Although an extensive functionality from JFreeChart is used in the plugin, a migration to another library is facilitated by the standardized interface of the class JFreeChartConn.

In addition, every chart has several adjustable properties, for example, the background color or the title of the x-axis. These properties are discussed in details later in this appendix.

The visualization of specific complex parameter type instances is dedicated to a separate class. The base class is ComplexParamVisualizer and it has subclasses for every complex parameter type. The names of all these classes end with the suffix Visualizer, e.g.

5 More precisely, all values excluding Analysis time (sec)

Page 59: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

59

IntHistogramVisualizer. In addition to a chart, several buttons are dedicated to every complex parameter. For instance, the buttons shown next to the bar chart of the node degree distribution, are labelled “View enlarged”, “Save chart”, “Export data”, “Filter data”, “Fit a power law”, “Visual settings” and “Help”. Most of the buttons are present for every complex parameter but some of them, such as “Fit a power law” are parameter-dependent and their presence is encoded in the plugin’s settings file. The class ChartDisplayPanel is responsible for creating the required buttons for the given complex parameter instance and implementing their functionality.

Settings An important aspect of the plugin is the management of settings. All settings are stored in the file NetworkAnalyzer.xml, situated in the plugins/ subdirectory of Cytoscape. In case this file does not exist or is invalid, the default settings are loaded from the file de.mpg.mpi_inf.bio.networkanalyzer.NetworkAnalyzer.xml in the plugin's file NetworkAnalyzer.jar.

An instance of the PluginSettings class stores the plugin's general settings. It contains the URLs for error reporting and new version check – these activities are described in section dedicated to the plugin’s communication to the server.

As described above, every complex parameter has a set of adjustable properties (depending on its type), also refered to as visual settings. For example, the topological coefficients parameter is of type Points2D and as such it has the following visual settings: chart title, label of x-axis, label of y-axis, and color of the points in the scatter plot. The default values for them are “Topological Coefficients”, “Number of neighbors”, “TCs”, and black, respectively. The user can change any of these. The abstract class Settings and its inheritors are used for storing visual settings. The extending classes of SettingsGroup specify which of these settings are used in the different complex parameter types.

Loading and Saving Complex parameters, charts, analysis results and settings can be loaded from and saved to files. The mechanism of loading the settings was briefly introduced in the previous section. Note that analysis results and settings support both loading and saving. The classes StatsSerializer and SettingsSerializer control the I/O operations of analysis results and settings, respectively. StatsSerializer relies on two additional classes – Netstats1Reader and Netstats2Reader – for its support of different versions of analysis results (.netstats) files. Complex parameters and charts support saving only.

Communication to the Server If the computer NetworkAnalyzer is running on is connected to the Internet, NetworkAnalyzer can exchange information with the MPII server. There are two services that involve communication over the Internet.

The first type of communication is checking for new versions (update check). In order to achieve this, NetworkAnalyzer sends an HTTP request for a URL. The response is an XML document that contains information about the newest version available for download. In case the current version of NetworkAnalyzer is not the newest one, an informative message is displayed to the user. The update check is performed only the first time an action by the plugin is executed. The service is then disabled until Cytoscape is restarted. The update check may fail due to one of the following reasons: the user has no Internet connection, the URL is invalid or it does not exist, or when the response is not an XML document with the expected structure. In

Page 60: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

60

any case, the failure is simply ignored since it should not be reported to the user. The class UpdateChecker implements this service.

The second type of communication is error reporting. When an internal error occurs, it is encapsulated in an InnerException instance and a dialog is displayed to the user asking for a permission to send an error report. If the user agrees, NetworkAnalyzer sends an HTTP request for a specific URL, passing as POST parameters system information (OS type and version, Java VM type and version) and the stack trace of the InnerException. The response is one of “OK” or “ERROR” and it is ignored. If the error reporting fails, the failure is ignored. The class that implements this functionality is ErrorReportSender.

Implementation Details of the Settings Package The management of plugin settings is the most advanced concept in NetworkAnalyzer and therefore, the package dedicated for this task is discussed here in details. The settings package contains storage classes for various plugin settings. The members of this package and their relationships are shown in Figure 14.

Figure 14. UML diagram showing the organisation of the package de.mpg.mpi_inf.bioinf.netanalyzer.settings.

All classes implement the XMLSerializable interface which contains a method for saving the data to an XML document. Thus the settings classes are able to present their data in the form of an XML tree. Moreover, they can also be initialized from a given XML tree. This functionality is used for loading all settings from an XML file, and saving them to such. The serialization of an XML document is implemented by the class SettingsSerializer (member of the de.mpg.mpi_inf.bioinf.netanalyzer.io package).

Page 61: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

61

The two base classes in this package are Settings and SettingsGroup. Inheritors of Settings store theme-specific settings data. For example, the AxesSettings class stores settings related to the x and y axes of a chart control, such as axis labels, axis types and gridline visibility. The Settings class provides automated means of saving to and loading from an XML tree, as well as methods aiding the generation of dialogs through which the user modifies the settings data. The following paragraph gives insights into the aspect of settings management.

The basic unit in the settings is a property. Inheritors of the Settings class serve the mere function of storing property values. Every property has name and type. For example, the AxesSettings class contains the properties listed in Table 11. Properties defined in AxesSettings..

Property name Property type

Description

domainAxisLabel String Label of the domain (horizontal) axis.

rangeAxisLabel String Label of the range (vertical) axis.

logarithmicDomainAxis boolean Flag indicating logarithmic scale of the domain axis.

logarithmicRangeAxis boolean Flag indicating logarithmic scale of the range axis.

integerDomainAxisTick boolean Flag indicating if the domain axis shows integer values only.

integerRangeAxisTick boolean Flag indicating if the range axis shows integer values only.

Table 11. Properties defined in AxesSettings.

The list of currently supported types for properties is shown in Table 12. Supported property types.. A property is defined by declaring a field with its name and a getter method. The field must have a default visibility, that is, it must be declared as neither private nor public. The signature of the getter method must follow the rules for getters in Java.

Property types

Boolean boolean

Color

Double double

Float float

Integer int

Long long

String

Table 12. Supported property types.

Properties that can be modified by the user also include the declaration of a corresponding setter method.

This set of rules for defining visual settings is designed to automate many settings-related tasks.

Page 62: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

62

More precisely, no additional programming effort is required to enable loading and saving of newly added properties. Moreover, the dialog window for adjusting visual settings in a chart is generated automatically, relieving the programmer from the burden of creating user interface controls for every property and thus minimizing the probability of erroneous processing of user’s input. The architecture of NetworkAnalyzer enables easy extensions to the plugin in the form of additional simple and complex parameters, as well as new visualization techniques and adjustments.

Page 63: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

63

Appendix B – Details of the Analysis of PABPs This appendix lists the figures of histograms resulting from the p-value estimation of the properties for polyanion-binding proteins, discussed in Chapter 6.2. The estimated p-values are given in the respective table.

Estimation of p-values for properties of PABPs in yeast This section shows the histograms obtained after computing property values for 50,000 randomly generated subnetworks of the compiled yeast interactome, as described in Chapter 6.2. The properties investigated are number of interactions involved and increase in connected components. Only the results for the sets of actin-binding (+1SD, +2SD and +3SD) and All (+1SD, +2SD and +3SD) PABPs are given.

(1SD)

(2SD)

Page 64: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

64

Figure 15. Histograms for the number of protein-protein interactions in randomly generated subnetworks of the compiled yeast interactome. The horizontal axis of the histogram represents the different values obtained for the number of protein-protein interactions in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of interactions. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 8. The red circle represents the number of interactions in the subnetwork formed by Actin-binding proteins and their direct interaction partners.

(1SD)

Page 65: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

65

(2SD)

Figure 16. Histograms for the number of protein-protein interactions in randomly generated subnetworks of the compiled yeast interactome. The horizontal axis of the histogram represents the different values obtained for the number of protein-protein interactions in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of interactions. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 8. The red circle represents the number of interactions in the subnetwork formed by PABPs and their direct interaction partners.

Page 66: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

66

(1SD)

(2SD)

Page 67: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

67

Figure 17. Histograms for the value of increase in connected components (CCI) after removal of random proteins in the compiled yeast interactome. The horizontal axis of the histogram represents the different values obtained for CCI. Negative values are treated as zeros. The vertical axis shows how many random subnetworks had the given value for CCI. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 8. The red circle represents the increase in connected components after removal of the Actin-binding proteins.

(1SD)

Page 68: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

68

(2SD)

Figure 18. Histograms for the value of increase in connected components (CCI) after removal of random proteins in the compiled yeast interactome. The horizontal axis of the histogram represents the different values obtained for CCI. Negative values are treated as zeros. The vertical axis shows how many random subnetworks had the given value for CCI. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 8. The red circle represents the increase in connected components after removal of the all PABPs.

Estimation of p-values for properties of PABPs in human This section shows the histograms obtained after computing property values for 50,000 randomly generated subnetworks of the compiled human interactome, as described in Chapter 6.2. The properties investigated are number of interactions involved, increase in connected components and number of connected components in subnetwork. Only the results for the sets of DNA-binding (+1SD, +2SD and +3SD) and All (+1SD, +2SD and +3SD) PABPs are given.

Page 69: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

69

(1SD)

(2SD)

Page 70: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

70

Figure 19. Histograms for the number of protein-protein interactions in randomly generated subnetworks of the compiled human interactome. The horizontal axis of the histogram represents the different values obtained for the number of protein-protein interactions in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of interactions. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 9. The red circle represents the number of interactions in the subnetwork formed by DNA-binding proteins and their direct interaction partners.

(1SD)

Page 71: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

71

(2SD)

Figure 20. Histograms for the number of protein-protein interactions in randomly generated subnetworks of the compiled human interactome. The horizontal axis of the histogram represents the different values obtained for the number of protein-protein interactions in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of interactions. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 9. The red circle represents the number of interactions in the subnetwork formed by PABPs and their direct interaction partners.

Page 72: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

72

(1SD)

(2SD)

Page 73: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

73

Figure 21. Histograms for the value of increase in connected components (CCI) after removal of random proteins in the compiled human interactome. The horizontal axis of the histogram represents the different values obtained for CCI. Negative values are treated as zeros. The vertical axis shows how many random subnetworks had the given value for CCI. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 9. The red circle represents the increase in connected components after removal of the DNA-binding proteins.

(1SD)

Page 74: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

74

(2SD)

Figure 22. Histograms for the value of increase in connected components (CCI) after removal of random proteins in the compiled human interactome. The horizontal axis of the histogram represents the different values obtained for CCI. Negative values are treated as zeros. The vertical axis shows how many random subnetworks had the given value for CCI. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 9. The red circle represents the increase in connected components after removal of the all PABPs.

Page 75: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

75

(1SD)

(2SD)

Page 76: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

76

Figure 23. Histograms for the number of connected components in randomly generated subnetworks of the compiled human interactome. The horizontal axis of the histogram represents the different values obtained for the number of connected components in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of connected components. These histograms are produced for three different statistical levels (+1SD, +2SD and +3SD) of PABP detection to estimate the p-values shown in Table 9. The red circle represents the number of connected components in the sub-network formed by DNA-binding proteins only.

(1SD)

Page 77: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

77

(2SD)

Figure 24. Histograms for the number of connected components in randomly generated subnetworks of the compiled human interactome. The horizontal axis of the histogram represents the different values obtained for the number of connected components in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of connected components. These histograms are produced for three different statistical levels (+1SD, +2SD and +3SD) of PABP detection to estimate the p-values shown in Table 9. The red circle represents the number of connected components in the sub-network formed by PABPs only.

Page 78: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

78

Estimation of p-values for properties of yeast homologues of human PABPs in yeast This section shows the histograms obtained after computing property values for 50,000 randomly generated subnetworks of the compiled yeast interactome, as described in Chapter 6.2. The properties investigated are number of interactions involved and number of connected components in subnetwork. Only the results for the sets of all yeast homologues (+1SD, +2SD and +3SD) of PABPs in human are given.

(1SD)

(2SD)

Page 79: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

79

(3SD)

Figure 25. Histograms for the number of protein-protein interactions in randomly generated subnetworks of the compiled yeast interactome. The horizontal axis of the histogram represents the different values obtained for the number of protein-protein interactions in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of interactions. These histograms are produced for three different statistical levels +1SD, +2SD and +3SD of PABP detection to estimate the p-values shown in Table 10. The red circle represents the number of interactions in the subnetwork formed by yeast proteins homologous to PABP proteins in human, and their direct interaction partners.

(1SD)

Page 80: Topological Analysis of Biological Networks...Master Thesis Topological Analysis of Biological Networks by Yassen Assenov A Thesis Submitted to the Department of Computer Science of

80

(2SD)

(3SD)

Figure 26. Histograms for the number of connected components in randomly generated subnetworks of the compiled yeast interactome. The horizontal axis of the histogram represents the different values obtained for the number of connected components in the generated subnetworks. The vertical axis shows how many random subnetworks contained the given number of connected components. These histograms are produced for three different statistical levels (+1SD, +2SD and +3SD) of PABP detection to estimate the p-values shown in Table 10. The red circle represents the number of connected components in the sub-network formed by all yeast proteins homologous to PABP proteins in human.