9. Lecture WS 2004/05Bioinformatics III1 Bioinformatics III “Systems biology”,“Integrative cell biology” Zusammenfassung Teil 2: Vorlesungen 9-16

9. Lecture WS 2004/05

Bioinformatics III 1

Bioinformatics III “Systems biology”,“Integrative cell biology”Zusammenfassung Teil 2: Vorlesungen 9-16



V9 - visualize cellular interaction data

e.g. protein interaction data (undirected): nodes – proteinsedges – interactions

metabolic pathways (directed)nodes – substancesedges – reactions

regulatory networks (directed): nodes – transcription factors + regulated proteinsedges – regulatory interaction

co-localization (undirected): nodes – proteins

edges – co-localization information

homology (undirected/directed)nodes – proteinsedges – sequence similarity (BLAST score)



Force-directed algorithm for graph layout

http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html

Various graph layout algorithms have been

developed to solve this visualisation task.

20 years ago, Peter Eades proposed a graph

layout heuristic [A heuristic for graph

drawing. Congressus Numerantium, 42:149-

160, 1984] which is called the ``Spring

Embedder'' algorithm.

Edges are replaced by springs and

vertexes are replaced by rings that

connect the springs. A layout can be

found by simulating the dynamics of such

a physical system.

This method and other methods, which

involve similar simulations to compute the

layout, are called ``Force Directed''

algorithms.



Force-directed algorithm

http://www.it.usyd.edu.au/~aquigley/3dfade/

The edges can be modeled as gravitational (or electrostatic) attraction

and all nodes have an electrical repulsion between them.

It is also possible for the system to simulate unnatural forces acting on the

bodies, which have no direct physical analogy, for example the use of a

logarithmic distance measure rather than Euclidean.



Force-directed algorithm


Because of the underlying analogy to a physical system, the force directed graph

layout methods tend to meet various aesthetic standards, such as

- efficient space filling,

- uniform edge length (when equal weights and repulsions are used)

- symmetry and the

- capability of rendering the layout process with smooth animation (visual

continuity).

Having these nice features, the force directed graph layout has become

the ``work horse'' of layout algorithms.

It has been successfully adapted to many domains with variations of

implementation.



Scaling


Force directed layout methods commonly have computational scaling problems.

When there are more than a few thousand vertexes in the graph, the running time

of the layout computation can become unacceptable.

This is caused by the fact that in each step of the simulation, the repulsive

force between each pair of unconnected vertexes needs to be computed,

costing a running time of O(0.5 V2 – E).

Here V is the number of vertexes and E is the number of edges in the graph.

This complexity is hard to escape for general graphs without hierarchical structure.



Protein interaction graphs

Ju et al. Bioinformatics 19, 317 (2003)

Most protein interaction data have the following characteristics:

(1) When visualized as a graph, the data yields a disconnected graph with many

connected components

(2) The data yields a nonplanar graph with a large number of edge crossings that

cannot be removed in a 2D drawing

(3) #interactions varies widely within the same set of data – p(k)

(4) data often contains protein interactions corresponding to self loops

demands robust algorithm.



InterViewer: Example of force-directed layout algorithm

Ju et al. Bioinformatics 19, 317 (2003)

InterViewer does not place initial nodes

randomly, but on the surface of a

sphere. Fixed # of iterations.

The original algorithm has complexity

O(N2) per timestep with N # of nodes.

When using multipole-methods, this

can be reduced to O(N logN)

Time may also be saved by introducing

a cut-off, e.g. only computing

interactions with the next neighbor

cells. Update neighbor list infrequently.



Aim: analyze and visualize homologies between the protein universe :-)

50 genomes 145579 proteins 21 109 BLASTP pairwise sequence

comparisons.

Expect that fusion proteins („Rosetta Stone proteins“) will link proteins

of related function.

Need to visualize extremely large network! Develop stepwise scheme.



LGL

Adai et al. J. Mol. Biol. 340, 179 (2004)

(1) separate original network into connected sets

(2) generate coordinates for each node in each connected set

(using force-directed layout algorithm and a recipe for the sequential lay out of

nodes guided by a minimum spanning tree of the network).

(3) integrate connected sets into one coordinate system via a funnel process:

the connected sets are sorted in descending size by the number of vertices.

The first connected set is placed at the bottom of a potential funnel and other

sets are placed one at a time on the rim of the potential funnel and allowed to

fall towards the bottom where they are frozen in space upon collision with the

previous sets.

We concentrate on step (2) in the following



Minimum Spanning Tree

Given: undirected graph G = (V,E)

where for each edge (u,v) E

exists a weight w(u,v) specifying

the cost to connect u and v.

Find an acyclic graph T E that

connects all of the nodes and

whose total weight

is minimized.

Tvu

vuwTw,

,

Popular algorithms by Kruskal and Prim.

Both are greedy algorithms making the

best choice at the moment.

no guarantee to find the best global

solution

[Cormen]



Kruskal’s algorithm

Consider edges in sorted order by weight.

The arrow points to the edge under consideration at each step.

[Cormen]



Kruskal’s algorithm (II)

Running time O(E log V)

[Cormen]



Intuitive description of LGL


Successive iterations of the layout. The MST determines the oder of placement of

the nodes. The root node could be chosen randomly or based on its centrality in the

network (e.g. minimizing the sum of distances to all other nodes). All other nodes

are assigned a level according to their edge-based distance in the MST from the

root node.

Level one vertices (red circles) are placed randomly on a sphere around the root

node (black circle). The system is allowed to iterate through time satisfying attractive

and repulsive forces until at rest.

Level two nodes (blue circles) are placed randomly on spheres directed away from

the current layout. Again, the system is allowed to evolve through time till at rest.

This process is iterated for the entire graph.



What is the role of fusion proteins?


A protein homology map summarizes the results of billions of sequence comparisons by modeling

the proteins as vertices in a network, and the statistically significant sequence similarities as edges

connecting the relevant proteins. In this manner, proteins within a sequence family (such as A, A′, A

″, and AB; or B, B′ and AB) are all or mostly connected to each other, forming a cluster in the map.

Fusion proteins (such as AB) serve to connect their component proteins' families. The structure of

the resulting map reflects historic genetic events, such as gene fusions, fissions, and duplications,

which are responsible for producing the modern-day genes. The map simultaneously represents

homology relationships (edges), remote homologies (proteins not directly connected but in the same

cluster), and non-homologous functional relationships (adjacent clusters and clusters linked by

fusion proteins).



LGL Algorithm for very large biological networks


The complete protein homology map. A layout of the entire protein homology

map; a total of 11,516 connected sets containing 111,604 proteins (vertices)

with 1,912,684 edges. The largest connected set is shown more clearly in the

inset and is enlarged further on the right side.



Functionally related gene families form adjacent clusters


Three examples illustrate spatial

localization of protein function in the map,

specifically

A, the linkage of the tryptophan synthase

family to the functionally coupled but non-

homologous family by the yeast

tryptophan synthase fusion protein,

B, protein subunits of the pyruvate

synthase and alpha-ketoglutarate

ferredexin oxidoreductase complexes

C, metabolic enzymes, particularly those of

acetyl CoA and amino acid metabolism.



Colocalization


Neighboring proteins tend to be in the

same cellular system. The tendency

for proteins to operate in the same

cellular system, as defined by the

percentage of matching assignments

into the 18 COG database pathways,

is plotted against the spatial

separation in multiples of a typical

cluster size.

The functional similarity decays

exponentially with distance

proportional to the function e−0.26d

where d is a typical cluster diameter.



Modularity in molecular networks?

A functional module is, by definition, a discrete entity whose function is

separable from those of other modules.

This separation depends on chemical isolation, which can originate from

spatial localization or from chemical specificity.

E.g. a ribosome concentrates the reactions involved in making a polypeptide

into a single particle, thus spatially isolating its function.

A signal transduction system is an extended module that achieves its isolation

through the specificity of the initial binding of the chemical signal to receptor

proteins, and of the interactions between signalling proteins within the cell.

Hartwell et al. Nature 402, C47 (1999)



Modularity in molecular networks

Modules can be insulated from or connected to each other.

Insulation allows the cell to carry out many diverse reactions without cross-talk

that would harm the cell.

Connectivity allows one function to influence another.

The higher-level properties of cells, such as their ability to integrate information

from multiple sources, will be described by the pattern of connections among their

functional modules.

Hartwell et al. Nature 402, C47 (1999)



Organization of large-scale molecular networks

Organization of molecular networks revealed by large-scale experiments:

- power-law distribution ; P(k) exp-

- similar distribution of the node degree k (i.e. the number of edges of a node)

- small-world property (i.e. a high clustering coefficient and a small shortest path

between every pair of nodes)

- anticorrelation in the node degree of connected nodes (i.e. highly interacting

nodes tend to be connected to low-interacting ones)

These properties become evident when hundreds or thousands of molecules and

their interactions are studied together.

On the other end of the spectrum: recently discovered motifs that consist of 3-4

nodes.



Mesoscale properties of networks

Most relevant processes in biological networks correspond to the

mesoscale (5-25 genes or proteins) not to the entire network.

However, it is computationally enormously expensive to study mesoscale

properties of biological networks.

e.g. a network of 1000 nodes contains 1 1023 possible 10-node sets.

Spirin & Mirny analyzed combined network of protein interactions with data from

CELLZOME, MIPS, BIND: 6500 interactions.



Identify connected subgraphsThe network of protein interactions is typically presented as an undirected graph

with proteins as nodes and protein interactions as undirected edges.

Aim: identify highly connected subgraphs (clusters) that have more interactions

within themselves and fewer with the rest of the graph.

A fully connected subgraph, or clique, that is not a part of any other clique is an

example of such a cluster.

In general, clusters need not to be fully connected.

Measure density of connections by

where n is the number of proteins in the cluster

and m is the number of interactions between them.

Spirin, Mirny, PNAS 100, 12123 (2003)

1

2

nn

mQ



(method I) Identify all fully connected subgraphs (cliques)Generally, finding all cliques of a graph is an NP-hard problem.

Because the protein interaction graph is sofar very sparse (the number of interactions

(edges) is similar to the number of proteins (nodes), this can be done quickly.

To find cliques of size n one needs to enumerate only the cliques of size n-1.

The search for cliques starts with n = 4, pick all (known) pairs of edges (6500 6500

protein interactions) successively.

For every pair A-B and C-D check whether there are edges between A and C, A and

D, B and C, and B and D. If these edges are present, ABCD is a clique.

For every clique identified, ABCD, pick all known proteins successively.

For every picked protein E, if all of the interactions E-A, E-B, E-C, and E-D are known,

then ABCDE is a clique with size 5.

Continue for n = 6, 7, ... The largest clique found in the protein-interaction network

has size 14. Spirin, Mirny, PNAS 100, 12123 (2003)



(I) Identify all fully connected subgraphs (cliques)These results include, however, many redundant cliques.

For example, the clique with size 14 contains 14 cliques with size 13.

To find all nonredundant subgraphs, mark all proteins comprising the clique of size

14, and out of all subgraphs of size 13 pick those that have at least one protein

other than marked.

After all redundant cliques of size 13 are removed, proceed to remove redundant

twelves etc.

In total, only 41 nonredundant cliques with sizes 4 - 14 were found.




(method II) Superparamagnetic Clustering (SPC)

SPC uses an analogy to the physical properties of an inhomogenous ferromagnetic

model to find tightly connected clusters on a large graph.

Every node on the graph is assigned a Potts spin variable Si = 1, 2, ..., q.

The value of this spin variable Si performs thermal fluctuations, which are

determined by the temperature T and the spin values on the neighboring nodes.

Energetically, 2 nodes connected by an edge are favored to have the same spin

value. Therefore, the spin at each node tends to align itself with the majority of its

neighbors.

When such a Potts spin system reaches equilibrium for a given temperature T,

high correlation between fluctuating Si and Sj at nodes i and j would indicate that

nodes i and j belong to the same cluster.




(II) Superparamagnetic Clustering (SPC)The protein-interaction network is represented by a graph where every pair of

interacting proteins is an edge of length 1.

The simulations are run for temperatures ranging from 0 to 1 in units of the

coupling strength.

The network splits two monomers at temperatures between 0.7 and 0.8,

whereas larger clusters only exist for temperatures between 0.1 and 0.7.

Clusters are recorded at all values temperature.

The overlapping clusters are then merged and redundant ones are removed.




(method III) Monte Carlo SimulationUse MC to find a tight subgraph of a predetermined number of nodes M.

At time t = 0, a random set of M nodes is selected.

For each pair of nodes i,j from this set, the shortest path Lij between i and j on the

graph is calculated.

Denote the sum of all shortest paths Lij from this set as L0.

At every time step one of M nodes is picked at random, and one node is picked at

random out of all its neighbors.

The new sum of all shortest paths, L1, is calculated if the original node were to be

replaced by this neighbor.

If L1 < L0, accept replacement with probability 1.

If L1 > L0, accept replacement with probability

where T is the effective temperature.


T

LL 01

exp



(III) Monte Carlo Simulation

Every tenth time step an attempt is made to replace one of the nodes from

the current set with a node that has no edges to the current set to avoid

getting caught in an isolated disconnected subgraph.

This process is repeated

(i) until the original set converges to a complete subgraph, or

(ii) for a predetermined number of steps,

after which the tightest subgraph (the subgraph corresponding to the smallest

L0) is recorded.

The recorded clusters are merged and redundant clusters are removed.




Optimal temperature in MC simulationFor every cluster size there is an

optimal temperature that gives the

fastest convergence to the tightest

subgraph.


Time to find a clique with size 7 in MC steps

per site as a function of temperature T.

The region with optimal temperature is

shown in Inset.

The required time increases sharply as the

temperature goes to 0, but has a relatively

wide plateau in the region 3 < T < 7.

Simulations suggest that the choice of

temperature T M would be safe for any

cluster size M.



Comparison of clusters found with

SPC (blue) and MC simulation

(red).

Reasonable overlap (ca. one third

of all clusters are found by both

methods) – but both methods

seem complementary.


Comparison of SPC and Monte Carlo methods



The SPC method is best at detecting high-Q value clusters with relatively few links

with the outside world. An example is the TRAPP complex, a fully connected clique

of size 10 with just 7 links with outside proteins.

This cluster was perfectly detected by SPC, whereas the MC simulation was able to

find smaller pieces of this cluster separately rather than the whole cluster.

By contrast, MC simulations are better suited for finding very „outgoing“ cliques.

The Lsm complex, a clique of size 11, includes 3 proteins with more interactions

outside the complex than inside. This complex was easily found by MC, but was not

detected as a stand-alone cluster by SPC.


Comparison of SPC and Monte Carlo methods

Q: warum funktioniert die SPC-Methode besonders gutum Cluster mit hohen Q-Werten und wenigen Verknüpfungenzu finden, wogegen die Monte-Carlo-Methode vor allem„outgoing“ Cliquen findet?



Merging Overlapping ClustersA simple statistical test shows that nodes which have only one link to a cluster are

statistically insignificant. Clean such statistically insignificant members first.

Then merge overlapping clusters:

For every cluster Ai find all clusters Ak that overlap with this cluster by at least one

protein.

For every such found cluster calculate Q value of a possible merged cluster

Ai U Ak . Record cluster Abest(i) which gives the highest Q value if merged with Ai.

After the best match is found for every cluster, every cluster Ai is replaced by a

merged cluster Ai U Abest(i) unless Ai U Abest(i) is below a certain threshold value

for QC.

This process continues until there are no more overlapping clusters or until merging

any of the remaining clusters witll make a cluster with Q value lower than QC.




Statistical significance of complexes and modules

Number of complete cliques (Q = 1) as

a function of clique size enumerated in

the network of protein interactions

(red) and in randomly rewired graphs

(blue, averaged >1,000 graphs where

number of interactions for each protein

is preserved).

Inset shows the same plot in log-

normal scale. Note the dramatic

enrichment in the number of cliques in

the protein-interaction graph

compared with the random graphs.

Most of these cliques are parts of

bigger complexes and modules.




Statistical significance of complexes and modules


Distribution of Q of clusters found by the MC search

method.

Red bars: original network of protein interactions.

Blue cuves: randomly rewired graphs.

Clusters in the protein network have many more

interactions than their counterparts in the random

graphs.



Discovered functional modules


Examples of discovered functional modules.

(A) A module involved in cell-cycle regulation. This module consists of cyclins (CLB1-4 and

CLN2) and cyclin-dependent kinases (CKS1 and CDC28) and a nuclear import protein (NIP29).

Although they have many interactions, these proteins are not present in the cell at the same

time.

(B) Pheromone signal transduction pathway in the network of protein–protein interactions. This

module includes several MAPK (mitogen-activated protein kinase) and MAPKK (mitogen-

activated protein kinase kinase) kinases, as well as other proteins involved in signal

transduction. These proteins do not form a single complex; rather, they interact in a

specific order.



Robustness of clusters found

Model effect of false positives in

experimental data: randomly reconnect,

remove or add 10-50% of interactions

in network.

Cluster recovery probability as a

function of the fraction of altered links.

Black curves correspond to the case

when a fraction of links are rewired.

Red, removed;

green, added.

Circles represent the probability to

recover 75% of the original cluster;

triangles represent the probability to

recover 50%.


Noise in the form of removal or addions

lf links has less deteriorating effect

than random rewiring. About 75% of

clusters can still be found when 10% of

links are rewired.



Summary

Here: analysis of meso-scale properties demonstrated the presence of highly

connected clusters of proteins in a network of protein interactions. Strong

support for suggested modular architecture of biological networks.

Distinguish 2 types of clusters: protein complexes and dynamic functional modules.

Both complexes and modules have more interactions among their members than

with the rest of the network.

Dynamic modules are elusive to experimental purification because they are not

assembled as a complex at any single point in time.

Computational analysis allows detection of such modules by integrating pairwise

molecular interactions that occur at different times and places.

However, computational analysis alone, does not allow to distinguish between

complexes and modules or between transient and simultaneous interactions.



V10 Protein complexes and their shared components

- Most cellular processes result from a cascade of events mediated by proteins

that act in a cooperative manner.-Protein complexes can share components: proteins can be reused and

participate to several complexes.

Methods for analyzing high-throughput protein interaction data have mainly used

clustering techniques.

They have been applied to assign protein function by inference from the biological

context as given by their interactors, and to identify complexes as dense regions

of the network (see V9).

The logical organization into shared and specific components, and its

representation remains elusive.

Gagneur et al. Genome Biology 5, R57 (2004)



shared components

Shared components = proteins or groups of proteins occurring in different

complexes are fairly common:

A shared component may be a small part of many complexes, acting as a

unit that is constantly reused for ist function.

Also, it may be the main part of the complex e.g. in a family of variant complexes

that differ from each other by distinct proteins that provide functional specificity.

Aim: identify and properly represent the modularity of protein-protein interaction

networks by identifying the shared components and the way they are arranged to

generate complexes.

Gagneur et al. Genome Biology 5, R57 (2004)Georg Casari, Cellzome (Heidelberg)



Modules

A graph and its modules.

Nodes connected by a link are

called neighbors.

In graph theory, a module is a set

of nodes that have the same

neighbors outside the module.

In addition to the trivial modules {a},

{b},...,{g} and {a,b,c,..,g}, this graph

contains the modules {a,b,c}, {a,b},

{a,c},{b,c} and {e,f}.




Quotient

Elements of a module have exactly the same neighbors outside the module

one can substitute all of them for a representative node.

In a quotient, all elements of the module are replaced by the representative node,

and the edges with the neighbors are replaced by edges to the representative.

Quotients can be iterated until the entire graph is merged into a final

representative node.

Iterated quotients can be captured in a tree, where each node represents a

module, which is a subset of ist parent and the set of its descendant leaves.




Modular decomposition

Modular decomposition of the

example graph shown before.

Modular decomposition gives a

labeled tree that represents iterations

of particular quotients, here the

successive quotients on the modules

{a,b,c} and {e,f}.

The modular decomposition is a

unique, canonical tree of iterated

quotients

(formal proof exists Möhring 1985).




Modular decomposition

The nodes of the modular decomposition

are labeled in 3 ways:

As series when the direct descendants

are all neighbors of each other,

as parallel when the direct descendants

are all non-neighbors of each other,

and by the structure of the module

otherwise (prime module case).


Series are labeled by an asterisk within a circle, parallel by two parallel lines within a circle,

and prime by a P within a circle. The prime is advantageously labeled by its structure.

The graph can be retrieved from the tree on the right by recursively expanding the modules

using the information in the labels. Therefore, the labeled tree can be seen as an exact

alternative representation of the graph.



Results from protein complex purifications (PCP), e.g. TAP

Different types of data:- Y2H: detects direct physical interactions between proteins

- PCP by tandem affinity purification with mass-spectrometric identification of the

protein components identifies multi-protein complexes

Molecular decomposition will have a different meaning due to different semantics

of such graphs.

Here, focus analysis on PCP content.

PCP experiment: select bait protein where TAP-label is attached Co-purify

protein with those proteins that co-occur in at least one complex with the bait

protein.

In future, integrated view combining both types of data would be preferred.




Clique and maximal clique

A clique is a fully connected sub-graph, that is, a set

of nodes that are all neighbors of each other.

In this example, the whole graph is a clique and

consequently any subset of it is also a clique, for

example {a,c,d,e} or {b,e}. A maximal clique is a

clique that is not contained in any larger clique. Here

only {a,b,c,d,e} is a maximal clique.


Assuming complete datasets and ideal results, a permanent complex will appear

as a clique.

The opposite is not true: not every clique in the network necessarily derives from

an existing complex. E.g. 3 connected proteins can be the outcome of a single

trimer, 3 heterodimers or combinations thereof.



Results from protein complex purifications (PCP), e.g. TAP

Interpretation of graph and module labels

for systematic PCP experiments.

(a) Two neighbors in the network are

proteins occurring in a same complex.

(b) Several potential sets of complexes

can be the origin of the same observed

network. Restricting interpretation to the

simplest model (top right), the series

module reads as a logical AND between

its members.

(c) A module labeled ´parallel´

corresponds to proteins or modules

working as strict alternatives with respect

to their common neighbors.

(d) The ´prime´ case is a structure where

none of the two previous cases occurs. Gagneur et al. Genome Biology 5, R57 (2004)



Obtain maximal cliques

Modular decomposition provides an instruction set to deliver all maximal cliques

of a graph.

In particular, when the decomposition has only series and parallels, the maximal

cliques are straightforwardly retrieved by traversing the tree recursively from top

to bottom.

A series module acts as a product: the maximal cliques are all the combinations

made up of one maximal clique from each „child“ node.

A parallel module acts as a sum: the set of maximal cliques is the union of all

maximal cliques from the „child“ nodes.




Consider undirected graph G=(V,E) with n =|V| vertices and m=|E| edges.

The complement of a graph G is denoted by G.

If X is a subset of vertices, then G[X] is the subgraph of G induced by X.

Let x be an arbitrary vertex, then N(x) and N(x) stand respectively for the

neighborhood of x and its non-neighborhood.

A vertex x distinguishes two vertices u and v if (x,u) E and (x,v) E.

A module M of a graph G is a set of vertices that is not distinguished by any

vertex.

Hier wurdedeutlich gekürzt.Nur Grundaspektedes Algorithmussind wichtig.



A simple linear algorithm for modular decomposition

The modules of a graph are a potentially exponentially-sized family

However, the sub-family of strong modules, the modules that overlap no other

modules, has size O(n).

A overlaps B if A B , A \ B and B \ A

The inclusion order of this family defines the previously explained

modular tree decomposition, which is enough to store the module family of a

graph.

The root of this tree is the trivial module V and its n leaves are the trivial modules

{x}, xV.

Habib, de Montgolfier, Paul (2004)



Aim: a simple linear algorithm for modular decomposition

Any graph G with at least 3 vertices is either not connected

or its complement G is not connected

or G and G are both connected.

In the last case, the maximal modules define a partition of the vertex-set and are

said to be a prime composition.

The modular decomposition tree can be recursively built by a top-down approach.

At each step, the algorithm recurses on graphs induced by the maximal strong

modules. This technique gives an O(n4) complexity.

Here, derive a linear-time algorithm that computes a modular factorizing

permutation without computing the underlying decomposition tree.

This tree may be derived in a second step.Habib, de Montgolfier, Paul (2004)



Modular decomposition of protein interaction graphs

A graph and its modular tree decomposition. The set {1,2} is a strong module.

The module {7,8} is weak: it is overlapped by the module {8,9}.

The permutation = (1,2,3,4,5,6,7,8,9) is a modular factorizing permutation.




Module-factorizing orders

Let G=(V,E) be a graph and let O be a partial order on V.

For two comparable elements x and y where x <O y we state x precedes y and y

follows x.

Two subsets A and B cross if a,a‘ A and b,b‘ B such that a <O b and a‘ >O

b‘. A linear extension of a partially ordered set (‚poset‘) is a completion of the poset

into a total order.

Definition 1. A partial order O is a Module-Factorizing Partial Order (MFPO) of

V(G) if any pair of non-intersecting strong modules of G do not cross.

The modular factorizing permutations are exactly the module-factorizing total orders.

Proposition 1. A partial order O is an MFPO if and only if it can be completed into a

factorizing permutation.




Module-factorizing orders

Definition 2. An ordered partition is a collection {P1, ..., Pk} of pairwise disjoint

parts, with and an order O such that for all

x Pi and y Pj, x <O y if i < j.

Start with trivial partition (a single part equal to the vertex set) and iteratively

extend (or refine) it until every part is a singleton.

A center vertex c V is distinguished and two refining rules, preserving the MFPO

property, are used. They are defined in Lemma 1:




Defining rules

Lemma 1.

1. Center Rule: For any vertex c, the ordered partition

is module-factorizing.


The center rule picks a center and breaks a trivial partition to start the

algorithm.

Once launched, the process goes on based on the pivot rule, that splits each

part Pi (except the part Pi that contains the pivot), according to the neighborhood

of the pivot.



Lemma 1 continued.

2. Pivot Rule: Let be an ordered partition with

center c and p Pi such that Pj, ij, overlaps N(p) .

If O is an MFPO, then the following refinements preserve the module-

factorizing property:

Defining rules: pivot rule




Preliminary algorithm

Partition refinement scheme that outputs a partition of V into the maximal

modules not containing c.


When this algorithm ends, every part is a module. To obtain a factorizing

permutation it has to be recursively relaunched on the non-singleton parts.




Execution example of algorithm

The resulting factorizing permutation is (a, s, v, w, u, y, x, z, t).



Ordered chain partition yields linear-time algorithm

Definition 3. An ordered chain partition (OCP) is a partial order such that each

vertex belongs to one and only one chain, and one chain belongs to one and

only one part. The vertices of the same chain are totally ordered, the chains

of the same part are uncomparable, and the parts of totally ordered.


A trivial chain contains only 1 vertex, and a monochain part contains only one

chain. The OCPs generalize the Ordered Partitions since the latter ones contain

only trivial chains.




C(x) denotes the chain containing x while P(x) denotes the part of the partition

containing x.

Each chain C has a representative vertex r(C) C.

During the algorithm, the chains will behave as their representative vertices.

Chains are possibly merged. Then, the representative of the new chain is one of

the former representatives. But chains will never be split.

The algorithm still uses the center and pivot rules.

The chains are moved by these 2 rules, according to the adjacency between

their representative vertex and the center of the pivot.

But there is a third rule, the chaining rule (line 9 of algorithm).




Defining rule 3: Chaining rule

There is a third rule, the chaining rule

Unlike the two first ones, the third rule removes comparisons from the order.

It first concatenates a sequence of monochain parts, that occur consecutively in

O, into one chain. Then this new chain is inserted into one of the two parts,

say P, neighboring the chain.

Chaining rule, chaining the black vertices into P.


The comparisons between the chain and P are lost.

But since the number of chains strictly decreases during the algorithm,

the process is guaranteed to end.




Use each vertex a constant number of times as a pivot.





Execution example of algorithm

The resulting factorizing permutation is (a, s, v, w, u, y, x, z, t).

Summary:- simple, linear-time

algorithm now available

for modular decomposition

of graphs.

What is the meaning of

such modules when

applied to real data?



In the modular decomposition tree, the leaves are proteins,

the root represents the whole network.

In between, each node is a module that is a sub-part of ist parent.

The label of a node gives the nature of the relationship between ist direct children.

Proteins or modules in a parallel module can be be seen as

alternatives. If A is neighbor of B and C, which are not neighbors

of each other, then A can belong to a complex together with

either B or C, but not with both at the same time.

B and C define a parallel module and thus are alternative

partners in a complex with their common neighbor A.

This situation corresponds to a logical „exclusive OR“

between B and C.

Interpretation for PCP protein interaction networks




Proteins or modules in a series module can be

seen as potentially combined in any way.

If A is neighbor of B and C, and B and C are

also neighbors, the A can belong to a complex

together with B or C, or with both at the same

time.

This corresponds to a logical „OR“ between B

and C.

A series module can be seen as a unit: a set of

proteins (modules) that function together.

A ‚prime‘ is a graph where neither of these cases

occurs.

Interpretation for PCP protein interaction networks




Three examples of modular

decomposition of protein-protein

interaction networks. In each case

from top to bottom: schema of

complexes, the corresponding

protein-protein interaction network as

determined from PCP experiments,

and its modular decomposition

(MOD).

(a) Protein phosphatase 2A. Parallel

modules group proteins that do not

interact but are functionally

equivalent. Here these are the

catalytic Pph21 and Pph22 (module

2) and the regulatory Cdc55 and

Rts1 (module 3).

Back to the real world …





RNA polymerases I, II and III

A good layout of the corresponding network

gives an intuitive idea of what the constitutive

units of the complexes are. Modular

decomposition extracts them and makes their

logical combinations explicit.



Summary


Ongoing: need for modular description of molecular biology.

What are suitable modules?

Spirin&Mirny, Barabasi et al. : identify dense parts of the network

Alon and co-workers: identify (small) repeated motifs

Here: apply established method of modular graph decomposition

to protein interaction networks. Can (and has been) applied to other networks.

What is the biological relevance of modules at different levels?

Integrate with gene ontology?



V11 – modules in cellular networks – wrap up

traditional biology (reductionist approach) produces long lists:

lists of genes in genomes

lists of transcripts in different cell types

lists of protein interactions in model organisms

genomes, transcriptomes, proteomes, interactomes,

databases of genetic perturbations, and corresponding phenotypes

How to make sense of it all?

Will meaningful hypotheses and discoveries emerge?

systems biology

Formalized mathematical modeling still room for reductionism:

simulations test hypothesis from

quantitative measurements systems biology experiments




Strategies to detect communities in networks

„Community“ stands for module, class, group, cluster, ...

Define community as a subset of nodes within the graph such that connections

between the nodes are denser than connections with the rest of the network.

The detection of community structure is generally intended as a procedure for

mapping the network into a tree („dendogram“ in social sciences).

Radicchi et al. PNAS 101, 2658 (2004)

Leaves: nodesbranches join nodesor (at higher level)groups of nodes.



Agglomerative algorithms for mapping to tree

Traditional method to perform this mapping: hierarchical clustering.

For every pair i,j of nodes in the network compute weight Wij that measures how

closely connected the vertices are.

Starting from the set of all nodes and no edges,

links are iteratively added between pairs of

nodes in order of decreasing weight.

In this way nodes are grouped into larger and larger

communities, and the tree is built up to the root,

which represents the whole network.

„agglomerative“ algorithm

Girven, Newman, PNAS 99, 7821 (2002)Radicchi et al. PNAS 101, 2658 (2004)

Here: 3 communities of densely connectedvertices (circles with solid lines) with amuch lower density of connections(gray lines) between them.



Possible definitions of the weights

(1) number of node-independent paths between vertices

2 paths that connect the same pair of vertices are said to be node-independent if

they share none of the same vertices other than their initial and final vertices.

(2) edge-independent paths.

It has been shown that the number of node-independent (edge-independent) paths

between 2 vertices i and j in a graph is equal to the minimum number of vertices

(edges) that must be removed from the graph to disconnect i and j from one

another (Menger, 1927).

these numbers are a measure of the robustness of the network to deletion of

nodes (edges).

Girven, Newman, PNAS 99, 7821 (2002)



Possible definitions of the weights (II)

(3) count total number of paths that run between them (not just those that are

node- or edge-independent).

Because the number of paths between any 2 vertices is either 0 or infinite, one

typically weighs paths of length l by a factor l with small so that the weighted

count of number of paths converges.

Thus long paths contribute exponentially less weight than short paths.

These node- or edge-dependent path definitions for weights work okay for certain

community structures, but show typical pathologies.




Problems

In particular, both counting of node- and edge-independent paths has a tendency

to separate single peripheral vertices from the communities to which they should

rightly belong.

If a vertex is, e.g., connected to the rest of a network by only a single edge then, to

the extent that it belongs to any community, it should clearly be considered to

belong to the community at the other end of that edge.

Unfortunately, both the numbers of independent paths and the weighted path

counts for such vertices are small and hence single nodes often remain isolated

from the network when the communities are constructed.

This and other pathologies, make the hierarchical clustering method, although

useful, far from perfect.




New strategy: Use “betweenness” as definition of weights

Focus on those edges that are least central, that are „between“ communities.

Define edge betweenness of an edge as the number of shortest paths between

pairs of vertices that run along it.

If there is more than one shortest path between a pair of vertices, each path is

given equal weight such that the total weight of all of the paths is 1.

If a network contains communities or groups that are only loosely connected by a

few intergroup edges, then all shortest paths between different communities must

go along one of these few edges.

the edges connecting communities will have high edge betweenness.

By removing these edges we separate groups from one another and so reveal the

underlying community structure of the graph.




GN Algorithm

1. Calculate betweenness for all m edges in a graph of n vertices

(can be done in O(mn) time).

2. Remove the edge with the highest betweenness.

3. Recalculate betweenness for all edges affected by the removal.

4. Repeat from step 2 until no edges remain.

Because step 3 has to be done for all edges, the algorithm runs in worst-case time

O(m2n).




1.


Application of Girvan&Newman Algorithm(a) The friendship network from Zachary's karate club study. Nodes associated with the club administrator's faction are drawn as circles, those associated with the instructor's faction are drawn as squares. (b) Hierarchical tree showing the complete community structure for the network calculated by using the algorithm presented in this article. The initial split of the network into two groups is in agreement with the actual factions observed by Zachary, with the exception that node 3 is misclassified. (c) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network.



Divisive algorithms for mapping to tree

Reverse order of construction of the tree than for agglomerative algorithms:

start with the whole graph and iteratively cut the edges

divide network progressively into smaller and smaller disconnected subnetworks

identified as the communities.

Crucial point: how to select those edges to be cut.

Example: Girven & Newman algorithm (GN)

Problem of GN algorithm: requires the repeated evaluation of a global property, the

betweenness, for each edge whose value depends on the properties of the whole

system.

becomes computationally very expensive for networks with e.g. 10000 nodes.




Faster algorithm

Introduce divisive algorithm that only requires the consideration of local quantities.

Need: quantity that can single out edges connecting nodes belonging to different

communities.

Consider edge-clustering coefficient:

number of triangles to which a given edge belongs divided by the number of

triangles that might potentially include it, given the degrees of the adjacent

nodes.

For the edge-connecting node i to node j, the edge-clustering coefficient is


1,1min

13,3

,

ji

jiji kk

zC

where zi,j(3) is the number of triangles built on that edge and

min[(ki – 1), (kj – 1)] is the maximal possible number of them.

1 is added to zi,j(3) to remove degeneracy for zi,j

(3) = 0.



Faster algorithm

Edges connecting nodes in different communities are included in few or no

triangles and tend to have small values of Ci,j(3).

On the other hand, many triangles exist within clusters.

By considering higher order cycles one can define coefficients of order g


gji

gjig

ji s

zC

,

,,

1

where zi,j(g) is the number of cyclic structures of order g the edge (i,j) belongs to,

and si,j(g) is the number of possible cyclic structures of order g that can be built

given the degrees of the nodes.

Define, for every g, a dectection algorithm that works exactly as the GN method

with the difference that, at every step, the removed edges are those with the

smallest value of Ci,j(g).

By considering increasing values of g, one can smoothly interpolate between a

local and a nonlocal algorithm.



Comparison with GN algorithm

Plot of the dendrograms for the network of college football teams, obtained by

using the GN algorithm (Left) and our algorithm with g = 4 (Right).

Different symbols denote teams belonging to different conferences.

In both cases, the observed communities perfectly correspond to the conferences,

with the exception of the six members of the „Independent conference“, which are

misclassified.




Simple network clustering based on shortest-path distance

Aim: compute modular organization of cellular networks controlling specific

biological responses.

Ideas:

(i) the shortest path between any two vertices (proteins) is probably the most

relevant for functional associations;

(ii) each vertex in a network has a unique profile of shortest-path distances through

the network to every other vertex

(iii) module comembers are likely to have similar (clustered) shortest-path-distance

profiles.

Rives & Galitski PNAS 100, 1128 (2003)



Network clustering

Yeast PI network; 4079 proteins, 6761 protein interactions.

MIPS: 133 signaling proteins, 64 have 1 interactions with another signaling

protein.

Algorithm: assign length 1 to each edge in protein interaction network.

Compute all-pairs shortest-path distance matrix: contains length of the shortest

path (distance) d between every pair of vertices in the network.

Convert into „association matrix“ using 1/d2 .

Associations range from 0 to 1.

Emphasizes local association in subsequent clustering.

Use hierarchical agglomerative average-linkage clustering.


Q: konstruieren Sie basierendauf diesem einfachen Maß einen Algorithmus, der diezu einem biochemischen Pfadgehörenden Proteine in einem Protein-Wechsel-Wirkungsnetzwerk identifiziert.



Clustering of yeast signaling protein interaction networkA symmetrical matrix of 64 proteins of the

MIPS-database signaling category was

clustered identically in both dimensions. The

cluster tree is not shown. Each row or

column represents a protein. Each feature is

the intersection of two proteins and is a

grayscale representation of pairwise protein

association).

Columns to the right of the clustered network

represent MIPS-defined signaling pathways

[P, polarity-PKC; R, Ras; H, HOG; M,

mating/filamentation MAPK (mfMAPK)].

White bars in the MIPS-pathway columns

indicate protein members of the pathway.

Ras-pathway proteins form a single

cluster.

3 MAPK pathways as clusters.


Q: Durch Anwendung eines einfachen Maßes für den Abstand zweierProteine in einem Interaktionsnetzwerk wurde obiges Diagramm erhalten.Was erwarten Sie für die Proteine eines biochemischen Pfades?



Network clustering of high-throughput data sets

HTS-Data usually has high (50%) false-positive error frequencies!

Also, many binary interactions may not occur within modules.

Because interacting proteins usually localize in the same subcellular compartment

one may integrate interaction and localization data for the identification of modules.

Single proteins with many interactions in Y2H screens (hubs) nucleate large

clusters that are not modules.




examples of derived clusters

Clustering of the yeast nuclear-protein network

derived from high-throughput interaction and

localization data.

(A) Examples of clusters representing module

rudiments are labeled. The cluster tree is not

shown. Arrows indicate high-connectivity hub

proteins.

(B) Example clusters are shown in detail.

Cluster comembers participating in some

common structure or function have large bold

labels.




Properties of hubs

All hub proteins indicated bind > 90 proteins in the global Y2H network.

The proteins bound by these hubs are randomly distributed in cellular

compartments.

The nuclear-localized proteins bound by these hubs form the 4 largest clusters.

Proteins bound by high-connectivity hubs will have few or no interactions among

themselves if they are not functionally associated („hub-and-spokes“ structure).

proteins bound by each high-connectivity hub are not functionally associated with

each other, and their clusters do not represent modules.




connectivity neighborhood clustering

Global protein connectivity versus

neighborhood clustering. Each

protein in the global protein net-

work is plotted by its connectivity,

k, and its neighborhood clustering,

C. Arrows indicate high-connec-

tivity proteins shown in Fig. 2A.


The 4 high-connectivity hubs are among 15 outliers. Although these proteins have

exceedingly high connectivity, they almost completely lack neighborhood clustering.

useful criterion to distinguish modules from nonmodules?



Application to biological-response networks

Incorporate network clustering into 3-step process to study complex biomolecular

systems generates modular network-structure model

(i) compile known and suspected components of the response network (from

databases, expression profiling, proteomics, genetic screens, metabolite profiles ...)

(ii) cluster network based on interactions between vertices. Edges can represent

any type of interaction.

(iii) abstract modular network-structure model showing modules.

Cluster 90 filamentation-network proteins that have 1 interaction with other

filamentation proteins.




Clustering of the yeast filamentation network

Proteins of the yeast

filamentation network were

clustered. A tree-depth

threshold was set.

Tree branches with 3 leaves

(clusters with 3 proteins)

below the tree threshold are

shown.

Bullets and large bold labels

indicate proteins of highest

intracluster connectivity.




Modular model of the yeast filamentation network

Clusters indicated in Fig. 4 are

abstracted as modules. All intermodule

paths in the filamentation network are

indicated as black lines with the

interacting proteins at the termini.

A gray line connecting the Ras and

protein kinase A modules was added to

indicate a connection mediated by the

small molecule cAMP.




Biological Insights from modular network abstraction

(1) In an integrated network, data on molecules and interactions shows clustered

organization that can be identified quantitatively

(2) Cluster co-member genes show significant coordination of expression change,

as expected for genes involved in a collective function.

(3) Cluster go-member genes show significant overrepresentation of biological-

process annotations, indicating collective function.

(4) The modular network abstraction intuitively stimulates testable biological

insights on complex biological properties.

Prinz et al. Genome Research 14, 380 (2004)



Evolutionary conservation of motif constituentsin the yeast protein interaction network

Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)

Question: why are some cellular components conserved across species

but others evolve rapidly?

Many biological functions are carried out by the integrated activity of highly

interacting cellular components = functional modules

Motifs = topologically distinct interaction patterns with complex networks

may represent the simplest building blocks of modules.

Here, test the correlation between a protein‘s evolutionary rate and the

structure of the motif it is embedded in

identify all 2-, 3-, 4-node motifs and some 5-node motifs



shared components

Data from DIP database,

3183 interacting yeast proteins

if there is evolutionary pressure to

maintain specific motifs, their

components should be evolutionarily

conserved and have identifiable

orthologs in other organisms.

Study conservation of 678 S. cerevisae

proteins with an ortholog in each of 5

higher eukaryotes:

Arabidopsis thaliana, C. elegans,

Drosophila melanogaster, Mus

musculus, Homo sapiens.


Algorithm to detect all

n-node subgraphs:

scan all rows of the adjacency

matrix M. For each non-zero

element (i,j) representing a link,

scan through all neighbors of

(i,j) until a specific n-node

subgraph is detected.



shared components

#motifs of a given kind in the yeast PI

network

fraction of original yeast motifs that is

evolutionary fully conserved: each of

their protein components belongs to

678 orthologous proteins

fraction of motifs that is fully conserved

for the random ortholog distribution

column 4 / column 5

less than 5% of #2 (linear 3-component

proteins) are completely maintained


47% of the fully conserved pentagons

(#11) are fully conserved!



topology conservation of individual proteins

Larger motifs tend to

be conserved as a

whole, where each

component has an

ortholog.


E.g. less than 1% of the fully connected pentagon motifs disappeared completely,

for 69% of them, each of the subunits had an ortholog in human.

Clear correlation between the conservation rate and the degree of saturation of

a motif.

Participation in motifs substantially influences the evolutionary conservation of

specific components.



From 65% (C = 0) to 84% (C = 1) of neighbors of a human ortholog were also

human orthologs (filled circles). The conserved fraction of the nonorthologous

protein‘s neighborhood is markedly smaller.

Enrichment = ration between the percentages of orthologous proteins at distance d

from an ortholog in the natural and the random orthologous sets.

d: shortest distance between i and target protein measured along network links.

Proteins that interact directly with an ortholog at d=1 have a 50% higher chance of

conservation that at random!


clustering coefficient conservation of proteins ?



Examine if the specific function of the yeast proteins within motifs affects their rate

of evolutionary conservation.

Assign each motif to functional class to which its protein components belong.

Larger motifs have a notable functional homogeneity:

- for 95% of fully connected yeast pentagon motifs (#11) all components shared at

least one common functional class,

- only 10% of the 2-node motifs (#1) are functionally conserved.

Identify type and number of evolutionary fully conserved motifs of each functional

class in S.cerevisae, for those that have an ortholog in humans.


function conservation?



shared components

For 3 functional classes

(subcellular localization, protein

fate, transcription) each of the 11

studied motifs is considerably

overrepresented.

Some other functional classes

have only 1 or 2 characteristic

motifs.

No motifs are found for:

transposable elements, energy,

cellular fate, cellular communi-

cation, cellular rescue, cellular

organization, metabolism,

protein activity, protein binding Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)



shared components

For 3 functional classes (subcellular localization, protein fate, transcription) each of

the 11 studied motifs is considerably overrepresented.

Some other functional classes have only 1 or 2 characteristic motifs.

No motifs are found for:

transposable elements, energy, cellular fate, cellular communi-cation, cellular

rescue, cellular organization, metabolism, protein activity, protein binding

The fully connected motifs (#9 and #11) tend to identify protein complexes.

However, the mere existence of protein complexes cannot explain the observed

trends towards higher conservation rates of the highly connected motifs.




shared components

Shared components = proteins or groups of proteins occurring in different

complexes are fairly common:

A shared component may be a small part of many complexes, acting as a unit that

is constantly reused for ist function.

Also, it may be the main part of the complex e.g. in a family of variant complexes

that differ from each other by distinct proteins that provide functional specificity.

Aim: identify and properly represent the modularity of protein-protein interaction

networks by identifying the shared components and the way they are arranged to

generate complexes.




Summary

Modules are key intermediate level in the organizational hierarchy of cells.

Biological Module: loose association of preferred molecular interaction partners

that interact to perform a collective function.

Modules can be identified based on structural characteristics such as their closely

connected members and interfacesto other modules.

There is evidence that modules are evolutionarily conserved.

Module co-members tend to be coordinately expressed.



Direct comparison of different data sets

Bayesian Network approach

V12: Reliability of Protein Interaction Networks



High-throughput methods for detecting protein interactions Yeast two-hybrid assay. Pairs of proteins to be tested for interaction are expressed as fusion proteins ('hybrids') in yeast: one protein is fused to a DNA-binding domain, the other to a transcriptional activator domain. Any interaction between them is detected by the formation of a functional transcription factor. Benefits: it is an in vivo technique; transient and unstable interactions can be detected; it is independent of endogenous protein expression; and it has fine resolution, enabling interaction mapping within proteins. Drawbacks: only two proteins are tested at a time (no cooperative binding); it takes place in the nucleus, so many proteins are not in their native compartment; and it predicts possible interactions, but is unrelated to the physiological setting.

Mass spectrometry of purified complexes. Individual proteins are tagged and used as 'hooks' to biochemically purify whole protein complexes. These are then separated and their components identified by mass spectrometry. Two protocols exist: tandem affinity purification (TAP), and high-throughput mass-spectrometric protein complex identification (HMS-PCI). Benefits: several members of a complex can be tagged, giving an internal check for consistency; and it detects real complexes in physiological settings. Drawbacks: it might miss some complexes that are not present under the given conditions; tagging may disturb complex formation; and loosely associated components may be washed off during purification.

Correlated mRNA expression (synexpression). mRNA levels are systematically measured under a variety of different cellular conditions, and genes are grouped if they show a similar transcriptional response to these conditions. These groups are enriched in genes encoding physically interacting proteins. Benefits: it is an in vivo technique, albeit an indirect one; and it has much broader coverage of cellular conditions than other methods. Drawbacks: it is a powerful method for discriminating cell states or disease outcomes, but is a relatively inaccurate predictor of direct physical interaction; and it is very sensitive to parameter choices and clustering methods during analysis.Von Mering et al. Nature 417, 399 (2002)



High-throughput methods for detecting protein interactions

Genetic interactions (synthetic lethality). Two nonessential genes that cause lethality when mutated at

the same time form a synthetic lethal interaction. Such genes are often functionally associated and their

encoded proteins may also interact physically. This type of genetic interaction is currently being studied in

an all-versus-all approach in yeast. Benefits: it is an in vivo technique, albeit an indirect one; and it is

amenable to unbiased genome-wide screens.

In silico predictions through genome analysis. Whole genomes can be screened for three types of

interaction evidence: (1) in prokaryotic genomes, interacting proteins are often encoded by

conserved operons; (2) interacting proteins have a tendency to be either present or absent

together from fully sequenced genomes, that is, to have a similar 'phylogenetic profile'; and (3)

seemingly unrelated proteins are sometimes found fused into one polypeptide chain. This is an

indication for a physical interaction. Benefits: fast and inexpensive in silico techniques; and coverage

expands as more genomes are sequenced. Drawbacks: it requires a framework for assigning orthology

between proteins, failing where orthology relationships are not clear; and so far it has focused mainly on

prokaryotes.

Von Mering et al. Nature 417, 399 (2002)

Q: Beschreiben Sie 3 in silico Methoden, um aus genomischen DatenProtein-Protein-Interaktionen vorherzusagen.



Data set

Experiment:

Uetz et al. 957 interactions

Ito et al. 4549 interactions

HMS-PCI 33014 interactions

In silico:

Conserved gene neighborhood 6387 interactions

Gene fusions 358 interactions

Co-occurrence of genes 997 interactions




Counting interactions

Various high-throughput methods

give differing results on the same

complex.

>80.000 interactions available

for yeast.

Only 2.400 are supported by

more than 1 method.


Possible explanations ?- Methods may not have reached saturation- Many of the methods produce a significant fraction of false positives- Some methods may have difficulties for certain types of interactions



Protein interactions between functional categories

Each technique produces a unique distribution of interactions with respect to functional

categories methods have specific strengths and weaknesses.

E.g. TAP and HMS-PCI predict few interactions for proteins involved in transport and sensing

because these categories are enriched with membrane proteins.

E.g. Y2H detects few proteins involved in translation.




Complementarity between data sets

Glycine decarboxylase- Multienzyme complex needed when Gly is

used as 1-carbon source.- Its key components GCV1, GCV2, GCV3

are only induced when there is excess

Glycine and folate levels are low. This may

explain why complex is not detected in

experiments.

However, 3 components can be detected by

several independent in silico methods- Gene neighborhood of all 3 components in

7 diverged species- genes show very similar phylogenetic

distribution- microarrays: genes are closely co-

regulated.


Opposite example: PPH3 protein

Complex found in 4 independent purifications,

but no in silico method predicts interaction.

Q: Interpretieren Sie das oben angegebene Schema.Welche experimentelle Methode ist am besten?A: verschiedene Methoden messen verschiedene Eigenschaften der Interaktionen. Aus diesem Schema allein kann man nicht entscheiden, welche die beste ist. Was wäre ein guter Test dafür?



Quantitative comparison of interaction data setsThe various data sets are benchmarked

against a reference set of 10,907 trusted

interactions, which are derived from protein

complexes annotated manually at MIPS and

YPD databases.

Coverage and accuracy are lower limits

owing to incompleteness of the reference

set. Each dot in the graph represents an

entire interaction data set.

For the combined evidence, consider only

interactions supported by an agreement of

two (or three) of any of the methods shown.


Q: erwarten Sie, daß die Bestätigungeiner Protein-Protein-Wechselwirkung durch mehrereunabhängige Experimente deren Aussagekraft verstärkt?Beschreiben Sie ein sehr geeignetes Verfahren, um dieseVerknüpfung zu beschreiben.A: Bayes‘sches Netzwerk.



Biases in interaction coverage

Experiment:

Uetz et al. 957 interactions

Ito et al. 4549 interactions

HMS-PCI 33014 interactions

In silico:

Conserved gene neighborhood 6387 interactions

Gene fusions 358 interactions

Co-occurrence of genes 997 interactions

None of the methods covers more than 60% of the proteins in the yeast genome.

Are there common biases as to which proteins are covered?




Bias 1 towards proteins of high abundance mRNA abundance is a rough measure of protein

abundance.

Here, divide yeast genome into 10 mRNA

abundance classes (bins) of equal size.

For each data set and abundance class, the

number of interactions is recorded having at least

one protein in that class. Each interaction (A–B) is

counted twice: once under the abundance class

of partner A, and once under the abundance

class of partner B.

Most data sets are heavily biased towards

proteins of high abundance except for genetic

techniques (Y2H and synthetic lethality)





Bias 2 towards cellular localization

Independent quality measure:

Do interacting proteins belong to the same

compartment?

Y2H method gives relatively poor results

here.



Outlook

How many protein-protein interactions can be expected in yeast?

Overlap of high-throughput data is 20 times larger than expected by chance. Good signal-to-noise ratio.

Also, for interactions discovered ≥ 2 times, usually both partners have the same

functional category and cellular localization.

Overlap mainly consists of „true positives“.

Less than 1/3 of new interactions in overlap set were previously known.

Given 10.000 currently known interactions predict >30.000 protein interactions in

yeast (lower boundary).




Problems

Jansen et al. Science 302, 449 (2003)

Unfortunately, interaction data sets are often incomplete and contradictory (von

Mering et al. 2002).

In the context of genome-wide analyses, these inaccuracies are greatly magnified

because the protein pairs that do not interact (negatives) by far outnumber those

that do interact (positives).

E.g. in yeast, the ~6000 proteins allow for N (N-1) / 2 ~ 18 million potential

interactions. But the estimated number of actual interactions is < 100.000.

Therefore, even reliable techniques can generate many false positives when

applied genome-wide.

Think of a diagnostic with a 1% false-positive rate for a rare disease occurring in

0.1% of the population. This would roughly produce 1 true positive for every 10

false ones.



Integrative Approach (sehr wichtig!)


One would like to integrate evidence from many different sources to increase the

predictivity of true and false protein-protein predictions.

Here, use Bayesian approach for integrating interaction information that allows for

the probabilistic combination of multiple data sets; apply to yeast.

Input: Approach can be used for combining noisy genomic interaction data sets.

Normalization: Each source of evidence for interactions is compared against

samples of known positives and negatives (“gold-standard”).

Output: predict for every possible protein pair likelihood of interaction.

Verification: test on experimental interaction data not included in the gold-

standard + new TAP (tandem affinity purification experiments).



Integration of various information sources


(iii) Gold-standards of known interactions

and noninteracting protein pairs.

3 different types of data used:

(i) Interaction data from high-

throughput experiments. These

comprise large-scale two-hybrid

screens (Y2H) and in vivo pull-

down experiments.

(ii) Other genomic features:

expression data, biological

function of proteins (from Gene

Ontology biological process and

the MIPS functional catalog), and

data about whether proteins are

essential.



Combination of data sets into probabilistic interactomes

(B) Combination of data sets into

probabilistic interactomes.

The 4 interaction data sets

from HT experiments were

combined into 1 PIE.

The PIE represents a

transformation of the

individual binary-valued

interaction sets into a data

set where every protein pair

is weighed according to the

likelihood that it exists in a

complex. A „naïve” Bayesian network is used to model

the PIP data. These information sets hardly

overlap.


Because the 4 experimental

interaction data sets contain

correlated evidence, a fully

connected Bayesian network

is used.



Bayesian Networks

Bayesian networks are probabilistic models that graphically encode probabilistic

dependencies between random variables.Y

E1 E2E3

Bayesian networks also include a quantitative measure of dependency. For each

variable and its parents this measure is defined using a conditional probability

function or a table.

Here, one such measure is the probability Pr(E1|Y).

A directed arc between variables

Y and E1 denotes conditional

dependency of E1 on Y, as

determined by the direction of

the arc.



Bayesian Networks

Together, the graphical structure and the conditional probability functions/tables

completely specify a Bayesian network probabilistic model.

Y

E1 E2E3

Here, Pr(Y,E1,E2,E3) = Pr(E1|Y) Pr(E2|Y) Pr(E3|Y) Pr(Y)

This model, in turn, specifies a

particular factorization of the joint

probability distribution function

over the variables in the

networks.



Gold-Standard


should be

(i) independent from the data sources serving as evidence

(ii) sufficiently large for reliable statistics

(iii) free of systematic bias (e.g. towards certain types of interactions).

Positives: use MIPS (Munich Information Center for Protein Sequences, HW

Mewes) complexes catalog: hand-curated list of complexes (8250 protein pairs that

are within the same complex) from biomedical literature.

Negatives:

- harder to define

- essential for successful training

Assume that proteins in different compartments do not interact.

Synthesize “negatives” from lists of proteins in separate subcellular compartments.



Measure of reliability: likelihood ratio


Consider a genomic feature f expressed in binary terms (i.e. „absent“ or „present“).

Likelihood ratio L(f) is defined as:

L(f) = 1 means that the feature has no predictability: the same number of positives

and negatives have feature f.

The larger L(f) the better its predictability.

f

ffL

featurehavingnegativesstandardgoldoffraction

featurehavingpositivesstandardgoldoffraction



Combination of features


For two features f1 and f2 with uncorrelated evidence,

the likelihood ratio of the combined evidence is simply the product:

L(f1,f2) = L(f1) L(f2)

For correlated evidence L(f1,f2) cannot be factorized in this way.

Bayesian networks are a formal representation of such relationships between

features.

The combined likelihood ratio is proportional to the estimated odds that two

proteins are in the same complex, given multiple sources of information.



Prior and posterior odds

„positive“ : a pair of proteins that are in the same complex. Given the number of

positives among the total number of protein pairs, the „prior“ odds of finding a

positive are:

„posterior“ odds: odds of finding a positive after considering N datasets with values

f1 ... fN :

posP

posP

negP

posPOprior

1

N

Nprior ffnegP

ffposPO

...

...

1

1

The terms „prior“ and „posterior“ refer to the situation before and after knowing the

information in the N datasets.




Static naive Bayesian Networks

In the case of protein-protein interaction data, the posterior odds describe the

odds of having a protein-protein interaction given that we have the information from

the N experiments,

whereas the prior odds are related to the chance of randomly finding a protein-

protein interaction when no experimental data is known.

If Opost > 1, the chances of having an interaction are


higher than having no interaction.



Static naive Bayesian Networks

The likelihood ratio L defined as

relates prior and posterior odds according to Bayes‘ rule:

negffP

posffPffL

N

NN ...

......

1

11

priorNpost OffLO ...1

In the special case that the N features are conditionally independent

(i.e. they provide uncorrelated evidence) the Bayesian network is a so-called

„naïve” network, and L can be simplified to:

N

i

N

i i

iiN negfP

posfPfLffL

1 11...




Computation of prior and posterior odds

L can be computed from contingency tables relating positive and negative

examples with the N features (by binning the feature values f1 ... fN into discrete

intervals) – wait for examples.

600

1

1018

1036

4

priorO

Opost > 1 can be achieved with L > 600.


Determining the prior odds Oprior is somewhat arbitrary in that it requires an

assumption about the number of positives.

Jansen et al. believe that 30,000 is a conservative lower bound for the number of

positives (i.e. pairs of proteins that are in the same complex).

Considering that there are ca. 18 million = 0.5 * N (N – 1) possible protein pairs in

total (with N = 6000 for yeast),



Essentiality (PIP)

Consider whether proteins are essential or non-essential = does a deletion mutant

where this protein is knocked out from the genome have the same phenotype?


It should be more likely that both of 2 proteins in a complex are essential or non-

essential, but not a mixture of these two attributes.

Deletion mutants of either one protein should impair the function of the same

complex.



Parameters of the naïve Bayesian Networks (PIP) Column 1 describes the genomic feature. In the „essentiality data“ protein pairs can take on 3 discrete

values (EE: both essential; NN: both non-essential; NE: one essential and one not).


Column 2 gives the number of protein pairs with a particular feature (i.e. „EE“) drawn from the whole yeast

interactome (~18M pairs).

Columns „pos“ and „neg“ give the overlap of these pairs with the 8,250 gold-standard positives and the

2,708,746 gold-standard negatives.

Columns „sum(pos)“ and „sum(neg)“ show how many gold-standard positives (negatives) are among the

protein pairs with likelihood ratio L, computed by summing up the values in the „pos“ (or „neg“) column.

P(feature value|pos) and P(feature value|neg) give the conditional probabilities of the feature values – and

L, the ratio of these two conditional probabilities.

143.0

518.0

2150

1114

573724

81924



mRNA expression dataProteins in the same complex tend to have correlated expression profiles.

Although large differences can exist between the mRNA and protein abundance, protein abundance can

be indirectly and quite crudely measured by the presence or absence of the corresponding mRNA

transcript.


Experimental data source:

- time course of expression fluctuations during the yeast cell cycle

- Rosetta compendium: expression profiles of 300 deletion mutants and cells under

chemical treatments.

Problem: both data sets are strongly correlated.

Compute first principal component of the vector of the 2 correlations.

Use this as independent source of evidence for the P-P interaction prediction.

The first principal component is a stronger predictor of P-P interactions that either

of the 2 expression correlation datasets by themselves.



mRNA expression dataThe values for mRNA expression correlation (first principal component) range on a

continuous scale from -1.0 to +1.0 (fully anticorrelated to fully correlated).

This range was binned into 19 intervals.




PIP – Functional similarityQuantify functional similarity between two proteins:


- consider which set of functional classes two proteins share, given either the MIPS or Gene

Ontology (GO) classification system.

- Then count how many of the ~18 million protein pairs in yeast share the exact same

functional classes as well (yielding integer counts between 1 and ~ 18 million). It was binned

into 5 intervals.

- In general, the smaller this count, the more similar and specific is the functional description

of the two proteins.



PIP – Functional similarity

Observation: low counts correlate with a higher chance of two proteins being in

the same complex. But signal (L) is quite weak.




Calculation of the fully connected Bayesian network (PIE)

The 3 binary experimental interaction datasets can be combined in at most 24 = 16

different ways (subsets). For each of these 16 subsets, one can compute a

likelihood ratio from the overlap with the gold-standard positives („pos“) and

negatives („neg“).

51003.08250

26

2708746

2 8250

2708746

27087462

825026




Distribution of likelihood ratios

Number of protein pairs in the individual datasets and the probabilistic interactomes

as a function of the likelihood ratio.

There are many more protein pairs with high

likelihood ratios in the probabilistic interactomes

(PIE) than in the individual datasets G,H,U,I.

Protein pairs with high likelihood ratios provide

leads for further experimental investigation of

proteins that potentially form complexes.





Overview

PIP and PIE are separately tested against the

gold-standard.



PIP vs. the information sources

Ratio of true to false positives (TP/FP) increases

monotonically with Lcut, confirming L as an

appropriate measure of the odds of a real

interaction.

The ratio is computed as:

Protein pairs with Lcut > 600 have a > 50%

chance of being in the same complex.Jansen et al. Science 302, 449 (2003)

cut

cut

LL

LL

cut

cut

Lneg

Lpos

LFP

LTP



PIE vs. the information sources

9897 interactions are predicted from PIP and

163 from PIE.

In contrast, likelihood ratios derived from single

genomic factors (e.g. mRNA coexpression) or

from individual interaction experiments (e.g. the

Ho data set) did no exceed the cutoff when used

alone.

This demonstrates that information sources that,

taken alone, are only weak predictors of

interactions can yield reliable predictions when

combined.




parts of PIP graph

Test whether the thresholded PIP

was biased toward certain

complexes, compare distribution of

predictions among gold-standard

positives.

(A ) The complete set of gold-

standard positives and their overlap

with the PIP. The PIP (green) covers

27% of the gold-standard positives

(yellow).

The predicted complexes are roughly

equally apportitioned among the

different complexes no bias.Jansen et al. Science 302, 449 (2003)



V13 Prediction of Phylogenies based on single genes

Material of this lecture taken from

- chapter 6, DW Mount „Bioinformatics“

and from Julian Felsenstein‘s book.

A phylogenetic analysis of a family of related

nucleic acid or protein sequences is a determination

of how the family might have been derived during

evolution.

Placing the sequences as outer branches on a tree,

the evolutionary relationships among the sequences

are depicted.

Phylogenies, or evolutionary trees, are the basic structures to describe

differences between species, and to analyze them statistically.

They have been around for over 140 years.

Statistical, computational, and algorithmic work on them is ca. 40 years old.



Methods for Single-Gene Phylogeny

Choose set of

related sequences

Obtain multiple

sequence

alignment

Is there

strong

sequence

similarity?

Maximum

parsimony

methods

Yes

No

Is there clearly recogniza-

ble sequence similarity?

YesDistance

methods

No

Maximum likelihood

methods

Analyze how well

data support

prediction

Q: füllen Sie in dasDiagramm ein, welcheder 3 in der Vorlesungbehandelten Phylogenie-Methoden jeweils ambesten geeignet ist?begründen Sie kurzwarum.



Parsimony methods (wurden stark gekürzt)

Edwards & Cavalli-Sforza (1963): that evolutionary tree is to be preferred that

involves „the minimum net amount of evolution“.

seek that phylogeny on which, when we reconstruct the evolutionary

events leading to our data, there are as few events as possible.

(1) We must be able to make a reconstruction of events, involving as few events

as possible, for any proposed phylogeny.

(2) We must be able to search among all possible phylogenies for the one or

ones that minimize the number of events.



Counting evolutionary changes

2 related dynamic programming algorithms: Fitch (1971) and Sankoff (1975)

- evaluate a phylogeny character by character

- for each character, consider it as rooted tree, placing the root wherever seems

appropriate.

- update some information down a tree; when we reach the bottom, the number of

changes of state is available.

Do not actually locate changes or reconstruct interior states at the nodes of the tree.



Fitch algorithm

intended to count the number of changes in a bifurcating tree with nucleotide

sequence data, in which any one of the 4 bases (A, C, G, T) can change to any

other.

At the particular site, we have observed the bases C, A, C, A and G in the 5 species.

Give them in the order in which they appear in the tree, left to right.



Fitch algorithm

For the left two, at the node that is their immediate common ancestor,

attempt to construct the intersection of the two sets.

But as {C} {A} = instead construct

the union {C} {A} = {AC} and count 1

change of state.

For the rightmost pair of species, assign

common ancestor as {AG},

since {A} {G} = and count another

change of state.

.... proceed to bottom

Total number of changes = 3. Algorithm works on arbitrarily large trees.

Q: beschreiben Sie kurz denFitch-Algorithmus und füllen Sieden oben gezeigten Baum aus.



Sankoff algorithm

Fitch algorithm is very effective – but we can‘t understand why it works.

Sankoff algorithm: more complex, but its structure is more apparent.

Assume that we have a table of the cost of changes cij between each character state

i and each other state j.

Compute the total cost of the most parsimonious combinations of events by

computing it for each character.

For a given character, compute for each node k in the tree a quantity Sk(i).

This is interpreted as the minimal cost, given that node k is assigned state i,

of all the events upwards from node k in the tree.



Sankoff algorithm

If we can compute these values for all nodes,

we can also compute them for the bottom node in the tree.

Simply choose the minimum of these values

which is the desired total cost we seek, the minimum cost of evolution for this

character.

At the tips of the tree, the S(i) are easy to compute. The cost is 0 if the observed

state is state i, and infinite otherwise.

If we have observed an ambigous state, the cost is 0 for all states that it could be,

and infinite for the rest.

Now we just need an algorithm to calculate the S(i) for the immediate common

ancestor of two nodes.

iSSi

0min



Sankoff algorithm

Suppose that the two descendant nodes are called l and r (for „left“ and „right“).

For their immediate common ancestor, node a, we compute

kScjSciS rikk

lijj

a minmin

The smallest possible cost given that node a is in state i is the cost cij of going from

state i to state j in the left descendant lineage, plus the cost Sl(j) of events further up

in the subtree gien that node l is in state j. Select value of j that minimizes that sum.

Same calculation for right descendant lineage sum of these two minima is the

smallest possible cost for the subtree above node a, given that node a is in state i.

Apply equation successively to each node in the tree, working downwards.

Finally compute all S0(i) and use previous eq. to find minimum cost for whole tree.



Sankoff algorithm

The array (6,6,7,8) at the bottom of the tree has a minimum value of 6

= minimum total cost of the tree for this site.

Q: beschreiben Sie kurz denSankoff-Algorithmus und tragen Sieim links gezeigten Baum die sich miteiner abgeänderten cost Matrix ergebenden Werte ein.



Finding the best tree by heuristic search

The obvious method for searching for the most parsimonious tree is to consider ALL

trees and evaluate each one.

Unfortunately, generally the number of possible trees is too large.

use heuristic search methods that attempt to find the best trees without looking at

all possible trees.

(1) Make an initial estimate of the tree and make small rearrangements of it

= find „neighboring“ trees.

(2) If any of these neighbors are better, consider them and continue search.



Resolve Incongruences in Phylogeny

Many possible reasons that may make decisions on how to handle conflicts in

larger sets of molecular data difficult.

E.g. two genes with different evolutionary history (e.g. owing to hybridization or

horizontal transfer) will necessarily give incongruent pictures while still depicting

true histories.

Here: compare genome sequence data for 7 Saccharomyces yeast species:

S. cerevisae

S. paradoxus

S. mikatae

S. kudriavzevii

S. bayanus

S. castelli

S. kluyveri

plus one outgroup fungus Candida albicans.

Rokas et al. Nature 425, 798 (2003)



A method for testing how well a particular data set fits a model.

E.g. the validity of the branch arrangement in a predicted phylogenetic tree can

be tested by resampling columns in a multiple sequence alignment to create

many new alignments.

The appearance of a particular branch in trees generated from these resampled

sequences can then be measured.

Alternatively, a sequence may be left out of an analysis to determine how

much the sequence influences the results of an analysis.

Here: swap individual nucleotide sites or positions of genes (bootstrap replicas).

Bootstrap analysis.

Q: Erklären Sie das Grundprinzip der Bootstrap-Methode.



Alternative Tree topologies

Single-gene data sets generate multiple, robustly supported alternative topologies.

Representative alternative trees recovered from analyses of nucleotide data of 106

selected single genes and six commonly used genes are shown. The trees are the

50% majority-rule consensus trees from the genes YBL091C (a), YDL031W (b),

YER005W (c), YGL001C (d), YNL155W (e) and YOL097C (f).

These 6 genes were selected without consideration of their function. Maybe

commonly used, well known genes of important functions provide a better resolution?




The alternative phylogenies could have resulted from a number of different

scenarios:

(1) most genes could have weakly supported most phylogenies and strongly

supported only a few alternative trees,

(2) most genes could have strongly supported one phylogeny and a few genes

strongly supported only a small number of alternatives,

(3) there could have been some combinations of these scenarios so that each

branch among alternative phylogenies had either weak or strong support

depending on the gene.

To distinguish between these possibilities, identify all branches recovered during

single-gene analyses, record each bootstrap value with respect to the gene and

method of analysis.

8 branches were shared by all three analyses with multiple instances of

bootstrap values > 50%.

Explanations?




Concatenation of single genes gives a single tree!

Phylogenetic analyses of the

concatenated data set composed

of 106 genes yield maximum

support for a single tree,

irrespective of method and type of

character evaluated. Numbers

above branches indicate bootstrap

values (ML on nucleotides/MP on

nucleotides/MP on amino acids).

All alternative topologies were rejected.

This level of support for a single tree with 5 internal branches is unprecedented.

This tree can now be referred to as species tree.




Convergence on single tree

A minimum of 20 genes is required to recover >95% bootstrap values for each

branch of the species tree. a, b, The bootstrap values for branches 3 (a) and 5 (b)

were constructed from the concatenation of randomly re-sampled orthologous

nucleotides (left) or random subsets of genes (right).

The species tree is recovered with robust support (>95% bootstrap values in all

branches at 95% confidence interval) by analyses of a minimum of 20

concatenated genes. All analyses were performed using MP.

branch 3

branch 5




Independent evolution?It has been suggested that nucleotides within a given gene do not evolve

independently.

Re-sample subset of orthologous nucleotides from the total data set.

Only 8000 randomly chosen nucleotide positions (corresponding to less than three

concatenated genes) are sufficient to generate single tree with > 95% confidence.

This indicates that nucleotides in genes have not evolved independently (because

when using complete genes more than 20 genes are necessary to generate single

tree).


Q: geben Sie eine strukturelle Erklärung, weshalb an unterschiedlichenPositionen eines Gens unterschiedliche Evolutionsraten zu beobachtensind.Wie erklärt es sich dann, daß man aus 8000 zufällig ausgewähltenNukleotidpositionen von alignierten Genomen einen einheitlichenBaum erhalten kann?



Implications for resolution of phylogeniesUnreliability of single-gene data sets stems from the fact that each gene is

shaped by a unique set of functional constraints through evolution.

Phylogenetic algorithms are sensitive to such constraints.

Such problems can be avoided with genome-wide sampling of independently

evolving genes.

In other cases the amount of sequence information needed to resolve specific

relationships will be dependent on the particular phylogenetic history under

examination.

Branches depicting speciation events separated by long time intervals may be

resolved with a smaller amount of data, and those depicting speciation events

separated by shorter invtervals may be much harder to resolve.

Rokas et al. Nature 425, 798 (2003)Q: Was ist der Vorteil dabei, mehrere Proteinefür phylogenetische Vergleiche zwischen Organismen zu verwenden?



SummaryRobust strategies exist for phylogenies built on single-gene comparisons

(maximum parsimony, distance, maximum likelihood).

Problem of incongruence of phylogenies derived from individual genes.

Can be resolved by integrative analysis of multiple (here > 20) genes.

It is desirable to combine results from phylogenies constructed from local

sequence information with trees constructed from genome rearrangement.

The power of genome rearrangement studies is the construction of ancestral

genomes. Then one can derive the speed of evolution at different times, disect

mutation biases at different times from the influence of genomic context ...

and possibly derive the driving forces of biological evolution.



V14: Phylogeny (II)

Distance matrix methods

Least squares

(leave out problematic UPGMA method)

Neighbor-joining

Maximum likelihood

An early "Universal Tree of Life" deduced from

ribosomal RNA (rRNA) data. The study upon which

this figure was based did not resolve the branching

of the three kingdoms most familiar to all of us:

plants, Fungi and animals. Subsequent analyses,

however, have revealed that the biochemistry of

fungi (in particular, the synthesis of chitin) is most

similar to animals. Thus, counter-intuitively, plants

are likely to have diverged first, leaving fungi and

animals as sister groups.

http://www.palaeos.com/Systematics/Cladistics/molecular.html

http://www.palaeos.com/Fungi/default.htm



Distance matrix methods

introduced by Cavalli-Sforza & Edwards (1967)

and by Fitch & Margoliash (1967)

general idea „seems as if it would not work very well“ (Felsenstein):

- calculate a measure of the distance between each pair of species

- find a tree that predicts the observed set of distances as closely as

possible.

All information from higher-order combinations of character states is left out.

But computer simulation studies show that the amount of lost information is

remarkably small.

Best way to think about distance matrix methods:

consider distances as estimates of the branch length separating that pair of

species.



Least square method

- observed table (matrix) of distances Dij

- any particular tree leads to a predicted set of distances dij.



Least square method

Measure of the discrepancy between the observed and expected distances:

n

i

n

jijijij dDwQ

1 1

2

where the weights wij can be differently defined:

- wij = 1 (Cavalli&Sforza, 1967)

- wij = 1/Dij2 (Fitch&Margoliash, 1967)

- wij = 1/Dij (Beyer et al., 1974)

Aim: Find tree topology and branch lengths that minimize Q.

Equation above is quadratic in branch lengths.

Take derivative with respect to branch lengths, set = 0,

and solve system of linear equations. Solution will minimize Q.

Doug Brutlag‘s course



Least square method

Number species in alphabetical order.

The expected distance between species A and D d14 = v1 + v7 + v4

The expected distance between speices B and E d25 = v5 + v6 + v7 + v2.

v1v2

v3

v4

v5 v6 v7



Least square method

Number all branches of the tree and introduce an indicator variable xijk:

xijk = 1 if branch k lies in the path from species i to species j

xijk = 0 otherwise.

The expected distance between i and j will then be

and

For the case with wij = 1 ij.

Note: these are k equations for each of the k branches.

k

kkijji vxd ,,

n

i ij kkkijijij vxDwQ

1

2

,

n

i ij kkkijijkijij

k

vxDxwdv

dQ

1,, 02



Least square method

DAB + DAC + DAD + DAE = 4v1 + v2 + v3 + v4 + v5 + 2v6 + 2v7

DAB + DBC + DBD + DBE = v1 + 4v2 + v3 + v4 + v5 + 2v6 + 3v7

DAC + DBC + DCD + DCE = v1 + v2 + 4v3 + v4 + v5 + 3v6 + 2v7

DAD + DBD + DCD + DDE = v1 + v2 + v3 + 4v4 + v5 + 2v6 + 3v7

DAE + DBE + DCE + DDE = v1 + v2 + v3 + v4 + 4v5 + 3v6 + 2v7

DAC + DAE + DBC + DBE + DCD + DDE = 2v1 + 2v2 + 3v3 + 2v4 + 3v5 + 6v6 + 4v7

DAB + DAD + DBC + DCD + DBE + DDE = 2v1 + 3v2 + 2v3 + 3v4 + 2v5 + 4v6 + 6v7

Stack up the (4 + 3 + 2 + 1 = 10) Dij, in alphabetical order, into a vector

and the coefficients xijk

are arranged in a matrix X

with each row corresponding

to the Dij in the row of d and

containing a 1 if branch k

occurs on the path between

species i and j.

DE

CE

CD

BE

BD

BC

AE

AD

AC

AB

D

D

D

D

D

D

D

D

D

D

d

1111000

0010100

1101100

1110010

0001010

1100110

0110001

1001001

0100101

1000011

X



Least square method

If we also stack up the 7 vi into a vector v, the previous set of linear equations can

be compactly expressed as:

Multiplied from the left by the inverse of XTX one can solve for the least squares

branch lengths

This is a standard method of expressing least squares problems in matrix notation

and solving them.

check for example :-)

vXXdX TT

dXXXv TT 1



Least square method

When we have weighted least squares, with a diagonal matrix of weights in the

same order as the Dij:

DE

CE

CD

BE

BD

BC

AE

AD

AC

AB

w

w

w

w

w

w

w

w

w

w

000000000

000000000

000000000

000000000

000000000

000000000

000000000

000000000

000000000

000000000

W

then the least square equations can be written

vWXXWdX TT

and their solution WdXWXXv TT 1



Finding the least squares tree topology

Now that we are able to assign branch lengths to each tree topology.

we need to search among tree topologies.

This can be done by the same methods of heuristic search that were presented for

the Maximum Parsimony method.

Note: no-one has sofar presented a branch-and-bound method for finding the least

squares tree exactly. Day (1986) has shown that this problem is NP-complete.

The search is not only among tree topologies, but also among branch lengths.



neighbor-joining method

introduced by Saitou and Nei (1987) – algorithm works by clustering - does not

assume a molecular clock but approximates the „minimum evolution“ model.

„Minimum evolution“ model:

among possible tree topologies, choose the one with minimal total branch length.

Neighbor-joining, as the least-squares method, is guaranteed to recover the true

tree if the distance matrix is an exact reflection of the tree.



neighbor-joining method

(1) For each tip, compute

(2) Choose the i and j for which Dij – ui – uj is smallest.

(3) Join items i and j. Compute the branch length

from i to the new node (vi) and from j to the new

node (vj) as

(4) Compute distance between the new node (ij) and each of the remaining tips as

(5) Delete tips i and j from the tables and replace them by the new node, (ij), which

is now treated as a tip.

(6) If more than 2 nodes remain, go back to step (1). Otherwise, connect the two

remaining nodes (e.g. l and m) by a branch of length Dlm.

n

ij

iji n

Du

2

ijijj

jiiji

uuDv

uuDv

2

1

2

12

1

2

1

2,ijjkik

kij

DDDD



limitation of distance methods

Distance matrix methods are the easiest phylogeny method to program,

and they are very fast.

Distance methods have problems when the evolutionary rates vary largely.

One can correct for this in distance methods as well as in likelihood methods.

When variation of rates is large, these corrections become important.

In likelihood methods, the correction can use information from changes in one part

of the tree to inform the correction in others.

Once a particular part of the molecule is seen to change rapidly in the primates, this

will affect the interpretation of that part of the molecule among the rodents as well.

But a distance matrix method is inherently incapable of propagating the information

in this way. Once one is looking at changes within rodents, it will forget where

changes were seen among primates.



Maximum Likelihood

For any 2 hypotheses H1 and H2 about a set of data D

2

1

2

1

2

1

Prob

Prob

Prob

Prob

Prob

Prob

Prob

ProbProb

Prob

andProbProb

H

H

HD

HD

DH

DH

D

HHD

D

DHDH

This expresses the „odds“ ratio in favor of hypothesis 1 over hypothesis 2 as a

product of two terms.

The first is the ratio of the probabilities of the data given the 2 hypotheses.

The second is the ratio of the prior probabilities of the 2 hypotheses before we

look at the data.



Maximum Likelihood

If we have independent observations, then

iniii HDHDHDHD Prob...ProbProbProb 21

It follows that

2

1

1 2

1

2

1

Prob

Prob

Prob

Prob

Prob

Prob

H

H

HD

HD

DH

DH n

ii

i



Computing the likelihood of a tree

Suppose that we have a set of aligned DNA-sequences with m sites.

We are given a phylogeny with branch lengths, and an evolutionary model that

allows to compute probabilities of changes of states along this tree.

In particular, the model allows us to compute transition probabilities Pij(t), the

probability that state j will exist at the end of a branch of length t, if the state at the

start of the branch is i. (t measures branch length, not time).

We will make 2 assumptions that are central to computing the likelihoods:

(1) Evolution in different sites (on the given tree) is independent.

(2) Evolution in different lineages is independent.




The first assumption

(1) Evolution in different sites (on the given tree) is independent.

allows us to take the likelihood and decompose it into a product, one term for each

site

where D(i) is the data at the ith site.

we only need to know how to compute the likelihood at a single site.

n

i

i TDTDL1

ProbProb




Suppose that we have a tree, and the data at a site.

The likelihood of the tree for this site is the sum, over all possible nucleotides that

may have existed at the interior nodes of the tree, of the probabilities of each

scenario of events:

Each summation runs over all 4 nucleotides.

x y z w

i TwzyxGCCCATD ,,,,,,,,ProbProb




The second assumption

(2) Evolution in different lineages is independent.

allows us to decompose

into a product of terms:

)tw,Prob(G)tw,Prob(C)tz,Prob(w)tz,Prob(C)tx,Prob(z

)ty,Prob(C)ty,Prob(A)tx,Prob(yProb(x)

,,,,,,,,Prob

54738

216

TwzyxGCCCA

TwzyxGCCCA ,,,,,,,,Prob

The problem with this expression is that a tree with n species has n – 1 interior

nodes, and each can have one of 4 states.

So we need 4n-1 terms. This may become enormously large when we need to sum

over all x,y,z,w. Refine strategy!



Economizing on the computation

The algorithm is applied starting at the node that has all of its immediate

descendants being tips (there will always be one such node).

Then it is applied successively to nodes further down the tree, not applying it to any

node until all of its descendants have been processed.

The result is the L0(i) for the bottom-most node in the tree.

Once the likelihood for each site is computed, the overall likelihood of the

tree is the product of the site likelihoods.



V 16 Genome RearrangementTwo genomes may have many genes in common, but the genes may be

arranged in a different sequence or be moved between chromosomes. Such

differences in gene orders are the results of rearrangement events that are

common in molecular evolution (frequency ca. only 1 event per million years!)

- Substitution

- Insertion

- Deletion

- Translocation

- Inversion/ Reversal

- Duplication



Types of Genome Rearrangements

In unichromosomal genomes, the most common rearrangement events are

reversals, in which a contiguous interval of genes is put into the reverse order.

For multichromosomal genomes, the most common rearrangement events are

reversals, translocations, fissions, and fusions.

The pairwise genome rearrangement problem is to find an optimal scenario

transforming one genome to another via these rearrangement events.

Genomic distance: the number of inversions and translocations needed to

transform one genome into another. Fissions and fusions may be included as a

special case of translocations in which one of the input or output chromosomes is

empty.



Representation of a genome

We consider a unichromosomal genome to be a sequence of n genes. The

genes are represented by numbers 1, 2, ..., n.

The two orientations of gene i are represented by i and -i.

A genome is represented as a signed permutation of the numbers 1, 2, ..., n.

For example, a unichromosomal genome with n = 5 genes is 5 -3 4 2 -1



Unichromosomal genomes: sorting by reversal

A reversal in a signed permutation is an operation that takes an interval in a

permutation, reverses the order of the numbers, and changes all their signs. For

example,

5 1 3 2 -9 7 -4 6 8

5 1 -7 9 -2 -3 -4 6 8

The reversal distance between two genomes is the minimum number of

reversals it takes to get from one genome to the other.

For a given pair of genomes, the reversal distance is unique, but there are

usually many possible reversal scenarios with this distance.

However, it is (of course) possible that this mathematical notion of reversal

distance can underestimate the actual number of steps that occurred

biologically.



Signed and unsigned genomes

Most comparative mapping techniques determine the physical locations and

relative order of genes in each chromosome, but do not determine which of

two orientations each gene has.

Current sequencing methods do provide the orientations. It turns out that the

genome rearrangement problem (uni- and multichromosomal) for unsigned

permutations is NP-hard, but the same problems for signed data can be done in

polynomial time.

Fortunately, with many genomes currently being sequenced, it is likely that

many comparative maps (corresponding to unsigned permutations) will soon be

replaced by sequencing data (corresponding to signed permutations).



Inversion, Transposition and inverted Transposition

inversion

transposition

inverted transposition



Sorting by Reversals

8 7 6 5 4 3 2 1 11 10 9

8 7 6 5 4 3 2 1 11 10 9

8 2 3 4 5 6 7 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

8 2 3 4 5 1 7 6 11 10 9

4 3 2 8 5 1 7 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

Cabbage

Turnip



Permutation () : an ordered arrangement of

the set { 1,2,…,n}

Reversal () :a rearrangement that inverts a

block in {3 4 7 6 1 5 2 } (3,6) ={3 4 5 1 6 7 2}

Signed Permutation (): a permutation

where the elements are oriented

a reversal switches element orientation

{+3 -4 +7 -6 +1 -5 +2 } (3,6) ={+3 -4 +5 -1 +6 -7 +2}



easy to do by eye ...

8 7 6 5 4 3 2 1 11 10 9

8 7 6 5 4 3 2 1 11 10 9

8 2 3 4 5 6 7 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

8 2 3 4 5 1 7 6 11 10 9

4 3 2 8 5 1 7 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

1

12

123

12….t=

= t …. 21



Formal Approach: Sorting by Reversals

The order of genes in 2 organisms is represented by permutations = 12 ... n and = 12 ... n.

A reversal of an interval [i,j] is the permutation

1 2 ... i-1 i i+1 ... j-1 j j+1 ... n

1 2 ... i-1 j j-1 ... i+1 i j+1 ... n

(i,j) has the effect of reversing the order of ii+1 ... j and transforming

1 ... i-1i ... j j+1 ... n into •(i,j) = 1 ... i-1j ... ij+1 ... n .

Given permutations and , the reversal distance problem is to find a series of

reversals 12 ... t such that •1•2 ... t = and t is minimal.

t is called the reversal distance between and .



Reconstruction of phylogenetic trees from WG data

1 Phylogeny reconstruction as optimization problem?

Attempt to reconstruct an evolutionary scenario with a minimum number of

permitted evolutionary events (e.g. duplications, insertions, deletions,

inversions, transpositions) on a tree all known approaches are NP-hard

Also, no automated tool exists sofar.

2 Estimate leaf-to-leaf distances

(based on some metric) between all genomes. Then úse a standard distance-

based method such as neighbour-joining to construct the tree.

Such approaches are quite fast but cannot recover the ancestral gene order.

2a Breakpoint phylogeny (Blanchette & Sankoff)

for special case in which the genomes all have the same set of genes, and

each gene appears once. Use breakpoint distance as distance matrix.



Reversal distance problem

The reversal distance for a pair of genomes can be computed in polynomial time

(Hannenhalli & Pevzner 1999 and others, also see Bioinformatics 1 lecture).

However, its use in studies of multiple genome rearrangements was somewhat

limited since it was not clear how to combine pairwise rearrangement scenarios

into a multiple rearrangement scenario.

In particular, Capara (1999) demonstrated that even the simplest version of the

Multiple Genome Rearrangement Problem, the Median Problem, is NP-hard.

Therefore, this line of research was abandoned for a while in favor of the

breakpoint analysis approach (see Blanchette & Sankoff).

The existing tools BPAnalysis or GRAPPA use the so-called breakpoint distance

to derive rearrangement scenarios.



Breakpoint phylogeny

When each genome has the same set of genes and each gene appears exactly

once, a genome can be described by a (circular or linear) ordering =

permutation of these genes.

Each gene has either positive (gi) or negative (- gi) orientation.

Given 2 genomes G and G‘ on the same set of genes, a breakpoint in G is

defined as an ordered pair of genes (gi,gj) such that gi and gj appear

consecutively in that order in G, but neither (gi,gj) (- gi,- gj) appears

consecutively in that order in G‘.

The breakpoint distance between two genomes is simply the number of

breakpoints between that pair of genomes.

The breakpoint score of a tree in which each node is labelled by a signed

ordering of genes is then the sum of the breakpoint distances along the edges

of the tree.



Breakpoint Graph

Sorting a permutation is a hard problem.

Breakpoints were introduced by Watterson et al. (1982) and by Nadeau and Taylor

(1984) and correlations were noticed between the reversal distance and the

number of breakpoints.

Let i j if |i – j| = 1. Extend a permutation = 12 ... n by adding 0 = 0 and

n+1 = n + 1. We call a pair of elements (i,i+1), 0 i n, of an adjacency

if i i+1, and a breakpoint if i i+1.

2 3 1 4 6 5 7

0 2 3 1 4 6 5 7 8

adjacencies

breakpoints

As the identity permutation has no

breakpoints, sorting by reversals

corresponds to eliminating breakpoints.

An observation that every reversal can

eliminate at most 2 breakpoints implies that

the reversal distance d() b() / 2 where

b() is the number of breakpoints in .

However, this is a clear overestimate.



Breakpoint Graph

The breakpoint graph of a permutation is an edge-colored graph G() with n +

2 vertices {0, 1 ... n, n+1} {0, 1, ..., n, n+1}. We join vertices i and i+1 by a

black edge for 0 i n. We join vertices i and j by a gray edge if i j.

Black path

0 2 3 1 4 6 5 7

Grey path

0 2 3 1 4 6 5 7

Superposition of black and grey paths formsthe breakpoint graph:

A breakpoint graph is obtained by a super-position of a black pathtraversing the vertices0, 1, ..., n, n+1 in the order given by the permutation and a graypath traversing the verticesin the order given by theidentity permutation.

more next week ...

Q: Konstruieren Sie den Breakpoint Graph für folgende Permutation.



Multiple Genome Rearrangement Problem

Find a phylogenetic tree describing the most „plausible“ rearrangement

scenario for multiple species.

The genomic distance in the case of genome rearrangement is defined in terms

of (1) reversals, (2) translocations, (3) fusions, and (4) fissions which are

the most common rearrangement events in multichromosomal genomes.

The special case of three genomes (m = 3) is called the Median Problem.

Given the gene order of three unichromosomal genomes G1, G2, and G3,

find the ancestral genome A which minimizes the total reversal distance

321 ,,, GAdGAdGAd



Multiple Genome Rearrangement Problem

New approach:

Given a set of m permutations (existing genomes) or order n, find a tree T

with the m permutations as leaf nodes and assign permutations (ancestral

genomes) to internal nodes such that D(T) is minimized, where

is the sum of reversal distances over all edges of the tree.

T

dTD

,

,

The breakpoint analysis attempts to solve the Median Problem by minimizing

the breakpoint distance instead of the reversal distance.

However, the breakpoint distance, in contrast to the reversal distance, does not

correspond to a minimum number of rearrangement events!

As a result, the breakpoint, recovered by breakpoint analysis, rarely

corresponds to the ancestral median, the genome that minimizes the overall

number of rearrangements in the evolutionary scenario.



New algorithm

Aim: Among all possible reversals for each of the three genomes identify good

reversals.

A good reversal in a genome G1 is a reversal that brings a genome closer to

the ancestral genome.

But since this is unknown, it is unclear to find good reversals, oops!

Instead: assume that reversals that reduce the reversal distance between G1

and G2 and the reversal distance between G1 and G3 are likely to be good

reversals.

With () as the overall reduction in the reversal distances:

the reversal () is good if () = 2.

31213121 ,,,, GGdGGdGGdGGd



New algorithm

Iteratively carry on these good rearrangements until the genomes G1, G2, and

G3 are transformed into an identical genome, hoping that this is the most likely

„ancestral median“.

When we are dealing with multichromosomal genomes and with four different

types of rearrangements, ambiguous situations may occur too.



Ambiguities again possible

E.g. G1 = 1 2 3 4 5

G2 = 1 2 -5 -4 -3

G3 = 1 2

3 4 5

The parsimony principle does not allow to umambiguously reconstruct the

evolutionary scenario.

If the ancestor coincides with G1, then a reversal occurred on the way to G2,

and a fission occurred on the way to G3.

One can as well start with G2 or G3 as the ancestors. In this case 323121 ,,, GGdGGdGGd

This kind of ambiguity does not exist for unichromosomal genomes because,

there, it is impossible to find 3 genomes that would all be within one reversal of

each other.



Strategy for choosing reversalsTherefore one has to select carefully among the good rearrangements.

Observe that in most genomes of interest reversals and translocations are

more common than fusions and fissions.

Therefore use as a rule always to select reversals/translocations before

fusions/fissions.

Often, the list of good reversals contains nonoverlapping reversals, and the

order in which these reversals are performed is often irrelevant.

Compute for each good reversal the number of good reversals n that will be

available if is carried out. Then choose the good reversal with the maximal n

to be carried out.

If we run out of good reversals before reaching a solution, the best reversal to

be taken will be the result of a depth k search minimizing the total pairwise

rearrangement distances.



How good measure is reversal distance?

Authors claim that the reversal distance is a good approximation of the true

distance for many biologically relevant cases.

Let be a genome that evolved from a genome by k reversals.

I.e. the true distance between and is k.

We say that and form a valid pair if d(, )= k.

Otherwise we say that d(, ) underestimates the true distance.

Typically two genomes form a valid pair if the number of rearrangements

between them is relatively small – exactly the case in a number of genome

rearrangement studies.



Reversal distance vs. True distance

Reversal distance, d(, ), versus

the actual number of reversals

performed to transform into ,

where is a genome/permutation

that evolved from the identity

permutation = 1,2, ... ,100 by k

random reversals.

The simulations were repeated

10 times for every k.

Shown is the average difference

between the reversal distance and

the actual number of reversals

performed (k).

For a genome with n=100 markers,

the reversal distance approximates

the true distance very well as long

as the number of reversals remains

below 0.4 n. This is the case in

many biological relevant cases.Bourque, Pevzner, Genome Res

(2002)



Nadeau & Taylor model (1984)

- suggest presence of conserved segments (i.e., segments with preserved gene

orders without disruption by rearrangements)

- estimated that there are ca. 180 conserved segments in human and mouse

- provided convincing evidence that random breakage model of genomic

evolution postulated by Ohno is correct. The model assumes a random (i.e.,

uniform and independent) distribution of chromosome rearrangement

breakpoints and is supported by the observation that the lengths of

synteny blocks shared by human and mouse are well fitted by the

predicted exponential distribution imposed by the random breakage

model.

where L is the average length of segments.

- model has become widely accepted

- new studies of significantly larger datasets that confirmed that newly

discovered synteny blocks still fit the predicted exponential distribution very well.

L

x

eLxf 1



Breakpoint reusage

Two different most parsimonious scenarios that transform the order of the 11 synteny blocks on the mouse X

chromosome into the order on the human X chromosome. The arrangement of synteny blocks in the ancestor is

unspecified (and is assumed to coincide with one of intermediate arrangements) because it cannot be inferred without

availability of a third genome.

Breakpoint uses are shown as short vertical yellow lines, and breakpoint region reuses are shown as double yellow

lines. In the first scenario (Left) the breakpoint reuses are located in human in breakpoint regions (3,4), (4,5), and (5,6),

whereas in the second one (Right) they are located in (5,6), (6,7), and after block 11. In the second scenario, a potential

hidden block is shown as a black dot; it restricts the set of possible most parsimonious scenarios, and it separates two

breakpoint uses that would have been a breakpoint region reuse. Our theory implies that any rearrangement scenario

based on these 11 blocks has at least three reuses of breakpoint regions (possibly including chromosome ends).

Pevzner, Tesler, PNAS 100, 7672 (2003)



Length of synteny blocks

(Left) Histogram of synteny block lengths in human for Nb = 281 synteny blocks of length at least 1 Mb,

fitted by an exponential distribution with mean block length L = GbNb = 9.6 Mb, where Gb = 2,707 Mb is

the overall length of syntenic blocks. The bin size is 2.5 Mb.

(Center) The same histogram superimposed with the 190 hidden synteny blocks revealed by genome

rearrangement analysis, under the assumption that all hidden blocks are short, i.e., <1 Mb in length.

(Right) Histogram of breakpoint region lengths in the human genome (bin size is 100 kb). Most

breakpoint regions are very short, with 109 of 258 regions being <100 kb. However, there is a small

number of long breakpoint regions: 17 regions are between 1 and 2.5 Mb, and 15 are <2.5 Mb (shown

by a single bar at the right end).

The rearrangement analysis confirms the existence of many short breakpoints. Their existence

immediately implies that an exponential distribution is not a good fit to reality, thus pointing to

limitations of the random breakage mode Pevzner, Tesler, PNAS 100, 7672 (2003)



Rat – mouse – man

The Rat Sequencing Consortium,

Nature 428, 493 (2004)

X chromosome on each pair. GRIMM synteny for 16 orthologous pairs.Arrangement of the 16 blocks: 15 rearrangement events necessary.Shown is one of a number of most parsimonious inversion scenarios.The last common ancestor of human, mouse and rat should be on the evolutionary path between median ancestor and human.



Summary

Breakpoint analysis (BPA) is a robust technique for small rearrangement

problems. Problem of ambiguity between different optimal solutions.

Although complexity could be dramatically reduced by algorithmic improvements

(e.g. GRAPPA), method is still too expensive for more than 10 genomes.

Heuristic MGR algorithm by Bourque & Pevzner minimizes reversal distance

instead of breakpoint distance. (Taking the number of breakpoints 2 was not the

optimal lower bound for the reversal distance.)

Runs more efficient + can be applied to much larger problems + provides only one

or a few solutions.

MGR algorithm: analogy to conformational search in some energy landscape ...

What is the correct way to identify the biologically correct = true evolutionary trees:

by minimizing the breakpoint distance or the reversal distance or something else?



V16 – genome rearrangement

Important information – contained in the order in which genes occur on the

genomes of different species – allows inferring phylogenetic relationships.

Together with phylogenetic information, ancestral gene order reconstructions

give some clues about the conservation of the functional organisation of genomes

towards a global knowledge of life evolution.

Often, phylogeny reconstruction techniques using gene order data rely on the

definition of an evolutionary distance between two gene orders.

These distances are usually computed as the minimal number of

rearrangement operations needed to transform one genome into another one.

Bergeron et al. WABI 2004, 14-25 (2004)




Most choices of rearrangements quickly lead to hard algorithmic problems.

Therefore, the set of operations is usually restricted to reversals, translocations,

fusions or fissions where linear-time algorithms were developed in the last

years.

However, this choice of rearrangement operations is more dictated by algorithm

necessity than by biological reality. E.g., in some genomes, transpositions and

inverted transpositions can be quite common.

A family of phylogenetic approaches labelled „distance-based“ methods relies on

pair-wise evolutionary distances which are then fed into an algorithm such as

neighbor-joining to infer tree topology and branch lengths.

These methods do not provide information about the putative ancestral gene

order.





Parsimony-based approaches attempt to identify the rearrangement scenario

(including tree topology and gene orders at the internal nodes) that minimizes the

number of evolutionary events required.

problem is computationally much more difficult than just computing distances.

Heuristic algorithms exist that use either breakpoint or reversal distances.

However, these methods only provide us with one (or a small number of) possible

hypothesis about ancestral gene orders, with no information about alternate

optimal or near-optimal solutions.

Today:

- quick look at the reversal distance problem again

- new method „sets of conserved intervals“ (Bergeron & Jens Stoye)




Breakpoint Graph

The breakpoint graph of a permutation is an edge-colored graph G() with

n + 2 vertices {0, 1 ... n, n+1} {0, 1, ..., n, n+1}. We join vertices i and i+1 by

a black edge for 0 i n. We join vertices i and j by a gray edge if i j.

Black path

0 2 3 1 4 6 5 7

Grey path

0 2 3 1 4 6 5 7

Superposition of black and grey paths formsthe breakpoint graph:

A breakpoint graph is obtained by a super-position of a black pathtraversing the vertices0, 1, ..., n, n+1 in the order given by the permutation and a graypath traversing the verticesin the order given by theidentity permutation.



Cycle decomposition

A cycle in an edge-colored graph G is called alternating if the colors of every two

consecutive edges of this cycle are distinct. In the following, cycles will mean

alternating cycles.

Cycle decomposition ofthe breakpoint graph:

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

A vertex v in a graph G is called balanced if the

number of black edges incident to v equals the

number of grey edges incident to v.

A balanced graph is a graph in which every

vertex is balanced. G() is a balanced graph.

Therefore, there exists a cycle decomposition

of G() into edge-disjoint alternating cycles

(every edge in the graph belongs to exactly one

cycle in the decomposition). Cycles in an edge

decomposition may be self-intersecting. The

previous breakpoint graph can be decomposed

into 4 cycles, one of which is self-intersecting.



Effects of reversals on cycles

(A) For reversals acting on two

cycles, (b – c) = 1.

(B) For reversals acting on an

unoriented cycle, (b – c) = 0.

(C) For reversals acting on an

oriented cycle, (b – c) = -1

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)



Cycle decomposition

What is the decomposition of the breakpoint graph into a maximum number c()

of edge-disjoint alternating cycles? Here, c() = 4.

Cycle decompositions play an important role in estimating reversal distances.

When a reversal is applied to a permutation, the number of cycles in a maximum

decomposition can change by at most one (while the number of breakpoints

can change by two).

Bafna&Pevzner (1996) proved the bound for the reversal distance d():

d() n + 1 - c()

which is much tighter than the bound in terms of breakpoints d() b() / 2.

For many biological problems, d() = n + 1 - c().

Therefore, the reversal distance problem reduces to the problem of finding

the maximal cycle decomposition.

Hurdles, Super-hurdles, fortresses ...



Alternative concept: conserved intervals

Bergeron & Stoye, Report 2003-01 Uni Bielefeld

Distrance matrices can be used as data for phylogenetic reconstruction, or to

reconstruct ancestral genomes.

However, all distances (except for the breakpoint distance) are closely tied to

initial choices of allowable rearrangement operations.

They are pure distances because similarities between genomes are ignored.

breakpoint distance is based on the notion of conserved adjacencies. These are

easy to compute, but breakpoint distance often fails to capture more global

relations between genomes.

A first generalization of adjacencies: common intervals that identify

subsets of genes that appear consecutively in two or

more genomes.

Jens Stoye



Permutations, Gene Order, and Rearrangements


Assume that the genes of an organism are ordered and oriented along linear or

circular DNA molecules. E.g. mitochondrial genes in insects

Collapse 38 genes into set of 17 blocks. Genes in one block do not change order

between these species.

Distance approaches: focus on the difference between 2 particular genomes.

E.g. Fruit Fly differs from Mosquito by the reversal of gene 10, and the

transposition of genes 7 and 8.

count minimal number of reversals and/or transpositions

distance matrix for the set of species





breakpoint distance: counts the lost adjacencies between genomes.

E.g. given the circularity of the genomes, Fruit Fly and Mosquito have 12

conserved adjacencies and a breakpoint distance of 5.

E.g. the first 4 species of table 1 share 6 adjacencies:

[1,2], [2,3], [11,12], [15,16], [16,17], and [17,1].

When comparing all 6 species, [17,1] is the only left adjacency.





Observation: the 6 permutations are very „similar“.

E.g. the genes in the interval [1,12] are all the same, with small variations in their

ordering.

This is also true for the genes in the intervals [3,6], [6,9], [9,11], and [12,17].

Such intervals, together with conserved adjacencies play a fundamental role in

rearrangement and distance theories, ancestral genome reconstructions, and

phylogeny.

Family portrait of the conserved intervals of the permutations of table 1

Here, the elements that can be glued together to form larger objects are boxed in

rectangles. Q: Konstruiere Sie das family protrait der konserviertenIntervalle für folgendeSequenzen …



Which arrangements are preferable?


All permutations of table 1 fit the representation with the following conventions

(1) free objects within a rectangle can be reordered, or can change sign

(2) connections between rectangles are fixed.

Consider 2 rearrangement scenarios that transform silkworm into Locust using a

minimal number of reversals

The two scenarios are fundamentally different, although both use 6 reversals.

The right one uses much longer reversals than the left one, and the right one

breaks conserved intervals between Silkworm and Locust in intermediate

permutations, namely [3,6], [1,12], and [12,17].

The right scenario looks highly suspicious.



Conserved intervalsDefinition 1 Let G be a set of signed permutations of n elements. An

interval [a,b] is a conserved interval of the set G if:

(1) either a precedes b, or –b precedes –a, in each permutation, and

(2) the sets of unsigned elements that appear between a and b is the same

for all permutations in G.

If [a,b] is a conserved interval, so is [-b,-a].

Consider 2 permutations

P = 1 2 3 7 5 6 -4 8

Q = 1 7 -3 -2 5 -6 -4 8

Here, [1,5] and [2,3] are conserved intervals, but not [1,6].

The other conserved intervals of P and Q are [1,-4], [1,8], [5,-4], [5,8], and [-4,8].

The diagram representation of these intervals is

1 2 3 7 5 6 -4 8



Conserved intervalsWhen the identity permutation is not in G, it is always possible to rename the

elements of G such that conserved intervals will be intervals of consecutive

elements.

E.g. if one composes the permutations P and Q of the example with the inverse

permutation P-1,

P‘ = P-1 o P = 1 2 3 4 5 6 7 8

Q‘ = P-1 o Q = 1 4 -3 -2 5 -6 7 8

or 1 2 3 4 5 6 7 8

Proposition 1 Let R be a permutation and G a set of permutations, denote by

R o G the set of permutations obtained by composing each permutation in G with

R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is

conserved in R o G.



Conserved intervalsProposition 1 Let R be a permutation and G a set of permutations, denote by

R o G the set of permutations obtained by composing each permutation in G with

R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is

conserved in R o G.Proof: if a permutation P is written as P = p1 p2 ... pn

then R o P is: R o P = R(p1) R(p2) ... R(pn)

If [a,b] is conserved in G, then each permutation in G has a consecutive block of elements

beginning with a and ending with b, or beginning with –b and ending with –a. These

properties hold also for the set R o G, if one replaces a by R(a) and b by R(b).

Some intervals, such as [1,7] for the set {P‘,Q‘} in the above example, are the

union of smaller intervals: [1,7 ] = [1,5] [5,7]. Intervals that are not unions are

specially useful.

Definition 2 Conserved intervals that are not the union of shorter conserved

intervals are called irreducible.

Sets of conserved intervals can be characterized by the set of irreducible intervals.

Q: Welche der zuvor identifiziertenkonservierten Intervalle sind irreduzibel?



Irreducible conserved intervals

Proposition 2 Two different irreducible conserved intervals [a,b] and [c,d]

of a set G of permutations are either

1) disjoint

2) nested with different endpoints, or

3) overlapping on one element.

Proof. Wlog we can assume that G contains the identity permutation and that conserved

intervals are intervals of consecutive elements.

Suppose that [a,b] and [c,d] are nested with a = c and d < b. Since [c,d] is a conserved

interval, it contains all integers between c and d the interval [d,b] contains all integers

between d and b, and [a,b] is not irreducible.

If [a,b] and [c,d] overlap with more than one element, we can suppose

a < c < b < d. Since all elements between c and d are greater than c, then the interval

between a and c must contain all elements between a and c, thus [a,b] is not irreducible.

Q: Zeigen Sie anhand dieserIrreduziblen konservierten Intervalle schnell, daß die 3Links gezeigten Eigenschaftenerfüllt sind.



Conserved intervals

Overlapping irreducible intervals form chains linked by their successive common

elements. A chain of k-1 intervals [a1,a2] [a2,a3] ... [ak-1,ak] will be denoted simply

by its k links [a1,a2,a3 ... ak].

E.g. [1,5,7,8] is a chain of the set of conserved intervals of P‘ and Q‘.

A maximal chain is a chain that cannot be extended.

Proposition 3. Every irreducible conserved interval belongs to a unique maximal

chain.

Proof: By Prop. 2: if [a,b] is an irreducible conserved interval, then no other can

begin by a or end by b.

Maximal chains, as sets of links, together with isolated genes, form a partition of

the set of genes.



Conserved intervals

A set of permutations on n elements can have as many as n(n-1)/2 conserved

intervals, but at most n-1 irreducible intervals.

These bounds are achieved with sets containing only one permutation.

Proposition 4. Each maximal chain of k links contributes k(k-1)/2 to the total

number of conserved intervals.

Proof. Conserved intervals [a,b] are in bijection with chains of the form

[a, x1, ..., kx, b]

of irreducible intervals. Each maximal chain of k links has k(k-1)/2 such sub-chains.



Conserved intervals

Proposition 5 Let P be a permutation that is contained in both sets G1 and G2.

The interval [a,b] is a conserved interval of G = G1 G2 if and only if there exist

two chains of irreducible conserved intervals, with respect to P, with k 0, l 0:

[a, x1, ..., kx, b] in G1

[a, y1, ..., yl, b] in G2.

The interval [a,b] is irreducible if and only if {x1, ..., xk} and {y1, ..., yl} are disjoint.

Proof. The interval [a,b] is a conserved interval of G if and only if it is a conserved interval

in both G1 and G2, therefore there must exist chains beginning by a and ending by b for

both sets G1 and G2. If [a,b] is irreducible in G, and if [a,x] and [x,b] are conserved intervals

of G1, say, then x cannot belong to the set {y1, ..., yl}. If there is a common element x to

both sets {x1, ..., xk} and {y1, ..., yl}, then [a,b] = [a,x] [x,b] and both [a,x] and [x,b] are

conserved intervals of G.



Similarity and distance

The number of conserved intervals of a set of permutations is a measure of

similarity, but can easily be transformed into a distance between two

permutations, or two sets of two permutations.

Definition 3 Let G1 and G2 be two permutations on n elements, with N1 and

N2 conserved intervals. Let N be the number of conserved intervals in G1

G2. The interval distance between G1 and G2 is then defined by:

d(G1,G2) = N1 + N2 – 2N

The interval distance satisfies the fundamental properties of a mathematical

distance, e.g. it fulfils the triangle inequality:

d(P,Q) + d(Q,R) d(P,R)




When comparing two permutations, the interval distance counts the total number

of intervals that are unique to one of them. E.g. the distance between

P = 0 1 2 3 4 5 6 7 8 9 10

Q = 0 5 -7 -6 8 9 1 2 3 4 10

is given by d(P,Q) = (1110)/2 +(1110)/2 – 2 11 = 88

The 2 measures sometimes disagree. The behavior of the interval distance

reflects that the length (number of genes) involved in a rearrangement operation

matters: short reversals are less disturbing than long ones.



Comparison with other distance measures

Breakpoint distance also gives different results than interval distances.

while the same results are obtained by transposition + reversal distances.Q: sowohl Intervall-Distanz wie Breakpoint-Distanz sind mathematischwohl definierte Eigenschaften. Welche Distanz entspricht der wirklichenBiologischen Distanz?A: dies wissen wir heute nicht und werden es evtl. niemals bestimmt wissen.




Proposition 7 Suppose that P and Q have n elements, then

(1) if P is obtained from Q by reversing k elements, then the interval distance

between P and Q is k (n – k);

(2) if P is obtained from Q by transposing two consecutive blocks of a and b

elements, then the interval distance between P and Q is (a+b)(n – (a+b)) + ab.

Because the interval distance is affected by length, one should question

the practice of collapsing identical strips of genes.

Why not use all available information?



Link with rearrangement theoriesCharacterize the rearrangement operations that preserve conserved intervals.

Definition 4. Let P and Q be two permutations, and a rearrangement

operation applied to P yielding P‘. We say that preserves the conserved

intervals of P and Q if the conserved intervals of {P,Q} are contained in those

of {P‘,Q}.

Only rearrangements within blocks are preserving. Note that all operations,

except fusions, destroy some adjacencies that existed in the original permutation:

the number and nature of these adjacencies is a key concept.

Definition 5. Let be a rearrangement operation that transforms P into P‘.

A breakpoint of is a pair of elements that are adjacent in P but not in P‘.

Breakpoints are where one has to cut P in order to apply .

Reversals and translocations have 2 breakpoints, transpositions have 3, and

fissions have 1.



Link with rearrangement theories

Consider the irreducible intervals of P and P‘ with respect to P.

Adjacencies in P either belong to a (smallest) irreducible interval, or are free.

E.g. in this diagram

the adjacency (3,4) belongs to the interval [1,5], (2,3) belongs to [2,3], and (8,9)

is free.

When 2 adjacencies belong to the same irreducible interval, then none of these

adjacencies is conserved between P and P‘.



Link with rearrangement theories

Theorem 3. Reversals, transpositions, and reverse transpositions are preserving

if and only if all their breakpoints belong to the same irreducible interval, or are

free. Translocations and fissions are preserving if and only if all their breakpoints

are free.

Proof. If the breakpoints of any operation are free, then no conserved interval is cut.

If the breakpoints of a reversal, transposition, or reverse transposition belong to the same

irreducible interval, then the operation reorders, or reverses, some blocks within that

interval, thus preserving conserved intervals.

If a reversal has its two breakpoints in different intervals, it will break those two intervals. If

it has only one free breakpoint, it will break the interval containing the other breakpoint.

The same kind of arguments hold for transpositions and reverse transpositions.

If a breakpoint of a translocation or fission is not free, then it belongs to an irreducible

interval whose extremities will end up in two different chromosomes.

It turns out that most rearrangement operations used in optimal scenarios are

indeed preserving.



Summary

Linear-time algorithms could be developed to minimize reversal distance

rearrangement scenarios.

Open question which distance measures (breakpoint distance, reversal

distance, interval distance ...) are most appropriate to compare genome

architectures.

Experimental evidence provides new insights which types of

rearrangements have likely occurred in the past need to adopt

algorithms to the biological reality.

Concept of „conserved intervals“ sounds very promising – can account for

arbitrary types of rearrangements.

Documents

9. Lecture WS 2004/05Bioinformatics III1 Bioinformatics III “Systems biology”,“Integrative cell biology” Zusammenfassung Teil 2: Vorlesungen 9-16