Upload
annis-adams
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
9. Lecture WS 2004/05
Bioinformatics III 1
Bioinformatics III “Systems biology”,“Integrative cell biology”Zusammenfassung Teil 2: Vorlesungen 9-16
9. Lecture WS 2004/05
Bioinformatics III 2
V9 - visualize cellular interaction data
e.g. protein interaction data (undirected): nodes – proteinsedges – interactions
metabolic pathways (directed)nodes – substancesedges – reactions
regulatory networks (directed): nodes – transcription factors + regulated proteinsedges – regulatory interaction
co-localization (undirected): nodes – proteins
edges – co-localization information
homology (undirected/directed)nodes – proteinsedges – sequence similarity (BLAST score)
9. Lecture WS 2004/05
Bioinformatics III 3
Force-directed algorithm for graph layout
http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html
Various graph layout algorithms have been
developed to solve this visualisation task.
20 years ago, Peter Eades proposed a graph
layout heuristic [A heuristic for graph
drawing. Congressus Numerantium, 42:149-
160, 1984] which is called the ``Spring
Embedder'' algorithm.
Edges are replaced by springs and
vertexes are replaced by rings that
connect the springs. A layout can be
found by simulating the dynamics of such
a physical system.
This method and other methods, which
involve similar simulations to compute the
layout, are called ``Force Directed''
algorithms.
9. Lecture WS 2004/05
Bioinformatics III 4
Force-directed algorithm
http://www.it.usyd.edu.au/~aquigley/3dfade/
The edges can be modeled as gravitational (or electrostatic) attraction
and all nodes have an electrical repulsion between them.
It is also possible for the system to simulate unnatural forces acting on the
bodies, which have no direct physical analogy, for example the use of a
logarithmic distance measure rather than Euclidean.
9. Lecture WS 2004/05
Bioinformatics III 5
Force-directed algorithm
http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html
Because of the underlying analogy to a physical system, the force directed graph
layout methods tend to meet various aesthetic standards, such as
- efficient space filling,
- uniform edge length (when equal weights and repulsions are used)
- symmetry and the
- capability of rendering the layout process with smooth animation (visual
continuity).
Having these nice features, the force directed graph layout has become
the ``work horse'' of layout algorithms.
It has been successfully adapted to many domains with variations of
implementation.
9. Lecture WS 2004/05
Bioinformatics III 6
Scaling
http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html
Force directed layout methods commonly have computational scaling problems.
When there are more than a few thousand vertexes in the graph, the running time
of the layout computation can become unacceptable.
This is caused by the fact that in each step of the simulation, the repulsive
force between each pair of unconnected vertexes needs to be computed,
costing a running time of O(0.5 V2 – E).
Here V is the number of vertexes and E is the number of edges in the graph.
This complexity is hard to escape for general graphs without hierarchical structure.
9. Lecture WS 2004/05
Bioinformatics III 7
Protein interaction graphs
Ju et al. Bioinformatics 19, 317 (2003)
Most protein interaction data have the following characteristics:
(1) When visualized as a graph, the data yields a disconnected graph with many
connected components
(2) The data yields a nonplanar graph with a large number of edge crossings that
cannot be removed in a 2D drawing
(3) #interactions varies widely within the same set of data – p(k)
(4) data often contains protein interactions corresponding to self loops
demands robust algorithm.
9. Lecture WS 2004/05
Bioinformatics III 8
InterViewer: Example of force-directed layout algorithm
Ju et al. Bioinformatics 19, 317 (2003)
InterViewer does not place initial nodes
randomly, but on the surface of a
sphere. Fixed # of iterations.
The original algorithm has complexity
O(N2) per timestep with N # of nodes.
When using multipole-methods, this
can be reduced to O(N logN)
Time may also be saved by introducing
a cut-off, e.g. only computing
interactions with the next neighbor
cells. Update neighbor list infrequently.
9. Lecture WS 2004/05
Bioinformatics III 9
Aim: analyze and visualize homologies between the protein universe :-)
50 genomes 145579 proteins 21 109 BLASTP pairwise sequence
comparisons.
Expect that fusion proteins („Rosetta Stone proteins“) will link proteins
of related function.
Need to visualize extremely large network! Develop stepwise scheme.
9. Lecture WS 2004/05
Bioinformatics III 10
LGL
Adai et al. J. Mol. Biol. 340, 179 (2004)
(1) separate original network into connected sets
(2) generate coordinates for each node in each connected set
(using force-directed layout algorithm and a recipe for the sequential lay out of
nodes guided by a minimum spanning tree of the network).
(3) integrate connected sets into one coordinate system via a funnel process:
the connected sets are sorted in descending size by the number of vertices.
The first connected set is placed at the bottom of a potential funnel and other
sets are placed one at a time on the rim of the potential funnel and allowed to
fall towards the bottom where they are frozen in space upon collision with the
previous sets.
We concentrate on step (2) in the following
9. Lecture WS 2004/05
Bioinformatics III 11
Minimum Spanning Tree
Given: undirected graph G = (V,E)
where for each edge (u,v) E
exists a weight w(u,v) specifying
the cost to connect u and v.
Find an acyclic graph T E that
connects all of the nodes and
whose total weight
is minimized.
Tvu
vuwTw,
,
Popular algorithms by Kruskal and Prim.
Both are greedy algorithms making the
best choice at the moment.
no guarantee to find the best global
solution
[Cormen]
9. Lecture WS 2004/05
Bioinformatics III 12
Kruskal’s algorithm
Consider edges in sorted order by weight.
The arrow points to the edge under consideration at each step.
[Cormen]
9. Lecture WS 2004/05
Bioinformatics III 13
Kruskal’s algorithm (II)
Running time O(E log V)
[Cormen]
9. Lecture WS 2004/05
Bioinformatics III 14
Intuitive description of LGL
Adai et al. J. Mol. Biol. 340, 179 (2004)
Successive iterations of the layout. The MST determines the oder of placement of
the nodes. The root node could be chosen randomly or based on its centrality in the
network (e.g. minimizing the sum of distances to all other nodes). All other nodes
are assigned a level according to their edge-based distance in the MST from the
root node.
Level one vertices (red circles) are placed randomly on a sphere around the root
node (black circle). The system is allowed to iterate through time satisfying attractive
and repulsive forces until at rest.
Level two nodes (blue circles) are placed randomly on spheres directed away from
the current layout. Again, the system is allowed to evolve through time till at rest.
This process is iterated for the entire graph.
9. Lecture WS 2004/05
Bioinformatics III 15
What is the role of fusion proteins?
Adai et al. J. Mol. Biol. 340, 179 (2004)
A protein homology map summarizes the results of billions of sequence comparisons by modeling
the proteins as vertices in a network, and the statistically significant sequence similarities as edges
connecting the relevant proteins. In this manner, proteins within a sequence family (such as A, A′, A
″, and AB; or B, B′ and AB) are all or mostly connected to each other, forming a cluster in the map.
Fusion proteins (such as AB) serve to connect their component proteins' families. The structure of
the resulting map reflects historic genetic events, such as gene fusions, fissions, and duplications,
which are responsible for producing the modern-day genes. The map simultaneously represents
homology relationships (edges), remote homologies (proteins not directly connected but in the same
cluster), and non-homologous functional relationships (adjacent clusters and clusters linked by
fusion proteins).
9. Lecture WS 2004/05
Bioinformatics III 16
LGL Algorithm for very large biological networks
Adai et al. J. Mol. Biol. 340, 179 (2004)
The complete protein homology map. A layout of the entire protein homology
map; a total of 11,516 connected sets containing 111,604 proteins (vertices)
with 1,912,684 edges. The largest connected set is shown more clearly in the
inset and is enlarged further on the right side.
9. Lecture WS 2004/05
Bioinformatics III 17
Functionally related gene families form adjacent clusters
Adai et al. J. Mol. Biol. 340, 179 (2004)
Three examples illustrate spatial
localization of protein function in the map,
specifically
A, the linkage of the tryptophan synthase
family to the functionally coupled but non-
homologous family by the yeast
tryptophan synthase fusion protein,
B, protein subunits of the pyruvate
synthase and alpha-ketoglutarate
ferredexin oxidoreductase complexes
C, metabolic enzymes, particularly those of
acetyl CoA and amino acid metabolism.
9. Lecture WS 2004/05
Bioinformatics III 18
Colocalization
Adai et al. J. Mol. Biol. 340, 179 (2004)
Neighboring proteins tend to be in the
same cellular system. The tendency
for proteins to operate in the same
cellular system, as defined by the
percentage of matching assignments
into the 18 COG database pathways,
is plotted against the spatial
separation in multiples of a typical
cluster size.
The functional similarity decays
exponentially with distance
proportional to the function e−0.26d
where d is a typical cluster diameter.
9. Lecture WS 2004/05
Bioinformatics III 19
Modularity in molecular networks?
A functional module is, by definition, a discrete entity whose function is
separable from those of other modules.
This separation depends on chemical isolation, which can originate from
spatial localization or from chemical specificity.
E.g. a ribosome concentrates the reactions involved in making a polypeptide
into a single particle, thus spatially isolating its function.
A signal transduction system is an extended module that achieves its isolation
through the specificity of the initial binding of the chemical signal to receptor
proteins, and of the interactions between signalling proteins within the cell.
Hartwell et al. Nature 402, C47 (1999)
9. Lecture WS 2004/05
Bioinformatics III 20
Modularity in molecular networks
Modules can be insulated from or connected to each other.
Insulation allows the cell to carry out many diverse reactions without cross-talk
that would harm the cell.
Connectivity allows one function to influence another.
The higher-level properties of cells, such as their ability to integrate information
from multiple sources, will be described by the pattern of connections among their
functional modules.
Hartwell et al. Nature 402, C47 (1999)
9. Lecture WS 2004/05
Bioinformatics III 21
Organization of large-scale molecular networks
Organization of molecular networks revealed by large-scale experiments:
- power-law distribution ; P(k) exp-
- similar distribution of the node degree k (i.e. the number of edges of a node)
- small-world property (i.e. a high clustering coefficient and a small shortest path
between every pair of nodes)
- anticorrelation in the node degree of connected nodes (i.e. highly interacting
nodes tend to be connected to low-interacting ones)
These properties become evident when hundreds or thousands of molecules and
their interactions are studied together.
On the other end of the spectrum: recently discovered motifs that consist of 3-4
nodes.
9. Lecture WS 2004/05
Bioinformatics III 22
Mesoscale properties of networks
Most relevant processes in biological networks correspond to the
mesoscale (5-25 genes or proteins) not to the entire network.
However, it is computationally enormously expensive to study mesoscale
properties of biological networks.
e.g. a network of 1000 nodes contains 1 1023 possible 10-node sets.
Spirin & Mirny analyzed combined network of protein interactions with data from
CELLZOME, MIPS, BIND: 6500 interactions.
9. Lecture WS 2004/05
Bioinformatics III 23
Identify connected subgraphsThe network of protein interactions is typically presented as an undirected graph
with proteins as nodes and protein interactions as undirected edges.
Aim: identify highly connected subgraphs (clusters) that have more interactions
within themselves and fewer with the rest of the graph.
A fully connected subgraph, or clique, that is not a part of any other clique is an
example of such a cluster.
In general, clusters need not to be fully connected.
Measure density of connections by
where n is the number of proteins in the cluster
and m is the number of interactions between them.
Spirin, Mirny, PNAS 100, 12123 (2003)
1
2
nn
mQ
9. Lecture WS 2004/05
Bioinformatics III 24
(method I) Identify all fully connected subgraphs (cliques)Generally, finding all cliques of a graph is an NP-hard problem.
Because the protein interaction graph is sofar very sparse (the number of interactions
(edges) is similar to the number of proteins (nodes), this can be done quickly.
To find cliques of size n one needs to enumerate only the cliques of size n-1.
The search for cliques starts with n = 4, pick all (known) pairs of edges (6500 6500
protein interactions) successively.
For every pair A-B and C-D check whether there are edges between A and C, A and
D, B and C, and B and D. If these edges are present, ABCD is a clique.
For every clique identified, ABCD, pick all known proteins successively.
For every picked protein E, if all of the interactions E-A, E-B, E-C, and E-D are known,
then ABCDE is a clique with size 5.
Continue for n = 6, 7, ... The largest clique found in the protein-interaction network
has size 14. Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 25
(I) Identify all fully connected subgraphs (cliques)These results include, however, many redundant cliques.
For example, the clique with size 14 contains 14 cliques with size 13.
To find all nonredundant subgraphs, mark all proteins comprising the clique of size
14, and out of all subgraphs of size 13 pick those that have at least one protein
other than marked.
After all redundant cliques of size 13 are removed, proceed to remove redundant
twelves etc.
In total, only 41 nonredundant cliques with sizes 4 - 14 were found.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 26
(method II) Superparamagnetic Clustering (SPC)
SPC uses an analogy to the physical properties of an inhomogenous ferromagnetic
model to find tightly connected clusters on a large graph.
Every node on the graph is assigned a Potts spin variable Si = 1, 2, ..., q.
The value of this spin variable Si performs thermal fluctuations, which are
determined by the temperature T and the spin values on the neighboring nodes.
Energetically, 2 nodes connected by an edge are favored to have the same spin
value. Therefore, the spin at each node tends to align itself with the majority of its
neighbors.
When such a Potts spin system reaches equilibrium for a given temperature T,
high correlation between fluctuating Si and Sj at nodes i and j would indicate that
nodes i and j belong to the same cluster.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 27
(II) Superparamagnetic Clustering (SPC)The protein-interaction network is represented by a graph where every pair of
interacting proteins is an edge of length 1.
The simulations are run for temperatures ranging from 0 to 1 in units of the
coupling strength.
The network splits two monomers at temperatures between 0.7 and 0.8,
whereas larger clusters only exist for temperatures between 0.1 and 0.7.
Clusters are recorded at all values temperature.
The overlapping clusters are then merged and redundant ones are removed.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 28
(method III) Monte Carlo SimulationUse MC to find a tight subgraph of a predetermined number of nodes M.
At time t = 0, a random set of M nodes is selected.
For each pair of nodes i,j from this set, the shortest path Lij between i and j on the
graph is calculated.
Denote the sum of all shortest paths Lij from this set as L0.
At every time step one of M nodes is picked at random, and one node is picked at
random out of all its neighbors.
The new sum of all shortest paths, L1, is calculated if the original node were to be
replaced by this neighbor.
If L1 < L0, accept replacement with probability 1.
If L1 > L0, accept replacement with probability
where T is the effective temperature.
Spirin, Mirny, PNAS 100, 12123 (2003)
T
LL 01
exp
9. Lecture WS 2004/05
Bioinformatics III 29
(III) Monte Carlo Simulation
Every tenth time step an attempt is made to replace one of the nodes from
the current set with a node that has no edges to the current set to avoid
getting caught in an isolated disconnected subgraph.
This process is repeated
(i) until the original set converges to a complete subgraph, or
(ii) for a predetermined number of steps,
after which the tightest subgraph (the subgraph corresponding to the smallest
L0) is recorded.
The recorded clusters are merged and redundant clusters are removed.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 30
Optimal temperature in MC simulationFor every cluster size there is an
optimal temperature that gives the
fastest convergence to the tightest
subgraph.
Spirin, Mirny, PNAS 100, 12123 (2003)
Time to find a clique with size 7 in MC steps
per site as a function of temperature T.
The region with optimal temperature is
shown in Inset.
The required time increases sharply as the
temperature goes to 0, but has a relatively
wide plateau in the region 3 < T < 7.
Simulations suggest that the choice of
temperature T M would be safe for any
cluster size M.
9. Lecture WS 2004/05
Bioinformatics III 31
Comparison of clusters found with
SPC (blue) and MC simulation
(red).
Reasonable overlap (ca. one third
of all clusters are found by both
methods) – but both methods
seem complementary.
Spirin, Mirny, PNAS 100, 12123 (2003)
Comparison of SPC and Monte Carlo methods
9. Lecture WS 2004/05
Bioinformatics III 32
The SPC method is best at detecting high-Q value clusters with relatively few links
with the outside world. An example is the TRAPP complex, a fully connected clique
of size 10 with just 7 links with outside proteins.
This cluster was perfectly detected by SPC, whereas the MC simulation was able to
find smaller pieces of this cluster separately rather than the whole cluster.
By contrast, MC simulations are better suited for finding very „outgoing“ cliques.
The Lsm complex, a clique of size 11, includes 3 proteins with more interactions
outside the complex than inside. This complex was easily found by MC, but was not
detected as a stand-alone cluster by SPC.
Spirin, Mirny, PNAS 100, 12123 (2003)
Comparison of SPC and Monte Carlo methods
Q: warum funktioniert die SPC-Methode besonders gutum Cluster mit hohen Q-Werten und wenigen Verknüpfungenzu finden, wogegen die Monte-Carlo-Methode vor allem„outgoing“ Cliquen findet?
9. Lecture WS 2004/05
Bioinformatics III 33
Merging Overlapping ClustersA simple statistical test shows that nodes which have only one link to a cluster are
statistically insignificant. Clean such statistically insignificant members first.
Then merge overlapping clusters:
For every cluster Ai find all clusters Ak that overlap with this cluster by at least one
protein.
For every such found cluster calculate Q value of a possible merged cluster
Ai U Ak . Record cluster Abest(i) which gives the highest Q value if merged with Ai.
After the best match is found for every cluster, every cluster Ai is replaced by a
merged cluster Ai U Abest(i) unless Ai U Abest(i) is below a certain threshold value
for QC.
This process continues until there are no more overlapping clusters or until merging
any of the remaining clusters witll make a cluster with Q value lower than QC.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 34
Statistical significance of complexes and modules
Number of complete cliques (Q = 1) as
a function of clique size enumerated in
the network of protein interactions
(red) and in randomly rewired graphs
(blue, averaged >1,000 graphs where
number of interactions for each protein
is preserved).
Inset shows the same plot in log-
normal scale. Note the dramatic
enrichment in the number of cliques in
the protein-interaction graph
compared with the random graphs.
Most of these cliques are parts of
bigger complexes and modules.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 35
Statistical significance of complexes and modules
Spirin, Mirny, PNAS 100, 12123 (2003)
Distribution of Q of clusters found by the MC search
method.
Red bars: original network of protein interactions.
Blue cuves: randomly rewired graphs.
Clusters in the protein network have many more
interactions than their counterparts in the random
graphs.
9. Lecture WS 2004/05
Bioinformatics III 36
Discovered functional modules
Spirin, Mirny, PNAS 100, 12123 (2003)
Examples of discovered functional modules.
(A) A module involved in cell-cycle regulation. This module consists of cyclins (CLB1-4 and
CLN2) and cyclin-dependent kinases (CKS1 and CDC28) and a nuclear import protein (NIP29).
Although they have many interactions, these proteins are not present in the cell at the same
time.
(B) Pheromone signal transduction pathway in the network of protein–protein interactions. This
module includes several MAPK (mitogen-activated protein kinase) and MAPKK (mitogen-
activated protein kinase kinase) kinases, as well as other proteins involved in signal
transduction. These proteins do not form a single complex; rather, they interact in a
specific order.
9. Lecture WS 2004/05
Bioinformatics III 37
Robustness of clusters found
Model effect of false positives in
experimental data: randomly reconnect,
remove or add 10-50% of interactions
in network.
Cluster recovery probability as a
function of the fraction of altered links.
Black curves correspond to the case
when a fraction of links are rewired.
Red, removed;
green, added.
Circles represent the probability to
recover 75% of the original cluster;
triangles represent the probability to
recover 50%.
Spirin, Mirny, PNAS 100, 12123 (2003)
Noise in the form of removal or addions
lf links has less deteriorating effect
than random rewiring. About 75% of
clusters can still be found when 10% of
links are rewired.
9. Lecture WS 2004/05
Bioinformatics III 38
Summary
Here: analysis of meso-scale properties demonstrated the presence of highly
connected clusters of proteins in a network of protein interactions. Strong
support for suggested modular architecture of biological networks.
Distinguish 2 types of clusters: protein complexes and dynamic functional modules.
Both complexes and modules have more interactions among their members than
with the rest of the network.
Dynamic modules are elusive to experimental purification because they are not
assembled as a complex at any single point in time.
Computational analysis allows detection of such modules by integrating pairwise
molecular interactions that occur at different times and places.
However, computational analysis alone, does not allow to distinguish between
complexes and modules or between transient and simultaneous interactions.
9. Lecture WS 2004/05
Bioinformatics III 39
V10 Protein complexes and their shared components
- Most cellular processes result from a cascade of events mediated by proteins
that act in a cooperative manner.-Protein complexes can share components: proteins can be reused and
participate to several complexes.
Methods for analyzing high-throughput protein interaction data have mainly used
clustering techniques.
They have been applied to assign protein function by inference from the biological
context as given by their interactors, and to identify complexes as dense regions
of the network (see V9).
The logical organization into shared and specific components, and its
representation remains elusive.
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 40
shared components
Shared components = proteins or groups of proteins occurring in different
complexes are fairly common:
A shared component may be a small part of many complexes, acting as a
unit that is constantly reused for ist function.
Also, it may be the main part of the complex e.g. in a family of variant complexes
that differ from each other by distinct proteins that provide functional specificity.
Aim: identify and properly represent the modularity of protein-protein interaction
networks by identifying the shared components and the way they are arranged to
generate complexes.
Gagneur et al. Genome Biology 5, R57 (2004)Georg Casari, Cellzome (Heidelberg)
9. Lecture WS 2004/05
Bioinformatics III 41
Modules
A graph and its modules.
Nodes connected by a link are
called neighbors.
In graph theory, a module is a set
of nodes that have the same
neighbors outside the module.
In addition to the trivial modules {a},
{b},...,{g} and {a,b,c,..,g}, this graph
contains the modules {a,b,c}, {a,b},
{a,c},{b,c} and {e,f}.
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 42
Quotient
Elements of a module have exactly the same neighbors outside the module
one can substitute all of them for a representative node.
In a quotient, all elements of the module are replaced by the representative node,
and the edges with the neighbors are replaced by edges to the representative.
Quotients can be iterated until the entire graph is merged into a final
representative node.
Iterated quotients can be captured in a tree, where each node represents a
module, which is a subset of ist parent and the set of its descendant leaves.
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 43
Modular decomposition
Modular decomposition of the
example graph shown before.
Modular decomposition gives a
labeled tree that represents iterations
of particular quotients, here the
successive quotients on the modules
{a,b,c} and {e,f}.
The modular decomposition is a
unique, canonical tree of iterated
quotients
(formal proof exists Möhring 1985).
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 44
Modular decomposition
The nodes of the modular decomposition
are labeled in 3 ways:
As series when the direct descendants
are all neighbors of each other,
as parallel when the direct descendants
are all non-neighbors of each other,
and by the structure of the module
otherwise (prime module case).
Gagneur et al. Genome Biology 5, R57 (2004)
Series are labeled by an asterisk within a circle, parallel by two parallel lines within a circle,
and prime by a P within a circle. The prime is advantageously labeled by its structure.
The graph can be retrieved from the tree on the right by recursively expanding the modules
using the information in the labels. Therefore, the labeled tree can be seen as an exact
alternative representation of the graph.
9. Lecture WS 2004/05
Bioinformatics III 45
Results from protein complex purifications (PCP), e.g. TAP
Different types of data:- Y2H: detects direct physical interactions between proteins
- PCP by tandem affinity purification with mass-spectrometric identification of the
protein components identifies multi-protein complexes
Molecular decomposition will have a different meaning due to different semantics
of such graphs.
Here, focus analysis on PCP content.
PCP experiment: select bait protein where TAP-label is attached Co-purify
protein with those proteins that co-occur in at least one complex with the bait
protein.
In future, integrated view combining both types of data would be preferred.
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 46
Clique and maximal clique
A clique is a fully connected sub-graph, that is, a set
of nodes that are all neighbors of each other.
In this example, the whole graph is a clique and
consequently any subset of it is also a clique, for
example {a,c,d,e} or {b,e}. A maximal clique is a
clique that is not contained in any larger clique. Here
only {a,b,c,d,e} is a maximal clique.
Gagneur et al. Genome Biology 5, R57 (2004)
Assuming complete datasets and ideal results, a permanent complex will appear
as a clique.
The opposite is not true: not every clique in the network necessarily derives from
an existing complex. E.g. 3 connected proteins can be the outcome of a single
trimer, 3 heterodimers or combinations thereof.
9. Lecture WS 2004/05
Bioinformatics III 47
Results from protein complex purifications (PCP), e.g. TAP
Interpretation of graph and module labels
for systematic PCP experiments.
(a) Two neighbors in the network are
proteins occurring in a same complex.
(b) Several potential sets of complexes
can be the origin of the same observed
network. Restricting interpretation to the
simplest model (top right), the series
module reads as a logical AND between
its members.
(c) A module labeled ´parallel´
corresponds to proteins or modules
working as strict alternatives with respect
to their common neighbors.
(d) The ´prime´ case is a structure where
none of the two previous cases occurs. Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 48
Obtain maximal cliques
Modular decomposition provides an instruction set to deliver all maximal cliques
of a graph.
In particular, when the decomposition has only series and parallels, the maximal
cliques are straightforwardly retrieved by traversing the tree recursively from top
to bottom.
A series module acts as a product: the maximal cliques are all the combinations
made up of one maximal clique from each „child“ node.
A parallel module acts as a sum: the set of maximal cliques is the union of all
maximal cliques from the „child“ nodes.
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 49
Consider undirected graph G=(V,E) with n =|V| vertices and m=|E| edges.
The complement of a graph G is denoted by G.
If X is a subset of vertices, then G[X] is the subgraph of G induced by X.
Let x be an arbitrary vertex, then N(x) and N(x) stand respectively for the
neighborhood of x and its non-neighborhood.
A vertex x distinguishes two vertices u and v if (x,u) E and (x,v) E.
A module M of a graph G is a set of vertices that is not distinguished by any
vertex.
Hier wurdedeutlich gekürzt.Nur Grundaspektedes Algorithmussind wichtig.
9. Lecture WS 2004/05
Bioinformatics III 50
A simple linear algorithm for modular decomposition
The modules of a graph are a potentially exponentially-sized family
However, the sub-family of strong modules, the modules that overlap no other
modules, has size O(n).
A overlaps B if A B , A \ B and B \ A
The inclusion order of this family defines the previously explained
modular tree decomposition, which is enough to store the module family of a
graph.
The root of this tree is the trivial module V and its n leaves are the trivial modules
{x}, xV.
Habib, de Montgolfier, Paul (2004)
9. Lecture WS 2004/05
Bioinformatics III 51
Aim: a simple linear algorithm for modular decomposition
Any graph G with at least 3 vertices is either not connected
or its complement G is not connected
or G and G are both connected.
In the last case, the maximal modules define a partition of the vertex-set and are
said to be a prime composition.
The modular decomposition tree can be recursively built by a top-down approach.
At each step, the algorithm recurses on graphs induced by the maximal strong
modules. This technique gives an O(n4) complexity.
Here, derive a linear-time algorithm that computes a modular factorizing
permutation without computing the underlying decomposition tree.
This tree may be derived in a second step.Habib, de Montgolfier, Paul (2004)
9. Lecture WS 2004/05
Bioinformatics III 52
Modular decomposition of protein interaction graphs
A graph and its modular tree decomposition. The set {1,2} is a strong module.
The module {7,8} is weak: it is overlapped by the module {8,9}.
The permutation = (1,2,3,4,5,6,7,8,9) is a modular factorizing permutation.
Habib, de Montgolfier, Paul (2004)
9. Lecture WS 2004/05
Bioinformatics III 53
Module-factorizing orders
Let G=(V,E) be a graph and let O be a partial order on V.
For two comparable elements x and y where x <O y we state x precedes y and y
follows x.
Two subsets A and B cross if a,a‘ A and b,b‘ B such that a <O b and a‘ >O
b‘. A linear extension of a partially ordered set (‚poset‘) is a completion of the poset
into a total order.
Definition 1. A partial order O is a Module-Factorizing Partial Order (MFPO) of
V(G) if any pair of non-intersecting strong modules of G do not cross.
The modular factorizing permutations are exactly the module-factorizing total orders.
Proposition 1. A partial order O is an MFPO if and only if it can be completed into a
factorizing permutation.
Habib, de Montgolfier, Paul (2004)
9. Lecture WS 2004/05
Bioinformatics III 54
Module-factorizing orders
Definition 2. An ordered partition is a collection {P1, ..., Pk} of pairwise disjoint
parts, with and an order O such that for all
x Pi and y Pj, x <O y if i < j.
Start with trivial partition (a single part equal to the vertex set) and iteratively
extend (or refine) it until every part is a singleton.
A center vertex c V is distinguished and two refining rules, preserving the MFPO
property, are used. They are defined in Lemma 1:
Habib, de Montgolfier, Paul (2004)
9. Lecture WS 2004/05
Bioinformatics III 55
Defining rules
Lemma 1.
1. Center Rule: For any vertex c, the ordered partition
is module-factorizing.
Habib, de Montgolfier, Paul (2004)
The center rule picks a center and breaks a trivial partition to start the
algorithm.
Once launched, the process goes on based on the pivot rule, that splits each
part Pi (except the part Pi that contains the pivot), according to the neighborhood
of the pivot.
9. Lecture WS 2004/05
Bioinformatics III 56
Lemma 1 continued.
2. Pivot Rule: Let be an ordered partition with
center c and p Pi such that Pj, ij, overlaps N(p) .
If O is an MFPO, then the following refinements preserve the module-
factorizing property:
Defining rules: pivot rule
Habib, de Montgolfier, Paul (2004)
9. Lecture WS 2004/05
Bioinformatics III 57
Preliminary algorithm
Partition refinement scheme that outputs a partition of V into the maximal
modules not containing c.
Habib, de Montgolfier, Paul (2004)
When this algorithm ends, every part is a module. To obtain a factorizing
permutation it has to be recursively relaunched on the non-singleton parts.
9. Lecture WS 2004/05
Bioinformatics III 58
Habib, de Montgolfier, Paul (2004)
Execution example of algorithm
The resulting factorizing permutation is (a, s, v, w, u, y, x, z, t).
9. Lecture WS 2004/05
Bioinformatics III 59
Ordered chain partition yields linear-time algorithm
Definition 3. An ordered chain partition (OCP) is a partial order such that each
vertex belongs to one and only one chain, and one chain belongs to one and
only one part. The vertices of the same chain are totally ordered, the chains
of the same part are uncomparable, and the parts of totally ordered.
Habib, de Montgolfier, Paul (2004)
A trivial chain contains only 1 vertex, and a monochain part contains only one
chain. The OCPs generalize the Ordered Partitions since the latter ones contain
only trivial chains.
9. Lecture WS 2004/05
Bioinformatics III 60
Ordered chain partition yields linear-time algorithm
C(x) denotes the chain containing x while P(x) denotes the part of the partition
containing x.
Each chain C has a representative vertex r(C) C.
During the algorithm, the chains will behave as their representative vertices.
Chains are possibly merged. Then, the representative of the new chain is one of
the former representatives. But chains will never be split.
The algorithm still uses the center and pivot rules.
The chains are moved by these 2 rules, according to the adjacency between
their representative vertex and the center of the pivot.
But there is a third rule, the chaining rule (line 9 of algorithm).
Habib, de Montgolfier, Paul (2004)
9. Lecture WS 2004/05
Bioinformatics III 61
Defining rule 3: Chaining rule
There is a third rule, the chaining rule
Unlike the two first ones, the third rule removes comparisons from the order.
It first concatenates a sequence of monochain parts, that occur consecutively in
O, into one chain. Then this new chain is inserted into one of the two parts,
say P, neighboring the chain.
Chaining rule, chaining the black vertices into P.
Habib, de Montgolfier, Paul (2004)
The comparisons between the chain and P are lost.
But since the number of chains strictly decreases during the algorithm,
the process is guaranteed to end.
9. Lecture WS 2004/05
Bioinformatics III 62
Ordered chain partition yields linear-time algorithm
Use each vertex a constant number of times as a pivot.
Habib, de Montgolfier, Paul (2004)
9. Lecture WS 2004/05
Bioinformatics III 63
Habib, de Montgolfier, Paul (2004)
Execution example of algorithm
The resulting factorizing permutation is (a, s, v, w, u, y, x, z, t).
Summary:- simple, linear-time
algorithm now available
for modular decomposition
of graphs.
What is the meaning of
such modules when
applied to real data?
9. Lecture WS 2004/05
Bioinformatics III 64
In the modular decomposition tree, the leaves are proteins,
the root represents the whole network.
In between, each node is a module that is a sub-part of ist parent.
The label of a node gives the nature of the relationship between ist direct children.
Proteins or modules in a parallel module can be be seen as
alternatives. If A is neighbor of B and C, which are not neighbors
of each other, then A can belong to a complex together with
either B or C, but not with both at the same time.
B and C define a parallel module and thus are alternative
partners in a complex with their common neighbor A.
This situation corresponds to a logical „exclusive OR“
between B and C.
Interpretation for PCP protein interaction networks
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 65
Proteins or modules in a series module can be
seen as potentially combined in any way.
If A is neighbor of B and C, and B and C are
also neighbors, the A can belong to a complex
together with B or C, or with both at the same
time.
This corresponds to a logical „OR“ between B
and C.
A series module can be seen as a unit: a set of
proteins (modules) that function together.
A ‚prime‘ is a graph where neither of these cases
occurs.
Interpretation for PCP protein interaction networks
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 66
Three examples of modular
decomposition of protein-protein
interaction networks. In each case
from top to bottom: schema of
complexes, the corresponding
protein-protein interaction network as
determined from PCP experiments,
and its modular decomposition
(MOD).
(a) Protein phosphatase 2A. Parallel
modules group proteins that do not
interact but are functionally
equivalent. Here these are the
catalytic Pph21 and Pph22 (module
2) and the regulatory Cdc55 and
Rts1 (module 3).
Back to the real world …
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 67
Gagneur et al. Genome Biology 5, R57 (2004)
RNA polymerases I, II and III
A good layout of the corresponding network
gives an intuitive idea of what the constitutive
units of the complexes are. Modular
decomposition extracts them and makes their
logical combinations explicit.
9. Lecture WS 2004/05
Bioinformatics III 68
Summary
Gagneur et al. Genome Biology 5, R57 (2004)
Ongoing: need for modular description of molecular biology.
What are suitable modules?
Spirin&Mirny, Barabasi et al. : identify dense parts of the network
Alon and co-workers: identify (small) repeated motifs
Here: apply established method of modular graph decomposition
to protein interaction networks. Can (and has been) applied to other networks.
What is the biological relevance of modules at different levels?
Integrate with gene ontology?
9. Lecture WS 2004/05
Bioinformatics III 69
V11 – modules in cellular networks – wrap up
traditional biology (reductionist approach) produces long lists:
lists of genes in genomes
lists of transcripts in different cell types
lists of protein interactions in model organisms
genomes, transcriptomes, proteomes, interactomes,
databases of genetic perturbations, and corresponding phenotypes
How to make sense of it all?
Will meaningful hypotheses and discoveries emerge?
systems biology
Formalized mathematical modeling still room for reductionism:
simulations test hypothesis from
quantitative measurements systems biology experiments
Gagneur et al. Genome Biology 5, R57 (2004)
9. Lecture WS 2004/05
Bioinformatics III 70
Strategies to detect communities in networks
„Community“ stands for module, class, group, cluster, ...
Define community as a subset of nodes within the graph such that connections
between the nodes are denser than connections with the rest of the network.
The detection of community structure is generally intended as a procedure for
mapping the network into a tree („dendogram“ in social sciences).
Radicchi et al. PNAS 101, 2658 (2004)
Leaves: nodesbranches join nodesor (at higher level)groups of nodes.
9. Lecture WS 2004/05
Bioinformatics III 71
Agglomerative algorithms for mapping to tree
Traditional method to perform this mapping: hierarchical clustering.
For every pair i,j of nodes in the network compute weight Wij that measures how
closely connected the vertices are.
Starting from the set of all nodes and no edges,
links are iteratively added between pairs of
nodes in order of decreasing weight.
In this way nodes are grouped into larger and larger
communities, and the tree is built up to the root,
which represents the whole network.
„agglomerative“ algorithm
Girven, Newman, PNAS 99, 7821 (2002)Radicchi et al. PNAS 101, 2658 (2004)
Here: 3 communities of densely connectedvertices (circles with solid lines) with amuch lower density of connections(gray lines) between them.
9. Lecture WS 2004/05
Bioinformatics III 72
Possible definitions of the weights
(1) number of node-independent paths between vertices
2 paths that connect the same pair of vertices are said to be node-independent if
they share none of the same vertices other than their initial and final vertices.
(2) edge-independent paths.
It has been shown that the number of node-independent (edge-independent) paths
between 2 vertices i and j in a graph is equal to the minimum number of vertices
(edges) that must be removed from the graph to disconnect i and j from one
another (Menger, 1927).
these numbers are a measure of the robustness of the network to deletion of
nodes (edges).
Girven, Newman, PNAS 99, 7821 (2002)
9. Lecture WS 2004/05
Bioinformatics III 73
Possible definitions of the weights (II)
(3) count total number of paths that run between them (not just those that are
node- or edge-independent).
Because the number of paths between any 2 vertices is either 0 or infinite, one
typically weighs paths of length l by a factor l with small so that the weighted
count of number of paths converges.
Thus long paths contribute exponentially less weight than short paths.
These node- or edge-dependent path definitions for weights work okay for certain
community structures, but show typical pathologies.
Girven, Newman, PNAS 99, 7821 (2002)
9. Lecture WS 2004/05
Bioinformatics III 74
Problems
In particular, both counting of node- and edge-independent paths has a tendency
to separate single peripheral vertices from the communities to which they should
rightly belong.
If a vertex is, e.g., connected to the rest of a network by only a single edge then, to
the extent that it belongs to any community, it should clearly be considered to
belong to the community at the other end of that edge.
Unfortunately, both the numbers of independent paths and the weighted path
counts for such vertices are small and hence single nodes often remain isolated
from the network when the communities are constructed.
This and other pathologies, make the hierarchical clustering method, although
useful, far from perfect.
Girven, Newman, PNAS 99, 7821 (2002)
9. Lecture WS 2004/05
Bioinformatics III 75
New strategy: Use “betweenness” as definition of weights
Focus on those edges that are least central, that are „between“ communities.
Define edge betweenness of an edge as the number of shortest paths between
pairs of vertices that run along it.
If there is more than one shortest path between a pair of vertices, each path is
given equal weight such that the total weight of all of the paths is 1.
If a network contains communities or groups that are only loosely connected by a
few intergroup edges, then all shortest paths between different communities must
go along one of these few edges.
the edges connecting communities will have high edge betweenness.
By removing these edges we separate groups from one another and so reveal the
underlying community structure of the graph.
Girven, Newman, PNAS 99, 7821 (2002)
9. Lecture WS 2004/05
Bioinformatics III 76
GN Algorithm
1. Calculate betweenness for all m edges in a graph of n vertices
(can be done in O(mn) time).
2. Remove the edge with the highest betweenness.
3. Recalculate betweenness for all edges affected by the removal.
4. Repeat from step 2 until no edges remain.
Because step 3 has to be done for all edges, the algorithm runs in worst-case time
O(m2n).
Girven, Newman, PNAS 99, 7821 (2002)
9. Lecture WS 2004/05
Bioinformatics III 77
1.
Girven, Newman, PNAS 99, 7821 (2002)
Application of Girvan&Newman Algorithm(a) The friendship network from Zachary's karate club study. Nodes associated with the club administrator's faction are drawn as circles, those associated with the instructor's faction are drawn as squares. (b) Hierarchical tree showing the complete community structure for the network calculated by using the algorithm presented in this article. The initial split of the network into two groups is in agreement with the actual factions observed by Zachary, with the exception that node 3 is misclassified. (c) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network.
9. Lecture WS 2004/05
Bioinformatics III 78
Divisive algorithms for mapping to tree
Reverse order of construction of the tree than for agglomerative algorithms:
start with the whole graph and iteratively cut the edges
divide network progressively into smaller and smaller disconnected subnetworks
identified as the communities.
Crucial point: how to select those edges to be cut.
Example: Girven & Newman algorithm (GN)
Problem of GN algorithm: requires the repeated evaluation of a global property, the
betweenness, for each edge whose value depends on the properties of the whole
system.
becomes computationally very expensive for networks with e.g. 10000 nodes.
Radicchi et al. PNAS 101, 2658 (2004)
9. Lecture WS 2004/05
Bioinformatics III 79
Faster algorithm
Introduce divisive algorithm that only requires the consideration of local quantities.
Need: quantity that can single out edges connecting nodes belonging to different
communities.
Consider edge-clustering coefficient:
number of triangles to which a given edge belongs divided by the number of
triangles that might potentially include it, given the degrees of the adjacent
nodes.
For the edge-connecting node i to node j, the edge-clustering coefficient is
Radicchi et al. PNAS 101, 2658 (2004)
1,1min
13,3
,
ji
jiji kk
zC
where zi,j(3) is the number of triangles built on that edge and
min[(ki – 1), (kj – 1)] is the maximal possible number of them.
1 is added to zi,j(3) to remove degeneracy for zi,j
(3) = 0.
9. Lecture WS 2004/05
Bioinformatics III 80
Faster algorithm
Edges connecting nodes in different communities are included in few or no
triangles and tend to have small values of Ci,j(3).
On the other hand, many triangles exist within clusters.
By considering higher order cycles one can define coefficients of order g
Radicchi et al. PNAS 101, 2658 (2004)
gji
gjig
ji s
zC
,
,,
1
where zi,j(g) is the number of cyclic structures of order g the edge (i,j) belongs to,
and si,j(g) is the number of possible cyclic structures of order g that can be built
given the degrees of the nodes.
Define, for every g, a dectection algorithm that works exactly as the GN method
with the difference that, at every step, the removed edges are those with the
smallest value of Ci,j(g).
By considering increasing values of g, one can smoothly interpolate between a
local and a nonlocal algorithm.
9. Lecture WS 2004/05
Bioinformatics III 81
Comparison with GN algorithm
Plot of the dendrograms for the network of college football teams, obtained by
using the GN algorithm (Left) and our algorithm with g = 4 (Right).
Different symbols denote teams belonging to different conferences.
In both cases, the observed communities perfectly correspond to the conferences,
with the exception of the six members of the „Independent conference“, which are
misclassified.
Radicchi et al. PNAS 101, 2658 (2004)
9. Lecture WS 2004/05
Bioinformatics III 82
Simple network clustering based on shortest-path distance
Aim: compute modular organization of cellular networks controlling specific
biological responses.
Ideas:
(i) the shortest path between any two vertices (proteins) is probably the most
relevant for functional associations;
(ii) each vertex in a network has a unique profile of shortest-path distances through
the network to every other vertex
(iii) module comembers are likely to have similar (clustered) shortest-path-distance
profiles.
Rives & Galitski PNAS 100, 1128 (2003)
9. Lecture WS 2004/05
Bioinformatics III 83
Network clustering
Yeast PI network; 4079 proteins, 6761 protein interactions.
MIPS: 133 signaling proteins, 64 have 1 interactions with another signaling
protein.
Algorithm: assign length 1 to each edge in protein interaction network.
Compute all-pairs shortest-path distance matrix: contains length of the shortest
path (distance) d between every pair of vertices in the network.
Convert into „association matrix“ using 1/d2 .
Associations range from 0 to 1.
Emphasizes local association in subsequent clustering.
Use hierarchical agglomerative average-linkage clustering.
Rives & Galitski PNAS 100, 1128 (2003)
Q: konstruieren Sie basierendauf diesem einfachen Maß einen Algorithmus, der diezu einem biochemischen Pfadgehörenden Proteine in einem Protein-Wechsel-Wirkungsnetzwerk identifiziert.
9. Lecture WS 2004/05
Bioinformatics III 84
Clustering of yeast signaling protein interaction networkA symmetrical matrix of 64 proteins of the
MIPS-database signaling category was
clustered identically in both dimensions. The
cluster tree is not shown. Each row or
column represents a protein. Each feature is
the intersection of two proteins and is a
grayscale representation of pairwise protein
association).
Columns to the right of the clustered network
represent MIPS-defined signaling pathways
[P, polarity-PKC; R, Ras; H, HOG; M,
mating/filamentation MAPK (mfMAPK)].
White bars in the MIPS-pathway columns
indicate protein members of the pathway.
Ras-pathway proteins form a single
cluster.
3 MAPK pathways as clusters.
Rives & Galitski PNAS 100, 1128 (2003)
Q: Durch Anwendung eines einfachen Maßes für den Abstand zweierProteine in einem Interaktionsnetzwerk wurde obiges Diagramm erhalten.Was erwarten Sie für die Proteine eines biochemischen Pfades?
9. Lecture WS 2004/05
Bioinformatics III 85
Network clustering of high-throughput data sets
HTS-Data usually has high (50%) false-positive error frequencies!
Also, many binary interactions may not occur within modules.
Because interacting proteins usually localize in the same subcellular compartment
one may integrate interaction and localization data for the identification of modules.
Single proteins with many interactions in Y2H screens (hubs) nucleate large
clusters that are not modules.
Rives & Galitski PNAS 100, 1128 (2003)
9. Lecture WS 2004/05
Bioinformatics III 86
examples of derived clusters
Clustering of the yeast nuclear-protein network
derived from high-throughput interaction and
localization data.
(A) Examples of clusters representing module
rudiments are labeled. The cluster tree is not
shown. Arrows indicate high-connectivity hub
proteins.
(B) Example clusters are shown in detail.
Cluster comembers participating in some
common structure or function have large bold
labels.
Rives & Galitski PNAS 100, 1128 (2003)
9. Lecture WS 2004/05
Bioinformatics III 87
Properties of hubs
All hub proteins indicated bind > 90 proteins in the global Y2H network.
The proteins bound by these hubs are randomly distributed in cellular
compartments.
The nuclear-localized proteins bound by these hubs form the 4 largest clusters.
Proteins bound by high-connectivity hubs will have few or no interactions among
themselves if they are not functionally associated („hub-and-spokes“ structure).
proteins bound by each high-connectivity hub are not functionally associated with
each other, and their clusters do not represent modules.
Rives & Galitski PNAS 100, 1128 (2003)
9. Lecture WS 2004/05
Bioinformatics III 88
connectivity neighborhood clustering
Global protein connectivity versus
neighborhood clustering. Each
protein in the global protein net-
work is plotted by its connectivity,
k, and its neighborhood clustering,
C. Arrows indicate high-connec-
tivity proteins shown in Fig. 2A.
Rives & Galitski PNAS 100, 1128 (2003)
The 4 high-connectivity hubs are among 15 outliers. Although these proteins have
exceedingly high connectivity, they almost completely lack neighborhood clustering.
useful criterion to distinguish modules from nonmodules?
9. Lecture WS 2004/05
Bioinformatics III 89
Application to biological-response networks
Incorporate network clustering into 3-step process to study complex biomolecular
systems generates modular network-structure model
(i) compile known and suspected components of the response network (from
databases, expression profiling, proteomics, genetic screens, metabolite profiles ...)
(ii) cluster network based on interactions between vertices. Edges can represent
any type of interaction.
(iii) abstract modular network-structure model showing modules.
Cluster 90 filamentation-network proteins that have 1 interaction with other
filamentation proteins.
Rives & Galitski PNAS 100, 1128 (2003)
9. Lecture WS 2004/05
Bioinformatics III 90
Clustering of the yeast filamentation network
Proteins of the yeast
filamentation network were
clustered. A tree-depth
threshold was set.
Tree branches with 3 leaves
(clusters with 3 proteins)
below the tree threshold are
shown.
Bullets and large bold labels
indicate proteins of highest
intracluster connectivity.
Rives & Galitski PNAS 100, 1128 (2003)
9. Lecture WS 2004/05
Bioinformatics III 91
Modular model of the yeast filamentation network
Clusters indicated in Fig. 4 are
abstracted as modules. All intermodule
paths in the filamentation network are
indicated as black lines with the
interacting proteins at the termini.
A gray line connecting the Ras and
protein kinase A modules was added to
indicate a connection mediated by the
small molecule cAMP.
Rives & Galitski PNAS 100, 1128 (2003)
9. Lecture WS 2004/05
Bioinformatics III 92
Biological Insights from modular network abstraction
(1) In an integrated network, data on molecules and interactions shows clustered
organization that can be identified quantitatively
(2) Cluster co-member genes show significant coordination of expression change,
as expected for genes involved in a collective function.
(3) Cluster go-member genes show significant overrepresentation of biological-
process annotations, indicating collective function.
(4) The modular network abstraction intuitively stimulates testable biological
insights on complex biological properties.
Prinz et al. Genome Research 14, 380 (2004)
9. Lecture WS 2004/05
Bioinformatics III 93
Evolutionary conservation of motif constituentsin the yeast protein interaction network
Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
Question: why are some cellular components conserved across species
but others evolve rapidly?
Many biological functions are carried out by the integrated activity of highly
interacting cellular components = functional modules
Motifs = topologically distinct interaction patterns with complex networks
may represent the simplest building blocks of modules.
Here, test the correlation between a protein‘s evolutionary rate and the
structure of the motif it is embedded in
identify all 2-, 3-, 4-node motifs and some 5-node motifs
9. Lecture WS 2004/05
Bioinformatics III 94
shared components
Data from DIP database,
3183 interacting yeast proteins
if there is evolutionary pressure to
maintain specific motifs, their
components should be evolutionarily
conserved and have identifiable
orthologs in other organisms.
Study conservation of 678 S. cerevisae
proteins with an ortholog in each of 5
higher eukaryotes:
Arabidopsis thaliana, C. elegans,
Drosophila melanogaster, Mus
musculus, Homo sapiens.
Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
Algorithm to detect all
n-node subgraphs:
scan all rows of the adjacency
matrix M. For each non-zero
element (i,j) representing a link,
scan through all neighbors of
(i,j) until a specific n-node
subgraph is detected.
9. Lecture WS 2004/05
Bioinformatics III 95
shared components
#motifs of a given kind in the yeast PI
network
fraction of original yeast motifs that is
evolutionary fully conserved: each of
their protein components belongs to
678 orthologous proteins
fraction of motifs that is fully conserved
for the random ortholog distribution
column 4 / column 5
less than 5% of #2 (linear 3-component
proteins) are completely maintained
Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
47% of the fully conserved pentagons
(#11) are fully conserved!
9. Lecture WS 2004/05
Bioinformatics III 96
topology conservation of individual proteins
Larger motifs tend to
be conserved as a
whole, where each
component has an
ortholog.
Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
E.g. less than 1% of the fully connected pentagon motifs disappeared completely,
for 69% of them, each of the subunits had an ortholog in human.
Clear correlation between the conservation rate and the degree of saturation of
a motif.
Participation in motifs substantially influences the evolutionary conservation of
specific components.
9. Lecture WS 2004/05
Bioinformatics III 97
From 65% (C = 0) to 84% (C = 1) of neighbors of a human ortholog were also
human orthologs (filled circles). The conserved fraction of the nonorthologous
protein‘s neighborhood is markedly smaller.
Enrichment = ration between the percentages of orthologous proteins at distance d
from an ortholog in the natural and the random orthologous sets.
d: shortest distance between i and target protein measured along network links.
Proteins that interact directly with an ortholog at d=1 have a 50% higher chance of
conservation that at random!
Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
clustering coefficient conservation of proteins ?
9. Lecture WS 2004/05
Bioinformatics III 98
Examine if the specific function of the yeast proteins within motifs affects their rate
of evolutionary conservation.
Assign each motif to functional class to which its protein components belong.
Larger motifs have a notable functional homogeneity:
- for 95% of fully connected yeast pentagon motifs (#11) all components shared at
least one common functional class,
- only 10% of the 2-node motifs (#1) are functionally conserved.
Identify type and number of evolutionary fully conserved motifs of each functional
class in S.cerevisae, for those that have an ortholog in humans.
Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
function conservation?
9. Lecture WS 2004/05
Bioinformatics III 99
shared components
For 3 functional classes
(subcellular localization, protein
fate, transcription) each of the 11
studied motifs is considerably
overrepresented.
Some other functional classes
have only 1 or 2 characteristic
motifs.
No motifs are found for:
transposable elements, energy,
cellular fate, cellular communi-
cation, cellular rescue, cellular
organization, metabolism,
protein activity, protein binding Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
9. Lecture WS 2004/05
Bioinformatics III 100
shared components
For 3 functional classes (subcellular localization, protein fate, transcription) each of
the 11 studied motifs is considerably overrepresented.
Some other functional classes have only 1 or 2 characteristic motifs.
No motifs are found for:
transposable elements, energy, cellular fate, cellular communi-cation, cellular
rescue, cellular organization, metabolism, protein activity, protein binding
The fully connected motifs (#9 and #11) tend to identify protein complexes.
However, the mere existence of protein complexes cannot explain the observed
trends towards higher conservation rates of the highly connected motifs.
Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
9. Lecture WS 2004/05
Bioinformatics III 101
shared components
Shared components = proteins or groups of proteins occurring in different
complexes are fairly common:
A shared component may be a small part of many complexes, acting as a unit that
is constantly reused for ist function.
Also, it may be the main part of the complex e.g. in a family of variant complexes
that differ from each other by distinct proteins that provide functional specificity.
Aim: identify and properly represent the modularity of protein-protein interaction
networks by identifying the shared components and the way they are arranged to
generate complexes.
Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)
9. Lecture WS 2004/05
Bioinformatics III 102
Summary
Modules are key intermediate level in the organizational hierarchy of cells.
Biological Module: loose association of preferred molecular interaction partners
that interact to perform a collective function.
Modules can be identified based on structural characteristics such as their closely
connected members and interfacesto other modules.
There is evidence that modules are evolutionarily conserved.
Module co-members tend to be coordinately expressed.
9. Lecture WS 2004/05
Bioinformatics III 103
Direct comparison of different data sets
Bayesian Network approach
V12: Reliability of Protein Interaction Networks
9. Lecture WS 2004/05
Bioinformatics III 104
High-throughput methods for detecting protein interactions Yeast two-hybrid assay. Pairs of proteins to be tested for interaction are expressed as fusion proteins ('hybrids') in yeast: one protein is fused to a DNA-binding domain, the other to a transcriptional activator domain. Any interaction between them is detected by the formation of a functional transcription factor. Benefits: it is an in vivo technique; transient and unstable interactions can be detected; it is independent of endogenous protein expression; and it has fine resolution, enabling interaction mapping within proteins. Drawbacks: only two proteins are tested at a time (no cooperative binding); it takes place in the nucleus, so many proteins are not in their native compartment; and it predicts possible interactions, but is unrelated to the physiological setting.
Mass spectrometry of purified complexes. Individual proteins are tagged and used as 'hooks' to biochemically purify whole protein complexes. These are then separated and their components identified by mass spectrometry. Two protocols exist: tandem affinity purification (TAP), and high-throughput mass-spectrometric protein complex identification (HMS-PCI). Benefits: several members of a complex can be tagged, giving an internal check for consistency; and it detects real complexes in physiological settings. Drawbacks: it might miss some complexes that are not present under the given conditions; tagging may disturb complex formation; and loosely associated components may be washed off during purification.
Correlated mRNA expression (synexpression). mRNA levels are systematically measured under a variety of different cellular conditions, and genes are grouped if they show a similar transcriptional response to these conditions. These groups are enriched in genes encoding physically interacting proteins. Benefits: it is an in vivo technique, albeit an indirect one; and it has much broader coverage of cellular conditions than other methods. Drawbacks: it is a powerful method for discriminating cell states or disease outcomes, but is a relatively inaccurate predictor of direct physical interaction; and it is very sensitive to parameter choices and clustering methods during analysis.Von Mering et al. Nature 417, 399 (2002)
9. Lecture WS 2004/05
Bioinformatics III 105
High-throughput methods for detecting protein interactions
Genetic interactions (synthetic lethality). Two nonessential genes that cause lethality when mutated at
the same time form a synthetic lethal interaction. Such genes are often functionally associated and their
encoded proteins may also interact physically. This type of genetic interaction is currently being studied in
an all-versus-all approach in yeast. Benefits: it is an in vivo technique, albeit an indirect one; and it is
amenable to unbiased genome-wide screens.
In silico predictions through genome analysis. Whole genomes can be screened for three types of
interaction evidence: (1) in prokaryotic genomes, interacting proteins are often encoded by
conserved operons; (2) interacting proteins have a tendency to be either present or absent
together from fully sequenced genomes, that is, to have a similar 'phylogenetic profile'; and (3)
seemingly unrelated proteins are sometimes found fused into one polypeptide chain. This is an
indication for a physical interaction. Benefits: fast and inexpensive in silico techniques; and coverage
expands as more genomes are sequenced. Drawbacks: it requires a framework for assigning orthology
between proteins, failing where orthology relationships are not clear; and so far it has focused mainly on
prokaryotes.
Von Mering et al. Nature 417, 399 (2002)
Q: Beschreiben Sie 3 in silico Methoden, um aus genomischen DatenProtein-Protein-Interaktionen vorherzusagen.
9. Lecture WS 2004/05
Bioinformatics III 106
Data set
Experiment:
Uetz et al. 957 interactions
Ito et al. 4549 interactions
HMS-PCI 33014 interactions
In silico:
Conserved gene neighborhood 6387 interactions
Gene fusions 358 interactions
Co-occurrence of genes 997 interactions
Von Mering et al. Nature 417, 399 (2002)
9. Lecture WS 2004/05
Bioinformatics III 107
Counting interactions
Various high-throughput methods
give differing results on the same
complex.
>80.000 interactions available
for yeast.
Only 2.400 are supported by
more than 1 method.
Von Mering et al. Nature 417, 399 (2002)
Possible explanations ?- Methods may not have reached saturation- Many of the methods produce a significant fraction of false positives- Some methods may have difficulties for certain types of interactions
9. Lecture WS 2004/05
Bioinformatics III 108
Protein interactions between functional categories
Each technique produces a unique distribution of interactions with respect to functional
categories methods have specific strengths and weaknesses.
E.g. TAP and HMS-PCI predict few interactions for proteins involved in transport and sensing
because these categories are enriched with membrane proteins.
E.g. Y2H detects few proteins involved in translation.
Von Mering et al. Nature 417, 399 (2002)
9. Lecture WS 2004/05
Bioinformatics III 109
Complementarity between data sets
Glycine decarboxylase- Multienzyme complex needed when Gly is
used as 1-carbon source.- Its key components GCV1, GCV2, GCV3
are only induced when there is excess
Glycine and folate levels are low. This may
explain why complex is not detected in
experiments.
However, 3 components can be detected by
several independent in silico methods- Gene neighborhood of all 3 components in
7 diverged species- genes show very similar phylogenetic
distribution- microarrays: genes are closely co-
regulated.
Von Mering et al. Nature 417, 399 (2002)
Opposite example: PPH3 protein
Complex found in 4 independent purifications,
but no in silico method predicts interaction.
Q: Interpretieren Sie das oben angegebene Schema.Welche experimentelle Methode ist am besten?A: verschiedene Methoden messen verschiedene Eigenschaften der Interaktionen. Aus diesem Schema allein kann man nicht entscheiden, welche die beste ist. Was wäre ein guter Test dafür?
9. Lecture WS 2004/05
Bioinformatics III 110
Quantitative comparison of interaction data setsThe various data sets are benchmarked
against a reference set of 10,907 trusted
interactions, which are derived from protein
complexes annotated manually at MIPS and
YPD databases.
Coverage and accuracy are lower limits
owing to incompleteness of the reference
set. Each dot in the graph represents an
entire interaction data set.
For the combined evidence, consider only
interactions supported by an agreement of
two (or three) of any of the methods shown.
Von Mering et al. Nature 417, 399 (2002)
Q: erwarten Sie, daß die Bestätigungeiner Protein-Protein-Wechselwirkung durch mehrereunabhängige Experimente deren Aussagekraft verstärkt?Beschreiben Sie ein sehr geeignetes Verfahren, um dieseVerknüpfung zu beschreiben.A: Bayes‘sches Netzwerk.
9. Lecture WS 2004/05
Bioinformatics III 111
Biases in interaction coverage
Experiment:
Uetz et al. 957 interactions
Ito et al. 4549 interactions
HMS-PCI 33014 interactions
In silico:
Conserved gene neighborhood 6387 interactions
Gene fusions 358 interactions
Co-occurrence of genes 997 interactions
None of the methods covers more than 60% of the proteins in the yeast genome.
Are there common biases as to which proteins are covered?
Von Mering et al. Nature 417, 399 (2002)
9. Lecture WS 2004/05
Bioinformatics III 112
Bias 1 towards proteins of high abundance mRNA abundance is a rough measure of protein
abundance.
Here, divide yeast genome into 10 mRNA
abundance classes (bins) of equal size.
For each data set and abundance class, the
number of interactions is recorded having at least
one protein in that class. Each interaction (A–B) is
counted twice: once under the abundance class
of partner A, and once under the abundance
class of partner B.
Most data sets are heavily biased towards
proteins of high abundance except for genetic
techniques (Y2H and synthetic lethality)
Von Mering et al. Nature 417, 399 (2002)
9. Lecture WS 2004/05
Bioinformatics III 113
Von Mering et al. Nature 417, 399 (2002)
Bias 2 towards cellular localization
Independent quality measure:
Do interacting proteins belong to the same
compartment?
Y2H method gives relatively poor results
here.
9. Lecture WS 2004/05
Bioinformatics III 114
Outlook
How many protein-protein interactions can be expected in yeast?
Overlap of high-throughput data is 20 times larger than expected by chance. Good signal-to-noise ratio.
Also, for interactions discovered ≥ 2 times, usually both partners have the same
functional category and cellular localization.
Overlap mainly consists of „true positives“.
Less than 1/3 of new interactions in overlap set were previously known.
Given 10.000 currently known interactions predict >30.000 protein interactions in
yeast (lower boundary).
Von Mering et al. Nature 417, 399 (2002)
9. Lecture WS 2004/05
Bioinformatics III 115
Problems
Jansen et al. Science 302, 449 (2003)
Unfortunately, interaction data sets are often incomplete and contradictory (von
Mering et al. 2002).
In the context of genome-wide analyses, these inaccuracies are greatly magnified
because the protein pairs that do not interact (negatives) by far outnumber those
that do interact (positives).
E.g. in yeast, the ~6000 proteins allow for N (N-1) / 2 ~ 18 million potential
interactions. But the estimated number of actual interactions is < 100.000.
Therefore, even reliable techniques can generate many false positives when
applied genome-wide.
Think of a diagnostic with a 1% false-positive rate for a rare disease occurring in
0.1% of the population. This would roughly produce 1 true positive for every 10
false ones.
9. Lecture WS 2004/05
Bioinformatics III 116
Integrative Approach (sehr wichtig!)
Jansen et al. Science 302, 449 (2003)
One would like to integrate evidence from many different sources to increase the
predictivity of true and false protein-protein predictions.
Here, use Bayesian approach for integrating interaction information that allows for
the probabilistic combination of multiple data sets; apply to yeast.
Input: Approach can be used for combining noisy genomic interaction data sets.
Normalization: Each source of evidence for interactions is compared against
samples of known positives and negatives (“gold-standard”).
Output: predict for every possible protein pair likelihood of interaction.
Verification: test on experimental interaction data not included in the gold-
standard + new TAP (tandem affinity purification experiments).
9. Lecture WS 2004/05
Bioinformatics III 117
Integration of various information sources
Jansen et al. Science 302, 449 (2003)
(iii) Gold-standards of known interactions
and noninteracting protein pairs.
3 different types of data used:
(i) Interaction data from high-
throughput experiments. These
comprise large-scale two-hybrid
screens (Y2H) and in vivo pull-
down experiments.
(ii) Other genomic features:
expression data, biological
function of proteins (from Gene
Ontology biological process and
the MIPS functional catalog), and
data about whether proteins are
essential.
9. Lecture WS 2004/05
Bioinformatics III 118
Combination of data sets into probabilistic interactomes
(B) Combination of data sets into
probabilistic interactomes.
The 4 interaction data sets
from HT experiments were
combined into 1 PIE.
The PIE represents a
transformation of the
individual binary-valued
interaction sets into a data
set where every protein pair
is weighed according to the
likelihood that it exists in a
complex. A „naïve” Bayesian network is used to model
the PIP data. These information sets hardly
overlap.
Jansen et al. Science 302, 449 (2003)
Because the 4 experimental
interaction data sets contain
correlated evidence, a fully
connected Bayesian network
is used.
9. Lecture WS 2004/05
Bioinformatics III 119
Bayesian Networks
Bayesian networks are probabilistic models that graphically encode probabilistic
dependencies between random variables.Y
E1 E2E3
Bayesian networks also include a quantitative measure of dependency. For each
variable and its parents this measure is defined using a conditional probability
function or a table.
Here, one such measure is the probability Pr(E1|Y).
A directed arc between variables
Y and E1 denotes conditional
dependency of E1 on Y, as
determined by the direction of
the arc.
9. Lecture WS 2004/05
Bioinformatics III 120
Bayesian Networks
Together, the graphical structure and the conditional probability functions/tables
completely specify a Bayesian network probabilistic model.
Y
E1 E2E3
Here, Pr(Y,E1,E2,E3) = Pr(E1|Y) Pr(E2|Y) Pr(E3|Y) Pr(Y)
This model, in turn, specifies a
particular factorization of the joint
probability distribution function
over the variables in the
networks.
9. Lecture WS 2004/05
Bioinformatics III 121
Gold-Standard
Jansen et al. Science 302, 449 (2003)
should be
(i) independent from the data sources serving as evidence
(ii) sufficiently large for reliable statistics
(iii) free of systematic bias (e.g. towards certain types of interactions).
Positives: use MIPS (Munich Information Center for Protein Sequences, HW
Mewes) complexes catalog: hand-curated list of complexes (8250 protein pairs that
are within the same complex) from biomedical literature.
Negatives:
- harder to define
- essential for successful training
Assume that proteins in different compartments do not interact.
Synthesize “negatives” from lists of proteins in separate subcellular compartments.
9. Lecture WS 2004/05
Bioinformatics III 122
Measure of reliability: likelihood ratio
Jansen et al. Science 302, 449 (2003)
Consider a genomic feature f expressed in binary terms (i.e. „absent“ or „present“).
Likelihood ratio L(f) is defined as:
L(f) = 1 means that the feature has no predictability: the same number of positives
and negatives have feature f.
The larger L(f) the better its predictability.
f
ffL
featurehavingnegativesstandardgoldoffraction
featurehavingpositivesstandardgoldoffraction
9. Lecture WS 2004/05
Bioinformatics III 123
Combination of features
Jansen et al. Science 302, 449 (2003)
For two features f1 and f2 with uncorrelated evidence,
the likelihood ratio of the combined evidence is simply the product:
L(f1,f2) = L(f1) L(f2)
For correlated evidence L(f1,f2) cannot be factorized in this way.
Bayesian networks are a formal representation of such relationships between
features.
The combined likelihood ratio is proportional to the estimated odds that two
proteins are in the same complex, given multiple sources of information.
9. Lecture WS 2004/05
Bioinformatics III 124
Prior and posterior odds
„positive“ : a pair of proteins that are in the same complex. Given the number of
positives among the total number of protein pairs, the „prior“ odds of finding a
positive are:
„posterior“ odds: odds of finding a positive after considering N datasets with values
f1 ... fN :
posP
posP
negP
posPOprior
1
N
Nprior ffnegP
ffposPO
...
...
1
1
The terms „prior“ and „posterior“ refer to the situation before and after knowing the
information in the N datasets.
Jansen et al. Science 302, 449 (2003)
9. Lecture WS 2004/05
Bioinformatics III 125
Static naive Bayesian Networks
In the case of protein-protein interaction data, the posterior odds describe the
odds of having a protein-protein interaction given that we have the information from
the N experiments,
whereas the prior odds are related to the chance of randomly finding a protein-
protein interaction when no experimental data is known.
If Opost > 1, the chances of having an interaction are
Jansen et al. Science 302, 449 (2003)
higher than having no interaction.
9. Lecture WS 2004/05
Bioinformatics III 126
Static naive Bayesian Networks
The likelihood ratio L defined as
relates prior and posterior odds according to Bayes‘ rule:
negffP
posffPffL
N
NN ...
......
1
11
priorNpost OffLO ...1
In the special case that the N features are conditionally independent
(i.e. they provide uncorrelated evidence) the Bayesian network is a so-called
„naïve” network, and L can be simplified to:
N
i
N
i i
iiN negfP
posfPfLffL
1 11...
Jansen et al. Science 302, 449 (2003)
9. Lecture WS 2004/05
Bioinformatics III 127
Computation of prior and posterior odds
L can be computed from contingency tables relating positive and negative
examples with the N features (by binning the feature values f1 ... fN into discrete
intervals) – wait for examples.
600
1
1018
1036
4
priorO
Opost > 1 can be achieved with L > 600.
Jansen et al. Science 302, 449 (2003)
Determining the prior odds Oprior is somewhat arbitrary in that it requires an
assumption about the number of positives.
Jansen et al. believe that 30,000 is a conservative lower bound for the number of
positives (i.e. pairs of proteins that are in the same complex).
Considering that there are ca. 18 million = 0.5 * N (N – 1) possible protein pairs in
total (with N = 6000 for yeast),
9. Lecture WS 2004/05
Bioinformatics III 128
Essentiality (PIP)
Consider whether proteins are essential or non-essential = does a deletion mutant
where this protein is knocked out from the genome have the same phenotype?
Jansen et al. Science 302, 449 (2003)
It should be more likely that both of 2 proteins in a complex are essential or non-
essential, but not a mixture of these two attributes.
Deletion mutants of either one protein should impair the function of the same
complex.
9. Lecture WS 2004/05
Bioinformatics III 129
Parameters of the naïve Bayesian Networks (PIP) Column 1 describes the genomic feature. In the „essentiality data“ protein pairs can take on 3 discrete
values (EE: both essential; NN: both non-essential; NE: one essential and one not).
Jansen et al. Science 302, 449 (2003)
Column 2 gives the number of protein pairs with a particular feature (i.e. „EE“) drawn from the whole yeast
interactome (~18M pairs).
Columns „pos“ and „neg“ give the overlap of these pairs with the 8,250 gold-standard positives and the
2,708,746 gold-standard negatives.
Columns „sum(pos)“ and „sum(neg)“ show how many gold-standard positives (negatives) are among the
protein pairs with likelihood ratio L, computed by summing up the values in the „pos“ (or „neg“) column.
P(feature value|pos) and P(feature value|neg) give the conditional probabilities of the feature values – and
L, the ratio of these two conditional probabilities.
143.0
518.0
2150
1114
573724
81924
9. Lecture WS 2004/05
Bioinformatics III 130
mRNA expression dataProteins in the same complex tend to have correlated expression profiles.
Although large differences can exist between the mRNA and protein abundance, protein abundance can
be indirectly and quite crudely measured by the presence or absence of the corresponding mRNA
transcript.
Jansen et al. Science 302, 449 (2003)
Experimental data source:
- time course of expression fluctuations during the yeast cell cycle
- Rosetta compendium: expression profiles of 300 deletion mutants and cells under
chemical treatments.
Problem: both data sets are strongly correlated.
Compute first principal component of the vector of the 2 correlations.
Use this as independent source of evidence for the P-P interaction prediction.
The first principal component is a stronger predictor of P-P interactions that either
of the 2 expression correlation datasets by themselves.
9. Lecture WS 2004/05
Bioinformatics III 131
mRNA expression dataThe values for mRNA expression correlation (first principal component) range on a
continuous scale from -1.0 to +1.0 (fully anticorrelated to fully correlated).
This range was binned into 19 intervals.
Jansen et al. Science 302, 449 (2003)
9. Lecture WS 2004/05
Bioinformatics III 132
PIP – Functional similarityQuantify functional similarity between two proteins:
Jansen et al. Science 302, 449 (2003)
- consider which set of functional classes two proteins share, given either the MIPS or Gene
Ontology (GO) classification system.
- Then count how many of the ~18 million protein pairs in yeast share the exact same
functional classes as well (yielding integer counts between 1 and ~ 18 million). It was binned
into 5 intervals.
- In general, the smaller this count, the more similar and specific is the functional description
of the two proteins.
9. Lecture WS 2004/05
Bioinformatics III 133
PIP – Functional similarity
Observation: low counts correlate with a higher chance of two proteins being in
the same complex. But signal (L) is quite weak.
Jansen et al. Science 302, 449 (2003)
9. Lecture WS 2004/05
Bioinformatics III 134
Calculation of the fully connected Bayesian network (PIE)
The 3 binary experimental interaction datasets can be combined in at most 24 = 16
different ways (subsets). For each of these 16 subsets, one can compute a
likelihood ratio from the overlap with the gold-standard positives („pos“) and
negatives („neg“).
51003.08250
26
2708746
2 8250
2708746
27087462
825026
Jansen et al. Science 302, 449 (2003)
9. Lecture WS 2004/05
Bioinformatics III 135
Distribution of likelihood ratios
Number of protein pairs in the individual datasets and the probabilistic interactomes
as a function of the likelihood ratio.
There are many more protein pairs with high
likelihood ratios in the probabilistic interactomes
(PIE) than in the individual datasets G,H,U,I.
Protein pairs with high likelihood ratios provide
leads for further experimental investigation of
proteins that potentially form complexes.
Jansen et al. Science 302, 449 (2003)
9. Lecture WS 2004/05
Bioinformatics III 136
Jansen et al. Science 302, 449 (2003)
Overview
PIP and PIE are separately tested against the
gold-standard.
9. Lecture WS 2004/05
Bioinformatics III 137
PIP vs. the information sources
Ratio of true to false positives (TP/FP) increases
monotonically with Lcut, confirming L as an
appropriate measure of the odds of a real
interaction.
The ratio is computed as:
Protein pairs with Lcut > 600 have a > 50%
chance of being in the same complex.Jansen et al. Science 302, 449 (2003)
cut
cut
LL
LL
cut
cut
Lneg
Lpos
LFP
LTP
9. Lecture WS 2004/05
Bioinformatics III 138
PIE vs. the information sources
9897 interactions are predicted from PIP and
163 from PIE.
In contrast, likelihood ratios derived from single
genomic factors (e.g. mRNA coexpression) or
from individual interaction experiments (e.g. the
Ho data set) did no exceed the cutoff when used
alone.
This demonstrates that information sources that,
taken alone, are only weak predictors of
interactions can yield reliable predictions when
combined.
Jansen et al. Science 302, 449 (2003)
9. Lecture WS 2004/05
Bioinformatics III 139
parts of PIP graph
Test whether the thresholded PIP
was biased toward certain
complexes, compare distribution of
predictions among gold-standard
positives.
(A ) The complete set of gold-
standard positives and their overlap
with the PIP. The PIP (green) covers
27% of the gold-standard positives
(yellow).
The predicted complexes are roughly
equally apportitioned among the
different complexes no bias.Jansen et al. Science 302, 449 (2003)
9. Lecture WS 2004/05
Bioinformatics III 140
V13 Prediction of Phylogenies based on single genes
Material of this lecture taken from
- chapter 6, DW Mount „Bioinformatics“
and from Julian Felsenstein‘s book.
A phylogenetic analysis of a family of related
nucleic acid or protein sequences is a determination
of how the family might have been derived during
evolution.
Placing the sequences as outer branches on a tree,
the evolutionary relationships among the sequences
are depicted.
Phylogenies, or evolutionary trees, are the basic structures to describe
differences between species, and to analyze them statistically.
They have been around for over 140 years.
Statistical, computational, and algorithmic work on them is ca. 40 years old.
9. Lecture WS 2004/05
Bioinformatics III 141
Methods for Single-Gene Phylogeny
Choose set of
related sequences
Obtain multiple
sequence
alignment
Is there
strong
sequence
similarity?
Maximum
parsimony
methods
Yes
No
Is there clearly recogniza-
ble sequence similarity?
YesDistance
methods
No
Maximum likelihood
methods
Analyze how well
data support
prediction
Q: füllen Sie in dasDiagramm ein, welcheder 3 in der Vorlesungbehandelten Phylogenie-Methoden jeweils ambesten geeignet ist?begründen Sie kurzwarum.
9. Lecture WS 2004/05
Bioinformatics III 142
Parsimony methods (wurden stark gekürzt)
Edwards & Cavalli-Sforza (1963): that evolutionary tree is to be preferred that
involves „the minimum net amount of evolution“.
seek that phylogeny on which, when we reconstruct the evolutionary
events leading to our data, there are as few events as possible.
(1) We must be able to make a reconstruction of events, involving as few events
as possible, for any proposed phylogeny.
(2) We must be able to search among all possible phylogenies for the one or
ones that minimize the number of events.
9. Lecture WS 2004/05
Bioinformatics III 143
Counting evolutionary changes
2 related dynamic programming algorithms: Fitch (1971) and Sankoff (1975)
- evaluate a phylogeny character by character
- for each character, consider it as rooted tree, placing the root wherever seems
appropriate.
- update some information down a tree; when we reach the bottom, the number of
changes of state is available.
Do not actually locate changes or reconstruct interior states at the nodes of the tree.
9. Lecture WS 2004/05
Bioinformatics III 144
Fitch algorithm
intended to count the number of changes in a bifurcating tree with nucleotide
sequence data, in which any one of the 4 bases (A, C, G, T) can change to any
other.
At the particular site, we have observed the bases C, A, C, A and G in the 5 species.
Give them in the order in which they appear in the tree, left to right.
9. Lecture WS 2004/05
Bioinformatics III 145
Fitch algorithm
For the left two, at the node that is their immediate common ancestor,
attempt to construct the intersection of the two sets.
But as {C} {A} = instead construct
the union {C} {A} = {AC} and count 1
change of state.
For the rightmost pair of species, assign
common ancestor as {AG},
since {A} {G} = and count another
change of state.
.... proceed to bottom
Total number of changes = 3. Algorithm works on arbitrarily large trees.
Q: beschreiben Sie kurz denFitch-Algorithmus und füllen Sieden oben gezeigten Baum aus.
9. Lecture WS 2004/05
Bioinformatics III 146
Sankoff algorithm
Fitch algorithm is very effective – but we can‘t understand why it works.
Sankoff algorithm: more complex, but its structure is more apparent.
Assume that we have a table of the cost of changes cij between each character state
i and each other state j.
Compute the total cost of the most parsimonious combinations of events by
computing it for each character.
For a given character, compute for each node k in the tree a quantity Sk(i).
This is interpreted as the minimal cost, given that node k is assigned state i,
of all the events upwards from node k in the tree.
9. Lecture WS 2004/05
Bioinformatics III 147
Sankoff algorithm
If we can compute these values for all nodes,
we can also compute them for the bottom node in the tree.
Simply choose the minimum of these values
which is the desired total cost we seek, the minimum cost of evolution for this
character.
At the tips of the tree, the S(i) are easy to compute. The cost is 0 if the observed
state is state i, and infinite otherwise.
If we have observed an ambigous state, the cost is 0 for all states that it could be,
and infinite for the rest.
Now we just need an algorithm to calculate the S(i) for the immediate common
ancestor of two nodes.
iSSi
0min
9. Lecture WS 2004/05
Bioinformatics III 148
Sankoff algorithm
Suppose that the two descendant nodes are called l and r (for „left“ and „right“).
For their immediate common ancestor, node a, we compute
kScjSciS rikk
lijj
a minmin
The smallest possible cost given that node a is in state i is the cost cij of going from
state i to state j in the left descendant lineage, plus the cost Sl(j) of events further up
in the subtree gien that node l is in state j. Select value of j that minimizes that sum.
Same calculation for right descendant lineage sum of these two minima is the
smallest possible cost for the subtree above node a, given that node a is in state i.
Apply equation successively to each node in the tree, working downwards.
Finally compute all S0(i) and use previous eq. to find minimum cost for whole tree.
9. Lecture WS 2004/05
Bioinformatics III 149
Sankoff algorithm
The array (6,6,7,8) at the bottom of the tree has a minimum value of 6
= minimum total cost of the tree for this site.
Q: beschreiben Sie kurz denSankoff-Algorithmus und tragen Sieim links gezeigten Baum die sich miteiner abgeänderten cost Matrix ergebenden Werte ein.
9. Lecture WS 2004/05
Bioinformatics III 150
Finding the best tree by heuristic search
The obvious method for searching for the most parsimonious tree is to consider ALL
trees and evaluate each one.
Unfortunately, generally the number of possible trees is too large.
use heuristic search methods that attempt to find the best trees without looking at
all possible trees.
(1) Make an initial estimate of the tree and make small rearrangements of it
= find „neighboring“ trees.
(2) If any of these neighbors are better, consider them and continue search.
9. Lecture WS 2004/05
Bioinformatics III 151
Resolve Incongruences in Phylogeny
Many possible reasons that may make decisions on how to handle conflicts in
larger sets of molecular data difficult.
E.g. two genes with different evolutionary history (e.g. owing to hybridization or
horizontal transfer) will necessarily give incongruent pictures while still depicting
true histories.
Here: compare genome sequence data for 7 Saccharomyces yeast species:
S. cerevisae
S. paradoxus
S. mikatae
S. kudriavzevii
S. bayanus
S. castelli
S. kluyveri
plus one outgroup fungus Candida albicans.
Rokas et al. Nature 425, 798 (2003)
9. Lecture WS 2004/05
Bioinformatics III 152
A method for testing how well a particular data set fits a model.
E.g. the validity of the branch arrangement in a predicted phylogenetic tree can
be tested by resampling columns in a multiple sequence alignment to create
many new alignments.
The appearance of a particular branch in trees generated from these resampled
sequences can then be measured.
Alternatively, a sequence may be left out of an analysis to determine how
much the sequence influences the results of an analysis.
Here: swap individual nucleotide sites or positions of genes (bootstrap replicas).
Bootstrap analysis.
Q: Erklären Sie das Grundprinzip der Bootstrap-Methode.
9. Lecture WS 2004/05
Bioinformatics III 153
Alternative Tree topologies
Single-gene data sets generate multiple, robustly supported alternative topologies.
Representative alternative trees recovered from analyses of nucleotide data of 106
selected single genes and six commonly used genes are shown. The trees are the
50% majority-rule consensus trees from the genes YBL091C (a), YDL031W (b),
YER005W (c), YGL001C (d), YNL155W (e) and YOL097C (f).
These 6 genes were selected without consideration of their function. Maybe
commonly used, well known genes of important functions provide a better resolution?
Rokas et al. Nature 425, 798 (2003)
9. Lecture WS 2004/05
Bioinformatics III 154
The alternative phylogenies could have resulted from a number of different
scenarios:
(1) most genes could have weakly supported most phylogenies and strongly
supported only a few alternative trees,
(2) most genes could have strongly supported one phylogeny and a few genes
strongly supported only a small number of alternatives,
(3) there could have been some combinations of these scenarios so that each
branch among alternative phylogenies had either weak or strong support
depending on the gene.
To distinguish between these possibilities, identify all branches recovered during
single-gene analyses, record each bootstrap value with respect to the gene and
method of analysis.
8 branches were shared by all three analyses with multiple instances of
bootstrap values > 50%.
Explanations?
Rokas et al. Nature 425, 798 (2003)
9. Lecture WS 2004/05
Bioinformatics III 155
Concatenation of single genes gives a single tree!
Phylogenetic analyses of the
concatenated data set composed
of 106 genes yield maximum
support for a single tree,
irrespective of method and type of
character evaluated. Numbers
above branches indicate bootstrap
values (ML on nucleotides/MP on
nucleotides/MP on amino acids).
All alternative topologies were rejected.
This level of support for a single tree with 5 internal branches is unprecedented.
This tree can now be referred to as species tree.
Rokas et al. Nature 425, 798 (2003)
9. Lecture WS 2004/05
Bioinformatics III 156
Convergence on single tree
A minimum of 20 genes is required to recover >95% bootstrap values for each
branch of the species tree. a, b, The bootstrap values for branches 3 (a) and 5 (b)
were constructed from the concatenation of randomly re-sampled orthologous
nucleotides (left) or random subsets of genes (right).
The species tree is recovered with robust support (>95% bootstrap values in all
branches at 95% confidence interval) by analyses of a minimum of 20
concatenated genes. All analyses were performed using MP.
branch 3
branch 5
Rokas et al. Nature 425, 798 (2003)
9. Lecture WS 2004/05
Bioinformatics III 157
Independent evolution?It has been suggested that nucleotides within a given gene do not evolve
independently.
Re-sample subset of orthologous nucleotides from the total data set.
Only 8000 randomly chosen nucleotide positions (corresponding to less than three
concatenated genes) are sufficient to generate single tree with > 95% confidence.
This indicates that nucleotides in genes have not evolved independently (because
when using complete genes more than 20 genes are necessary to generate single
tree).
Rokas et al. Nature 425, 798 (2003)
Q: geben Sie eine strukturelle Erklärung, weshalb an unterschiedlichenPositionen eines Gens unterschiedliche Evolutionsraten zu beobachtensind.Wie erklärt es sich dann, daß man aus 8000 zufällig ausgewähltenNukleotidpositionen von alignierten Genomen einen einheitlichenBaum erhalten kann?
9. Lecture WS 2004/05
Bioinformatics III 158
Implications for resolution of phylogeniesUnreliability of single-gene data sets stems from the fact that each gene is
shaped by a unique set of functional constraints through evolution.
Phylogenetic algorithms are sensitive to such constraints.
Such problems can be avoided with genome-wide sampling of independently
evolving genes.
In other cases the amount of sequence information needed to resolve specific
relationships will be dependent on the particular phylogenetic history under
examination.
Branches depicting speciation events separated by long time intervals may be
resolved with a smaller amount of data, and those depicting speciation events
separated by shorter invtervals may be much harder to resolve.
Rokas et al. Nature 425, 798 (2003)Q: Was ist der Vorteil dabei, mehrere Proteinefür phylogenetische Vergleiche zwischen Organismen zu verwenden?
9. Lecture WS 2004/05
Bioinformatics III 159
SummaryRobust strategies exist for phylogenies built on single-gene comparisons
(maximum parsimony, distance, maximum likelihood).
Problem of incongruence of phylogenies derived from individual genes.
Can be resolved by integrative analysis of multiple (here > 20) genes.
It is desirable to combine results from phylogenies constructed from local
sequence information with trees constructed from genome rearrangement.
The power of genome rearrangement studies is the construction of ancestral
genomes. Then one can derive the speed of evolution at different times, disect
mutation biases at different times from the influence of genomic context ...
and possibly derive the driving forces of biological evolution.
9. Lecture WS 2004/05
Bioinformatics III 160
V14: Phylogeny (II)
Distance matrix methods
Least squares
(leave out problematic UPGMA method)
Neighbor-joining
Maximum likelihood
An early "Universal Tree of Life" deduced from
ribosomal RNA (rRNA) data. The study upon which
this figure was based did not resolve the branching
of the three kingdoms most familiar to all of us:
plants, Fungi and animals. Subsequent analyses,
however, have revealed that the biochemistry of
fungi (in particular, the synthesis of chitin) is most
similar to animals. Thus, counter-intuitively, plants
are likely to have diverged first, leaving fungi and
animals as sister groups.
http://www.palaeos.com/Systematics/Cladistics/molecular.html
9. Lecture WS 2004/05
Bioinformatics III 161
Distance matrix methods
introduced by Cavalli-Sforza & Edwards (1967)
and by Fitch & Margoliash (1967)
general idea „seems as if it would not work very well“ (Felsenstein):
- calculate a measure of the distance between each pair of species
- find a tree that predicts the observed set of distances as closely as
possible.
All information from higher-order combinations of character states is left out.
But computer simulation studies show that the amount of lost information is
remarkably small.
Best way to think about distance matrix methods:
consider distances as estimates of the branch length separating that pair of
species.
9. Lecture WS 2004/05
Bioinformatics III 162
Least square method
- observed table (matrix) of distances Dij
- any particular tree leads to a predicted set of distances dij.
9. Lecture WS 2004/05
Bioinformatics III 163
Least square method
Measure of the discrepancy between the observed and expected distances:
n
i
n
jijijij dDwQ
1 1
2
where the weights wij can be differently defined:
- wij = 1 (Cavalli&Sforza, 1967)
- wij = 1/Dij2 (Fitch&Margoliash, 1967)
- wij = 1/Dij (Beyer et al., 1974)
Aim: Find tree topology and branch lengths that minimize Q.
Equation above is quadratic in branch lengths.
Take derivative with respect to branch lengths, set = 0,
and solve system of linear equations. Solution will minimize Q.
Doug Brutlag‘s course
9. Lecture WS 2004/05
Bioinformatics III 164
Least square method
Number species in alphabetical order.
The expected distance between species A and D d14 = v1 + v7 + v4
The expected distance between speices B and E d25 = v5 + v6 + v7 + v2.
v1v2
v3
v4
v5 v6 v7
9. Lecture WS 2004/05
Bioinformatics III 165
Least square method
Number all branches of the tree and introduce an indicator variable xijk:
xijk = 1 if branch k lies in the path from species i to species j
xijk = 0 otherwise.
The expected distance between i and j will then be
and
For the case with wij = 1 ij.
Note: these are k equations for each of the k branches.
k
kkijji vxd ,,
n
i ij kkkijijij vxDwQ
1
2
,
n
i ij kkkijijkijij
k
vxDxwdv
dQ
1,, 02
9. Lecture WS 2004/05
Bioinformatics III 166
Least square method
DAB + DAC + DAD + DAE = 4v1 + v2 + v3 + v4 + v5 + 2v6 + 2v7
DAB + DBC + DBD + DBE = v1 + 4v2 + v3 + v4 + v5 + 2v6 + 3v7
DAC + DBC + DCD + DCE = v1 + v2 + 4v3 + v4 + v5 + 3v6 + 2v7
DAD + DBD + DCD + DDE = v1 + v2 + v3 + 4v4 + v5 + 2v6 + 3v7
DAE + DBE + DCE + DDE = v1 + v2 + v3 + v4 + 4v5 + 3v6 + 2v7
DAC + DAE + DBC + DBE + DCD + DDE = 2v1 + 2v2 + 3v3 + 2v4 + 3v5 + 6v6 + 4v7
DAB + DAD + DBC + DCD + DBE + DDE = 2v1 + 3v2 + 2v3 + 3v4 + 2v5 + 4v6 + 6v7
Stack up the (4 + 3 + 2 + 1 = 10) Dij, in alphabetical order, into a vector
and the coefficients xijk
are arranged in a matrix X
with each row corresponding
to the Dij in the row of d and
containing a 1 if branch k
occurs on the path between
species i and j.
DE
CE
CD
BE
BD
BC
AE
AD
AC
AB
D
D
D
D
D
D
D
D
D
D
d
1111000
0010100
1101100
1110010
0001010
1100110
0110001
1001001
0100101
1000011
X
9. Lecture WS 2004/05
Bioinformatics III 167
Least square method
If we also stack up the 7 vi into a vector v, the previous set of linear equations can
be compactly expressed as:
Multiplied from the left by the inverse of XTX one can solve for the least squares
branch lengths
This is a standard method of expressing least squares problems in matrix notation
and solving them.
check for example :-)
vXXdX TT
dXXXv TT 1
9. Lecture WS 2004/05
Bioinformatics III 168
Least square method
When we have weighted least squares, with a diagonal matrix of weights in the
same order as the Dij:
DE
CE
CD
BE
BD
BC
AE
AD
AC
AB
w
w
w
w
w
w
w
w
w
w
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
W
then the least square equations can be written
vWXXWdX TT
and their solution WdXWXXv TT 1
9. Lecture WS 2004/05
Bioinformatics III 169
Finding the least squares tree topology
Now that we are able to assign branch lengths to each tree topology.
we need to search among tree topologies.
This can be done by the same methods of heuristic search that were presented for
the Maximum Parsimony method.
Note: no-one has sofar presented a branch-and-bound method for finding the least
squares tree exactly. Day (1986) has shown that this problem is NP-complete.
The search is not only among tree topologies, but also among branch lengths.
9. Lecture WS 2004/05
Bioinformatics III 170
neighbor-joining method
introduced by Saitou and Nei (1987) – algorithm works by clustering - does not
assume a molecular clock but approximates the „minimum evolution“ model.
„Minimum evolution“ model:
among possible tree topologies, choose the one with minimal total branch length.
Neighbor-joining, as the least-squares method, is guaranteed to recover the true
tree if the distance matrix is an exact reflection of the tree.
9. Lecture WS 2004/05
Bioinformatics III 171
neighbor-joining method
(1) For each tip, compute
(2) Choose the i and j for which Dij – ui – uj is smallest.
(3) Join items i and j. Compute the branch length
from i to the new node (vi) and from j to the new
node (vj) as
(4) Compute distance between the new node (ij) and each of the remaining tips as
(5) Delete tips i and j from the tables and replace them by the new node, (ij), which
is now treated as a tip.
(6) If more than 2 nodes remain, go back to step (1). Otherwise, connect the two
remaining nodes (e.g. l and m) by a branch of length Dlm.
n
ij
iji n
Du
2
ijijj
jiiji
uuDv
uuDv
2
1
2
12
1
2
1
2,ijjkik
kij
DDDD
9. Lecture WS 2004/05
Bioinformatics III 172
limitation of distance methods
Distance matrix methods are the easiest phylogeny method to program,
and they are very fast.
Distance methods have problems when the evolutionary rates vary largely.
One can correct for this in distance methods as well as in likelihood methods.
When variation of rates is large, these corrections become important.
In likelihood methods, the correction can use information from changes in one part
of the tree to inform the correction in others.
Once a particular part of the molecule is seen to change rapidly in the primates, this
will affect the interpretation of that part of the molecule among the rodents as well.
But a distance matrix method is inherently incapable of propagating the information
in this way. Once one is looking at changes within rodents, it will forget where
changes were seen among primates.
9. Lecture WS 2004/05
Bioinformatics III 173
Maximum Likelihood
For any 2 hypotheses H1 and H2 about a set of data D
2
1
2
1
2
1
Prob
Prob
Prob
Prob
Prob
Prob
Prob
ProbProb
Prob
andProbProb
H
H
HD
HD
DH
DH
D
HHD
D
DHDH
This expresses the „odds“ ratio in favor of hypothesis 1 over hypothesis 2 as a
product of two terms.
The first is the ratio of the probabilities of the data given the 2 hypotheses.
The second is the ratio of the prior probabilities of the 2 hypotheses before we
look at the data.
9. Lecture WS 2004/05
Bioinformatics III 174
Maximum Likelihood
If we have independent observations, then
iniii HDHDHDHD Prob...ProbProbProb 21
It follows that
2
1
1 2
1
2
1
Prob
Prob
Prob
Prob
Prob
Prob
H
H
HD
HD
DH
DH n
ii
i
9. Lecture WS 2004/05
Bioinformatics III 175
Computing the likelihood of a tree
Suppose that we have a set of aligned DNA-sequences with m sites.
We are given a phylogeny with branch lengths, and an evolutionary model that
allows to compute probabilities of changes of states along this tree.
In particular, the model allows us to compute transition probabilities Pij(t), the
probability that state j will exist at the end of a branch of length t, if the state at the
start of the branch is i. (t measures branch length, not time).
We will make 2 assumptions that are central to computing the likelihoods:
(1) Evolution in different sites (on the given tree) is independent.
(2) Evolution in different lineages is independent.
9. Lecture WS 2004/05
Bioinformatics III 176
Computing the likelihood of a tree
The first assumption
(1) Evolution in different sites (on the given tree) is independent.
allows us to take the likelihood and decompose it into a product, one term for each
site
where D(i) is the data at the ith site.
we only need to know how to compute the likelihood at a single site.
n
i
i TDTDL1
ProbProb
9. Lecture WS 2004/05
Bioinformatics III 177
Computing the likelihood of a tree
Suppose that we have a tree, and the data at a site.
The likelihood of the tree for this site is the sum, over all possible nucleotides that
may have existed at the interior nodes of the tree, of the probabilities of each
scenario of events:
Each summation runs over all 4 nucleotides.
x y z w
i TwzyxGCCCATD ,,,,,,,,ProbProb
9. Lecture WS 2004/05
Bioinformatics III 178
Computing the likelihood of a tree
The second assumption
(2) Evolution in different lineages is independent.
allows us to decompose
into a product of terms:
)tw,Prob(G)tw,Prob(C)tz,Prob(w)tz,Prob(C)tx,Prob(z
)ty,Prob(C)ty,Prob(A)tx,Prob(yProb(x)
,,,,,,,,Prob
54738
216
TwzyxGCCCA
TwzyxGCCCA ,,,,,,,,Prob
The problem with this expression is that a tree with n species has n – 1 interior
nodes, and each can have one of 4 states.
So we need 4n-1 terms. This may become enormously large when we need to sum
over all x,y,z,w. Refine strategy!
9. Lecture WS 2004/05
Bioinformatics III 179
Economizing on the computation
The algorithm is applied starting at the node that has all of its immediate
descendants being tips (there will always be one such node).
Then it is applied successively to nodes further down the tree, not applying it to any
node until all of its descendants have been processed.
The result is the L0(i) for the bottom-most node in the tree.
Once the likelihood for each site is computed, the overall likelihood of the
tree is the product of the site likelihoods.
9. Lecture WS 2004/05
Bioinformatics III 180
V 16 Genome RearrangementTwo genomes may have many genes in common, but the genes may be
arranged in a different sequence or be moved between chromosomes. Such
differences in gene orders are the results of rearrangement events that are
common in molecular evolution (frequency ca. only 1 event per million years!)
- Substitution
- Insertion
- Deletion
- Translocation
- Inversion/ Reversal
- Duplication
9. Lecture WS 2004/05
Bioinformatics III 181
Types of Genome Rearrangements
In unichromosomal genomes, the most common rearrangement events are
reversals, in which a contiguous interval of genes is put into the reverse order.
For multichromosomal genomes, the most common rearrangement events are
reversals, translocations, fissions, and fusions.
The pairwise genome rearrangement problem is to find an optimal scenario
transforming one genome to another via these rearrangement events.
Genomic distance: the number of inversions and translocations needed to
transform one genome into another. Fissions and fusions may be included as a
special case of translocations in which one of the input or output chromosomes is
empty.
9. Lecture WS 2004/05
Bioinformatics III 182
Representation of a genome
We consider a unichromosomal genome to be a sequence of n genes. The
genes are represented by numbers 1, 2, ..., n.
The two orientations of gene i are represented by i and -i.
A genome is represented as a signed permutation of the numbers 1, 2, ..., n.
For example, a unichromosomal genome with n = 5 genes is 5 -3 4 2 -1
9. Lecture WS 2004/05
Bioinformatics III 183
Unichromosomal genomes: sorting by reversal
A reversal in a signed permutation is an operation that takes an interval in a
permutation, reverses the order of the numbers, and changes all their signs. For
example,
5 1 3 2 -9 7 -4 6 8
5 1 -7 9 -2 -3 -4 6 8
The reversal distance between two genomes is the minimum number of
reversals it takes to get from one genome to the other.
For a given pair of genomes, the reversal distance is unique, but there are
usually many possible reversal scenarios with this distance.
However, it is (of course) possible that this mathematical notion of reversal
distance can underestimate the actual number of steps that occurred
biologically.
9. Lecture WS 2004/05
Bioinformatics III 184
Signed and unsigned genomes
Most comparative mapping techniques determine the physical locations and
relative order of genes in each chromosome, but do not determine which of
two orientations each gene has.
Current sequencing methods do provide the orientations. It turns out that the
genome rearrangement problem (uni- and multichromosomal) for unsigned
permutations is NP-hard, but the same problems for signed data can be done in
polynomial time.
Fortunately, with many genomes currently being sequenced, it is likely that
many comparative maps (corresponding to unsigned permutations) will soon be
replaced by sequencing data (corresponding to signed permutations).
9. Lecture WS 2004/05
Bioinformatics III 185
Inversion, Transposition and inverted Transposition
inversion
transposition
inverted transposition
9. Lecture WS 2004/05
Bioinformatics III 186
Sorting by Reversals
8 7 6 5 4 3 2 1 11 10 9
8 7 6 5 4 3 2 1 11 10 9
8 2 3 4 5 6 7 1 11 10 9
4 3 2 8 7 1 5 6 11 10 9
8 2 3 4 5 1 7 6 11 10 9
4 3 2 8 5 1 7 6 11 10 9
4 3 2 8 7 1 5 6 11 10 9
4 3 2 8 7 1 5 6 11 10 9
Cabbage
Turnip
9. Lecture WS 2004/05
Bioinformatics III 187
Permutation () : an ordered arrangement of
the set { 1,2,…,n}
Reversal () :a rearrangement that inverts a
block in {3 4 7 6 1 5 2 } (3,6) ={3 4 5 1 6 7 2}
Signed Permutation (): a permutation
where the elements are oriented
a reversal switches element orientation
{+3 -4 +7 -6 +1 -5 +2 } (3,6) ={+3 -4 +5 -1 +6 -7 +2}
9. Lecture WS 2004/05
Bioinformatics III 188
easy to do by eye ...
8 7 6 5 4 3 2 1 11 10 9
8 7 6 5 4 3 2 1 11 10 9
8 2 3 4 5 6 7 1 11 10 9
4 3 2 8 7 1 5 6 11 10 9
8 2 3 4 5 1 7 6 11 10 9
4 3 2 8 5 1 7 6 11 10 9
4 3 2 8 7 1 5 6 11 10 9
4 3 2 8 7 1 5 6 11 10 9
1
12
123
12….t=
= t …. 21
9. Lecture WS 2004/05
Bioinformatics III 189
Formal Approach: Sorting by Reversals
The order of genes in 2 organisms is represented by permutations = 12 ... n and = 12 ... n.
A reversal of an interval [i,j] is the permutation
1 2 ... i-1 i i+1 ... j-1 j j+1 ... n
1 2 ... i-1 j j-1 ... i+1 i j+1 ... n
(i,j) has the effect of reversing the order of ii+1 ... j and transforming
1 ... i-1i ... j j+1 ... n into •(i,j) = 1 ... i-1j ... ij+1 ... n .
Given permutations and , the reversal distance problem is to find a series of
reversals 12 ... t such that •1•2 ... t = and t is minimal.
t is called the reversal distance between and .
9. Lecture WS 2004/05
Bioinformatics III 190
Reconstruction of phylogenetic trees from WG data
1 Phylogeny reconstruction as optimization problem?
Attempt to reconstruct an evolutionary scenario with a minimum number of
permitted evolutionary events (e.g. duplications, insertions, deletions,
inversions, transpositions) on a tree all known approaches are NP-hard
Also, no automated tool exists sofar.
2 Estimate leaf-to-leaf distances
(based on some metric) between all genomes. Then úse a standard distance-
based method such as neighbour-joining to construct the tree.
Such approaches are quite fast but cannot recover the ancestral gene order.
2a Breakpoint phylogeny (Blanchette & Sankoff)
for special case in which the genomes all have the same set of genes, and
each gene appears once. Use breakpoint distance as distance matrix.
9. Lecture WS 2004/05
Bioinformatics III 191
Reversal distance problem
The reversal distance for a pair of genomes can be computed in polynomial time
(Hannenhalli & Pevzner 1999 and others, also see Bioinformatics 1 lecture).
However, its use in studies of multiple genome rearrangements was somewhat
limited since it was not clear how to combine pairwise rearrangement scenarios
into a multiple rearrangement scenario.
In particular, Capara (1999) demonstrated that even the simplest version of the
Multiple Genome Rearrangement Problem, the Median Problem, is NP-hard.
Therefore, this line of research was abandoned for a while in favor of the
breakpoint analysis approach (see Blanchette & Sankoff).
The existing tools BPAnalysis or GRAPPA use the so-called breakpoint distance
to derive rearrangement scenarios.
9. Lecture WS 2004/05
Bioinformatics III 192
Breakpoint phylogeny
When each genome has the same set of genes and each gene appears exactly
once, a genome can be described by a (circular or linear) ordering =
permutation of these genes.
Each gene has either positive (gi) or negative (- gi) orientation.
Given 2 genomes G and G‘ on the same set of genes, a breakpoint in G is
defined as an ordered pair of genes (gi,gj) such that gi and gj appear
consecutively in that order in G, but neither (gi,gj) (- gi,- gj) appears
consecutively in that order in G‘.
The breakpoint distance between two genomes is simply the number of
breakpoints between that pair of genomes.
The breakpoint score of a tree in which each node is labelled by a signed
ordering of genes is then the sum of the breakpoint distances along the edges
of the tree.
9. Lecture WS 2004/05
Bioinformatics III 193
Breakpoint Graph
Sorting a permutation is a hard problem.
Breakpoints were introduced by Watterson et al. (1982) and by Nadeau and Taylor
(1984) and correlations were noticed between the reversal distance and the
number of breakpoints.
Let i j if |i – j| = 1. Extend a permutation = 12 ... n by adding 0 = 0 and
n+1 = n + 1. We call a pair of elements (i,i+1), 0 i n, of an adjacency
if i i+1, and a breakpoint if i i+1.
2 3 1 4 6 5 7
0 2 3 1 4 6 5 7 8
adjacencies
breakpoints
As the identity permutation has no
breakpoints, sorting by reversals
corresponds to eliminating breakpoints.
An observation that every reversal can
eliminate at most 2 breakpoints implies that
the reversal distance d() b() / 2 where
b() is the number of breakpoints in .
However, this is a clear overestimate.
9. Lecture WS 2004/05
Bioinformatics III 194
Breakpoint Graph
The breakpoint graph of a permutation is an edge-colored graph G() with n +
2 vertices {0, 1 ... n, n+1} {0, 1, ..., n, n+1}. We join vertices i and i+1 by a
black edge for 0 i n. We join vertices i and j by a gray edge if i j.
Black path
0 2 3 1 4 6 5 7
Grey path
0 2 3 1 4 6 5 7
Superposition of black and grey paths formsthe breakpoint graph:
A breakpoint graph is obtained by a super-position of a black pathtraversing the vertices0, 1, ..., n, n+1 in the order given by the permutation and a graypath traversing the verticesin the order given by theidentity permutation.
more next week ...
Q: Konstruieren Sie den Breakpoint Graph für folgende Permutation.
9. Lecture WS 2004/05
Bioinformatics III 195
Multiple Genome Rearrangement Problem
Find a phylogenetic tree describing the most „plausible“ rearrangement
scenario for multiple species.
The genomic distance in the case of genome rearrangement is defined in terms
of (1) reversals, (2) translocations, (3) fusions, and (4) fissions which are
the most common rearrangement events in multichromosomal genomes.
The special case of three genomes (m = 3) is called the Median Problem.
Given the gene order of three unichromosomal genomes G1, G2, and G3,
find the ancestral genome A which minimizes the total reversal distance
321 ,,, GAdGAdGAd
9. Lecture WS 2004/05
Bioinformatics III 196
Multiple Genome Rearrangement Problem
New approach:
Given a set of m permutations (existing genomes) or order n, find a tree T
with the m permutations as leaf nodes and assign permutations (ancestral
genomes) to internal nodes such that D(T) is minimized, where
is the sum of reversal distances over all edges of the tree.
T
dTD
,
,
The breakpoint analysis attempts to solve the Median Problem by minimizing
the breakpoint distance instead of the reversal distance.
However, the breakpoint distance, in contrast to the reversal distance, does not
correspond to a minimum number of rearrangement events!
As a result, the breakpoint, recovered by breakpoint analysis, rarely
corresponds to the ancestral median, the genome that minimizes the overall
number of rearrangements in the evolutionary scenario.
9. Lecture WS 2004/05
Bioinformatics III 197
New algorithm
Aim: Among all possible reversals for each of the three genomes identify good
reversals.
A good reversal in a genome G1 is a reversal that brings a genome closer to
the ancestral genome.
But since this is unknown, it is unclear to find good reversals, oops!
Instead: assume that reversals that reduce the reversal distance between G1
and G2 and the reversal distance between G1 and G3 are likely to be good
reversals.
With () as the overall reduction in the reversal distances:
the reversal () is good if () = 2.
31213121 ,,,, GGdGGdGGdGGd
9. Lecture WS 2004/05
Bioinformatics III 198
New algorithm
Iteratively carry on these good rearrangements until the genomes G1, G2, and
G3 are transformed into an identical genome, hoping that this is the most likely
„ancestral median“.
When we are dealing with multichromosomal genomes and with four different
types of rearrangements, ambiguous situations may occur too.
9. Lecture WS 2004/05
Bioinformatics III 199
Ambiguities again possible
E.g. G1 = 1 2 3 4 5
G2 = 1 2 -5 -4 -3
G3 = 1 2
3 4 5
The parsimony principle does not allow to umambiguously reconstruct the
evolutionary scenario.
If the ancestor coincides with G1, then a reversal occurred on the way to G2,
and a fission occurred on the way to G3.
One can as well start with G2 or G3 as the ancestors. In this case 323121 ,,, GGdGGdGGd
This kind of ambiguity does not exist for unichromosomal genomes because,
there, it is impossible to find 3 genomes that would all be within one reversal of
each other.
9. Lecture WS 2004/05
Bioinformatics III 200
Strategy for choosing reversalsTherefore one has to select carefully among the good rearrangements.
Observe that in most genomes of interest reversals and translocations are
more common than fusions and fissions.
Therefore use as a rule always to select reversals/translocations before
fusions/fissions.
Often, the list of good reversals contains nonoverlapping reversals, and the
order in which these reversals are performed is often irrelevant.
Compute for each good reversal the number of good reversals n that will be
available if is carried out. Then choose the good reversal with the maximal n
to be carried out.
If we run out of good reversals before reaching a solution, the best reversal to
be taken will be the result of a depth k search minimizing the total pairwise
rearrangement distances.
9. Lecture WS 2004/05
Bioinformatics III 201
How good measure is reversal distance?
Authors claim that the reversal distance is a good approximation of the true
distance for many biologically relevant cases.
Let be a genome that evolved from a genome by k reversals.
I.e. the true distance between and is k.
We say that and form a valid pair if d(, )= k.
Otherwise we say that d(, ) underestimates the true distance.
Typically two genomes form a valid pair if the number of rearrangements
between them is relatively small – exactly the case in a number of genome
rearrangement studies.
9. Lecture WS 2004/05
Bioinformatics III 202
Reversal distance vs. True distance
Reversal distance, d(, ), versus
the actual number of reversals
performed to transform into ,
where is a genome/permutation
that evolved from the identity
permutation = 1,2, ... ,100 by k
random reversals.
The simulations were repeated
10 times for every k.
Shown is the average difference
between the reversal distance and
the actual number of reversals
performed (k).
For a genome with n=100 markers,
the reversal distance approximates
the true distance very well as long
as the number of reversals remains
below 0.4 n. This is the case in
many biological relevant cases.Bourque, Pevzner, Genome Res
(2002)
9. Lecture WS 2004/05
Bioinformatics III 203
Nadeau & Taylor model (1984)
- suggest presence of conserved segments (i.e., segments with preserved gene
orders without disruption by rearrangements)
- estimated that there are ca. 180 conserved segments in human and mouse
- provided convincing evidence that random breakage model of genomic
evolution postulated by Ohno is correct. The model assumes a random (i.e.,
uniform and independent) distribution of chromosome rearrangement
breakpoints and is supported by the observation that the lengths of
synteny blocks shared by human and mouse are well fitted by the
predicted exponential distribution imposed by the random breakage
model.
where L is the average length of segments.
- model has become widely accepted
- new studies of significantly larger datasets that confirmed that newly
discovered synteny blocks still fit the predicted exponential distribution very well.
L
x
eLxf 1
9. Lecture WS 2004/05
Bioinformatics III 204
Breakpoint reusage
Two different most parsimonious scenarios that transform the order of the 11 synteny blocks on the mouse X
chromosome into the order on the human X chromosome. The arrangement of synteny blocks in the ancestor is
unspecified (and is assumed to coincide with one of intermediate arrangements) because it cannot be inferred without
availability of a third genome.
Breakpoint uses are shown as short vertical yellow lines, and breakpoint region reuses are shown as double yellow
lines. In the first scenario (Left) the breakpoint reuses are located in human in breakpoint regions (3,4), (4,5), and (5,6),
whereas in the second one (Right) they are located in (5,6), (6,7), and after block 11. In the second scenario, a potential
hidden block is shown as a black dot; it restricts the set of possible most parsimonious scenarios, and it separates two
breakpoint uses that would have been a breakpoint region reuse. Our theory implies that any rearrangement scenario
based on these 11 blocks has at least three reuses of breakpoint regions (possibly including chromosome ends).
Pevzner, Tesler, PNAS 100, 7672 (2003)
9. Lecture WS 2004/05
Bioinformatics III 205
Length of synteny blocks
(Left) Histogram of synteny block lengths in human for Nb = 281 synteny blocks of length at least 1 Mb,
fitted by an exponential distribution with mean block length L = GbNb = 9.6 Mb, where Gb = 2,707 Mb is
the overall length of syntenic blocks. The bin size is 2.5 Mb.
(Center) The same histogram superimposed with the 190 hidden synteny blocks revealed by genome
rearrangement analysis, under the assumption that all hidden blocks are short, i.e., <1 Mb in length.
(Right) Histogram of breakpoint region lengths in the human genome (bin size is 100 kb). Most
breakpoint regions are very short, with 109 of 258 regions being <100 kb. However, there is a small
number of long breakpoint regions: 17 regions are between 1 and 2.5 Mb, and 15 are <2.5 Mb (shown
by a single bar at the right end).
The rearrangement analysis confirms the existence of many short breakpoints. Their existence
immediately implies that an exponential distribution is not a good fit to reality, thus pointing to
limitations of the random breakage mode Pevzner, Tesler, PNAS 100, 7672 (2003)
9. Lecture WS 2004/05
Bioinformatics III 206
Rat – mouse – man
The Rat Sequencing Consortium,
Nature 428, 493 (2004)
X chromosome on each pair. GRIMM synteny for 16 orthologous pairs.Arrangement of the 16 blocks: 15 rearrangement events necessary.Shown is one of a number of most parsimonious inversion scenarios.The last common ancestor of human, mouse and rat should be on the evolutionary path between median ancestor and human.
9. Lecture WS 2004/05
Bioinformatics III 207
Summary
Breakpoint analysis (BPA) is a robust technique for small rearrangement
problems. Problem of ambiguity between different optimal solutions.
Although complexity could be dramatically reduced by algorithmic improvements
(e.g. GRAPPA), method is still too expensive for more than 10 genomes.
Heuristic MGR algorithm by Bourque & Pevzner minimizes reversal distance
instead of breakpoint distance. (Taking the number of breakpoints 2 was not the
optimal lower bound for the reversal distance.)
Runs more efficient + can be applied to much larger problems + provides only one
or a few solutions.
MGR algorithm: analogy to conformational search in some energy landscape ...
What is the correct way to identify the biologically correct = true evolutionary trees:
by minimizing the breakpoint distance or the reversal distance or something else?
9. Lecture WS 2004/05
Bioinformatics III 208
V16 – genome rearrangement
Important information – contained in the order in which genes occur on the
genomes of different species – allows inferring phylogenetic relationships.
Together with phylogenetic information, ancestral gene order reconstructions
give some clues about the conservation of the functional organisation of genomes
towards a global knowledge of life evolution.
Often, phylogeny reconstruction techniques using gene order data rely on the
definition of an evolutionary distance between two gene orders.
These distances are usually computed as the minimal number of
rearrangement operations needed to transform one genome into another one.
Bergeron et al. WABI 2004, 14-25 (2004)
9. Lecture WS 2004/05
Bioinformatics III 209
V16 – genome rearrangement
Most choices of rearrangements quickly lead to hard algorithmic problems.
Therefore, the set of operations is usually restricted to reversals, translocations,
fusions or fissions where linear-time algorithms were developed in the last
years.
However, this choice of rearrangement operations is more dictated by algorithm
necessity than by biological reality. E.g., in some genomes, transpositions and
inverted transpositions can be quite common.
A family of phylogenetic approaches labelled „distance-based“ methods relies on
pair-wise evolutionary distances which are then fed into an algorithm such as
neighbor-joining to infer tree topology and branch lengths.
These methods do not provide information about the putative ancestral gene
order.
Bergeron et al. WABI 2004, 14-25 (2004)
9. Lecture WS 2004/05
Bioinformatics III 210
V16 – genome rearrangement
Parsimony-based approaches attempt to identify the rearrangement scenario
(including tree topology and gene orders at the internal nodes) that minimizes the
number of evolutionary events required.
problem is computationally much more difficult than just computing distances.
Heuristic algorithms exist that use either breakpoint or reversal distances.
However, these methods only provide us with one (or a small number of) possible
hypothesis about ancestral gene orders, with no information about alternate
optimal or near-optimal solutions.
Today:
- quick look at the reversal distance problem again
- new method „sets of conserved intervals“ (Bergeron & Jens Stoye)
Bergeron et al. WABI 2004, 14-25 (2004)
9. Lecture WS 2004/05
Bioinformatics III 211
Breakpoint Graph
The breakpoint graph of a permutation is an edge-colored graph G() with
n + 2 vertices {0, 1 ... n, n+1} {0, 1, ..., n, n+1}. We join vertices i and i+1 by
a black edge for 0 i n. We join vertices i and j by a gray edge if i j.
Black path
0 2 3 1 4 6 5 7
Grey path
0 2 3 1 4 6 5 7
Superposition of black and grey paths formsthe breakpoint graph:
A breakpoint graph is obtained by a super-position of a black pathtraversing the vertices0, 1, ..., n, n+1 in the order given by the permutation and a graypath traversing the verticesin the order given by theidentity permutation.
9. Lecture WS 2004/05
Bioinformatics III 212
Cycle decomposition
A cycle in an edge-colored graph G is called alternating if the colors of every two
consecutive edges of this cycle are distinct. In the following, cycles will mean
alternating cycles.
Cycle decomposition ofthe breakpoint graph:
0 2 3 1 4 6 5 7
0 2 3 1 4 6 5 7
0 2 3 1 4 6 5 7
0 2 3 1 4 6 5 7
A vertex v in a graph G is called balanced if the
number of black edges incident to v equals the
number of grey edges incident to v.
A balanced graph is a graph in which every
vertex is balanced. G() is a balanced graph.
Therefore, there exists a cycle decomposition
of G() into edge-disjoint alternating cycles
(every edge in the graph belongs to exactly one
cycle in the decomposition). Cycles in an edge
decomposition may be self-intersecting. The
previous breakpoint graph can be decomposed
into 4 cycles, one of which is self-intersecting.
9. Lecture WS 2004/05
Bioinformatics III 213
Effects of reversals on cycles
(A) For reversals acting on two
cycles, (b – c) = 1.
(B) For reversals acting on an
unoriented cycle, (b – c) = 0.
(C) For reversals acting on an
oriented cycle, (b – c) = -1
Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)
9. Lecture WS 2004/05
Bioinformatics III 214
Cycle decomposition
What is the decomposition of the breakpoint graph into a maximum number c()
of edge-disjoint alternating cycles? Here, c() = 4.
Cycle decompositions play an important role in estimating reversal distances.
When a reversal is applied to a permutation, the number of cycles in a maximum
decomposition can change by at most one (while the number of breakpoints
can change by two).
Bafna&Pevzner (1996) proved the bound for the reversal distance d():
d() n + 1 - c()
which is much tighter than the bound in terms of breakpoints d() b() / 2.
For many biological problems, d() = n + 1 - c().
Therefore, the reversal distance problem reduces to the problem of finding
the maximal cycle decomposition.
Hurdles, Super-hurdles, fortresses ...
9. Lecture WS 2004/05
Bioinformatics III 215
Alternative concept: conserved intervals
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
Distrance matrices can be used as data for phylogenetic reconstruction, or to
reconstruct ancestral genomes.
However, all distances (except for the breakpoint distance) are closely tied to
initial choices of allowable rearrangement operations.
They are pure distances because similarities between genomes are ignored.
breakpoint distance is based on the notion of conserved adjacencies. These are
easy to compute, but breakpoint distance often fails to capture more global
relations between genomes.
A first generalization of adjacencies: common intervals that identify
subsets of genes that appear consecutively in two or
more genomes.
Jens Stoye
9. Lecture WS 2004/05
Bioinformatics III 216
Permutations, Gene Order, and Rearrangements
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
Assume that the genes of an organism are ordered and oriented along linear or
circular DNA molecules. E.g. mitochondrial genes in insects
Collapse 38 genes into set of 17 blocks. Genes in one block do not change order
between these species.
Distance approaches: focus on the difference between 2 particular genomes.
E.g. Fruit Fly differs from Mosquito by the reversal of gene 10, and the
transposition of genes 7 and 8.
count minimal number of reversals and/or transpositions
distance matrix for the set of species
9. Lecture WS 2004/05
Bioinformatics III 217
Permutations, Gene Order, and Rearrangements
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
breakpoint distance: counts the lost adjacencies between genomes.
E.g. given the circularity of the genomes, Fruit Fly and Mosquito have 12
conserved adjacencies and a breakpoint distance of 5.
E.g. the first 4 species of table 1 share 6 adjacencies:
[1,2], [2,3], [11,12], [15,16], [16,17], and [17,1].
When comparing all 6 species, [17,1] is the only left adjacency.
9. Lecture WS 2004/05
Bioinformatics III 218
Permutations, Gene Order, and Rearrangements
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
Observation: the 6 permutations are very „similar“.
E.g. the genes in the interval [1,12] are all the same, with small variations in their
ordering.
This is also true for the genes in the intervals [3,6], [6,9], [9,11], and [12,17].
Such intervals, together with conserved adjacencies play a fundamental role in
rearrangement and distance theories, ancestral genome reconstructions, and
phylogeny.
Family portrait of the conserved intervals of the permutations of table 1
Here, the elements that can be glued together to form larger objects are boxed in
rectangles. Q: Konstruiere Sie das family protrait der konserviertenIntervalle für folgendeSequenzen …
9. Lecture WS 2004/05
Bioinformatics III 219
Which arrangements are preferable?
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
All permutations of table 1 fit the representation with the following conventions
(1) free objects within a rectangle can be reordered, or can change sign
(2) connections between rectangles are fixed.
Consider 2 rearrangement scenarios that transform silkworm into Locust using a
minimal number of reversals
The two scenarios are fundamentally different, although both use 6 reversals.
The right one uses much longer reversals than the left one, and the right one
breaks conserved intervals between Silkworm and Locust in intermediate
permutations, namely [3,6], [1,12], and [12,17].
The right scenario looks highly suspicious.
9. Lecture WS 2004/05
Bioinformatics III 220
Conserved intervalsDefinition 1 Let G be a set of signed permutations of n elements. An
interval [a,b] is a conserved interval of the set G if:
(1) either a precedes b, or –b precedes –a, in each permutation, and
(2) the sets of unsigned elements that appear between a and b is the same
for all permutations in G.
If [a,b] is a conserved interval, so is [-b,-a].
Consider 2 permutations
P = 1 2 3 7 5 6 -4 8
Q = 1 7 -3 -2 5 -6 -4 8
Here, [1,5] and [2,3] are conserved intervals, but not [1,6].
The other conserved intervals of P and Q are [1,-4], [1,8], [5,-4], [5,8], and [-4,8].
The diagram representation of these intervals is
1 2 3 7 5 6 -4 8
9. Lecture WS 2004/05
Bioinformatics III 221
Conserved intervalsWhen the identity permutation is not in G, it is always possible to rename the
elements of G such that conserved intervals will be intervals of consecutive
elements.
E.g. if one composes the permutations P and Q of the example with the inverse
permutation P-1,
P‘ = P-1 o P = 1 2 3 4 5 6 7 8
Q‘ = P-1 o Q = 1 4 -3 -2 5 -6 7 8
or 1 2 3 4 5 6 7 8
Proposition 1 Let R be a permutation and G a set of permutations, denote by
R o G the set of permutations obtained by composing each permutation in G with
R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is
conserved in R o G.
9. Lecture WS 2004/05
Bioinformatics III 222
Conserved intervalsProposition 1 Let R be a permutation and G a set of permutations, denote by
R o G the set of permutations obtained by composing each permutation in G with
R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is
conserved in R o G.Proof: if a permutation P is written as P = p1 p2 ... pn
then R o P is: R o P = R(p1) R(p2) ... R(pn)
If [a,b] is conserved in G, then each permutation in G has a consecutive block of elements
beginning with a and ending with b, or beginning with –b and ending with –a. These
properties hold also for the set R o G, if one replaces a by R(a) and b by R(b).
Some intervals, such as [1,7] for the set {P‘,Q‘} in the above example, are the
union of smaller intervals: [1,7 ] = [1,5] [5,7]. Intervals that are not unions are
specially useful.
Definition 2 Conserved intervals that are not the union of shorter conserved
intervals are called irreducible.
Sets of conserved intervals can be characterized by the set of irreducible intervals.
Q: Welche der zuvor identifiziertenkonservierten Intervalle sind irreduzibel?
9. Lecture WS 2004/05
Bioinformatics III 223
Irreducible conserved intervals
Proposition 2 Two different irreducible conserved intervals [a,b] and [c,d]
of a set G of permutations are either
1) disjoint
2) nested with different endpoints, or
3) overlapping on one element.
Proof. Wlog we can assume that G contains the identity permutation and that conserved
intervals are intervals of consecutive elements.
Suppose that [a,b] and [c,d] are nested with a = c and d < b. Since [c,d] is a conserved
interval, it contains all integers between c and d the interval [d,b] contains all integers
between d and b, and [a,b] is not irreducible.
If [a,b] and [c,d] overlap with more than one element, we can suppose
a < c < b < d. Since all elements between c and d are greater than c, then the interval
between a and c must contain all elements between a and c, thus [a,b] is not irreducible.
Q: Zeigen Sie anhand dieserIrreduziblen konservierten Intervalle schnell, daß die 3Links gezeigten Eigenschaftenerfüllt sind.
9. Lecture WS 2004/05
Bioinformatics III 224
Conserved intervals
Overlapping irreducible intervals form chains linked by their successive common
elements. A chain of k-1 intervals [a1,a2] [a2,a3] ... [ak-1,ak] will be denoted simply
by its k links [a1,a2,a3 ... ak].
E.g. [1,5,7,8] is a chain of the set of conserved intervals of P‘ and Q‘.
A maximal chain is a chain that cannot be extended.
Proposition 3. Every irreducible conserved interval belongs to a unique maximal
chain.
Proof: By Prop. 2: if [a,b] is an irreducible conserved interval, then no other can
begin by a or end by b.
Maximal chains, as sets of links, together with isolated genes, form a partition of
the set of genes.
9. Lecture WS 2004/05
Bioinformatics III 225
Conserved intervals
A set of permutations on n elements can have as many as n(n-1)/2 conserved
intervals, but at most n-1 irreducible intervals.
These bounds are achieved with sets containing only one permutation.
Proposition 4. Each maximal chain of k links contributes k(k-1)/2 to the total
number of conserved intervals.
Proof. Conserved intervals [a,b] are in bijection with chains of the form
[a, x1, ..., kx, b]
of irreducible intervals. Each maximal chain of k links has k(k-1)/2 such sub-chains.
9. Lecture WS 2004/05
Bioinformatics III 226
Conserved intervals
Proposition 5 Let P be a permutation that is contained in both sets G1 and G2.
The interval [a,b] is a conserved interval of G = G1 G2 if and only if there exist
two chains of irreducible conserved intervals, with respect to P, with k 0, l 0:
[a, x1, ..., kx, b] in G1
[a, y1, ..., yl, b] in G2.
The interval [a,b] is irreducible if and only if {x1, ..., xk} and {y1, ..., yl} are disjoint.
Proof. The interval [a,b] is a conserved interval of G if and only if it is a conserved interval
in both G1 and G2, therefore there must exist chains beginning by a and ending by b for
both sets G1 and G2. If [a,b] is irreducible in G, and if [a,x] and [x,b] are conserved intervals
of G1, say, then x cannot belong to the set {y1, ..., yl}. If there is a common element x to
both sets {x1, ..., xk} and {y1, ..., yl}, then [a,b] = [a,x] [x,b] and both [a,x] and [x,b] are
conserved intervals of G.
9. Lecture WS 2004/05
Bioinformatics III 227
Similarity and distance
The number of conserved intervals of a set of permutations is a measure of
similarity, but can easily be transformed into a distance between two
permutations, or two sets of two permutations.
Definition 3 Let G1 and G2 be two permutations on n elements, with N1 and
N2 conserved intervals. Let N be the number of conserved intervals in G1
G2. The interval distance between G1 and G2 is then defined by:
d(G1,G2) = N1 + N2 – 2N
The interval distance satisfies the fundamental properties of a mathematical
distance, e.g. it fulfils the triangle inequality:
d(P,Q) + d(Q,R) d(P,R)
9. Lecture WS 2004/05
Bioinformatics III 228
Similarity and distance
When comparing two permutations, the interval distance counts the total number
of intervals that are unique to one of them. E.g. the distance between
P = 0 1 2 3 4 5 6 7 8 9 10
Q = 0 5 -7 -6 8 9 1 2 3 4 10
is given by d(P,Q) = (1110)/2 +(1110)/2 – 2 11 = 88
The 2 measures sometimes disagree. The behavior of the interval distance
reflects that the length (number of genes) involved in a rearrangement operation
matters: short reversals are less disturbing than long ones.
9. Lecture WS 2004/05
Bioinformatics III 229
Comparison with other distance measures
Breakpoint distance also gives different results than interval distances.
while the same results are obtained by transposition + reversal distances.Q: sowohl Intervall-Distanz wie Breakpoint-Distanz sind mathematischwohl definierte Eigenschaften. Welche Distanz entspricht der wirklichenBiologischen Distanz?A: dies wissen wir heute nicht und werden es evtl. niemals bestimmt wissen.
9. Lecture WS 2004/05
Bioinformatics III 230
Similarity and distance
Proposition 7 Suppose that P and Q have n elements, then
(1) if P is obtained from Q by reversing k elements, then the interval distance
between P and Q is k (n – k);
(2) if P is obtained from Q by transposing two consecutive blocks of a and b
elements, then the interval distance between P and Q is (a+b)(n – (a+b)) + ab.
Because the interval distance is affected by length, one should question
the practice of collapsing identical strips of genes.
Why not use all available information?
9. Lecture WS 2004/05
Bioinformatics III 231
Link with rearrangement theoriesCharacterize the rearrangement operations that preserve conserved intervals.
Definition 4. Let P and Q be two permutations, and a rearrangement
operation applied to P yielding P‘. We say that preserves the conserved
intervals of P and Q if the conserved intervals of {P,Q} are contained in those
of {P‘,Q}.
Only rearrangements within blocks are preserving. Note that all operations,
except fusions, destroy some adjacencies that existed in the original permutation:
the number and nature of these adjacencies is a key concept.
Definition 5. Let be a rearrangement operation that transforms P into P‘.
A breakpoint of is a pair of elements that are adjacent in P but not in P‘.
Breakpoints are where one has to cut P in order to apply .
Reversals and translocations have 2 breakpoints, transpositions have 3, and
fissions have 1.
9. Lecture WS 2004/05
Bioinformatics III 232
Link with rearrangement theories
Consider the irreducible intervals of P and P‘ with respect to P.
Adjacencies in P either belong to a (smallest) irreducible interval, or are free.
E.g. in this diagram
the adjacency (3,4) belongs to the interval [1,5], (2,3) belongs to [2,3], and (8,9)
is free.
When 2 adjacencies belong to the same irreducible interval, then none of these
adjacencies is conserved between P and P‘.
9. Lecture WS 2004/05
Bioinformatics III 233
Link with rearrangement theories
Theorem 3. Reversals, transpositions, and reverse transpositions are preserving
if and only if all their breakpoints belong to the same irreducible interval, or are
free. Translocations and fissions are preserving if and only if all their breakpoints
are free.
Proof. If the breakpoints of any operation are free, then no conserved interval is cut.
If the breakpoints of a reversal, transposition, or reverse transposition belong to the same
irreducible interval, then the operation reorders, or reverses, some blocks within that
interval, thus preserving conserved intervals.
If a reversal has its two breakpoints in different intervals, it will break those two intervals. If
it has only one free breakpoint, it will break the interval containing the other breakpoint.
The same kind of arguments hold for transpositions and reverse transpositions.
If a breakpoint of a translocation or fission is not free, then it belongs to an irreducible
interval whose extremities will end up in two different chromosomes.
It turns out that most rearrangement operations used in optimal scenarios are
indeed preserving.
9. Lecture WS 2004/05
Bioinformatics III 234
Summary
Linear-time algorithms could be developed to minimize reversal distance
rearrangement scenarios.
Open question which distance measures (breakpoint distance, reversal
distance, interval distance ...) are most appropriate to compare genome
architectures.
Experimental evidence provides new insights which types of
rearrangements have likely occurred in the past need to adopt
algorithms to the biological reality.
Concept of „conserved intervals“ sounds very promising – can account for
arbitrary types of rearrangements.