11. Lecture WS 2004/05Bioinformatics III1 Semesterplanung 25.11.Ass #6 30.11.Qualitätsanalyse in PI networks: Baye‘sche StatistikAss #7 2.12. + 7.12.Phylogenie

11. Lecture WS 2004/05

Bioinformatics III 1

Semesterplanung

25.11. Ass #6

30.11. Qualitätsanalyse in PI networks: Baye‘sche Statistik Ass #7

2.12. + 7.12. Phylogenie

9.12. + 14.12. Genome-Rearrangement Ass #8

16.12. Weihnachtsvorlesung

11.1 V19 Einleitung metabolische Netzwerke

13.1 V20 Extreme Pathways Ass. #9

18.1 V21 Elementarmodenanalyse

20.1. V22 Integration metabol. + regul. Netzwerke Ass. #10

25.1. V23 Modellierung von Signaltransduktions-Kaskaden

27.1. Modellierung von Signaltransduktions-Kaskaden (II)

1.2. chemical genomics

3.2. V12 pharmacogenomics

8.2. Integrative Netzwerkanalyse

10.2. Zusammenfassung für Klausur

Klausurtermin: wann?



V11 – modules in cellular networks – wrap up

traditional biology (reductionist approach) produces long lists:

lists of genes in genomes

lists of transcripts in different cell types

lists of protein interactions in model organisms

genomes, transcriptomes, proteomes, interactomes,

databases of genetic perturbations, and corresponding phenotypes

How to make sense of it all?

Will meaningful hypotheses and discoveries emerge?

systems biology

Formalized mathematical modeling still room for reductionism:

simulations test hypothesis from

quantitative measurements systems biology experiments

Gagneur et al. Genome Biology 5, R57 (2004)



Strategies to detect communities in networks

„Community“ stands for module, class, group, cluster, ...

Define community as a subset of nodes within the graph such that connections

between the nodes are denser than connections with the rest of the network.

The detection of community structure is generally intended as a procedure for

mapping the network into a tree („dendogram“ in social sciences).

Radicchi et al. PNAS 101, 2658 (2004)

Leaves: nodesbranches join nodesor (at higher level)groups of nodes.



Agglomerative algorithms for mapping to tree

Traditional method to perform this mapping: hierarchical clustering.

For every pair i,j of nodes in the network compute weight Wij that measures how

closely connected the vertices are.

Starting from the set of all nodes and no edges,

links are iteratively added between pairs of

nodes in order of decreasing weight.

In this way nodes are grouped into larger and larger

communities, and the tree is built up to the root,

which represents the whole network.

„agglomerative“ algorithm

Girven, Newman, PNAS 99, 7821 (2002)Radicchi et al. PNAS 101, 2658 (2004)

Here: 3 communities of densely connectedvertices (circles with solid lines) with amuch lower density of connections(gray lines) between them.



Possible definitions of the weights

(1) number of node-independent paths between vertices

2 paths that connect the same pair of vertices are said to be node-independent if

they share none of the same vertices other than their initial and final vertices.

(2) edge-independent paths.

It has been shown that the number of node-independent (edge-independent) paths

between 2 vertices i and j in a graph is equal to the minimum number of vertices

(edges) that must be removed from the graph to disconnect i and j from one

another (Menger, 1927).

these numbers are a measure of the robustness of the network to deletion of

nodes (edges).

Girven, Newman, PNAS 99, 7821 (2002)



Possible definitions of the weights (II)

(3) count total number of paths that run between them (not just those that are

node- or edge-independent).

Because the number of paths between any 2 vertices is either 0 or infinite, one

typically weighs paths of length l by a factor l with small so that the weighted

count of number of paths converges.

Thus long paths contribute exponentially less weight than short paths.

These node- or edge-dependent path definitions for weights work okay for certain

community structures, but show typical pathologies.




Problems

In particular, both counting of node- and edge-independent paths has a tendency

to separate single peripheral vertices from the communities to which they should

rightly belong.

If a vertex is, e.g., connected to the rest of a network by only a single edge then, to

the extent that it belongs to any community, it should clearly be considered to

belong to the community at the other end of that edge.

Unfortunately, both the numbers of independent paths and the weighted path

counts for such vertices are small and hence single nodes often remain isolated

from the network when the communities are constructed.

This and other pathologies, make the hierarchical clustering method, although

useful, far from perfect.




New strategy: Use “betweenness” as definition of weights

Focus on those edges that are least central, that are „between“ communities.

Define edge betweenness of an edge as the number of shortest paths between

pairs of vertices that run along it.

If there is more than one shortest path between a pair of vertices, each path is

given equal weight such that the total weight of all of the paths is 1.

If a network contains communities or groups that are only loosely connected by a

few intergroup edges, then all shortest paths between different communities must

go along one of these few edges.

the edges connecting communities will have high edge betweenness.

By removing these edges we separate groups from one another and so reveal the

underlying community structure of the graph.




GN Algorithm

1. Calculate betweenness for all m edges in a graph of n vertices

(can be done in O(mn) time).

2. Remove the edge with the highest betweenness.

3. Recalculate betweenness for all edges affected by the removal.

4. Repeat from step 2 until no edges remain.

Because step 3 has to be done for all edges, the algorithm runs in worst-case time

O(m2n).




1.


Application of Girvan&Newman Algorithm(a) The friendship network from Zachary's karate club study. Nodes associated with the club administrator's faction are drawn as circles, those associated with the instructor's faction are drawn as squares. (b) Hierarchical tree showing the complete community structure for the network calculated by using the algorithm presented in this article. The initial split of the network into two groups is in agreement with the actual factions observed by Zachary, with the exception that node 3 is misclassified. (c) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network.



Divisive algorithms for mapping to tree

Reverse order of construction of the tree than for agglomerative algorithms:

start with the whole graph and iteratively cut the edges

divide network progressively into smaller and smaller disconnected subnetworks

identified as the communities.

Crucial point: how to select those edges to be cut.

Example: Girven & Newman algorithm (GN)

Problem of GN algorithm: requires the repeated evaluation of a global property, the

betweenness, for each edge whose value depends on the properties of the whole

system.

becomes computationally very expensive for networks with e.g. 10000 nodes.




Faster algorithm

Introduce divisive algorithm that only requires the consideration of local quantities.

Need: quantity that can single out edges connecting nodes belonging to different

communities.

Consider edge-clustering coefficient:

number of triangles to which a given edge belongs divided by the number of

triangles that might potentially include it, given the degrees of the adjacent nodes.

For the edge-connecting node i to node j, the edge-clustering coefficient is


1,1min

13,3

,

ji

jiji kk

zC

where zi,j(3) is the number of triangles built on that edge and

min[(ki – 1), (kj – 1)] is the maximal possible number of them.

1 is added to zi,j(3) to remove degeneracy for zi,j

(3) = 0.



Faster algorithm

Edges connecting nodes in different communities are included in few or no

triangles and tend to have small values of Ci,j(3).

On the other hand, many triangles exist within clusters.

By considering higher order cycles one can define coefficients of order g


gji

gjig

ji s

zC

,

,,

1

where zi,j(g) is the number of cyclic structures of order g the edge (i,j) belongs to,

and si,j(g) is the number of possible cyclic structures of order g that can be built

given the degrees of the nodes.

Define, for every g, a dectection algorithm that works exactly as the GN method

with the difference that, at every step, the removed edges are those with the

smallest value of Ci,j(g).

By considering increasing values of g, one can smoothly interpolate between a

local and a nonlocal algorithm.



Comparison with GN method

Test of the efficiency of the different algorithms in the analysis of the artificial graph

with four communities. Here N = 128 and pin is changed with pout to keep the

average degree equal to 16.

(Left) Strong definition: fraction of successes for the different algorithms compared

with the analytical probability that four communities are actually defined.

(Right) Weak definition: in addition to the same quantities plotted in Left, here we

report, for every algorithm, the fraction f of nodes not correctly classified.




Comparison with GN algorithm

Plot of the dendrograms for the network of college football teams, obtained by

using the GN algorithm (Left) and our algorithm with g = 4 (Right).

Different symbols denote teams belonging to different conferences.

In both cases, the observed communities perfectly correspond to the conferences,

with the exception of the six members of the „Independent conference“, which are

misclassified.




Simple network clustering based on shortest-path distance

Aim: compute modular organization of cellular networks controlling specific

biological responses.

Ideas:

(i) the shortest path between any two vertices (proteins) is probably the most

relevant for functional associations;

(ii) each vertex in a network has a unique profile of shortest-path distances through

the network to every other vertex

(iii) module comembers are likely to have similar (clustered) shortest-path-distance

profiles.

Rives & Galitski PNAS 100, 1128 (2003)



Network clustering

Yeast PI network; 4079 proteins, 6761 protein interactions.

MIPS: 133 signaling proteins, 64 have 1 interactions with another signaling

protein.

Algorithm: assign length 1 to each edge in protein interaction network.

Compute all-pairs shortest-path distance matrix: contains length of the shortest

path (distance) d between every pair of vertices in the network.

Convert into „association matrix“ using 1/d2 .

Associations range from 0 to 1.

Emphasizes local association in subsequent clustering.

Use hierarchical agglomerative average-linkage clustering.




Clustering of yeast signaling protein interaction networkA symmetrical matrix of 64 proteins of the

MIPS-database signaling category was

clustered identically in both dimensions. The

cluster tree is not shown. Each row or

column represents a protein. Each feature is

the intersection of two proteins and is a

grayscale representation of pairwise protein

association).

Columns to the right of the clustered network

represent MIPS-defined signaling pathways

[P, polarity-PKC; R, Ras; H, HOG; M,

mating/filamentation MAPK (mfMAPK)].

White bars in the MIPS-pathway columns

indicate protein members of the pathway.

Ras-pathway proteins form a single

cluster.

3 MAPK pathways as clusters.




Network clustering of high-throughput data sets

HTS-Data usually has high (50%) false-positive error frequencies!

Also, many binary interactions may not occur within modules.

Because interacting proteins usually localize in the same subcellular compartment

one may integrate interaction and localization data for the identification of modules.

Single proteins with many interactions in Y2H screens (hubs) nucleate large

clusters that are not modules.




examples of derived clusters

Clustering of the yeast nuclear-protein network

derived from high-throughput interaction and

localization data.

(A) Examples of clusters representing module

rudiments are labeled. The cluster tree is not

shown. Arrows indicate high-connectivity hub

proteins.

(B) Example clusters are shown in detail.

Cluster comembers participating in some

common structure or function have large bold

labels.




Properties of hubs

All hub proteins indicated bind > 90 proteins in the global Y2H network.

The proteins bound by these hubs are randomly distributed in cellular

compartments.

The nuclear-localized proteins bound by these hubs form the 4 largest clusters.

Proteins bound by high-connectivity hubs will have few or no interactions among

themselves if they are not functionally associated („hub-and-spokes“ structure).

proteins bound by each high-connectivity hub are not functionally associated with

each other, and their clusters do not represent modules.




connectivity neighborhood clustering

Global protein connectivity versus

neighborhood clustering. Each

protein in the global protein net-

work is plotted by its connectivity,

k, and its neighborhood clustering,

C. Arrows indicate high-connec-

tivity proteins shown in Fig. 2A.


The 4 high-connectivity hubs are among 15 outliers. Although these proteins have

exceedingly high connectivity, they almost completely lack neighborhood clustering.

useful criterion to distinguish modules from nonmodules?



Application to biological-response networks

Incorporate network clustering into 3-step process to study complex biomolecular

systems generates modular network-structure model

(i) compile known and suspected components of the response network (from

databases, expression profiling, proteomics, genetic screens, metabolite profiles ...)

(ii) cluster network based on interactions between vertices. Edges can represent

any type of interaction.

(iii) abstract modular network-structure model showing modules.

Cluster 90 filamentation-network proteins that have 1 interaction with other

filamentation proteins.




Clustering of the yeast filamentation network

Proteins of the yeast

filamentation network were

clustered. A tree-depth

threshold was set.

Tree branches with 3 leaves

(clusters with 3 proteins)

below the tree threshold are

shown.

Bullets and large bold labels

indicate proteins of highest

intracluster connectivity.




Modular model of the yeast filamentation network

Clusters indicated in Fig. 4 are

abstracted as modules. All intermodule

paths in the filamentation network are

indicated as black lines with the

interacting proteins at the termini.

A gray line connecting the Ras and

protein kinase A modules was added to

indicate a connection mediated by the

small molecule cAMP.




Filamentous growth-response of yeast cells

(A) Wild-type yeast-form cells grown in SHAD

liquid medium.

(B) Wild-type filamentous-form cells grown for

10 h on SLAD agar medium.

For budding yeast diploid cells, low availability

of ammonium and a solid growth substrate

trigger a dimorphic switch to filamentous-form

growth, characterized by cell elongation,

unipolar distal budding, adhesion and invasion.

Prominent involved pathways: cAMP-dependent

protein kinase, fMAP kinase, Cdc28 kinase

activity, ubiquitination by SCF ubiquitin-ligase.

Here: investigate next step, ubiquitin-dependent

degration by 26S proteasome. Prinz et al. Genome Research 14, 380 (2004)



Integrated filamentation networkThe filamentation network includes proteins

(rectangular nodes) implicated in filamentous

growth by expression profiling or known

phenotypes, and metabolites (triangular

nodes) that are either substrates or products

of filamentation-protein enzymes. N

ot shown are filamentation proteins with neither a protein–

metabolite interaction nor a protein–protein interaction with

another filamentation protein.

Blue edges: protein–protein interactions.

Green edges: protein–metabolite interactions.

Each gene node is colored based on its

expression log-ratio. Shades of red indicate

higher expression in the filamentous form

relative to the yeast form; shades of blue

indicate the opposite response; white indicates

no difference. Prinz et al. Genome Research 14, 380 (2004)



Collective Functions of Network Clusters

If clusters in an integrated network

represent biological modules, the

clusters should have collective

functions in specific biological

processes.

Specific biological-process gene

annotations (taken from GO

database) are found overrepre-

sented in specific filamentation-

network clusters.

Significance: -log (cumulative probability of the

observed data and all more extreme probabilities).

Prinz et al. Genome Research 14, 380 (2004)



Modular abstraction of the filamentation network

Network clusters are abstracted

as circular "module nodes."

The area of each module node

is proportional to the number of

member molecules. The color of

each module node reflects the

average expression log-ratio of

member genes.

Each module node is assigned

the name of the member node

of highest intracluster degree

(the highest number of

interactions with cluster co-

members); most are proteins,

some are metabolites.




Quantitative identifcation of network clusters

Nodes of the filamentation network were

iteratively joined into clusters.

(A) A cluster was defined as a joined group

containing at least 3 protein nodes.

The number of clusters is plotted as a

function of join number.

(B) The selection of nodes/clusters to join

was based on average-linkage Manhattan

distance of node shortest-paths-distance

profiles.

This distance metric is plotted as a function of

join number.

The arrows indicate join #535 corresponding

to the highest join number with the highest

number of clusters.Prinz et al. Genome Research 14, 380 (2004)



RPN12, GRR1, and CDC28 modules and their components

Modules (A), and their respective

components (B) with collective functions in

cell-cycle control and ubiquitin-dependent

proteolysis are shown.




growth behavior of rpn4 mutants

rpn4 mutants show Cln1-dependent

hyperelongation, and cell type-

independent agar adhesion.

(A) Diploid wild-type, rpn4 , cln1 ,

and rpn4 cln1 strains were grown

on SLAD agar plates and

photographed after 9 h.

(B) Patches of strains of the indicated

cell types and genotypes were

subjected to a wash-off assay of

adhesion. The plate was imaged

before and after washing with water.




Stabilization of Cln1 protein in rpn4 mutants(A) Northern blot analysis of total RNA from wild-type and

rpn4 strains, and a cln1 strain. The blot was probed

consecutively with probes for CLN1 and RPN12. The asterisk

in the CLN1 blot indicates a cross-hybridizing band that also

serves as a loading control.

(B) Western blot analysis of Cln1 protein in diploid wild-type

and rpn4 strains carrying HA-tagged CLN1, and a no-tag

wild-type control strain. Protein extracts were prepared from

cells grown for 10 h on SLAD agar plates. Pgk1 protein levels

served as a loading control.

(C) Cln1-HA protein was immunoprecipitated from an rpn4

strain. Aliquots of the immunoprecipitate were incubated with

calf-intestine phosphatase (CIP), or without CIP, and

analyzed by Western blotting.

(D) myc-tagged Cln1 protein was immunoprecipitated in

diploid wild-type and rpn4 strains, and a no-tag control

strain. All strains had a multicopy plasmid expressing HA-

tagged ubiquitin. Immunoprecipitates were analyzed by gel

electrophoresis and immunoblotting with anti-HA antibody to

detect ubiquitin conjugates. The blot membrane was stripped

and reprobed with anti-myc antibodies to detect the

immunoprecipitated Cln1.





a Interaction data include all protein-protein interactions plus all metabolicinteractions. Each analysis used eitherbiological interaction data or 10 datasets in which interactions were assigned randomly to pairs of proteins.

b Each analysis included a list of either the 1026 filamentation proteins, or the873 expression-implicated proteins, or 10 sets of random proteins.

c The number of proteins in the list thathas at least one interaction with anotherprotein in the list

d The number of direct interactions betweenpairs of proteins in the list.

e Node degree: # of incident edges of the node. Mean node degree: ratio of 2*# of interactions to # of interacting proteins.

Non-random interaction among filamentation proteins



Expression change within clusters

RPN12, GRR1, and CDC28

modules and their components.

Modules (A), and their respective

components (B) with collective

functions in cell-cycle control and

ubiquitin-dependent proteolysis are

shown. Graphic representations

are as in Figures 2 and 3.


... table continues ...



Biological Insights from modular network abstraction

(1) In an integrated network, data on molecules and interactions shows clustered

organization that can be identified quantitatively

(2) Cluster co-member genes show significant coordination of expression change,

as expected for genes involved in a collective function.

(3) Cluster go-member genes show significant overrepresentation of biological-

process annotations, indicating collective function.

(4) The modular network abstraction intuitively stimulates testable biological

insights on complex biological properties.




Evolutionary conservation of motif constituentsin the yeast protein interaction network

Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)

Question: why are some cellular components conserved across species

but others evolve rapidly?

Many biological functions are carried out by the integrated activity of highly

interacting cellular components = functional modules

Motifs = topologically distinct interaction patterns with complex networks

may represent the simplest building blocks of modules.

Here, test the correlation between a protein‘s evolutionary rate and the

structure of the motif it is embedded in

identify all 2-, 3-, 4-node motifs and some 5-node motifs



shared components

Data from DIP database,

3183 interacting yeast proteins

if there is evolutionary pressure to

maintain specific motifs, their

components should be evolutionarily

conserved and have identifiable

orthologs in other organisms.

Study conservation of 678 S. cerevisae

proteins with an ortholog in each of 5

higher eukaryotes:

Arabidopsis thaliana, C. elegans,

Drosophila melanogaster, Mus

musculus, Homo sapiens.


Algorithm to detect all

n-node subgraphs:

scan all rows of the adjacency

matrix M. For each non-zero

element (i,j) representing a link,

scan through all neighbors of

(i,j) until a specific n-node

subgraph is detected.



shared components

#motifs of a given kind in the yeast PI

network

fraction of original yeast motifs that is

evolutionary fully conserved: each of

their protein components belongs to

678 orthologous proteins

fraction of motifs that is fully conserved

for the random ortholog distribution

column 4 / column 5

less than 5% of #2 (linear 3-component

proteins) are completely maintained


47% of the fully conserved pentagons

(#11) are fully conserved!



topology conservation of individual proteins

Larger motifs tend to

be conserved as a

whole, where each

component has an

ortholog.


E.g. less than 1% of the fully connected pentagon motifs disappeared completely,

for 69% of them, each of the subunits had an ortholog in human.

Clear correlation between the conservation rate and the degree of saturation of

a motif.

Participation in motifs substantially influences the evolutionary conservation of

specific components.



From 65% (C = 0) to 84% (C = 1) of neighbors of a human ortholog were also

human orthologs (filled circles). The conserved fraction of the nonorthologous

protein‘s neighborhood is markedly smaller.

Enrichment = ration between the percentages of orthologous proteins at distance d

from an ortholog in the natural and the random orthologous sets.

d: shortest distance between i and target protein measured along network links.

Proteins that interact directly with an ortholog at d=1 have a 50% higher chance of

conservation that at random!


clustering coefficient conservation of proteins ?



Examine if the specific function of the yeast proteins within motifs affects their rate

of evolutionary conservation.

Assign each motif to functional class to which its protein components belong.

Larger motifs have a notable functional homogeneity:

- for 95% of fully connected yeast pentagon motifs (#11) all components shared at

least one common functional class,

- only 10% of the 2-node motifs (#1) are functionally conserved.

Identify type and number of evolutionary fully conserved motifs of each functional

class in S.cerevisae, for those that have an ortholog in humans.


function conservation?



shared components

For 3 functional classes

(subcellular localization, protein

fate, transcription) each of the 11

studied motifs is considerably

overrepresented.

Some other functional classes

have only 1 or 2 characteristic

motifs.

No motifs are found for:

transposable elements, energy,

cellular fate, cellular communi-

cation, cellular rescue, cellular

organization, metabolism,

protein activity, protein binding Wuchty, Oltvai, Barabasi, Nature Gen 35, 176 (2003)



shared components

For 3 functional classes (subcellular localization, protein fate, transcription) each of

the 11 studied motifs is considerably overrepresented.

Some other functional classes have only 1 or 2 characteristic motifs.

No motifs are found for:

transposable elements, energy, cellular fate, cellular communi-cation, cellular

rescue, cellular organization, metabolism, protein activity, protein binding

The fully connected motifs (#9 and #11) tend to identify protein complexes.

However, the mere existence of protein complexes cannot explain the observed

trends towards higher conservation rates of the highly connected motifs.




shared components

Shared components = proteins or groups of proteins occurring in different

complexes are fairly common:

A shared component may be a small part of many complexes, acting as a unit that

is constantly reused for ist function.

Also, it may be the main part of the complex e.g. in a family of variant complexes

that differ from each other by distinct proteins that provide functional specificity.

Aim: identify and properly represent the modularity of protein-protein interaction

networks by identifying the shared components and the way they are arranged to

generate complexes.




Summary

Modules are key intermediate level in the organizational hierarchy of cells.

Biological Module: loose association of preferred molecular interaction partners

that interact to perform a collective function.

Modules can be identified based on structural characteristics such as their closely

connected members and interfacesto other modules.

There is evidence that modules are evolutionarily conserved.

Module co-members tend to be coordinately expressed.

Documents

11. Lecture WS 2004/05Bioinformatics III1 Semesterplanung 25.11.Ass #6 30.11.Qualitätsanalyse in PI networks: Baye‘sche StatistikAss #7 2.12. + 7.12.Phylogenie