View
1
Download
0
Category
Preview:
Citation preview
Spectral Algorithms for BiologicalNetworks
Des HighamDepartment of Mathematics
University of Strathclyde
djh@maths.strath.ac.uk
graphnet07 – p.1/42
Collaborators
Julie Morrison, formerly U. Strathclyde
David Gilbert, U. Glasgow
Rainer Breitling, U. Groningen
A lock-and-key model for protein-protein interactions,Bioinformatics, 2006
Nataša Pržulj, U. C. Irvine
Modelling protein-protein interaction networks via astickiness index, J. Royal Society Interface, 2006
Marija Rašajski, U. C. Irvine & U. Belgrade
graphnet07 – p.2/42
EPSRC-funded project
Theory and Tools for Complex Biological Networks(2007–2010)
Peter Grindrod, University of Reading
Gabriela Kalna, University of Strathclyde
Alastair Spence, University of Bath
Zhivko Stoyanov, University of Bath
Keith Vass, Beatson Inst. for Cancer Research
graphnet07 – p.3/42
Overview
Protein-protein interaction (PPI) networks
Random graph models
Geometric model
Algorithm for testing geometric model
Lock-and-key model
Algorithm for discovering locks and keys
Results on biological data
graphnet07 – p.4/42
Central Dogma of Molecular Biology
graphnet07 – p.5/42
Yeast 2-Hybrid Protein-Protein Interaction Networks
Data:
list of N proteins (nodes)
list of protein pairs (edges)
This is an undirected, unweighted graph
Also, a symmetric N × N matrix of 0’s and 1’sYeast has N ≈ 3, 000
graphnet07 – p.6/42
Uetz et al. 2000, Yeast PPIyeast.gif (GIF Image, 612x695 pixels) http://www-personal.umich.edu/~mejn/networks/yeast.gif
1 of 1 17/10/05 13:54
Specificity and stability in topology of proteinnetworks, S. Maslov & K. Sneppen, Science, 2002
graphnet07 – p.7/42
Adjacency Matrix: Uetz et al. 2000, Yeast PPI
0 100 200 300 400 500 600 700 800 900 1000
0
100
200
300
400
500
600
700
800
900
1000
graphnet07 – p.8/42
Adjacency Matrix: Ito et al. 2001, Yeast PPI
0 500 1000 1500 2000 2500 3000
0
500
1000
1500
2000
2500
3000
graphnet07 – p.9/42
Y2H Protein-Protein Interaction Networks
Noisy: 50–90% false positive, 50–90% false negative
Two types of false positive
Technical: experimental limitations
Biological: don’t occur in vivo
not expressed at same timenot in same sub-cellular compartment, or sametissue
Interactions may also depend on the environment
How can we use this data . . . . . . ?
graphnet07 – p.10/42
Fine details. . .
Typical questions:
Are there any other proteins like protein Y?
What is the biological function of protein X?
Which proteins act together?
What happens if protein Z is removed?
Also: which are the false pos/negs ?
graphnet07 – p.11/42
Big picture. . .
PPI networks are not regularDescribe them by a random graph model?
capture many PPI networks with a small number ofparameters:
distinguish between different organismsget evolutionary insights
generate synthetic data sets to test algorithms
Several random graph “models” have been proposed . . . . . .
graphnet07 – p.12/42
Comparing Networks
Global Measures
Degree distribution
Pathlength distribution
Clustering coefficients
Local Measure
graphlet frequencies . . .
graphnet07 – p.13/42
Graphlet Frequencies (Pržulj et al.)
7 86543
9 1110 12 13 14 15 16
4-node graphlets3-node graphlets
292827262524232221
191817
21
20
5-node graphlets
Frequency of graphlet i (0 ≤ i ≤ 29)
number of graphlets of type i
total number of graphletsgraphnet07 – p.14/42
Geometric Model (Pržulj et al., 2004)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
graphnet07 – p.15/42
Geometric Random Graphrandomly place N nodes in unit square
connect nodes within distance ε
Able to match PPI properties (pathlengths, clusteringcoefficients, degree distributions, graphlet frequencies)
Modeling Interactome: Scale-Free or Geometric?, N.Pržulj, D. Corneil and I. Jurisica, Bioinformatics, 2004
Analyzing Large Biological Networks . . . , N. Pržulj,Ph.D. Thesis, University of Toronto, 2005
Question: Given a PPI network, can we map it on to ageometric random graph?
⇒ develop a tool for reverse engineering a GRG
Given nodes and edges, optimally place the nodes in R2
such that nodes within a distance ε are connectedgraphnet07 – p.16/42
Multi-Dimensional Scaling (MDS)
Problem:
Given all pairwise distances {dij}Ni,j=1,
find vectors {x[i]}Ni=1 ∈ R
m such that
‖x[i] − x
[j] ‖ = dij , ∀i, j
i.e. go from pairwise distance to locationNotation
X =
x[1]
x[2]
x[N ]
......
......
......
......
∈ R
m×N
graphnet07 – p.17/42
MDS Theory
Define (sym. pos. def.) A ∈ RN×N by
A(i,j) = -0.5*( Dsq(i,j) - mean(Dsq(i,:)) ...- mean(Dsq(:,j)) + mean(mean(Dsq)) );
Then XT X = A ⇒ ‖x[i] − x
[j] ‖ = dij
Symm. Real Schur Decomp. A = UT ΣU ⇒ use
X = Σ1
2 U =
√σ1u
[1] . . . . . . . . .√σ2u
[2] . . . . . . . . .. . . . . . . . . . . .√
σNu[N ] . . . . . . . . .
∈ RN×N
To embed into, say, R2 “best” approximation is
X = Σ1
2 U =
[ √σ1u
[1] . . . . . . . . .√σ2u
[2] . . . . . . . . .
]
graphnet07 – p.18/42
MDS to reverse engineer a GRG?
PPI data is “0 or 1”, we don’t have Euclidean distances
Idea use pathlengthd2
ij = pathlength from node i to node j
Compute pathlengths 1, 2, . . . ,K, set the rest to Kmax, so
distance matrix is sparse plus rank 1
∞’s avoided
Now apply MDS to recover locations in R2
graphnet07 – p.19/42
N = 100, ε = 0.25 (K = 4, Kmax
= 5)
Eigs of A: 38.2, 30.1, 10.7, 8.9, 6.1, 3.8, . . .
0 0.5 10
0.2
0.4
0.6
0.8
1Original Nodes
0 50 100
0
20
40
60
80
100
Adjacency Matrix
0 0.5 10
0.2
0.4
0.6
0.8
1Relocated Nodes
0 0.1 0.2 0.3 0.4 0.50
0.5
1
1.5
2
ε
|| W
− W
ε ||2 /
|| W
||2
graphnet07 – p.20/42
Same Example, “optimal” ε
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
90
100
Correct Neg: 4091Correct Pos: 610False Neg: 102False Pos: 147
graphnet07 – p.21/42
Same Example, ROC curve
Area under curve is 0.965
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1S
ensi
tivity
: TP
/(T
P +
FN
)
1−Specificity: 1−[TN/(TN+FP)]
graphnet07 – p.22/42
GRG data, coin flip to predict links
Area under curve is 0.48
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1S
ensi
tivity
: TP
/(T
P +
FN
)
1−Specificity: 1−[TN/(TN+FP)]
graphnet07 – p.23/42
Erdõs–Rényi Random Graph with MDS algorithm
Area under curve is 0.67
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1S
ensi
tivity
: TP
/(T
P +
FN
)
1−Specificity: 1−[TN/(TN+FP)]
graphnet07 – p.24/42
Nineteen PPI networks
YU YICUYIC YIP YHCY11K YK YD YM FH WS WC WE HV HSHHSMHSL HM HH0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1Values of the areas under the ROC curve for PPI networks
dim=2
dim=3
dim=4
High Confidence Von Mering et al. gave 0.89graphnet07 – p.25/42
Lock-and-Key Model
On the structure of protein-protein interactionnetworks, A. Thomas, R. Cannings, N. Monk, C. Cannings,Biochemical Society Transactions 31, 2003
Idea: two proteins interact because they ‘fit together’⇒ complementary domains, i.e. locks and keys
graphnet07 – p.26/42
Lock-and-Key Model
Thomas et al. model: m locks and m matching keys
let each protein have each lock and key withindependent probability p
put an edge between two proteins ⇔ they share at leastone lock/key pair
Thomas et al. looked at big picture issue:Does this model reproduce the almost scale free nature ofPPI networks?
Our approach:
introduce different modelling assumptions
develop an algorithm for inferring locks and keys
answer both big picture and fine detail questions
graphnet07 – p.27/42
Our AssumptionsThere exists a lock/key pair in the network such that anyprotein with this lock/key
does not have the matching key/lock
will only interact with a protein having the matchingkey/lock
only has a fixed proportion 0 ≤ θ ≤ 1 of its lock/keymatches recorded as interactions
⇒ the adjacency matrix has a pair of eigenvalues
λ = ±θ√
locksum × keysum
with eigenvectors√
keysum ind[lock] ±
√locksum ind
[key]
graphnet07 – p.28/42
Algorithm
Calculate eigenvals/vecs
Group into ≈ ±λ pairs
For each pair with eigvecs ua and ub
choose a threshold, K
|ua + ub|i ≥ K means protein i has lock
|ua − ub|i ≥ K means protein i has key
Successful at recovering locks and keys in syntheticallygenerated networks (good sensitivity and specificity)
graphnet07 – p.29/42
Spectral Properties: Uetz (2000) data>> [U,D] = eigs(W,8,’BE’);
>> diag(D)
ans =
-6.4614
-5.1460
-4.1557
-4.1270
4.3778
5.4309
5.8397
7.4096
200 400 600 800 1000−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8Sum and Difference
graphnet07 – p.30/42
Result for Uetz et al. (2000) yeast data
YFR024C
SLA1
YSC84
BZZ1
KTR3
YPL246C YPR171W
YKR030W
SNA3
YOR284W
YGR268C
APP1
SYS1
BOS1
VPS73
YMR253C
YLR064WYJR083C
YMR192W
ACF2
MMS4
RVS167
graphnet07 – p.31/42
Further Investigation . . .
Other biological data shows thatall five proteins in one group possess the SH3 domain
⇒ we have identified the key!
Recent experiments (Kessels & Qualmann 2004, Friesen etal. 2005) show that the SH3 domain is involved in traffickingof vesicles
All proteins in the other group are part of the actin corticalpatch assembly mechansim of vesicle endocytosis (Dreeset al. 2001)
[vesicle: small, enclosed compartment within a cell ]graphnet07 – p.32/42
Arabidopsis Thaliana (small flowering plant)
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
Homeobox Transcription Factor module?
graphnet07 – p.33/42
Saccharomyces Cerevisiae (yeast)
0 2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
18
20
Protein Trafficking module?
graphnet07 – p.34/42
Homo Sapiens (us!)
0 1 2 3 4 5 6 7 8 9 10 11
0
1
2
3
4
5
6
7
8
9
10
11
Smad Transcription Factor module?
graphnet07 – p.35/42
Drosophila Melanogaster (fruit fly)
0 5 10 15 20 25 30 35 40 45 50
0
5
10
15
20
25
30
35
40
45
50
Cell Cycle Transcriptional Regulation module?
. . . plus many more . . .graphnet07 – p.36/42
Recap
Lock-and-key model: Extension of Thomas et al. (2003)model to
make testable predictions about PPI network structure
extract important structural information from (noisy)PPI data sets
Note: different to traditional clusteringEssentially clustering on paths of length two
Match local/global PPI properties? . . .
graphnet07 – p.37/42
Stickiness Model
Back to the big picture
Can we produce a model that matches PPI networkproperties?
Inferring number and distribution of locks and keys in a real(noisy) network: very challenging
Idea summarize abundance/popularity of binding domainsas a single number per protein: stickiness index
[Analogous idea of fitness in physics community]
graphnet07 – p.38/42
Modelling Assumptions
Assumption 1High degree implies many and/or popular binding domains:high stickinessSo high degree ⇒ high stickiness
Assumption 2A pair of proteins is more likely to interact (sharecomplementary binding domains) if they both have highstickiness indexTake the product of stickiness indices
Hence, we suppose P (i ↔ j) = f (degi) f(
degj
)
Match expected degree ⇒ f (degi) = degi√
P
N
k=1deg
k
graphnet07 – p.39/42
PseudoCode
input {degi}Ni=1
output{wij}Ni,j=1
for i = 1 to N
θi = degi/√
∑Nj=1 degj
endInitialize all wij = 0for i = 1 to N − 1
for j = i + 1 to Ncompute a uniform (0, 1) sample, r
if r ≤ θiθj
wij = 1 and wji = 1end if
end forend for
graphnet07 – p.40/42
Graphlet Frequency Comparison
0
50
100
150
200
250
YHC Y11K YIC YU YICU FE FH WE WC HS HG
Rel
ativ
e G
raph
let F
requ
ency
Dis
tanc
e
PPI Networks
Relative Graphlet Frequency Distances Between PPI and Model Networks
ERER-DD
SFGEO-3DSTICKY
graphnet07 – p.41/42
What’s New?
Spectral algorithm for discovering bi-partite subgraphs(locks and keys)
Realistic results on PPI networks
Spectral algorithm for reverse engineering ageometric graph
Supports the claim that PPI networks have somegeometric structure
Simplified stickiness model gives excellent local andglobal fit to PPI data
with Alan Taylor:CONTEST (CONTrolable TEST matrices) for MATLAB athttp://www.maths.strath.ac.uk/research/groups/numerical_analysis/contest
graphnet07 – p.42/42
Recommended