Upload
ahmed-johnson
View
27
Download
0
Embed Size (px)
DESCRIPTION
Approximation of Protein Structure for Fast Similarity Measures. Itay Lotan Fabian Schwarzer. Comparing Protein Structures. Same protein:. vs. Analysis of MDS and MCS trajectories. Graph-based Methods. Structure prediction applications. Evaluating decoy sets - PowerPoint PPT Presentation
Citation preview
Approximation of Protein Structure for Fast Similarity
Measures
Itay LotanFabian Schwarzer
Comparing Protein Structures
vs.Same protein:
Analysis of MDS
and MCS
trajectories
http://folding.stanford.edu
Structure prediction applications• Evaluating decoy sets
• Clustering predictions (Shortle et al, Biophysics ’98)
Graph-based Methods
Stochastic Roadmap Simulation (Apaydin et al, RECOMB ’02)
k Nearest-Neighbors Problem
Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c.
Can be done in
N – size of S
L – time to compare two conformations
(log )O N k L
k Nearest-Neighbors Problem
What if needed for all c in S ?
2 (log )O N k L - too much time
Can be improved by:
1. Reducing L
2. A more efficient algorithm
Our Solution
Reduce structure description
Approximate but fast similarity measures
Efficient nearest-neighbor algorithms can be used
Reduce description further
Description of a Protein’s Structure
3n coordinates of Cα atoms (n – Number of residues)
m-Averaged Approximation Cut chain into pieces of length m Replace each sequence of m Cα
atoms by its centroid
3n coordinates
3n/m coordinates
Similarity Measures - cRMS
The RMS of the distances between corresponding atoms after the two conformations are optimally aligned
2
21
1( , ) min
n
T i ii
cRMS P Q p Tqn
Computed in O(n) time
Similarity Measures - dRMS
The Euclidean distance between the intra-molecular distances matrices of the two conformations
2
2 1
2( , )
( 1)
n iP Qij ij
i j
dRMS P Q d dn n
Computed in O(n2) time
1. Decoy sets: conformations from the Park-Levitt set (Park et al, JMB ’97), N =10,000
2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, Proteins ’00), N = 5000
Evaluation: Test Sets
8 structurally diverse proteins of size 54 -76 residues:
Decoy Sets Correlation
m cRMS dRMS
4
6
9
12
3 0.99
0.98 – 0.990.92 – 0.990.81 – 0.980.54 – 0.92
0.96 – 0.980.94 – 0.970.78 – 0.930.65 – 0.960.52 – 0.69Higher Correlation for random
sets!
Speed-up for Decoy Sets
9x for cRMS (m = 9) 36x for dRMS (m = 6)with very small error
For random sets the speed-up for dRMS goes up to 81x (m = 9)
Efficient Nearest-Neighbor Algorithms
There are efficient nearest-neighbor algorithms, but they are not compatible with similarity measures:
cRMS is not a Euclidean metric
dRMS uses a space of dimensionality n(n-1)/2
Further Dimensionality Reduction of dRMS
kd-trees require dimension 20m-averaging with dRMS is not enough
Reduce further using SVD
SVD: A tool for principal component analysis. Computes directions of greatest variance.
Reduction Using SVD
1. Stack m-averaged distance matrices as vectors
2. Compute the SVD of entire set3. Project onto most important
singular vectors
dRMS is thus reduced to 20 dimensionsWithout m-averaging SVD can be too costly
Testing the Method
Use decoy sets (N = 10,000) and random sets (N = 5,000)
m-averaging with (m = 4) Project onto 16 PCs for decoys, 12
PCs for random sets Find k = 10, 25, 100 NNs for 250
conformations in each set
Results Decoy sets:
~77% correct Furthest approximate NN off by 10% - 15% ~4k approximate NNs contain all true k NNs
Random sets: 71%, 76%, 84% correct respectively Furthest approximate NN off by 5% - 10% ~3k approximate NNs contain all true k NNs
More Results: N = 100,000
1CTF decoys: ~70% correct Furthest approximate NN off by ~20% ~6k approximate NNs contain all true k NNs
1CTF random: 46%, 48%, 60% correct respectively Furthest approximate NN off by ~16% ~7k approximate NNs contain all true k NNs
Running Time
N = 100,000, m=4, PC = 16
Find k = 100 for each conformation
Brute-force: ~84 hoursBrute-force + m-averaging: ~4.8 hoursBrute-force + m-averaging + SVD: 41
minutesKd-tree + m-averaging + SVD: 19 minutes
kd-trees will have more impact for larger sets
Structural ClassificationComputing the similarity between structures of two different proteins is more involved:
The correspondence problem:
Which parts of the two structures should be compared?
1IRD 2MM1
vs.
STRUCTAL (Subbiah et al, ’93)
1. Compute optimal correspondence using dynamic programming
2. Optimally align the corresponding parts in space to minimize cRMS
3. Repeat until convergence
O(n1n2) time
STRUCTAL + m-averaging 256 protein domains (180 – 420 res) 3691 good matches (Sandelin’s PROTOFARM) 6375 random pairs Compute SAS scores (cRMS/length*100)
m correlation3
4
5
0.81
0.77
0.70
speed-up~9x
~16x~25x
OK (P < 0.005) BAD (P > 0.005)
SAS score
Num
ber
of
pair
s
MisclassificationP ≤ 0.005 P ≤ 0.001
Total FP FN Total FP FN
m=1410(4%)
176 225457
(4.5%)
215 242
m=3802(8%)
274 528660
(6.5%)
262 398
m=4993
(10%)248 745
870(8.5%
)282 588
m=51602(16%)
186 14161129(11%)
148 981
Random Chains
c0c1
c2
c3c4
c5
cn-1c6
c7
c8
The dimensions are uncorrelated Average behavior can be
approximated by normal variables:
1 (0,1)i i N l c c
1-D Haar Wavelet Transform
Recursive averaging and differencing of the values
Level AveragesDetail Coefficients
[ 9 7 2 6 5 1 4 6 ]
[ 8 4 3 5 ]
[ 6 4 ]
[ 5 ]
[ 1 -2 2 -1 ]
[ -2 -1 ]
[ 1 ]
3
2
1
0
[ 9 7 2 6 5 1 4 6 ]
[ 5 1 -2 -1 1 -2 2 1 ]
pdf of the detail coefficients is:
Coefficients expected to be ordered!
Discard coefficients starting at lowest level
Transform of Random Chains
m-averaging
Discarding lowest levels of detail coeeficients
logm
( ) 0, 4j jd N O
Random Chains and Proteins
Protein backbones behave on average like random chains Chain topology Limited compactness
Conclusion
Fast computation of similarity measures
Trade-off between speed and precision Exploits chain topology of proteins and
limited compactness Allows use of efficient nearest-neighbor
algorithms Can be used as filter when precision is
important