View
218
Download
0
Category
Preview:
Citation preview
A data-mining approach for multiple structural
alignment of proteins
WY Siu, N Mamoulis, SM Yiu, HL ChanThe University of Hong Kong
Sep 9, 2009
2
Aligning protein structures Important step for understanding protein
functions Sequencing proteins and determining 3D
structures is easy X-ray crystallography, NMR spectroscopy
Testing functions of proteins is hard One useful observation
Mutations change sequences Structures conserved Structural similarity => Functional similarity
Good structural alignment algorithm => Predict functions of proteins
3
Our focus We propose studying the problem with less
information or assumptions Sequence order independence
Sequence order (arrangement of amino acids) is unknown Reduce the need to find sequence information
Subset alignment Find large alignment for all subsets Extract similar structures from a mixture with dissimilar
ones Bottleneck metric
In an alignment, every pair of aligned points have a small distance
4
Related work Pairwise alignment
Dali [Holm & Sander 93] VAST [Gibrat, Madej, & Bryant 96] CE [Shindyalov & Bourne 98]
Techniques to obtain multiple alignments center-star [Akutsu & Kim 99] Tree-progress [Taylor, Flores, & Orengo 94]
Multiple alignment MultiProt [Shatsky et. al. 02] (seq. order) MultiBind [Shatsky et. al. 06] (all align.) MUSTA [Leibowitz et. al. 01] (all align.) MASS [Dror et. al. 03] (seq. order) POSA [Ye & Godzik 05] (seq. order)
5
In the followings Model Algorithms SOIL Experimental results
6
Model A protein is a set of amino acid in 3D,
and an amino acid = 3 points in space For Cα-atom, C, N Substructure = subset of amino acid
Transformation T(S) For each s S, T(s) = Rs + t, where R is a 3 × 3 rotation matrix t is a 3 × 1 translation matrix
Similarity C = {c1, …, cn} a set of substructures, T
= {T1, … Tn} be a set of transformation C is ε-congruent w.r.t. T if we can
transform each structures in C and align the amino acids such that the Cα items of every aligned pairs are close (<=ε)
A ε-congruent alignment For a set of S of structures, an alignment is
set of substructures C and transformations T
Rotate
Translate
S1 S2
7
Problem definition Size of an alignment: number of aligned amino
acid or each protein Cardinality: number of structures involebed. Input
A set of structures S = {S1, S2, …, Sm} A distance threshold A subset size threshold min_cardinality An alignment length threshold min_size
Output For each subset S’ S with |S’| min_size, the maximal
length –congruent alignment whose length is at least min_length
8
The SOIL Algorithm Sequence Order Independent aLignment
Step 1. Geometric hashing Step 2. Frequent pattern mining Step 3. Generating alignments
9
Geometric Hashing Purpose
Take each amino acid as a base (reference) and store the relative location of other amino acids in a hashtable.
31 2
45
S1 S2
1
2
3
4
5
Store the base Length of box = ε
10
Mining Frequent Patterns Main observation. Assume that a pair of bases
{(k1, i1) {k2, i2)} appears in x boxes. Then if structures Sk1 and Sk2 are transformed using the bases for Sk1
i1 and Sk1i2, there are at least x+1
pairs of points locating closely with each other (distance at most √3ε, i.e., diagonal length).
Proof. Why (k1,i1) is in a box?
When Sk1 is transformed using the base Sk1
i1, an amino acid locates at that box
11
Mining Frequent Patterns Let each hashbox be a coincidence group, or
transaction. Consider all bases as items Find all sets of items that appear frequently in
the coincidence group.
“Frequent pattern mining problem”, a well-studied problem in database area.
Efficient algorithms, like fp-tree, are known
Efficient, can consider all possible transformations at the same time
12
Generating Alignments Given a frequent pattern
E.g., (S12, S2
1) Use the bases in a tuple to transform the structures
involved Generate a matching of points, bipartite matching for
pairwise, greedy for multiple Output the largest alignment
x
y S1 S3
5 1
1 5
2 4
3 3
4 2
Alignment
11S
12S
13S1
4S
31
15
S
S 32S
33S
34S
35S
Transformed S1 and S3
13
Experimental evaluation Implemented in C++ Test cases run on Intel ® CoreTM 2 Duo with
2.66GHz CPU and 4GB main memory Default settings
: 3Å min_size: 2 LRF: 3Atoms Coincidence group: Bin max_trans: 30Avg
14
Pairwise alignment 10 pairs of proteins used before, e.g., MultiProt SCOP and PRINT families
Comparison of running time C-alpha match: within a few seconds (from web) MultiProt: 0.211s MultiBind: 1.968s SOIL: 0.235s
050
100150200250
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Len
gth
Test cases
C-alpha match
MultiProt
MultiBind
SOIL
Multiple alignments 10 groups of proteins Various superfamilies in SCOP, protein interfaces
from PRINT
16
Comparison of Lengths (Cond't)
0
100
200
300
400
MultiProtMultiBindSOIL
Comparison of Lengths
0
50
100
150
200MultiProt
MultiBind
SOIL
Multiple alignments
6 5 4 3 2 4 3 2 5 4 3 2 3 2 10 9 8 7 6 5 4 3 2 (Levels) Calcium Binding 4-helix Bundle Superhelix Supersandwich Concanavalin
4 3 2 6 5 4 3 2 5 4 3 2 3 2 3 2 (Levels)tRNA synthetase G-proteins PTB domain PRINT 45 PRINT 8158
17
Multiple alignment
Comparison of Running Time
0.010.1
110
1001000
10000
1 2 3 4 5 6 7 8 9 10
Test cases
s
MultiProt
MultiBind
SOIL
18
Conclusion Proposed a more difficult problem
Sequence order independence Modeled as the largest common point set problem
Subset alignment Automatically detect subsets of similar structures
Similarity measurement Adopt the bottleneck metric
Developed the SOIL algorithm Combination of Geometric Hashing and Frequent Itemset
Mining Simultaneous alignment
Evaluated the algorithm with experiments Can be combined with other methods by simply taking the
maximum.
Recommended