A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009

A data-mining approach for multiple structural

alignment of proteins

WY Siu, N Mamoulis, SM Yiu, HL ChanThe University of Hong Kong

Sep 9, 2009

2

Aligning protein structures Important step for understanding protein

functions Sequencing proteins and determining 3D

structures is easy X-ray crystallography, NMR spectroscopy

Testing functions of proteins is hard One useful observation

Mutations change sequences Structures conserved Structural similarity => Functional similarity

Good structural alignment algorithm => Predict functions of proteins

3

Our focus We propose studying the problem with less

information or assumptions Sequence order independence

Sequence order (arrangement of amino acids) is unknown Reduce the need to find sequence information

Subset alignment Find large alignment for all subsets Extract similar structures from a mixture with dissimilar

ones Bottleneck metric

In an alignment, every pair of aligned points have a small distance

4

Related work Pairwise alignment

Dali [Holm & Sander 93] VAST [Gibrat, Madej, & Bryant 96] CE [Shindyalov & Bourne 98]

Techniques to obtain multiple alignments center-star [Akutsu & Kim 99] Tree-progress [Taylor, Flores, & Orengo 94]

Multiple alignment MultiProt [Shatsky et. al. 02] (seq. order) MultiBind [Shatsky et. al. 06] (all align.) MUSTA [Leibowitz et. al. 01] (all align.) MASS [Dror et. al. 03] (seq. order) POSA [Ye & Godzik 05] (seq. order)

5

In the followings Model Algorithms SOIL Experimental results

6

Model A protein is a set of amino acid in 3D,

and an amino acid = 3 points in space For Cα-atom, C, N Substructure = subset of amino acid

Transformation T(S) For each s S, T(s) = Rs + t, where R is a 3 × 3 rotation matrix t is a 3 × 1 translation matrix

Similarity C = {c1, …, cn} a set of substructures, T

= {T1, … Tn} be a set of transformation C is ε-congruent w.r.t. T if we can

transform each structures in C and align the amino acids such that the Cα items of every aligned pairs are close (<=ε)

A ε-congruent alignment For a set of S of structures, an alignment is

set of substructures C and transformations T

Rotate

Translate

S1 S2

7

Problem definition Size of an alignment: number of aligned amino

acid or each protein Cardinality: number of structures involebed. Input

A set of structures S = {S1, S2, …, Sm} A distance threshold A subset size threshold min_cardinality An alignment length threshold min_size

Output For each subset S’ S with |S’| min_size, the maximal

length –congruent alignment whose length is at least min_length

8

The SOIL Algorithm Sequence Order Independent aLignment

Step 1. Geometric hashing Step 2. Frequent pattern mining Step 3. Generating alignments

9

Geometric Hashing Purpose

Take each amino acid as a base (reference) and store the relative location of other amino acids in a hashtable.

31 2

45

S1 S2

1

2

3

4

5

Store the base Length of box = ε

10

Mining Frequent Patterns Main observation. Assume that a pair of bases

{(k1, i1) {k2, i2)} appears in x boxes. Then if structures Sk1 and Sk2 are transformed using the bases for Sk1

i1 and Sk1i2, there are at least x+1

pairs of points locating closely with each other (distance at most √3ε, i.e., diagonal length).

Proof. Why (k1,i1) is in a box?

When Sk1 is transformed using the base Sk1

i1, an amino acid locates at that box

11

Mining Frequent Patterns Let each hashbox be a coincidence group, or

transaction. Consider all bases as items Find all sets of items that appear frequently in

the coincidence group.

“Frequent pattern mining problem”, a well-studied problem in database area.

Efficient algorithms, like fp-tree, are known

Efficient, can consider all possible transformations at the same time

12

Generating Alignments Given a frequent pattern

E.g., (S12, S2

1) Use the bases in a tuple to transform the structures

involved Generate a matching of points, bipartite matching for

pairwise, greedy for multiple Output the largest alignment

x

y S1 S3

5 1

1 5

2 4

3 3

4 2

Alignment

11S

12S

13S1

4S

31

15

S

S 32S

33S

34S

35S

Transformed S1 and S3

13

Experimental evaluation Implemented in C++ Test cases run on Intel ® CoreTM 2 Duo with

2.66GHz CPU and 4GB main memory Default settings

: 3Å min_size: 2 LRF: 3Atoms Coincidence group: Bin max_trans: 30Avg

14

Pairwise alignment 10 pairs of proteins used before, e.g., MultiProt SCOP and PRINT families

Comparison of running time C-alpha match: within a few seconds (from web) MultiProt: 0.211s MultiBind: 1.968s SOIL: 0.235s

050

100150200250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Len

gth

Test cases

C-alpha match

MultiProt

MultiBind

SOIL

Multiple alignments 10 groups of proteins Various superfamilies in SCOP, protein interfaces

from PRINT

16

Comparison of Lengths (Cond't)

0

100

200

300

400

MultiProtMultiBindSOIL

Comparison of Lengths

0

50

100

150

200MultiProt

MultiBind

SOIL

Multiple alignments

6 5 4 3 2 4 3 2 5 4 3 2 3 2 10 9 8 7 6 5 4 3 2 (Levels) Calcium Binding 4-helix Bundle Superhelix Supersandwich Concanavalin

4 3 2 6 5 4 3 2 5 4 3 2 3 2 3 2 (Levels)tRNA synthetase G-proteins PTB domain PRINT 45 PRINT 8158

17

Multiple alignment

Comparison of Running Time

0.010.1

110

1001000

10000

1 2 3 4 5 6 7 8 9 10

Test cases

s

MultiProt

MultiBind

SOIL

18

Conclusion Proposed a more difficult problem

Sequence order independence Modeled as the largest common point set problem

Subset alignment Automatically detect subsets of similar structures

Similarity measurement Adopt the bottleneck metric

Developed the SOIL algorithm Combination of Geometric Hashing and Frequent Itemset

Mining Simultaneous alignment

Evaluated the algorithm with experiments Can be combined with other methods by simply taking the

maximum.

Documents

A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009