A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL...

A data-mining approach for multiple structural

alignment of proteins

WY Siu, N Mamoulis, SM Yiu, HL ChanThe University of Hong Kong

Sep 9, 2009

Aligning protein structures Important step for understanding protein

functions Sequencing proteins and determining 3D

structures is easy X-ray crystallography, NMR spectroscopy

Testing functions of proteins is hard One useful observation

Mutations change sequences Structures conserved Structural similarity => Functional similarity

Good structural alignment algorithm => Predict functions of proteins

Our focus We propose studying the problem with less

information or assumptions Sequence order independence

Sequence order (arrangement of amino acids) is unknown Reduce the need to find sequence information

Subset alignment Find large alignment for all subsets Extract similar structures from a mixture with dissimilar

ones Bottleneck metric

In an alignment, every pair of aligned points have a small distance

Related work Pairwise alignment

Dali [Holm & Sander 93] VAST [Gibrat, Madej, & Bryant 96] CE [Shindyalov & Bourne 98]

Techniques to obtain multiple alignments center-star [Akutsu & Kim 99] Tree-progress [Taylor, Flores, & Orengo 94]

Multiple alignment MultiProt [Shatsky et. al. 02] (seq. order) MultiBind [Shatsky et. al. 06] (all align.) MUSTA [Leibowitz et. al. 01] (all align.) MASS [Dror et. al. 03] (seq. order) POSA [Ye & Godzik 05] (seq. order)

In the followings Model Algorithms SOIL Experimental results

Model A protein is a set of amino acid in 3D,

and an amino acid = 3 points in space For Cα-atom, C, N Substructure = subset of amino acid

Transformation T(S) For each s S, T(s) = Rs + t, where R is a 3 × 3 rotation matrix t is a 3 × 1 translation matrix

Similarity C = {c1, …, cn} a set of substructures, T

= {T1, … Tn} be a set of transformation C is ε-congruent w.r.t. T if we can

transform each structures in C and align the amino acids such that the Cα items of every aligned pairs are close (<=ε)

A ε-congruent alignment For a set of S of structures, an alignment is

set of substructures C and transformations T

Rotate

Translate

Problem definition Size of an alignment: number of aligned amino

acid or each protein Cardinality: number of structures involebed. Input

A set of structures S = {S1, S2, …, Sm} A distance threshold A subset size threshold min_cardinality An alignment length threshold min_size

Output For each subset S’ S with |S’| min_size, the maximal

length –congruent alignment whose length is at least min_length

The SOIL Algorithm Sequence Order Independent aLignment

Step 1. Geometric hashing Step 2. Frequent pattern mining Step 3. Generating alignments

Geometric Hashing Purpose

Take each amino acid as a base (reference) and store the relative location of other amino acids in a hashtable.

Store the base Length of box = ε

Mining Frequent Patterns Main observation. Assume that a pair of bases

{(k1, i1) {k2, i2)} appears in x boxes. Then if structures Sk1 and Sk2 are transformed using the bases for Sk1

i1 and Sk1i2, there are at least x+1

pairs of points locating closely with each other (distance at most √3ε, i.e., diagonal length).

Proof. Why (k1,i1) is in a box?

When Sk1 is transformed using the base Sk1

i1, an amino acid locates at that box

Mining Frequent Patterns Let each hashbox be a coincidence group, or

transaction. Consider all bases as items Find all sets of items that appear frequently in

the coincidence group.

“Frequent pattern mining problem”, a well-studied problem in database area.

Efficient algorithms, like fp-tree, are known

Efficient, can consider all possible transformations at the same time

Generating Alignments Given a frequent pattern

E.g., (S12, S2

1) Use the bases in a tuple to transform the structures

involved Generate a matching of points, bipartite matching for

pairwise, greedy for multiple Output the largest alignment

y S1 S3

Alignment

Transformed S1 and S3

Experimental evaluation Implemented in C++ Test cases run on Intel ® CoreTM 2 Duo with

2.66GHz CPU and 4GB main memory Default settings

: 3Å min_size: 2 LRF: 3Atoms Coincidence group: Bin max_trans: 30Avg

Pairwise alignment 10 pairs of proteins used before, e.g., MultiProt SCOP and PRINT families

Comparison of running time C-alpha match: within a few seconds (from web) MultiProt: 0.211s MultiBind: 1.968s SOIL: 0.235s

100150200250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Test cases

C-alpha match

MultiProt

MultiBind

Multiple alignments 10 groups of proteins Various superfamilies in SCOP, protein interfaces

from PRINT

Comparison of Lengths (Cond't)

MultiProtMultiBindSOIL

Comparison of Lengths

200MultiProt

MultiBind

Multiple alignments

6 5 4 3 2 4 3 2 5 4 3 2 3 2 10 9 8 7 6 5 4 3 2 (Levels) Calcium Binding 4-helix Bundle Superhelix Supersandwich Concanavalin

4 3 2 6 5 4 3 2 5 4 3 2 3 2 3 2 (Levels)tRNA synthetase G-proteins PTB domain PRINT 45 PRINT 8158

Multiple alignment

Comparison of Running Time

0.010.1

1001000

1 2 3 4 5 6 7 8 9 10

Test cases

MultiProt

MultiBind

Conclusion Proposed a more difficult problem

Sequence order independence Modeled as the largest common point set problem

Subset alignment Automatically detect subsets of similar structures

Similarity measurement Adopt the bottleneck metric

Developed the SOIL algorithm Combination of Geometric Hashing and Frequent Itemset

Mining Simultaneous alignment

Evaluated the algorithm with experiments Can be combined with other methods by simply taking the

maximum.

A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL...

Documents

EXECUTIVE SUMMARY 01092013 - ArDOT · SIU 2 – Indiana SIU 9A/9B – Tennessee SIU 13 – Arkansas SIU 3 ... Toll Feasibility Analysis Element ... EXECUTIVE SUMMARY_01092013.docx

SIU Awards

DLD · 5. 1. 2. 3. 4. ffl.ã.tJ. tn Yi.q. 2510 (YIU. 1 YIU. 2 ) (YIU. 1 4 (Certificate of analysis) (YIU. 1) 3. 100 ua. n€uanucu% 1 2. 3. 4. 5. 1. 2

Recreational Mathematics - Florida Atlantic Universitymath.fau.edu/yiu/RecreationalMathematics2003.pdf · Recreational Mathematics Paul Yiu Department of Mathematics Florida Atlantic

Paul Yiu - math.fau.edumath.fau.edu/yiu/Oldwebsites/Geometry2008Spring/2008geometrynotes.pdfSurvey of Geometry Paul Yiu Department of Mathematics Florida Atlantic University Spring

2003 SIU Review Report on the SIU Reforms by AG Ontario

A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis

Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007

Tse Yiu Kai Classroom

1 Ranking Spatial Data by Quality Preferences - NTUAgtsat/collection/spatio-textual/TKDE_spatpref.pdf · Ranking Spatial Data by Quality Preferences Man Lung Yiu, Hua Lu, Nikos Mamoulis,

EXPERIENCE SIU

Victoria Siu

CONSORCIO SIU SIU-DIAGUITA - Universidad Nacional … · SIU-DIAGUITA INDICE INDICE ... MANUAL FUNCIONAL Página 2 de 156 . SIU-DIAGUITA Etapa Solicitudes de Transferencia de Bienes

Yusing Yiu-Sing Jung

Hierarchical Constraint Satisfaction in Spatial Database Dimitris Papadias, Panos Kalnis And Nikos Mamoulis

SIU Interrupt Controller - NXP Semiconductors · SIU Interrupt Controller 11 - 2 SIU Interrupts • CPIC Generates an interrupt to the SIU Interrupt Controller at a User Programmable

T HE P ROBLEM OF R ECONSTRUCTING K - ARTICULATED P HYLOGENETIC N ETWORK Supervisor : Dr. Yiu Siu Ming Second Examiner : Professor Francis Y.L. Chin Student

Nutraceuticals - SIU

Cho Yiu Catholic Primary School

ChronologicalListingofEuler'sMathematicalPapers ...math.fau.edu/yiu/PSRM2015/yiu/LAPTOPbackup030317/HM2002/... · 2012. 3. 23. · ChronologicalListingofEuler'sMathematicalPapers