29
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo, Ontario, Canada [email protected]

Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,

  • View
    235

  • Download
    1

Embed Size (px)

Citation preview

Finding Compact Structural Motifs

Presented By: Xin GaoAuthors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu,

Ming Li, and Jinbo XuUniversity of Waterloo, Ontario, Canada [email protected]

Outline

Introduction to Structural Motif

Related Work

Compact Motif-finding Problem Formulation

NP-Hard of the Compact Motif-finding Problem

A Polynomial Time Approximate Scheme

Outline

Introduction to Structural Motif

Related Work

Compact Motif-finding Problem Formulation

NP-Hard of the Compact Motif-finding Problem

A Polynomial Time Approximate Scheme

Introduction

Protein is a sequence of amino acids. A protein always folds into a specific

3-D shape. Structures are important to proteins:

The functional properties of proteins depend on their 3-D structures.

Structures are more conserved than sequence during the evolution of proteins.

Structural Motif

Structural motif is a frequently occurring substructure of proteins.

Motifs are thought to be tightly related to protein functions.

Identifying motifs from a set of proteins can help us to know their evolutionary history and functions.

Structural Motif Finding Problem Given a set of protein structures, to find the

frequently occurring substructure. Informally, to find one substructure from each

protein, that exhibit the highest degree of similarity.

How to measure the similarity of two substructures?

Two popular measurements: dRMSD: measure the root mean square Euclidean

distance between the corresponding residues from different protein

structures.

cRMSD: calculate the internal distance matrix for each protein, and compare the distance matrices for input structures.

Outline

Introduction to Structural Motif

Related Work

Compact Motif-finding Problem Formulation

NP-Hard of the Compact Motif-finding Problem

A Polynomial Time Approximate Scheme

Related Work L.P.Chew proposed an iterative algorithm to

compute the conserved shape and proved its convergence. (2002)

D. Bandyopadhyay applied graph-based data-mining tools to find the family-specific fingerprints. (2006)

M. Shatsky presented an algorithm to uncover the binding pattern. (2006)

DALI and CE attempt to identify structural alignment with minimal dRMSD.

STRUCTRAL and TM-Align employ heuristics to detect the alignment with minimal cRMSD.

Related Work (continued)

However, these methods are all heuristic; the solutions are not guaranteed to be optimal or near optimal.

The first PTAS for pairwise structural alignment: R. Kolodny explored the Lipschitz property of the scoring

function. (2004) Though this algorithm can be extended to the case

of multiple structure alignment, the simple extension has a time complexity exponential in the number of proteins.

Is there a PTAS to multiple structure motif finding?

Outline

Introduction to Structural Motif

Related Work

Compact Motif-finding Problem Formulation

NP-Hard of the Compact Motif-finding Problem

A Polynomial Time Approximate Scheme

We focus on (R, C)-Compact Motif.

What is (R, C)-compact motif? A motif is bounded in a minimum ball with radius R. In this ball, at most C residues do not belong to this motif.

(R,C)-compact motif is biologically meaningful since We focus on globular proteins. We allows at most C exceptions.

(R, C)-Compact Motif Finding Problem Input: protein structures S1 …, Sn, and length l

Output: a consensus consists of l 3D points q=(q1, …, ql ) a substructure ui from each protein Si

Objective: min (1 in d2(q, ui))1/2

Here, we adopt the dRMSD distance function, i.e., d(q, ui)=min||q- (ui)||2

consists of a rotation and a translation ||*||2 is the Euclidean metric.

n

Outline

Introduction to Structural Motif

Related Work

Compact Motif-finding Problem Formulation

NP-Hard of the Compact Motif-finding Problem

A Polynomial Time Approximate Scheme

(R,C)-compact motif finding is still NP-Hard. Reduction from the Sequence Consensus Problem

Input: n binary strings S1, …, Sn, each is of length m

Output: A substring ti of length l from each string Si, 1i n,

Objective: minimize 1 i <i’ n dH(ti, ti’), where dH is Hamming distance.

Basic Idea: Try to find a way of reduction to make:

dRMSD=Hamming Distance

(R,C)-compact motif finding is still NP-Hard. Each l-mer is

transformed into 6l 3D points.

110 110 001 000000 111111

0(0, 2i, 0), 1(1, 2i, 0)

(R,C)-compact motif finding is still NP-Hard. Each l-mer is transformed into 6l 3D points.

110 110 001 000000 111111 0(0, 2i, 0), 1(1, 2i, 0)

The centroid will be (1/2, 2i, 0) (Easy translation) Large “tail” no rotation RMSD = Hamming Distance

Small distortion to each point to make it protein-like.

Sequence Consensus Problem (1,0)-Compact Motif Finding Problem

Outline

Introduction to Structural Motif

Related Work

Compact Motif-finding Problem Formulation

NP-Hard of the Compact Motif-finding Problem

A Polynomial Time Approximate Scheme

The Basic Idea of Our PTAS

There are always a few “important” sub-structures, whose consensus holds most of the “secrets” of the true optimal motif.

Therefore, if we can simply do exhaustive search to find these few sub-structures, then the trivial optimal solution for these sub-structures is a good approximation to the real optimal solution.

Technique 1: Sampling

We sample only r proteins, consider each motif in a sampled protein, we can say we almost know the optimal

solution.

Sampling will introduce only a bit of error. There is at least one selection schema,

whose consensus has a cost value less than (1+1/r)OPT.

So, we can find this schema by simply enumerating operation.

Technique 2: Discretize the Rotation Space

Each rotation is parameterized by three angles 1, 2, 3[0, 2)

Discretize the angles with step size ’ we get an ’-rotation net.

Discretized rotation will not introduce a large error, either.

A parameterized algorithm for protein structure alignment. J. Xu, F. Jiao, and B. Berger. RECOMB2006.

PTAS

PTAS

Performance Ratio Analysis

Running Time

Each protein contains M motifs M is a polynomial of protein length

Each motif can adopt W rotations

W depends on the constant

So the number of consensus is less than O(nr(MW)r)= O((nMW)r)

)()( llm mOCOM

Conclusion and Future Work

We prove the (R,C)-compact motif finding problem is NP-hard

We obtain a PTAS for this problem. Future Work:

Further reduce the time complexity Design some practical algorithms. Solve a more general case.

Thank You. Questions…