25
Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology Iowa State University Joint Work with Tauqir Bibi, Feng Cui, Qunfeng Dong, Peter Vedell, Di Wu A Novel Geometric Build-Up Algorithm for Solving the Distance Geometry Problem and Its Application to Multidimensional Scaling

Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

  • Upload
    zada

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

A Novel Geometric Build-Up Algorithm for Solving the Distance Geometry Problem and Its Application to Multidimensional Scaling. Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology Iowa State University Joint Work with - PowerPoint PPT Presentation

Citation preview

Page 1: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Zhijun WuDepartment of Mathematics

Program on Bio-Informatics and Computational BiologyIowa State University

Joint Work with Tauqir Bibi, Feng Cui, Qunfeng Dong,

Peter Vedell, Di Wu

A Novel Geometric Build-Up Algorithm for Solving the Distance Geometry Problem and Its

Application to Multidimensional Scaling

Page 2: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

S

Multidimensional Scaling

data classificationgeometric mapping of data

T

Distance Geometry

mapping from semi-metric to metric spacesEuclidean and non-Euclidean

BMolecular Conformation

embedding in 3D Euclidean spaceprotein structure prediction and determination

fundamental problem: find the coordinates for a set of points, given the distances for all pairs of points

Cayley-Menger determinantnecessary & sufficient conditions of embedding

singular-value decomposition methodstrain/stress minimization

sparse, inexact distances, bounds on the distances, probability distributions

Page 3: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

HIV Retrotranscriptase

554 amino acids4200 atoms

Proteins are building blocks of life and key ingredients of biological processes.

A biological system may have up to hundreds of thousands of different proteins, each with a specific role in the system.

A protein is formed by a polypeptide chain with typically several hundreds of amino acids and tens of thousands of atoms.

A protein has a unique 3D structure, which determines in many ways the function of the protein.

an example:

Page 4: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Molecular Distance Geometry Problem

Given n atoms a1, …, an and a set of distances di,j between ai and aj, (i,j) in S

Sj)(i,,d||xx||

thatsuch a,...,afor x,,x scoordinate thefind

ji,ji

n1n1

Page 5: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Problems and Complexity

problems with all distances:

solvable in O (n3) using SVD

problems with sparse sets of distances:

NP-complete (Saxe 1979)

problems with distance ranges (NMR results):

NP-complete (More and Wu 1997), if the ranges are small

problems with probability distributions of distances:

stochastic multidimensional scaling, structure prediction

},,1|),{(),(,|||| , njijiDjidxx jiji DSjidxx jiji ),(,|||| ,

DSjidxxl jijiji ),(,|||| ,,

)],,([,),(,|||| ,,,,, jijijijijiji puldDSjidxx

Page 6: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

• Embed Algorithm by Crippen and Havel

• CNS Partial Metrization by Brünger et al

• Graph Reduction by Hendrickson

• Alternating Projection by Glunt and Hayden

• Global Optimization by Moré and Wu

• Multidimensional Scaling by Trosset, et al

Current Approaches

Page 7: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

1. bound smooth; keep distances consistent2. distance metrization; estimate the missing distances 3. repeat (say 1000 times):4. randomly generate D in between L and U5. find X using SVD with D6. if X is found, stop7. select the best approximation X8. refine X with simulated annealing 9. final optimization

Embed Algorithm

Crippen and Havel 1988 (DGII, DGEOM)Brünger et al 1992, 1998 (XPLOR, CNS)

time consuming in O(n3~n4)

costly in O(n2~n3)

Page 8: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Independent Points: A set of k+1 points in Rk is called independent if it is not a set of points in Rk-1.

Metric Basis: A set of points B in a space S is a metric basis of S provided each point of S is uniquely determined by its distances from the points in B.

Fundamental Theorem: Any k+1 independent points in Rk form a metric basis for Rk.

Geometric Build-Up

Blumenthal 1953: Theory and Applications of Distance Geometry

Page 9: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

in two dimension

Geometric Build-Up

Page 10: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Geometric Build-Up

in three dimension

Page 11: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Geometric Build-Up

in three dimension

Page 12: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Geometric Build-Up

x1 = (u1, v1, w1)x2 = (u2, v2, w2)x3 = (u3, v3, w3)x4 = (u4, v4, w4)

||xi - x1|| = di,1

||xi - x2|| = di,2

||xi - x3|| = di,3

||xi - x4|| = di,4

||xj - x1|| = dj,1

||xj - x2|| = dj,2

||xj - x3|| = dj,3

||xj - x4|| = dj,4

? xi = (ui, vi, wi)

? xj = (uj, vj, wj)3

42

1

j

i

Page 13: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

The geometric build-up algorithm solves a molecular distance geometry problem in O(n) when distances between all pairs of atoms are given, while the singular value

decomposition algorithm requires O(n2~n3) computing time!

Page 14: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

The X-ray crystallography structure (left) of the HIV-1 RT p66 protein (4200 atoms) and the structure (right) determined by the geometric build-up algorithm using the distances for all pairs of atoms in the protein. The algorithm took only 188,859 floating-point operations to obtain the structure, while a conventional singular-value decomposition algorithm required 1,268,200,000 floating-point operations. The RMSD of the two structures is ~10-4 Å.

Page 15: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Problems with Sparse Sets of Distances

Page 16: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Control of Rounding Errors

Page 17: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Control of Rounding Errors

Page 18: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Tolerate Distance Errors

Page 19: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

j

22ji,

2ji

x)d||xx(||min

i

i

j

(i,j) in S

xj are determined.

Tolerate Distance Errors

Page 20: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

j

22ji,

2ji

x)d||xx(||min

i

(i,j) in S

xj are determined.

The objective function is convex and the problem can be solved using a standard Newton method.

Each function evaluation requires order of n floating point operations, where n is the number of atoms.

In the ideal case when every atom can be determined, n atoms require O(n2) floating point operations.

Page 21: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

NMR Structure Determination

i

j

The distances are given with their possible ranges.

Sj)i,(

u||xx||l

thatsuch x find

ji,jiji,

i

Page 22: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

22ji

2ji,j

22ji,

2ji

x)||xx||l()u||xx(||min

i

(i, j) in S

Sj)i,(

u||xx||l

thatsuch x find

ji,jiji,

i

Page 23: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

The structure of 4MBA (red lines) determined by using a geometric build-up algorithm with a subset of all pairs of inter-atomic distances. The X-ray crystallography structure is shown in blue lines.

Compu

tation

al Res

ults

Compu

tation

al Res

ults

Page 24: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

The total distance errors (red) for the partial structures of a polypeptide chain obtained by using a geometric build-up are all smaller than 1 Å, while those (blue) by using CNS (Brünger et al) grow quickly with increasing numbers of atoms in the chain.

Compu

tation

al Res

ults

Compu

tation

al Res

ults

Page 25: Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology

Extension to Statistical Distance Data

i

j

the distributions of the distances in structure database

||)]xx(||p[logmax jiSj)(i,x ji,

i

structure prediction