Protein Folds and Protein Structure Superposition · Tasks l Compare all known structures to each other l Classify and organize all structures in a meaningful way l Find common folding

Protein Folds and ProteinStructure Superposition

Amit P. Singh

Biochemistry 218/MIS 231

November 24, 1999

PDB Size

l More than 11,000 released atomic coordinateentries

l About 4.5 new entries per day

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1970 1980 1990 2000

Year

Nu

mb

er o

f E

ntr

ies

Tasksl Compare all known structures to each other

l Classify and organize all structures in a meaningfulway

l Find common folding patterns and structural motifs

l Compute evolutionary distances between proteinstructures

l Study interactions between structures and othermolecules (Protein Docking)

l Use known structures to predict structure fromsequence (Protein Threading)

l Many more ...

Overview

l Protein folds• Classification of protein structures

• Current databases

l Structure Superposition• Existing methods

• LOCK - hierarchical protein structure superposition

• 3dSearch - fast secondary structure superposition

l Demonstrations

Classification of Protein Structuresl Class

• Similar secondary structure content• All α, all β, α/β, etc

l Fold (Architecture)• Major structural similarity

• SSE’s in similar arrangement

l Superfamily (Topology)• Probable common ancestry

l Family• Clear evolutionary relationship

• Sequence similarity usually > 25%

Classes of Protein Structuresl Mainly αl Mainly βl α/β

• Parallel β sheets, β-α-β units

l α+β• Anti-parallel β sheets, segregated α

and β regions

• helices mostly on one side of sheet

Classes of Protein Structures

l Others• Multi-domain, membrane and cell surface, small

proteins, peptides and fragments, designedproteins

Folds / Architecturesl Mainly α

• Bundle

• Non-Bundle

l Mainly β• Single sheet

• Roll

• Barrel

• Clam

• Sandwich

• Prism

• 4/6/7/8 Propeller

• Solenoid

l α/β and α+β• Closed

» Barrel

» Roll, ...

• Open

» Sandwich

» Clam, ...

eg. The TIM Barrel Fold

A Conceptual Problem ...

For example:

Another example:Globin

vsColicin

Databases of Foldsl SCOP

• Structural Classification of Protein Structures

• Manual assembly by inspection

• All nodes are annotated (eg. All-alpha, alpha/beta)

• Structural similarity search using 3dSearch (Singh and Brutlag)

l CATH• Class - Architecture - Topology - Homologous Superfamily

• Manual classification at Architecture level

• Automated topology classification using the SSAP algorithm(Orengo and Taylor)

• No structural similarity search

l FSSP• Fully automated using the DALI algorithm (Holm and

Sander)

• No internal node annotations

• Structural similarity search using DALI

l PClass• Fully automated using the LOCK and 3dSearch algorithms

(Singh and Brutlag)

• All internal nodes automatically annotated with commonterms

• JAVA based classification browser

• Structural similarity search using 3dSearch

Databases of Folds

Automating Structure Classification andFold Detection

l Growth of PDB demands automated techniques forclassification and fold detection

l Protein Structure Comparison• computing structure based evolutionary distances

• identifying protein function

• understanding functional mechanism

• identifying structurally conserved regions in the protein

• finding binding sites or other functionally important regionsof the protein

Structure Superposition

l Find the transformation matrix that best overlaps the table andthe chair

l i.e. Find the transformation matrix that minimizes the rootmean square deviation between corresponding points of thetable and the chair

l Correspondences:

• Top of chair to top of table

• Front of chair to front of table, etc.

Absolute Orientation Algorithm

+Closed-form solution of absoluteorientation using unitquaternions

Berthold K.P. Horn,J.Opt.Soc.Am,April 1987, Vol 4, No. 4

The key is finding corresponding points between the twostructures

Algorithms for Structure Superpositionl Distance based methods:

l DALI (Holm and Sander): Aligning scalar distance plots

l STRUCTAL (Gerstein and Levitt): Dynamic programming using pairwise inter-molecular distances

l SSAP (Orengo and Taylor): Dynamic programming using intra-molecular vector distances

l MINAREA (Falicov and Cohen): Minimizing soap-bubble surfacearea

l Vector based methods:l VAST (Bryant): Graph theory based secondary structure alignment

l 3dSearch (Singh and Brutlag) & 3D Lookup (Holm andSander): Fast secondary structure index lookup

l Bothl LOCK (Singh and Brutlag): Hierarchically uses both secondary

structure vectors and atomic distances

DALI

An intra-molecular distance plot for myoglobin

DALI

l Based on aligning 2-D intra-molecular distancematrices

l Computes the best subset of corresponding residuesfrom the two proteins such that the similarity betweenthe 2-D distance matrices is maximized

l Searches through all possible alignments of residuesusing Monte-Carlo and Branch-and-Bound algorithms

Score(i, j) = 1.5 - |distanceA(i, j) - distanceB(i, j)|

STRUCTALl Based on Iterative Dynamic Programming to align

inter-molecular distances

l Pair-wise alignment score in each square of the matrixis inversely proportional to distance between the twoatoms

VAST - Vector Alignment Search Tool

l Aligns only secondary structure elements (SSE)

l Represents each SSE as a vector

l Finds all possible pairs of vectors from the twostructures that are similar

l Uses a graph theory algorithm to find maximalsubset of similar vectors

l Overall alignment score is based on the number ofsimilar pairs of vectors between the two structures



l STRUCTAL (Gerstein and Levitt): Dynamic programming using pair wiseinter-molecular distances

l SSAP (Orengo and Taylor): Dynamic programming using intra-molecularvector distances

l MINAREA (Falicov and Cohen): Minimizing soap-bubble surface area


l 3dSearch (Singh and Brutlag): Fast secondary structure index lookup

l Bothl LOCK (Singh and Brutlag): Hierarchically uses both secondary

structure vectors and atomic distances

LOCK - Creating Secondary Structure Vectors

Comparing Secondary Structure Vectors

i

p

jk

r

q

S(d) =

M

d

d0

1

2

+

-2M M

d

M-

d0

Orientation Independent Scores:S = S(|angle(i,k) - angle(p,r)|)S = S(|distance(i,k) - distance(p,r)|)S = S(|length(k) - length(r)|)

Orientation Dependent Scores:S = S(angle(k,r))S = S(distance(k,r))

Aligning Secondary Structure Vectors

Best local alignment : HHSSSHSSH

H H S S

H

HS

SS

Three Step Algorithm

l Local Secondary Structure Superposition• Find an initial superposition of the two proteins by using

dynamic programming to align the secondary structurevectors

l Atomic Superposition• Apply a greedy nearest neighbor method to minimize the

RMSD between the C-α atoms from query and the target(i.e. find the nearest local minimum in the alignment space)

l Core Superposition• Find the best sequential core of aligned C-α atoms and

minimize the RMSD between them

Step 1: Local Secondary Structure Superposition

H1H3

S2S4

S2H1

H3S4


A1A3

A2A4

# of aligned vectors

A1,A2 B2,B3 A3,A4B3,B4

total alignment scorepair

B2B1

B3B4

3 71

2 32


Step 2: Atomic Superposition

Step 3: Core Superposition

Features

l Able to detect both global structural similarity as wellas sub-domain similarity

l Fast: O(n3), where n = number of secondary structureelements

l ~4 min to compare myoglobin (153 residues) to 800representative proteins from the PDB (on a MIPSR10000, 195 MHz processor)

l Can align flexible proteins that contain multiple sub-domains

LOCK PDB Search: Myoglobin1myh-A 0.562531 153 MYOGLOBIN (AQUOMET, PH 7.1) MUTANT2dhb-A 1.335804 132 HEMOGLOBIN (HORSE,DEOXY)1eca 1.497830 130 HEMOGLOBIN (ERYTHROCRUORIN, AQUO MET)1flp 1.349387 129 HEMOGLOBIN I (MONOMERIC) (FERRIC)2lhb 1.084818 128 HEMOGLOBIN V (CYANO,MET)1hds-B 1.235697 128 HEMOGLOBIN (SICKLE CELL)1ith-A 1.302358 126 HEMOGLOBIN (CYANOMET)1mba 1.501337 126 MYOGLOBIN (MET) ($P*H 7.0)1hbg 1.412812 125 HEMOGLOBIN (CARBON MONOXY)1ash 1.431599 122 HEMOGLOBIN (DOMAIN ONE)1hlb 1.475264 121 HEMOGLOBIN (SEA CUCUMBER)1hbi-A 1.408008 118 HEMOGLOBIN I (OXYGENATED, HOMODIMER)1gdi 1.476536 118 LEGHEMOGLOBIN (CARBON MONOXY)1cpc-A 1.787059 80 C-PHYCOCYANIN1cpc-L 1.682854 74 C-PHYCOCYANIN1tox-A 1.640087 63 DIPHTHERIA TOXIN DIMER COMPLEXED WITHNAD1pbg-A 1.701624 60 MOL_ID: 1;1col-A 1.798676 58 COLICIN *A (C-TERMINAL DOMAIN)2sbl-B 1.519786 53 LIPOXYGENASE-1 (SOYBEAN)1oxa 1.851156 52 CYTOCHROME P450 (DONOR:O2OXIDOREDUCTASE)1krb-C 1.680887 49 MOL_ID: 1;1irk 1.409630 48 INSULIN RECEPTOR (TYROSINE KINASEDOMAIN)2hpd-A 1.619509 48 CYTOCHROME P450 (BM-3)

1lmb-4 0.207748 22 DNA-BINDING REGULATORY PROTEIN1pra 0.451552 22 GENE REGULATING PROTEIN1adr 0.472226 22 TRANSCRIPTION REGULATION1yrn-A 0.492206 22 COMPLEX (TWO DNA-BINDING PROTEINS/DNA)1dik 0.517499 22 PHOSPHOTRANSFERASE1pnr-A 0.613499 22 COMPLEX (DNA-BINDING REGULATION/DNA)4fis-A 0.677586 22 DNA-BINDING PROTEIN1oct-C 0.697683 22 DNA-BINDING PROTEIN1ftt 0.742871 22 DNA BINDING PROTEIN1cop-E 0.768256 22 GENE REGULATING PROTEIN1dtr 0.780090 22 DNA BINDING PROTEIN1gdt-A 0.795171 22 COMPLEX (SITE-SPECIFICRECOMBINASE/DNA)3gap-B 0.817812 22 GENE REGULATORY PROTEIN1lfb 0.820489 22 TRANSCRIPTION REGULATION1ads 0.920977 22 OXIDOREDUCTASE1pdn-C 0.949104 22 COMPLEX (GENE REGULATING PROTEIN/DNA)1hom 0.958631 22 DNA-BINDING PROTEIN1mse-C 0.965817 22 COMPLEX (BINDING PROTEIN/DNA)1cma-A 1.023169 22 DNA-BINDING REGULATORY PROTEIN1trr-A 1.048990 22 DNA-BINDING REGULATORY PROTEIN1tpl-B 1.467803 22 LYASE(CARBON-CARBON)1hcr-A 0.930009 21 DNA-BINDING1neq 0.949148 21 DNA-BINDING PROTEIN1ade-A 0.979226 21 LIGASE (SYNTHETASE)153l 1.093427 21 HYDROLASE(O-GLYCOSYL)

LOCK PDB Search: Helix-Turn-Helix

3dSearchl Place vectors of all target proteins into an index table

l Structure comparison = Index lookup

l Building the table:• For each pair of vectors in the target protein, generate a

unique coordinate system

• Record the position and orientation of all remaining vectors inthis coordinate system in an index table

l Searching the table:• For each pair of vectors in the query protein, generate a

unique coordinate system

• Search the table of target vectors to find the query coordinatesystem that results in the maximum number of aligned vectors

l For each pair of vectors (Vi,Vj)• Transform the coordinates such that Vi lies on the Z-axis

and the projection of Vj on to the X-Y plane is parallel to theY-axis

Generating a Coordinate System

Building the Index Table

l For each coordinate system• Compute the position and orientation of the remaining

vectors in this new frame of reference



l STRUCTAL (Gerstein and Levitt): Dynamic programming using pair wiseinter-molecular distances

l SSAP (Orengo and Taylor): Dynamic programming using intra-molecularvector distances

l MINAREA (Falicov and Cohen): Minimizing soap-bubble surface area


l 3dSearch (Singh and Brutlag): Fast secondary structure indexlookup

l Bothl LOCK (Singh and Brutlag): Hierarchically uses both secondary structure

vectors and atomic distances

B

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Number of False Positives

Number of True Posit ives DALI

STRUCTALVASTMINAREALOCK3dSEARCH

C

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60

Number of False Positives

Number of True Posi t ives DALI

STRUCTALVASTMINAREALOCK3dSEARCH

Comparing Structural Alignment Algorithms

TIM Barrel

Immunoglobulin

αα/β & α+β

α+β β

Subset ofPClass

Comparing PClass to Scop: NumClusters

A

B C

l Clusters (S) = number of PClass “Clusters” required to cover all leaves ofS

l Given a set of leaves L, a PClass node, P, is a “Cluster” if:

• all leaves of P are in the set L (i.e. P is a subset of L)

• no ancestor of P is also a “Cluster”

Comparing Pclass to Scop: NumClusters

A

B C

Clusters(A) = 3

Clusters(B) = 2

Clusters(C) = 1

l PClassDistance(i,j,S) :• Find lowest common ancestor, P, of i and j

• Computer the number of leaves P that are not leaves of S (i.e. False Positives)

A

B C

Sunder j)(i, pairs ofnumber

S) j,ance(i,PClassDist

)S(Spread Sunder j)(i, leaves of pairs all�

=

Comparing PClass to Scop: Spread

Comparing PClass to Scop: Spread

A

B C

FP(A) = 1

FP(B) = 2

FP(C) = 1

Spread(A) = (2*4*1)/(6C2) = 8/(6*5) =0.27

Spread(C) = 0

Spread(B) = (1*4*2)/(5C2) = 8/(5*4) =0.4

Family Sizes

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7 8 9 10 11

SuperFamily Sizes

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 19 21 38

Fold Sizes

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 51 58

Class Sizes

0

0.2

0.4

0.6

0.8

1

1.2

1 17 28 52 148 183 249 268

Class NumClusters

0

10

20

30

40

50

60

70

80

1 17 28 52 148 183 249 268

Fold NumClusters

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 51 58

SuperFamily NumClusters

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 19 21 38

Family NumClusters

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10 11

Family ClusterCoverage

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11

SuperFamily ClusterCoverage

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 19 21 38

Fold ClusterCoverage

0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 51 58

Class ClusterCoverage

0

50

100

150

200

250

1 17 28 52 148 183 249 268

Class Spread

0

100

200

300

400

500

600

700

1 17 28 52 148 183 249 268

Fold Spread

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 51 58

SuperFamily Spread

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 19 21 38

Family Spread

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 10 11

Demonstrations

l Fold Databases• SCOP (http://scop.stanford.edu/scop)

• FSSP (http://www2.ebi.ac.uk/dali/fssp/fssp.html)

• PClass (http://gene.stanford.edu/PClass)

l Structural Alignment Tools• LOCK (http://gene.stanford.edu/lock)

• 3dSearch (http://gene.stanford.edu/3dSearch/)

• DALI (http://www2.ebi.ac.uk/dali)

Documents

Protein Folds and Protein Structure Superposition · Tasks l Compare all known structures to each other l Classify and organize all structures in a meaningful way l Find common folding