Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Protein Folds and ProteinStructure Superposition
Amit P. Singh
Biochemistry 218/MIS 231
November 24, 1999
PDB Size
l More than 11,000 released atomic coordinateentries
l About 4.5 new entries per day
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1970 1980 1990 2000
Year
Nu
mb
er o
f E
ntr
ies
Tasksl Compare all known structures to each other
l Classify and organize all structures in a meaningfulway
l Find common folding patterns and structural motifs
l Compute evolutionary distances between proteinstructures
l Study interactions between structures and othermolecules (Protein Docking)
l Use known structures to predict structure fromsequence (Protein Threading)
l Many more ...
Overview
l Protein folds• Classification of protein structures
• Current databases
l Structure Superposition• Existing methods
• LOCK - hierarchical protein structure superposition
• 3dSearch - fast secondary structure superposition
l Demonstrations
Classification of Protein Structuresl Class
• Similar secondary structure content• All α, all β, α/β, etc
l Fold (Architecture)• Major structural similarity
• SSE’s in similar arrangement
l Superfamily (Topology)• Probable common ancestry
l Family• Clear evolutionary relationship
• Sequence similarity usually > 25%
Classes of Protein Structuresl Mainly αl Mainly βl α/β
• Parallel β sheets, β-α-β units
l α+β• Anti-parallel β sheets, segregated α
and β regions
• helices mostly on one side of sheet
Classes of Protein Structures
l Others• Multi-domain, membrane and cell surface, small
proteins, peptides and fragments, designedproteins
Folds / Architecturesl Mainly α
• Bundle
• Non-Bundle
l Mainly β• Single sheet
• Roll
• Barrel
• Clam
• Sandwich
• Prism
• 4/6/7/8 Propeller
• Solenoid
l α/β and α+β• Closed
» Barrel
» Roll, ...
• Open
» Sandwich
» Clam, ...
eg. The TIM Barrel Fold
A Conceptual Problem ...
For example:
Another example:Globin
vsColicin
Databases of Foldsl SCOP
• Structural Classification of Protein Structures
• Manual assembly by inspection
• All nodes are annotated (eg. All-alpha, alpha/beta)
• Structural similarity search using 3dSearch (Singh and Brutlag)
l CATH• Class - Architecture - Topology - Homologous Superfamily
• Manual classification at Architecture level
• Automated topology classification using the SSAP algorithm(Orengo and Taylor)
• No structural similarity search
l FSSP• Fully automated using the DALI algorithm (Holm and
Sander)
• No internal node annotations
• Structural similarity search using DALI
l PClass• Fully automated using the LOCK and 3dSearch algorithms
(Singh and Brutlag)
• All internal nodes automatically annotated with commonterms
• JAVA based classification browser
• Structural similarity search using 3dSearch
Databases of Folds
Automating Structure Classification andFold Detection
l Growth of PDB demands automated techniques forclassification and fold detection
l Protein Structure Comparison• computing structure based evolutionary distances
• identifying protein function
• understanding functional mechanism
• identifying structurally conserved regions in the protein
• finding binding sites or other functionally important regionsof the protein
Structure Superposition
l Find the transformation matrix that best overlaps the table andthe chair
l i.e. Find the transformation matrix that minimizes the rootmean square deviation between corresponding points of thetable and the chair
l Correspondences:
• Top of chair to top of table
• Front of chair to front of table, etc.
Absolute Orientation Algorithm
+Closed-form solution of absoluteorientation using unitquaternions
Berthold K.P. Horn,J.Opt.Soc.Am,April 1987, Vol 4, No. 4
The key is finding corresponding points between the twostructures
Algorithms for Structure Superpositionl Distance based methods:
l DALI (Holm and Sander): Aligning scalar distance plots
l STRUCTAL (Gerstein and Levitt): Dynamic programming using pairwise inter-molecular distances
l SSAP (Orengo and Taylor): Dynamic programming using intra-molecular vector distances
l MINAREA (Falicov and Cohen): Minimizing soap-bubble surfacearea
l Vector based methods:l VAST (Bryant): Graph theory based secondary structure alignment
l 3dSearch (Singh and Brutlag) & 3D Lookup (Holm andSander): Fast secondary structure index lookup
l Bothl LOCK (Singh and Brutlag): Hierarchically uses both secondary
structure vectors and atomic distances
DALI
An intra-molecular distance plot for myoglobin
DALI
l Based on aligning 2-D intra-molecular distancematrices
l Computes the best subset of corresponding residuesfrom the two proteins such that the similarity betweenthe 2-D distance matrices is maximized
l Searches through all possible alignments of residuesusing Monte-Carlo and Branch-and-Bound algorithms
Score(i, j) = 1.5 - |distanceA(i, j) - distanceB(i, j)|
STRUCTALl Based on Iterative Dynamic Programming to align
inter-molecular distances
l Pair-wise alignment score in each square of the matrixis inversely proportional to distance between the twoatoms
VAST - Vector Alignment Search Tool
l Aligns only secondary structure elements (SSE)
l Represents each SSE as a vector
l Finds all possible pairs of vectors from the twostructures that are similar
l Uses a graph theory algorithm to find maximalsubset of similar vectors
l Overall alignment score is based on the number ofsimilar pairs of vectors between the two structures
Algorithms for Structure Superpositionl Distance based methods:
l DALI (Holm and Sander): Aligning scalar distance plots
l STRUCTAL (Gerstein and Levitt): Dynamic programming using pair wiseinter-molecular distances
l SSAP (Orengo and Taylor): Dynamic programming using intra-molecularvector distances
l MINAREA (Falicov and Cohen): Minimizing soap-bubble surface area
l Vector based methods:l VAST (Bryant): Graph theory based secondary structure alignment
l 3dSearch (Singh and Brutlag): Fast secondary structure index lookup
l Bothl LOCK (Singh and Brutlag): Hierarchically uses both secondary
structure vectors and atomic distances
LOCK - Creating Secondary Structure Vectors
Comparing Secondary Structure Vectors
i
p
jk
r
q
S(d) =
M
d
d0
1
2
+
-2M M
d
M-
d0
Orientation Independent Scores:S = S(|angle(i,k) - angle(p,r)|)S = S(|distance(i,k) - distance(p,r)|)S = S(|length(k) - length(r)|)
Orientation Dependent Scores:S = S(angle(k,r))S = S(distance(k,r))
Aligning Secondary Structure Vectors
Best local alignment : HHSSSHSSH
H H S S
H
HS
SS
Three Step Algorithm
l Local Secondary Structure Superposition• Find an initial superposition of the two proteins by using
dynamic programming to align the secondary structurevectors
l Atomic Superposition• Apply a greedy nearest neighbor method to minimize the
RMSD between the C-α atoms from query and the target(i.e. find the nearest local minimum in the alignment space)
l Core Superposition• Find the best sequential core of aligned C-α atoms and
minimize the RMSD between them
Step 1: Local Secondary Structure Superposition
H1H3
S2S4
S2H1
H3S4
Step 1: Local Secondary Structure Superposition
A1A3
A2A4
# of aligned vectors
A1,A2 B2,B3 A3,A4B3,B4
total alignment scorepair
B2B1
B3B4
3 71
2 32
Step 1: Local Secondary Structure Superposition
Step 2: Atomic Superposition
Step 3: Core Superposition
Features
l Able to detect both global structural similarity as wellas sub-domain similarity
l Fast: O(n3), where n = number of secondary structureelements
l ~4 min to compare myoglobin (153 residues) to 800representative proteins from the PDB (on a MIPSR10000, 195 MHz processor)
l Can align flexible proteins that contain multiple sub-domains
LOCK PDB Search: Myoglobin1myh-A 0.562531 153 MYOGLOBIN (AQUOMET, PH 7.1) MUTANT2dhb-A 1.335804 132 HEMOGLOBIN (HORSE,DEOXY)1eca 1.497830 130 HEMOGLOBIN (ERYTHROCRUORIN, AQUO MET)1flp 1.349387 129 HEMOGLOBIN I (MONOMERIC) (FERRIC)2lhb 1.084818 128 HEMOGLOBIN V (CYANO,MET)1hds-B 1.235697 128 HEMOGLOBIN (SICKLE CELL)1ith-A 1.302358 126 HEMOGLOBIN (CYANOMET)1mba 1.501337 126 MYOGLOBIN (MET) ($P*H 7.0)1hbg 1.412812 125 HEMOGLOBIN (CARBON MONOXY)1ash 1.431599 122 HEMOGLOBIN (DOMAIN ONE)1hlb 1.475264 121 HEMOGLOBIN (SEA CUCUMBER)1hbi-A 1.408008 118 HEMOGLOBIN I (OXYGENATED, HOMODIMER)1gdi 1.476536 118 LEGHEMOGLOBIN (CARBON MONOXY)1cpc-A 1.787059 80 C-PHYCOCYANIN1cpc-L 1.682854 74 C-PHYCOCYANIN1tox-A 1.640087 63 DIPHTHERIA TOXIN DIMER COMPLEXED WITHNAD1pbg-A 1.701624 60 MOL_ID: 1;1col-A 1.798676 58 COLICIN *A (C-TERMINAL DOMAIN)2sbl-B 1.519786 53 LIPOXYGENASE-1 (SOYBEAN)1oxa 1.851156 52 CYTOCHROME P450 (DONOR:O2OXIDOREDUCTASE)1krb-C 1.680887 49 MOL_ID: 1;1irk 1.409630 48 INSULIN RECEPTOR (TYROSINE KINASEDOMAIN)2hpd-A 1.619509 48 CYTOCHROME P450 (BM-3)
1lmb-4 0.207748 22 DNA-BINDING REGULATORY PROTEIN1pra 0.451552 22 GENE REGULATING PROTEIN1adr 0.472226 22 TRANSCRIPTION REGULATION1yrn-A 0.492206 22 COMPLEX (TWO DNA-BINDING PROTEINS/DNA)1dik 0.517499 22 PHOSPHOTRANSFERASE1pnr-A 0.613499 22 COMPLEX (DNA-BINDING REGULATION/DNA)4fis-A 0.677586 22 DNA-BINDING PROTEIN1oct-C 0.697683 22 DNA-BINDING PROTEIN1ftt 0.742871 22 DNA BINDING PROTEIN1cop-E 0.768256 22 GENE REGULATING PROTEIN1dtr 0.780090 22 DNA BINDING PROTEIN1gdt-A 0.795171 22 COMPLEX (SITE-SPECIFICRECOMBINASE/DNA)3gap-B 0.817812 22 GENE REGULATORY PROTEIN1lfb 0.820489 22 TRANSCRIPTION REGULATION1ads 0.920977 22 OXIDOREDUCTASE1pdn-C 0.949104 22 COMPLEX (GENE REGULATING PROTEIN/DNA)1hom 0.958631 22 DNA-BINDING PROTEIN1mse-C 0.965817 22 COMPLEX (BINDING PROTEIN/DNA)1cma-A 1.023169 22 DNA-BINDING REGULATORY PROTEIN1trr-A 1.048990 22 DNA-BINDING REGULATORY PROTEIN1tpl-B 1.467803 22 LYASE(CARBON-CARBON)1hcr-A 0.930009 21 DNA-BINDING1neq 0.949148 21 DNA-BINDING PROTEIN1ade-A 0.979226 21 LIGASE (SYNTHETASE)153l 1.093427 21 HYDROLASE(O-GLYCOSYL)
LOCK PDB Search: Helix-Turn-Helix
3dSearchl Place vectors of all target proteins into an index table
l Structure comparison = Index lookup
l Building the table:• For each pair of vectors in the target protein, generate a
unique coordinate system
• Record the position and orientation of all remaining vectors inthis coordinate system in an index table
l Searching the table:• For each pair of vectors in the query protein, generate a
unique coordinate system
• Search the table of target vectors to find the query coordinatesystem that results in the maximum number of aligned vectors
l For each pair of vectors (Vi,Vj)• Transform the coordinates such that Vi lies on the Z-axis
and the projection of Vj on to the X-Y plane is parallel to theY-axis
Generating a Coordinate System
Building the Index Table
l For each coordinate system• Compute the position and orientation of the remaining
vectors in this new frame of reference
Algorithms for Structure Superpositionl Distance based methods:
l DALI (Holm and Sander): Aligning scalar distance plots
l STRUCTAL (Gerstein and Levitt): Dynamic programming using pair wiseinter-molecular distances
l SSAP (Orengo and Taylor): Dynamic programming using intra-molecularvector distances
l MINAREA (Falicov and Cohen): Minimizing soap-bubble surface area
l Vector based methods:l VAST (Bryant): Graph theory based secondary structure alignment
l 3dSearch (Singh and Brutlag): Fast secondary structure indexlookup
l Bothl LOCK (Singh and Brutlag): Hierarchically uses both secondary structure
vectors and atomic distances
B
0
10
20
30
40
50
60
0 10 20 30 40 50 60
Number of False Positives
Number of True Posit ives DALI
STRUCTALVASTMINAREALOCK3dSEARCH
C
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50 60
Number of False Positives
Number of True Posi t ives DALI
STRUCTALVASTMINAREALOCK3dSEARCH
Comparing Structural Alignment Algorithms
TIM Barrel
Immunoglobulin
αα/β & α+β
α+β β
Subset ofPClass
Comparing PClass to Scop: NumClusters
A
B C
l Clusters (S) = number of PClass “Clusters” required to cover all leaves ofS
l Given a set of leaves L, a PClass node, P, is a “Cluster” if:
• all leaves of P are in the set L (i.e. P is a subset of L)
• no ancestor of P is also a “Cluster”
Comparing Pclass to Scop: NumClusters
A
B C
Clusters(A) = 3
Clusters(B) = 2
Clusters(C) = 1
l PClassDistance(i,j,S) :• Find lowest common ancestor, P, of i and j
• Computer the number of leaves P that are not leaves of S (i.e. False Positives)
A
B C
Sunder j)(i, pairs ofnumber
S) j,ance(i,PClassDist
)S(Spread Sunder j)(i, leaves of pairs all�
=
Comparing PClass to Scop: Spread
Comparing PClass to Scop: Spread
A
B C
FP(A) = 1
FP(B) = 2
FP(C) = 1
Spread(A) = (2*4*1)/(6C2) = 8/(6*5) =0.27
Spread(C) = 0
Spread(B) = (1*4*2)/(5C2) = 8/(5*4) =0.4
Family Sizes
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7 8 9 10 11
SuperFamily Sizes
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 19 21 38
Fold Sizes
0
20
40
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 51 58
Class Sizes
0
0.2
0.4
0.6
0.8
1
1.2
1 17 28 52 148 183 249 268
Class NumClusters
0
10
20
30
40
50
60
70
80
1 17 28 52 148 183 249 268
Fold NumClusters
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 51 58
SuperFamily NumClusters
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 19 21 38
Family NumClusters
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10 11
Family ClusterCoverage
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11
SuperFamily ClusterCoverage
0
5
10
15
20
25
30
35
40
1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 19 21 38
Fold ClusterCoverage
0
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 51 58
Class ClusterCoverage
0
50
100
150
200
250
1 17 28 52 148 183 249 268
Class Spread
0
100
200
300
400
500
600
700
1 17 28 52 148 183 249 268
Fold Spread
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 51 58
SuperFamily Spread
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 19 21 38
Family Spread
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10 11
Demonstrations
l Fold Databases• SCOP (http://scop.stanford.edu/scop)
• FSSP (http://www2.ebi.ac.uk/dali/fssp/fssp.html)
• PClass (http://gene.stanford.edu/PClass)
l Structural Alignment Tools• LOCK (http://gene.stanford.edu/lock)
• 3dSearch (http://gene.stanford.edu/3dSearch/)
• DALI (http://www2.ebi.ac.uk/dali)