Homology Modeling Lu Chih-Hao 1. Why study protein structure? Proteins play crucial functional roles in all biological processes: enzymatic catalysis,

Homology Modeling

Lu Chih-Hao

1

Why study protein structure?

• Proteins play crucial functional roles in all biological processes: enzymatic catalysis, signaling messengers …

• Function depends on 3D structure.

• Easy to obtain protein sequences, difficult to determine structure.

2

Where find the data?

• Protein Data Bank (PDB)– http://www.rcsb.org/pdb/– > ~65,500 structures of proteins

• Text file contain: coordinates for each heavy atom from the first residue to the last

X Y Z

3

PDB Statistics

4

TIM barrel

5

How to determine the protein structure?

• By experimentation– X-Ray– NMR (nuclear magnetic resonance spectroscopy)

• Sequence-Structure gap

6

Protein Structure Prediction

• The primary sequence already contain all the information necessary to define 3D structure.

• The 3D protein structure can be predicted according to three main categories of methods (Rost & O’Donoghue, 1997): (1) homology modeling; (2) fold recognition (threading); (3) ab initio techniques.

• Homology modeling is currently the most accurate method to predict protein 3D structure (Tramontano, 1998).

7

Protein Structure Prediction

Sequence

Sequence HomologyTo known fold

HomologyModeling

>30%

Threading

Match Found?

Ab initio

No

Model

Yes

<30%

8

.

0

20

40

60

80

100

0 50 100 150 200 250

identity

Number of residues aligned

Perc

enta

ge s

equence

identi

ty/s

imila

rity

(B.Rost, Columbia, NewYork)

Sequence identity implies structural similarity

Sequence similarity implies structural similarity?

Safe zone

9

Homology Modeling

• Basis– Structure is much more conserved than sequence

during evolution

• Limited applicability– A large number of proteins and ORFs have no

similarity to proteins with known structure

10

??

KQFTKCELSQNLYDIDGYGRIALPELICTMFHTSGYDTQAIVENDESTEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILDIKGIDYWIAHKALCTEKLEQWLCEKE

Use as template

8lyz1alc

KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRLShare Similar

Sequence

Homologous

What is Homology Modeling?

Target Template

11

Structure prediction by homology modeling

12

Step 1

Step 2

Step 3

Step 4

Homology detection and template selection

• Homology detection– To detect the fold of a probe sequence from a library

of known target fold.

• The three type of sequence based methods:– Pair-wise sequence-sequence comparison

• FASTA, BLAST

– Sequence profile comparison• PSI-BLAST, IMPALA, HMMER, SAM

– Profile-profile comparison• prof_sim, COMPASS

13

Q T

Sequence-Sequence comparison

BLAST, FASTA, SSEARCH

14

Q T

Profile-Sequence comparison

PSI-BLAST15

PSI-BLAST Overview

16

Q T

Sequence-Profile comparison

RPS-BLAST, IMPALA, HMMER, SAM

17

Q T

Profile-Profile comparison

prof_sim, COMPASS18

Method_11lmb3 <-> 1pou shift = 9.34 σ = 39.62LEDARRLKAIYEKKKNELGLSQESVADKMGMGQSGVGALFNGINALNAYNAALLAKILKVSVEEFSPSIAREIYEMYEAHHHHHHHHHHHHHHHHHCCCChhhhhhhhccchhhhhhhhccccccchhhhhhhhhhhccchhhcchhhhhhhhhhhhh||||||||||||||||||||| ++++++++ + ++++++++++++ ++++++++000000000000000000000 99999999 X XXXXXXXXXXXX XXXXXXXXHHHHHHHHHHHHHHHHHHCCC---------cchhhhhhhhhcccccc---chhhhhhhcccccccchhhhhhhhhhhhhLEELEQFAKTFKQRRIKLGFT---------QGDVGLAMGKLYGNDFS---QTTISRFEALNLSFKNMCKLKPLLEKWLN

Method_21lmb3 <-> 1pou Shift = 0.67 σ = 60.78LEDARRLKAIYEKKKNELGLS----QESVADKMG--MGQSGVGALFN-GINALNAYNAALLAKILKVSVEEFSHHHHHHHHHHHHHHHHHCCCC----hhhhhhhhc--cCHHHHHHHHC-cccccchhhhhhhhhhhccchhhcc||||||||||||||||||||| ---- |||||||||| -- ++++++++ ++ 000000000000000000000 4444 0000000000 11 11111111 44 HHHHHHHHHHHHHHHHHHCCCcchhhhhhhhhcccccCCHHHHHHHCccccccchhhhhhhhhhh---hhhccLEELEQFAKTFKQRRIKLGFTQGDVGLAMGKLYGNDFSQTTISRFEALNLSFKNMCKLKPLLEKW---LNDAE

The importance of the sequence alignment

SCR; structure conserved region SVR; structure variable region

19

Backbone generation

• Rigid-body assembly– Building model core

20

21

Construction of loops might be done by:

Wedemeyer,ScheragaJ. Comput. Chem.20, 819-844(1999)

Ab initio methods - without any prior knowledge. This is done by empirical scoring functions that check large number of conformations and evaluates each of them.

22

data clustereddata

library

Construction of loops might be done by:

Using database of loops which appear in known structures. The loops could be categorized by their length or sequence

23

Scan database and search protein fragments with correct number of residuesand correct end-to-end distances

24

25

26

Loop length

cRM

S (Ǻ

)

Method breaksdown for loopslarger than 9

Loop Modeling: A database approach

27

GDT_TS = 45.96 GDT_TS = 60.48

Predicted model with long loop Without loop

Target: 2bj7A

28

29

Errors in Homology Modeling

a) Side chain packing b)Distortions and shifts c) No template

Template ModelTrue structure30

Errors in Homology Modeling

d) Misalignments e) Incorrect template

(Marti-Renom et al., 2000)

Template ModelTrue structure31

PROCHECK, Verify3D, Prosa, Anolea, Bala …

32

PROCHECK

β

α http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html

33

Verify3D

• Verify3D analyzes the compatibility of an atomic model (3D) with its own amino acid sequence (1D).

Luethy et al., 1992 34

ProQ Server

• ProQ is a neural network-based predictor

– Structural features quality of a protein model.

Arne Elofssons group: http://www.sbc.su.se/~bjorn/ProQ/

Correct Good Very goodLGscore > 1.5 LGscore > 3 LGscore > 5MaxSub > 0.1 MaxSub > 0.5 MaxSub > 0.8

35

Modeling accuracy

(Marti-Renom et al., 2000)

36

Utility of Structural Information

37

38

39

(PS)2: protein structure prediction server

40

Consensus strategy

• The idea of consensus analysis is to gather predictions from a set of different methods.

• The performance of consensus methods is significantly higher than for individual methods.

3d-shotgun (Fischer D., 2003)3d-jury (Ginalski K et al., 2003)Pmodeller (Bjorn W et al., 2003)

41

Structure prediction by homology modeling

42

Step 1

Step 2

Step 3

Step 4

Figure 1. Overview of the protein structure prediction server, (PS)2.

Overview of the (PS)2 method

Step1: Template search/selection by the

consensus of PSI-BLAST and IMPALA

Step2: Target-template alignment by the consensus of T-Coffee, PSI-BLAST,

and IMPALA

Step3: Model building by MODELLER and structure evaluation and visualization

by CHIME and Raster3D

(d)

(a)

(b) (c)

43

: Aligned path of PSI-BLAST

: Aligned path of T-Coffee

: Aligned path of IMPALA

: Final aligned path

9: aligned in 1st cycle7: aligned in 2nd cycle5: aligned in 3rd cycle3: aligned in 4th cycle4 and 2: unfeasible solution

Input: target and template sequences

Output: target-template aligned sequences

Step 1: Initial all entries of the aligned matrix to 0. Align target and template sequences using PSI-BLAST, IMPALA, and T-Coffee.

Step 2: Sum aligned scores of these three alignments for each position with different scoring weights.

Step 3: Take the positions with the highest score as the aligned points to build the final target-template alignment. (e.g., the highest scoring is 9 for the 1st cycle in (b) )

Step 4: Identify the unfeasible positions. ( 4 and 2 in (b))

Step 5: Change the scores of unfeasible positions and the aligned points to 0.

Step 6: Repeatedly Steps 3 and 5 until all entries are 0.

Step 7: Output the path with the aligned points as the target-template alignment

(b)(a)

Alignment method

44

http://predictioncenter.org/

45

CASP3 servers registered:

1. 3D-PSSM (Sternberg) [email protected] 2. Karplus [email protected] 3. frsvr (Fischer) [email protected] 4. pscan (Eloffson) [email protected] 5. BASIC (Godzik) [email protected] 6. GenTHREADER [email protected] 7. Valentina di Francesco [email protected] 8. TOPITS (Rost) [email protected] 9. Bork

46

CASP8 servers registered:

47

}8 4, 2, 1,{(%)4

100_

dN

GDT

TSGDT d

d

- N is the total number residues of the target (native structure)- GDTd is the number of aligned residues whose Cα-atom distance

between the target and predicted model is less than d- d is 1, 2, 4, or 8 Å.

Model Evaluation

• Performance evaluation– Comparing the 47 CM targets to evaluate the

performance with the other groups in CASP6.

• GDT_TS Score

48

10 272

6 294 Native structure

PSI-BLAST modelGDT_TS = 64.97

272(PS)2 modelGDT_TS = 67.22

GDT_TS = 66.00

10

10 272 IMPALA modelGDT_TS = 63.32

294 T-Coffee modelGDT_TS = 65.14

T0264 (1wde)

Aligned rate: 91.00 %

Aligned rate: 91.00 %

Aligned rate: 100 %6

Aligned rate: 100 %6 294

Figure 3. Comparison (PS)2 with PSI-BLAST, IMPALA, and T-Coffee of the prediction accuracies (global / local GDT_TS scores) on target T0264.

49

Figure 4. Comparison of (PS)2 models with all automated servers in CASP6.

0

20

40

60

80

100

T0196

T0199_1

T0200

T0204

T0205

T0208

T0211

T0222_1

T0223_1

T0226

T0226_1

T0229

T0229_1

T0229_2

T0231

T0233

T0233_1

T0233_2

T0234

T0235_1

T0240

T0246

T0247

T0247_1

T0247_2

T0247_3

T0264

T0264_1

T0264_2

T0266

T0267

T0268

T0268_1

T0268_2

T0269

T0269_1

T0269_2

T0271

T0274

T0275

T0276

T0277

T0279

T0279_1

T0279_2

T0280_1

T0282

Targets

GD

T_T

S Sc

ore

(%)

50

51

Table 1. Compare with the other groups in CASP6

(PS)2 RBTA ESYP 3DJR MGTH 3DJS PROS PMO5 PRCM PCO5 PCOB

Average GDT_TS

65.89 64.92 63.14 62.54 61.27 61.08 58.11 57.93 57.62 56.37 37.57

• Cases

T0269, Template 1prxA(PS)2 model, GDT_TS: 85.76

T0269, Template 1qq2AESYP model, GDT_TS: 78.48

http://ps2.life.nctu.edu.tw

52

53

54

Documents

Homology Modeling Lu Chih-Hao 1. Why study protein structure? Proteins play crucial functional roles in all biological processes: enzymatic catalysis,