21
2 o structure, TM regions, and solvent accessibility Topic 13 er 29, Du and Bourne “Structural Bioinformatics”

2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Embed Size (px)

Citation preview

Page 1: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

2o structure, TM regions, and solvent accessibility

Topic 13Chapter 29, Du and Bourne “Structural Bioinformatics”

Page 2: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

The Truth (Information) is Out (In) There

Page 3: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

The Truth (Information) is Out (In) There

But we’re still having a tough time finding it.

Page 4: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Given a protein sequence (primary structure), predict its secondary structures

GHWIATRGQLIREAYEDYRHFSSECPFIP

CEEEEECCCEEEEECCCHHHHHHCCCCCC

E: -strandH: -helixC: coil

Assumption: short stretches of residues have propensity to adopt certainconformation conformation of the central residue in a sequence fragment⇒depends only on flanking residues (sliding window)

Protein Secondary Structure Prediction

H: ( H: - helix, G: 310 helix, I: -helix ) E: (E: -strand, B: bridge) C: (T: -turn, S: bend, C: coil)

Page 5: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

-- Because we can (kind of).

-- Because it could be a first step towards prediction of protein tertiary structure.

Why secondary structure prediction?

“Have solution, need problem.” Nearly every imaginable algorithm has been applied to secondary structure prediction.

Page 6: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

1. First generation: Single amino acid propensities Chou-Fasman method (1974), GOR I-IV ~56-60% accuracy

2. Second generation: Segments of 3-51 adjacent residues NNSSP, SSPAL

~65% accuracy

3. Neural network PHD, Psi-Pred, J-Pred

4. Support vector machine (SVM)

5. Hidden Markov Models (HMM)

Third generation methodsusing evolutionary information ~76% accuracy

Secondary Structure Prediction Methods

Page 7: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

3

ii1

3obs

M100

NiQ

1. three-state per-residue prediction accuracy

Mii, number of residues observed in state i and predicted in state i Nobs, the total number of residues observed in 3 states

Secondary Structure Prediction Accuracy

2. per-segment prediction accuracy (SOV, Segment of OVerlap)

Per-stage segment overlap:                                                 

S1: observed SS segmentS2: predicted SS segment

Page 8: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Calculate the propensity for a given amino acid to adopt a certain ss-type

( | ) ( , )

( ) ( ) ( )i i i

i

P aa p aaP

p p p aa

Example: from a data set with 30 proteins

#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=580

p(,aa) = 580/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000

P = 580 / (4,000/10) = 1.45

i, amino acid, secondary structure state

Single Residue Propensity Methods

Page 9: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Amino Acid Propensities to Secondary Structures

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Chou-Fasman method

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Page 10: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

* The idea is simple: predict SS of the central residue of a given segment from homologous segments (neighbors).

For example, from database, find some number of the closest sequences to a subsequence defined by a window around the central residue, then use max (N, N, Nc) to assign the SS.

Nearest Neighbor Methods

RSTEVRASRQLAKEKVN

Window size

Homologous sequences

ECCHHCC

C

Key parameters:1. How to define similarity?2. What size window of sequence should be examined?3. How many close sequences should be selected?

Page 11: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

The Devil is in the details…

Page 12: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

D. Jones, J. Mol. Boil. 292, 195 (1999). Method : Neural network Input data : PSSM generated by PSI-BLAST Bigger and better sequence database

Combining several database and data filtering Training and test sets preparation

Ss prediction only makes sense for proteins with no homologous structure.

No sequence & structural homologues between training and test sets by CATH and PSI-BLAST (mimicking realistic situation).

Psi-Pred Method

Page 13: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Window size = 15 Two networks First network (sequence-to-structure):

315 = (20 + 1) 15 inputs extra unit to indicate where the windows spans either N or C terminus Data are scaled to [0-1] range by using 1/[1+exp(-x)] 75 hidden units 3 outputs (H, E, L)

Second network (structure-to-structure): Structural correlation between adjacent sequences 60 = (3 + 1) 15 inputs 60 hidden units 3 outputs

Accuracy ~76%

Psi-Pred Method--Neural Network

Page 14: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Conf: Confidence (0=low, 9=high) ---very important!!!!Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence # PSIPRED HFORMAT (PSIPRED V2.3 by David Jones) Conf: 966899999997542002357777557999999716898188034435788873356776 Pred: CCHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHCCCCCEEECCCCEEEEEEECCCCCC AA: MMWEQFKKEKLRGYLEAKNQRKVDFDIVELLDLINSFDDFVTLSSCSGRIAVVDLEKPGD 10 20 30 40 50 60

Conf: 777179998337888888988751235636899718261220179868899999998557 Pred: CCCCEEEEEECCCCCHHHHHHHHHCCCCCEEEEECCCEEEEECCCHHHHHHHHHHHHHCC AA: KASSLFLGKWHEGVEVSEVAEAALRSRKVAWLIQYPPIIHVACRNIGAAKLLMNAANTAG 70 80 90 100 110 120

Conf: 200242314703799714651435541487355188999999999999999889999999 Pred: CCCCCCEECCCEEEEEECCCEEEEEECCCCCEEECHHHHHHHHHHHHHHHHHHHHHHHHH AA: FRRSGVISLSNYVVEIASLERIELPVAEKGLMLVDDAYLSYVVRWANEKLLKGKEKLGRL 130 140 150 160 170 180

Sample Psi-Pred Output

***Compare the prediction for residues 9 and 17***

Page 15: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Sample Psi-Pred Output-II

Page 16: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Again, voting rules methods tend to be bestATKAVCVLKGDGPVQGTIHFEAKGDTVVVTGSITGLTEGDHGFHVHQFGDNTQGCTSAGP 2SODCCCCCCCCCCCCCCCCEEHCCHHECEEEEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCC BPSCCHEEEEECCCCCCCCEEEHHHCCCEEEEEEEEECECCCCCCEEEECCCCCCCCCCCCCC D_RCCCEEEEEECCCCCEEEEEEEECCCEEEEEEEEEEEECCCCCEEEEECCCCCCCCCCCCC DSCCCCEEEEECCCCCCCEEEEEECCCCEEEEEEEEECCCCCCCCEEEEEECCCCCCCCCCCC GGRHHHCEEEECCCCCCCEEEEEECCCCEEEEEECEEEEEECCCCEEEEECCCCCCEEECCCC GORCCCCEEEECCCCCCCCCEEECCCCCCEEEEECEEECCCCCCCEEEECCCCCCCCEEECCC H_KCCCCEEEEECCCCCCCCCEEECCCCCEEEECCCCCCCCCCCEEEEEEEECCCCCCCCCCC K_SCCCCEEEECCCCCCCCEEEEECCCCEEEEEEEEEEECCCCCCEEEEECCCCCCCCCCCCC JOI---EEEEE------EEEEEEEEE--EEEEEEEEE-----EEEEEEEE------------- 2SOD HFNPLSKKHGGPKDEERHVGDLGNVTADKNGVAIVDIVDPLISLSGEYSIIGRTMVVHEK 2SODCCCCCCCCCCCCCCCCCCCCCCECCCCCCHEECCCCCCCCCECCEECEEEEEEEEEEECC BPSCCCCCCCCCCCCCCCHHCECCCCCECCCCCCEEEEEEECCEEEECCCEEEEEEEEEEECC D_RCCCCCCCCCCCCCCEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEEEEEECC DSCCCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEECCCCCCCCCCEEEECEEEEEECC GGRCCCCCCCCCCCCCCHHEEECCCCCCCCCCCCEEEEEEECCEEECCCCEEEEEEEEEECCC GORCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEECCCCCCCCCCCCCCHHHHHHEECCC H_KCCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEEEEEEEEEECCCEEECCEEEEEEE K_SCCCCCCCCCCCCCCCCEEECCCCCCCCCCCCEEEEEECCCCECCCCCEEEEEEEEEEECC JOI--------------------EEEEEE------EEEEEEE--------------EEEEE-- 2SOD

Page 17: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

0

5

10

15

20

25

30 40 50 60 70 80 90 100

PSIPREDSSproPROFPHDpsiJPred2PHD

Perc

en

tag

e o

f all

15

0 p

rote

ins

Percentage correctly predicted residues per protein

Prediction Accuracy (EVA)

EVA: Automatic evaluation of prediction servers

Page 18: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Currently ~76%

Proteins with more than 100 homologues 80%

Assignment is ambiguous (5-15%). Recall DSSP vs STRIDE.

-- non-unique protein structures (dynamic), H-bond cutoff, etc.

Different secondary structures between homologues (~12%).

Non-locality. Secondary structure is influenced by long-range interactions.

-- Some segments can have multiple structure types (chameleon sequences).

How Far Can We Go?

Page 19: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Conceptually similar problem to SS prediction: Buried vs. Exposed. Weighted Ensemble Solvent Accessibility predictor: http://pipe.scs.fsu.edu/

wesa.html

Solvent accessibility

EE

E EE

E

B

B

B B

B

B

Page 20: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

To provide structural context for putative mutations that one wants to characterize biochemically or biophysically.

Why bother?

Page 21: 2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

Again, conceptually similar problem to SS prediction: TM vs. Not.

Transmembrane Segment Prediction