13
Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on physical principals. --Prediction uses any statistical, theoretical or empirical data to try to get at the end result. Protein Structure Prediction 1. A bit of history: Asilomar, 1994, 1996, 1998 & 2000. 2. Four approaches to structure prediction: a. Homology Modeling b. Ab initio prediction c. Sequence-Structure Threading d. Docking 3. Two ways of threading Dynamic programming Knowledge-based potentials Asilomar, 1994, 1996, 1998 & 2000 1. Asilomar is state conference ground near Carmel, Monterey. 2. December 1994: “Meeting on Critical Assessment of Techniques for Protein Structure Prediction” 3. December 1996 & 1998: “Second” and “Third” meeting, etc… 4. Competition was held to compare/contrast methods. Asilomar 4. Competition worked like this: Experimentalists who had structure that would be solved before date of CASP meeting submitted the sequence of the unknown to central repository. Predictors could download sequence and minimal information about protein (name), and could enter one of three categories. Assessors use automatic programs for analysis in addition to expertise to evaluate quality of predictions. Asilomar Categories 1. Homology Modeling (sequences with high homology to sequences of known structure) Given a sequence with homology > 25-30% with known structure in PDB, use known structure as starting point to create a model of the 3D structure of the sequence. Takes advantage of knowledge of a closely related protein. Use sequence alignment techniques to establish correspondences between known “template” and unknown.

Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 1

Protein Structure Prediction

Russ B. AltmanBMI 214CS 274

Protein Folding is different from structure prediction

--Folding is concerned with the process of taking the 3D shape, usually based on physical principals.

--Prediction uses any statistical, theoretical or empirical data to try to get at the end result.

Protein Structure Prediction

1. A bit of history: Asilomar, 1994, 1996, 1998 & 2000.

2. Four approaches to structure prediction:a. Homology Modelingb. Ab initio predictionc. Sequence-Structure Threadingd. Docking

3. Two ways of threading• Dynamic programming• Knowledge-based potentials

Asilomar, 1994, 1996, 1998 & 20001. Asilomar is state conference ground near

Carmel, Monterey.

2. December 1994: “Meeting on Critical Assessment of Techniques for Protein Structure Prediction”

3. December 1996 & 1998: “Second” and “Third” meeting, etc…

4. Competition was held to compare/contrast methods.

Asilomar4. Competition worked like this:

• Experimentalists who had structure that would be solved before date of CASP meeting submitted the sequence of the unknown to central repository.

• Predictors could download sequence and minimal information about protein (name), and could enter one of three categories.

• Assessors use automatic programs for analysis in addition to expertise to evaluate quality of predictions.

Asilomar Categories1. Homology Modeling (sequences with high

homology to sequences of known structure)

Given a sequence with homology > 25-30% with known structure in PDB, use known structure as starting point to create a model of the 3D structure of the sequence.

Takes advantage of knowledge of a closely related protein. Use sequence alignment techniques to establish correspondences between known “template” and unknown.

Page 2: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 2

Asilomar Categories2. Ab initio prediction (no known homology

with any sequence of known structure)

Given only the sequence, predict the 3D structure from “first principles”, based on energetic or statistical principles.

Secondary structure prediction and multiple alignment techniques used to predict features of these molecules. Then, some method necessary for assembling 3D structure.

Ab initio predictionNew sequence:MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL

Predict secondary structure:MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL

…HHHHHCCCCCHHHHHHHHHHCCCCBBBBBBBCCBBBB

Predict 3D structure entirely:

Comparison of calculated (red) and experimental (blue) structures for the proteinmyoglobin using the refined potential function. The calculated structure is thelowest energy structure obtained from 3 different jobs with clustering and energyselection. The total simulation time on a 16 node partition CM-5 massivelyparallel computer was 60 hours, in which about 5 billion structures weregenerated. The RMS deviation of the two structures is 6.2 Å.

Asilomar Categories3. Fold recognition (sequences with no

sequence identity (<= 30%) to sequences of known structure.

Given the sequence, and a set of folds observed in PDB, see if any of the sequences could adopt one the known folds.

Takes advantage of knowledge of existing structures, and principles by which they are stabilized (favorable interactions).

Page 3: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 3

Fold RecognitionNew sequence:MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL…

Library of known folds:

? ? ? ?

X XX !

Asilomar Categories3. Docking two proteins (‘96 only)

Given two separate (known) protein structures, predict the geometry of their physical association.

Use information about surface properties to find best hand/glove or lock/key fit between two known structures. Can do it by rigid body docking or flexible docking (harder)

Protein Docking

+

Asilomar ResultsHow to evaluate predictions?

• RMSD• Overall identification and topology

of secondary structures• Energy considerations (contacts, H-bonds)• Similarity of hydrophobic core• Sequence alignment quality (and systematic

shift)

See review of CASP4 athttp://www3.interscience.wiley.com/cgi-bin/issuetoc?Type=DD&ID=90010623

Asilomar Results

Homology Modeling

• When sequence homology is > 70%, high resolution models are possible (< 3 Å RMSD).

• Sophisticated energy minimization techniques do not dramatically improve upon initial guess.

• Rigorous criteria applied such as torsion angles, van der Waals violations, RMSD.

Sample Homology Modeling

MODELLER (Sali et al, see course web page)

1. Find homologous proteins with known structure and align

2. Collect distance distributions between atoms in known protein structures

3. Use these distributions to compute positions for equivalent atoms in alignment

4. Refine using energetics

Page 4: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 4

Homology modeling sample. Thick backbone shows known structure.Thin lines show modeled structures. Some sidechains are not positionedcorrectly, but backbone and other sidechains look quite good. a. Sidechain

mistakes

b. Shifts with correct alignment

c. No template

d. Misalignment

e. Incorrect template

Asilomar Results

Use of sensitive multiple alignment (e.g. PSI-BLAST) techniques helped get best alignments.

Sidechain modeling using libraries of known amino acid conformations. Success ranged from 45% to 80% correct (= angles within 30° of experimental structure).

Energy based refinement still not improving the structures.

PSI BLASTExtension of BLAST with extra features:

1. Multiple blocks aligned (not just 1)2. Profile used iterative to increase

sensitivity in picking distance sequences– build profile based on initial hits– use profile to conduct another search– rebuilt profile– repeat

5. Be careful about repeating too many times…PSIBLAST DRIFT

Page 5: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 5

PSI BLAST OVERVIEW

SKIP FOLD RECOGNITION AND COME BACK TO IT…

Asilomar Results

Ab Initio Predictions 1° to 2°: (Secondary structure prediction)

Range of accuracy from 66% to 77%(3 state labeling: helix, coil or beta).

Human hand editing improves the accuracy.

Multiple sequence alignments improve the performance of secondary structure prediction.

Asilomar ResultsAb Initio Predictions 2° to 3°:(Assemble secondary structures into

3D)

• Sensitive to errors in secondary structure

• Predictors were more likely to predict previously known structures.

Asilomar ResultsAb Initio Predictions 1° to 3°:(Predict 3D from sequence only)

• Predict interresidue contacts and then compute structure (mild success)

• Simplified energy term + reduced search space (phi/psi or lattice) (moderate success)

• Creative ways to memorize sequence <-> structure correlations in short segments from the PDB, and use these to model new structures. ROSETTA Method.

Asilomar ResultsAb Initio Predictions 1° to 3°:Good progress (3 models better than fold recognition

results in CASP III)

1. Associate sequence of unknown with known 3D structure library, and then optimizing contact frequency of amino acids, as measured in PDB (Baker et al).

2. Generate all folds on lattice and then filter the bad ones out (Samudrala et al)

3. Combine multiple sequence alignment, secondary structure prediction and lattice. (Skolnick et al)

Page 6: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 6

Lattice search Rosetta Method for ab initio1. Break target into fragments of 9 amino acids

2. Create profile, X, for target

3. Create profile, S, for similar PDB sequences

4. Align profiles X, S to get rank order list of best match fragments in the PDB

(REF: Simons…Baker, JMB 306: 1191-1199)

Rosetta Method for ab initio5. Start with extended chain, and evaluate the

effect of introducing the fragments into the chain.

6. Use Metropolis-type algorithm for optimization, using following terms:– hydrophobic burial– polar side-chain interactions– hydrogen bonding between beta-strands– hard sphere repulsion (van der Waals)

6. Create 1000 structures, cluster them.7. Choose one representative from each cluster

as possible prediction…

Use an ellipsoid to be

sure that hydrophobic residues are

central

Page 7: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 7

CASP IV Performance

Performance of Rosetta Method Alexey Murzin (Proteins Volume 45, Issue S5, 2001. Pages: 76-85)

“In 1996, in CASP2, we presented a semimanualapproach to the prediction of protein structure that was aimed at the recognition of probable distant homology, where it existed, between a given target protein and a protein of known structure (Murzin and Bateman, [Proteins 1997; Suppl 1:105-112]). Central to our method was the knowledge of all known structural and probable evolutionary relationships among proteins of known structure classified in the SCOP database (Murzin et al., J Mol Biol 1995;247:536-540). It was demonstrated that a knowledge-based approach could compete successfully with the best computational methods of the time in the correct recognition of the target protein fold.”

Murzin prediction CASP IV

Experimental Predicted

The computational community responds…

Alexey can’t play!

Page 8: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 8

Asilomar ResultsFold Recognition(check if sequence matches known 3D fold)CASP1: Of 21 target proteins, 11 wound up having

folds that were previously known.CASP2: Of 22 targets, 15 with available foldsCASP3: Of 43 targets, 36 with available foldsCASP4: Of 56 target domains…hard to say…

• Every predictor does well on something.• Common folds (more examples) are easier to

recognize.• Fold recognition was the surprise performer at

the first competition. Incremental progress at second, third, fourth…

Asilomar ResultsFold Recognition• Not “all or none.” List of top N hits much better

than top hit.

• Common folds easier to recognize.

• Quality of alignments that result is NOT good.

• Potentials include: residue pair contact terms,hydrophobicity, polarity, H-bonds, local structure terms.

• Simple Dynamic Programming with environmental matching sometimes performs as well as sophisticated 3D potentials...

Fold Recognition

New sequence:MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL…

Library of known folds:

? ? ? ?

X XX !

N-1 = target, N-2 = Fold in PDB

N-1 = target, N-2 = Fold in PDB

N-1 = target, N-2 = Fold in

PDB

Page 9: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 9

Predictors for CASP I are along top row. Target sequences along first column. Dark grey means bad prediction, light gray pretty good, white very good. Hatchedmeans no prediction. Upper left corner shows rank of best answer amonglist submitted by predictors (also shows fold used to make prediction, shift errorand general protein class)

Fold Recognition ~ Threading ~

Inverse FoldingFold Recognition: given a sequence, and a

library of backbones, find the backbone that accommodates the sequence best.

Threading: Given a backbone, find the best way to “mount” the sequence on the backbone (with gaps) to maximize good interactions.

Inverse Folding: (Folding = sequence to 3D). Start with 3D and find a good sequence.

Elements of a fold recognition algorithm

1. Library of protein structures, suitably processed- All structures- Representative subset- Structures with loops removed

2. Scoring function- contact potential- environmental evaluation function

3. Method for generating initial alignments and/or searching for better alignments.

Dynamic Programming withEnvironmental Strings

(The subject of one of the homeworks)

IDEA: Instead of aligning a sequence to a sequence, align a sequence to a string of descriptors that describe the 3D environment of the target structure.

A R N D C QA 2 -2 0 0 -2 0

R -2 6 0 -1 -4 1

N 0 0 2 2 -4 -1

D 0 -1 2 4 -5 2

C -2 -4 -4 -5 12 -5

Q 0 1 1 2 -5 4

Usual DP, score matrix relates two amino acids:

E1 E2 E3 E4 E5 ...A -0.77 –1.05 -0.54 -0.65 -1.52R -1.80 -1.52 -2.35 -0.11 -0.41N -1.76 -2.18 -2.61 -0.48 -0.26D -2.48 -1.80 -2.63 -0.80 -2.08C -0.43 -0.45 -0.59 0.15 -0.72Q -1.38 -2.03 -0.84 0.16 -0.79

Thread DP, relate AAs to environments in 3D structure.

Page 10: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 10

What are environments.

Conceptually, superimpose multiple structures and look at the statistically conserved features around each 3D xyz position. This may include:

• Is AA buried/partially buried or exposed?• If buried, how polar is the environment?• If partially buried, how polar?• What kind of secondary structures?

(Buried status, polarity and secondary structure)

How do you compute them?1. Align proteins with similar 3D structure.2. Align homologous proteins by sequence alone.3. For each position in protein,

identify what environment it is by computing the local properties of interest

(e.g. secondary structure, buried, polarity).4. Count frequencies of different amino

acids (within multiple alignment) in different environments.

This creates a MATCH MATRIX. Bowie et al define 18 environments…Another example of position-specific scores.

DP threading Match Matrix Sample matrix showing alignment of amino acids andenvironments for globins. Entries indicate possible scorefor each amino acid at each environmental position, taken frommatch matrix.

Z-Scores of DP threading formyoglobins, globins and non-globins.

How do you thread a new sequence?

Using standard dynamic programming, use new score matrix to align the sequence of environments from the structure of interest to the sequence of amino acids from unknown sequence.

The highest scoring alignment is the best superposition of the sequence onto the structure.

Using knowledge of scores of sequences with known structure, can see if the score is high enough to put the new sequence in the family.

Page 11: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 11

DP ThreadingAdvantages:

1. Environmental proclivities may be more accurate than simple amino acid similarity:

• structural information• local context• potentially, many other features

2. Fast.

3. Pretty good performance (at Asilomareven).

Sample alignment

Net Result:

B1 E2α B2α B2α E2α B2β P2β Eα Eβ Eα..His Asp Val Ile Lys Ile Tyr — — Ser..

DP ThreadingDisadvantages

• Requires previous examples to work.

• Resulting match usually needs refinement

• May share some problems of DP in general (independence assumption from column to column, gap penalty choice, etc...)

DP ThreadingDisadvantages

• Assumes “average” amino acid preferences overall similar protein-family environments.

• Doesn’t compute the actual environment created by mounting the sequence on the structure.

• Assumes that the environment is relatively constant, and that only amino acid details change. But could have different types of interactions...

Contact Potential Threading

IDEA: Instead of modeling energies from first physical principles, simplify the problem by positioning only amino acids, and compute empirical energiesfrom the observed associations of amino acids.

“GLU is attracted to LYS” = E(glu,lys)

Contact potential threading

Create energy terms between amino acids:

E(interaction) = -KT ln[frequency of interaction]

where K is constant, T is temperature (constant), frequency of interaction measured in database of known structures.

More frequent —> more favorable.

Page 12: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 12

Contact potential(After Sippl et al.)

More specifically:a = amino acid type a (ALA, VAL, etc...)b = amino acid type bs = separation in sequence

∆ Eabs(r) = Eabs (r) — Es(r)

Energy of interaction between a and b minus average energy at that separation equals the energy difference that contributes to stability.

Contact Potential

∆Eabs(r) = -KT ln [ fabs (r) / fs (r) ]

For any given sequence in 3D, compute distances between all pairs of amino acids (usually upto r = 10-15Å), and sum.

∆Etot = Σ ∆Eabs(r)all

a,b pairs

Using contact potential1. Given 3D structure, need to mount the sequence on

the structure.– simple dynamic programming (misses the point)– other dynamic programming (better)– exhautive enumeration (too expensive)

• recent paper shows that this is NP-hard– heuristic enumeration—limit on gap lengths,

loop lengths (heuristic)

2. Evaluate the contact potential for the alignment.3. {Optional} Locally optimize the potential score.4. Compare potential with random shuffle of sequence,

and with other sequences to approximate z-score.

Using contact potential

Z-score. Number of standard deviations away from mean. Most meaningful for normal distributions...

Mean2SD

Sample threading. Other uses of contact potentials

• Fold recognition (as discussed here)

• Incorrect fold recognition—detect unlikely or wrong structures—bad predictions—bad contacts, etc...

• Measure protein stability

• Use for ab initio prediction....

Page 13: Protein Folding is different from structure prediction ...isoft.postech.ac.kr/~gblee/Course/NLP_for... · Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding

Page 13

Conclusions

1. Protein fold recognition will get asymptotically better, as we get more folds.

2. Best ab initio methods use knowledge of database, and will thus also improve.

2. Estimates are that we now have between 30% and 50% of folds that occur.

3. Given fold, we need to improve refinement with homology modeling techniques.

Other information

1. http://PredictionCenter.llnl.gov/

points to CASP results and targets.

2. Special journal issues devoted to CASP: Proteins 23(3), 1995CASP2: Proteins Supplement 1, 1997CASP3: Nature Structural Biology, Vol 6, No. 2, Feb 1999, page 108.CASP4: Proteins Vol 45 (S5), 2001.