Protein Structure Prediction Ram Samudrala University of Washington

Protein Structure PredictionRam Samudrala

University of Washington

Rationale for understanding protein structure and function

Protein sequence

-large numbers of sequences, including whole genomes

Protein function

- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution

?

structure determination structure prediction

homologyrational mutagenesisbiochemical analysis

model studies

Protein structure

- three dimensional- complicated- mediates function

Protein folding

…-L-K-E-G-V-S-K-D-…

…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…

one amino acid

DNA

protein sequence

unfolded protein

native state

spontaneous self-organisation (~1 second)

not uniquemobileinactive

expandedirregular

Protein folding

…-L-K-E-G-V-S-K-D-…

…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…

one amino acid

DNA

protein sequence

unfolded protein

native state

spontaneous self-organisation (~1 second)

unique shapeprecisely orderedstable/functionalglobular/compacthelices and sheets

not uniquemobileinactive

expandedirregular

unfolded

Protein folding landscape

Large multi-dimensional space of changing conformationsfr

ee e

nerg

y

folding reaction

molten globule

J=10-8 s

native

J=10-3 s

G**

barrierheight

Protein primary structure

twenty types of amino acids

R

H

C

OH

O

N

H

HCα

two amino acids join by forming a peptide bond

R

Cα

H

C

O

N

H

H NCα

H

C

O

OH

R

H

R

Cα

H

C

O

N

H

NCα

H

C

O

R

HR

Cα

H

C

O

N

H

NCα

H

C

O

R

H

each residue in the amino acid main chain has two degrees of freedom (and

the amino acid side chains can have up to four degrees of freedom 1-4

Protein secondary structure

L

0

0

+180

+180-180

-180

many combinations are not possible

helix

sheet (anti-parallel)

N

C

N

C

sheet (parallel)

Protein tertiary and quaternary structures

Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh)

Hemagglutinin (1hgd)

Methods for determining protein structure

Protein sequence


Protein function


?

X-ray crystallographyNMR spectroscopy


model studies

Protein structure


X-ray crystallography- concept

• X-rays interact with electrons in protein molecules arranged in a crystal to produce diffraction patterns

• The diffraction patterns of the x-rays can be used to determine the three-dimensional structure of proteins

• Provides a “static” picture

From <http://info.bio.cmu.edu/courses/03231/LecF01/Lec25/lec25.html>

• Prepare protein crystals where the proteins are organised in a precise crystal lattice

• Shine x-rays on crystals which diffract off of electrons of atoms in the crystals; the intensities of the individual reflections are measured

• Phases are usually obtained indirectly by ismorphous replacement, from the way one or a few heavy atoms incorporated into the same isomorphous crystal lattice affect the diffraction patern

• Intensities and phases of all reflections are combined in a Fourier transform to provide maps of electron density

• Interpret the map by fitting the polypeptide chain to the contours

• Refine the model by minimising the distance between the observed amplitudes and the calculated amplitudes

X-ray crystallography- details

NMR spectroscopy - concept

• The magnetic-spin properties of atomic nuclei within a molecule are used to obtain a list of distance constraints between atoms in the molecule, from which a three-dimensional structure of the protein molecule can be obtained

• Provides a “dynamic” picture

NK-lysin (1nkl) S1 RNA binding domain (1sro)

NMR spectroscopy - details

• Protein molecules placed in a strong magnetic field have their hydrogen atoms aligned to the field; the alignment can be excited by applying radio frequency (RF) pulses

• Possible to obtain unique signal (chemical shift) for each hydrogen atom in a protein molecule

• Structural information arises primarily from the Nuclear Overhauser Effect (NOE), which gives information about distances between atoms in a molecule

• A pair of protons give a detectable NOE cross-peak if they are within 5.0 Å of each other in space

• After obtaining NOE data for protons througout the structure, a number of independent structures can be generated that are consistent with the distance constraints

Computer representation of protein structure

• Structures are stored in the protein data bank (PDB), a repository of mostly experimental models based on X-ray crystallographic and NMR studies • <http://www.rcsb.org> • Atoms are defined by their Cartesian coordinates: ATOM 1 N GLU 1 18.222 18.496 -16.203 1.00 21.95ATOM 2 CA GLU 1 17.706 17.982 -14.905 1.00 16.74ATOM 3 C GLU 1 17.368 16.466 -15.121 1.00 15.45ATOM 4 O GLU 1 16.780 16.073 -16.175 1.00 18.81ATOM 5 CB GLU 1 16.552 18.744 -14.351 1.00 17.35ATOM 6 CG GLU 1 16.952 20.118 -13.803 1.00 24.48ATOM 7 CD GLU 1 15.881 21.145 -13.597 1.00 31.51ATOM 8 OE1 GLU 1 16.012 22.316 -13.292 1.00 29.12ATOM 9 OE2 GLU 1 14.701 20.768 -13.799 1.00 35.19ATOM 10 N PHE 2 17.762 15.746 -14.052 1.00 15.83ATOM 11 CA PHE 2 17.509 14.262 -14.184 1.00 13.24 • These structures provide the basis for most of theoretical work in protein folding and protein structure prediction

Comparison of protein structures

• Need ways to determine if two protein structures are related and to compare predicted models to experimental structures

• Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979):

• Usually use C atoms

N

dzdydxN

i iii

1

222

3.6 Å 2.9 Å

NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) T102 best model

• Other measures include contact maps and torsion angle RMSDs

Methods for predicting protein structure

Protein sequence


Protein function


?

comparative modellingfold recognition

ab initio prediction


model studies

Protein structure


Comparative modelling of protein structure

• Proteins that have similar sequences (i.e., related by evolution) have similar three-dimensional structures

• A model of a protein whose structure is not known can be constructed if the structure of a related protein has been determined by experimental methods

• Similarity must be obvious and significant for good models to be built

• Need ways to build regions that are not similar between the two related proteins

• Need ways to move model closer to the native structure

Comparative modelling of protein structure

KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **

… …

scanalign

build initial modelconstruct non-conserved

side chains and main chains

refine

Fold recognition

• The number of possible protein structures/folds is limited (large number of sequences but few folds)

• Proteins that do not have similar sequences sometimes have similar three-dimensional structures

• A sequence whose structure is not known is fitted directly (or “threaded”) onto a known structure and the “goodness of fit” is evaluated using a discriminatory function

• Need ways to move model closer to the native structure

3.6 Å5% ID

NK-lysin (1nkl) Bacteriocin T102/as48 (1e68)

Fold recognition

KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **

… …

evaluatefit

build initial modelconstruct non-conserved

side chains and main chains

refine

Ab initio prediction of protein structure – concept

• Go from sequence to structure by sampling the conformational space in a reasonable manner and select a native-like conformation using a good discrimination function

• Problems: conformational space is astronomical, and it is hard to design functions that are not fooled by non-native conformations (or “decoys”)

Ab initio prediction of protein structure

sample conformational space such thatnative-like conformations are found

astronomically large number of conformations5 states/100 residues = 5100 = 1070

select

hard to design functionsthat are not fooled by

non-native conformations(“decoys”)

Sampling conformational space – continuous approaches

• Most work in the field- Molecular dynamics- Continuous energy minimisation (follow a valley)- Monte Carlo simulation- Genetic Algorithms

• Like real polypeptide folding process

• Cannot be sure if native-like conformations are sampled

energy

Molecular dynamics

• Force = -dU/dx (slope of potential U); acceleration, m a(t) = force

• All atoms are moving so forces between atoms are complicated functions of time

• Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial

• Atoms move for very short times of 10-15 seconds or 0.001 picoseconds (ps)

x(t+t) = x(t) + v(t)t + [4a(t) – a(t-t)] t2/6

v(t+t) = v(t) + [2a(t+t)+5a(t)-a(t-t)] t/6

Ukinetic = ½ Σ mivi(t)2 = ½ n KBT

• Total energy (Upotential + Ukinetic) must not change with time

new position

old position

new velocity

old velocity

acceleration

acceleration

old velocity

n is number of coordinates (not atoms)

Energy minimisation

• For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial

• With convergence, we have an accurate equilibrium conformation and a well-defined energy value

energy

number of steps deep minimum

starting conformation

steepest descent

conjugate gradient

energy

number of steps

give up

converge

RMSD

Monte Carlo simulation

• Discrete moves in torsion or cartesian conformational space

• Evaluate energy after every move and compare to previous energy (E)

• Accept conformation based on Boltzmann probability:

• Many variations, including simulated annealing (starting with a high temperature so more moves are accepted initially and then cooling)

• If run for infinite time, simulation will produce a Boltzmman distribution

kT

ΔEexpP

Genetic Algorithms

• Generate an initial pool of conformations

• Perform crossover and mutation operations on this set to generate a much larger pool of conformations

• Select a subset of the fittest conformations from this large pool

• Repeat above two steps until convergence

Sampling conformational space – exhaustive approaches

enumerate all possible conformationsview entire space (perfect partition function)

computationally intractable:5 states/100 residues = 5100 = 1070 possible conformations

select

must use discrete statemodels to minimise

number of conformationsexplored

Scoring/energy functions

• Need a way to select native-like conformations from non-native ones • Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms

• Knowledge-based scoring functions: derive information about atomic properties from a database of experimentally determined conformations; common parametres include pairwise atomic distances and amino acid burial/exposure.

Requirements for sampling methods and scoring functions

• Sampling methods must produce good decoy sets that are comprehensive and include several native-like structures

• Scoring function scores must correlate well with RMSD of conformations (the better the score/energy, the lower the RMSD)

Overview of CASP experiment

• Three categories: comparative/homology modelling, fold recognition/threading, and ab initio prediction

• Goal is to assess structure prediction methods in a blind and rigourous manner; blind prediction is necessary for accurate assessment of methods

• Ask modellers to build models of structures as they are in the process of being solved experimentally

• After prediction season is over, compare predicted models to the experimental structures

• Discuss what went right, what went wrong, and why

• Compare progress from CASP1 to CASP4

• Results published in special issues of Proteins: Structure, Function, Genetics 1995, 1997, 1999, 2002

Comparative modelling at CASP - methods

• Alignment: PSI-BLAST, FASTA, CLUSTALW - multiple sequence alignments carefully hand-edited using secondary structure information • More successful side chain prediction methods include:

backbone-dependent rotamer libraries (Bower & Dunbrack)segment matching followed by energy minimisation (Levitt)self-consistent mean field optimisation (Bates et al)graph-theory + knowledge-based functions (Samudrala et al)

• More successful loop building methods include:satisfaction of spatial restraints (Sali)internal coordinate mechanics energy optimisation (Abagyan et al)graph-theory + knowledge-based functions (Samudrala et al)

• Overall model building: there is no substitute for careful hand-constructed models (Sternberg et al, Venclovas)

A graph theoretic representation of protein structure

-0.6 (V1)

-1.0 (F) -0.7 (K)

-0.5 (I) -0.9 (V2) weighnodes

-0.5 (I) -0.9 (V2)

-1.0 (F) -0.7 (K)

-0.3-0.4

-0.2

-0.1

-0.1

-0.1

find cliques

W = -4.5

representresiduesas nodes

-0.5 (I)

-0.6 (V1)

-0.9 (V2)

-1.0 (F) -0.7 (K)

-0.3-0.4

-0.2

-0.1

-0.1

-0.2

-0.2

constructgraph

-0.1

Historical perspective on comparative modelling

BC

excellent~ 80%1.0 Å2.0 Å

alignmentside chainshort loopslonger loops

Historical perspective on comparative modelling

CASP1

poor~ 50%~ 3.0 Å> 5.0 Å

BC



Prediction for CASP4 target T128/sodm

C RMSD of 1.0 Å for 198 residues (PID 50%)

Prediction for CASP4 target T111/eno


Prediction for CASP4 target T122/trpa


Prediction for CASP4 target T125/sp18


Prediction for CASP4 target T112/dhso


Prediction for CASP4 target T92/yeco


CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity

**T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)

**T128/sodm – 1.0 Å (198 residues; 50%)

**T125/sp18 – 4.4 Å (137 residues; 24%)

**T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%)

Comparative modelling at CASP - conclusions

CASP2

fair~ 75%~ 1.0 Å~ 3.0 Å

CASP3

fair~75%

~ 1.0 Å~ 2.5 Å

CASP4

fair~75%~ 1.0 Å~ 2.0 Å

CASP1

poor~ 50%~ 3.0 Å> 5.0 Å

BC



Fold recognition at CASP - methods

• Visual inspection with sequence comparison (Murzin group)

• Procyon - potential of mean force based on pairwise interactions and global dynamic programming (Sippl group)

• Threader - potential of mean force and double dynamic programming (Jones group)

• Environmental 3D Profiles (Eisenberg group)

• NCBI Threading Program using contact potentials and models of sequence-structure conservation (Bryant group) • Hidden Markov Models (Karplus group)

• Combination of threading with ab initio approaches (Friesner group)

• Environment-specific substitution tables and structure-dependent gap penalties (Blundell group)

Fold recognition at CASP - conclusions

• Fold recognition is one of the more successful approaches at predicting structure at all four CASPs

• At CASP2 and CASP4, one of the best methods was simple sequence searching with careful manual inspection (Murzin group)

• At CASP3 and CASP4, none of the threading targets could have been recognised by the best standard sequence comparison methods such as PSI-BLAST • For the most difficult targets, the methods were able to predict 60 residues to 6.0 Å C RMSD, approaching comparative modelling accuracies as the similarity between proteins increased.

Ab initio prediction at CASP – methods

• Assembly of fragments with simulated annealing (Simons et al)

• Exhaustive sampling and pruning using knowledge-based scoring functions (Samudrala et al) • Constraint-based Monte Carlo optimisation (Skolnick et al)

• Thermodynamic model for secondary structure prediction with manual docking of secondary structure elements and minimisation (Lomize et al)

• Minimisation of a physical potential energy function with a simplified representation (Scheraga et al, Osguthorpe et al)

• Neural networks to predict secondary structure (Jones, Rost)

Semi-exhaustive segment-based foldingEFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK

generatefragments from database14-state , model

… …

minimisemonte carlo with simulated annealingconformational space annealing, GA

… …

filter all-atom pairwise interactions, bad contactscompactness, secondary structure

Historical perspective on ab initio prediction

Before CASP (BC):“solved”

(biased results)

CASP1: worse than random

CASP2: worse thanrandom with one

exception

CASP4: ?

CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues

*T56/dnab – 6.8 Å (60 residues; 67-126)

**T59/smd3 – 6.8 Å (46 residues; 30-75)

**T61/hdea – 7.4 Å (66 residues; 9-74) **T64/sinr – 4.8 Å (68 residues; 1-68)

*T74/eps15 – 7.0 Å (60 residues; 154-213) **T75/ets1 – 7.7 Å (77 residues; 55-131)

Prediction for CASP4 target T110/rbfa

C RMSD of 4.0 Å for 80 residues (1-80)

Prediction for CASP4 target T97/er29


Prediction for CASP4 target T106/sfrp3


Prediction for CASP4 target T98/sp0a


Prediction for CASP4 target T126/omp


Prediction for CASP4 target T114/afp1


Postdiction for CASP4 target T102/as48


Ab initio prediction at CASP - conclusions

CASP1: worse than random

CASP2: worse thanrandom with one

exception

CASP4: consistently predicted correct topology - ~4-6.0 A for 60-80+ residues

CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues

**T110/rbfa – 4.0 Å (80 residues; 1-80) *T114/afp1 – 6.5 Å (45 residues; 36-80)

**T97/er29 – 6.0 Å (80 residues; 18-97)

**T106/sfrp3 – 6.2 Å (70 residues; 6-75)

*T98/sp0a – 6.0 Å (60 residues; 37-105) **T102/as48 – 5.3 Å (70 residues; 1-70)

Before CASP (BC):“solved”

(biased results)

Computational aspects of structural genomics

D. ab initio prediction

C. fold recognition

*

*

*

*

*

*

*

*

*

*

B. comparative modellingA. sequence space

*

*

*

*

*

*

*

*

*

*

*

*

E. target selection

targets

F. analysis

*

*

(Figure idea by Steve Brenner.)

Key points

• DNA/gene is the blueprint - proteins are the functional representatives of genes

• Protein structure can be used to understand protein function

• Large numbers of genes being sequenced - need structures

• Protein folding (from primary sequence to tertiary structure) is a fast self-organising process where a disordered non-functional chain of amino acids becomes a stable, compact, and functional molecule

• The free energy difference between the folded and unfolded states is not very high

• Experimental methods to determine protein structures include x-ray crystallography and NMR spectroscopy • Theoretical methods to predict protein structures include comparative/homology modelling, fold recognition/threading, and ab initio prediction

• For ab initio prediction, you need a method that samples the conformational space adequately (to find native-like conformations) and a function that can identify them

• CASP experiment shows limited progress in protein structure prediction

Michael Levitt, Stanford UniversityJohn Moult, CARB

Patrice Koehl, Stanford UniversityYu Xia, Stanford Univeristy

Levitt and Moult groups

Acknowledgements

<http://compbio.washington.edu>

Documents

Protein Structure Prediction Ram Samudrala University of Washington