58
Protein Structure Prediction Ram Samudrala University of Washington

Protein Structure Prediction Ram Samudrala University of Washington

Embed Size (px)

Citation preview

Protein Structure PredictionRam Samudrala

University of Washington

Rationale for understanding protein structure and function

Protein sequence

-large numbers of sequences, including whole genomes

Protein function

- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution

?

structure determination structure prediction

homologyrational mutagenesisbiochemical analysis

model studies

Protein structure

- three dimensional- complicated- mediates function

Protein folding

…-L-K-E-G-V-S-K-D-…

…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…

one amino acid

DNA

protein sequence

unfolded protein

native state

spontaneous self-organisation (~1 second)

not uniquemobileinactive

expandedirregular

Protein folding

…-L-K-E-G-V-S-K-D-…

…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…

one amino acid

DNA

protein sequence

unfolded protein

native state

spontaneous self-organisation (~1 second)

unique shapeprecisely orderedstable/functionalglobular/compacthelices and sheets

not uniquemobileinactive

expandedirregular

unfolded

Protein folding landscape

Large multi-dimensional space of changing conformationsfr

ee e

nerg

y

folding reaction

molten globule

J=10-8 s

native

J=10-3 s

G**

barrierheight

Protein primary structure

twenty types of amino acids

R

H

C

OH

O

N

H

HCα

two amino acids join by forming a peptide bond

R

H

C

O

N

H

H NCα

H

C

O

OH

R

H

R

H

C

O

N

H

NCα

H

C

O

R

HR

H

C

O

N

H

NCα

H

C

O

R

H

each residue in the amino acid main chain has two degrees of freedom (and

the amino acid side chains can have up to four degrees of freedom 1-4

Protein secondary structure

L

0

0

+180

+180-180

-180

many combinations are not possible

helix

sheet (anti-parallel)

N

C

N

C

sheet (parallel)

Protein tertiary and quaternary structures

Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh)

Hemagglutinin (1hgd)

Methods for determining protein structure

Protein sequence

-large numbers of sequences, including whole genomes

Protein function

- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution

?

X-ray crystallographyNMR spectroscopy

homologyrational mutagenesisbiochemical analysis

model studies

Protein structure

- three dimensional- complicated- mediates function

X-ray crystallography- concept

• X-rays interact with electrons in protein molecules arranged in a crystal to produce diffraction patterns

• The diffraction patterns of the x-rays can be used to determine the three-dimensional structure of proteins

• Provides a “static” picture

From <http://info.bio.cmu.edu/courses/03231/LecF01/Lec25/lec25.html>

• Prepare protein crystals where the proteins are organised in a precise crystal lattice

• Shine x-rays on crystals which diffract off of electrons of atoms in the crystals; the intensities of the individual reflections are measured

• Phases are usually obtained indirectly by ismorphous replacement, from the way one or a few heavy atoms incorporated into the same isomorphous crystal lattice affect the diffraction patern

• Intensities and phases of all reflections are combined in a Fourier transform to provide maps of electron density

• Interpret the map by fitting the polypeptide chain to the contours

• Refine the model by minimising the distance between the observed amplitudes and the calculated amplitudes 

X-ray crystallography- details

NMR spectroscopy - concept

• The magnetic-spin properties of atomic nuclei within a molecule are used to obtain a list of distance constraints between atoms in the molecule, from which a three-dimensional structure of the protein molecule can be obtained

• Provides a “dynamic” picture

NK-lysin (1nkl) S1 RNA binding domain (1sro)

NMR spectroscopy - details

• Protein molecules placed in a strong magnetic field have their hydrogen atoms aligned to the field; the alignment can be excited by applying radio frequency (RF) pulses

• Possible to obtain unique signal (chemical shift) for each hydrogen atom in a protein molecule

• Structural information arises primarily from the Nuclear Overhauser Effect (NOE), which gives information about distances between atoms in a molecule

• A pair of protons give a detectable NOE cross-peak if they are within 5.0 Å of each other in space

• After obtaining NOE data for protons througout the structure, a number of independent structures can be generated that are consistent with the distance constraints

Computer representation of protein structure

• Structures are stored in the protein data bank (PDB), a repository of mostly experimental models based on X-ray crystallographic and NMR studies • <http://www.rcsb.org> • Atoms are defined by their Cartesian coordinates: ATOM 1 N GLU 1 18.222 18.496 -16.203 1.00 21.95ATOM 2 CA GLU 1 17.706 17.982 -14.905 1.00 16.74ATOM 3 C GLU 1 17.368 16.466 -15.121 1.00 15.45ATOM 4 O GLU 1 16.780 16.073 -16.175 1.00 18.81ATOM 5 CB GLU 1 16.552 18.744 -14.351 1.00 17.35ATOM 6 CG GLU 1 16.952 20.118 -13.803 1.00 24.48ATOM 7 CD GLU 1 15.881 21.145 -13.597 1.00 31.51ATOM 8 OE1 GLU 1 16.012 22.316 -13.292 1.00 29.12ATOM 9 OE2 GLU 1 14.701 20.768 -13.799 1.00 35.19ATOM 10 N PHE 2 17.762 15.746 -14.052 1.00 15.83ATOM 11 CA PHE 2 17.509 14.262 -14.184 1.00 13.24 • These structures provide the basis for most of theoretical work in protein folding and protein structure prediction 

Comparison of protein structures

• Need ways to determine if two protein structures are related and to compare predicted models to experimental structures

• Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979):

 

• Usually use C atoms 

N

dzdydxN

i iii

1

222

3.6 Å 2.9 Å

NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) T102 best model

• Other measures include contact maps and torsion angle RMSDs

Methods for predicting protein structure

Protein sequence

-large numbers of sequences, including whole genomes

Protein function

- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution

?

comparative modellingfold recognition

ab initio prediction

homologyrational mutagenesisbiochemical analysis

model studies

Protein structure

- three dimensional- complicated- mediates function

Comparative modelling of protein structure

• Proteins that have similar sequences (i.e., related by evolution) have similar three-dimensional structures

• A model of a protein whose structure is not known can be constructed if the structure of a related protein has been determined by experimental methods

• Similarity must be obvious and significant for good models to be built

• Need ways to build regions that are not similar between the two related proteins

• Need ways to move model closer to the native structure

Comparative modelling of protein structure

KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **

… …

scanalign

build initial modelconstruct non-conserved

side chains and main chains

refine

Fold recognition

• The number of possible protein structures/folds is limited (large number of sequences but few folds)

• Proteins that do not have similar sequences sometimes have similar three-dimensional structures

• A sequence whose structure is not known is fitted directly (or “threaded”) onto a known structure and the “goodness of fit” is evaluated using a discriminatory function

• Need ways to move model closer to the native structure

3.6 Å5% ID

NK-lysin (1nkl) Bacteriocin T102/as48 (1e68)

Fold recognition

KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **

… …

evaluatefit

build initial modelconstruct non-conserved

side chains and main chains

refine

Ab initio prediction of protein structure – concept

• Go from sequence to structure by sampling the conformational space in a reasonable manner and select a native-like conformation using a good discrimination function

• Problems: conformational space is astronomical, and it is hard to design functions that are not fooled by non-native conformations (or “decoys”)

Ab initio prediction of protein structure

sample conformational space such thatnative-like conformations are found

astronomically large number of conformations5 states/100 residues = 5100 = 1070

select

hard to design functionsthat are not fooled by

non-native conformations(“decoys”)

Sampling conformational space – continuous approaches

• Most work in the field- Molecular dynamics- Continuous energy minimisation (follow a valley)- Monte Carlo simulation- Genetic Algorithms

• Like real polypeptide folding process

• Cannot be sure if native-like conformations are sampled

energy

Molecular dynamics

• Force = -dU/dx (slope of potential U); acceleration, m a(t) = force

• All atoms are moving so forces between atoms are complicated functions of time

• Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial

• Atoms move for very short times of 10-15 seconds or 0.001 picoseconds (ps)

x(t+t) = x(t) + v(t)t + [4a(t) – a(t-t)] t2/6

v(t+t) = v(t) + [2a(t+t)+5a(t)-a(t-t)] t/6

Ukinetic = ½ Σ mivi(t)2 = ½ n KBT

• Total energy (Upotential + Ukinetic) must not change with time

new position

old position

new velocity

old velocity

acceleration

acceleration

old velocity

n is number of coordinates (not atoms)

Energy minimisation

• For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial

• With convergence, we have an accurate equilibrium conformation and a well-defined energy value

energy

number of steps deep minimum

starting conformation

steepest descent

conjugate gradient

energy

number of steps

give up

converge

RMSD

Monte Carlo simulation

• Discrete moves in torsion or cartesian conformational space

• Evaluate energy after every move and compare to previous energy (E)

• Accept conformation based on Boltzmann probability:  

• Many variations, including simulated annealing (starting with a high temperature so more moves are accepted initially and then cooling)

• If run for infinite time, simulation will produce a Boltzmman distribution

kT

ΔEexpP

Genetic Algorithms

• Generate an initial pool of conformations

• Perform crossover and mutation operations on this set to generate a much larger pool of conformations

• Select a subset of the fittest conformations from this large pool

• Repeat above two steps until convergence

Sampling conformational space – exhaustive approaches

enumerate all possible conformationsview entire space (perfect partition function)

computationally intractable:5 states/100 residues = 5100 = 1070 possible conformations

select

must use discrete statemodels to minimise

number of conformationsexplored

Scoring/energy functions

• Need a way to select native-like conformations from non-native ones • Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms

• Knowledge-based scoring functions: derive information about atomic properties from a database of experimentally determined conformations; common parametres include pairwise atomic distances and amino acid burial/exposure.

Requirements for sampling methods and scoring functions

• Sampling methods must produce good decoy sets that are comprehensive and include several native-like structures

• Scoring function scores must correlate well with RMSD of conformations (the better the score/energy, the lower the RMSD)

Overview of CASP experiment

• Three categories: comparative/homology modelling, fold recognition/threading, and ab initio prediction

• Goal is to assess structure prediction methods in a blind and rigourous manner; blind prediction is necessary for accurate assessment of methods

• Ask modellers to build models of structures as they are in the process of being solved experimentally

• After prediction season is over, compare predicted models to the experimental structures

• Discuss what went right, what went wrong, and why

• Compare progress from CASP1 to CASP4

• Results published in special issues of Proteins: Structure, Function, Genetics 1995, 1997, 1999, 2002

Comparative modelling at CASP - methods

• Alignment: PSI-BLAST, FASTA, CLUSTALW - multiple sequence alignments carefully hand-edited using secondary structure information • More successful side chain prediction methods include:

backbone-dependent rotamer libraries (Bower & Dunbrack)segment matching followed by energy minimisation (Levitt)self-consistent mean field optimisation (Bates et al)graph-theory + knowledge-based functions (Samudrala et al)

• More successful loop building methods include:satisfaction of spatial restraints (Sali)internal coordinate mechanics energy optimisation (Abagyan et al)graph-theory + knowledge-based functions (Samudrala et al)

• Overall model building: there is no substitute for careful hand-constructed models (Sternberg et al, Venclovas)

A graph theoretic representation of protein structure

-0.6 (V1)

-1.0 (F) -0.7 (K)

-0.5 (I) -0.9 (V2) weighnodes

-0.5 (I) -0.9 (V2)

-1.0 (F) -0.7 (K)

-0.3-0.4

-0.2

-0.1

-0.1

-0.1

find cliques

W = -4.5

representresiduesas nodes

-0.5 (I)

-0.6 (V1)

-0.9 (V2)

-1.0 (F) -0.7 (K)

-0.3-0.4

-0.2

-0.1

-0.1

-0.2

-0.2

constructgraph

-0.1

Historical perspective on comparative modelling

BC

excellent~ 80%1.0 Å2.0 Å

alignmentside chainshort loopslonger loops

Historical perspective on comparative modelling

CASP1

poor~ 50%~ 3.0 Å> 5.0 Å

BC

excellent~ 80%1.0 Å2.0 Å

alignmentside chainshort loopslonger loops

Prediction for CASP4 target T128/sodm

C RMSD of 1.0 Å for 198 residues (PID 50%)

Prediction for CASP4 target T111/eno

C RMSD of 1.7 Å for 430 residues (PID 51%)

Prediction for CASP4 target T122/trpa

C RMSD of 2.9 Å for 241 residues (PID 33%)

Prediction for CASP4 target T125/sp18

C RMSD of 4.4 Å for 137 residues (PID 24%)

Prediction for CASP4 target T112/dhso

C RMSD of 4.9 Å for 348 residues (PID 24%)

Prediction for CASP4 target T92/yeco

C RMSD of 5.6 Å for 104 residues (PID 12%)

CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity

**T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)

**T128/sodm – 1.0 Å (198 residues; 50%)

**T125/sp18 – 4.4 Å (137 residues; 24%)

**T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%)

Comparative modelling at CASP - conclusions

CASP2

fair~ 75%~ 1.0 Å~ 3.0 Å

CASP3

fair~75%

~ 1.0 Å~ 2.5 Å

CASP4

fair~75%~ 1.0 Å~ 2.0 Å

CASP1

poor~ 50%~ 3.0 Å> 5.0 Å

BC

excellent~ 80%1.0 Å2.0 Å

alignmentside chainshort loopslonger loops

Fold recognition at CASP - methods

• Visual inspection with sequence comparison (Murzin group)

• Procyon - potential of mean force based on pairwise interactions and global dynamic programming (Sippl group)

• Threader - potential of mean force and double dynamic programming (Jones group)

• Environmental 3D Profiles (Eisenberg group)

• NCBI Threading Program using contact potentials and models of sequence-structure conservation (Bryant group) • Hidden Markov Models (Karplus group)

• Combination of threading with ab initio approaches (Friesner group)

• Environment-specific substitution tables and structure-dependent gap penalties (Blundell group)

Fold recognition at CASP - conclusions

• Fold recognition is one of the more successful approaches at predicting structure at all four CASPs 

• At CASP2 and CASP4, one of the best methods was simple sequence searching with careful manual inspection (Murzin group)

• At CASP3 and CASP4, none of the threading targets could have been recognised by the best standard sequence comparison methods such as PSI-BLAST  • For the most difficult targets, the methods were able to predict 60 residues to 6.0 Å C RMSD, approaching comparative modelling accuracies as the similarity between proteins increased.

Ab initio prediction at CASP – methods

• Assembly of fragments with simulated annealing (Simons et al)

• Exhaustive sampling and pruning using knowledge-based scoring functions (Samudrala et al) • Constraint-based Monte Carlo optimisation (Skolnick et al)

• Thermodynamic model for secondary structure prediction with manual docking of secondary structure elements and minimisation (Lomize et al)

• Minimisation of a physical potential energy function with a simplified representation (Scheraga et al, Osguthorpe et al)

• Neural networks to predict secondary structure (Jones, Rost)

Semi-exhaustive segment-based foldingEFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK

generatefragments from database14-state , model

… …

minimisemonte carlo with simulated annealingconformational space annealing, GA

… …

filter all-atom pairwise interactions, bad contactscompactness, secondary structure

Historical perspective on ab initio prediction

Before CASP (BC):“solved”

(biased results)

CASP1: worse than random

CASP2: worse thanrandom with one

exception

CASP4: ?

CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues

*T56/dnab – 6.8 Å (60 residues; 67-126)

**T59/smd3 – 6.8 Å (46 residues; 30-75)

**T61/hdea – 7.4 Å (66 residues; 9-74) **T64/sinr – 4.8 Å (68 residues; 1-68)

*T74/eps15 – 7.0 Å (60 residues; 154-213) **T75/ets1 – 7.7 Å (77 residues; 55-131)

Prediction for CASP4 target T110/rbfa

C RMSD of 4.0 Å for 80 residues (1-80)

Prediction for CASP4 target T97/er29

C RMSD of 6.2 Å for 80 residues (18-97)

Prediction for CASP4 target T106/sfrp3

C RMSD of 6.2 Å for 70 residues (6-75)

Prediction for CASP4 target T98/sp0a

C RMSD of 6.0 Å for 60 residues (37-105)

Prediction for CASP4 target T126/omp

C RMSD of 6.5 Å for 60 residues (87-146)

Prediction for CASP4 target T114/afp1

C RMSD of 6.5 Å for 45 residues (36-80)

Postdiction for CASP4 target T102/as48

C RMSD of 5.3 Å for 70 residues (1-70)

Ab initio prediction at CASP - conclusions

CASP1: worse than random

CASP2: worse thanrandom with one

exception

CASP4: consistently predicted correct topology - ~4-6.0 A for 60-80+ residues

CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues

**T110/rbfa – 4.0 Å (80 residues; 1-80) *T114/afp1 – 6.5 Å (45 residues; 36-80)

**T97/er29 – 6.0 Å (80 residues; 18-97)

**T106/sfrp3 – 6.2 Å (70 residues; 6-75)

*T98/sp0a – 6.0 Å (60 residues; 37-105) **T102/as48 – 5.3 Å (70 residues; 1-70)

Before CASP (BC):“solved”

(biased results)

Computational aspects of structural genomics

D. ab initio prediction

C. fold recognition

*

*

*

*

*

*

*

*

*

*

B. comparative modellingA. sequence space

*

*

*

*

*

*

*

*

*

*

*

*

E. target selection

targets

F. analysis

*

*

(Figure idea by Steve Brenner.)

Key points

• DNA/gene is the blueprint - proteins are the functional representatives of genes

• Protein structure can be used to understand protein function

• Large numbers of genes being sequenced - need structures

• Protein folding (from primary sequence to tertiary structure) is a fast self-organising process where a disordered non-functional chain of amino acids becomes a stable, compact, and functional molecule 

• The free energy difference between the folded and unfolded states is not very high

• Experimental methods to determine protein structures include x-ray crystallography and NMR spectroscopy • Theoretical methods to predict protein structures include comparative/homology modelling, fold recognition/threading, and ab initio prediction

• For ab initio prediction, you need a method that samples the conformational space adequately (to find native-like conformations) and a function that can identify them 

• CASP experiment shows limited progress in protein structure prediction

Michael Levitt, Stanford UniversityJohn Moult, CARB

Patrice Koehl, Stanford UniversityYu Xia, Stanford Univeristy

Levitt and Moult groups

Acknowledgements

<http://compbio.washington.edu>