Upload
shanna-charleen-dickerson
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Rationale for understanding protein structure and function
Protein sequence
-large numbers of sequences, including whole genomes
Protein function
- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution
?
structure determination structure prediction
homologyrational mutagenesisbiochemical analysis
model studies
Protein structure
- three dimensional- complicated- mediates function
Protein folding
…-L-K-E-G-V-S-K-D-…
…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…
one amino acid
DNA
protein sequence
unfolded protein
native state
spontaneous self-organisation (~1 second)
not uniquemobileinactive
expandedirregular
Protein folding
…-L-K-E-G-V-S-K-D-…
…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…
one amino acid
DNA
protein sequence
unfolded protein
native state
spontaneous self-organisation (~1 second)
unique shapeprecisely orderedstable/functionalglobular/compacthelices and sheets
not uniquemobileinactive
expandedirregular
unfolded
Protein folding landscape
Large multi-dimensional space of changing conformationsfr
ee e
nerg
y
folding reaction
molten globule
J=10-8 s
native
J=10-3 s
G**
barrierheight
Protein primary structure
twenty types of amino acids
R
H
C
OH
O
N
H
HCα
two amino acids join by forming a peptide bond
R
Cα
H
C
O
N
H
H NCα
H
C
O
OH
R
H
R
Cα
H
C
O
N
H
NCα
H
C
O
R
HR
Cα
H
C
O
N
H
NCα
H
C
O
R
H
each residue in the amino acid main chain has two degrees of freedom (and
the amino acid side chains can have up to four degrees of freedom 1-4
Protein secondary structure
L
0
0
+180
+180-180
-180
many combinations are not possible
helix
sheet (anti-parallel)
N
C
N
C
sheet (parallel)
Protein tertiary and quaternary structures
Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh)
Hemagglutinin (1hgd)
Methods for determining protein structure
Protein sequence
-large numbers of sequences, including whole genomes
Protein function
- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution
?
X-ray crystallographyNMR spectroscopy
homologyrational mutagenesisbiochemical analysis
model studies
Protein structure
- three dimensional- complicated- mediates function
X-ray crystallography- concept
• X-rays interact with electrons in protein molecules arranged in a crystal to produce diffraction patterns
• The diffraction patterns of the x-rays can be used to determine the three-dimensional structure of proteins
• Provides a “static” picture
From <http://info.bio.cmu.edu/courses/03231/LecF01/Lec25/lec25.html>
• Prepare protein crystals where the proteins are organised in a precise crystal lattice
• Shine x-rays on crystals which diffract off of electrons of atoms in the crystals; the intensities of the individual reflections are measured
• Phases are usually obtained indirectly by ismorphous replacement, from the way one or a few heavy atoms incorporated into the same isomorphous crystal lattice affect the diffraction patern
• Intensities and phases of all reflections are combined in a Fourier transform to provide maps of electron density
• Interpret the map by fitting the polypeptide chain to the contours
• Refine the model by minimising the distance between the observed amplitudes and the calculated amplitudes
X-ray crystallography- details
NMR spectroscopy - concept
• The magnetic-spin properties of atomic nuclei within a molecule are used to obtain a list of distance constraints between atoms in the molecule, from which a three-dimensional structure of the protein molecule can be obtained
• Provides a “dynamic” picture
NK-lysin (1nkl) S1 RNA binding domain (1sro)
NMR spectroscopy - details
• Protein molecules placed in a strong magnetic field have their hydrogen atoms aligned to the field; the alignment can be excited by applying radio frequency (RF) pulses
• Possible to obtain unique signal (chemical shift) for each hydrogen atom in a protein molecule
• Structural information arises primarily from the Nuclear Overhauser Effect (NOE), which gives information about distances between atoms in a molecule
• A pair of protons give a detectable NOE cross-peak if they are within 5.0 Å of each other in space
• After obtaining NOE data for protons througout the structure, a number of independent structures can be generated that are consistent with the distance constraints
Computer representation of protein structure
• Structures are stored in the protein data bank (PDB), a repository of mostly experimental models based on X-ray crystallographic and NMR studies • <http://www.rcsb.org> • Atoms are defined by their Cartesian coordinates: ATOM 1 N GLU 1 18.222 18.496 -16.203 1.00 21.95ATOM 2 CA GLU 1 17.706 17.982 -14.905 1.00 16.74ATOM 3 C GLU 1 17.368 16.466 -15.121 1.00 15.45ATOM 4 O GLU 1 16.780 16.073 -16.175 1.00 18.81ATOM 5 CB GLU 1 16.552 18.744 -14.351 1.00 17.35ATOM 6 CG GLU 1 16.952 20.118 -13.803 1.00 24.48ATOM 7 CD GLU 1 15.881 21.145 -13.597 1.00 31.51ATOM 8 OE1 GLU 1 16.012 22.316 -13.292 1.00 29.12ATOM 9 OE2 GLU 1 14.701 20.768 -13.799 1.00 35.19ATOM 10 N PHE 2 17.762 15.746 -14.052 1.00 15.83ATOM 11 CA PHE 2 17.509 14.262 -14.184 1.00 13.24 • These structures provide the basis for most of theoretical work in protein folding and protein structure prediction
Comparison of protein structures
• Need ways to determine if two protein structures are related and to compare predicted models to experimental structures
• Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979):
• Usually use C atoms
N
dzdydxN
i iii
1
222
3.6 Å 2.9 Å
NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) T102 best model
• Other measures include contact maps and torsion angle RMSDs
Methods for predicting protein structure
Protein sequence
-large numbers of sequences, including whole genomes
Protein function
- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution
?
comparative modellingfold recognition
ab initio prediction
homologyrational mutagenesisbiochemical analysis
model studies
Protein structure
- three dimensional- complicated- mediates function
Comparative modelling of protein structure
• Proteins that have similar sequences (i.e., related by evolution) have similar three-dimensional structures
• A model of a protein whose structure is not known can be constructed if the structure of a related protein has been determined by experimental methods
• Similarity must be obvious and significant for good models to be built
• Need ways to build regions that are not similar between the two related proteins
• Need ways to move model closer to the native structure
Comparative modelling of protein structure
KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **
… …
scanalign
build initial modelconstruct non-conserved
side chains and main chains
refine
Fold recognition
• The number of possible protein structures/folds is limited (large number of sequences but few folds)
• Proteins that do not have similar sequences sometimes have similar three-dimensional structures
• A sequence whose structure is not known is fitted directly (or “threaded”) onto a known structure and the “goodness of fit” is evaluated using a discriminatory function
• Need ways to move model closer to the native structure
3.6 Å5% ID
NK-lysin (1nkl) Bacteriocin T102/as48 (1e68)
Fold recognition
KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **
… …
evaluatefit
build initial modelconstruct non-conserved
side chains and main chains
refine
Ab initio prediction of protein structure – concept
• Go from sequence to structure by sampling the conformational space in a reasonable manner and select a native-like conformation using a good discrimination function
• Problems: conformational space is astronomical, and it is hard to design functions that are not fooled by non-native conformations (or “decoys”)
Ab initio prediction of protein structure
sample conformational space such thatnative-like conformations are found
astronomically large number of conformations5 states/100 residues = 5100 = 1070
select
hard to design functionsthat are not fooled by
non-native conformations(“decoys”)
Sampling conformational space – continuous approaches
• Most work in the field- Molecular dynamics- Continuous energy minimisation (follow a valley)- Monte Carlo simulation- Genetic Algorithms
• Like real polypeptide folding process
• Cannot be sure if native-like conformations are sampled
energy
Molecular dynamics
• Force = -dU/dx (slope of potential U); acceleration, m a(t) = force
• All atoms are moving so forces between atoms are complicated functions of time
• Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial
• Atoms move for very short times of 10-15 seconds or 0.001 picoseconds (ps)
x(t+t) = x(t) + v(t)t + [4a(t) – a(t-t)] t2/6
v(t+t) = v(t) + [2a(t+t)+5a(t)-a(t-t)] t/6
Ukinetic = ½ Σ mivi(t)2 = ½ n KBT
• Total energy (Upotential + Ukinetic) must not change with time
new position
old position
new velocity
old velocity
acceleration
acceleration
old velocity
n is number of coordinates (not atoms)
Energy minimisation
• For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial
• With convergence, we have an accurate equilibrium conformation and a well-defined energy value
energy
number of steps deep minimum
starting conformation
steepest descent
conjugate gradient
energy
number of steps
give up
converge
RMSD
Monte Carlo simulation
• Discrete moves in torsion or cartesian conformational space
• Evaluate energy after every move and compare to previous energy (E)
• Accept conformation based on Boltzmann probability:
• Many variations, including simulated annealing (starting with a high temperature so more moves are accepted initially and then cooling)
• If run for infinite time, simulation will produce a Boltzmman distribution
kT
ΔEexpP
Genetic Algorithms
• Generate an initial pool of conformations
• Perform crossover and mutation operations on this set to generate a much larger pool of conformations
• Select a subset of the fittest conformations from this large pool
• Repeat above two steps until convergence
Sampling conformational space – exhaustive approaches
enumerate all possible conformationsview entire space (perfect partition function)
computationally intractable:5 states/100 residues = 5100 = 1070 possible conformations
select
must use discrete statemodels to minimise
number of conformationsexplored
Scoring/energy functions
• Need a way to select native-like conformations from non-native ones • Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms
• Knowledge-based scoring functions: derive information about atomic properties from a database of experimentally determined conformations; common parametres include pairwise atomic distances and amino acid burial/exposure.
Requirements for sampling methods and scoring functions
• Sampling methods must produce good decoy sets that are comprehensive and include several native-like structures
• Scoring function scores must correlate well with RMSD of conformations (the better the score/energy, the lower the RMSD)
Overview of CASP experiment
• Three categories: comparative/homology modelling, fold recognition/threading, and ab initio prediction
• Goal is to assess structure prediction methods in a blind and rigourous manner; blind prediction is necessary for accurate assessment of methods
• Ask modellers to build models of structures as they are in the process of being solved experimentally
• After prediction season is over, compare predicted models to the experimental structures
• Discuss what went right, what went wrong, and why
• Compare progress from CASP1 to CASP4
• Results published in special issues of Proteins: Structure, Function, Genetics 1995, 1997, 1999, 2002
Comparative modelling at CASP - methods
• Alignment: PSI-BLAST, FASTA, CLUSTALW - multiple sequence alignments carefully hand-edited using secondary structure information • More successful side chain prediction methods include:
backbone-dependent rotamer libraries (Bower & Dunbrack)segment matching followed by energy minimisation (Levitt)self-consistent mean field optimisation (Bates et al)graph-theory + knowledge-based functions (Samudrala et al)
• More successful loop building methods include:satisfaction of spatial restraints (Sali)internal coordinate mechanics energy optimisation (Abagyan et al)graph-theory + knowledge-based functions (Samudrala et al)
• Overall model building: there is no substitute for careful hand-constructed models (Sternberg et al, Venclovas)
A graph theoretic representation of protein structure
-0.6 (V1)
-1.0 (F) -0.7 (K)
-0.5 (I) -0.9 (V2) weighnodes
-0.5 (I) -0.9 (V2)
-1.0 (F) -0.7 (K)
-0.3-0.4
-0.2
-0.1
-0.1
-0.1
find cliques
W = -4.5
representresiduesas nodes
-0.5 (I)
-0.6 (V1)
-0.9 (V2)
-1.0 (F) -0.7 (K)
-0.3-0.4
-0.2
-0.1
-0.1
-0.2
-0.2
constructgraph
-0.1
Historical perspective on comparative modelling
BC
excellent~ 80%1.0 Å2.0 Å
alignmentside chainshort loopslonger loops
Historical perspective on comparative modelling
CASP1
poor~ 50%~ 3.0 Å> 5.0 Å
BC
excellent~ 80%1.0 Å2.0 Å
alignmentside chainshort loopslonger loops
CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity
**T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)
**T128/sodm – 1.0 Å (198 residues; 50%)
**T125/sp18 – 4.4 Å (137 residues; 24%)
**T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%)
Comparative modelling at CASP - conclusions
CASP2
fair~ 75%~ 1.0 Å~ 3.0 Å
CASP3
fair~75%
~ 1.0 Å~ 2.5 Å
CASP4
fair~75%~ 1.0 Å~ 2.0 Å
CASP1
poor~ 50%~ 3.0 Å> 5.0 Å
BC
excellent~ 80%1.0 Å2.0 Å
alignmentside chainshort loopslonger loops
Fold recognition at CASP - methods
• Visual inspection with sequence comparison (Murzin group)
• Procyon - potential of mean force based on pairwise interactions and global dynamic programming (Sippl group)
• Threader - potential of mean force and double dynamic programming (Jones group)
• Environmental 3D Profiles (Eisenberg group)
• NCBI Threading Program using contact potentials and models of sequence-structure conservation (Bryant group) • Hidden Markov Models (Karplus group)
• Combination of threading with ab initio approaches (Friesner group)
• Environment-specific substitution tables and structure-dependent gap penalties (Blundell group)
Fold recognition at CASP - conclusions
• Fold recognition is one of the more successful approaches at predicting structure at all four CASPs
• At CASP2 and CASP4, one of the best methods was simple sequence searching with careful manual inspection (Murzin group)
• At CASP3 and CASP4, none of the threading targets could have been recognised by the best standard sequence comparison methods such as PSI-BLAST • For the most difficult targets, the methods were able to predict 60 residues to 6.0 Å C RMSD, approaching comparative modelling accuracies as the similarity between proteins increased.
Ab initio prediction at CASP – methods
• Assembly of fragments with simulated annealing (Simons et al)
• Exhaustive sampling and pruning using knowledge-based scoring functions (Samudrala et al) • Constraint-based Monte Carlo optimisation (Skolnick et al)
• Thermodynamic model for secondary structure prediction with manual docking of secondary structure elements and minimisation (Lomize et al)
• Minimisation of a physical potential energy function with a simplified representation (Scheraga et al, Osguthorpe et al)
• Neural networks to predict secondary structure (Jones, Rost)
Semi-exhaustive segment-based foldingEFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK
generatefragments from database14-state , model
… …
minimisemonte carlo with simulated annealingconformational space annealing, GA
… …
filter all-atom pairwise interactions, bad contactscompactness, secondary structure
Historical perspective on ab initio prediction
Before CASP (BC):“solved”
(biased results)
CASP1: worse than random
CASP2: worse thanrandom with one
exception
CASP4: ?
CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues
*T56/dnab – 6.8 Å (60 residues; 67-126)
**T59/smd3 – 6.8 Å (46 residues; 30-75)
**T61/hdea – 7.4 Å (66 residues; 9-74) **T64/sinr – 4.8 Å (68 residues; 1-68)
*T74/eps15 – 7.0 Å (60 residues; 154-213) **T75/ets1 – 7.7 Å (77 residues; 55-131)
Ab initio prediction at CASP - conclusions
CASP1: worse than random
CASP2: worse thanrandom with one
exception
CASP4: consistently predicted correct topology - ~4-6.0 A for 60-80+ residues
CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues
**T110/rbfa – 4.0 Å (80 residues; 1-80) *T114/afp1 – 6.5 Å (45 residues; 36-80)
**T97/er29 – 6.0 Å (80 residues; 18-97)
**T106/sfrp3 – 6.2 Å (70 residues; 6-75)
*T98/sp0a – 6.0 Å (60 residues; 37-105) **T102/as48 – 5.3 Å (70 residues; 1-70)
Before CASP (BC):“solved”
(biased results)
Computational aspects of structural genomics
D. ab initio prediction
C. fold recognition
*
*
*
*
*
*
*
*
*
*
B. comparative modellingA. sequence space
*
*
*
*
*
*
*
*
*
*
*
*
E. target selection
targets
F. analysis
*
*
(Figure idea by Steve Brenner.)
Key points
• DNA/gene is the blueprint - proteins are the functional representatives of genes
• Protein structure can be used to understand protein function
• Large numbers of genes being sequenced - need structures
• Protein folding (from primary sequence to tertiary structure) is a fast self-organising process where a disordered non-functional chain of amino acids becomes a stable, compact, and functional molecule
• The free energy difference between the folded and unfolded states is not very high
• Experimental methods to determine protein structures include x-ray crystallography and NMR spectroscopy • Theoretical methods to predict protein structures include comparative/homology modelling, fold recognition/threading, and ab initio prediction
• For ab initio prediction, you need a method that samples the conformational space adequately (to find native-like conformations) and a function that can identify them
• CASP experiment shows limited progress in protein structure prediction