Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Swiss Institute of Bioinformatics
Torsten SchwedeBiozentrum - Universität Basel Swiss Institute of BioinformaticsKlingelbergstr 50-70 CH - 4056 Basel, Switzerland Tel: +41-61 267 15 81
EMBnet course: Introduction to Protein Structure Bioinformatics
Homology ModelingLausanne, February 22, 2007
How many structures do we know?
http://www.wwpdb.org/
How many structures do we know?
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
[ PDB: http://www.pdb.org ]
TotalYearly
100
1,000
10,000
100,000
1,000,000
10,000,000
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
TrEMBL
SwissProt
PDB
No experimentalstructure for mostprotein sequences
(Sources: PDB, EBI, SIB)
How many structures do we know?
In the near future for most of the known protein sequences
no experimental structure will be available.
Can we predict protein structures
from genome sequences?
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITKDEAEKLFNQD VDAAVRGILR NAKLKPVYDS LDAVRRCALI NMVFQMGETG VAGFTNSLRMLQQKRWDEAA VNLAKSRWYN QTPNRAKRVI TTFRTGTWDA YKNL
The protein sequence contains all information needed to create a correctly folded protein.
Can we predict the folding process of a protein structure from their sequences (abinitio)?
Many proteins fold spontaneously to their native structureProtein folding is relatively fast (nsec – sec)Chaperones speed up folding, but do not alter the structure
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITKDEAEKLFNQD VDAAVRGILR NAKLKPVYDS LDAVRRCALI NMVFQMGETG VAGFTNSLRMLQQKRWDEAA VNLAKSRWYN QTPNRAKRVI TTFRTGTWDA YKNL
( )
( )
( )( )
∑ ∑
∑
∑
∑
= += ⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛+
⎥⎥
⎦
⎤
⎢⎢
⎣
⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛+
−++
−+
−=
N
i
N
ij ij
ji
ij
ij
ij
ijij
torsions
N
anglesii
i
bondsii
i
rqq
rr
nV
k
llk
1 1 0
612
2
0,
2
0,
44
cos12
2
2
πεσσ
πε
γω
θθ
ν
Molecular Dynamics
Ab initio protein folding simulation
[ http://www.research.ibm.com/bluegene/ ]
Physical time for simulation 10–4 seconds Typical time-step size 10–15 seconds Number of MD time steps 1011
Atoms in a typical protein and water simulation 32’000 Approximate number of interactions in force calculation 109
Machine instructions per force calculation 1000 Total number of machine instructions 1023
Petaflop capacity computer (floating point operations per second) 1 petaflop (1015)
Blue Gene will need 1-3 years to simulate 100 μsec.
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
New folds per year
“Old” folds per year
[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]
CATH - Protein Structure Classification
Class(C)
derived from secondary structure content is assigned automatically
Architecture(A)
describes the gross orientation of secondary structures, independent of connectivity.
Topology(T)
clusters structures according to their topological connections and numbers of secondary structures
Homologous Superfamily (H)
This level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous.
.
Number of residues aligned
Pairw
ise
sequ
ence
iden
tity
100
75
50
25
0
Sequence identityimplies
structuralsimilarity !
Don't know region
(B.Rost, Columbia, NewYork)
Sequence similarity implies structural similarity?
.
0
20
40
60
80
100
0 50 100 150 200 250
identitysimilarity
Number of residues aligned
Perc
enta
ge
sequen
ce
iden
tity
/sim
ilarity
(B.Rost, Columbia, NewYork)
Sequence similarity implies structural similarity?
Don’t
know region .....
Sequence identity implies structural similarity
Find a compatible fold for a given sequence ....
>Protein XYMSTLYEKLGGTTAVDLAVDKFYERVLQDDRIKHFFADVDMAKQRAHQKAFLTYAFGGTDKYDGRYMREAHKELVENHGLNGEHFDAVAEDLLATLKEMGVPEDLIAEVAAVAGAPAHKRDVLNQ
≈?
Fold recognition / Threading
Number of protein folds that occurs in nature is limited. Fold Recognition
can be used to:
Identify templates for comparative modeling
Assign Protein Function
Fold recognition / Threading
The "biological" perspective: Homologous proteins have evolved by
molecular evolution from a common ancestor. If we can establish
homology, we can predict aspects of structure and function of a new
protein by analogy.
The "physical" perspective: The native conformation of a protein
corresponds to a global free energy minimum of the protein / solvent
system. To identify a compatible fold, the protein sequence is "threaded"
through a library of folds, and empirical energy calculations are used to
evaluate compatibility.
No single method is perfect. Consensus methods often perform better:
MetaPP: http://cubic.bioc.columbia.edu/predictprotein/
http://bioinfo.pl/meta/
Further reading: Adam Godzik, "Fold Recognition Methods", in:
"Structural Bioinformatics", Bourne & Weissig, Eds.
Protein Structure / Fold Databases
PDB: http://www.pdb.org
EBI-MSD http://www.ebi.ac.uk/msd/
SCOP http://scop.mrc-lmb.cam.ac.uk/scop/
CATH http://www.biochem.ucl.ac.uk/bsm/cath_new/
Fold Recognition Servers
Meta serverhttp://bioinfo.pl/meta/
3DPSSM / Phyrehttp://www.sbg.bio.ic.ac.uk/servers/3dpssm/
http://www.sbg.bio.ic.ac.uk/~phyre/
GenTHREADERhttp://bioinf.cs.ucl.ac.uk/psipred/
FUGUE2http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html
SAMhttp://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html
FOLDhttp://fold.doe-mbi.ucla.edu/
FFAS/PDBBLASThttp://bioinformatics.burnham-inst.org/
Evolution of the globin family:
0.0
2.5
0.5
1.5
2.0
1.0
100 050
Percent identical residues in core
Rm
sdof
bac
kbone
atom
s in
core
[ Chothia & Lesk (1986) ]
Evolution of protein structure families
Common core = all residues that can be superposed in 3D
For proteins > 60% identical residues, the core contains >
90 % of all residues deviating less than 1.0 Å.
Homology modeling= Comparative protein modeling = Knowledge-based modeling
Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).
Similar Sequence Similar Structure
Known Structures(Templates)
Target Sequence Template Selection
Alignment Template - Target
Structure modeling
Structure Evaluation &Assessment
HomologyModel(s)
Comparative Modeling
Known Structures(Templates)
Target Sequence Template Selection
Alignment Template - Target
Structure modeling
Structure Evaluation &Assessment
HomologyModel(s)
• Protein Data Bank PDB http://www.pdb.org
Database of templates
• Separate into single chains• Remove bad structures
(models)• Create BLASTable database
or fold library (profiles, HMMs)
Comparative Modeling
Known Structures(Templates)
Target Sequence Template Selection
Alignment Template - Target
Structure modeling
Structure Evaluation &Assessment
HomologyModel(s)
Template selection:
1. Sequence Similarity / Fold recognition
2. Structure quality (resolution, experimental method)
3. Experimental conditions (ligands and cofactors)
Comparative Modeling
Known Structures(Templates)
Target Sequence Template Selection
Alignment Template - Target
Structure modeling
Structure Evaluation &Assessment
HomologyModel(s)
• Multiple sequence alignment for pairs > 40% identity
or• Use structural alignment of
templates to guide sequence alignment of target
or• Use separate profiles for
template and targets
Comparative Modeling
Known Structures(Templates)
Target Sequence Template Selection
Alignment Template - Target
Structure modeling
Structure Evaluation &Assessment
HomologyModel(s)
• Errors in template selection or alignment result in bad models
iterative cycles of alignment, modeling and evaluation
Built many models, choose best.
Comparative Modeling
Known Structures(Templates)
Target Sequence Template Selection
Alignment Template - Target
Structure modeling
Structure Evaluation &Assessment
HomologyModel(s)
I. Manual Model building
II. Template based fragment assembly
– Composer (Sybyl, Tripos)– SWISS-MODEL
III. Satisfaction of spatial restraints– Modeller (Insight II, MSI)– CPH-Models
Comparative Modeling
[ http://www.expasy.org/spdbv/ ]
I. Manual Modeling
II. Template based fragment assembly
Find structurally conserved core regions
II. Template based fragment assembly
Build model core… by averaging core template backbone atoms (weighted by local sequence similarity with the target sequence). Leave non-conserved regions (loops) for later ….
II. Template based fragment assembly
Loop (insertion) modelingUse the “spare part” algorithm to find compatible fragments in a Loop-Database, or “ab-initio” rebuilding (e.g. Monte Carlo, MD, GA, etc.) to build missing loops.
II. Template based fragment assembly
Side Chain placementFind the most probable side chain conformation, using
• homologues structure information• back-bone dependent rotamer libraries• energetic and packing criteria
II. Template based fragment assembly
Rotamer Libraries
Only a small fraction of all possible side chain conformations is observed in experimental structures
Rotamer libraries provide an ensemble of likely conformations
The propensity of rotamers depends on the backbone geometry:
II. Template based fragment assembly
Energy minimization
modeling method will produce unfavorable contacts and bonds
Energy minimization is used to
• regularize local bond and angle geometry
• Relax close contacts and geometric strain
extensive energy minimization will move coordinates away from real structure ⇒ keep it to a minimum
SWISS-MODEL is using GROMOS 96 force field for a steepest descent
M
A
T
EA
F
TS
G
Q
Homology Modeling
III. Satisfaction of Spatial restraints
III. Satisfaction of Spatial restraints
Alignment of target sequence with templates
Extraction of spatial restraints from templates
Modeling by satisfaction of spatial restraints
M
A
T
EA
F
TS
G
Q
Some features of a protein structure:
R resolution of X-ray experimentr amino acid residue typeΦ, Ψ main chain anglest secondary structure classM main chain conformation classΧ i,, ci side chain dihedral angle classa residue solvent accessibilitys residue neighborhood differenced Ca - Ca distanceΔd difference between two Ca - Ca distances
III. Satisfaction of Spatial restraints
Feature properties can be associated with
a protein (e.g. X-ray resolution)
residues (e.g. solvent accessibility)
pairs of residues (e.g. Ca - Ca distance)
other features (e.g. main chain classes)
How can we derive modeling restraints from this data?A restraint is defined as probability density function (pdf) p(x):
∫=<≤1
2
)()21(x
x
dxxpxxxp1)( =∫ dxxp
with
0)( >xp
III. Satisfaction of Spatial restraints
a) 11 Cys residues Chi-1 angles
b) smoothed distribution from a)
c) 297 Cys Chi-1 angles as control
III. Satisfaction of Spatial restraints
Derive pdfs from frequency tables by smoothing:
4.0'2.0 << s
4.0''2.0 << s
4.0'2.0 << s 6.0''4.0 << s 4.0''2.0 << s6.0'4.0 << s
III. Satisfaction of Spatial restraints
Combine basis pdfs to molecular probability density functions
Satisfaction of spatial restraints
Find the protein model with the highest probability
Variable target function:
Start with a linear conformation model or a model close to
the template conformation
At first, use only local restraints
minimize some steps using a conjugate gradient optimization
repeat with introducing more and more long range restraints
until all restraints are used
III. Satisfaction of Spatial restraints
III. Satisfaction of Spatial restraints
Optimization schedule and progress
EVA
Evaluation of Automatic protein structure prediction [ Burkhard Rost, Andrej Sali, http://maple.bioc.columbia.edu/eva/ ]
CASPCommunity Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction http://predictioncenter.org/casp7/
Model Accuracy Evaluation
Evaluation of Automatic protein structure prediction
[ Burkhard Rost, Andrej Sali, http://maple.bioc.columbia.edu/eva/ ]
Target SequenceMNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK
New PDB ReleasePrediction Servers
e.g.
Evaluation of prediction accuracy
1
2
3
Typical types of errors
Sequence alignment errors.
Loops which cannot be rebuilt.
Inappropriate template selection.
Subunit displacement.
Typical types of errors
Sequence alignment errors.
Loops which cannot be rebuilt.
Inappropriate template selection.
Subunit displacement.
e.g. flap-region in adenylate kinases(1AKE, 4AKE)
e.g. DNA-binding domains(1AWC, 1ETC)
… because they are sequence independent.
Structural rearrangements ….
… cause problems for template selection and automated evaluation:
Problem:
How can we identify errors in 3-dimensional protein structures (without knowing the correct answer)?
Protein Structure Evaluation
Bond & Angle Geometry
Molecular Interactions
Empirical Force Fields
Statistical Methods
e.g. GROMOS, CHARMM, AMBER, ...
Which type of errors in a protein structure can you identify by an empirical force filed?
Which type of errors are not recognized?
Empirical Force Fields
Useful to identify regions with errors in geometry
Statistical Methods
Ramachandran Plot of backbone angles (ϕ,ψ)favored regionsgenerously allowed regions disallowed regions
Amino acids with special properties:• PRO: ϕ = 60º• GLY (�)
Similar plots for χ-angle distributions
Probability for a feature to occur in a given environment,
e.g.
Solvent exposed / buried
Hydrophobic / polar environment
Electrostatic interactions
Secondary structure
See: R. Luthy (1992) Assessment of protein models with
three-dimensional profiles, Nature, 356(6364):83-5
1D - 3D Checks
+, Ile86
III, Ala182
II, Phe134
I, Val13
*, Met80
I II III*
Val13 Met80 Phe134 Ala182
A
B
+
Statistical Mean Force Potentials
Atomic non-local interaction energy.
Atom Type Definitions
Distance Å
MFPkcal/mol
Methyl-Methyl pairs
Cysteine S-S-pairs
Distance Å
Statistical Mean Force Potentials
Use inverse Boltzmann law to derive an atomic Potential of Mean Force (Ū) from the observed number of atomic pairs (i,j) within a distance shell r±Δr in the training database of protein structures:
Nexpected is the expected number of atomic pairs (i,j) in the same distance shell if there were no interactions between atoms (reference state).
),,(),,(ln),,(
rjiNrjiNRTrjiU
expected
observed−= R: gas constantT: temperature
ANOLEA : (Atomic Non-Local Environment Assessment)
http://protein.bio.puc.cl/cardex/servers/anolea/
http://swissmodel.expasy.org/anolea/
Correct Structure:PDB: 1GES
Model with wrongalignment:
Detects local packing errors
Errors in alignments
ANOLEA
Checks the stereo-chemical quality of a protein structure, producing a
number of plots analyzing its overall and residue-by-residue geometry.
• Covalent geometry• Planarity• Dihedral angles• Chirality• Non-bonded interactions• Main-chain hydrogen bonds• Disulphide bonds• Stereochemical parameters• Residue-by-residue analysis
Laskowski R A, MacArthur M W, Moss D S & Thornton J M (1993). PROCHECK: aprogram to check the stereochemical quality of protein structures. J. Appl. Cryst., 26, 283-291. Morris A L, MacArthur M W, Hutchinson E G & Thornton J M (1992). Stereochemical quality of protein structure coordinates. Proteins, 12, 345-364.
PROCHECK
WHAT IF I check my structure?
Imagine ...• An everyday situation in a biocomputing lab: "Should they use the structure?" • An everyday situation in a crystallography lab: "Should they deposit the structure already?" In a WHAT_CHECK report, each reported fact has an assigned severity:
error:severe errors encountered during the analyses. Items marked as errors are considered severe problems requiring immediate attention.
warning:Either less severe problems or uncommon structural features. These still need special attention.
note:Statistical values, plots, or other verbose results of tests and analyses that have been performed.
WHAT IF: A molecular modeling and drug design program. G.Vriend, J. Mol. Graph. (1990) 8, 52-56. Errors in protein structures. R.W.W. Hooft, G. Vriend, C. Sander, E.E. Abola, Nature (1996) 381, 272-272.
WhatCheck / WhatIf
# 49 # Note: Summary report for users of a structureThis is an overall summary of the quality of the structure ascompared with current reliable structures. This summary is mostuseful for biologists seeking a good structure to use for modellingcalculations.
The second part of the table mostly gives an impression of how wellthe model conforms to common refinement constraint values. Thefirst part of the table shows a number of constraint-independentquality indicators.
Structure Z-scores, positive is better than average:1st generation packing quality : -2.5502nd generation packing quality : -5.472 (bad)Ramachandran plot appearance : -1.898chi-1/chi-2 rotamer normality : -1.433Backbone conformation : -2.173
RMS Z-scores, should be close to 1.0:Bond lengths : 0.905Bond angles : 1.476Omega angle restraints : 0.921Side chain planarity : 2.681 (loose)Improper dihedral distribution : 1.771 (loose)Inside/Outside distribution : 1.333 (unusual)
whatcheck.txt
WhatCheck / WhatIf report for a bad model ...
All checking tools are happy, so can I believe it now?
Models are not experimental facts !
Models can be partially inaccurate or sometimes completely wrong !
A model is a tool that helps to interpret biochemical data.
ANOLEA : (Atomic Non-Local Environment Assessment)
• http://protein.bio.puc.cl/cardex/servers/anolea/• http://swissmodel.expasy.org/anolea/
ProCheck
• http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html
WhatCheck
• http://www.cmbi.kun.nl/gv/whatcheck/
Verify3D
• http://www.doe-mbi.ucla.edu/Services/Verify_3D/
Biotech Validation Suite for Protein Structures
• http://biotech.ebi.ac.uk:8400/
Some useful Evaluation Tools
What can models be used for ?
“A Model must be wrong, in some respects, else it would
be the thing itself. The trick is to see where it is right.”
(Henry A. Bent)
Save Zone
TwilightZone
MidnightZone
Model quality vs. sequence identity
Annotation by fold assignment3D-motif searching, active site recognition
Including NMR restraints
Supporting site directed mutagenesis
X-Ray Molecular replacement models
Docking of small moleculesDrug development;
comparable to medium resolution NMR or low resolution X-ray structures
What can models be used for ?
The knowledge of 3-dimensional
structures of target proteins allows
to undertand interactions of
inhibitors and drugs with their target
proteins.
Application example: Understanding drug interactions
Reference:
Discovery of a potent and selective protein kinase CK2 inhibitor by high-throughput docking.
Vangrevelinghe E, Zimmermann K, Schoepfer J, Portmann R, Fabbro D, Furet P.Oncology Research, Novartis Pharma, Basle, J Med Chem. 2003 Jun 19;46(13):2656-62.
Discovery of CK2a Inhibitors by in silico docking
Homology model of
the target molecule:
Medicines are not Effective in all Patients
Group Incomplete/absent efficacy
SSRI 10-25%ACE-I 10-30%Beta blockers 15-25%Statins 30-70%Beta2 agonists 40-70%
[ Spear BB (2001) Trends Mol Med;7(5):201-204 ]
InterInter--individual differences in drug efficacy:individual differences in drug efficacy:
E.g. Changes in the electrostatic properties upon mutation
-8 -4 0 +4 +8 kT/e
1
4
2
7
3
5
6 8
Structural analysis of human mutations and nsSNPs
100
1'000
10'000
100'000
1'000'000
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
TrEMBL
SwissProt
PDB
Public database holdings
• large scale experimental structure solution projects
Goal: Most of the sequences in a genome database should match
at least one structure with a sufficient sequence identity
allowing for reliable modeling.
Range of sequence space that can be modeled with acceptable accuracy.
The modeling error determines selection of targets for structural genomics.
Structural Genomics
Structural Genomics – Target Selection
Protein Modeling Resources
SWISS-MODEL http://swissmodel.expasy.org
Modeller http://www.salilab.org
WhatIf http://www.cmbi.kun.nl/whatif/
3D-JIGSAW http://www.bmm.icnet.uk/people/paulb/3dj/form.html
CPHmodels http://www.cbs.dtu.dk/services/CPHmodels/
SDSC1 http://cl.sdsc.edu/hm.html
ANOLEA : (Atomic Non-Local Environment Assessment)
• http://protein.bio.puc.cl/cardex/servers/anolea/• http://swissmodel.expasy.org/anolea/
ProCheck
• http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html
WhatCheck
• http://www.cmbi.kun.nl/gv/whatcheck/
Verify3D
• http://www.doe-mbi.ucla.edu/Services/Verify_3D/
Biotech Validation Suite for Protein Structures
• http://biotech.ebi.ac.uk:8400/
Some useful Evaluation Tools