Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Optimisation & Landscape Analysis
for Studying Protein Folding
Bridging the Gap Workshop: Dynamic OptimisationUniversity of Birmingham, 24 February 2011
Roy L. Johnston
School of Chemistry
University of Birmingham
Overview
• Introduction
– The Protein Folding Problem
– Protein Models
• Genetic Algorithms
– HP Lattice Bead Model
– Dynamical Lattice Model
• Energy Landscapes for Protein Folding
– BLN Model
– Principal Component Analysis
– Landscape Complexity
• Conclusions
The Protein Folding Problem• To predict the 3D local spatial arrangement (secondary structure)
and folded conformation (tertiary structure) of a protein from
knowledge of its primary structure – the 1D sequence of amino acid
residues.
Why Study Protein Folding?
• To rationalise and predict the relationship between sequence, 3D
structure and function.
• To understand the effect of mutations on protein structure and
function.
• To understand protein folding dynamics – e.g. in order to
understand protein misfolding diseases (Alzheimers, CJD etc).
1 2 3 4
Search Methods in Protein Folding
• Even for the minimalist HP lattice bead model, global optimization is NP-hard.
• Search methods adopted include:– Monte Carlo
– Simulated Annealing
– Chain Growth Algorithms
– Genetic Algorithms (Unger & Moult)
– Ant Colony Optimization (Hoos)
– Immune Algorithms (Cutello, Timmis et al.)
Protein Models
• Bead Models
– minimalist models – each amino acid is
represented by a bead, usually based on
their hydrophobic or hydrophilic nature
– e.g. HP and BLN models
– beads may be constrained to a lattice or
may be off-lattice.
• United Atom Models
– with backbone and side chain beads
– e.g. Dynamical Lattice Model.
• All-Atom Models
– full atomistic treatment of protein
– e.g. CHarMM, AMBER.
The HP Lattice Bead ModelAmino acids are classed as either Hydrophobic (H) or Polar (P).
Each amino acid is represented as a hard sphere (“bead”) on a lattice (e.g. 2-D squareand 3-D diamond lattices).
Interactions occur between beads which are adjacent on the lattice (topological neighbours)but are not directly bonded (sequence neighbours).
Standard HP Model
HH = 1 HP = PP = 0
ij = 1 i and j are topological neighbours (but not sequence neighbours)
ij = 0 otherwise.
ijji
ijE
ε
P H
Protein Folding GA: HP Model
• Local coordinate scheme:
– Conformation vector c = {131221 …}
– Sequence vector s = {HPPHHP …}
• Initial valid conformations generated
using Recoil Growth Algorithm.
• Fitness simply related to energy of the conformation: Fi = Ei + 0.01
• Roulette wheel & Brood selection.
• 1-point crossover.
• Variety of mutation operators.
• Monte Carlo local search.
• Diversity checking – no duplicate structures allowed.
Invalid Structure
(superimposed beads)
Dead End Structure
(no further growth possible)
Crossover
1-point crossover leads to higher GA success rates (and fewer
structure evaluations) than 2-pt.
1-pt crossover is better at maintaining schemata (good regions
of local structure).
Mutation
Corrector operator – introduced to “repair” invalid structures
generated by mutation. (Sequential 1-bit changes.)
Benchmark SequencesName E (GM) Sequence % Success
HP-20 9 HPHP2H2PHP2HPH2P2HPH 99.5
HP-24 9 H2P2(HP2)6H2 93.5
HP-25 8 P2HP2(H2P4)3H2 86.5
HP-36 14 P3H2P2H2P5H7P2H2P4H2P2HP2 4.0
HP-48 23 P2H(P2H2)2P5H10P6(H2P2)2HP2H5
HP-50 21 H2(PH)3PH4P(HP3)3P(HP3)2HPH4(PH)4H
200 GA runs. Parameters: X-over = 1.0, mutation = 0.5, elitism = 30%.
Structures sampled capped at 60,000.
Modified GA
Local Search
• Introduce long range Monte Carlo move operator to allow local
searching around each offspring and mutant.
• Conformation c1 (energy = E1) undergoes random fold mutation
(changing one bit in conformation vector): c1(E1) c2(E2)
E2 < E1 accept move.
E2 > E1 accept move with probability p = E2/15E1
• 30 attempted MC steps = 1 local search.
Brood Selection
• More than 2 offspring generated from a selected pair of parents.
• The best 2 offspring replace the parents.
• Allows wider exploration of crossover space around the two parents.
• Optimum brood size = 5.
Comparison with Previous GA
This Work Unger & Moult
Sequence E(GM) D(GM) %Success Neval E (GM) Neval
HP-20 9 2 100 18,338 9 30,492
HP-24 9 19 100 27,278 9 30,491
HP-25 8 16 100 35,128 8 20,400
HP-36 14 192 70 113,667 14 301,339
HP-48 23 285 13 261,311 22 126,547
HP-50 21 370 100 97,691 21 592,887
200 GA runs. Parameters: X-over = 1.0, mutation = 0.1, elitism = 30%, DPL = 1,
local search, brood size = 5.
Maximum generations = 100.
G.A. Cox, T. V. Mortimer-Jones, R. P. Taylor, RLJ, Theor. Chem. Acc. 112, 163-178 (2004).
Example Global Minima for Benchmark Sequences
HP-20 HP-24 HP-25 HP-36
HP-50HP-48
Dynamical Lattice Model*
* F. Dressel, S. Kobe, Chem. Phys. Lett. 424, 369-373 (2006).
Amino acid residues
have preferred conformations
determined by backbone
angles and .
Dynamical lattice = discrete
but non-regular grid.
From cluster analysis of
Ramachandran plots, certain
allowed (,) pairs are
defined for each residue.
Dynamical Lattice Model
Graham Cox; S. Kobe, F. Dressel (Dresden)
• Amino acid residues treated explicitly.
• Beads for all backbone atoms (N, C , C).
• Hard sphere beads for side chains (R).
• Energy obtained by summing
interactions between C beads.
R
502208tanh .r.eE ijij
j,Sei,Sej,iee
n
j,iijT EE
Dynamical Lattice ModelParameters
*
* From cluster analysis
Dynamical Lattice Model
GA Parameters
Population 200
Structure limit 500,000
X-over 1.0
Mutation 0.1
Elitism 30%
Local search 20%
No duplicates allowed
Repair invalid structures
(hard sphere overlap)
Gene coding (e.g. cysteine)
Code (,)
0
1
2
3
n13
14
17
20
20
21
26
28
Results (400 GA runs)
* = GM from Branch & Bound Search
GM(B+B) = -2.5157
**
*
**
*
*
PDB GA
1AL1
(right-handed -helix, n = 13)
XELLKKLLEELKG
1A1P
(Compstatin, n =14)
ICVVQDWGHHRCTX
PDB GA
Energy Landscapes
• The EL determines kinetics and thermo-
dynamics of e.g. clusters, liquids,
glasses and biomolecules.
• Determines ability of system to
find the global minimum energy
and of search methods to find the GM.
• Examples:– Potential Energy Surfaces
– Free Energy Surfaces (as function of T)
• Multidimensional surfaces are difficult
to visualise.
• Consider connected network of minima and
transition states:– Eigenvector following, successive confinement,parallel
tempering, and nudged elastic band methods.
Energ
y
Representing Energy Landscapes
• Disconnectivity Graph (DG) approach (Hoffman, Sibani, Schoen, Becker & Karplus, Berry et al., Wales et al.) allows visualisation of the connectivity of high-dimensional PES (e.g. for proteins, clusters, spin glasses).
• BUT – the x-coordinate has no meaning. Can more physically meaningful coordinates be obtained?
Metric Disconnectivity Graphs (MDGs):– Reproducible placement of superbasins.
– Separation of superbasins reflects structural difference.
– Thickness of line represents “size” of superbasin.
E E
• Perform a linear transformation of coordinates of energy minima and transition states (1st rank saddles).
• Identify the principal components –coordinates that maximise the variance of the system.
• PCA finds orthogonal lines of best fit through a data set.
• These lines of best fit are used as coordinates to re-plot the data .
• This analysis can be used to show and visualise trends in multi-dimensional data.
Principal Component Analysis (PCA)
x
y
z
D1D2
D1
D2
• There are several ways to represent the structure of a protein:
– (,,) dihedral angles
– (x,y,z) Cartesian co-ordinates of atoms
• We have (mostly) used (x,y,z) co-ordinates, with translations
and rotations removed.
PCA for Proteins
• PCA can be combined with DGs to produce MDGs in which the x and y axes
are used to display structural information.
• Structures are grouped into “superbasins” that are mutually accessible without
passing through a transition state with energy > Emax.
• The MDG is produced using the average coordinate of all members of a
superbasin to place the node.
• The number of superbasins and the connectivity is assessed at intervals Esep.
• The thickness of the lines can be used to represent the number of structures or
the structural diversity within the superbasin.
• Structural diversity = number of dimensions needed to reproduce (say) 99%
(SD0.99) of the information (variance) within a superbasin.
PCA-based Disconnectivity Graphs
Energ
y
X1
V
Energ
yE
nerg
yE
nerg
y
Q1
Energ
y
The Off-Lattice BLN Bead Model
• 3 Types of bead: HydrophoBic (B) , HydrophiLic (L)
and Neutral (N).
• Off-lattice model – has (bond r) stretching, (angle )
bending, torsional () and through space (Lennard-
Jones) components.
Anti
GaucheGauche
Gō ModelAll non-native attractive contacts removed.
Single-funnel PES with same GM as BLN.
Efficient folding.
46-Bead BLN Model
Global minimum is a 4-strand -barrel.
Frustrated PES.
Inefficient folding.
• 1st PC (Q1) contains approx. 30% of total variance.
• 1st + 2nd PCs (Q1,Q2) contain approx. 45% of total variance.
• 2D and 3D disconnectivity graphs can be plotted against Q1 and Q2
• Line thickness related to structural diversity within a superbasin.
T. Komatsuzaki, K. Hoshino, Y. Matsunaga, G.J. Rylance, RLJ, D.J. Wales, J. Chem. Phys. 122, 084714 (2005).
PCA for the 46-bead BLN Model
3D
2D
BLN Go
3D Disconnectivity Graphs: Go vs. BLN
GoBLN
Dihedral Angles – 46 Bead Go modelDisconnectivity graph based on 43 dihedrals
A B
BA
• How can we quantify the complexity of an energy landscape?
• Residential probability (pr): probability of being located in a given superbasin at a certain energy.
• Branching probability (pb): probability of taking a particular path to a given superbasin compared to all possible paths, leading from parent node.
• Landscape complexity (CL): Shannon entropy of residual probabilities:
• Path complexity (CP,): Shannon entropy of branching probabilities:
Landscape Complexity
bbi,P ppVC log
rriL ppVC log
Complexity: Go vs. BLN
CL= 1.23 − 1
CL= 3.89 − 1
GO
BLN
Go
BLN
Landscape Complexity
Conclusions
• Nature-inspired Computation has proven to be successful in searching for low energy folds for simple model proteins.
• Disconnectivity Graphs – give information about the connectivity and important energy barriers on energy (and other) landscapes.
• The Dynamical Lattice Model shows promise as an intermediate
between simple bead models and all-atom models.
• PCA-based visualisation and complexity analysis of protein
folding landscapes allows us to explain difficulties encountered
by global optimisation algorithms in certain cases – and may aid
the design of more robust search algorithms.
Acknowledgments
Birmingham
• Dr Gareth Rylance
• Dr Graham Cox
• Dr Ben Curley
• Dr Lesley Lloyd
• Dr Andrew Bennett
• Dr Jun He (now Aberystwyth)
• Eleanor Turpin
External
• Prof. Sigismund Kobe (Dresden)
• Prof. Tamiki Komatsuzaki (Kobe)
• Prof. David Wales (Cambridge)
• Prof. Martin Karplus & Dr Paul Maragakis (Harvard)
• Prof. Said Salhi (Kent)
Funding
• EPSRC
• The Royal Society
• Wellcome VIP Scheme
• JSPS
• Leverhulme Trust
• University of Birmingham
• BlueBEAR
Cambridge
– Daan Frenkel, David Wales & Mark Miller
Oxford
– Jon Doye
Birmingham
– Roy Johnston, Mark Oakley
• Methods
– New coarse-grained potentials.
– Analysis of potential energy, free energy
and other landscapes.
– New hybrid search algorithms.
– Dynamical and thermodynamic
simulations.
– Investigation of hierarchical self-
assembly.
• Example Systems
– Proteins; DNA, RNA; Liquid crystals
Simulation of Self-Assembly (Programme Grant EP/I001352)
2010-15