Optimisation & Landscape Analysis for Studying Protein Foldingbtg.bham.ac.uk/events/workshops/ws7/PresentationJohnston.pdf · Optimisation & Landscape Analysis for Studying Protein

Optimisation & Landscape Analysis

for Studying Protein Folding

Bridging the Gap Workshop: Dynamic OptimisationUniversity of Birmingham, 24 February 2011

Roy L. Johnston

School of Chemistry

University of Birmingham

Overview

• Introduction

– The Protein Folding Problem

– Protein Models

• Genetic Algorithms

– HP Lattice Bead Model

– Dynamical Lattice Model

• Energy Landscapes for Protein Folding

– BLN Model

– Principal Component Analysis

– Landscape Complexity

• Conclusions

The Protein Folding Problem• To predict the 3D local spatial arrangement (secondary structure)

and folded conformation (tertiary structure) of a protein from

knowledge of its primary structure – the 1D sequence of amino acid

residues.

Why Study Protein Folding?

• To rationalise and predict the relationship between sequence, 3D

structure and function.

• To understand the effect of mutations on protein structure and

function.

• To understand protein folding dynamics – e.g. in order to

understand protein misfolding diseases (Alzheimers, CJD etc).

1 2 3 4

Search Methods in Protein Folding

• Even for the minimalist HP lattice bead model, global optimization is NP-hard.

• Search methods adopted include:– Monte Carlo

– Simulated Annealing

– Chain Growth Algorithms

– Genetic Algorithms (Unger & Moult)

– Ant Colony Optimization (Hoos)

– Immune Algorithms (Cutello, Timmis et al.)

Protein Models

• Bead Models

– minimalist models – each amino acid is

represented by a bead, usually based on

their hydrophobic or hydrophilic nature

– e.g. HP and BLN models

– beads may be constrained to a lattice or

may be off-lattice.

• United Atom Models

– with backbone and side chain beads

– e.g. Dynamical Lattice Model.

• All-Atom Models

– full atomistic treatment of protein

– e.g. CHarMM, AMBER.

The HP Lattice Bead ModelAmino acids are classed as either Hydrophobic (H) or Polar (P).

Each amino acid is represented as a hard sphere (“bead”) on a lattice (e.g. 2-D squareand 3-D diamond lattices).

Interactions occur between beads which are adjacent on the lattice (topological neighbours)but are not directly bonded (sequence neighbours).

Standard HP Model

HH = 1 HP = PP = 0

ij = 1 i and j are topological neighbours (but not sequence neighbours)

ij = 0 otherwise.

ijji

ijE

ε

P H

Protein Folding GA: HP Model

• Local coordinate scheme:

– Conformation vector c = {131221 …}

– Sequence vector s = {HPPHHP …}

• Initial valid conformations generated

using Recoil Growth Algorithm.

• Fitness simply related to energy of the conformation: Fi = Ei + 0.01

• Roulette wheel & Brood selection.

• 1-point crossover.

• Variety of mutation operators.

• Monte Carlo local search.

• Diversity checking – no duplicate structures allowed.

Invalid Structure

(superimposed beads)

Dead End Structure

(no further growth possible)

Crossover

1-point crossover leads to higher GA success rates (and fewer

structure evaluations) than 2-pt.

1-pt crossover is better at maintaining schemata (good regions

of local structure).

Mutation

Corrector operator – introduced to “repair” invalid structures

generated by mutation. (Sequential 1-bit changes.)

Benchmark SequencesName E (GM) Sequence % Success

HP-20 9 HPHP2H2PHP2HPH2P2HPH 99.5

HP-24 9 H2P2(HP2)6H2 93.5

HP-25 8 P2HP2(H2P4)3H2 86.5

HP-36 14 P3H2P2H2P5H7P2H2P4H2P2HP2 4.0

HP-48 23 P2H(P2H2)2P5H10P6(H2P2)2HP2H5

HP-50 21 H2(PH)3PH4P(HP3)3P(HP3)2HPH4(PH)4H

200 GA runs. Parameters: X-over = 1.0, mutation = 0.5, elitism = 30%.

Structures sampled capped at 60,000.

Modified GA

Local Search

• Introduce long range Monte Carlo move operator to allow local

searching around each offspring and mutant.

• Conformation c1 (energy = E1) undergoes random fold mutation

(changing one bit in conformation vector): c1(E1) c2(E2)

E2 < E1 accept move.

E2 > E1 accept move with probability p = E2/15E1

• 30 attempted MC steps = 1 local search.

Brood Selection

• More than 2 offspring generated from a selected pair of parents.

• The best 2 offspring replace the parents.

• Allows wider exploration of crossover space around the two parents.

• Optimum brood size = 5.

Comparison with Previous GA

This Work Unger & Moult

Sequence E(GM) D(GM) %Success Neval E (GM) Neval

HP-20 9 2 100 18,338 9 30,492

HP-24 9 19 100 27,278 9 30,491

HP-25 8 16 100 35,128 8 20,400

HP-36 14 192 70 113,667 14 301,339

HP-48 23 285 13 261,311 22 126,547

HP-50 21 370 100 97,691 21 592,887

200 GA runs. Parameters: X-over = 1.0, mutation = 0.1, elitism = 30%, DPL = 1,

local search, brood size = 5.

Maximum generations = 100.

G.A. Cox, T. V. Mortimer-Jones, R. P. Taylor, RLJ, Theor. Chem. Acc. 112, 163-178 (2004).

Example Global Minima for Benchmark Sequences

HP-20 HP-24 HP-25 HP-36

HP-50HP-48

Dynamical Lattice Model*

* F. Dressel, S. Kobe, Chem. Phys. Lett. 424, 369-373 (2006).

Amino acid residues

have preferred conformations

determined by backbone

angles and .

Dynamical lattice = discrete

but non-regular grid.

From cluster analysis of

Ramachandran plots, certain

allowed (,) pairs are

defined for each residue.

Dynamical Lattice Model

Graham Cox; S. Kobe, F. Dressel (Dresden)

• Amino acid residues treated explicitly.

• Beads for all backbone atoms (N, C , C).

• Hard sphere beads for side chains (R).

• Energy obtained by summing

interactions between C beads.

R

502208tanh .r.eE ijij

j,Sei,Sej,iee

n

j,iijT EE

Dynamical Lattice ModelParameters

*

* From cluster analysis

Dynamical Lattice Model

GA Parameters

Population 200

Structure limit 500,000

X-over 1.0

Mutation 0.1

Elitism 30%

Local search 20%

No duplicates allowed

Repair invalid structures

(hard sphere overlap)

Gene coding (e.g. cysteine)

Code (,)

0

1

2

3

n13

14

17

20

20

21

26

28

Results (400 GA runs)

* = GM from Branch & Bound Search

GM(B+B) = -2.5157

**

*

**

*

*

PDB GA

1AL1

(right-handed -helix, n = 13)

XELLKKLLEELKG

1A1P

(Compstatin, n =14)

ICVVQDWGHHRCTX

PDB GA

Energy Landscapes

• The EL determines kinetics and thermo-

dynamics of e.g. clusters, liquids,

glasses and biomolecules.

• Determines ability of system to

find the global minimum energy

and of search methods to find the GM.

• Examples:– Potential Energy Surfaces

– Free Energy Surfaces (as function of T)

• Multidimensional surfaces are difficult

to visualise.

• Consider connected network of minima and

transition states:– Eigenvector following, successive confinement,parallel

tempering, and nudged elastic band methods.

Energ

y

Representing Energy Landscapes

• Disconnectivity Graph (DG) approach (Hoffman, Sibani, Schoen, Becker & Karplus, Berry et al., Wales et al.) allows visualisation of the connectivity of high-dimensional PES (e.g. for proteins, clusters, spin glasses).

• BUT – the x-coordinate has no meaning. Can more physically meaningful coordinates be obtained?

Metric Disconnectivity Graphs (MDGs):– Reproducible placement of superbasins.

– Separation of superbasins reflects structural difference.

– Thickness of line represents “size” of superbasin.

E E

• Perform a linear transformation of coordinates of energy minima and transition states (1st rank saddles).

• Identify the principal components –coordinates that maximise the variance of the system.

• PCA finds orthogonal lines of best fit through a data set.

• These lines of best fit are used as coordinates to re-plot the data .

• This analysis can be used to show and visualise trends in multi-dimensional data.

Principal Component Analysis (PCA)

x

y

z

D1D2

D1

D2

• There are several ways to represent the structure of a protein:

– (,,) dihedral angles

– (x,y,z) Cartesian co-ordinates of atoms

• We have (mostly) used (x,y,z) co-ordinates, with translations

and rotations removed.

PCA for Proteins

• PCA can be combined with DGs to produce MDGs in which the x and y axes

are used to display structural information.

• Structures are grouped into “superbasins” that are mutually accessible without

passing through a transition state with energy > Emax.

• The MDG is produced using the average coordinate of all members of a

superbasin to place the node.

• The number of superbasins and the connectivity is assessed at intervals Esep.

• The thickness of the lines can be used to represent the number of structures or

the structural diversity within the superbasin.

• Structural diversity = number of dimensions needed to reproduce (say) 99%

(SD0.99) of the information (variance) within a superbasin.

PCA-based Disconnectivity Graphs

Energ

y

X1

V

Energ

yE

nerg

yE

nerg

y

Q1

Energ

y

The Off-Lattice BLN Bead Model

• 3 Types of bead: HydrophoBic (B) , HydrophiLic (L)

and Neutral (N).

• Off-lattice model – has (bond r) stretching, (angle )

bending, torsional () and through space (Lennard-

Jones) components.

Anti

GaucheGauche

Gō ModelAll non-native attractive contacts removed.

Single-funnel PES with same GM as BLN.

Efficient folding.

46-Bead BLN Model

Global minimum is a 4-strand -barrel.

Frustrated PES.

Inefficient folding.

• 1st PC (Q1) contains approx. 30% of total variance.

• 1st + 2nd PCs (Q1,Q2) contain approx. 45% of total variance.

• 2D and 3D disconnectivity graphs can be plotted against Q1 and Q2

• Line thickness related to structural diversity within a superbasin.

T. Komatsuzaki, K. Hoshino, Y. Matsunaga, G.J. Rylance, RLJ, D.J. Wales, J. Chem. Phys. 122, 084714 (2005).

PCA for the 46-bead BLN Model

3D

2D

BLN Go

3D Disconnectivity Graphs: Go vs. BLN

GoBLN

Dihedral Angles – 46 Bead Go modelDisconnectivity graph based on 43 dihedrals

A B

BA

• How can we quantify the complexity of an energy landscape?

• Residential probability (pr): probability of being located in a given superbasin at a certain energy.

• Branching probability (pb): probability of taking a particular path to a given superbasin compared to all possible paths, leading from parent node.

• Landscape complexity (CL): Shannon entropy of residual probabilities:

• Path complexity (CP,): Shannon entropy of branching probabilities:

Landscape Complexity

bbi,P ppVC log

rriL ppVC log

Complexity: Go vs. BLN

CL= 1.23 − 1

CL= 3.89 − 1

GO

BLN

Go

BLN

Landscape Complexity

Conclusions

• Nature-inspired Computation has proven to be successful in searching for low energy folds for simple model proteins.

• Disconnectivity Graphs – give information about the connectivity and important energy barriers on energy (and other) landscapes.

• The Dynamical Lattice Model shows promise as an intermediate

between simple bead models and all-atom models.

• PCA-based visualisation and complexity analysis of protein

folding landscapes allows us to explain difficulties encountered

by global optimisation algorithms in certain cases – and may aid

the design of more robust search algorithms.

Acknowledgments

Birmingham

• Dr Gareth Rylance

• Dr Graham Cox

• Dr Ben Curley

• Dr Lesley Lloyd

• Dr Andrew Bennett

• Dr Jun He (now Aberystwyth)

• Eleanor Turpin

External

• Prof. Sigismund Kobe (Dresden)

• Prof. Tamiki Komatsuzaki (Kobe)

• Prof. David Wales (Cambridge)

• Prof. Martin Karplus & Dr Paul Maragakis (Harvard)

• Prof. Said Salhi (Kent)

Funding

• EPSRC

• The Royal Society

• Wellcome VIP Scheme

• JSPS

• Leverhulme Trust

• University of Birmingham

• BlueBEAR

Cambridge

– Daan Frenkel, David Wales & Mark Miller

Oxford

– Jon Doye

Birmingham

– Roy Johnston, Mark Oakley

• Methods

– New coarse-grained potentials.

– Analysis of potential energy, free energy

and other landscapes.

– New hybrid search algorithms.

– Dynamical and thermodynamic

simulations.

– Investigation of hierarchical self-

assembly.

• Example Systems

– Proteins; DNA, RNA; Liquid crystals

Simulation of Self-Assembly (Programme Grant EP/I001352)

2010-15

Documents

Optimisation & Landscape Analysis for Studying Protein Foldingbtg.bham.ac.uk/events/workshops/ws7/PresentationJohnston.pdf · Optimisation & Landscape Analysis for Studying Protein