Computational Protein Design: A problem in combinatorial optimization

Computational Protein Design:A problem in combinatorial

optimization

CSE 549 Guest LectureSeptember 17, 2009

David GreenApplied Mathematics & Statistics

What is a protein?

• Polymers (chains) of amino acids.– There are 20 different amino acids that can be

part of the chain.

• Machines of the cell.– It’s proteins that do most of the work involved in

life!

Polymers of amino acids.

• Amino acids link to form polypeptides.– There is a backbone of constant composition.– There are side chains that vary.

The twenty amino acids.

• AA side chains vary from:– Big to small.– Non-polar (all C and

H) to polar.– Positive to

negative.– Flexible to rigid.

The machinery of life.

• Protein sensors (receptors) are responsible for all the senses (sight, smell, taste, touch, hearing).

• Enzymes are proteins the catalyze chemical reactions, like the ones that convert food to energy.

• Specialized structural proteins make skin elastic, and make the lens of the eye work.

• Muscles are primarily composed of proteins that combine structural and enzymatic parts to make a machine.

Why design proteins?

• New sensors based on biology.– Proteins have been engineered to detect TNT (explosive)

and sarin (nerve gas).• Proteins are used as treatments for many diseases.

– Protein engineering has helped improve proteins that are given to cancer patients on radiation or chemo-therapy.

– Work in the Green lab is on-going to design proteins for use as anti-HIV prophylatics.

• Many nanotechnology applications that haven’t even been considered yet!

Where do proteins come from?

• The genome contains instructions for every protein in a cell.– A few HUGE molecules of DNA.

• Each gene is the code for one protein.– There are ~30,000 genes in humans.

• Genes are expressed through an intermediate molecule, RNA.– Many copies of each protein can be made.

The Central Dogma of Molecular Biology.

• Then proteins do the work!

How do proteins work?

• Proteins fold into a unique 3-dimensional structure.

• The amino acid sequence of a protein dictates it’s structure.

• The function of a protein is controlled by it’s structure.

Many polymers are long, unstructured chains.

• Polyethylene – Is made of long

chains of the same monomers.

– Adopts a random mesh of inter-weaving strands.

– This structure gives us PLASTIC!

DNA has the same structure for every sequence.

• The “double-helix” is a great structure for storing and replication information.

Protein structures are well-defined and diverse!

• One chain or many.

• Elongated or globular.

• Many forms of symmetry (or none).

What does a protein look like?

• Cyanovirin – A protein that inhibits the entry of HIV into human cell.


• The atoms of a protein form a compact, well-packed cluster.


• A protein can be thought of as a nearly solid object.


• Simplified cartoons make the structure easier to see.


• The path of the backbone of a protein is called it’s “fold”.


• Different types of amino acids are found all along the protein chain.


• Each amino acid has a side chain that protrudes from the backbone.


• Many proteins bind other molecules, like the sugar molecules here.


• Binding interfaces are usually a close fit of two complementary surfaces.


• The core of a protein is key in keeping a stable structure.

Many side chains fill the core.

The core is well packed …

… with groups from all along the chain.

Each side chain fits perfectly.

What is a protein?

• A protein is a complicated three-dimensional structure, made up by an amazing 3-D jigsaw puzzle of interlocking amino acids.

• Amino acids pack together not just geometrically, but with complementary chemical groups as well.

• Proteins move too, but we’ll ignore that for now.

How can we design one?!?

1. Choose a fold (path of the backbone).2. Pack the core with the right set of amino

acids to achieve the desired fold.3. Choose other amino acids to achieve the

desired function (such as binding to a target molecule, or getting the right molecular motions).

Structure prediction is a forward problem.

•Given a protein sequence, what is the structure that it will adopt (fold to)?•This is a VERY hard problem, and it not yet fully solved.•Prediction is difficult because you are stuck with what nature gives you.

Protein design is an inverse problem.

•Given a desired 3-dimensional protein structure, what is a sequence that will fold to that structure?•We have the freedom to add constraints that simplify the problem.•As a result, methods for protein design have had many successes.

• Pabo. Nature 301: 200 (1981).• Drexler. PNAS 78: 5275-5278 (1981).

A designed sequence should fold according to design.

•ANY sequence which folds to the correct target structure (and carries out the desired function) can be considered a successful design •There is more than one right answer, unlike in prediction!

Choosing a backbone fold.

• The structure dictates the function, and a big part of structure is the fold.

• We still don’t really know how to choose the “best” fold.

• Instead, we just borrow from nature – redesign a natural protein to do something new.

Zinc finger proteins bind DNA.

A Zinc ion holds them together.

• The protein will not fold if zinc is not present.

• The protein only binds DNA when it is folded.

• A group at Caltech set out to design a zinc finger that doesn’t need zinc!

1997: The first fully automated protein design!

• Dahiyat and Mayo. Science 287: 82-87 (1997).

Designing function.

• Making a molecule bind is like designing a the core – we want to make the interface between the two pieces complementary.

• Other functions are a lot trickier … and we don’t have good ways to solve them yet, but we’re on our way.

2003: A Duke group designs a set of protein sensors.

• Looger, Dwyer, Smith and Hellinga. Nature 423: 185-190 (2003).

Protein design is a BIG problem.• The zinc finger is one of the smallest protein domains …

about 30 amino acids long.• How many different 30 amino acid polypeptides are

there?– Choose from any of 20 amino acids at each position.– Total sequences = 2030 = 1x1039

• Mass of earth = 6x1027 g• Mass of a grain of sand ~ 1x10-3 g

– A billion earths’ worth of sand grains

• Enumeration of possible states is beyond impossible — must take advantage of need to achieve complementary interactions between amino acids.

Many different structures are possible.

• An arginine and a glutamate interact.






• An arginine and a glutamate interact in several different conformations.

Really Big!!!

• Amino-acid side chains are flexible.• But not every shape (conformation) is equal.• Each amino acid has a set of preferred

conformations (rotamers).– 1 to 80 per amino acid.

• Instead of choosing from 20 amino acids … we need to choose from ~400 (at least) amino acid rotamers!– Total structures = 40030 = 1x1078

• (approx. number of atoms in the universe!!!!!)

• How do you solve a jigsaw puzzle?• Impossible to try all combinations of piece

placement – Unique ways of placing N pieces on a grid is (4N)(N!)

• For N=100, (1.6x1060)(9.3x10157)= 1.5x10218

• Trying each piece one by one is better, but still infeasible– Number of iterative tries for a N piece puzzle is:

• For N=100, 1.37x106

Packing side chains – a puzzle.

)1)(4(3

4

6

)12)(1(

2

)1()1(8

)1(8

)22)((4

2

1

2

1

2

1

NNN

NNNNNNN

iiNN

iiN

N

i

N

i

N

i )1)(4(3

4)22)((4

1

NNNiiNN

i

Packing side chains – be smart.

• How do you solve a jigsaw puzzle?• Group pieces by colors and patterns.• Iterate over matching of pieces that are

complementary– Shape is important.– The pattern must also match.

Pattern matching in proteins?

• What does it mean for two amino acids in the core of a protein to “match”?– Must fit close together (but not too close) Steric

complementarity.– Neighboring atoms must have complementary

charges (neutral likes neutral, positive likes negative) Electrostatic complementarity.

Steric fit: Lennard-Jones potential.• Van der Waals attraction

between atoms at moderate distances.

• Repulsion of atoms from one another at short distances.

• If atoms are not nearby, the energy between them will be very close to zero.

• The total score of the “goodness” of fit in a molecule is the sum of the energy for every pair of atoms.

6

0

12

0 2r

r

r

rUU o

Electrostatic fit: Coulomb’s Law• Atoms in molecules can be thought

of as having tiny charges on them, even if the total charge on a molecule is zero.

• Coulomb’s Law describes the energy of how two charges interact.

• The overall electrostatic fit is calculated by adding up the energy of all pairs of atoms.– Like charges give a positive value.– Opposite charges give a negative. – Neutral (zero charge) groups don’t

matter.

r

qqkU C

21

The total energy describes the fitness of a structure.

• Van der Waals + Coulomb’s Law, for every pair of atoms, and all added up.

• Negative energies are favorable, positive energies unfavorable.

– Nature works to MINIMIZE energy.

Protein Design as a Discrete Conformational Search

Position 1

Position 2

Position 3

Conformational states of system

Tree Pruning with Dead End Elimination

The Dead-end Elimination TheoremGiven two rotamer choices X, and Y at position I, if the best energy of X (with any choice of rotamers at other positions) is worse than the worst energy of Y, then X can not be part of the global energy minimum.

Need to make the comparison,

but the min and max functions require evaluating all states.

Molecular Mechanics energy Van der Waals Coulombic Bond, angle, dihedral

ji

pairRjRi

selfRiRi jiii

EEE ),(),,(21

,,

positionsi

Ri iERE ,)(

i , j are positions in sequenceRi is rotamer choice at position i

)),((max)),((min?

RYERXE iR

iR

Making DEE feasible.

ip ipq

pairRqRp

selfRp

iq

pairRqXi

selfXi

p pq

pairRqRp

selfRpi

qppqii

qpp

EEEE

EERXE

,),(),,(2

1,),(),,(,

),(),,(21

,),(

Last sum is invariant with respect to our choice at position i, and thus:

* Ri is used to replace Xi in the first equation.

iq

pairRqYi

selfYi

Riq

pairRqXi

selfXi

Ri

Ri

R qiiqiiEEEERYERXE ),(),,(,

?

),(),,(,

?

maxmin),(max),(min

iq

pairRqXi

R

selfXi

iq

pairRqXi

selfXi

R qiq

iqiiEEEE ),(),,(,),(),,(, minmin

iq

pairRqYi

R

selfYi

iq

pairRqYi

selfYi

R qiq

iqiiEEEE ),(),,(,),(),,(, maxmax

But note that:

iq

pairRqYi

R

selfYi

iq

pairRqXi

R

selfXi qi

qiqi

qi

EEEE ),(),,(,

?

),(),,(, maxmin

This gives a sufficient condition for the DEE theorem to hold:

min and max are evaluated over rotamers at a single position, so entirely feasible!

Improving the bounds for DEE

0),(),(min),(max),(min??

RYERXERYERXE iiR

iR

iR

Again, we use the same trick to bound the minimum in a feasible manner:

This gives a alternate sufficient condition for the DEE theorem to hold:

This is a tighter bound on the “true” desired comparison, since:

0min?

),(),,(),(),,(,,

pairRqYi

iq

pairRqXi

R

selfYi

selfXi qiqiii

EEEE

iq

pairRqYi

pairRqXi

R

pairRqYi

iq

pairRqXi

R qiqiq

qiqiEEEE ),(),,(),(),,(),(),,(),(),,( minmin

Original problem statement was to compare the best structure with Xi at position I to the worst with Yi; it is a easier to satisfy, but still sufficient, criterion to find the single set of choices at all other positions with the minimum difference between choice Xi and Yi.

0min?

),(),,(),(),,(,, iq

pairRqYi

pairRqXi

R

selfYi

selfXi qiqi

qii

EEEE

iq

pairRqYi

pairRqXi

Riq

pairRqYi

Riq

pairRqXi

R qiqiq

qiq

qiq

EEEE ),(),,(),(),,(),(),,(),(),,( minmaxmin

DEE as an iterative algorithmAs rotamers are flagged as incompatible with the global minimum solution, the min/max functions are evaluated over a smaller and smaller set of choices, and so additional iterations of the comparison can eliminate more possibilities. Thus, the algorithm can be outlined as:

While (any rotamers eliminated ) { For each position in the sequence (i){ For each rotamer choice (X) at position i { For each rotamer choice (Y ne X) at position i { If DEE criterion is satisfied { Eliminate choice X at position I } } } }}

The order of each cycle is ~N*R2, where N is the number of positions, and R is the number of choices at each position. However, as R decreases with iteration, each subsequent cycle costs less.

DEE identifies branches that are incompatible with the global minimumPosition 1

Position 2

Position 3


The pruned tree can be much smallerPosition 1

Position 2

Position 3


DEE can be highly effectiveExample of a 5-position design, with 306 choices per position, done in a simple MATLAB script:

Iteration 0: 306 306 306 306 306Iteration 0: Structures = 2.682916e+12Iteration 1: 143 198 145 42 33Iteration 1: Structures = 5.690265e+09 from 2.682916e+12Iteration 1: Elapsed time is 96.461053 seconds.Iteration 2: 52 83 55 11 8Iteration 2: Structures = 20889440 from 5.690265e+09Iteration 2: Elapsed time is 11.604577 seconds.Iteration 3: 4 12 36 6 5Iteration 3: Structures = 51840 from 20889440Iteration 3: Elapsed time is 0.625266 seconds.Iteration 4: 4 12 35 6 5Iteration 4: Structures = 50400 from 51840Iteration 4: Elapsed time is 0.148598 seconds.Iteration 5: 4 12 35 6 5Iteration 5: Structures = 50400 from 50400Iteration 5: Elapsed time is 0.045631 seconds.

DEE: Notes and CaveatsDead-end elimination is not guaranteed not to eliminate any choices … in this case computational expense is used at zero gain. However, experience suggests that in the case of protein design, the algorithm is highly efficient.

For large design problems, even a highly efficient pruning can leave a tree which is too large to be searched by enumeration (such as depth-first search); for example, consider an original space of 10100 states, reduced to 1020. An efficient bounding heuristic can be defined using similar tricks as discussed here; this can be used in the A* algorithm to find the global minimum within the remaining space.

The DEE criterion can be extended to pairs, triples, and n-tuples of positions. Most applications use only singles and pairs.

Optimization as a tree search.How to choose a path through a tree, which the goal of reaching the global minimum state?

1

1),(),,(,,

i

j

pairRjRi

selfRiRi jiii

EEE

positionsi

Ri iERE ,)(

First, define an energy to be associated with each step through the tree:

This is the energy of placing rotamer choice R at position i, given that all positions in the tree above i have been selected.Note this is similar, but not identical, to the definition used with DEE.

At the leaf node, the state energy is then the sum of all individual path energies:

Thus, we wish to find the path with the lowest total.

Efficient search using a heuristic (A*)Challenge is that we do not know the total until we have traversed the tree to a leaf!

*

, iRiE

Assume we have some heuristic, that estimates the best possible energy that we can achieve down some path. For choice R at position I, we will call this

When choosing a path to take at a given node, we know the path we have already taken, and thus the true cost. We thus combine this information with the heuristic in deciding what step to take:

At a leaf node, the heuristic is 0, and this gives the total energy.

1

,*

,),(

i

ijRji jiRi

EERif

However, if we only use the heuristic, how do we know we get the correct solution?

Guaranteed optimality with A*1. Allow backtracking.

At every step, not only choose between the possible paths from the current node, but also the paths from all nodes which have been visited in the past.In this way, if the heuristic turned out to be poor for a given path (and the true energy became large), a new path is chosen.

Thus, it is possible to have a guarantee that the solution found is the true global minimum!

2. Use a heuristic that bounds the true solution.

Consider a heuristic that is guaranteed to be lower than the true best energy down a path (an optimistic prediction of the best energy). When a leaf is reached, a comparison between the true energy of that leaf and the heuristic energy of the un-followed choice can be made. Since the heuristic is an optimistic guess, if the true energy is lower than the heuristic for all other choices, it must be the global minimum.

Defining the bounding heuristic

1

1),(),,(,,

i

j

pairRjRi

selfRiRi jiii

EEE

positionsi

Ri iERE ,)(

Recall our energy definitions:

N

ijRj

R

i

jRji

opt

jjEERif ,

1

1, min),(

The optimal heuristic is:

Now, we bound the second term:

N

ij

j

k

pairRkRj

selfRj

R

N

ij

j

k

pairRkRj

selfRj

R kjjkjjEEEE

1

1),(),,(,

1

1),(),,(, minmin

N

ij

j

k

pairRkRj

R

selfRj

R

N

ij

j

k

pairRkRj

selfRj

R kjjj

kjjEEEE

1

1),(),,(,

1

1),(),,(, minminmin

N

ij

j

k

pairRkRj

R

selfRj

R

N

ij

j

k

pairRkRj

R

selfRj

R kjk

jj

kjjj

EEEE1

1),(),,(,

1

1),(),,(, minminminmin

Overview of A* for protein design

Our final heuristic is:

The second term involves a minimum over each choice at following position, which is order (N-i)*R, with an inner minimum over the same, order (N-j)*R. Thus, the cost is less than (N-j)2*R2.

Initialize a list of traveled nodes with the root.While (no mininum leaf in list) { Select minimum f* of all paths from nodes in list. Add this new node to list. If (new node is leaf) { Compare leaf energy to minimum f* from list. If (Leaf energy < min(f*)) { Leaf is global minimum. }}

N

ij

j

k

pairRkRj

R

selfRj

R

i

j

j

k

pairRkRj

selfRji kj

kj

jkjj

EEEERif1

1),(),,(,

1

1

1

1),(),,(,

* minmin),(

A*: Notes and CaveatsPerformance of A* is not guaranteed — in the worst case, the entire tree must be enumerated to find the solution. Again, however, experience suggests that the algorithm is highly efficient for protein design.

In many cases, we wish not only to have the global minimum, but all solutions within a cutoff of the minimum. A* can be adapted to solve this problem as well, but care must be taken in the first step of pruning with DEE — the elimination criterion must be modified to prevent elimination of low, but not minimum, energy states.

ReferencesProtein design as an inverse problem:

1. Pabo. Nature 301: 200 (1981).2. Drexler. PNAS 78: 5275-5278 (1981).

Examples of successful protein design:3. Dahiyat and Mayo. Science 287: 82-87 (1997).4. Looger, Dwyer, Smith and Hellinga. Nature 423: 185-190 (2003).

Development of the Dead-End Elimination algorithm:5. Desmet, DeMaeyer, Hazes and Lasters, Nature 356: 539-542 (1992).6. Goldstein, Biophys. J. 66: 1335-1340 (1994).7. Gordon and Mayo. J. Comput. Chem. 19: 1505-1514 (1998).

Formulation of A* for protein design:8. Leach and Lemon, Proteins 33: 227-239 (1998).

Documents

Computational Protein Design: A problem in combinatorial optimization