90
Appendix 1 Biological and Chemical Basics Related to Protein Structures Hong Guo and Haobo Guo Al.l Amino Acid Residues Proteins, whose three-dimensional structures are a major focus of this book, belong to one of the most important classes of macromolecules in living systems. They are linear polymers built from amino acids which are linked together by peptide bonds during condensation. The term proteins is often used for relatively large molecules (say, containing 50 or more amino acid residues) with folded conformations, while the term peptides is used for smaller molecules (say, with less than 50 residues). The structures of the 20 naturally occurring amino-acid residues are shown in Fig. A1.1. The frequency of occurrence for each residue is different. For instance, Ala is about five to six times more abundant than Trp in globular proteins. The amino- acid side chains possess a variety of physical-chemical properties. These properties determine how a given amino-acid residue would participate in the interaction with other groups as well as its role in protein folding and function. The hydrophobic residues such as Ala, Val, Phe, Pro, Leu, and lIe do not interact favorably with water. Therefore, these residues tend to avoid contact with water and often interact with each other or with other nonpolar groups. Such hydrophobic effect is one of the major factors in stabilizing folded conformations of proteins. The chemical reactivity for Ala, Val, Leu, and lIe is generally very low. Polar and charged residues can interact with each other, with peptide groups of the backbone, with water, and with polar groups on ligands through hydrogen bonding and electrostatic interactions. For instance, the hydroxyl groups of the Ser and Thr residues can act as both donor and acceptor for hydrogen bonds. Asn and GIn may also act as hydrogen- bond donor and acceptor in hydrogen bonding interactions. For the charged residues, the Henderson-Hasselbalch equation, pH = pK a + log ([A- ]/[HA]), can be used to describe the relationship between the pH of a solution and the concentrations of the basic and acidic forms ([A -] and [HA], respectively) of the amino acids. For instance, the pK a value of Glu and Asp is ""'4 in aqueous solution. Based on the Henderson-Hasselbalch equation, half of the side chains are in the basic form and the other half in the acidic form at pH r-v 4, and the side chains of these two residues are normally ionized and negatively charged as carboxylates (i.e., in the basic form) under physiologically relevant pH values. Charged residues and some polar residues may change their charge state based on their microenvironment. 229

Appendix 1 Biological and Chemical Basics Related to

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Appendix 1 Biological and Chemical Basics Related to

Appendix 1 Biological and Chemical Basics Relatedto Protein Structures

Hong Guo and Haobo Guo

Al.l Amino Acid Residues

Proteins, whose three-dimensional structures are a major focus of this book, belongto one of the most important classes ofmacromolecules in living systems. They arelinear polymers built from amino acids which are linked together by peptide bondsduring condensation. The term proteins is often used for relatively large molecules(say, containing 50 or more amino acid residues) with folded conformations, whilethe term peptides is used for smaller molecules (say, with less than 50 residues).The structures of the 20 naturally occurring amino-acid residues are shown inFig. A1.1. The frequency ofoccurrence for each residue is different. For instance, Alais about five to six times more abundant than Trp in globular proteins. The amino-acid side chains possess a variety ofphysical-chemical properties. These propertiesdetermine how a given amino-acid residue would participate in the interaction withother groups as well as its role in protein folding and function.

The hydrophobic residues such as Ala, Val, Phe, Pro, Leu, and lIe do not interactfavorably with water. Therefore, these residues tend to avoid contact with waterand often interact with each other or with other nonpolar groups. Such hydrophobiceffect is one ofthe major factors in stabilizing folded conformations ofproteins. Thechemical reactivity for Ala, Val, Leu, and lIe is generally very low. Polar and chargedresidues can interact with each other, with peptide groups of the backbone, withwater, and with polar groups on ligands through hydrogen bonding and electrostaticinteractions. For instance, the hydroxyl groups ofthe Ser and Thr residues can act asboth donor and acceptor for hydrogen bonds. Asn and GIn may also act as hydrogen-bond donor and acceptor in hydrogen bonding interactions.

For the charged residues, the Henderson-Hasselbalch equation, pH = pKa +log ([A- ]/[HA]), can be used to describe the relationship between the pH ofa solutionand the concentrations of the basic and acidic forms ([A-] and [HA], respectively)of the amino acids. For instance, the pKa value of Glu and Asp is ""'4 in aqueoussolution. Based on the Henderson-Hasselbalch equation, half of the side chains arein the basic form and the other halfin the acidic form at pH r-v 4, and the side chains ofthese two residues are normally ionized and negatively charged as carboxylates (i.e.,in the basic form) under physiologically relevant pH values. Charged residues andsome polar residues may change their charge state based on their microenvironment.

229

Page 2: Appendix 1 Biological and Chemical Basics Related to

230 Hong Guo and Haobo Guo

IsoleucineDe (I)

~CH3

)HH21 'cH3

LeucineLeu (L)

Hydrophobic side chains:

~ iH3 ~H3\(H3Alanine ValineAI. (A) V.I (V)

PhenylalanineP"e (F)

P..clinePr o (P)

l\I.t.. ionlnel\Iet (1\1)

Charged side chains :

Glutamtc lIe1dGlu (E)

i+ + NH2

H2NH2N==<\

CH2 NH

H2< H2<CH2 CH2

H2d H2d\

Lysme ArginineLys (K) AJ'g (R)

Polar side chains :

H1stldb ••His (H)

CysteineCys(C)

T.....cnlneThr (T)

SerineS... (S)

TyrosineTyr (y)

Asp ..·.gb'eAsn (N)

GlutamineGin (Q)

HC-NHII \

H C_C: C/ Y~CH

II IHCyCH

HT")ptop""n

TIp (W)

Gly.l"ine:

GlycmeGly (G)

Fig. A1.1 Side chains of 20 a-amino acid residues of proteins and peptides . The predominantforms under physiological conditions are given for the charged residues

Page 3: Appendix 1 Biological and Chemical Basics Related to

AI. Biological and Chemical Basics Related to Protein Structures 231

Forexample, thepKa valuesofGlu andAspcanbe significantly higher in the interiorof a protein;pKa valuesof 8-10 for the carboxyl groupof Glu and Asp in the activesites of certain proteins have been reported. The pKa values (around 10.5-12) forthe guanidine group of Arg and aminogroup ofLys ensurethat most of the Arg andLysresiduesare in the acidicformsand arepositively chargedunderphysiologicallyrelevant pH values. These residues therefore have high hydrogen-bond donatingcapacity as well as the ability to form salt bridges with negatively charged groupssuch as carboxylate from Asp or Glu. For a detailed discussion of the interactionsinvolving differentpolar and charged groups as well as hydrophobic residues, seeAppendix3.

The side chain of His has a pKa value of around 6.5. Therefore, both basic(imidazole) and acidic (imidazolium) forms may exist under physiological condi-tions. In the nonionized form, the nitrogen with the hydrogen atom is a donor forhydrogen bonding,and the othernitrogenwithouta hydrogen atomis an acceptor. Inthe ionized form of His, both nitrogenatoms can act as donors for hydrogen bonds.The Pro residue is uniquebecauseits side chain is covalently bondedto the nitrogenatom of the backbone. Therefore, the backboneof the Pro residuehas no amidehy-drogenand cannotact as a donorfor hydrogen bonding. ThepeptidebondprecedingPro has a higherpossibilityto adopt the cis configuration. Two residuesthat containsulfurare Met and Cys,Methas similarhydrophobility as Leu. Cyshas a thiol groupthat ionizes at slightly alkaline pH values, and thiolate anion is a powerful nucle-ophile.Cys residuescan serve as ligandsto a varietyof metals and maybe oxidizedto form disulfide bondswhentwoCysresiduesare adjacentin the three-dimensionalstructureof a protein. Suchdisulfide bonds can stabilize the structureof the protein.Gly is the simplestamino acid residue, and its side chain containsonly a hydrogenatom. The backbone at Gly has a much greater conformational flexibility than inother residues.

Al.2 Nucleic Acids

Another important class of biomolecules is nucleic acids, in the form ofdeoxyribonucleic acid(DNA) andribonucleic acid(RNA). DNAmakesreproductionpossibleby storinggeneticinformationthat controlsall cellularprocesses, includingthe synthesisof proteins.JamesWatson andFrancisCrickproposedthe structureforDNAin 1953 in whichthe two helical strandsof DNAare coiled about one anotherto form a double-stranded helix as shown in Fig. A1.2A. Intertwining of the twohelical strands creates two grooves, a wide one (major groove) and a narrow one(minorgroove).

Nucleicacidsare linearpolymersthatarebuilt fromnucleotides. Thesequencesof DNA contain four bases-adenine (A), cytosine (C), guanine (G), and thymine(T);RNAhasthesamebasesexceptthaturacil(D)replacesthymine(T)(Fig.Al.2B).Adeninealways pairs with thymine in DNAand cytosinealways pairs with guanine.This relationship betweenbases in the doublehelix is describedas complementary,

Page 4: Appendix 1 Biological and Chemical Basics Related to

232 Hong Guo and Haobo Guo

(b) NH2 0 NH2 0 0

( Nj) ( N£ NH ( NH (NH y NH

N N ~ N~NH2 "N~O N~O N~OH H H

Adenine (A) Guanine (G) Cytosine (C) Uracil (U) Thymine In

Fig. A1.2 (A) Space filling model of B DNA; (B) major purine and pyrimidine bases found inDNA and RNA

i.e., every base on one strand is matched by a complementary base on the otherstrand. Such complementary base-pairing has a very important implication on themolecular mechanism of DNA replication. This is because each strand can be usedas a template for synthesis of a new complementary strand. The determination ofthe structure of DNA by 1.Watson and F. Crick is often considered as the birth ofmodern molecular biology.

The interrelationship of DNA, RNA, and proteins constitutes the "centraldogma ofmolecular biology" in which genetic information is transmitted from DNAto a messenger RNA (mRNA) by transcription, and the sequence of mRNA is thentranslated into a protein sequence. There exists a relationship (genetic code) betweenthe DNA base sequence of a gene and the amino-acid sequence of the protein en-coded by the gene. The genetic code is a set of rules that maps DNA sequences toproteins with each amino acid represented by a codon consisting ofthree consecutive,nonoverlapping nucleotides in the gene .

The 64 triplet codons ofmRNA and related amino acids are given in Table A1.1.Three of the codons (UAA, UAG, and UGA) do not code for any amino acid, insteadthey signal the termination ofthe polypeptide chain and are called stop codons. Mostof the 20 amino acids are designated by multiple codons with the exception of Metand Trp which are represented by single codons . The genetic code is therefore said tobe degenerate. For an amino acid represented by multiple codons, these codons oftendiffer only in the third position. An example of this rule is Gly, which is encoded byGGA, GGC, GGG, and GGu. However, there are exceptions to this rule. That is, eachofleucine, arginine and serine is represented by six codons that cannot be generatedsimply by replacing the third nucleotide (as there are only four nucleotides). AUG(and less frequently, GUG) signals the initiation of protein synthesis; AUG alsocodes for methionine and GUG for valine . The genetic code allows us to understand

Page 5: Appendix 1 Biological and Chemical Basics Related to

AI. Biological and Chemical Basics Related to Protein Structures 233

Table AI.I The genetic code

2nd position

1st position U C A G 3rd position

U Phenylalanine Serine Tyrosine Cysteine UPhenylalanine Serine Tyrosine Cysteine CLeucine Serine STOP STOP ALeucine Serine STOP Tryptophan G

C Leucine Proline Histidine Arginine ULeucine Proline Histidine Arginine CLeucine Proline Glutamine Arginine ALeucine Proline Glutamine Arginine G

A Isoleucine Threonine Asparagine Serine UIsoleucine Threonine Asparagine Serine CIsoleucine Threonine Lysine Arginine AMethionine Threonine Lysine Arginine G

G Valine Alanine Aspartic acid Glycine UValine Alanine Aspartic acid Glycine CValine Alanine Glutamic acid Glycine AValine Alanine Glutamic acid Glycine G

the nature of mutations involving changes in genes. A point mutation is a changein a single base pair in DNA (and in the corresponding mRNA). A change in thethird position may lead to a codon that codes for the same amino acid (see above).If this is the case, the mutation is called silent. A missense mutation is a changein DNA sequence that leads to substitution of one amino acid for another. A pointmutation can also introduce or eliminate a termination codon, leading to a changeof the length of polypeptide chain. A nonsense mutation changes a normal codonthat encodes an amino acid to a stop codon and results in a truncated polypeptide.

The biosynthesis ofpolypeptides is directed by mRNA, and the process is calledtranslation. Polypeptides are synthesized on ribosomes, which are multicomponentparticles that contain RNA and some enzymes required for protein synthesis. Thisprocess can be divided into three stages: initiation, elongation, and termination.Polypeptides that are synthesized on ribosomes must fold to their native conforma-tions to function as proteins. A detailed discussion of protein synthesis and foldingis beyond the scope of this appendix.

Al.3 Protein Structures

Proteins perform a variety of essential biochemical functions, and the functionalproperties ofproteins are determined by their three-dimensional structures. The im-portant relationship between protein structure and function is discussed in Chapter 2of this book. In general, all protein functions involve specific recognition and bind-ing to other molecules (small ligands or macromolecules such as other proteins and

Page 6: Appendix 1 Biological and Chemical Basics Related to

234 Hong Guo and Haobo Guo

DNA). Therefore, the ability of proteins to recognize and bind specific ligands isperhaps the most importantpropertyassociated withproteinfunction. Almostall ofthe chemical reactions that occur in metabolism, biosynthesis as well as signalingand motor activities in cells need to be catalyzed, as the rates of these reactionsare generally too slowby themselves to sustain life underphysiological conditions.These chemical reactions are generally catalyzed by protein enzymes, although anumber of RNA molecules can also function as catalysts. Protein enzymes are re-markable catalysts, and the rates of chemical reactions can be enhanced by manyorders of magnitude. Proteins can also serve as switches for the controlof cellularprocesses, due to their flexibility and ability to undergo conformational changes inresponse to pH change or ligand binding. Another important role of proteins is toprovide structural support. Forinstance, fibrous proteinsfromskin, tendon, andboneprovide supportive roles in livingsystems.

Thebackbone ofaproteinorpeptideisa sequence of repeatedunitsoftheamidenitrogen N(H), the carbonC'(R), and the carbonyl carbonC(=O) (Fig. A1.3A) forthe residues in the proteinor peptide. The peptidebond connecting the neighboring

..... . ,

'I' o---··.~',-.-- -+-- --,--.' ~ ~t:~~. ...... t .

180. " - - =.J

o(J)

....... ..... ... .-: ..

·180 .•·180

(c)

180 .. t. ';.\~.: ~,;:S:.~ j.........•..

(b)

Fig. A1.3 (a) Amino-acid residues are connected together by peptide bonds to form a linear chainofa protein. Here "R" represent the side chains of the residues . The planes conta ining the peptidelinkages are shown in the shaded boxes and the dipole moments are indicated by the arrows on thepeptide bonds . (b) Matrix metalloproteinase (MMP; PDB 10 : IY93) . (c) The Ramachandran plotofMMP obtained from the Rampage Server (www-cryst.bioc .cam.ac.uk)

Page 7: Appendix 1 Biological and Chemical Basics Related to

AI. Biological and Chemical Basics Related to Protein Structures 235

residues is a partial double bond with a significant 11'electron distribution across theN-C=O structure. As a result, C, N, and the four atoms attached to them have a strongtendency to be coplanar with only small possible deviations. The rotation about thepeptide bond is therefore restricted with a substantial energetic barrier. The transconfiguration shown in Fig. A1.3A (i.e., the CU atoms of two adjacent residues areon opposite sides ofthe peptide bond that joins the residues together) is energeticallymore stable than the cis configuration; the probability ofoccurrence ofthe nonprolinecis configuration was estimated to be as low as 0.1%. One important consequence ofthe planar nature ofthe peptide bond is that the polypeptide's backbone conformationis largely determined by the torsion angles about the CU-N bond (tp) and the CU-Cbond (\fJ) (see Fig. A1.3A). Not all the <p and \fJ values are allowed, due to possiblesteric clashes between nonneighboring groups. This leads to a further reduction inthe number of possible regular structures in proteins. The Ramachandran plot isoften used to plot the allowed <p-\fJ values for proteins [Ramachandra, 1963]. FigureAl.3B shows the three-dimensional structure of matrix metalloproteinase (MMP),and the corresponding Ramachandran plot is given in Fig. Al.3C. The (c, \fJ ) valuesare clustered around the regions for helices, parallel and antiparallel f3-sheets (seebelow), consistent with the structure in Fig. Al.3B. One important property of theprotein backbone is that there is a large dipole moment (about 3.36 D) for thepeptide group, which can therefore make favorable interactions with solvent andform hydrogen bonds with other peptide group or side chains.

There are different levels of protein structure. The primary structure refers tothe amino-acid sequence of a protein. Sequence comparison of primary structuresfor different proteins is widely used to predict the similarity in structure and functionand is often the first step in protein structure prediction. These comparisons are basedon the alignments of sequences to maximize the number of identical residues andminimize the number of insertions and deletions. Detailed discussions of sequencecomparisons can be found in other chapters of this book.

The secondary structure refers to the regular repeating patterns of backboneconformations that are usually held together by hydrogen bonds. Side chains arenormally not considered at the level of secondary structure, although different sidechains often have different preferences to exist in different elements of secondarystructure (see Chapter 8 for related discussions). Figure A1.4 shows the elementsof secondary structure, including the right-handed ex-helix, f3-turn, parallel andantiparallel f3-sheets. The ex-helices are widely observed in globular proteins (andfibrous proteins), and about one-third of residues of proteins are in ex-helices. Theexistence of ex-helixas a stable structure in proteins was predicted by Linus Paulingand his co-workers in 1951, and this prediction was soon supported by diffractionpatterns from hemoglobin crystals obtained by Max Perutz. The average span of thehelices is approximately 11 residues and three turns, although much longer helicesexist in proteins. The ex-helixhas 3.6 amino acid residues per tum and is also knownas the 3.6-helix, with a hydrogen bond between the C=O group of residue n and theN-H group of residue n+4. These hydrogen bonding interactions play an importantrole in stabilizing the helical conformations. The 'P and \fJ values for amino acid

Page 8: Appendix 1 Biological and Chemical Basics Related to

236 Hong Guo and Haobo Guo

(b)

H a H aN~~rrNJN/ ~

? ~ 9 7: :: :

Antiparallel

Parallel

Fig. AlA Schematic diagrams for elements of secondary structure. (a) a-helix ; (b) ~-turn;

(c) antiparallel and parallel ~-sheets

Page 9: Appendix 1 Biological and Chemical Basics Related to

AI. Biological and Chemical Basics Related to Protein Structures 237

residues are approximately -60° and -40°, respectively (see Fig. Al.3C), and theside chains lie outside the core.

The C=O groups generally tilt outward and are sometimes able to interactwith solvent molecules or other protein groups to satisfy their hydrogen bondingpotential. There are unsatisfied hydrogen-bond donors at the N-terminal end of thehelices and unsatisfied hydrogen-bond acceptors at the C-terminal end (Fig. Al.4A).Polar and charged groups (e.g., from side chains or other peptide groups) normallyinteract with these unsatisfied hydrogen-bond donors and acceptors at the termini(i.e., formation of the so-called helix capping interactions) to satisfy their hydrogenbonding potential (Ref. 6). All the peptide groups in o-helix point approximatelyin the same direction. It has been widely observed that helices have the abilityto stabilize negatively charged groups at their N-termini, such as certain reactionintermediates and transition states as well as phosphate (sulfate) groups. It is widelybelieved that this is due to the cumulative effects ofthe peptide dipoles, leading to theformation ofmacro-dipoles ofthe helices with the positive ends at the N-termini andthe negative end at the C-termini [HoI, 1984]. However, there are different proposalsconcerning the charge stabilization effects by helices, including the explanationsbased on cooperative effects of hydrogen bonding [Guo, 1998] and local peptidedipoles [Aqvist, 1991].

Another type of helix that is frequently observed in proteins is the 310 helix.In this helix, the carbonyl oxygen atom of one residue (n) accepts a hydrogen bondfrom the amide nitrogen atom three residues farther (i.e., residue n + 3 instead ofresidue n + 4 as in the o-helix). The <p and \fJ values for these amino-acid residuesare approximately - 70° and -15°, respectively. The 310 helices normally containonly a few residues and frequently occur at the C-terminus of o-helices.

The ~-sheet is another major secondary structural element ofproteins. Similarto the o-helix, it has a regular pattern ofbackbone hydrogen bonding and repeating <pand \fJ values. The sheet structures are formed by two or more l3-strands,usually from5 to 10 residues long, in an almost fully extended conformation. These l3-strandscan run either in the same direction (parallel ~-sheets) or in the opposite direction(antiparallel ~-sheets). The (<p, \fJ) values are around (-120°, 110°) for parallel ~­

sheets and (-140°, 135°) for antiparallel ~-sheets. For each strand in ~-sheets, theside chains alternate from one side to another, but they can interact with the sidechains from the adjacent strands. Parallel ~-sheets are normally located within thehydrophobic core of proteins and contain five or more strands. In contrast, smallantiparallel ~-sheets (e.g., containing only two strands) are frequently observed. Ithas been suggested that the antiparallel sheets are more stable and have strongerhydrogen bonds than the parallel sheets. For anti-parallel ~-sheets, the amino-acidresidues tend to alternate between hydrophobic and hydrophilic residues along thestrands with the face containing the hydrophilic residues exposed to solvent. Thebackbone hydrogen bonds between the c=o and N-H groups in ~-sheets involveresidues from different strands that are distant from each other. This is in contrastto the o-helices where the hydrogen bonds occur between residues that are close inthe primary structure. One important structural feature of ~-sheets is the existence

Page 10: Appendix 1 Biological and Chemical Basics Related to

238 Hong Guo and Haobo Guo

Fig. A1.5 Two of the X angles defining the conformations of the side chain of residue i

of right-handed twist for the strands and sheets . Turns are another type of importantstructural elements widely observed in proteins. They are stabilized by one or twobackbone hydrogen bonds and are able to reverse the direction ofpolypeptide chains,e.g., successive strands in antiparallel Bvsheets are connected by turns .

The conformations ofthe side chains for the amino-acid residues in proteins canbe described by the X torsion angles along the side chains (see Fig. A1.5) and otherstructural parameters. The XI torsion angle for residue i is defined by N;, Cr,cf, and

Ci .There are three possible positions for the rotation about the cr-ef bond whenother nonbonded interactions are not significant. These positions correspond to onetrans (X I "'-' 180°) and two gauche (X I "'-' 60° and -60°) conformations, althoughlarge deviations (say, up to 20°) from these "standard" X1 values are quite commonin protein structures.

The elements of secondary structure of a polypeptide chain can be packedtogether to form a tertiary structure ofthe protein containing one or several structuralunits called domains. A domain is a compact and stable structure from a polypeptidechain or a part ofa polypeptide chain that folds independently. Many proteins containmore than one polypeptide chain, and the association of two or more polypeptidechains leads to the quaternary structure of a protein. For a more detailed discussionof protein domains, protein structure comparison and classification, see Chapters 5and 6.

Suggested Further Readings

Voet, D., and Voet, IG. 2004 . Biochemistry, 3rd ed. New York, Wiley.Creighton, I.E. 1993. Proteins: Structure and Molecular Properties, 2nd ed. New

York, Freeman.Petsko, G. A., and Ringe , D. 2004 . Protein Structure and Function . London, New

Science Press .

Page 11: Appendix 1 Biological and Chemical Basics Related to

AI. Biological and Chemical Basics Related to Protein Structures 239

Branden, C., and Tooze, 1. 1999. Introduction to Protein Structure, .Znd ed. NewYork, GarlandPublishing.

Ramachandra, G.N., et al. 1963. J Mol. BioI. 7:95-99.Presta,L. G., and Rose, G. D. 1988. Science240:1632-1641.HoI, W G. 1984. JProg. Biophys. Mol. BioI. 45:149195.Guo, H., and Salahub, D. R. 1998. Angew. Chern. Int. Ed. Engl. 37:2985-2990.Aqvist, 1., Lucke, H., Quiocho, F. A., and Warshel, A. 1991. Proc. Natl. Acad. Sci.

USA 88:2026-2030.Richardson, 1.S. 1981. Adv. Protein. Chern. 34:167-339.

Page 12: Appendix 1 Biological and Chemical Basics Related to

Appendix 2 Computer Science for StructuralInformatics

Guohui Lin, Dong Xu, and Ying Xu

A2.1 Introduction

Many of the computational problems arising from structural informatics, such asprotein structure prediction and protein docking, are highly computationally inten-sive. It represents a significant challenge to design computer programs for solvingthese problems efficiently enough to be practically useful. One of the main reasonsthat ab initio folding algorithms have not been widely used for protein structure pre-diction is that people have not been able to make these algorithms run fast enoughto achieve a certain accuracy, besides the fact that these algorithms are limited bythe inadequacy of the current understanding of protein folding. For these problems,while programming-level techniques, advanced hardware and parallel implementa-tion can help to make the programs run faster, improvements at the algorithmic leveloften play larger and more fundamental roles. Consequently, the design and analysisofalgorithms come into play. To address algorithmic issue, we need to first discuss abasic concept, namely, "computational complexity" as it measures the running timeof an algorithm or the intrinsic difficulty of a computational problem, using mathe-matical terms. Furthermore, good algorithm design generally involves two levels ofconsideration, namely, high-level algorithms and lower-level data structures, both ofwhich we will discuss in this appendix.

Extensive algorithmic research has been carried out for structural informaticsproblems, such as protein threading, protein docking, and protein structure-structurealignment, as we have presented in this book. As we have seen, for many compu-tational problems in structural informatics, it often requires program developers tohave a deep understanding about algorithmic techniques and good domain knowl-edge ofthe biological problems. From time to time we see situations for which underone particular formulation, a computational problem is exceedingly difficult to solve.However, a different formulation, which might require deeper understanding of thebiological problem, could often lead to tractable algorithms. For example, withoutgood understanding about the protein threading problem, people might formulatea threading problem as a computational problem in which all pairwise interactionsare considered for solving the threading problem. Under this formulation, the pro-tein threading problem has been proven to be NP-hard. But this mathematical resultmight not necessarily have much to do with the intrinsic computational difficulty

241

Page 13: Appendix 1 Biological and Chemical Basics Related to

242 Guohui Lin et aI.

of a real threading problem. Protein structural biologists will tell us that only short-range interactions are important while long-range interactions are not. With thisknowledge, protein threading could potentially be solved efficiently as discussed inChapter 12. So a key issue, before starting designing a computer algorithm, is tohave a good understanding of the biological problem. Still many well-formulatedstructural informatics problems are highly difficult to solve. For some of them, ef-ficient and rigorous algorithms might not be possible. Hence, compromise betweencomputational efficiency and prediction accuracy has to be made. Approximationand heuristic algorithmic techniques have often been used to gain computationalefficiency at the expense of (some) prediction accuracy.

In this chapter, we use several computational problems arising from proteinstructure prediction and whole genome phylogeny reconstruction to demonstrate anumber ofkey algorithmic design techniques. It should be noted that these techniquesfall into the general discipline ofalgorithm design and analysis. Weonly discuss thesetechniques briefly as the main purpose of this chapter is to provide an introductioninto basic algorithmic concepts and techniques for researchers and students whohave no formal training in computer algorithm design so as to help them to betterfollow other chapters ofthis book. Interested readers should consult algorithm booksfor more in-depth discussions on the details of these algorithmic techniques. A listof such references is provided at the end of the chapter.

A2.2 Efficient Data Structures

How data is organized for a computational problem could have a significant impact onthe computational complexity ofan algorithm for solving the problem. For example,to find a particular number from a set of numbers, it takes a linear time of the sizeof the set if these numbers are stored in an unsorted array, but it takes only a timeproportional to the logarithm of the size if these numbers are sorted in a descendingor ascending order, indicating the power of data organization. As a key componentin algorithm design, use of proper data structures could make an algorithm runmuch faster, compared to naive implementations of an algorithm. Typically, a datastructure is defined in terms of how its data is organized and what operations (e.g.,finding the maximum value from a set or merging two sets) it supports. Previousexperience in combinatorial algorithm design has indicated that many algorithmscould be described in terms ofa set ofbasic operations on the data, and interestinglysome of these basic operations recur in many different algorithm design problems,making them a set ofcommonly used operations. Making these basic operations runas efficiently as possible through careful design allows a wide range of algorithmsto run more efficiently. Efficient implementation of such basic operations representsthe focus of data structure design. An array represents one of the simplest datastructures that is widely used in the implementation of many computer algorithms.It supports a direct access to each data point in a constant time if the address ofthe data point is known. Otherwise it might require time proportional to the size

Page 14: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 243

of the array to access a particular data point as it might need to go through thewhole array to find the data point in the array. Other simple data structures include(a) linked lists, a data structure defined on a sequence of nodes with one or tworeferences pointing to the next and/or previous nodes, (b) stacks, a data structurewhich maintains the order of last-in, first-out, and (c) queues, a data structure whichmaintain the order offirst-in, first-out. More sophisticated data structures may include(a) hash tables, (b) suffix trees, (c) heaps and priority queues, (d) disjoint sets, (e)red-black trees, and many more. Each of these data structures supports a particularset of operations and organizes its data in such a way that the computing time forthese supported operations is "optimized." We now introduce several commonlyused data structures.

A2.2.1 Hash Tables

Biological objects are usually associated with many interesting attributes, and theyare referenced through the so-called "keys." In applications that involve a smallnumber ofobjects with distinct keys, an array ofpointers can be allocated for storingthe addresses of the objects. These pointers can be referenced by their indices inthe array, which in the simplest case equal to the key values of the objects. That is,the kth pointer records the address of the object with the key value equal to k. Sucha simple direct addressing scheme ensures that the dictionary operations SEARCH,INSERT and DELETE run in constant time. However, one can easily see the limitation ofsuch an addressing scheme, namely, the key values have to be consecutive integers;otherwise many entries in the array contain no data, and hence are wasted.

Resolving these issues leads to search for a function h(·) that transforms a keyvalue k into a value h(k), such that the range of function h(·) is densely populatedin the interval [1, n], where n is the number of objects. Such a function is generallyreferred to as a hash function and the new array is referred to as a hash table.It should be noted that the hashed key values may not be distinct and collisionsoccur when two or more hashed key values are the same. For an efficient search,collisions should occur with a minimum possibility. To resolve collisions, each entryin a hash table can point to a (linked) list of pointers that address the objects withthe same hashed key values. The quality of a hash function is generally measuredby the density of its range in the interval and the length of the longest linked list.There is no universal scheme for finding good hashing functions, as hash functionsare application dependent. Nonetheless, an average-quality hash function suffices toensure the dictionary operations in a constant time on average. One should note thatthe main benefit of using hash tables is to save memory.

As an example, an information content-based phylogeny construction methodusing whole genomes is to use the frequencies of strings of a certain number ofnucleotides to map a whole genome as a high-dimensional vector, and then definethe cosine of the angle between two vectors to be the evolutionary distance betweenthe two genomes (Wu et al., 2006). When a string is short, for instance shorterthan 8, it is very likely to appear in every genome sequence. However, when it is

Page 15: Appendix 1 Biological and Chemical Basics Related to

244 Guohui Lin et al,

longer than 12, it becomes rare and might not appear in any genome. Since thenucleotide alphabet has a size of4, using length-k strings a genome can be plotted asa 4k dimensional vector, which is computationally infeasible when k is larger than 13(using at least 10GB memory). On the other hand, plotting genomes using too shortstrings would not have sufficient discerning power to distinguish genomes. To makethis computation feasible, the observation that many entries in the high-dimensionalvectors are 0 suggests the use of a hash table for frequency counting, namely, onlythose non-O frequencies are recorded.

Geometric hashing is the hashing technique most relevant to structuralinformatics problems. Geometric hashing represents a geometric object as a set oftriangles and stores the triangles in a hash table. It is efficient for finding geometricobjects of similar shape and it has been widely used in computer vision. Applica-tions of geometric hashing in structural informatics include structural alignment ofproteins (Weskamp et aI., 2004) and protein docking (Fischer et aI., 1993).

A2.2.2 Suffix Trees

A suffix tree is a data structure that exposes the internal structure ofa sequence/string(e.g., a DNA or protein sequence). It enables the speed-up ofa number ofoperationsperformed on sequences. Given a sequence S of length n, a suffix treeT is a rooted,directed and edge-labeled tree with exactly n leaves numbered from 1 to n. Everyinternal node in T, except for the root, has at least two edges. The edges are labeledby nonempty substrings of S such that edges coming out of a node have their labelsstarting with distinct characters. For every leaf, the concatenation ofthe edge labelsofthe edges on the root-to-leaf'path spells out the suffix ofS and its starting positionin the sequence of S is the leaf label. Figure A2.1 illustrates the suffix tree forsequence S = xabxac.

To demonstrate the power ofa suffix tree, we use the following protein sequencesearch problem as an example, in which a protein sequence is represented as a suffixtree. P is a protein sequence and S is a peptide sequence. We want to know if Pcontains the peptide S. A naive algorithm can decide if S is part of P by simplycomparing S to each segment of P and then answer "yes" or "no" after possibly

bx

ac

36

2

Fig. A2.1 The suffix tree for sequence S = xabxac.

Page 16: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 245

going through the whole sequence. Using computer science terms, it will take atime proportional to the length of P times the length of S to answer the question.This search problem becomes computationally expensive when we have to search formany different S's against P. Using a suffix tree data structure for P, we only need tospend a time proportional to the length of S for each peptide S. The key observationto this is that once the suffix tree for P is built, we may compare S from the rootand along the branch whose label maps to S. At the end, either the comparison endswith no match, or all occurrences of S in P are identified. The leaves in the subtreerooted at the point where S is exhausted record the positions where S occurs in P.One thing worth noting is that there is an overhead for constructing a suffix tree forthe protein sequence P, which can be precomputed before any search and done ina linear time proportional to the length of P (Gusfield, 1997). Suffix tree and itsvariants have found many other applications such as pattern finding (Marsan andSagot, 2000) in computational biology.

A2.2.3 Disjoint Sets

Some biological applications involve grouping distinct objects into a collection ofdisjoint sets such that objects in one set share some commonalities. For example, onemay want to cluster protein structures in structure prediction when a large numberof possible structures are computationally generated. In such applications, one maywant to find the set that a given object belongs to and merge two sets iftheir objects aresimilar (to create a larger set). A disjoint-setdata structurecan be used to implementsuch operations efficiently.

A disjoint set maintains a collection S = {Sl, S2, ... , Sk} of disjoint dynamicsets, where each set is identified by one of its members called representative. Lettingx denote an object, the data structure supports the following operations (Cormenet aI., 2001):

• MAKESET(X) creates a new set whose only member is x. It is required that x doesnot belong to any other set and x naturally becomes the representative of the set.

• UNION(X, y) merges the sets that contain x and y, say Sx and Sy, into a new setS, U Sy. S, and Sy are assumed to be distinct and any member of the new set canbe elected to serve the representative. In many implementations, the representativeof either Sx or Sy is picked as the new representative. After the merging, S, andSy are destroyed, i.e., removed from the collection S.

• FINn(x) returns the representative of the unique set containing x.

A naive implementation of this data structure, say, putting all elements in aset into a linked list, will require a linear time to Find a particular element thoughit takes only a constant time to merge two sets. Through careful designing, thecombined computing time of these two operations could be reduced significantly.For example, through connecting elements in a set into a rooted tree and using theroot to be the representative, one may implement FIND operation so as to return theroot of the tree that the element is in and implement UNION operation so as to make

Page 17: Appendix 1 Biological and Chemical Basics Related to

246 Guohui Lin et ale

the root of the shorter tree a child of the other root. This Forest ofRooted Treesimplementation ensures average computing time per operation in O(log n), where nis the total number ofelements. A further implementation ofFIND called compressedfind can reduce the average running time to almost a constant.

A2.2.4 Heaps

A heap data structure is essentially an array, and the objects in the array satisfy someimplicit relationships specified by their indices.

The simplest heap is a binary min-heap (or max-heap), which is an array A ofnobjects, each associated with a nonnegative rational keyvalue such that A[i] ::s A[2i]and A[i] ::s A[2i + 1] hold for every i (as long as 2i ::s nand/or 2i + 1 ::s n). Sucha binary heap may be viewed as a nearly complete binary tree in which the key valueof a node is smaller than or equal to the key values of its child nodes. Given anarbitrary array A of size n, A can be built into a binary min-heap in linear time. Forexample, A[1 ... 7] = (1,3,2,6,4,5,7) is a heap.

As we know, by regarding an array as a binary tree, the height of the tree isproportional to the logarithm of the size of the heap, i.e., O(log n). Inserting a newobject into the tree structure, maintaining the property of the tree, can be done asfollows. We append the new object to the end of the array, or the last position in thebinary tree, and compare its key value to the key value of its parent. If it is largerthan or equal to, the heap property is maintained and we may stop; otherwise weswap the new object and its parent, and continue. The process ends when eitherno swapping is needed or the root is reached. Since the height is O(log n), thisinsertion operation takes O(log n) time. In a similar fashion, other heap operationssuch as "extracting an object with the minimum key value" or "increasing the keyvalue for an object" can be accomplished in O(log n) time. These heap operationstogether with the heap itselfgive rise to another important and useful data structure,or perhaps more frequently termed, priority queue.

Among the heaps, it is interesting to note the Fibonacciheap,which can serve asa key component for efficient implementation ofother data structures and algorithms,such as minimum spanning tree and Dijkstra's algorithm for the shortest path betweentwo nodes in a graph. A Fibonacci heap is a collection ofmin-heaps (or max-heaps,to be analogously defined), not necessarily binary. In a Fibonacci heap, every min-heap is treated as a rooted tree. Each node x contains several values, besides its keyvalue:

• a pointer p [x] to its parent;• a pointer child[x] to anyone of its children, which are linked together in a circular,

doubly linked list, called the child list ofx;• degree[x] records the size of its child list;• pointers left[x] and right[x] to its left and right siblings, respectively;• a Boolean field mark[x] indicates whether node x has lost a child since the last

time x was made the child of another node.

Page 18: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 247

The roots of min-heaps in the Fibonacci heap are also doubly linked into a circularlist, which is accessed by a head pointer min to the root containing a minimum keyvalue. One global attribute of the Fibonacci heap is n, which records the number ofnodes in the heap.

Besides the operations for a normal binary heap, an additional operation UNION

is supported that merges two min-heaps into one by making the root of one heapa child of the root of the other heap. All the supported operations ensure that themaximum degree of node in a Fibonacci heap of size n is O(log n). Through apotential function and an amortized running time analysis, the average running timeper operation on a Fibonacci heap is constant. It should be noted that such a constantis not very small. Therefore, when the number of objects is small, a Fibonacciheap might not be better than a normal min-heap, in terms of the actual computingtime.

A2.2.5 Other Data Structures

There are many other efficient data structures that support various operations suchas search, insertion, deletion, merge, split and so on. These data structures includebinary search trees, red-black trees, sorting networks, k-d trees, etc. The interestedreader may refer to standard algorithmic textbooks such as Cormen et aI. (2001)for their detailed definitions and sample applications. More advanced applicationsin structural informatics can be found in many research articles (Koch et aI., 1996;Shyu et aI., 2004).

A2.3 Computational Complexity and NP-Hardness

A2.3.! Concept of Computational Complexity

Computational complexity is a measure for the computational resources neededto solve a given problem. The computational resources include computing timeand memory. In most cases, computing time is more of a bottleneck for solvinga structural informatics problem than the computer memory required. Hence, wefocus on computing time here. The computing time in the context of computationalcomplexity does not give an estimate of the actual computing time; rather, it givesthe general trend as a function of the problem size. In this chapter, we do not userigorous definitions ofcomputational complexity concepts that are customary in thetheory of computational complexity. Rather, we introduce them in an informal wayand emphasize their intuitive meanings.

Generally, people are most interested in the classes of computational prob-lems that are solvable in polynomial time, though there are finer levels of hier-archical structures in the theory of computational complexity. The term polyno-mial here means polynomial in the size of the input instance, which could be thenumber of elements in a data set such as the number of amino acids in a proteinsequence.

Page 19: Appendix 1 Biological and Chemical Basics Related to

248 Guohui Lin et al,

Computational complexity can be characterized through abstraction. Anabstract problem is defined to specify what is the input to the problem and whatis the goal to be achieved. Abstract problems are dramatically different from oneanother in applications. For example, the input to the sorting problem is a list ofelements that can be pairwisely compared and the goal is to rearrange them in anondecreasing order (or nonincreasing order). Note that there is no value associatedwith the desired solution to the sorting problem.

Another simple abstract problem is the minimum spanning tree (MST) problem.In this problem, a simple, connected, undirected and edge-weighted graph constitutesthe input, and the goal is to compute a minimum weight spanning tree for the graph.Whenever the input graph is given, that is, G = (V, E, w) where V is the vertexset, E is the edge set, and w : E ---+ Q+ is the weight function, an instance of theMST problem is defined. In general, storing such a graph G = (V, E, w) takes8(IV I+ IE I) memory units, where notation 8(/) represents that the computationalresource requirement has an order ofI. Sometimes, people prefer using 0 (I), whichmeans that the computational resource requirement has no more than the order of fin the worst-case scenario.

In the context of theory of computational complexity, there is a uniformityrequirement on the computational problems, which leads to the definition of deci-sion problems. Essentially, decision problems are a special case of abstract prob-lems, where the answer to the query that the problem asks is either yes or no.For this purpose, we may write down the decision version of the MST prob-lem, by adding to the abstract version an upper bound on the weight of spanningtrees:

MINIMUM SPANNING TREE PROBLEM (MST):INPUT: A simple, connected, undirected, and edge-weighted graph G =(V, E, w) where V is the vertex set, E is the edge set, and w : E ---+ Q+ isthe weight function, and a positive rational number £ E Q+.QUERY: Is there a spanning tree T of G such that weT) :::s £?

To solve the decision version of the MST problem, we may simply apply awell-established MST-finding algorithm, such as Prim's algorithm disregarding theparameter £, and then compare the weight of the output tree to £. If it is lower thanor equal to £, then we answer yes, otherwise no. The decision version of the MSTproblem can be solved in polynomial time.

Weighted bipartite matching (BM) problem represents a common problemwhich many applications could be formulated as. Given a graph G = (V, E), amatching is a subset ofedges ofwhich there are no two incident at a common vertex.Graph G is bipartite if its vertex set V can be partitioned into two subsets suchthat every edge is incident at a vertex in one subset and another vertex in the othersubset. We now present a generalized version of the bipartite matching problem,called Constrained Bipartite Matching (CMB) problem, and apply it to model andsolve a structural informatics problem.

Page 20: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 249

MAXIMUM WEIGHT BIPARTITE MATCHING PROBLEM (BM):INPUT: An edge-weighted bipartite graph G = (VI, V2 , E, w) where VI U V2 isthe vertex set, E is the edge set, and w : E -+ Q+ is the weight function, anda positive rational number f, E Q+.QUERY: Is there a matching M of G such that w(M) ~ f?

Like the MST problem, the BM problem can also be solved in polynomial time.

CONSTRAINED BIPARTITE MATCHING PROBLEM (CBM):INPUT: An edge-weighted bipartite graph G = (VI, V2, E, w) where VI U V2 isthe vertex set, E is the edge set, and w : E -+ Q+ is the weight function, anda positive rational number f, E Q+. The vertices in VI are linearly ordered intoVI, V2, ... , VI VII; the vertices in V2 are partitioned into a collection of subsetsand vertices in each subset are linearly ordered.QUERY: Is there a matching M of G such that (1) M satisfies the linear orderconstraints, i.e., vertices in a subset of V2 are mapped to consecutive verticesin VI, and (2) w(M) ~ f,?

Unlike the MST and BM problems, so far there is no polynomial time algorithmthat solves the CBM problem. Nonetheless, if the answer to an instance of the CBMproblem is yes, a proof that supports the answer would be a matching. What weneed to do to verify the answer is to check if all the constraints are satisfied inthe provided matching and if its weight is indeed greater than or equal to f,. Such averification can be done in polynomial time. Polynomial time solution algorithms andpolynomial time verification algorithms lead to the definitions of deterministic andnondeterministic Turing machines that are used to formally define the hierarchicalclasses of problems.

Computational complexity can be categorized into a number of classes. Thecomplexity class P consists ofall decision problems that can be solved in polynomialtime; the complexity class NP consists ofall decision problems whose yes-instancescan be verified in polynomial time. We know that the MST problem and the BMproblem both belong to P With an empty proof and using a solution algorithm asthe verification algorithm, we can easily show that a problem in P is also in N~ Inother words, P is a subset ofNl' The three problems outlined above, MST, BM, andCBM, all belong to NP. For every decision problem in N~ the class co-NP containsthe same decision problem with the yes and no answers swapped (no-instances orcomplementary instance). The class co-NP consists of all decision problems whoseno-instances can be verified in polynomial time. P is a subset of co-NP too andthere is a longstanding open problem as to the relationships among ~ N~ and co-Nl'Computational decision problems from most practical applications fall into thesethree classes.

Given two problems Il , and n2 in NP, if every instance II of Il , can betransformed into an instance 12 of n2 in polynomial time (in the size of II) and II isa yes-instance if and only if 12 is a yes-instance, then we say Il 1 is polynomial timereducibleto n2 and denote it as Il , ~p 02. The transformation function is referred

Page 21: Appendix 1 Biological and Chemical Basics Related to

250 Guohui Lin et aI.

to as a polynomial time reduction. Polynomial time reductions are useful in bothalgorithm design and mathematical proofs ofcomputational hardness. For example,ifproblem n2 admits a polynomial time A, for every instance II of problem Il.,we may use the reduction to construct an instance 12 of n2, then apply algorithm Aon 12, and finally use the solution to 12 to be the solution to II. The property of thereduction ensures that the solution to II is correct. Also, since the reduction takes apolynomial time, the size of 12 must be polynomial in the size of II. Notice that thecomposition of polynomials is still a polynomial. Hence, we can conclude that theoverall computational time to solve II is polynomial. In other words, we can developa polynomial time algorithm for problem n 1.

Perhaps the more important application ofpolynomial time reductions is to usethem to define the hardest problems or the notion ofcompleteness. One problem thatis complete with respect to a class has two meanings: (1) the problem belongs to theclass; (2) the problem belongs to the most difficult ones to solve within the class.Therefore, for a problem n E NP, if we are able to rigorously prove that Il' sp nfor every other problem Il' E N~ then n is NP-complete. There are thousands ofproblems that have been proved to be in NP and the number is still growing. Infact, the size of NP is expected to be infinite. In this sense, proving a problem isNP-complete seems impossible. The first breakthrough was made by Cook (1971)who showed that the SATISFIABILITY (SAT) problem is NP-complete. The proof isthrough a reduction which transforms an instance I of any decision problem nin NP into an instance I' of SAT, such that the transformation takes a polynomialtime in the size of I and the answer to I is yes if and only if the answer to I'is yes.

With a problem proven to be NP-complete, the proof ofNP-completeness canbe astonishingly simplified by the use of polynomial time reductions. Notice thatn 1 sp n2 essentially (or informally) indicates that n2 is at least as difficult as n1.

Therefore, if n1 is known to be NP-complete, then n2 must be NP-complete too. Tosummarize, a proofofNP-completeness for problem n can be simplified as follows:

1. Show that n E NP;2. Show that there exists an NP-complete problem Il' and Il' sp n.To date, almost all the problems in NP that do not admit polynomial time solvingalgorithms have been shown to be NP-complete, with a few exceptions. Integerfactorization and graph isomorphism are probably the two most well-known onesfor which we do not have polynomial time algorithms, neither do we succeed inproving their NP-completeness. The CBM problem introduced in the above hasproven to be NP-complete (Xu et al., 2002).

A2.3.2 Optimization Problems

There are problems that we might not be able to prove their memberships in N~or they might not even be decision problems. If we are still able to prove thatthere exist NP-complete problems reducible to them, then they can be classified as

Page 22: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 251

NP-hardproblems. Notice that the abstract version ofthe CBM problem is to computea maximum weight constrained matching. Such a problem contains an objectivefunction to be optimized and is thus an optimization problem. Certainly, it is nota decision problem and thus it does not belong to N~ Nonetheless, if a decisionversion can be solved in polynomial time, then through a binarization scheme tosearch for the largest value for parameter f in the decision version such that theanswer maintains to be yes, we can obtain a constrained matching that has weight fwhile no constrained matching has weight larger than f. This way, we would obtain apolynomial time algorithm for the optimization problem. On the other hand, if thereis a polynomial time algorithm solving the optimization problem, we can certainlyuse it to solve the decision problem. This says that the optimization problem is atleast as difficult as the decision problem and thus it is NP-hard.

Many computational problems formulated out of bioinformatics applicationsare optimizationproblems, such as the MST, BM and CBM problems. In Section A2.4introducing algorithm design and analysis techniques, many of the target problemsare optimization problems. Some of the optimization problems can be solved inpolynomial time, such as the MST problem, while some others are NP-hard. Thealgorithm design techniques apply to both categories of optimization problems,for some of which the resultant algorithms output the optimal solutions, while forothers the resultant algorithms output only approximate solutions with or withoutguarantees on how close they are to the optimal ones.

A2.4 Algorithmic Techniques

Combinatorial optimization problems represent a large class ofcomputational prob-lems which many structural informatics problems could be formulated as, such asprotein threading and protein structure alignment. Some of these problems admitpolynomial time exact algorithms, while the others can be proven to be NP-hard.In this section, we introduce a number of popular and powerful algorithmic tech-niques that have been used to tackle these problems, including enumeration, dynamicprogramming, integer programming, branch-and-bound, heuristic search, dead-end-elimination, greedy, reduction techniques, and divide-and-conquer. We only presentthe basic design principles in an informal way and refer the reader seeking math-ematical proofs of the detailed properties of the designed algorithms to computeralgorithm books such as Cormen et al. (2001).

Depending on the target problems, various computational techniques may leadto exact, approximate, or heuristic algorithms. Exact algorithms refer to those thatoutput optimal solutions to the computational problems. Approximation algorithmsand heuristics usually target at near-optimal solutions, and they normally run in poly-nomial time. The needs for approximation algorithms and heuristics are that manycomputational problems in structural informatics are NP-hard and hence exact algo-rithms would run in exponential time, and for many ofthem, although computing theoptimal solutions are the most desirable, near-optimal solutions are also acceptable

Page 23: Appendix 1 Biological and Chemical Basics Related to

252 Guohui Lin et ale

and sometimes desired in biological applications to provide several candidate targetsthat could be used for further investigation.

The main difference between approximation algorithms and heuristics is thatapproximation algorithms have theoretical performance guarantees while heuristicsdo not. For example, an approximation algorithm might guarantee that the solutionit calculates is at most 10% worse than the best solution, while a heuristic algorithmmight give solutions that are "intuitively" close to the best solution. To define theperformance guarantee for an approximation algorithm APX, for an instance I of aminimization problem Il, let OPT(I) denote the value of the optimal solution to I.Let APX(I) denote the value of its output solution. Then

APX(I)p = mfx OPT(I)

guarantees that the quality of the computed solution is within p times the optimum,which is called the worst-case performance ratio of algorithm APX Problem Ilis also said to admit a p-approximation algorithm. The performance ratio of anapproximation algorithm for a maximization problem can be analogously defined.

A2.4.1 Exhaustive Enumeration

Exhaustive enumeration is commonly used algorithmic technique in bioinformatics.We use the CBM problem defined in a previous section as an example to describethe exhaustive enumeration method. We use a special case of the CBM problemthat is formulated out of protein NMR backbone resonance sequential assignment,a crucial step in the protein structure determination through NMR spectroscopy.An instance of the CBM problem consists of an edge-weighted (complete) bipartitegraph G = (VI, V2, E, w), where IVII = IV21, VI is the set ofamino acid residues inthe target protein sequence arranged in sequential order; V2 is the set ofspin systemsobserved from protein NMR spectra; and every edge connecting a spin system anda residue indicates a possible assignment of a spin system to a residue, where itsweight measures the confidence. The overall goal is to compute a maximum weightperfect matching, which gives a sequential assignment.

The above formulation is in fact an instance of the BM problem and thus amaximum weight perfect matching can be computed in polynomial time. However,researchers found some problems with this formulation. Since the weighting functioncould not distinguish the sequential positions of amino acid residues and there are20 types of residues, one spin system can map equally well to multiple copies ofthe same residue and thus cannot be distinguished. Consequently, there are toomany maximum weight perfect matchings for one protein. This makes the returnedmatching not applicable since protein NMR spectroscopy uses a nearly completeassignment as a prerequisite for latter-stage structure calculation.

Fortunately, connectivity information between spin systems could possibly bederived from the NMR spectra and it can be identified with a certain level ofaccuracy.

Page 24: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 253

One way ofusing the connectivity information is to connect spin systems into strings.Note that one spin system can be in at most one string. A string ofspin systems mustbe mapped to a consecutive peptide in the target protein. Using the connectivityconstraint, the instance becomes a nontrivial instance of the CBM problem thatseeks a maximum weight perfect matching that satisfies the constraint. Since theCBM problem is NP-hard, it is unlikely that polynomial time algorithms are available.One way of resolving the connectivity constraint is to apply a two-layer algorithm(Xu et aI., 2002), which falls into the enumeration category.

Essentially, in the two-layer algorithm, the first layer is to enumerate all possibleassignments for the strings (of length ~2) and the second layer completes eachpossible assignment by adding the assignments for singleton spin systems, whichcan be done by calling to a maximum weight matching algorithm.

Exhaustive enumeration requires essentially no prior knowledge or understand-ing of the problem under consideration. Consequently, the running time could bevery long and most of the time, too long to be practical. The actual implementationof the two-layer algorithm in Xu et al. (2002) greedily selects only a small fractionof the possible assignments for the strings, which are regarded as the most likelyones. With some knowledge or understanding of the problem under consideration,often better enumeration schemes can be developed, which subsequently fall intoother categories of exact algorithm design techniques.

A2.4.2 Dynamic Programming

Dynamic programming is one of the most widely used algorithmic techniques inbioinformatics. The popular sequence-sequence alignment problem was solved us-ing a dynamic programming algorithm (Smith and Waterman, 1981). This classof algorithms has been used in solving a wide range of bioinformatics problems,including assembly of gene model (Xu et aI., 1994), frame-shift detection (Xuet al., 1995), protein threading (Jones et al., 1992), mass spectrometry data in-terpretation (Yan et al., 2005), RNA secondary structure prediction (Mathews et al.,2004), and haplotype block partitioning (Zhang et al., 2002). We now use a stringmatching problem as an example to explain the essence of a dynamic programmingalgorithm.

Given an alphabet b, every element in it is called a letter, or a base, or acharacter. A string on b can be defined recursively. (1) An empty string containsno letter; (2) every single letter from b is a string; (3) the concatenation of twostrings is a string. Intuitively, a string can be regarded as a sequence of linearlyordered letters, also referred to as a sequence. A subsequence of a given sequenceis obtained by deleting some letters from the sequence and then concatenating theremaining letters while maintaining their relative order in the original sequence.Given two sequences, a common subsequence of them is a subsequence of bothof them. Measuring the length of a sequence by the number of elements therein,the longest common subsequence (LCS) problem is to compute a longest commonsubsequence of two given sequences. In general, the ratio between the length of

Page 25: Appendix 1 Biological and Chemical Basics Related to

254 Guohui Lin et ale

an LCS and the lengths of the given sequences reflects the extent of similaritybetween the two given sequences. The LCS problem is an essential element in bio-logical sequence comparison (Needleman and Wunsch, 1970; Smith and Waterman,1981).

Given two sequences 81[1, n] and 82 [1, m] of lengths nand m, respectively, tocompute an LCS for 81 and 82, one simple recursive method is to ask whether ornot the last letter from sequence 81 is in the LCS. Obviously, if it is not, then thecomputation is reduced to computing an LCS for 81[1, n - 1] and 82[1, m], whichis a smaller instance than the original problem. Similarly, we can reduce it to theproblem of computing an LCS for 81[1, n] and 82[1, m - 1]. When both last lettersfrom the given sequences are in the LCS, we conclude that they must be identical andthey both map to the last letter in the LCS. Therefore, the computation is reduced tocomputing an LCS for 81[1, n - 1] and 82[1, m - 1], which is appended with 81[n]to give the LCS. Such a recurrence relationship gives a simple recursive algorithm,in which solving one instance reduces to solving three smaller instances. Note thatthe recursion could go as far as m + n iterations and therefore the worst case runningtime for such a recursive algorithm would be O(3m+n ) .

A more careful look at the computation reveals that such an exponential run-ning time results from the fact that the algorithm contains a very large number ofrepetitive calls. Let LC8(i, i) denote the length of an LCS between prefixes 81[1, i]and 82[1, i]. Notice that there are a total of O(nm) distinct LC8(i, i) entries. Dy-namic programming technique can avoid such repetitive calls by doing exactly onecall and storing the result once the call is made. It follows that if LC8(i - 1, i),LC8(i, i-I), and LC8(i - 1,i-I) are known, then computing LCS(i, j) takesconstant time through the following recurrence (i 2: 1, j 2: 1):

{

LCS[i - 1, i],LCS[i, j] = max LCS[i, j - 1],

LCS[i - 1,i-I] + 1, if SI[i] = S2[j].

Note that LC8[i, 0] = LC8[0, i] = 0 are the base cases. The overall running timeof this dynamic programming algorithm is thus O(nm). The algorithm requiresO(nm) in space, which can be reduced to linear complexity by combining divide-and-conquer (see Section 4.9) with dynamic programming. A key feature, and avery important design step, of the dynamic programming technique is to order theLC8(i, i) entries to be computed in the ascending manner of the indices. In general,the recurrence relation describes, in a top-down manner, the relationships among theentries and the actual computation is done in a bottom-up fashion. For the LCS prob-lem, LC8(i, i) must be computed before any entry LCS(i', i') where i' ~ i, i' ~ i.

A2.4.3 Integer Programming

The CBM problem can also be solved using an integer programming approach. Inte-gerprogramming is a class oflinear programming, which optimizes a linear objective

Page 26: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 255

function so that a list of linear constraints are satisfied. In integer programming orinteger linear programming, all variables can only take integer values. To solve theCBM problem using integer programming, let xi} = 1 if in the output assignmentspin system s, is mapped to amino acid residue aj; or xi} = 0 otherwise. Let Wij

denote the weight of mapping spin system Si to amino acid residue ai : The integerprogramming associated with the CBM problem is:

max LWijXij

i.]

S.t. LXij :::: 1,i

V};

LXij:::: 1, Vi;j

xi} = Xi+l,j+l = ... = Xi+k,j+k, V string Xi Xi+l ... Xi+k;

xi}=O,l, Vi,};

where the objective function is to compute a maximum weight matching, the firstset of constraints are to guarantee that every spin system is mapped to at most oneamino acid residue, the second set of constraints are to guarantee that at most onespin system is mapped to an amino acid residue, and the third set of constraints arethe connectivity constraints. The general integer programming problem is NP-hard.There are available software packages for solving small to medium size integer pro-gramming problems, among which CPLEX (http://www.ilog.com/products/cplexl)is a popular one. Integer programming has also been applied in protein threading(Xu et al., 2003) and protein structure comparison (Chen et al., 2005).

A2.4.4 Branch-and-Bound

Branch-and-bound is another algorithmic technique that can solve the CBM prob-lem exactly. Branch-and-bound and the subsequent two other techniques, A* searchand dead-end-elimination, are all tree-based optimization techniques. The key ideain a branch-and-bound algorithm is to construct a search tree where every noderepresents a partial instance of the original instance. Starting from the root node,which represents the original instance, the algorithm applies a depth-first search(DFS) for a partial solution to the instance and creates a child node to representthe remaining instance. It applies some methods to estimate the optimal solutionfor the remaining instance, which is added to the partial solution to serve as anestimated solution for this child node. If such an estimated solution is no betterthan the current best solution recorded by the algorithm, then the child node ismarked as infeasible and the algorithm steps backward to find another partial so-lution for the root; in other cases, the algorithm repeats the same process on thischild node to continue on search. Each time, a better feasible solution is found,which corresponds to a leaf node or an empty remaining instance, the algorithm

Page 27: Appendix 1 Biological and Chemical Basics Related to

256 Guohui Lin et al,

updates the current best solution. The algorithm terminates when the search tree isexhausted. The last solution recorded by the algorithm is an optimal solution to theproblem.

For the CBM problem, a branch-and-bound algorithm is proposed in Lin et al.(2003) where at each search node the possible immediate steps are those possiblemappings for the longest unassigned string. Certainly, if there is no possible imme-diate step but the remaining instance is not empty, the search node is infeasible andthe algorithm steps backward. To effectively estimate the best solution to the partialinstances, a small number of approximations and heuristics can be employed, to bedetailed in later sections. Essentially, these estimation algorithms come up with anupper bound on the best solutions for the remaining instances. If adding this upperbound to the partial assignment at the search node is no better than the currentlyrecorded best solution, then the node is infeasible and ignored.

A2.4.5 A* Search

A* search is another algorithmic technique that is similar to branch-and-bound. Themain difference is that while branch-and-bound is a DFS method, A* search is abest-first-search (BFS) method, in that it examines all the partial solutions achievedat the moment (forming into an open list) and picks the one associated with thelargest potential to continue the search. Those partial solutions that do not lead tooptimal solutions to the original instance are marked to form a closed list.

For the CBM problem, the A* search tree is similar to the search tree producedby a branch-and-bound algorithm, except that the orders of nodes being exploredare different. The immediate steps at every node are the same. For each node, theA* search algorithm records the weight of the partial solution, and again estimatesan upper bound on the optimal solution for the remainder instance. The sum of thepartial weight and the estimated upper bound becomes the key value associated withthe search node, which is used to sort all the partial solutions. The algorithm picksthe node associated with a maximum key value to expand the search. Note that theestimate is an upper bound on the optimal solution for the remainder instance. Thefirst feasible solution found by the algorithm is guaranteed to be an optimal solutionto the original instance.

It should be noted that in most cases, branch-and-bound algorithms and theircorresponding A* search algorithms are equivalent to each other, and most of thetime it is hard to predict which one would perform better than the other on a specificapplication. Nonetheless, as the branch-and-bound algorithms use depth-first-search,the required memory size is proportional to the product of the depth of the searchtree and the size ofan instance. The A* search algorithms apply best-first-search thatrequires a larger volume ofmemory, which is proportional to the product of the sizeof the open list and the size of an instance. In general, A* search algorithms tradememory with speed and often implementation of the open list requires a hash table.Most of the time, choosing one of the two is the researcher's personal preferencesand compromise between speed and memory.

Page 28: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 257

A2.4.6 Dead-End-Elimination Algorithm

Another exact algorithm for global optimization is the dead-end-elimination (DEE)algorithm. DEE is similar to the branch-and-bound and A* search algorithms in thatall these algorithms attempt to avoid exhaustive search through evaluating partialsolutions. The branch-and-bound and A* search algorithms achieve this throughbuilding effective search trees, and they apply to general objective functions. DEEtakes advantage of the property of pairwise decomposition for a typical proteinenergy function and removes certain ranges/states of search variables from furtherconsideration to limit the search space.

DEE was first developed for side-chain placement in homology modeling ap-plications (Desmet et aI., 1992). The search for an optimal side-chain placement isbased on a rotamer library of side-chain conformations. The energy function of aprotein can be decomposed into the energy of each rotamer and the interactions be-tween rotamer pairs. Given such a property ofthe energy function, one can comparetwo states (A and B) ofa rotamer X and eliminate one ofthem from further consider-ation as a possible state for the global optimal solution with a rigorous mathematicalproof. If the energy of the rotamer X plus the sum of all the interactions betweenX and other rotamers in the protein for A is lower than that for B, then B cannotbe part of the global optimal solution, and hence can be removed from further con-sideration. DEE becomes an effective search strategy in bioinformatics. In additionto the side-chain placement problem, DEE has also been applied to protein design(Dahiyat and Mayo, 1997) and sequence comparison (Lukashin and Rosa, 1999).

A2.4.7 Greedy Algorithms

Greedy is a simple algorithm design technique that can be applied to a large numberof problems. Along this line, a problem under consideration has to be partitionedinto iterations and at each iteration the algorithm picks the best "local solution."A typical example is Kruskal's algorithm for solving the minimum spanning treeproblem, which picks at every iteration a least weighted edge that does not destroythe acyclicity property of the obtained forest. After exactly IV I - 1 iterations, thealgorithm terminates and obtains a minimum spanning tree. For problems that fallinto a category called matroids (Edmonds, 1971) such as the MST problem, greedyalgorithms could provide rigorous solutions. For other problems, there might beperformance guarantee, such as the one for the CBM problem to be introduced next,or there might not be any performance guarantee.

We present a greedy approximation algorithm for the CBM problem, whichhas a performance ratio of 2, that is, the algorithm produces an assignment whoseweight is at least half of the weight of the optimal assignment. Note that the aminoacid residues in a CBM instance follow a sequential order as they appear in theprimary protein sequence. For every amino acid residue, there could be many edgesconnecting it and the heading spin systems in some strings. Among these edges, theone incident at the heading spin system of a shortest string is called an innermost

Page 29: Appendix 1 Biological and Chemical Basics Related to

258 Guohui Lin et al,

2-APPROXIMATION on G:1. if (E(G) = 0)

output the empty set and halt;2. find a leading innermost edge e in G;3. I' = {e} U {e'le' E E(G), e' conflictswith e };4. find the minimum weight c of an edge of I' in G;5. for (every edge f E f)

subtract c from the weight of f:6. F = {e]e E I', e has weight OJ;7. G' = G-F;8. recursivelyca1l2-ApPROXIMATION on G' and output Mi;9. find a maximal edge set M~ C F s.t.

Mi UM~ is a feasible matching in G;10. output Mi UM~ and halt.

Fig. A2.2 A greedy algorithm for finding a feasible constrainedmatching

edge. Scanning from the left to right, the innermost edge incident at the first aminoacid residue is called the leadinginnermost edge. It can be shown that for the leadinginnermost edge, any feasible solution to the CBM problem would have at most twoedges conflicting to it, where conflicting means that they cannot be present in aconstrained matching. Using the concept of leading innermost edge in the localratio technique, which is of greedy nature, an algorithm for the CBM problem hasbeen proven to have performance ratio of2. For the details ofthe algorithm, we referthe reader to Chen et al. (2003). Figure A.2.2 outlines the steps in the algorithm.

The kBCMV problem, which asks for a resolving scheme to produce a minimumnumber of distinct binary vectors, is another example where greedy algorithms areuseful. To show this, we construct a graph G to have its vertices to represent thegiven binary ternary vectors and two vertices are adjacent if and only if the twocorresponding given vectors can be resolved into a common binary vector. Then, theproblem is equivalent to finding a minimum clique partition of graph G. That is, tofind a partition ofthe vertex set in graph G such that the induced subgraph on everysubset is a clique (vertices are pairwise adjacent) and the number of subsets in thepartition is minimized. Every clique/subset consists ofthose given vectors that can beresolved into a common binary vector. One greedy strategy to achieve this minimumclique partition is to find a maximum clique in the remainder graph at each iterationand then subsequently remove it from the graph. In Figueroa et al. (2004), the GCPalgorithm essentially follows the above greedy scheme and it was demonstrated thatthe algorithm performs quite well through empirical studies, although there is notheoretical accuracy guarantee.

A2.4.8 Reduction Techniques

As discussed earlier in this chapter, reduction is an algorithmic technique that reducesa problem under consideration to another problem that usually admits good known

Page 30: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 259

algorithms. We use an example from oligonucleotide fingerprinting to explain thebasic idea of the technique. One application of this problem is using cDNA arraysto perform microbe classification. For this purpose, probes are carefully designed toform cDNA chips, and their hybridization values are read offand fed to classificationtools. There could be missing values from the hybridization results. The problem is toresolve the missing values and classify microbes into a minimum number ofclusters.In more detail, the mathematical model built consists of a set of n vectors of lengthm, where each entry is either 0 (for no hybridization), 1 (for hybridization), or U (forunknown, which is also called a missingvalue). Vectors ofthis type are called binaryternary vectors. The goal is to resolve the unknown values, that is, assigning eitheroor 1 to each U, such that the number ofdistinct resolvedvectors is minimized. Thegeneral problem is called binary clusteringwith missing values (BCMV) (Figueroaet aI., 2004). If every vector is known to contain at most k unknown entries, then theproblem is called kBCMV. It is known that kBCMV is NP-hard when k 2: 2 and issolvable in polynomial time when k = 1.

In a classic computational problem SET COVER, a base set S and a collection Cof subsets of S are given, and the goal is to find a subcollection C' of subsets suchthat every element in S appears in some subset in C'. SET COVER is notoriously hard,but if every subset in C has size at most p, which is called p-SET COVER, then itadmits a (1 + Lf=3 t)-approximation (Duh and Fiirer, 1997).

To approximate the kBCMV problem for k 2: 2, one way is to reduce it to SETCOVER. For this purpose, one may construct a subset to contain the given vectors thatcan be resolved into a common resolved vector. By setting the base set S to be the setof all given vectors, we reduce the kBCMV problem into the 2k -Set Cover problem.

Therefore, sucha reductionimpliesa (1 + L;:3 t)-approximation algorithm for thekBCMV problem.

There are a few other well-known approximation algorithmic design techniquessuch as (1) Integer Programming with Rounding and (2) Semidefinite Programming.They may be more powerful than the techniques introduced above. Nonetheless,they are also more complicated and require much space for description. Interestedreaders may wish to consult advanced textbooks on approximation algorithms, suchas Ausiello et al. (1999) and Vazirani (2001).

Reduction can also be applied to obtain a quick understanding of the compu-tational structure of a problem. Such a technique is applied typically at the timewhen some special cases of the problem are found equivalent to some well-knownproblems that admit good algorithms (exact or approximation).

For example, for the CBM problem, if there are no strings but singleton spinsystems only, then an optimal assignment between the set ofspin systems and the setof amino acid residues can be computed efficiently. That is, the special case givesa bipartite matching problem. The existence and application of strings complicatesthe problem while making a solution more accurate. In a greedyfiltering processproposed in Chen et al. (2002), the mapping positions for the strings are computedgreedily, and the remainder problem involving only singleton spin systems is solvedby calling to efficient bipartite matching algorithms. The greedy filtering process

Page 31: Appendix 1 Biological and Chemical Basics Related to

260 Guohui Lin et ale

sets up a maximum number k of best partial assignments and they are identifiediteratively. At the first iteration, the k best mapping positions for the longest stringare identified, which form the k best partial assignments. At the second iteration,the algorithm tries to find the best mapping positions for the second longest string,starting from the obtained k best partial assignments. To do so, for each of the kbest partial assignments, the algorithm searches for the k best mapping positions forthe second longest string. If none is found, then the partial assignment is markedinfeasible and subsequently discarded from further consideration. The result of thisiteration is thus at most k2 partial assignments, among which the k best ones areretained while the others are discarded. The algorithm then goes to the third iteration,and so on until all the strings are considered. At the end of this greedy filteringprocess, at most k partial assignments that involve all the strings are obtained. Foreach one of them, the algorithm calls to a maximum weight matching algorithm,such as the one by Goldberg and Kennedy (1995), to complete the assignment. Itreturns the best one to be the output assignment.

A2.4.9 Divide-and-Conquer Algorithms

Divide-and-conquer is another useful algorithmic technique that can lead to twopossible consequences, depending on the detailed design. One is an exact algorithmpotentially running in exponential time, and the other is a potentially good heuristicsrunning in polynomial time. A well-known divide-and-conquer algorithm is thequicksort algorithm, which essentially picks one element as a pivot to partitionthe other elements into two sublists, one containing only elements smaller than thepivot and one containing only elements larger than the pivot, and the algorithmrecursively run quicksorts on the two sub-lists. The mergesort is another sortingalgorithm that falls in the divide-and-conquer category. The sorting problem belongsto P The quicksort algorithm runs in O(n 2) time in the worst case and O(n log n)time in the average case, where n is the number of elements to be sorted. Themergesort algorithm runs in O(n log n) time in both the worst case and the averagecase.

For NP-hard problems, a good divide-and-conquer algorithm usually effectivelypartitions the original instance into several smaller instances, where the linkagesamong the smaller instances are minimized. The algorithm solves or approximatesthe smaller instances and then assembles a solution for the original instance out oftheachieved solutions for the smaller instances together with the linkages. In general,only partial linkage is used to assemble the solution to ensure that the assemblingcan be done in a reasonable amount oftime and thus the algorithm is only a heuristic.Often, if all the linkages are used, the algorithm would run in exponential time inthe worst case, but nonetheless becomes an exact algorithm.

One good example is the protein threading problem. In protein threading, targetstructure templates are given, as well as an energy function which includes singleresidue fitness, residue mutations, alignment gaps, and pairwise interactions. By dis-carding the pairwise interaction energy, the threading problem becomes polynomial

Page 32: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 261

time solvable, which is essentially a sequence alignment problem. By considering thepairwise interactions, the problem becomes more difficult as the problem was provento be NP-hard under some overly generalized conditions. The software PROSPECT(Xu et al., 1998) uses the short-range pairwise interactions that cut the threadingproblem on a whole sequence into a number of smaller instances whose optimalsolutions are sorted (in polynomial time), as described in Chapter 12 of this book.In this case, the algorithm is still an exact one, while the model for the computa-tional problem is modified using heuristics so that the problem becomes solvable inpolynomial time.

A2.5 Parallel Computing

As many of the problems for structural informatics are often computationally in-tensive, parallel computing has been widely used in structural informatics. Thereare two main types of parallelism, i.e., the single-instruction-multiple-data (SIMD)model and the multiple-instruction-multiple-data (MIMD) model. The SIMD modelperforms the same operation (single instruction) on multiple data in parallel, whilethe MIMD model carries out different operations (multiple instructions) on differentdata at the same time. Earlier Cray vector machines, Intel's MMX, SSE, SSE2, andSSE3, and AMD's 3DNow used the SIMD model. Newer machines, such as CrayT90, and Linux clusters, mostly use the MIMD model, as MIMD is more flexible.

Message passing is one of the basic operations used in parallel computing,and it facilitates communication among different processors through sending andreceiving data from one processor to another. Portable message passing libraries suchas MPI (Message Passing Interface) or PVM (Parallel Virtual Machine) have beenwidely used for such a purpose. These libraries can run on different machines, andprovide a software infrastructure that enables heterogeneous collections ofmachinesto perform parallel computing.

The rapid growth of interconnected Linux/Unix workstations has opened adoor for a new parallel computing paradigm, i.e., network computing or clusteredworkstation computing. With Giga switch and Asynchronous Transfer Mode (ATM)network technology, the connected workstations could achieve very fast messagepassing between machines, while they are much cheaper than supercomputers. Asa result, Linux clusters are being widely used as they are particularly suitable forstructural informatics problems. Many structural informatics problems (e.g., thread-ing) can be solved using data parallelism, which applies the same operation to allthe elements in a large data set such as a protein structure database. This data par-allelism model has evolved from the SIMD concept, and it requires little messagepassing. Hence, Linux clusters or even heterogeneous machines across the Internetcan perform computing of data parallelism efficiently.

Examples ofusing parallel computing in structural informatics include molec-ular dynamics simulation using NAMD program (Nelson et al., 1996) and aparallel-computing tool CEPAR (Pekurovsky et al., 2004) for finding pairwise

Page 33: Appendix 1 Biological and Chemical Basics Related to

262 Guohui Lin et al,

protein structure similarities from PDB, a fold recognition server PROSPECT-PSPP (http://esbl.bmb. uga.edu/protein.pipelinei, and an alignment server SAM-T02(http://www.ese.uese.edu/researeh/eompbio/HMM-apps/HMM-applieations.htm/).

A2.6 Programming

An important aspect of structural informatics is to develop computational toolsfor solving structure prediction, modeling and analysis problems. For this purpose,good skills in computer programming and software engineering are essential. Moststructural informatics tools are implemented using C/C++ and Java, and run onUnixiLinux, especially for computationally intensive algorithms. Although increas-ingly more tools have been developed under the Windows operating system, themainstream development environment for structural informatics problems is stillUnixiLinux. As a result, a minimum requirement for structural informatics tooldevelopment is to be comfortable with working in a command-line computing envi-ronment under UnixiLinux. This is generally not difficult to achieve even for peoplewithout any formal training in computer science though learning an interpreted lan-guage for researchers and students who have no computer science background maytake some effort. A quick way to have hands-on experience in structural informaticsprogramming is to use a scripting language such as Perl or Python. Perl and Pythonare especially suited for simple parsing, formatting, and text conversion problems,which arise frequently in working with protein sequences and structures. An exten-sion ofPerl, called Bioperl (http://bioperl.orgl), is particularly designed for dealingwith biosequence and molecular structural data. Bioperl currently consists ofa num-ber ofpre-made modules that are useful for structural informatics, including proteinsequence analysis and handling PDB files.

A2.7 Summary

Computer algorithm design is a key component in bioinformatics, particularly soin structural informatics, as many of the structural informatics problems are intrin-sically computationally difficult. For those problems, good algorithm design couldoften have much a bigger impact in achieving computational efficiency than hard-ware acceleration or parallel computing. The reason is simply that good algorithmdesign can often achieve better than linear speed up. For problems which do nothave efficient algorithms (e.g., NP-hard problems), one often needs to consider thetrade-offbetween computational efficiency and prediction accuracy. Ideally the bestsolutions are achieved within an acceptable amount of time. Otherwise approxima-tion algorithms or heuristic algorithms could be used to get some solutions (thoughnot necessarily the best one) to a problem within a desired time. Given that so manyalgorithmic techniques are available, a bioinformatics tool developer might want toconsider the pros and the cons of the different algorithmic techniques for a specific

Page 34: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 263

application and choose an appropriate technique to implement. Good practice inalgorithm design for structural informatics takes experience and the domain knowl-edge of structural biology. For beginners, especially those without formal trainingin computer science, one could focus on a component-based software engineeringapproach, e.g., to apply modules with well-developed algorithms, such as Bioperlor BLAST, to implement new tools.

Further Reading

The textbook Introduction to Algorithms (Cormen et aI., 2001) provides a detaileddescription of algorithm design and analysis. It includes almost all the algorithmicdesign techniques mentioned above, and several more algorithms that are potentiallyuseful for biological applications. Other textbooks contain many insightful com-ments regarding algorithm design and analysis, for example, the one by Kleinbergand Tardos (2005). One can find good textbooks and monographs on approximationalgorithms, such as Ausiello et al. (1999) and Vazirani (2001). An introduction toalgorithmic design under the context of bioinformatics can be found in Jones andPevzner (2004).

For researchers and students who have no computer science background and arelearning computational approaches to biology for the first time, a quick way to havehands-on experience of computational modeling and prediction is to follow Gibasand Jambeck (2001). This book covers some basics on using computers for biology,including the Unix file system, databases, an introduction to Perl for bioinformatics,data mining, and data visualization. Another book for hands-on experience is Tisdall(2002). It covers the core Perl language and many ofits module extensions (especiallyBioperl) in the context of biological data and problems.

An area which we do not cover in this chapter is the machine learning and datamining approaches for structural informatics. Monte Carlo methods (Appendix 4),genetic algorithms, neural networks, support vector machines (SVMs), fuzzy logic,etc. are all being used in bioinformatics, especially for optimization problems. Thescope is too broad to cover in this chapter. Interested readers are referred to Baldiand Brunak (2001) on this subject.

A2.8 Acknowledgments

GL is grateful for the research support from CFI and NSERC. The research of DXand YX was sponsored in part by NSF/ITR-IIS-0407204 and the U.S. Departmentof Energy's Genomes to Life program (www.doegenomestolife.org) under project"Carbon Sequestration in Synechococcus sp.: From Molecular Machines to Hierar-chical Modeling" (www.genomes2Iife.org). YX's work was also supported in part byNSF/DBI-0354771 and a "Distinguished Cancer Scholar" grant from the GeorgiaCancer Coalition.

Page 35: Appendix 1 Biological and Chemical Basics Related to

264

References

Guohui Lin et al.

Ausiello, G., Crescenzi, P, Gambosi, G., Kann, v., Marchetti-Spaccamela, A., andProtasi, M. 1999. Complexity andApproximation: Combinatorial OptimizationProblems and Their Approximability Properties. Berlin, Springer-Verlag.

Baldi, P, and Brunak, S. 2001. Bioinformatics-The Machine Learning Approach,2nd edition. Cambridge, MA, MIT Press.

Chen, L., Zhou, T., and Tang, Y. 2005. Protein structure alignment by deterministicannealing. Bioinformatics 21:51-62.

Chen, Z.Z., Jiang, T., Lin, G.R., Wen, J.J., Xu, D., Xu, J., and Xu, Y. 2003. Approx-imation algorithms for NMR spectral peak assignment. Theor. Comput. Sci.,299:211-229.

Chen, Z.Z., Jiang, T., Lin, G.H., Wen, J.J., Xu, D., and Xu, Y. 2002. Improvedapproximation algorithms for NMR spectral peak assignment. In Proceedingsof the 2nd Workshop on Algorithms in Bioinformatics (WABI 2002), LNCS2452, pp. 82-96.

Cook, S. 1971. The complexity of theorem proving procedures. In Proceedings ofthe Third Annual ACM Symposium on Theory of Computing (STOC 1971),pp. 151-158.

Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. 2001. Introduction toAlgorithms, 2nd edition. Cambridge, MA, MIT Press.

Dahiyat, B.I., and Mayo, S.L. 1997. De-novo protein design: Fully automated se-quence selection. Science 278:82-87.

Desmet, 1.,de Maeyer, M., Hazes, B., and Lasters, I. 1992. The dead-end eliminationtheorem and its use in protein side-chain positioning. Nature 356:539-542.

Duh, R., and Fiirer, M. 1997. Approximation of k-set cover by semi-local opti-mization. In Proceedings of the 29th Annual ACM Symposium on Theory ofComputing (STOC 1997), pp. 256-265.

Edmonds, J. 1971. Matroids and the greedy algorithm. Math. Programming 1:127-136.

Figueroa, A., Borneman, J., and Jiang, T. 2004. Clustering binary fingerprint vectorswith missing values for DNA array data analysis. J. Comput. Bioi. 11:887-910.

Fischer, D., Norel, R., Wolfson, H., and Nussinov, R. 1993. Surface motifs by acomputer vision technique: Searches, detection, and implications for protein-ligand recognition. Proteins 16:278-292.

Gibas, C., and Jambeck, P 2001. Developing Bioinformatics Computer Skills. Se-bastopol, CA: O'Reilly.

Goldberg, A.V., and Kennedy, R. 1995. An efficient cost scaling algorithm for theassignment problem. Math. Programming 71:153-178.

Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge,Cambridge University Press.

Jones, D.T., Taylor, WR., and Thornton, J.M. 1992. A new approach to protein foldrecognition. Nature 358:86-89.

Page 36: Appendix 1 Biological and Chemical Basics Related to

A2. Computer Science for Structural Informatics 265

Jones, N.C., and Pevzner, ~A. 2004. An Introduction to BioinformaticsAlgorithms.Cambridge, MA, MIT Press.

Kleinberg, J., and Tardos, E. 2005. Algorithm Design. Upper Saddle River, NJ,Pearson Education.

Koch, I., Lengauer, T., and Wanke, E. 1996. An algorithm for finding maximal com-mon subtopologies in a set ofprotein structures. 1 Comput. Biol. 3:289-306.

Lin, G.H., Xu, D., Chen, Z.Z., Jiang, T., Wen, J.J., and Xu, Y. 2003. Computationalassignment ofprotein backbone NMR peaks by efficient bounding and filtering.J. Bioinform. Comput. BioI. 1:387-409.

Lukashin, A.V.,and Rosa, J.J. 1999. Local multiple sequence alignment using dead-end elimination. Bioinformatics 15:947-953.

Marsan, L., and Sagot, M.F. 2000. Algorithms for extracting structured motifs usinga suffix tree with an application to promoter and regulatory site consensusidentification. J. Comput. BioI. 7:345-362.

Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M., and Turner,D.H. 2004. Incorporating chemical modification constraints into a dynamicprogramming algorithm for prediction ofRNA secondary structure. Proc.Natl.Acad. Sci. USA 101:7287-7292.

Needleman, S.B., and Wunsch, C.D. 1970. A general method applicable to thesearch for similarities in the amino acid sequence of two proteins. J. Mol.Bioi. 48:443-453.

Nelson, M., Humphrey, W, Gursoy,A., Dalke, A., Kale, L., Skeel, R.D., and Schulten,K. 1996. NAMD-A parallel, object-oriented molecular dynamics program.Int. J. Supercomput. Appl. High Perf Comput. 10:251-268.

Pekurovsky, D., Shindyalov, I.N., and Bourne, P.E. 2004. A case study of high-throughput biological data processing on parallel platforms. Bioinformatics20:1940-1947.

Shyu, C.R., Chi, P.H., Scott, G., and Xu, D. 2004. ProteinDBS: A real-time retrievalsystem for protein structure comparison. Nucleic Acids Res. 32:W572-W575.

Smith, T.F., and Waterman, M.S. 1981. Comparison of biosequences. Adv. Appl.Math. 2:482-489.

Tisdall, J.D. 2002. Mastering Perlfor Bioinformatics. Sebastopol, CA, OReilly.Vazirani, V. 2001. ApproximationAlgorithms. Berlin, Springer-Verlag.Weskamp, N., Kuhn, D., Hiillermeier, E., and Klebe, G. 2004. Efficient similar-

ity search in protein structure databases by k-clique hashing. Bioinformatics20: 1522-1526.

Wu, X., Wan, X.F., Wu, G., Xu, D., and Lin, G.H. 2006. Phylogenetic analysis usingcomplete signature information of whole genomes and clustered neighbor-joining method. Int. J. Bioinform. Res. Appl. 2:219-248.

Xu, J., Li, M., Kim, D., and Xu, Y. 2003. RAPTOR: Optimal protein threading bylinear programming. 1 Bioinform. Comput. Bioi. 1:95-117.

Xu, Y., Mural, R.J., and Uberbacher, E.C. 1994. Constructing gene models fromaccurately predicted exons: An application of dynamic programming. Bioin-formatics 10:613-623.

Page 37: Appendix 1 Biological and Chemical Basics Related to

266 Guohui Lin et al,

Xu, Y, Mural, R.I, and Uberbacher, E.C. 1995. Correcting sequencing errors inDNA coding regions using a dynamic programming approach. Bioinformatics11:117-124.

Xu, Y.,Xu, D., Kim, D., Olman, v., Razumovskaya, I, and Jiang, T.2002. Automatedassignment of backbone NMR peaks using constrained bipartite matching.IEEE Comput. Sci. Eng. 4:50-62.

Xu, Y., Xu, D., and Uberbacher, E.C. 1998. An efficient computational method forglobally optimal threading. 1 Comput. Bioi. 5:597-614.

Yan,B., Pan, C., Olman, v., Hettich, R., and Xu, Y.2005. A graph-theoretic approachto separation ofb and y ions in tandem mass spectra. Bioinformatics 21:563-574.

Zhang, K., Deng, M., Chen, T., Waterman, M., and Sun, F. 2002. A dynamic pro-gramming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci.USA 99:7335-7339.

Page 38: Appendix 1 Biological and Chemical Basics Related to

Appendix 3 Physical and Chemical Basis forStructural Bioinformatics

Hui Lu, Ognjen Perisic, and Dong Xu

A3.1 Introduction

Physics and chemistry are key components of the foundations for structural bioin-formatics. This appendix gives an introduction to the basic concepts of physics andchemistry which are relevant to structural bioinformatics. Special attention is givento the description of various types of physical forces which govern the behavior ofbiomolecules. Theoretical framework of statistical physics and thermodynamics isdescribed together with the sampling techniques, such as molecular dynamics simu-lation and Monte Carlo method. The purpose of this appendix is to help researchersand students who lack an extensive physics or chemistry background to understandother chapters in this book.

Structure and dynamics of biomolecules including proteins, nucleotides, andmembranes are governed by the laws of physics and chemistry. However, due tothe large size, wide range of time scales, and complex energy topography of themolecular biological problems, there are certain unique features in applying physicsand chemistry to structural bioinformatics. In particular, structural bioinformaticsaddresses various modeling and prediction problems related to physics and chemistryon different levels:

• Quantum mechanics. The modeling of enzyme reactions, electron transfer, andexcited states of molecules, typically requires quantum-mechanical description.Many atomic force fields used in the description of biological molecules werederived from quantum-mechanical calculations. One can also apply quantum cal-culation to model structures of small organic molecules that serve as substratesfor proteins and nucleic acids.

• Classical physics. Models based on classical physics are applied in energy min-imization, molecular dynamics simulations, structural refinement, and certain abinitio protein folding predictions.

• Potential of mean force (PMF). This approach derives an average force (and thecorresponding potential) measured during perturbation of the biological polymerfrom more detailed level modeling of a system. The resulting PMF can be used toinvestigate the dynamics process during the perturbation of biological polymers.For example, one can calculate the PMF of an ion during its movement through a

267

Page 39: Appendix 1 Biological and Chemical Basics Related to

268 Hui Lu et al.

membrane channel, and then simulate the behavior of that ion using a simplifiedmodel using the corresponding PMF.

• Coarse-grained approach. This approach treats certain atoms as an entitygrouped together. Examples for this level of modeling include handling a clusterof atoms (e.g., an a-helix in a protein) as a rigid body, studying elastic propertiesofDNA molecules using simplified representation, and calculation of the surfacetension of the cell membrane. These approaches have their origins in physics,although they are not derived explicitly from the atomic level description.

• Bulk properties. Bulk properties include dielectric effects of the solution, saltconcentration, pH level, temperature, pressure, etc. They describe the environmentand overall properties for biomolecules, and as a result, they play important rolesin modeling biomolecules. For example, the continuum model for calculatingelectrostatics in a protein should include the dielectric effect of the solvent.

The application of the different physical principles mentioned above dependson the complexity of the modeling problem and the availability of computationalresources. Ideally, everything could be modeled using quantum-mechanics, but thisis not feasible because most of the structural bioinformatics problems deal withhundreds to millions of atoms. In addition, some problems may not need complexdescription. For example, a typical protein folding process may not need quantum-level description to be efficiently described. Furthermore, high-level description maycapture some key features that are difficult to model using more detailed descriptions.For example, the influence ofcertain dielectric properties ofthe solvent on the proteincan be better modeled using a continuum model than an atomic model. On the otherhand, as computers become more powerful, the complexity of quantum-mechanicalcalculations may not be the limiting factor anymore (Anikin et aI., 2004). Anotherclass of methods uses hybrid classical and quantum mechanics to model biologicalmolecules. Those methods appy high-level descriptions to simulate the whole proteinand quantum mechanics to calculate properties of a few key atoms (Hayashi et aI.,2003; Gherman et aI., 2005; Friesner and Guallar, 2005).

Other than the models based on physics and chemistry, statistics also pro-vide a theoretical basis for another type of computational techniques. For example,the knowledge-based scoring functions are derived from statistics of known pro-tein structures rather than from physics principles. However, that derivation stillapplies certain concepts in physics, such as the Boltzmann distribution. Compu-tational techniques, such as Monte Carlo optimization, simulated annealing, andBoltzmann machines, can also be interpreted using physical concepts, although theydo not describe actual physical processes in biomolecules.

A3.2 Physics Concepts

A3.2.1 Units

In structural and dynamics modeling, it is common to use the International SystemofUnits (SI) to describe biological molecules and interactions among them. Several

Page 40: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics 269

non-SI units found their place as they conveniently represent observables related tobiomolecules.

For length measurement, the common non-SI unit is angstrom (1 A== 10-10 m== 100 pm == 0.1 nm). It is used because the chemical bond lengths are in the rangeof a few angstroms.

Nonstandard units widely used to describe energy are the kilocalo-rie (1 kcal == 4184 J; it is usually expressed as kcal/mol which is6.9479 x 10-21 J) and kBT (Boltzmann's constant k» = 1.3807 X 10-23 J K- I ==8.6171 x 1O-5eVK- 1 == 1.3807 x 1O-16ergK- I) .WhenweuseroomtemperatureT = 300 K, then I kBT =4.1421 x 10-21J == 0.02585 eY. These units are chosenbecause they are in the same order of magnitude as the energies involved in themolecular biology processes (for example, 1kBT, thermal energy; 25 kBT for ATPhydrolysis; and 15-25 kBT for the free energy difference between folded and un-folded state of a protein).

The duration of biological processes on the molecular level covers in a periodranging from femtoseconds (1 fs = 10-15 s) to microseconds (I f.LS = 10-6 s).

Figure A3.l shows various scales oftime, length, and energy related to biolog-ical processes, especially those related to structural bioinformatics.

A3.2.2 Potential Energy Surface

Proteins pass through many different conformations during folding into their na-tive, stable states. The number of possible conformations is very large, much larger

Length scale (m) Energyscale (J) Time scale (s)

· 18

10

,10 ..

0 • Human ;r-10 • longest neYron

.,· Musde fi~10

.,10 . F~

.,• lice PedICUlus humsnus

10

...10 • Bacterium

- CIlium

-s • Eukanobc cell

10 • Ce ll nucJeus

-e• Prokaryot ICcell ,10- Bacterium., • '.t!tocnonclna

10 • R,bOsome

·8- ViruS

10 • Actin filaments

.. <-10 • Protein Afj'~ ,

· 1010 • .Botv diame ter of hydrogen atom G)

In grOUnd state (106 pm)

."10

·20

10

."10

- Covalent bond

- ATPhydrolysis

• Protein folding

- Electron transfer

-1 kcal/mol

- Hydrogen bond

- Thermal fluctuations- 1 kaT

310

110

100

·310...10

10-6.,10

10-9

-1210

- Bacterial life

- Protein folding

- Side chain rotat ion in inte rior

- Global bending

- Thermal equi lib rium

• Side cha in rotat ion on the surface• Proton transfer

- Bond vibrations- Electron transfe r

Fig. A3.1 Scales in length, energy, and time . Examples from biomolecularmodeling are providedfor different scales .

Page 41: Appendix 1 Biological and Chemical Basics Related to

270 Hui Lu et ale

than the number of conformations the protein goes through during the folding pro-cess, and depends on the number of residues. Each conformation of the proteincorresponds to a certain potential energy. That potential energy of the biologi-cal molecule is a multidimensional function of the coordinates of the atomic nu-clei. The main difficulty in computing and visualizing the surface of the poten-tial energy is the large number of associated variables. In the case of N atoms,the potential energy is a function of 3N Cartesian coordinates or 3N-6 internalcoordinates. Since the whole system has 6 degrees of freedom, the internal coor-dinates will have only 3N-6 in dimension. Calculation of the energy surface canbe simplified if only part of the surface is analyzed or shown. In that case, en-ergy surface (potential of mean force) can be described as a function of just a fewcoordinates.

The key features of the energy landscape are its minima (see Fig. A3.2).These minimum points correspond to the stable or meta-stable molecular confor-mations. The first derivative of the energy at those points is equal to zero withrespect to all of the coordinates, and the second derivative is positive. The mostimportant minimum among them is the global energy minimum, the point on theenergy landscape with the lowest energy, where the biomolecule has the most stableconformation.

Another important group of points on the potential energy surface are thepoints which correspond to the transition conformations ofthe biological molecule.At these points, the first derivative of the energy is equal to zero with respect to allof the coordinates, and at least one second derivative should be positive and at leastone should be negative. These points, also called saddle points, represent the highestpoints on the path between two minima.

A more strict definition of saddle points is that for a nondegenerative criticalpoint (point where the first derivative is equal to zero), the Hessian matrix (squarematrix ofpartial second derivatives) is indefinite, i.e., it has both positive and negativeeigenvalues.

The important feature oftransition conformations (sometimes called transitionstates) is that at those points the system is unstable and of crucial importance, i.e.,it has equal probability ofgoing to different minima. Any small perturbations in thetransition states may have a dramatic effect on the dynamic process ofthe biologicalsystems.

A3.2.3 Coordinate Systems

Computer modeling programs often use two approaches to represent atomic coordi-nates of the biological molecules. The first approach, commonly used by dynamicssimulation programs, specifies each atom in the molecule using either the Cartesiancoordinates (x, y, z) or polar coordinates (r, 4>,8).

The Cartesian coordinate system, also called a rectangular coordinate system,is defined by the three orthogonal axes, labeled x, y, and z. Polar coordinates (alsocalled spherical coordinates when used to represent mathematical objects in three

Page 42: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics 271

. "

..._.. .;...

contour plot

i · · · · · · · ; · · · · · · ·~ · · · · · · .... ······; :: ; ;:

""::L::"ll:::::TI~;:-:::: ::i~::':I ::~ ····· ·j·· ·· ··t···· · ·~·· ·· ···j·· ·····j······ ·~· · · ·· ·· ~· ~ + .

·'0·10 ., -6 -&

lD

global minimu m " r-,..---,--.--r-,-----;,-,---,----.--,

-cs

-t elD

15

~ o s..a; 0

global minimum

100

00., ..'

60

40c-,0) 20....~ 0Q) ·20

·40

·60

·00

· 10050

saddle point

o

.' ~ .. '

45

Fig. A3.2 Potential energy surfaces with two degrees of freedom. The bottom figure illustratesthe saddle point in three-dimensional energy surfaces . The top left shows a three-dimensionalillustration ofenergy surfaces, while the top right shows a contour plot ofthe same energy surfaces .The contour plot displays three-dimensional function (function dependent on two input variables)using isolines, lines on a graph which connect points of equal value.

dimensions) are represented as positions on the surfaces of concentric spheres (seeFig. A3.3) . Positions ofthe points are specified by the tuples ofthe three components(p, <1> , 8). Radius p is the distance between the point and the origin of the coordi-nate system. The angle 0 ::: <I> ::: 360 (zenith, colatitude or polar angle) is the anglebetween the x-axis and the projection of the line between the origin and the pointin the xy plane. Angle 0 ::: 8 ::: 180 (azimuth or longitude) is the angle between thepositive z-axis and the line from the origin to the point.

Page 43: Appendix 1 Biological and Chemical Basics Related to

272

x

y

Hui Lu et al,

Fig. A3.3 Polar(spherical) coordinates are definedby the distancer fromthe fixed point (originof the system)and two angles (<p, 6). The angle0 :s <P:s 360 (zenith,colatitude or polar angle)isthe anglebetweenthe x-axis and the projectionof the line betweenthe originand the point in thexy plane. The angle 0 :s 6 :s 180 (azimuthor longitude) is the angle betweenthe positivez-axisand the line from the origin to the point.

The spherical coordinates can be transformed from the Cartesian coordinatesthrough the following three equations:

r =JxJ +yJ + z~,

<p = arctan (~:) ,

(J2 2)Xo+Yo Zo

e = arcsin r = arccos (-;) ·

(A3.1)

(A3.2)

The Cartesian coordinates can also be expressed as the functions of spherical ones:

Xo = r cos <p sin 8,

Yo = r sin 8 sin <p,

ZQ = r cos 8.

The second approach in representing atomic coordinates of biologicalmolecules uses internal coordinates of relative positions of each atom to theother atoms in the molecule. They efficiently describe special relations betweenatoms in an individual molecule. Computer programs which use quantum me-chanics to analyze properties of biological molecules commonly use internal co-ordinates to represent molecules, especially for small molecules. Internal co-ordinates are usually represented using a Z-matrix. Details can be found inLeach (2001).

Page 44: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics 273

t 1

Residue I

!(---"I "

I " ", I II N I-E- C

: " / : if! la lJI'- _ _ _ .J (jf.1

Amino ~

group

1o

IIc~

W I"

Carboxy l

( R? group~~ ( - - - "I

I : .>" IC--C I

la r-, I

o I" l .J

Residue 2

Fig. A3.4 Schemat ic representation of a peptide bond. 4>, I\J , ware dihedral angles .

A3.3 Basic Chemistry

A3.3.1 Chemical Reactions

Chemical reactions describe formation and breakage ofchemical bonds in molecules.Many chemical reactions in living cells are based on the chemistry of the carbonatom. Living organisms use carbon due to its versatility. Carbon can make up to fourcovalent bonds and several carbon atoms can make rings and chains.

A3.3.2 Formation of the Peptide Bond

Proteins are polymer chains composed of amino acids. Amino acids can condense,with the loss of a water molecule to form a peptide bond (Figs. A3.4 and A3.5).When an amino acid participates in the formation of a peptide bond, it is called anamino acid residue (or residue) because it loses one water molecule (an H+ fromthe nitrogenous side and an OH- from the carboxylic side). A peptide bond is astandard amide bond. Two or more amino acids can form one continuous peptidechain. Conventionally these chains are classified based on their molecular masses.

Fig. A3.5 Steric representation of a peptide bond.

Page 45: Appendix 1 Biological and Chemical Basics Related to

274 Hui Lu et aI.

The Ramachandran plot

~Lefthandeda helix

Righthandeda helix

180o~

-180L...- ~---__:_:_::_'

-180

Fig. A3.6 Ramachandran plot for the <1>, I\J angle. The contours show areas (dark) with theminimum energy and correspond to the stable conformations ofpolypeptide chains.

Peptides are chains having molecular masses below 5000 daltons (Da). Polypeptidesare chains having molecular masses in the range from 6 to about 40,000 kDa. The endofthe polypeptide chain with the free amino group (-NH3) is called the N-terminus,and the end with the free carboxyl is called the C-terminus. Schematic representationstarts with the N-terminus on the left and ends with the C-terminus on the right.

A peptide bond is planar and has only two possible conformations, i.e., rotationangle w can take two values : trans, which corresponds to a rotation angle of 180°,and cis, which corresponds to a 0° rotation angle (Fig. A3.5) .

Angles around the N-Ca bond (called the <I>-angle) and Ca-C bond (calledthe ljJ -angle) are more flexible. Values of those torsional angles in natural proteinsare restricted to a certain set of pairs. Allowed values (<I>, ljJ) of torsional anglesare nicely depicted by the Ramachandran plots (Fig. A3.6). Ramachandran plotscalculate the potential energy surface using torsional angles (<I>, ljJ ) as the samplingcoordinates. White areas correspond to protein conformations with high energies,mainly due to overlap among side-chain atoms, except for glycine which does nothave side chains. Dark gray areas correspond to favorable conformations withoutsteric clashes and light gray areas correspond to conformations where the van derWaals overlaps are minor.

A3.4 Physical Forces in Proteins and Nucleic Acids

A3.4.1 CovalentBond

Atoms tend to share electrons to fulfill their outer electron shells , i.e., electronorbitals . A covalent bond is formed when two atoms share one or more pairs ofelec-trons. Those bonds are stronger than noncovalent bonds such as hydrogen bonds .

Page 46: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics 275

Covalent bonds are usually formed by atoms with similar and high electronegativi-ties. The energy needed to remove an electron from the outer shell is large. Covalentbonding is highly directional (unlike ionic bonding, governed by nondirectionalCoulombic attraction), which results in a small number of characteristic bondingshapes with specific bonding angles. The bond and bond angles are generally mod-eled with harmonic potentials (see Chapter 2 for details).

Bond order describes the number of electron pairs shared by atoms formingthe covalent bond. The most common one is the single bonds, but double and triplebonds also occur. Quadruple bonds (carbon and silicon can form them) are possible,but the compounds formed with those bonds are very unstable.

A3.4.2 Electrostatic Interactions

Electrostatic properties are important for the stability ofthe individual biomoleculesand intermolecular interactions. Some approximations are needed if computation-ally demanding quantum-mechanical calculations have to be avoided. The commonmethod is to use fractional point charges. Those electrical points are created in away that reproduces the overall charge distribution of the molecule, although theyare not real physical entities. If those charges are positioned at the nuclei of atoms,then they are referred to as partial atomic charges or net atomic charges.

Another method is to represent electrostatic properties with central multipoleexpansion, a method based on the electric moments/multipoles, such as dipole,quadrupole, and octopole. Every multipole is composed ofan even number ofelectriccharges. Half of them are positive, and half of them negative.

A3.4.3 van der Waals Interactions

The nature of the interactions among atomic particles includes other forces besideselectrostatic interaction between point charges, which means that the total forcebetween atoms and molecules is not calculated accurately only under the electrostaticframework. For example, in rare gases, multipole moments are equal to zero, but it isknown that they have liquid and solid phases and they deviate from the behavior ofthe ideal gases. Thus, another term, van der Waals interaction (vdW), is introduced toaccount for these interactions. vdW force comes from the property that the nucleus-surrounding electron cloud is not perfectly spherical, but it is slightly oval, whichmeans that it has dipole properties.

At infinite distances, such interaction energy is equal to zero. As the distancedecreases, energy decreases gradually, and drops to a minimum (e.g., at the distanceof3.8 Afor argon). Then the energy increases rapidly. A widely used form of thisinteraction is the Lennard-Jones potential (Fig. A3.7):

(A3.3)

Page 47: Appendix 1 Biological and Chemical Basics Related to

276 Hui Lu et al,

V(r)

rO~I----""""---'==--------------+

~attraction

Fig. A3.7 Representation of the van der Waals interaction between atoms using Lennard-Jonespotential. It is a combination of a repulsive term and an attractive term.

A3.4.4 Hydrogen Bond

A hydrogen bond is formed when two electronegative atoms (e.g., oxygen and ni-trogen) share a hydrogen atom. Polarity ofthe participating molecules is the basis offormation of a the hydrogen bond. The energy of a hydrogen bond is in the range ofseveral kilojoules per mole. This is small compared to the covalent C-C bond, whichhas an energy of 380 kJ/mole. The hydrogen bond is stronger than typical electro-static interactions between partial charges, but it is easily disassociated by heat or byinteraction with other atoms. Hydrogen bonds playa crucial role in the structures ofproteins and nucleic acids. DNA has a double-helical structure mainly due to hydro-gen bonding between base pairs. The hydrogen bonds define the secondary structureof proteins. They are formed between the backbone oxygens and amide hydrogens.

A3.4.5 Disulfide Bond

Covalent bonds can form between cysteine residues that are not neighbors in a peptidechain. This disulfide bond (disulfide bridge or disulfide link) is formed when twosulfur atoms (sulfhydryl group) in the side chains ofcysteine residues are covalentlylinked (Fig. A3.8). The oxidation reaction uses one external oxygen atom and twohydrogen atoms bound to each of the sulfur atoms in the side chains. When a bridgebetween two cysteines is formed, a water molecule is released. When two cysteinesare bonded by the disulfide bridge, the end product is cystine. This covalent bondprovides extra stability to the protein's native fold.

A3.4.6 Solvation

Water is a polar solvent, and it separates ions in solution. In the structural modelingand molecular dynamics programs, water can be treated in two ways. One way is

Page 48: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics 277

+ Hp

H 0+ I 9H N-C-C

3 I ' -CH 0I 2

SISICH2 0

+ I ~H N-C-C

3 I ' -H 0

reduce[H]

H 0+ I 9H N-C-C

3 I ' -CH 0 [0]I

2oxidize

SH -~---~SH ~----

ICH2

+ I ~oH N-C-C

3 I '0-H

Fig. A3.8 The formation of a disulfide bond. An oxidationprocess removes sulfur-bound hy-drogenatomsfromthe side chainsof two cysteineresidues. Sulfuratomscovalently bind to reachstabilityand thus form a disulfide bond.

to explicitly model all water molecules in solution as individual entities with co-ordinates, mass, electrostatic properties, etc. The other way is to use the implicitsolvent model. Implicit solvent treats the bulk water surrounding the biomoleculesas a continuous medium with averaging physical properties. The advantage of theimplicit solvent model is that the computational cost is low and the model allows bet-ter handling of the dielectric effects of electrons. The disadvantage is that the modeloverlooks the influences of the individual water molecules and the nonequilibriumbehavior.

A3.4.7 Hydrophobic Interactions

Hydrophobicity is the property of amino acids which forces them to avoid wa-ter molecules, Le., polar environment. Residues such as Phe, Tyr, and Leu,have side chains which cannot form hydrogen bonds with water, so they pre-fer an environment with no water molecules. Those residues can perturb thehydrogen bonds between water molecules in the solvent, and thus destabilizethe protein-water system. Such hydrophobic behavior is one of the main driv-ing forces behind protein folding, and it has a major impact on protein stability.Residues that exhibit such behavior are referred to as hydrophobic (or nonpolar)residues.

To form the most number of hydrogen bonds, water molecules near the hy-drophobic residues effectively make a cage around them (Fig. A3.9). Such cageforming process decreases the entropy ofthe solvent. Hence, it is energetically morefavorable when the hydrophobic residues form a compact shape and expose a smallersurface area to the solvent, so that fewer water molecules are involved in the for-mation of ordered structures around them. Protein folding decreases the entropy ofnonpolar residues, but the entropy of the solvent is significantly increased becausethere are fewer ordered structures formed by the water molecules (i.e., the overallfree energy is decreased).

Page 49: Appendix 1 Biological and Chemical Basics Related to

278 HuiLu et aI.

o 0

o 0

o 0

o00

oo 0 0

o 0

o

o 0 00 00

o 00o 0

o_~~ 00

o 0

oo

o 0 0 0o 0 0o

Fig. A3.9 Hydrophobic interactions. Dark gray rectangles are hydrophobic residues. Lightergrayrectangles arehydrophilic or neutral residues. Small circlesrepresent watermolecules. Linesbetween some of the circlesrepresent ordered structures formed among watermolecules.

A3.5 Concepts from Statistical Physicsand Thermodynamics

Thermodynamic quantities are used to describe the states ofbiomolecules. Some ofthem have meanings on both macroscopic and microscopic scales , such as tempera-ture . Others have statistical origin, for example entropy, and thus cannot be appliedto few-body systems or with very few degrees of freedom.

A3.5.1 Temperature

Temperature is a measure of the average kinetic energy of the system. Temperaturehas only macroscopic meaning, because it is a statistical quantity. Nevertheless, itcan be applied to a protein composed of only a few hundred atoms .

The instantaneous temperature of a set of atoms is dependent on their kineticenergy:

N 1 12 k TH = L L = _B_ (3N - Nc)

;=1 Lm, 2(A3.4)

where ke is the Boltzmann constant, N; is the number of constraints, and 3N - N;is the total number of degrees of freedom . In the molecular dynamics simulations,the total linear momentum of the system is often constrained to a zero value, whichremoves three degrees of freedom from the system. In that case, N; is equal to 3.

A3.5.2 The Most Probable Distribution

Statistical analysis ofthe particular system composed ofthe set ofsubsystems whichdiffer only with respect to the energy they possess starts with the probability distri-bution function of those subsystems. To derive this energy-dependent function, twoassumptions have to be introduced:

Page 50: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics

• Number ofsubsystems is constant, i.e., L a j = A;j

• Total energy of the whole system is constant, i.e., L aj E j = E.j

279

(A3.5)

(A3.6)

In those two equations, aj is the number of subsystems with energy Ei- i.e., aj

is the number of subsystems in the quantum state i. E is total energy, and A is thenumber of subsystems.

To find the probability that a certain subsystem has a particular energy, theprinciple of equal a priori probability must be applied. This principle says thatevery combination of subsystems (every distribution of numbers aI, a2, ... , an) isequally probable, thus it must have an equal weight during calculation. To apply thisprinciple, the number of realization of any particular distribution of subsystems aj

must be found, i.e., the number of permutations with repetition:

(A3.7)

The probability that the system aj is in the state with the energy Ej is obtainedby averaging aj / A over all possible distributions, i.e., over all possible permutationswith aj:

Qj 1 La W(a)aj(a)p. - - - ------

J - A - A La W(a) .(A3.8)

In this equation, aj(a) means that aj is dependent on particular distribution a.Ifthe total number ofsubsystems A is very large, then the spread ofthe W(a) is

very narrow around the set a' which maximizes W(a). Following this assumption,summation La W(a)aj(a) can be written as W(a*)aj, and the whole calculation ofthe probability Pj can be expressed as

p. _ ~ La W(a)aAa) _ ~_W_(a_*_)a_j

J - A La W(a) - A W(a*)

a~J

A(A3.9)

The expression for the distribution a* which maximizes W(a) can be calculatedusing Lagrange's method of undetermined multipliers:

(A3.10)

In this set of equations, ex and ~ are undetermined multipliers. When Stirling'sapproximation (n! ~ J27rn (nje)n ) is used, those equations can be written as-lnaj - ex -1- ~Ej = 0, which is

(A3.11)

Page 51: Appendix 1 Biological and Chemical Basics Related to

280

l.e.,

Hui Lu et ale

(A3.12)

Normalization condition (A3.4) determines ea.' = ~ Lj e-~Ej which appliedto (A3.9) gives the probability that the subsystemhas energy E j :

«: e-~Ej(N, V)P

j= -.L = _

A Lj e-~Ej(N,V)·(A3.13)

Coefficient J3 canbeeasilydeterminedandit isequivalentto 1/ kB T; kB isBoltz-mann's constant (8.6167 x 0.0004 x 10- 5 eV deg- l ) and T is the temperature. ThesummationLj e-Ej / kT is called the partition function and is usuallydenoted as Z.

A3.5.3 Entropy

There are severalwaysto defineentropy; a coupleof them will be givenhere for thepurpose of clarity.

The first approach is the thermodynamic definition, which says that entropyisconjugateto thepart of the totalenergyof theparticularphysicalsystemwhichcannotbe used for useful work. It measures the disorder of the system (e.g., ideal gas keptin a closed container) and therefore it is closely related to the information entropywhich will be explained later. This quantity was introducedby Rudolf Clausius in1865. His definition implies the difference of the entropy rather than the absolutevalue of this quantity:

~Q~S=-.

T(A3.14)

This definition connects change of the entropy with the heat absorbed by thesystem(~Q) during the reversible process with the fixedtemperatureT. In the caseof an infinitenumber of reversible processes in closed cycle, this can be written as

fdi =0. (A3.15)

In the case of irreversible transformations, entropy is always greater than theintegral of heat over temperature, i.e.,

~S> f ~Q.- T (A3.16)

Another approach in defining entropy is via quantum statistics. In quantumstatistics and atomic physics the volume of the phase space which corresponds to

Page 52: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics

the one quantum state is equal to

(27rFi)S = hS(h is Planck's constant, 6.6260693 x 10-34 Js).

281

(A3.17)

The states of the system with s degrees of freedom (position and momenta areboth described with s-coordinates) can be represented by the corresponding pointsin s-dimensional phase space. As the coordinates of the systems, i.e., position andmomenta, change with time, the corresponding trajectories in the phase space arecalled the phase trajectories.

Behavior of the complex quantum system at equilibrium, divided into manysubsystems, can be described by the distribution function io; of those subsystems.This function usually depicts the distribution ofsubsystems according to their energyE. It may be written as ui; = w(E). To define the probability W(E)dE that a par-ticular subsystem has energy between E and E + dE, w(E) has to be multiplied bythe number of quantum states which have energy in this same interval. The numberof quantum states having energy equal to or less than E is denoted as r (E), so thenumber of states with energy in the range between E and E + dE is

dr(E) dEdE '

and corresponding energy probability distribution,

dr(E)W(E) = --w(E).

dE

Probability distribution function normalization condition is

f W(E)dE = 1.

(A3.18)

(A3.19)

(A3.20)

The function WeE) is usually sharp around the energy E which maximizes it,so that (A3.20) can be simplified and written as

with

W(E)~E = 1.

This equation combined with (A3 .18) gives

w(E)~r = 1

dr(E)~r=--~E.

dE

(A3.21)

(A3.22)

(A3.23)

Page 53: Appendix 1 Biological and Chemical Basics Related to

282 Hui Lu et ale

~ris the number of quantum states which correspond to the interval energy~E.

The same relation can be applied to the classical statistics, only with the functionweE) replaced by the classical distribution function p, and ~r replaced by theclassical phase space volume ~p~q :

p(E)~p~q = 1. (A3.24)

The relation which connects quasi-classical and quantum volume in phasespace is

(A3.25)

It says that the number of states in quasi-classical state can be obtained whenthe volume occupied by the classical phase state is divided by the volume ofthe onequantumstate (21TFi)s.

The number ofstates ~ r is also called the statisticalweightofthe macroscopicstate of the subsystem. Statistical weight represents the broadening degree of themacroscopic state of the system with respect to its microscopic state.

The logarithm of the statistical weight

s = log~r

is called the entropyof the subsystem.The same relation can be used in the classical case:

~p~q

S = log (27TliY ·

(A3.26)

(A3.27)

The entropy itself is dimensionless, but in the classical case entropy is definedonly up to within an additive constant. Classical statistics defines entropy as S =log ~p /sq . The value of the product ~pSq depends on the choice of units; e.g., ifthe unit is changed by a times, product Sp~q is changed to as~p~q and the entropybecomes S = log ~pSq + s log a. This means that in the classical case only thedifference between two values ofthe entropy can be measured, not the absolute value.

The logarithm ofthe distribution function of the system is an additive quantitybecause two subsystems/bodies in equilibrium have independent distributions, whichmeans that the common distribution is of the form W12 = Wl·W2. From this equalitycomes the relation

(A3.28)

The logarithm ofthe distribution can be expressed as the Taylor polynomial. Ifhigher-order terms are neglected, the general relation can be written as

(A3.29)

Page 54: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics

It follows directly that

logw(E) = a + ~E

which can be written as

(logw(En ) ) = a+ ~E.

Together with (A3.22), this can be used to expand (A3.26) into

S = log ~r = -log w(E) = - (log weEn)) .

283

(A3.30)

(A3.31)

(A3.32)

From the general definition ofthe mean value in statistics comes the final expression

s = - L ui; log W n ·

n

In classical statistics, entropy is expressed in a similar way as

s = - (log[(2'TTJiY p]) = - f p log [(2'TTJiY p] dpdq.

(A3.33)

(A3.34)

Equations (A3.33) and (A3.34) show that entropy is an additive quantity, i.e.,

(A3.35)

A3.5.4 Information Entropy

The information theory deals with the amount of information, not with the mean-ing of it. To be able to quantitatively work with the information it must definethe amount of information contained in the event I (p) with the probability p. Forthe independent events A and B with the probabilities peA) = a and PCB) = b, theamount of information should be an additive quantity, i.e., l(ab) = lea) + l(b). Itcan be shown that the only continuous solution of this equation is a function withthe form l(p) = clog p. The information unit is the Shannon, named after ClaudeShannon.

When the basis of the logarithm is 2, the event with the probability 1/2 has oneunit of information, one Shannon:

l(p) = -log2P. (A3.36)

The mean value of the information contained in the events X = xi, k = 1,2, ... , is

SeX) = - L Pk log, Pu-k

(A3.37)

Page 55: Appendix 1 Biological and Chemical Basics Related to

284 Hui Lu et ale

IfX is a continuous variable with the distribution function f(x), then the mean valueof the information is

+00

SeX) = - f f(x) log, f(x )dx ·

-00

(A3.38)

The mean value of the information thus defined is the information entropy.The information entropy has a maximum value for the uniform distribution,

Pi = lin, for I ~ i ~ n. The system in equilibrium has the maximum entropy.

A3.5.5 Enthalpy

Ifthe system keeps its volume constant during the process, then the change in heat isequal to the change in energy, i.e., d Q = dll. Ifthe pressure is kept constant besidesvolume, then the change in heat can be written as the differential

dQ = d(U +PV) = dH. (A3.39)

In this equation the quantity H is called the enthalpy or heat function of thesystem. Total differential of this quantity gives an interesting relation concerningtemperature T and volume V of the system. From

dU = TdS -PdV

and

dH = dU +PdV + VdP,

we have

dH = TdS + vdP.

Then T and V can be calculated as

T = (aHlaS)p

and

v = (aHjap)s.

(A3.40)

(A3.41)

(A3.42)

(A3.43)

(A3.44)

If the system is thermally isolated, then d Q = 0, and enthalpy is conserved.Since heat is a state function, only the change in enthalpy between two states can bemeasured.

Page 56: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics

The specific heat C; when the volume is constant can be written as

The specific heat Cp when the pressure is constant can be written as

A3.5.6 Free Energy

285

(A3.45)

(A3.46)

Free energy is the measure ofthe reversible work that can be extracted from a system.The negative differential of the free energy is equal to the maximum work the givensystem can perform.

A3.5.6.1 Helmholtz Free Energy

Under constant temperature and constant volume, free energy is called Helmholtzfree energy and is defined as

F = U - TS. (A3.47)

In this equation U is the internal energy, T is the temperature, and S is the entropy.The total work performed on a system at constant temperature in a reversible processis equal to the change in the Helmholtz free energy of that system. The first law ofthermodynamics states that dU = 8Q - 8W and 8W = pdV. The second law ofthermodynamics states that at equilibrium, 8Q = TdS, Thus, we have

dF = dU - (TdS + SdT)

= (TdS - pdW) - TdS - SdT

= -pdV -SdT.

(A3.48)

Entropy has a maximum value when the process is at equilibrium, which means thatfor processes which are not reversible, free energy is smaller than the difference-pdV - SdT, i.e.,

dF:::; -pdV - SdT.

If the process is performed at constant temperature with dT = 0, then

dF:::; -pdV,

«r s aw.

(A3.49)

(A3.50)

Page 57: Appendix 1 Biological and Chemical Basics Related to

286 Hui Lu et al,

Expressing the free energy using statistical physics requires use ofthe partitionfunction Z = Lj e-Ej / kT and its logarithm, f = log Z, which is supposed to be afunction of ~ and Ej's:

The total derivative of f is

( af

) ( af )df= - d[3+ L - dEk.a~ Ej's k aEk r3,Ej's

The two partial derivatives are

(A3.51)

(A3.52)

- Lj E je- r3 Ej

-----=-E,Z

(A3.53)

Those two expressions simplify (A3.52),

df = -Ed~ - ~ LPjdEj .

j

By adding ~dE to both sides this equation can be written as

(A3.54)

(A3.55)

(A3.56)

dE is average energy increase and Lj PjdE, is the average reversible work doneon the ensemble of systems (reversible work has the same distribution as the energyof the system, which means that PjdEj is equal to the reversible work) . Those tworelations reveal that the value in parentheses on the right-hand side is the averagereversible heat supplied to a system. Equation (A3.56) can be written as

d(f + ~E) = ~Bqreversible.

Equation (A3.57) together with (A3.14) gives

d(f + ~E) = dSj k.

(A3.57)

(A3.58)

Page 58: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics

Integration of(A3.58) yields

ES = - + k In Z + const.

T

Average energy can be written as

- - Lj Ej(N, V)e-~Ej(N,v) 2 (a In Z)E = E(N, V, 13) = L -13E(N,v) = kT -- .

. e } aT N v} ,

287

(A3.59)

(A3.60)

This equation can be used to express the relation for entropy using partition function

(alnZ)S = kT -- + kIn Z + const.sr N,V

(A3.61)

Equations (A3.47), (A3.56), and (A3.57) give the relation for the free energyusing the formalism of classical statistical physics:

F(N, V, T) = -kT In ZeN, V, T) + const. (A3.62)

It can be seen that free energy can be calculated up to the additive constant. However,only the free energy difference, shown below, is important in biomolecular modeling:

(A3.63)

A3.5.6.2 Gibbs Free Energy

When a system has temperature T and pressure p as independent variables, for-mulation of the differential of the total energy U, dU = TdS - pdV can be writtenas

dU = TdS - pdV = d(TS) - SdT - d(PV) + Vdp.

The quantity ~G, differential of the Gibbs free energy, is introduced as

dG = -SdT + Vdp.

From (A3.64) and (A3.65) comes the definition of Gibbs free energy G:

G= U - TS+pV.

(A3.64)

(A3.65)

(A3.66)

U is the internal energy of the system, T the temperature, S the entropy, p the pres-sure, and V the volume of the system. The Gibbs free energy determines properties

Page 59: Appendix 1 Biological and Chemical Basics Related to

288 Hoi Lu et ale

of the system such as the voltage of an electrochemical cell or the equilibrium con-stant for a reversible reaction. A process can run spontaneously if and only if theassociated change in G for that system is negative (~G < 0).

The Gibbs free energy can be expressed using enthalpy and Helmholtz freeenergy,

and

G=H-TS

G=F+pV.

(A3.67)

(A3.68)

Connection of the equilibrium constant K eq = e-!:!,.F/kT with the difference ofthe Gibbs free energy at constant pressure is straightforward:

~GO = -RTlnKeq . (A3.69)

The Gibbs free energy G is constant along a coexistence line on a p-T phase diagram.

A3.5.7 Kinetic Barrier

Compounds involved in dynamics process must possess a certain amount of energyto react, as discovered by Arrhenius. Reaction rate in thermal equilibrium at anabsolute temperature T can be calculated from the Arrhenius equation:

(A3.70)

Ea is called the activation energy or kinetic barrier. It is the minimum energyneeded, when thermal fluctuations are neglected, for the reaction to be initiated. A(frequency factor) is a constant specific to a particular reaction and indicates howmany collisions between reactants have the correct orientation to lead to the products.Frequency factor is comparatively insensitive to the temperature. The Arrheniusequation is derived from the assumption that all ofthe molecules involved in reactionare in thermal equilibrium which means that they obey Maxwell-Boltzmann energydistribution.

The height of the kinetic barrier is given by

e, = -RTIOg(~). (A3.71)

If the kinetic energy is large enough, molecules in reaction can overcomerepulsive forces to start the reaction (Fig. A3.10). The transition state in chemicalbond formation and breakage reaction occurs when initial bonds are stretched totheir maximum. Its duration is very short, about 10- 15 s. The energy needed to

Page 60: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics

---- -lki~e~c~ -barier B

-l:~ ;o:t~e~:eral~ -----reaction B

------------------reaction coordinate

289

Fig. A3.10 The energy profile along the reaction coordinate with the kinetic barrier EBandoverall reaction energy difference L\E .

reach the transition state is equal to the kinetic barrier. If the reaction has more thanone transition point, the kinetic barrier is equal to the height of the highest one. Asubstance which lowers the kinetic barrier is called a catalyst. In biology the catalystsare called enzymes, which are mostly proteins and certain RNA molecules.

The concept ofkinetic barrier is mostly the view ofmass-reaction model that hasbeen conventionally used by experimentalists. This simple picture does not directlyapply to protein folding, an important subject for protein structure prediction. Proteinfolding process is mainly based on free energy landscape rather than internal energy,and the entropy plays a crucial role in the folding.

Protein Folding Landscape

The folding process was studied by Cyrus Levinthal who showed that a protein doesnot enumerate all of its possible conformations during folding. He argued that fora protein composed of 100 residues, with three possible conformations per residue,there are about 1048 conformations in total. If the transition between conformationsrequires 10-11 s (lOps), it would take about I029 years to explore all of them, whichexceeds the lifetime of the universe. This Levin thai paradox leads to the conclusionthat proteins follow pathways through the energy landscape during their folding.The folding landscape, a multidimensional free energy surface, forms in a funnel-like shape (sometimes called the folding funnel). It was found that most proteinsfold into their native conformation within a few seconds from random unfoldedconformations. It is assumed that evolution shaped the folding landscape so as tohave a funneled shape which allows proteins to fold into the global minimum withinreasonable times.

A3.6 Physics in Molecular Dynamics

A3.6.1 Newtonian Laws

Molecular dynamics simulation is a method for numerically solving Newton's equa-tions of motion of the set of particles which represent atoms and molecules. Using

Page 61: Appendix 1 Biological and Chemical Basics Related to

290 Hui Lu et ale

this method, time-dependent behavior of the molecular system of interest can beanalyzed in detail.

Interactions among atoms in molecular dynamics simulation are described us-ing force fields (also called molecular mechanics). Then the dynamics is describedby Newton's second law, which states that the force equals the rate of change ofmomentum:

d dF = -p = -mv.

dt dt

When the mass m is constant, we have

dvF=m-=ma,

dt

where a is the acceleration.

A3.6.2 Numeric Solution: Finite Difference Method

(A3.72)

(A3.73)

With Newton's laws, one can simulate the dynamics ofa mechanical system numer-ically. The basic idea is to calculate the force for each particle based on its initialposition and assign an initial velocity. The initial velocities of the atoms in a molec-ular system are typically assigned randomly according to the Boltzmann distributionat a particular temperature. A particle will then move according to its velocity and ac-celeration rate for a small time step during which the force has no significant change.Then one will update the position, velocity, and force of the particle at the end ofthe time step, which serves as the starting point for the next cycle of the simulation.

To simulate the behavior ofthe biological molecules using Newton's laws someapproximations must be used. The dynamics integration cannot be done using in-finitely small time steps. Finite difference methods are developed to efficiently inte-grate, with negligible error, Newton's equations ofmotion with continuous potentialmodels. The whole idea is that integration is divided into a large number of smallsteps, each separated by a fixed time interval St. The total force is a vector sum offorces coming from all other particles.

Many algorithms have been developed to achieve the correct integration. Theyall start from the Taylor series expansion of the dynamical properties of all theparticles:

1 1 1ret + St) = ret) + Stv(t) + -St2a(t) + -St3b(t) + -St4e(t) + ... ,

2 6 241 1

vet + St) = vet) + Sta(t) + -St2b(t) + -St3e(t) + ... , (A3.74)2 61

aCt + at) = aCt) + atb(t) + 2at2c(t) + ·.. ,bet + St) = bet) + Ste(t) + ....

Page 62: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics 291

The most widely used method in molecular dynamics programs is the Verlet algo-rithm. It uses the positions and accelerations at time t, and the positions from theprevious step, ret - St), to calculate new positions at t + St, ret + St). Relationsbetween positions and velocities at those two moments in time can be written as

1ret + St) = ret) + Stv(t) + -St2a(t) + ... ,

21 2

ret - St) = ret) - Stv(t) + -St aCt) - ....2

Those two relations can be added to give

ret + St) = 2r(t) - ret - St) + St2a(t).

(A3.75)

(A3.76)

The velocities do not explicitly appear in the Verlet algorithm. They can be calculatedin several ways. A very simple approach is to divide the difference in positions attimes t + St and t - St by 2 St, i.e.,

vet) = [ret + St) - ret - St)]/2St.

Another approach calculates velocities at the half step t + !St:

v (t + ~3t) = [ret+ 3t) - r(t)]j3t.

(A3.77)

(A3.78)

Practical application of this algorithm is straightforward and memory re-quirements are modest, only positions at two time steps have to be recorded[ret), ret - St)], and the acceleration aCt). The only drawback is that the new posi-tion ret + St) is obtained by adding small term S2 ta(t) to the difference oftwo muchlarger terms 2r(t) and r(t-St), which requires high precision for r in the numericalcalculation.

The leapfrog method is a variation ofthe Verlet algorithm. It uses the followingrelations:

1ret + St) = ret) + Stv(t + -St),

21 1

vet + -St) = vet - -St) + Sta(t),2 2

1 1 1vet) = -[vet + -St) + vet - -St)].222

(A3.79)

The method derives its name from the "leapfrog" jumps that the velocities makeover the positions to give their values at t + !St.

Page 63: Appendix 1 Biological and Chemical Basics Related to

292 Hui Lu et al.

The velocity Verlet algorithm gives positions, velocities, and accelerations atthe same time and does not compromise precision:

1r(t + 8t) = r(t) + 8tv(t) + -8t2a(t),

21

vet + 8t) = vet) + 28t[a(t) + aCt + 8t)]. (A3.80)

Beeman salgorithm is a derivative of the Verlet method:

2 1r(t + 8t) = r(t) + 8tv(t) + -8t2a(t) - -8t2a(t - 8t),

3 6151

v(t + 8t) = v(t) + -8ta(t) + -8ta(t) - -8ta(t - 8t).366

(A3.81)

A3.6.3 Time-Dependent Properties

Obtaining time-dependent properties is a major advantage of molecular dynamicsover other dynamics simulation techniques, such as Monte Carlo methods. Using thetrajectories ofa time series in molecular dynamics simulations, one can calculate timecorrelation coefficients. Correlation function gives numerically explicit relationsbetween two sets of data. The simplest form of this function for the two sets of Mvalues is

1 MCxy = M L (Xi - (Xi))(Yi - (Yi)) ;: (XiYi) - (Xi) (Yi).

;=1

(A3.82)

If the normalized value (between -1 and 1) is needed, this function is divided bythe root-mean-square values of both sets (x andy):

(X;y;) - (Xi) (y;)

1 M

M L(Xi - (Xi))(Yi - (Yi));=1

(~ ~(Xi - (Xi)i) (~ ~(yi - (Yi)i)

exy = --;:::::=====================================================

(A3.83)

If the properties analyzed by the correlation coefficient are between two setsofdata, then the correlation function is called the cross-correlation function. If theyare for the same set of data, the function is called the autocorrelation function.

The data provided by the molecular dynamics simulation are time dependentand thus they can be analyzed using one more dimension, time. For example, the

Page 64: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics

time correlation coefficient for two sets of coordinates is

(x(t)y(t)) - (x(t)) (y(t))

A3.7 Physics of Monte Carlo Methods

293

(A3.84)

A3.7.1 Boltzmann Factor

Boltzmann factor e-E / kBT gives the relative probability (population) ofthe system inequilibrium at temperature T between conformations that have an energy differenceof E (kB is the Boltzmann constant and T is the temperature).

A3.7.2 Metropolis Method

The Metropolis method is a sampling scheme that can be used to efficiently coverimportant conformations during an optimization process. In this method, the newly(randomly) generated configuration is accepted or rejected based on comparinga random number with the Boltzmann factor. The Boltzmann factor in this caseis the exponential value of the normalized difference of the energies betweenthe new and old configuration, i.e., exp[ -(ENEw(rN) - EoLD(rN))j kBT]. It as-sumes that the system is in equilibrium and thus the energies have the Boltz-mann distribution. When the new configuration has lower energy, it is alwaysaccepted.

A3.7.3 Simulated Annealing

Simulated annealing is an idea borrowed from metallurgy. In metallurgy, gradualcooling produces bigger crystals and smaller defects. This approach is applied inthe search for the conformation with global minimum energy; the idea is to avoidbeing trapped in the local minima. The system is initially heated to high temperature,thus all conformations are accessible. During the simulation the temperature of thesystem is gradually lowered. This procedure increases the probability of finding theglobal minimum by gradually reducing the probability of jumps with unfavorableenergy differences.

A3.7.4 Euler Angles

Practical application of Monte Carlo principles to molecular simulations assumestranslation and rotation ofmolecules. Translation is straightforward. Here the focus ison the rotation descriptions. Rotations are applied using trigonometric relationships.For example, if the rotation is around the x axis and the rotation angle is 800, new

Page 65: Appendix 1 Biological and Chemical Basics Related to

294 Hui Lu et ale

z.z' z' y"

x'x"

y'

z't.z'"

y'" y"

x"

Fig. A3.11 A rigid body rotation represented by rotating the three Euler angles.

values of the vector describing molecule position (x', y', z') are calculated from theold ones using the relation

(A3.85)

The Euler angles, as shown in Fig. A3.11, are used to describe the orientationof a molecule. Three Euler angles describe rotation around certain axes. <I> is therotation about the Cartesian z-axis. a is the rotation around the new x-axis after thefirst rotation is performed. ~ is the rotation around the new z-axis after the first tworotations are done.

Those angles are randomly changed by small amounts 8<1>, 88, and 8\fJ , then theold vector is calculated using the following matrix equation Vnew = Aold VoId, with

(

COS 84> cos8$ - sin 84> cosse sin8$A = - cos 84> cos8$ - sin 84> cos8e sin8$

sin 8<l> sin8e

sin 84> cos8$ + cos8$ cos8e sin8$- sin84> cos8$ + cos8$ cos8e sin8$

- cos 8<l> sin8e

sin8e sin8$ )sin8e sin8$ .

cos8e

(A3.86)

Simple random movement in angle space does not give a uniform distributionin Cartesian space, as shown in Fig. A3.12. The solution for this problem isto sample cosf instead of a. One approach gives the next relation for uniformsampling:

%ew = <Pold + 2(~ - 1)8<f>max,cos anew = cos aold + (2~ - 1)8(cos a)max ,~new = ~old + 2(~ - 1)8~max.

(A3.87)

A problem with direct application of Euler's rotation is that it contains sixtrigonometric functions (two functions for each of three angles) which are com-putationally demanding. A solution for this problem is to use quaternions, four-dimensional vectors whose sum ofsquare components is equal to 1.The next relations

Page 66: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics 295

Fig. A3.12 The nonuniform distribution along II in the Euler angle representation . At differentII angles, the same interval of II +dll corresponds to different sizes of the bands (shown in dark)on the surface of the sphere.

connectquaternions and Euler's angles:

I Iqo = cos 2'8cos 2' (<\> + '1J)'

. 1 1q\ = sm 2'8cos 2'(<\> + '1J),

. I Iqz = sm 2 8 sin 2(<\> + '1J) ,

1 1q3 = cos 2'8 sin 2'(<\> + '1J).

The Euler angle rotationmatrix can be expressedusing those relations:

(A3.88)

(A3.89)

A new four-dimensional vector is generatedusing the following procedure:1. Generate two pairs of random numbers between -1 and 1 (Tl TZ) and

(T3 'T4) until

S\ = Tf + Ti < 1,

S: = Tf + T1 < 1.(A3.90)

Page 67: Appendix 1 Biological and Chemical Basics Related to

296

2. Form the random four-dimensional unit vector

A3.8 Summary

Hui Lu et al,

(A3.91)

In summary, the behavior of any biomolecule is governed by the basic princi-ples of physics and chemistry, and some of these principles form core compo-nents for the foundation of structural bioinformatics. Due to the different size,time scale, and energy level studied in molecular biomolecular systems, as wellas the feasibility in terms of computing resources, various levels of physics andchemistry principles are applied in structural bioinformatics, including quantum-mechanical calculations, classical mechanical computations, potential ofmean forceapproaches, coarse-grained calculations, and bulk descriptions. At the electronlevel described by quantum mechanics, the studies are often well grounded intheories. As we move toward the atomic-level description using particles, partialcharges, and force fields, it opens the door for modeling many more propertiesof biomolecules, such as protein dynamics and folding, although some limitationsstart to occur. When we move to larger scale and longer time scale, many mod-els (such as knowledge-based scoring functions) and computational techniques(such as Monte Carlo simulations) are developed. These models and techniques,even incorporating certain approximations, provide powerful tools for structuralbioinformatics.

Among many theories in physics and chemistry, classical mechanics, ther-modynamics, and concepts of chemical bonds and kinetic barrier are widelyused in structural bioinformatics. As structural bioinformatics moves forwardand computer speed keeps improving, we foresee that more advanced meth-ods in physics and chemistry will be applied. For example, new theoriesin nonequilibrium thermodynamics and statistical physics may help us bettermodel biomolecular self-assembly, including misfolded proteins, as described inChapter 9.

Further Reading

Further details about basic concepts and theories introduced in this appendix canbe found in Leach (2001) and Bromberg and Dill (2002). Molecular mechanics,molecular dynamics simulation, and Monte Carlo methods are thoroughly describedin Frenkel and Smit (2001). Holtje et al. (2003) is another book with rich materialsrelevant to this appendix. The basis ofphysics and chemistry in the context ofproteinsis well illustrated in Creighton (1992).

Page 68: Appendix 1 Biological and Chemical Basics Related to

A3. Physical and Chemical Basis for Structural Bioinformatics

References

297

Anikin, N.A., Anisimov, VM., Bugaenko, VL., Bobrikov, Vv, and Andreyev, A.M.2004. Local SCF method for semiempirical quantum-chemical calculation ofultralarge biomolecules. J. Chem Phys. 121:1266-1270.

Bromberg, S., and Dill, K.A. 2002. Molecular Driving Forces: Statistical Thermo-dynamics in Chemistry & Biology. New York, Garland Publishing.

Clote, P, and Backofen, R. 2000. Computational Molecular Biology: An Introduc-tion. New York, John Wiley & Sons.

Creighton, T.E. 1992. Proteins: Structures and Molecular Properties, 2nd edition.New York, Freeman.

Frenkel, D., and Smit, B. 2001. Understanding Molecular Simulation, 2nd edition.San Diego, Academic Press.

Friesner, R.A., and Guallar, V. 2005. Ab initio quantum chemical and mixed quantummechanics/molecular mechanics (QM/MM) methods for studying enzymaticcatalysis. Annu. Rev. Phys. Chem. 56:389--427.

Gherman, B.F., Lippard, S.l, and Friesner, R.A. 2005. Substrate hydroxylation inmethane monooxygenase: Quantitative modeling via mixed quantum mechan-ics/molecular mechanics techniques. J. Am. Chem. Soc. 127:1025-1037.

Hayashi, S., Tajkhorshid, E., and Schulten, K. 2003. Molecular dynamics simulationofbacteriorhodopsin's photoisomerization using ab initio forces for the excitedchromophore. Biophys. J. 85:1440-1449.

Holtje, H.D., Sippl, W, Rognan, D., and Folkers, G. 2003. Molecular Modeling:Basic Principles and Applications, 2nd edition. New York, Wiley-VCH.

Landau, L.D., and Lifshitz, L.M. 1984. Statistical Physics, 3rd edition. London,Butterworth-Heinemann.

Leach, A. R. 2001. Molecular Modeling: Principles and Applications, 2nd edition.Englewood Cliffs, NJ, Prentice-Hall.

McQuarrie, D.A. 2000. Statistical Mechanics, 2nd revised edition. New York, Uni-versity Science Books.

Norbert, W 1965. Cybernetics, Second Edition: or the Control and Communicationin the Animal and the Machine. Cambridge, MA, MIT Press.

Shannon, C.E. 1948. A mathematical theory of communication. Bell Syst. Tech. J.27:379--423, 623-656.

Shannon, C.E. 1949. Communication in the presence ofnoise. Proc. IRE. 37:10-21.

Page 69: Appendix 1 Biological and Chemical Basics Related to

Appendix 4 Mathematics and Statistics for StudyingProtein Structures

Yang Dai and Jie Liang

A4.1 Introduction

Thischapterpresentsbasicsof mathematics and statistics for studyingproteinstruc-tures.The topics includeprobabilitydistributions (uniform, Gaussian, binomialandmultinomial, geometric, Poisson, exponential, Dirichletand gamma, extremevaluedistribution), information theory (entropy, relative entropy, mutual information),Markovian process and hidden Markov model, hypothesis testing, statistical in-ference (Bayesian approach, maximumlikelihood, expectation maximization), andsampling(rejectionsampling, Gibbssampling, andMetropolis-Hastings algorithm).

A4.2 Probability Distributions

Let P(·) beaprobabilityfunctionofarandomvariable X. Thecumulative distributionfunction (cdf) of X, denotedby Fx(x), is definedby

Fx(x) = P(X :s x), for all x. (A4.1)

Another function, called the probability density function (pdf), is associatedwith a random variable X and its cdf Fx(x). The pdf fx(x) of a discrete randomvariable X is givenby

fx(x) = P(X = x), for all x.

Thepdfof a continuous randomvariable X is the function that satisfies

Fx(x) = [~fx(t)dt, for all x.

(A4.2)

(A4.3)

The mean of a randomvariable X, denotedby E (X), is E (X) = LXP(X = x) if Xis discrete, or E (X) = J~ xfx(x )dx if X is continuous. The variance of a randomvariable Xis definedas Var(X) = E(X - E(X))2.

299

Page 70: Appendix 1 Biological and Chemical Basics Related to

300 Yang Dai and Jie Liang

Consider a set of n discrete random variables Xl, X 2 , ••• , Xn • For the ran-dom vectorX = (Xl, X 2, ••• ,Xn ) , a joint probabilitydistributionfunction fx(x) isdefinedas

where x = (Xl, ... ,Xn ) . If the n random variablesare independent, then

n

fx(x) =nP(Xi = Xi)'

i=l

(A4.4)

(A4.5)

In case of a set of n continuous random variables Xl, X 2, ••• , X n , the jointdensity function fx(x) is definedas

P((X, , ... , X n ) take values in Q) =f '0.' f fx(x" ... , xn)dx, dx2 ' " dx.;

(A4.6)

where Qis the setof allvaluesof (Xl , ... , Xn ) that the randomvariables(Xl, ... , X n )

can take. If the random variables are independent, then

n

!X(XI' ... , Xn ) = n!Xi(Xi), for all (Xl, ... , xn ) ·

i=l

(A4.7)

Several common families of distributions are introduced in the remainder of thesection.

Uniform Distribution

A random variable X has a discrete uniform (1, N) distribution if

1P(X=xIN)=-, for x=1,2, ... ,N,

N(A4.8)

where N is a given integer.A random variableX has a continuous uniform distribution if itspdfis defined

as

1fx(x) = -b-, for x E I,

-a(A4.9)

where a and b are some constants such that a < b, and the interval I = [a, b].

Page 71: Appendix 1 Biological and Chemical Basics Related to

A4. Mathematics and Statistics for Studying Protein Structures

Gaussian Distribution

A random variable X has a Gaussian distribution if the pdfis

1 (x-~)2

fx(x) = --e-~, for any x E (-00, (0),~(J

301

(A4.10)

where f.1 and (J > 0 are parameters. The mean and variance of this distributionare f.1 and (J2, respectively. When f.1 = 0 and (J = 1, it is called standard normaldistribution.

The importance of the standard normal distribution is stated in the CentralLimit Theorem:

Assume that Xl, X 2, •.• , X n are independent and identically distributed ran-dom variables, each having a probability distribution with mean f.1 and variance (J2 •

Then the standardized random variable

L:7=1 Xi - nf.1

y'n(J

(A4.11)

converges in distribution to a random variable having the standard normal distributionas n ~ 00.

Binomial and Multinomial Distributions

A Bernoulli trial is an experiment with result of "success" or "failure." The proba-bilities of success and failure are denoted by P and 1 - P, respectively. A randomvariable X has a Bernoulli distribution if X = 1 with probability P and X = 0 withprobability 1 - p.

Let the random variable X denote the number of successes in a fixed numberm of independent Bernoulli trials with the same probability p of success for eachtrial. Then X follows a binomial distribution, and its pdfis

fx(x) = (: )px(l - p)n-x, X = 0, 1,2, ... , m, (A4.12)

Multinomial distribution is the generalization of the binomial distribution. LetXl, X 2 , •.. , X; be random variables and 0 ::s Pi ::s 1 constants with PI + ... +Pn = 1. The random vector (Xl, ... , Xn) has a multinomial distribution with mtrials if the joint probability is

, n

( m. n x·P Xl = Xl, ••• , Xn= Xn)= TIn ., Pi I,i=IX1 • i=l

where each Xi is a nonnegative integer and Xl + ... + Xn = m.

(A4.13)

Page 72: Appendix 1 Biological and Chemical Basics Related to

302 Yang Dai and Jie Liang

The multinomial distribution is a model for the type ofexperiment that consistsofm independent trails. Each trial generates one ofthe n distinct possible outcomesand Pi is the probability of the ith outcome occurring on every trial. Each randomvariable Xi is the count of the number of times that the ith outcome occurred in them trials. Binomial distribution is useful in modeling the number of matches of tworesidues in corresponding positions in two sequences. The multinomial distributioncan be used in the detection of sequence motifs from a set ofprotein sequences.

Geometric Distribution

The geometric distribution is a discrete distribution modeling the outcomes of asequence of independent Bernoulli trials, where each trial has probability p of suc-cess. Let the random variable X define the number of successful trials before thefirst failure. The pdfof X can be defined by

Px(x) = (I - p)pX, x = 0, 1,2, .... (A4.14)

The mean and variance ofthe geometric distribution are respectively p / (I - p) andp / (I - p)2. This distribution can model the number of repeated residues in a DNAsequence.

Gamma Distribution

The full Gamma family has two parameters: a and (3. The pdfis

(A4.15)

where r(·) is the Gamma function , defined as

(A4.16)

The mean and variance of the Gamma distribution are a/J3 and ex/J32, respectively.The Gamma function possesses good properties. The value of I'(z) is finite ifk

is a positive constant; and the integral can be expressed in closed form for a positiveinteger k; namely, r(k + I) = kr(k), and r(k + 1) = k!, since I'(I) = 1.

The Gamma and exponential distributions relate in an interesting way. Whenex = 1, then fx(x I (3) = (3e-~x, for 0 < x, J3 < 00, which is an exponential distri-bution with parameter (3.

Page 73: Appendix 1 Biological and Chemical Basics Related to

A4. Mathematics and Statistics for Studying Protein Structures 303

Dirichlet Distribution

The Dirichlet distribution is the conjugate prior! ofthe parameters ofthe multinomialdistribution. Thepdjofthe Dirichlet distribution for variables P = (PI, ... , Pn)withparameters u = (u1, . . . , Un) is defined by

(A4.17)

where PI, ... , Pn 2:: 0, L:7=I Pi = 1, and UI, ... , Un > O. The parameter u, canbe interpreted as prior observation counts for events governed by Pi. Thenormalizing constant Z(u) for the Dirichlet distribution is defined as Z(u) =TIi r(Pi)/ r(L:7=I Pi)' Let UQ = L:7=I u.. The mean and variance of the Dirichletdistribution are E(Pi) = Ui/UQ and Var(pi) = Ui(UQ - ui)/u5(UQ + 1), respectively.

Extreme Value Distribution

Consider taking N samples from a density function g(x) independently. Let therandom variable X denote the largest value of the N samples. The probabilityP(X :s x) can be calculated by

(A4.18)

where G(x) = J~oo g(u) duo The pdf of X can be derived by differentiating theabove cdf with respect to X. This gives Ng(x)G(x)N-i. The limit for large N ofN g(x )G(x)N -1 is called the extreme value density function for g(x).

Specially, ifthe density function g(x) is an exponential distribution, i.e., g(x) =Ae- Ax, then the extreme value density function becomes

lim NAe-Ax(l - e-Ax)N-I = Ae-Axe- e-Xx

N-+oo(A4.19)

The extreme value distribution can be used for the approximation of p-value instatistical hypothesis testing.

1 In Bayesian approach, modeling priors has been traditionally a compromise between a realistic as-sessment of beliefs and choosing a mathematical function that simplifies the analytic calculations. Awell-known strategy is to choose a prior with a suitable form so the posterior belongs to the samefunctional family as the prior. The choice ofthe family depends on the likelihood. A prior and posteriorchosen in this way are said to be conjugate. For instance, given a Dirichlet likelihood and choosing aDirichlet prior, the posterior is still Dirichlet.

Page 74: Appendix 1 Biological and Chemical Basics Related to

304 Yang Dai and Jie Liang

A4.3 Information Theory

Entropy

Suppose that X is a discrete random variable with probability distribution (the pdfdefined in (A4.2) is considered here). The entropy, denoted by H(P), of this proba-bility distribution is defined by

H(P) = - LPx(x)logPx(x),x

(A4.20)

with the summation being taken over all values of x in the range of X. NotePx(x) = P(X = x). Specifically, PX(Xi) log PX(Xi) = 0 if PX(Xi) = O. The entropyis a measure of the unpredictability of any observed value of a random variablehaving the distribution. The entropy is maximized when all the PX(Xi) are equal,say PX(Xi) = -Ji. This means that it is maximally uncertain about the outcome of a

random sample. Then H(P) = - "L{:I -Ji log -Ji = log N. When the outcome of asample from the distribution is certain, i.e., PX(Xk) = 1 for some k and PX(Xi) = 0for all i =I k, then H(P) = O.

Relative Entropy

Let Xoand Xl be two discrete random variables with probability distributions PXo(x)and PX1(x), respectively. These two distributions are assumed to have the same range.Two relative entropies can be associated with these two distributions, denoted byH (Po II PI) and H (PIli Po), are defined respectively by

(A4.21)

and

(A4.22)

The summations in both cases are taken over all values in the common range of thetwo probability distributions. Note that the relative entropies are not symmetric. Therelative entropy measures the dissimilarity between two probability distributions. Itis always nonnegative and equals zero when PXo(X) = PX1(X). It is also called theKullback-Leibler distance between the two distributions.

Mutual Information

Let X and Y be two random variables with probability distributions Px and Py ,

respectively. The mutual information, denoted by I(X; Y), is the relative entropy

Page 75: Appendix 1 Biological and Chemical Basics Related to

A4. Mathematics and Statistics for Studying Protein Structures

between the joint distribution and the product distribution:

"" Px y(x, y)I(X;Y) = H(Px,yIJPxPy) = L.d~-,PX,y(x,y)log (') (y)'Pxx Py

x y

305

(A4.23)

The mutual information represents how much information one random variabledescribes about another one. When X and Yare independent, i.e., PX,y(x, y) =Px(x )Py(y), then I (X; Y) = o. Namely, ifthe two random variables are independent,then Y provides no information about X.

A4.4 Markovian Process and Hidden Markov Model

Markovian Process

A stochastic process has the Markov property if the conditional probability distribu-tion of future states of the process depends only on the current state. A process withthe Markov property is usually called a Markovian process.

A Markov chain is a discrete-time stochastic process with the Markov property.The Markov chain describes the behavior of a random variable changing over time.It consists of a collection of states, denoted by X(t), t = 1, 2, .... The state vectorX(t) in a Markov model traditionally lists either the probability that a system is in aparticular state at a particular time, or the percentage of the system which is in eachstate at a given time. Thus, X(t) is a probability distribution vector and must sum toone.

Three properties define a state model as a Markov model: (1) the Markovproperty, that is, the probability of moving from state i to state j, is independent ofhow one moves to state i; (2) conservation, that is, the sum of the probabilities outof a state must be one; and (3) the vector X(t) is a probability distribution vectorwhich describes the probability of the system being in each of the states at time t.

The transition matrix, denoted by T, for a Markov chain is a matrix of proba-bilities of moving from one state to another:

T=

PII P2I

PI2 P22

Pni

Pn2(A4.24)

PIn P2n Pnn

where Pi} denotes the probability of the system moving from state i to state j. ThenL~=I Pi} = 1 for each i . Given an initial distribution X(O), i.e., a distribution forthe chance that a system is initially in each of the states, the vector X(t) can beobtained as

X(t) = TX(t - 1) = t'X(O). (A4.25)

Page 76: Appendix 1 Biological and Chemical Basics Related to

306 Yang Dai and Jie Liang

IfX(t) ~ X, as t ~ 00, then T' ~ L.The matrix L is called a steady-state matrix,if it exists. Assume that T satisfies the property that some power of the matrix hasall positive entries, then the steady-state distribution and the steady-state matrix canbe shown to exist. In this case, X can be found by solving the system of equationsX=TX.

Hidden Markov Model

A hidden Markov model (HMM) is a discrete-time Markov model with some extrafeatures. When a state is visited by a Markov chain, the state "emits" a letter from afixed time-independent alphabet. An HMM is characterized by a sequence ofvisitedstates Q = ql, q2, ... and a sequence ofemitted symbols 0 = 01,02, .... However,the sequence 0 is often observed and the sequence Qis not. Therefore, the sequenceQ is hidden.

An HMM consists of the following five components:

(1) A set of N states 81, ... , SN.

(2) An alphabet of M distinct symbols A = {aI, ... , aM}.(3) The transition probability matrix P = (Pi}) ofthe states, where Pt) = P(qt+l =

Sjlqt=Si), i,j=I, ... ,N.(4) The emission probabilities: for each state s, and a in A, bi(a) =

P(Si emits symbol a). The probabilities bi(a) form the elements in an N x Mmatrix B.

(5) The initial distribution vector 'IT = ('ITj), where 'ITj = Ptq, = Sj).

Given some observed output sequence 0 = 01, ... , 0T, the solutions to thefollowing three problems are required in the HMM.

• Evaluation: given the parameters A = (P, B, 'IT), efficiently calculate the proba-bility of some given sequence of observed output: P( 0 IA).

• Decoding: efficiently calculate the hidden sequence Q= ql, ... , ar of states thatmost likely have occurred: argmaxQP(Q 10).

• Learning: assuming a fixed topology of the model, find the parameters A =(P, B, 'IT) that maximize P(O IA).

In the remainder of this section, three algorithms are described for these problems.

Evaluation

The probability of the observations 0 from a specific state sequence Q is

T

P( 0 I Q, A) =nP(Ot Iqt, A) = bqI(01) x bq2(02) x ... X bqr(OT) (A4.26)i=1

Page 77: Appendix 1 Biological and Chemical Basics Related to

A4. Mathematics and Statistics for Studying Protein Structures

and the probabilityof the state sequence is

307

(A4.27)

Therefore, the probabilityof the observations giventhe model can be calculatedas

P(O IA) = LP(O IQ, A)P(Q IA)Q

= L 1T'qj bqj(ol)aqjq2bq2(02)' · · aqr_lqrbqr(OT).ql,···,qr

(A4.28)

However, this direct evaluation would require steps exponential in T. An efficientapproachcanbe designedas follows. Let (Xt(i) be theprobabilityof the partialobser-vation sequence01,02, ... , o, and state s, at time t. Define the forward probabilityvariable: (Xt(i) = P(OI, 02, ... , o., qt = Si I X-). Theforward algorithm is describedas follows.

(1) Initialization:

(2) Induction:

N

at+IU) = [Lat(i)aij]bj(ot+l), 1 ~ t ~ T -1, 1 ~ j ~ N.i=1

(3) Termination:

N

P( 0 Ix.) = L (XT(i).i=1

(A4.29)

(A4.30)

(A4.31)

Decoding

Theaimofdecoding isthe discovery ofthehiddenstatesequencethatwasmostlikelyto haveproduceda givenobservationsequence. One solutionto this problemis theuse of the Viterbialgorithmto findthe singlebest state sequencefor an observationsequence. TheViterbialgorithmis very similarto the forward algorithm, exceptthatthe transitionprobabilities are maximizedat each step, instead of summed. Define

8t = max P(ql' q2,···, qt = s., 01, 02,···, o, IX-)Ql,q2,···,qt-l

(A4.32)

as theprobabilityofthemostprobablestatepath forthepartialobservationsequence.The Viterbialgorithmis as follows.

Page 78: Appendix 1 Biological and Chemical Basics Related to

308

(1) Initialization:

Bl (i) = 'ITibi(OI), 1 ~ i ~ N, <VI (i) = O.

(2) Recursion:

Yang Dai and Jie Liang

(A4.33)

StU) = max [St-l(i)aij] bi(Ot), 2 ~ t ~ T, 1 ~ j ~ N,l~i~N

<f>tU) = arg max [St-l(i)aij], 2 ~ t ~ T, 1 :::s j s N.l~i~N

(3) Termination:

P* = max [Br(i)], q~ = arg max [Br(i)].l~i~N l~i~N

(4) Optimal state sequence backtracking:

(A4.34)

(A4.35)

(A4.36)

(A4.37)

Learning

Given a set of examples from a process, the estimation of the model parame-ters 'A = (P, B, 'IT) that best describe that process is necessary. Two standard ap-proaches to this task exist. They are called supervised or unsupervised trainingdepending on the form of the examples. If the training examples contain both theinputs and outputs of a process, supervised training is performed by equating in-puts to observations, and outputs to states. If only the inputs are provided in thetraining data, then unsupervised training must be used to guess a model that mayhave produced those observations. In this case, there is no optimal way of esti-mating the model parameters, One can, however, choose 'A = (P, B, 'IT) such thatP( 0 I 'A) is locally maximized using an iterative procedure such as the Baum-Welchmethod.

A4.5 Hypothesis Testing

Null Test

Classical statistical hypothesis testing involves the test of a null hypothesis againstan alternative hypothesis. It consists of five steps.

The first step is to declare the null hypothesis Hoand the alternative hypothesisHI. The choice of the null and alternative hypotheses should be made before thedata are seen. Since the decision as to whether Hoor HI is accepted will be made onthe basis of data derived from some random process, it is possible that an incorrect

Page 79: Appendix 1 Biological and Chemical Basics Related to

A4. Mathematics and Statistics for Studying Protein Structures 309

decision will be made, that is, to reject Ho when it is true (a Type I error), or toaccept H« when it is false (a Type II error). Frequently, a numerical value (l ofthe Type I error, called significance level, is fixed at some acceptable level (usually1 or 5%). Step 2 is to choose such a level. Step 3 determines a test statistic, a quantitycalculated from the data whose numerical value leads to acceptance or rejection ofthe null hypothesis. Step 4 computes from those observed values of the test statistic.The final step is to determine whether the observed value of the test statistic isequal to or more extreme than the significance point calculated, and to reject the nullhypothesis if it is.

z-Test and T-test

Suppose that a random variable X has the normal distribution (A4.10). Then the"standardized" random variable Z, defined by Z = X~f.L, has a standard normaldistribution with (.1 = 0 and (J = 1. The observed value z of Z is often called az-score.

The T -test concerns the unknown mean (.1 of a normal distribution with un-known variance (J of the population. A Student t distribution is used to derive theconfidence interval for the mean estimation.i The test is the null hypothesis (.1 = (.10

against the one-sided alternative hypothesis (.1 > (.10 or two-sided alternative hy-pothesis (.1 =j:: (.10. Given the Type I error ex, the confidence coefficient t(n-l,l-a) ort(n-I,I-aI2) can be found from the table ofthe Student t distribution for the one-sidedtest or the two-sided test, respectively.

If this test is carried out using the observed values of the random variablesXl, X2 , ... .X; having normal distribution in question, the variable X = ~(XI +. · . + Xn ) follows the same distribution and it can be used as the test statistics. Ifthis value .x satisfies

i > (.10 + t(n-I,I-a)S, (A4.38)

then the null hypothesis is rejected when the Type I error is ex% in the one-sidedtest. Here S the sample standard deviation. If the alternative hypothesis had beentwo-sided, then the null hypothesis would be rejected if

.x < (.10 - t(n-I,I-aI2)S or i > (.10 + t(n-I,I-aI2)S. (A4.39)

The Z-test can be used in a very similar way to the T -test, except thatthe population standard deviation (J) is known rather than the sample standarddeviation s .

2 This distributionis a familyof distributions, with each familymember indexedby a parameterknownas degree of freedom (number of samples -1). A t distribution with infinite degree of freedom is astandardnormal distribution.

Page 80: Appendix 1 Biological and Chemical Basics Related to

31 0 Yang Dai and Jie Liang

A4.6 Statistical Inference

Bayesian Approach

In the Bayesian paradigm, the probability model that we build can be quite approx-imate. It reflects one's beliefs and the prior experience. It is described as a personalor subjective view. When the uncertainty about the model can be boiled down toa parameter 6, the Bayesian statistician treats 6 as if it were a random variablewhose distribution describes that uncertainty. A personal probability is subject tomodification upon acquisition of further information supplied by experimental data.

Suppose that a distribution with density p(6) describes one's present uncertain-ties about some probability model. Let Y denote the observed data. In order to makeprobability statements about 6 given Y, a model providing a joint distribution for 6and Y must be established. The jointpdfcan be written as a product of two densitiesthat are often referred to as the prior distribution p(6) and the sample distributionp(Y /6), respectively:

p(6, Y) = p(6)p(Y I6). (A4.40)

Using Bayes' rule, the conditional probability of6 on Y, called the posterior density,is given as

(6 / Y) = p(6, Y) = p(6)p(Y /6)p p(Y) p(Y) ,

(A4.41)

where p(Y) = f p(Y / e)p(e) de, the marginal distribution of Y, or priorpredictivedistribution. Since p(Y) does not depend on e,

p(e /Y) <X p(6)p(Y / e). (A4.42)

The data Y affect the posterior inference only through the function p(Y / e), whichis called the likelihoodfunction. For fixed Y, this function only depends on 6. Theprimary task of any specific application is the development of a model p(6, Y) andthe execution of the necessary computations of an appropriate summarization ofp(6/ Y).

Maximum Likelihood

A maximum likelihood estimator (MLE) of 6 is a value of 6 that maximizes thelikelihood p(Y I6) or, equivalently, the logarithm of the likelihood. The MLE eisthe value of e that makes the observed value of the data most likely. The parameterestimation problem in this case is implicit and its solution is obtained through solv-ing a maximization problem. This is different from the Bayesian approach, wherethe estimator is explicitly given by the posterior density, which often leads to anintegration problem rather than a maximization problem.

Page 81: Appendix 1 Biological and Chemical Basics Related to

A4. Mathematics and Statistics for Studying Protein Structures 311

Suppose that the likelihood function is differentiable, unimodal, and boundedabove, the MLE eis then computed through the following two steps.

(1) Differentiate the logarithm of the likelihood with respect to a. Obtainalogp(a Iy)/aa ..

(2) Solve the system of equations, i.e., alog p(a Iy)/aa = 0, for a.

When log p(a IY) is quadratic, the above equation is linear in a and the solutionis relatively easy to obtain. In more general situations when a closed-form solutioncannot be found, one may have to use an iterative numerical method, such as theNewton-Raphson algorithm, to locate a value of'B, In this case, there is no guaranteethat the globally optimal solution is found.

Expectation Maximization

The expectation maximization algorithm is an iterative method for the determinationofposterior modes. Each iteration consists oftwo steps: the E -step (expectation step)and the M -step (maximization step).

Suppose that ai is the current guess to the mode of the observed posteriorp(a IY). Let p(a IY, Z) denote the augmented posterior, i.e., the posterior of theaugmented data. Let p(Z Iai, Y) denote the conditional predictive distribution ofthe latent data Z, conditional on the current guess to the posterior mode.

Generally, the E -step computes the expectation of log p(a IZ, Y) with respectto p(Z Iai, Y):

(A4.43)

The M-step maximizes the Q function with respect to a to obtain ai+1• The processis repeated untilllai+1 - ai II or IQ(ai+1, ai ) - Q(ai , ai)1 is sufficiently small.

The EM algorithm increases the posterior p(a IY) at each iteration, i.e.,p(ei+1 IY) ~ p(ai IY), the equality holds if and only if Q(a i+1, ai ) - Q(a i , ai ) =o.

A4.7 Sampling

Importance and Rejection Samplings

Consider the problem of estimating the integral J(y) = J f(y Ix )g(x )dx. In manycases, especiallywheng(x) is in high-dimensional space, directly generating samplesfrom g(x) is not possible. There are two popular methods based on generatingindependent samples: importance sampling and rejection sampling.

Importance sampling draws samples from a proposal density function I(x)that is easy to sample and approximates g(x). The method of importance sampling

Page 82: Appendix 1 Biological and Chemical Basics Related to

312 Yang Dai and Jie Liang

approximates J(y). The procedure is described as follows.

(1) Draw XI, ••• , Xm from lex).m m

(2) Estimate J(y) by a weightedaverage J(y) = LWd(Y IXd/L Wi,i=1 i=1

where ui, = g(Xi)/l(Xi), i = 1, ... , m, are importance weights.The rejection sampling can be considered a special case of the importance

sampling. It requires a density function I (x) and a constant M > °such that the"envelope property" g(x) ~ M I (x) holds on the support of g(x) for all x . Thealgorithm consists of the following three steps:

(1) Generate x* from lex).(2) Generate u from the uniform [0, 1] distribution, independent ofx*.

(3) If u < [1J.~;.)1' then accept x*; otherwise,repeat steps (l}--(3).

This allows one to indirectly sample fromg(x), ifthe functional form ofg(x) is knownand a sample from an approximating distribution I (x) is given. Ifthe approximationI (x) is poor in some regions, then the algorithm may have a low acceptance rate.

Metropolis-Hastings Algorithm

The key to the Markov chain simulation is to create a Markov process whose sta-tionary distribution is specified by pee Iy) so that the distribution of the currentdraws is close enough to the stationary distribution after sufficiently long steps ofthe simulation.

Given a target distribution pee Iy) that can be computed up to a normalizingconstant, the Metropolis algorithm creates a sequence ofrandom points (e I , e2 , •.. )

whose distribution converges to the target distribution. Each sequence can be con-sidered a random walk whose stationary distribution is pee Iy). The algorithm isdescribed as follows.

(1) Draw a starting point eO, for which p(eoIy) > 0, from a starting distributionpo(e).

(2)Fort=1,2, ...(a) Sample a candidate point e* from a jumping distribution at time

t, ~(e* Iet-I). The jumping distribution must be symmetric; that is,~(ea Ieb) = ~(eb Iea) for all ea and eb, and t.

(b) Calculate the ratio of the densities

(c) Set

r = p(e*ly)/p(et-Ily).

t { e* with probability min{r, I}e = et - I otherwise.

(A4.44)

(A4.45)

Page 83: Appendix 1 Biological and Chemical Basics Related to

A4. Mathematics and Statistics for Studying Protein Structures 313

In the Metropolis algorithm, a parameter e* with larger posterior probability thanthe current et - 1 is always accepted, and the one with lower probability is acceptedrandomly with probability r .

The Metropolis-Hastings algorithm generalizes the basic Metropolis algorithmin two ways. First, J, is not required to be symmetric. Second, to correct for theasymmetry in the jumping rule, the ratio r in (A4.44) is replaced by a ratio ofimportance ratios:

p(e* Iy)jJt(e* Iet-

1)

r---------- p(et- 1 Iy)j~(et-l Ie*)·

(A4.46)

Allowing asymmetric jumping rules can be useful in increasing the speed ofthe convergence of the Markov chain. In fact, the Metropolis-Hastings algorithmcan be further generalized, if the ratio r is taken as

8(et- 1, e*)jJt(e* Iet - 1)r=--------

p(e t - 1 Iy)(A4.47)

where 8(et- 1 Ie*) can be any function as long as it is a symmetric function in et- 1

and e*, and makes r :::: 1 for all et-

1 and e*.

Gibbs sampling

The Gibbs sampling is used to approximate a probability model with many variables.It samples from distribution obtained by keeping all variables fixed except one. Giventhe starting point (eiO), eiO), ... ,6JO)), the algorithm iterates the following steps:

(i+l) (i+l) (i) (i+l)(d) Sample ed from p(6d leI' ... ,ed- 1 ,y).

The vector e(O), e(2), ... , e(t), ... represents realization of a Markov chain, withtransition probability from e' to e,

K(e', e) = p(el Ie~, , e~, y)p(e21 e~, ... , e~, y)

p(e31 e~, e~, e~, , e~, y) x p(ed I e~, ... , e~-I' y). (A4.48)

Under rather general conditions, the Markov chain generated by the Gibbs samplingalgorithm converges to the target density as the number of iterations become large.

Page 84: Appendix 1 Biological and Chemical Basics Related to

314 Yang Dai and Jie Liang

Further Reading

Casella and Berger's (2002) book provides a thorough introduction on probabilitydistributions and general approaches of statistical inference. Basic concepts of en-tropy, relative entropy, mutual information, and their applications are presented inCover and Thomas (1991). The book by Gelman et al. (1995) gives a comprehensivedescription of various aspects of the Bayesian Paradigm. The computational algo-rithms in Bayesian analysis are introduced in a unified manner by Tanner (1996).The importance and rejection samplings were originated in von Neumann (1951)and Marshall (1956). The original ideas on Metropolis and Metropolis-Hastings al-gorithms can be found in Metropolis et al. (1953) and Hastings (1970) and overallsampling methods are described in detail in Tanner and Wang (1987). Liu (2001)provides a comprehensive treatment of the Monte Carlo method and its applicationin scientific computing. The original Baum-Welch method is presented in Baumet al. (1970). A unified treatment of the three algorithms associated with HMMs canbe found in Rabiner (1989).

References

Baum, L., Petrie, T., Soules, G., and Weiss, N. 1970. A maximization techniqueoccurring in the statistical analysis ofprobabilistic functions ofMarkov chains.Ann. Math. Stat. 41:164-171.

Casella, G., and Berger, R.L. 2002. Statistical Inference. Belmont, CA, DuxburyPress.

Cover, T., and Thomas, J. 1991. Elements ofInformation Theory. New York, JohnWiley & Sons.

Ewens, W J., and Grant, G. R. 2001. Statistical Methods in Bioinformatics: AnIntroduction. Berlin, Springer.

Gelman, A., Carlin, lB., Stem, H.S., and Rubin, B.D. 1995. Bayesian Data Analysis.London, Chapman & Hall/CRC.

Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions andthe Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell.12:609-628.

Hastings, WK. 1970. Monte Carlo sampling methods using Markov chains and theirapplications. Biometrika 57:97-109.

Liu, J.S. 2001. Monte Carlo Strategies in Scientific Computing. Berlin, Springer-Verlag.

Marshall, A. 1956. The use ofmulti-stage sampling schemes in Monte Carlo compu-tations. In Meyer, M. (Ed.), Symposium on Monte Carlo Methods, pp. 123-140.

Metropolis, N., Rosenbluth, A.W, Rosenbluth, M.N., Teller, A.H., and Teller, E.1953. Equations of state calculations by fast computing machines. J. Chem.Phys.21:1087-1091.

Rabiner, L.R. 1989. A tutorial on hidden Markov models and selected applicationsin speech recognition. Proc. IEEE 77:257-286.

Page 85: Appendix 1 Biological and Chemical Basics Related to

A4. Mathematics and Statistics for Studying Protein Structures 315

Tanner, M.A. 1996. Tools for Statistical Inference: Methods for the Exploration ofPosterior Distributions and LikelihoodFunctions, 3rd edition. Berlin, Springer.

Tanner, M.A., and Wong, WHo 1987. The calculation of posterior distributions bydata augmentation. J. Am. Stat. Assoc. 82:528-549.

von Neumann, 1. 1951. Various techniques used in connection with random digits.NBS Appl. Math. Sere 12:36-38.

Page 86: Appendix 1 Biological and Chemical Basics Related to

Index

a-helix 7, 66-67, 70, 85, 97, 235, 237~-sheet 66,68,70,85,97,235,237-238~-mrn 77-78,235-236310 Helix 237

A* search 255-257ab initio 43-44,160,177-178,181,193,200,

218,222,241,267adenine 231Ad~ET 136, 144, 155, 159alignment accuracy 29, 32, 193-194, 199amphipathicity 85-87antibody

complementarity-determining regions (CDRs)

122, 141-142Fab fragments 140Fe fragment 140Monoclonal 136, 141-142polyclonal 141

anti-parallel J3-sheet 237Atomic Contact Energy (ACE) 118-119,

122-124autocorrelation function 292autodock 116,146-148

backbone structure 2,6-7,23,30,32,183,189

back-propagation algorithm 82,87bacteriorhodopsin 65-67, 72, 99base pair 232-233, 276Bayesian 45,299,310,314binding free energy 144, 147-148,

151binomial distribution 301-302bioperl 219, 262-263biosynthesis 233-234bipartite matching 248-249,259BLOCKS 191,216Boltzmann constant 9,278,293Boltzmann factor 293branch-and-bound 16, 28, 251, 255-257

CAFASP 194CAPRJ 121-122, 126, 128

Cartesian coordinate 270,272CASP 12, 14, 29, 33, 44, 122,

180-181catalytic site 143, 191, 211CATH 6,210CHA~~ 45,119,123-124,179chemotherapeutics 135-136CHIME 209,212Clusters ofOrthologous Group (COG) 216coarse-grained model 268comparative modeling 177, 199, 210confidence factors 187conformation

deterministic conformational search 146-147bioactive 147, 163-164ensemble 52, 147

stochastic conformational search 146-147systematic conformational search 146-147

conformational sampling 169confusion matrix 91-92consensus prediction 94consensus-based approach 180-182constrained bipartite matching 248-249Convex global underestimator (CGU) 126Coulomb's Law 118covalent bond 273-276Critical Assessment of Structure Prediction (CASP)

cross-correlation function 292C-termini 68, 83, 237cytokine 138-142cytosine 231

DALI 3dead-end-elimination algorithm 257decoys 24,99,211degrees of freedom 46,114,120,124, 145-147,

151,270,278,281deoxyribonucleic acid 231desolvation 110, 116, 116, 123-124, 147DFIRE 11-12,118-119Dirichlet distribution 303disulfide bond 139, 192,209,231,276divide-and-conquer algorithm 14-16, 26-27,

260

317

Page 87: Appendix 1 Biological and Chemical Basics Related to

318

DOCK 112,147docking

flexible ligand 218MM/PBSA 151receptor flexibility 146, 149, 151rigid ligand 114validation 152-153

domain 2-3,29,56,81,100,139,141,179,181-182,184,210-212,216-217,238,263

DOT 76,118downhill simplex 124, 126dynamic programming 13, 17, 26, 28, 78,

82-83,87-88,89,251,253,254

electrostatic interaction 98, 139, 164, 275-276electrostatics 110, 116, 118, 123-124enthalpy 284, 288entropy 82,147,277,280,282e~e 67,110,135,138,142-143,149,155enzyme structures 211Euler angles 122, 293-294EVA 194expectation maximization 78,299,311expert system 180, 186-189, 192

fast Fourier Transform (FFT) 112, 115finite difference method 290fold recognition 1-2, 13, 26, 30, 32-33, 47,

55-56, 180, 183-184, 187, 190-194folding landscape 189freeenergy 9,69-70,74,97,144,147-148,

151,269,277,285-289free energy of transfer 69-70,74frozen approximation 13-14FSSP (Fold classification based on

Structure-Structure alignment of Proteins)

3,6,9,210FTDock 115functional motifs 190

Gamma distribution 302Gaussian distribution 24, 301GenBank 216Gene Ontology (GO) 183, 216genetic code 232GenTHREADER 28geometric distribution 302Gibbs free energy 9, 287-288Gibbs sampling 299,313globular protein 68, 77, 83, 182, 229, 235glycophorin A 100

Index

GRAMM 115GRAMM-X 113greedy algorithm 257-258guanine 231-232

HADDOCK 119hash table 116, 243-244, 256heap 243, 246-247helix-helix interaction 97-98Helmholtz free energy 285, 288hidden Markov model 68,81-82,88-89,94,

193,216,299,305-306homology modeling iterative density function 32,

97,109,128,177,218,222,257hydrogen bonding 28, 46, 98, 110, 119, 147,

229,231,235,237,276hydrogen-bond 156, 231, 237hydropathy analysis 68, 72, 74-75, 77, 79-80,

87,94-95hydrophilic 32, 74, 77, 85-86, 237hydrophilicity 74hydrophobic 32, 44, 65-68, 70, 72, 75-77,

82-83,85hydrophobic moment 45, 50, 54, 70, 72, 86hydrophobicity 45, 68-70, 72-78, 85-86, 94hydrophobicity analysis 76-77

ICM 146,148-149,151importance sampling 311-312independent reference state 9induced fit 135,137,147information entropy 280, 283-284integer programming 14, 16, 22, 26, 28, 251,

254-255,259

Jess expert system shell 187

kinetic barrier 288-289,296knowledge-based 9, 45-47, 50, 55-57,

147-148

lattice representation 46lead identification

enrichment factor 151, 153virtual screening 144, 149, 152, 154

lead optimizationstructure activity relationship 155-156, 158,

164leap-frog algorithm 291Lennard-Jones potential 44, 148, 275linear integer programming 16, 28lipid bilayer 65, 68, 74-75, 89

Page 88: Appendix 1 Biological and Chemical Basics Related to

Index

LiveBench 194local ratio technique 258longestcommonsubsequence 253

machinelearning algorithm 80-81, 87macromolecule 67, 112, 118, 210-211, 233Markovianprocess 299, 305maximumlikelihood 88,299,310membraneprotein 6,29-30,34,65-100membranetopology 76, 78mesofold 5meta-server 181-182Metropolismethod 293Metropolis-Hastings algorithm 299, 312-314minimumspanningtree 246, 248, 257mmCIF format 208moleculardynamics 30, 32, 45, 57, 112,

123-124,146,151,218,261,267,276,278,289-292

moleculardynamicssimulation 32,45,210,218,261,267,278,289-290,292

molecularstructure 137, 198MolFit 113Molscript 212Monte Carlo sampling 116, 126Monte Carlo simulatedannealing 47, 49multinomialdistribution 301-303multiple sequencealignment 10, 78-80, 82, 95,

216mutation 139-140, 142-143, 178, 233, 260mutual information 299,304-305

neural network(NN) 26,68,76,80,82,87-89,99

Newtonianlaws 289non-redundant sequencerepresentatives 9NP-complete 157, 250NP-hard 13, 241, 251, 253, 255, 259-261N-termini 237nucleic acid 208, 210, 215, 222, 231

obligateprotein complexes 116optimizationproblem 17, 57, 161, 250-251

pairwise interaction 9, 11, 13-14, 17, 19,27-28,98,241,260-261

parallel ~-Sheet 237partition function 9,280,286-287PatchDock 113PDB 3,9,25,56,119,137,165,177-178,

181,190-191,194,208PDB-select 9, 25

319

peptide 29, 77, 80, 83, 179, 182, 229, 234peptidebond 46,229,231per-residueaccuracy 91-94per-segmentaccuracy 92-93, 95Pfam 216pharmacodynamics 136pharmacokinetics 136pharmacophore

clique detection 157-158feature 156, 158maximumcommonsubstructure 158model 155, 157-158

physicochemical 58, 72, 85-87, 89physics-based 12, 26, 45-46, 55-56pKa 229,231Poisson-Boltzmann Equation 118polar clamp 97-98polar coordinate 270polynomialtime reduction 250polypeptide 74, 76, 88porin 67,88positive-inside rule 71-72, 76-77, 82, 85potential energysurface 269-270, 274potentialof mean-force 267, 270, 296primary structure 235, 237principalof optimality 13-14, 21PRINTS 191,216probabilitydensity function 299probabilitydistributions 299, 304ProDom 216propensity 46,68,76-80PROSITE 191-192,216PROSPECT 13-16,27-28PROSPECT-PSPP pipeline 195proteinbackbone 66, 70, 120proteinbinding 118, 137, 178protein classification 216protein complexstructure 34, 109, 218ProteinData Bank (PDB) 65,99,137,177protein docking

backbonesearching 120benchmark 121bound 121clustering 52-53, 55, 119-120predictionevaluation 120refinement 32, 34, 56, 113side chain searching 112, 114symmetricmultimers 128unbound 112,117

protein folding 44,47,67,97,119,122,126protein geometryanalysis 217protein interaction 44

Page 89: Appendix 1 Biological and Chemical Basics Related to

320

protein modeling 119, 212protein sequence databases 224protein structure determination 252protein structure prediction

benchmark 199evaluation 77, 93, 120

protein therapeutics 135-139, 143protein threading 1-34protein-protein docking 109, 112, 119, 124

Quantitative Structure Activity Relationship

(QSAR)

3D-QSAR (CoMFA) 159,163-164descriptor selection 159validation 136, 152-154, 159, 162-164

quantum mechanics 97,160,267-268quaternary structure 210, 238

Ramachandran plot 235, 274RAPTOR 14,16,28Rasmol 207,209,212RDOCK 122-124rejection sampling 299,311-312, 314relative entropy 299, 304Research Collaboratory for Structural

Bioinformatics (RCSB) 208residue-specific all-atom probability discriminatory

function (RAPDF) 45, 50, 54, 56Rete algorithm 187ribonucleic acid 231-232rigid-body docking 116, 121, 218Robetta 49,56,181Root Mean Square Deviation (RMSD) 52,

120-121,144,212RosettaDock 117rotamers 46, 57, 120, 151, 257rules based system 86, 164

SCOP 189-190,193,210scoring function 147-149SCWRL 120secondary structure 7, 11, 47, 50segment overlap (SOV) 93-94sequence fingerprints 259sequence similarity 9,143,177,182,184sequence-structure alignment 2, 12-14, 17shape complementarity 110, 112, 116-117,

123side chain 26, 32-33, 46-47, 57, 69-70, 83,

86,98,110,113,120-121,123signal peptide 83, 179, 182, 184

Index

simulated annealing 47, 49, 99, 116, 161, 181,268,293

singleton energy 8, 27solvation 46, 110, 116, 118, 123-124statistical significance 23-25,28-29, 33statistics based energy 7, 9-10, 26structural domain 2-3, 141, 182structural fold space 191structural motifs 191structural refinement 181, 267structure based drug design (SBDD)

ligand-based 137receptor-based 137

structure comparison 43, 218, 238, 255structure families 210

structure quality assessment 184, 210, 218structure-based function prediction 180subgraph isomorphism 17suffix tree 243-245superfold 5-6SwissProt 82,190,215-217

template structure 2,7,11-14,17,21,23-25,27

tertiary structure 2, 17-18, 65, 67-68, 93, 97,136

threading algorithm 14, 16-18, 22-23, 25-26,28-29,33

three-dimensional structure 65,67,78, 95,137-138,144

thymine 231-232torsional space 46, 146transient protein complexes 124tree decomposition 19-22, 26T-test 309

unifold 5-6uniform distribution 10, 284, 294, 300uracil 231-232

van der Waals interactions 98, 100, 147-148,151,275

van der Waals potential 100Verlet algorithm 291-292VMD 212

worst-case performance ratio 252

ZDOCK 115,122-124Z-score 26,29,188,190,196Z-Test 309

Page 90: Appendix 1 Biological and Chemical Basics Related to

(continued from page ii)

Physics ofthe Human Body: A Physical View ofPhysiology Herman, J.P., 2006

Intermediate Physics for Medicine and Biology Hobbie, R.K.,Roth, B.,2006

Computational Methods for Protein Structure Prediction and Modeling (2 volume set) Xu,Y., Xu,D., Liang, J. (Eds.) 2006

Artificial Sight: Basic Research, Biomedical Engineering, and Clinical Advances, Humayun,Weiland, Greenbaum., 2006

Physics and Energy Landscapes ofProteins Fraunfelder, H., Austin, R.,Chan. S., 2006

Biological Membrane Ion Channels Chung, S.H., Anderson, O.S.,Krishnamurthy, V.V., 2006

Cell Motility, Lenz,P.,2007

Applications ofPhysics in Radiation Oncology, Goitein, M., 2007

Statistical Physics ofMacromolecules, Khokhlov,A., Grosberg, A.Y., Pande, V.S. 2007

Biological Physics, Benedek,G., Villars,F., 2007

Protein Structure Protein Modeling, Kurochikina, N., 2007

Three-Dimensional Shape Perception, Zaidi, Q., 2007

Structural Approaches to Sequence Evolution, Bastolla,U.

Radiobiologically Optimized Radiation Therapy, Brahme, A.

Biological Optical Microscopy, Cheng, P.

Microscopic Imaging, Gu, M.

Deciphering Complex Signatures: Applications in Life Sciences, Morfill, G.

Biomedical Opto-Acoustics, Oraevsky, A.A.

Mathematical Methods in Biology: Mathematics for Ecology and Environmental Sciences,Takeuchi,Y.

Mathematical Methods in Biology: Mathematics for Life Science and Medicine, Takeuchi,Y.

In Vivo Optical Biopsy ofthe Human Retina, Drexler, W.) Fujimoto, J.

Tissue Engineering: Scaffold Material, Design and Fabrication Principles, Hutmacher, D.W.

Ion Beam Therapy, Kraft, G.H.

Biomaterials Engineering: Implants, Materials and Tissues, Helsen, J.A.