PRINCIPAL COMPONENT ANALYSIS OF GRAMICIDIN

PRINCIPAL COMPONENT ANALYSIS OF GRAMICIDIN A multivariate statistical analysis of collective modes in a model protein

by

Martin Kurylowicz

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Department of Molecular Structure and Function, Hospital for Sick Children and Graduate Department of Biochemistry in the

University of Toronto

‘ Copyright by Martin Kurylowicz (2010)

ii

Principal Component Analysis of Gramicidin: A multivariate statistical analysis of collective modes in a model protein

Martin Kurylowicz

Dissertation for the degree of Doctor of Philosophy (PhD) Department of Molecular Structure and Function, Hospital for Sick Children, Toronto

and Graduate Department of Biochemistry University of Toronto, 2010

Abstract

Computational research making use of molecular dynamics (MD) simulations has begun to expand

the paradigm of structural biology to include dynamics as the mediator between structure and

function. This work aims to expand the utility of MD simulations by developing Principal

Component Analysis (PCA) techniques to extract the biologically relevant information in these

increasingly complex data sets. Gramicidin is a simple protein with a very clear functional role and a

long history of experimental, theoretical and computational study, making it an ideal candidate for

detailed quantitative study and the development of new analysis techniques. First we quantify the

convergence of our PCA results to underwrite the scope and validity of three 64 ns simulations of gA

and two covalently linked analogs (SS and RR) solvated in a glycerol mono-oleate (GMO)

membrane. Next we introduce a number of statistical measures for identifying regions of

anharmonicity on the free energy landscape and highlight the utility of PCA in identifying functional

modes of motion at both long and short wavelengths. We then introduce a simple ansatz for

extracting physically meaningful modes of collective dynamics from the results of PCA, through a

weighted superposition of eigenvectors. Applied to the gA, SS and RR backbone, this analysis results

in a small number of collective modes which relate structural differences among the three analogs to

dynamic properties with functional interpretations. Finally, we apply elements of our analysis to the

GMO membrane, yielding two simple modes of motion from a large number of noisy and complex

eigenvectors. Our results demonstrate that PCA can be used to isolate covariant motions on a number

of different length and time scales, and highlight the need for an adequate structural and dynamical

account of many more PCs than have been conventionally examined in the analysis of protein motion.

iii

Acknowledgements

I am grateful to my supervisor Régis Pomès, for giving me the opportunity to pursue

this research, as well as my committee members Boris Steipe and Ray Kapral for

scientifically fruitful discussions over the years. My thanks also extend to all members of the

Pomès lab, especially my immediate neighbours Nilu Chakrabarti, Rowan Henry and Grace

Li for their daily conversation and motivation, Chris Neale for his expertise and

insightfulness, and Chris Madill for his frequent aid and longsuffering comradery. My work

builds on simulations created by Ching-Hsing Yu, and his expertise was invaluable both as a

post-doc in the lab and during his tenure at the Centre for Computational Biology. All of

their fellowship over the last six years has been invaluable to me.

I am ever appreciative of my scientific mentors at the University of British Columbia,

who continue to lend me their ear, pen and encouragement. Walter Hardy, Doug Bonn and

Myer Bloom taught me what I know about experimentation and physics, while Lee Gass and

Mark Maclean taught me much about the art of science as well as living. I am in their debt

not only for instilling a love of science and its practice, but also the motivation to stick with it

in hard times, if only to pay back their many investments in me.

The companionship of my friends has sustained me from close and far. Thanks to my

west-coast family: Josie Hughes, Mike Melnychuk, Tom Bird, Sarah Henderson, Roger

Donaldson, Janet Tecklenborg; I`m glad we`re all still travelling along parallel paths. My

Toronto brothers have also made life grand for me in the big city: Dan Fraleigh, Christopher

Oates and Davy Boon, I`ll miss living with and next door to you.

The support of my parents Zosia and Stan, as well as my brother Mike, have been

invaluable over my many years of schooling. I hope to make them proud at convocation,

becoming the first doctor in our family. Thanks for bringing me to Canada, Mom and Dad.

And finally, my most heartfelt love and gratitude to Nancy my wife, for loving me so well,

for keeping our heads above water, and for giving birth to our daughter Ivy Lumina, who has

brought new meaning and light to our lives together.

iv

Table of Contents

ABSTRACT……………………...……………………………………………………. ii

Acknowledgements…………...……………………………………………………… iii Table of Contents………………………………………………………………..……. iv List of Figures…....…….……………………………………………………………… vi List of Tables….….…………………………………………………………………… vii List of Appendices….…..…..………………………………………………………... vii List of Abbreviations….....……….………...……………………………………….. vii

PREFACE………….……..……….…………………………………………………… 1

CHAPTER 1: Introduction ………………………………………………………… 2

1.1: Biophysical Background…………………………………………………………….. 2 1.2: Gramicidin…………………………………………………………………………… 8 1.2.1: Biological Characterization………………………………………….………… 8 1.2.2: Structure of Gramicidin and its Dioxolane-Linked Analogs…………………. 9 1.2.3: Studies of Gramicidin Dynamics…………………………………..................... 14 1.3: Summary and Overview……………………………………………………………… 17

CHAPTER 2: Theory and Methods……………………………………………… 18

2.1: Molecular Dynamics………………………………………………………………… 18 2.1.1: Background…………………………………………………………………… 18 2.1.2: Simulation of gA/SS/RR in GMO membrane…………………………………. 22 2.2: Principal Component Analysis (PCA) ……………………………………………… 23 2.2.1: Background …………………………………….……………………………. 23 2.2.2: PCA and Protein Dynamics……………………………………………..…… 25 2.2.3: PCA vs. NMA of Proteins…………………………………….………………. 27 2.2.4: PCA and its Development in Climatology…………………………………… 30 CHAPTER 3: Convergence of PCA……………………………………………… 35

3.1: Background………………………………….……………………………………… 35 3.2: Convergence of Structure: Overlap of Covariance Matrices:………………………. 36 3.2.1: Backbone of gA, SS and RR: Converged Eigenvectors……………………… 37 3.2.2: Side Chains and GMO: Unconverged Eigenvectors………………………… 40 3.3: Convergence of Dynamics: Average Distributions and Deviations from Gaussian.. 42 3.3.1: Backbone and Side Chains of gA……………………………………………… 43 3.3.2: Backbone of SS and RR………….…………………………………………… 45 3.4: Summary and Conclusions………………………………………………………..… 56

v

Table of Contents (cont.)

CHAPTER 4: Anharmonic Features of Collective Motion…………………… 47

4.1: Scaling of PCA Eigenvalues………………………………………………….……… 47 4.2: Non-Gaussian PC Distributions………………………….………………………….. 50 4.3: MSD and Anomalous Diffusion…………………………………………………...… 54 4.4: Collective Oscillations in the Small Covariance Regime……………………………. 58 4.5: Discussion……………………..……………………………………………………... 61 4.6: Conclusion.…………………………………………………………………………… 63

CHAPTER 5: Emergent Modes at Large Covariance…………………..……… 65

5.1: Introduction………………………….……………………………………………..… 65 5.2: Band Gaps in the Eigenvalue Spectra……………………………………….……….. 67 5.3: Spatial Structure of PC Eigenvectors………………………….…………………….. 70 5.4: The Principal Components of gA, SS and RR………………………….…………… 73 5.5: Coherent Modes from Weighted Sums of PCs………………………….…………… 75 5.6: Covariance of PC Trajectories……………………………………………………….. 83 5.6: Discussion and Conclusions………………………….……………………………… 85

CHAPTER 6: PCA of GMO Lipids Solvating Gramicidin …………………… 87

6.1: Background………………………………………………………………………….. 87 6.2: Methods…………………………………………………………………………….. 88 6.3: Results and Discussion………………………….………………………………….. 89 CHAPTER 7: General Conclusions and Future Directions………………….. 99

References…………………………………………………………………….……….. 102

Appendix 1: Normal Mode Analysis………………………….………………… 113

Appendix 2: Side Chain Conformations of gA………………………….……… 115

vi

List of Figures

Figure 1.1: Gramicidin in a hydrated GMO membrane ……………………………….. 12 Figure 1.2: Structure of dioxolane linker and its orientation in RR and SS analogs…… 13

Figure 3.1: Convergence of gA backbone dynamics in 10 ns and 64 ns simulations….. 39 Figure 3.2: Comparison of convergence for gA, SS and RR backbone dynamics …….. 42 Figure 3.3: Unconverged dynamics of side chains and GMO lipids…………………… 41 Figure 3.4: Average difference from Gaussian distributions for gA……………………. 44 Figure 3.5: Average difference from Gaussian distributions for SS and RR…………… 45

Figure 4.1: PCA eigenvalue spectra for gA for various atomic subsets………………… 49 Figure 4.2: Non-Gaussian distributions of long and short PCs vs. timescale for gA…... 51 Figure 4.3: Non-Gaussian distributions of 5 long and 5 short PCs at 1ns for gA……… 52 Figure 4.4: Surfaces of normalized difference from Gaussian ………………………… 53 Figure 4.5: MSD of various PCs for for gA backbone and side chains………………… 56 Figure 4.6: MSD slope of various PCs for gA backbone and side chains……………… 57 Figure 4.7: Spectra power of MSD oscillations……………………………………...…. 59 Figure 4.8: Illustration of short oscillating backbone PCs……………………………… 61

Figure 5.1: Detail of gA backbone eigenvalue spectrum……………………………….. 69 Figure 5.2: Illustration of eigenvector directional coordinates…………………………. 72 Figure 5.3: PC 1-3 of gA, SS and RR backbone………………………………………… 74 Figure 5.4: PC 4-9 of gA backbone……………………………………………………… 76 Figure 5.5: Projection of directional coordinate for PC 1-8: gA, SS and RR…………… 77 Figure 5.6: Coherent modes of the gA backbone……………………………………….. 79 Figure 5.7: Projection of directional coordinate for coherent modes of gA ………….… 79 Figure 5.8: Comparison of modes A and B for gA, SS and RR …………..………….… 82 Figure 5.9: Covariance matrix and its absolute value for gA…....……………………… 84 Figure 5.10: Covariance matrices for gA/SS.RR……………………..………………… 86

Figure 6.1: Radial distribution function of GMO surrounding gA ……………………… 89 Figure 6.2: Planar distribution functions of GMO surrounding gA …………..………… 94 Figure 6.3: Comparison of average structure of annular lipids from 2 to 64 ns………… 95 Figure 6.4: Eigenvalue spectra for annular GMO lipids……………………………..….. 96 Figure 6.5: PC 1-3 for annular GMO lipids …………………………………………….. 97 Figure 6.6: Emergent modes for PC 1-3 and PC 4-12…………………………………… 98 Figure 6.7: Normalized distributions of PC trajectories………………………………… 98

Figure A1: NMA eigenvalues for atomic subsets of gA………………………………… 114 Figure A2: Distributions of PC1 vs. PC2 for NCαC atoms in gA………………………… 116 Figure A3: Side chain conformations of gA in GMO over 64 ns………………………… 117

vii

List of Tables Table 1: PCA studies of protein dynamics ……………………………………………… 26

List of Appendices

Appendix 1: Normal Mode Analysis……………………………………………………. 113 Appendix 2: Conformational Basins of gA Side Chains……………………………….. 115

List of Abbreviations Å: Angstroms Ala: Alanine ATP: Adenosine triphosphate BPTI: Bovine pancreatic trypsin inhibitor CFA: Common Factor Analysis CHARMM: Chemistry at Harvard Molecular Mechanics (an MD package) CPT: Constant Pressure and Temperature algorithm for CHARMM dynamics DMPC: 1,2-dimyristoylglycero-3-phosphocholine EOF: Empirical Orthogonal Functions gA: Gramicidin A GMO: glycerol monooleate (hydrophilic headgroup with single chain mono-unsaturated lipid) Lys: Lysine MD: Molecular Dynamics MSD: Mean Squared Deviation NCaC: Nitrogen, alpha-carbon, carbonyl-carbon atoms NHCaCO: Nitrogen, amide hydrogen, alpha-carbon, carbonyl-group (carbon and oxygen) atoms NMR: Nuclear Magnetic Resonance PC: Principal Component PCA: Principal Component Analysis POD: Proper Orthogonal Decomposition RMS: Root Mean Square RMSΔθ: Root Mean Square change in angular coordinate RMSD: Root Mean Square Deviation RMSF: Root Mean Square Fluctuation RMSIP: Root Mean Square Inner Product RR: dioxolane-linked analog of gramicidin A with ring perpendicular to helical pitch SS: dioxolane-linked analog of gramicidin A with ring parallel to helical pitch TIP3P: Three-point transferable intermolecular potential for water Trp: Tryptophan SVD: Singular Value Decomposition

1

Preface

This work aims to expand the utility of molecular dynamics (MD) simulations

through the use of a multivariate statistical technique called Principal Component Analysis

(PCA). While MD simulations continue to become more powerful, creating longer

trajectories of increasingly large and complex systems, there is a need to develop and refine

mathematical and computational techniques to extract the biologically relevant information

in these increasingly elaborate data sets.

There are four main sections of results, each expanding the use of Principal

Component Analysis (PCA) beyond the traditional applications currently found in the

biomolecular literature. The first concerns the quantification of convergence in Chapter 3,

which is relevant not only to PCA but to the sampling of conformational state space of

complex dynamics in general. The second introduces quantitative statistical measures for

identifying regions of anharmonicity on the free energy landscape (Chapter 4), and highlights

the utility of PCA in identifying functional modes of motion at the equivalent of short

wavelengths, whereas PCA has traditionally been focused almost exclusively on long

wavelength modes. Chapter 5 introduces a simple ansatz for extracting simplified and

physically interpretable modes of collective dynamics from the results of PCA, through a

weighted superposition of eigenvectors. Finally, in Chapter 6 PCA is applied to the

membrane lipids surrounding gramicidin. This is a test case for the utility of PCA on diffuse

collections of monomers which behave as a continuous medium, whose eigenvectors are very

noisy and difficult to interpret.

The structural biologist’s insights linking molecular structure to function in complex

biochemical systems has contributed significantly to the tremendous success of molecular

biology. The advent of molecular dynamics has begun to expand this paradigm to include

dynamics as the mediator between structure and function. The development of multivariate

methods like PCA promises to enrich the analysis of MD data and contribute quantitative

insights into the relationships between structure, dynamics and function.

2

Chapter 1: Introduction

1.1: Biophysical Background

Computer simulation has become an essential research tool for understanding how the

dynamics of proteins link their structure to their function (1-5). Molecular dynamics (MD) in

particular can be helpful in obtaining information that is experimentally inaccessible with

current technologies. This is especially true in the single molecule regime, where it is

currently impossible to measure the internal motions of proteins with atomic resolution and

at timescales fast enough to resolve conformational transitions. On the other hand, despite

spanning ~11 orders of magnitude – from femtoseconds to microseconds – MD simulations

are still not capable of reaching long enough timescales to model many biologically relevant

processes; even the fastest protein folding event takes microseconds, and simulations on the

millisecond timescale are necessary to model the kinetics of this process. Hopefully there

will be a time in the future when computational and experimental technology will overlap in

the middle ground, when experimental techniques are able to probe small and fast enough,

and computational simulations are large and long enough, to study the same phenomena and

complement each other directly. Until then, many biophysical processes can only be studied

by simulation, and contact with experimental data remains a significant challenge for

computational biochemistry.

By combining energetics and dynamics, MD simulations are capable of calculating

the free energy of a complex system with many degrees of freedom. At temperature T, the

change in free energy ΔG has two components, the change in enthalpy ΔH and entropy ΔS:

ΔG = ΔH-TΔS. The enthalpy is defined as the sum of internal energy U and the mechanical

work done on the system by changes in pressure (VΔP) or volume (PΔV). In thermal

equilibrium where no work is done on the system, the internal energy in the microcanonical

ensemble is the sum of all pair-wise potential energy terms between the atoms of a molecule;

this is the quantity which is calculated at each time step of an MD simulation. The values of

molecular parameters in these pair-wise interactions are derived by calibration against both

measured and computed (with high-level quantum calculations) values of well-established

molecular and bulk properties, such as the atomic charge distribution, the orientational

relaxation rate, the dielectric constant, etc. While the enthalpy can be calculated

3

instantaneously, the entropy is a function of dynamics since the time-evolution of a system

generates the ensemble of states which are actually explored on the potential energy surface.

Hence the entropic component of the free energy is included in an MD simulation by

integrating Newton’s equations of motion over many time steps.

In general if the enthalpy is large, a complex molecular structure is very stable and

hence the entropy is small. On the other hand if the enthalpy is small then more

conformations become accessible and the entropy is large. Together these two terms create a

balancing act which determines whether any biochemical event will proceed spontaneously,

or how large an activation barrier must be overcome, and determines the ratio of substrate to

product in a reaction. One of the defining characteristics of biomolecular dynamics is that

both enthalpic and entropic contributions are very large, since these molecules have many

stabilizing interactions but also many degrees of freedom to explore. This cancelation of

large positive and negative terms gives rise to a free energy landscape which is intrinsically

“rough”, with many minima and maxima. Such a fine balance also means that calculations

of internal energy and of dynamics must be very accurate to yield meaningful free-energy

results.

This rough free-energy landscape model has become a paradigm for understanding

protein dynamics, and especially protein folding (6). It is generally accepted that proteins

exist on a complex free-energy landscape that is “rugged” in the sense of having multiple

nested minima corresponding to stable conformations, while a global funneling or ravine-like

structure of the landscape guides folding around kinetic barriers toward the native structure.

Simulations usually provide insight by describing the conformational ensemble

corresponding to the free-energy minima which are accessible to a complex biomolecule

under physiological conditions. Simulations are also very useful for studying the pathways

between these minima, which elucidates the kinetic barriers, intermediate structures and

transition states along a complex reaction pathway. This is why dynamics are important in

addition to structure, since they fill in the connections among an ensemble of conformations

that make proteins into machines capable of function, rather than static objects.

In recent years MD has contributed significantly to our understanding of biochemical

mechanism in enzyme catalysis as well as protein folding (2). For example, a coarse-grained

MD approach was able to compute hundreds of folding trajectories for a simple three-helix

4

bundle protein to understand the role of native and non-native contacts along the folding

pathway (7), and these results were consistent with shorter all-atom simulations (8). These

studies were able to determine the relative contributions of secondary structure formation and

hydrophobic collapse in the folding pathway of a simple protein, as well as the sequence in

which these events occur. Such a study elucidates the relative importance and structure of

on-pathway intermediates. Intermediate structures are of special interest in enzyme catalysis,

since it has long been recognized that enzymes function by binding the transition state in a

reaction (9), thereby lowering the activation energy and increasing reaction rates by factors

of up to 1019 (10). Both the structure and dynamics of an enzyme are important in this

regard, as the structure provides a pre-organized environment which stabilizes the transition

state (11), and dynamic fluctuations are often important in allowing for the substrate to enter

and the product to leave the reactive site (12). Dynamics also play a role through

conformational changes as well as vibrational modes, since these may also contribute directly

to lowering the activation energy for an enzymatic reaction (13). A good example of

structural and dynamic effects can be found in the enzyme triosephosphate isomerase (TIM),

which has a finely structured pocket of residues whose positions lower the activation energy

for the transfer of a proton from substrate to enzyme through electrostatic interactions (9, 13).

Proton transfer reactions are particularly sensitive to structural changes, and can be catalyzed

by deforming a C-H bond as little as 0.5 Å, and O-H bonds by 0.1 Å. Moreover, proton

transfer reactions are also sensitive to the presence of water, which may catalyze unwanted

side-reactions at the reaction site; the TIM enzyme has a dynamic conformational mechanism

for closing a “lid” over the active site during catalysis, making the reaction centre accessible

to substrate but not water (14).

Conformational changes are generally the best characterized examples of functional

motion in proteins. Many proteins bind their ligands through very specific conformational

changes around the binding site (as in myoglobin and hemoglobin), often coupled with other

conformational changes which exert allosteric control over the binding at other sites on the

protein (15). Global conformational changes may also exert mechanical forces in the

function of molecular motors, as in myosin (16, 17), or facilitate chemical catalysis in the

modification of chemical bonds, as in serine proteases (18). However, large-scale global

conformation changes are not the only interesting feature of protein dynamics; motions at

very different length scales are also important to the functioning of a protein. While changes

5

in tertiary and quaternary structure may span the size of an entire protein, individual residues

will also have important collective motions at much smaller spatial scales, and modification

of hydrogen bonds within the secondary structure will occur on even smaller scales yet. The

same is true for processes occurring at very different timescales, spanning at least 9 orders of

magnitude from femtoseconds (bond vibrations) to milliseconds (folding).

The coupling of large and small structural changes, as well as slow and fast

dynamical processes, is especially pertinent in the study of membrane proteins which form

ion channels. These proteins are responsible for regulating the permeation of material in and

out of cells or organelles, and transporting charges across membranes, an activity essential to

many fundamental biophysical processes from the transmission of electrical signals in

neurons to the generation of ATP. Ion channels have evolved sophisticated molecular

mechanisms to control the specificity with which they conduct various molecular species.

The intrinsically dynamic nature of transport processes makes MD simulations particularly

helpful in elucidating the mechanism of action of these channels. The transport process is

usually much faster than any conformational changes in the protein which modulates it.

Indeed, ion channels are an excellent target for MD studies precisely because the timescale of

ion diffusion is accessible to these simulations, and hence functional properties of the

channel can be probed at equilibrium without biasing dynamics to encourage rare events. At

the other extreme, the membrane in which these proteins function is governed by much

longer timescales, and must be described dynamically as well since no fixed structure exists

for this liquid-crystalline environment. Furthermore, the low dielectric constant of lipids

make membrane-bound proteins more sensitive to electrostatic forces than water soluble

proteins (4). The detailed atomistic study of ion channels presents a special opportunity for

understanding the structural and dynamic correlates of function.

The KcsA potassium channel illustrates many of these features, and is among the

most studied transmembrane channels after gramicidin. KcsA conducts K+ at rates near the

diffusion limit while discriminating against Na+ by more than a thousand-fold. The “knock-

on” mechanism was described long ago by Hodgkin and Keynes (19), where concerted multi-

ion transitions are mediated simultaneously by ion-channel attraction and ion-ion repulsion,

allowing several ions to move in single file through the narrow pore. This illustrates the fine

balance of interactions and dynamics which exist in ion channels. The selectivity of KcsA

6

was not clearly understood until its crystal structure was solved (20, 21), showing multiple

dehydrated K+ ions coordinated by main-chain carbonyl groups which line a very narrow

region of the pore corresponding to a highly conserved sequence of six amino acids common

to all K+ channels. Atomic fluctuations are essential to this selectivity mechanism, since

there are regions of the filter that are effectively narrower than suggested by the van der

Waals radius of K+ and carbonyl oxygens in the channel (22). This is also intriguing given

the relatively small size difference between Na+ and K+ (0.38 Å); it would be expected that

only a very rigid pore could discriminate between these, but the pore has been shown to be

quite flexible with RMS fluctuations on the order of 1 Å (23). However, this small size

difference allows for an optimum coordination number of 8 for K+, and only 6 for Na+. Since

there are eight carbonyls in the selectivity filter of KcsA, this turns out to be the basis of K+

selectivity in KcsA (24). We will see below that the solvation of ions by backbone carbonyls

is also a significant feature of the gramicidin channel.

Finally, it has long been recognized that interactions between membrane proteins and

their lipid environment may be integral to function. There has been considerable interest in

this problem in structural biology (25), where understanding lipid interactions may be

essential to crystallization and structural characterization of membrane proteins. There are

many roles for lipid-protein interactions: specific lipid species may confer structural stability

to membrane proteins, control insertion and folding processes, or aid in the assembly or

oligomerisation of multi-subunit complexes (26).

MD simulations of integral membrane proteins have demonstrated a number of

effects which are thought to be relevant to membrane proteins in general. For example,

simulations demonstrated that the presence of the transmembrane region formed by the alpha

helical bundle of the nAChR glycoprotein increases the orientational order of the DMPC

lipid acyl chains relative to the pure lipid bilayer, an effect which is enhanced deeper in the

membrane interior (27). This study also showed a decrease in the number of gauche defects,

a broadening of the orientational distribution of lipid headgroup dipole moments, and an

increase in headgroup orientation toward the water phase. Simulation studies of OmpA have

demonstrated a strong differentiation between bound and free lipids, where the lateral

diffusion coefficients of lipids solvating the protein are about half that of free lipids (28).

The same study also showed that lipid-protein interactions are able to relax to a stable state

7

on the 20 ns timescale. The shell of relatively immobilized lipids interacting directly with a

protein have been called “annular” lipids (29), in that they form a ring-like structure around

the protein whose properties are distinct from the bulk lipids in the rest of the membrane.

Spin-labelling has been particularly successful in characterizing annular lipids (30),

demonstrating that their interaction with the protein is ‘non-sticky’ and that a particular lipid

molecule remains in the annular shell for approximately 100 ns in the case of diacyl

phospholipids. (These timescales are important to keep in mind as our simulations of

gramicidin are 64 ns long).

These effects are in general agreement with experimental and simulation studies of

gramicidin in a membrane environment. An increase in the ordering of acyl chains was

observed using ESR and 2H NMR for gA in DMPC lipid bilayers in the liquid crystalline

phase (31), although the opposite effect is observed in the gel phase (32). 2D-ELDOR

(electron double resonance) has been used to differentiate between bound and free lipid

behaviour (33), demonstrating that lipids bound in the first solvation shell are immobilized

compared to bulk lipids. A 0.5 ns simulation of gA in a DMPC bilayer has demonstrated

good agreement with the 2H NMR data and an increase in the ordering of the acyl chains was

observed (34). Another 1.2 ns simulation of the same system demonstrated that the effects of

the channel on the lipid bilayer were short range, affecting only those DMPC molecules

bound to the channel (35). However, a comparison of gA simulations in DiPhPC and GMO

bilayers show that GMO molecules are significantly more ordered than the diacyl chains,

with three distinct solvation shells apparent in the radial distribution function (36).

All of the phenomena described above demonstrate the interplay between structure

and dynamics which is essential to the function of large and small biological molecules.

While MD simulations have often been successful in providing insight into these

relationships, the size of the resulting data set makes their interpretation difficult. Different

parts of a complex molecule (and its solvation environment) may play various functional

roles at different length and time scales, and it is difficult to identify these motions in the

large amount of data resulting from MD trajectories. This is a general problem facing much

of structural biology and computational science: our ability to generate experimental or

simulated data has begun to outpace our ability to analyze it for biologically meaningful

information and insight. One of the outstanding questions posed by the study of molecular

8

dynamics is how to quantify the structure of motion: we must account not only for the three

dimensional pattern of atomic positions, but also of their displacements. Which atoms move

together, how far do they move, and most importantly, in which directions? These are the

questions which motivate this study to undertake Principal Component Analysis as a means

of characterizing the 3D structure of collective displacements.

In order to develop quantitative techniques for the analysis of dynamics, it behooves

us to study simple systems which have been well characterized in the past, yet also have

adequate complexity to capture the essential features of biological function. Gramicidin is

one of the simplest membrane proteins with a very clear functional role and a long history of

experimental, theoretical and computational study. Other examples of such archetypal

systems include cytochrome c, BPTI, ubiquitin and lysozyme, but all of these are globular

proteins while gramicidin is a membrane-bound channel, which adds an important layer of

complexity. It is very small compared to most proteins, yet it has both secondary and

quaternary structure, and is also a membrane protein which interacts with its lipid

environment. Moreover, since its function is well understood as a channel, MD studies of

gramicidin are especially tractable, since we know functional events take place within the

duration of our simulations. On the other hand, after decades of theoretical and

computational studies of gramicidin, only recently have nanosecond-scale MD simulations of

proteins in an explicit membrane bilayer become tractable. All these features make

gramicidin an ideal candidate for detailed quantitative study and the development of new

analysis techniques.

1.2: Gramicidin

1.2.1: Biological Characterization

Gramicidin was discovered by René Dubos in 1939 (37), who isolated it from the soil

bacterial species Bacillus brevis, and named it for its bactericidal properties. Gramicidin was

one of the first commercially produced antibiotics, making a significant impact on battlefield

medicine during the Second World War. It is active primarily against Gram-positive bacteria

other than the Bacilli, as well as select Gram-negative species. Its use as an antibiotic is

limited to topical applications, as it induces hemolysis when taken internally, and is most

commonly found today in the commercial ointment Neosporin. To give a historical

9

perspective on the importance of this molecule, when Soviet researchers isolated an entirely

different compound with similar antibacterial properties in 1942, it was named Gramicidin S,

for Soviet. At the end of the wartime effort in 1944 the Soviet Ministry of Health was

collaborating with Great Britain to solve its structure. While the culmination of this effort

had to await the development of x-ray crystallography and NMR spectroscopy, gramicidin

was one of the first proteins whose structure was definitively solved by NMR (38) and for

about 15 years was the only transmembrane channel with known structure. This contributed

significantly to the wealth of research which has been devoted to this molecule.

When inserted into a membrane gramicidin forms a passive trans-membrane pore

which is selective for small monovalent cations (39), and this is essential to its mode of

action as an antibiotic. It kills bacteria by increasing the permeability of their cell walls,

thereby destroying the ion gradients (primarily of H+, Na+ and K+) between the cytoplasm

and the extracellular environment. The experimentally observed selectivity sequence for

gramicidin is Li+ < Na+ < K+ < Rb+ < Cs+ (40, 41) – which is the same as these ions’ mobility

in water – with overall activation free energy barriers on the order of 5-10 kcal/mol and

conductance of ~107 ions per second (39, 42). Gramicidin is impermeable to anions and is

blocked by divalent cations.

It is interesting to note that the natural function of gramicidin in Bacillus brevis is not

known, although it is apparently not used as an antibacterial pore-forming agent in its native

environment. It has been shown to inhibit E. coli RNA polymerase, and in B. brevis it is

believed to play a role in gene regulation during the shift from vegetative growth to

sporulation (43).

1.2.2: Structure of Gramicidin and its Dioxolane-Linked Analogs

Gramicidin has a number of structural analogs, all of which are pentadeca-peptides

which dimerize to form beta-helical transmembrane channels when inserted into a membrane

bilayer. Gramicidin D is the pharmacological extract (named for Dubos), and is a

heterogeneous mixture of 80% gramicidin A, 6% gramicidin B and 14% gramicidin C.

These are all naturally occurring dimers and differ only in the residue at position 11 with the

following chemical formula:

XL-Gly-AlaL-LeuD-AlaL-ValD-ValL-ValD-TrpL-LeuD-YL-LeuD-TrpL-LeuD-TrpL

10

The L and D subscripts indicate left-handed and right-handed enantiomers of the amino acids

(note that Gly has no optical activity since it is not chiral). Gramicidin A has Y=Trp,

gramicidin B has Y=Phe and gramicidin C has Y=Tyr. There are variants of all three analogs

where X=Val or X=Ile. There are also a number of artificial analogs where the Trp residues

at positions 9, 13 and 15 are also modified. The analog in which all Trp residues have been

replaced with Phe is called gramicidin M.

The structure of gramicidin A has been characterized at high resolution with 1H-NMR

in lipid micelles (38) and using solid-state NMR in lamellar-phase lipid bilayers (44, 45).

The native channel is composed of two monomers which assemble as a head-to-head non-

covalently-linked dimer, forming a cylindrical pore when solvated in a membrane bilayer, as

shown in Fig. 1.1. Each monomer has 15 alternating L- and D-amino acid residues which

form a b6.3-helix with 2.5 turns per monomer. Four Trp residues stabilize the C-terminals at

the water-membrane interface. The gA helix forms a 4-Å-wide cylindrical pore which hosts

a single file chain of water molecules traversing the membrane, thereby creating a pathway

for cation permeation and a hydrogen-bonded wire for the conduction of protons. Divalent

cations are too large to pass through the mouth of the channel, and block it by binding there.

The unique ability to form a beta-helical secondary structure is due to the alternating

L and D amino acids in the structure of gramicidin. L amino acids are by far the dominant

component of proteins in most life forms, with D forms found only in the outer

peptidoglycan walls of bacteria (46). This beta-helix has the carboxyl oxygen alternating

from one side of the backbone to the other. Each of these is hydrogen bonded to an amide

hydrogen, which is the characteristic pattern of hydrogen-bonds in the beta-sheet (hence the

name). However, the alternating L and D amino acids allow for a continuous curve in a

single direction rather than flattening the chain as in a beta-sheet formed exclusively of L-

amino acids. This pattern of alternating carbonyl orientations results in twice the distance

between neighbouring hydrogen-bonds than in an alpha helix, making the beta helix less

rigid. This pattern of carbonyls also exposes a periodic set of partial negative charges to the

lumen of the channel, which play a significant role in lowering the energetic barrier due to

cation dehydration upon entry into the channel, and also in solvating positive ions as they

pass through the channel. Note however that the orientation of the carbonyl dipoles is

parallel to the helical axis, while the optimal solvation geometry would point the dipole

11

moment radially towards the ion within a pore. This makes the solvation of ions by

backbone carbonyls a more subtle process in gramicidin than in the selectivity filter of KcsA

discussed above, and intrinsically couples tilting motions of carbonyl groups with ion

solvation.

Gramicidin A is a dimer held together by six hydrogen-bonds at the N-terminals

located in the centre of the bilayer. These interactions play a dominant role in dimer

association and dissociation. Gating of the native channel is associated with the lifetime of

dimerization, which is on the order of 100 ms (47, 48). Dioxolane-linked analogs of gA have

been synthesized which inhibit dimer dissociation (49, 50), resulting in channels with

increased conductive lifetimes. The presence of two chiral carbon atoms in the dioxolane

ring leads to two distinct diastereoisomers, where both linking carbon atoms are either in the

S or R state. The structure of the linker bridging a dipeptide (in the R configuration) is

shown in Fig. 1.2A. The R and S designation defines the nomenclature of linked channels;

since both chiral carbons must be in the same configuration when linking the gramicidin

dimer, the two diastereoisomers are names SS and RR. The most significant structural

difference between these channels relates to the strain of the linker acting on the helical

backbone: the SS linker fits easily along the pitch of the helix, while the RR linker is

perpendicular to it, creating a wedge-like dislocation with the ring parallel to the helical axis,

as depicted in Fig. 1.2B and 1.2C. Significantly, the SS dimer is much more stable in its

conducting state (hours) than the RR dimer (minutes) (51-53).

12

Figure 1.1: Gramicidin A in a hydrated lipid bilayer. The GMO molecules are shown in thin lines (cyan carbons, red oxygens), bulk water molecules are shown as small spheres while 9 lumen water molecules are emphasized as large spheres. The β-helical backbone is shown in blue, while hydrophilic side chains (Trp in transparent red) and hydrophobic side chains (Leu, Val, Ala and Gly in transparent green) are shown in stick representation. A top-down view of the channel can be found in Figure A3 of Appendix 2. The SS- and RR-linked analogs were simulated in the same hydrated membrane environment.

13

Figure 1.2: A: The dioxolane linker (atoms 1-9) inserted in the R configuration between two amino acids. The SS-linked (B) and RR-linked (C) analogs vary in the degree of structural perturbation caused by the linker to the pitch of the beta-helix. Only backbone atoms are shown for clarity.

14

1.2.3: Studies of Gramicidin Dynamics

Gramicidin has a long history of theoretical and computational study; reviews may

be found in Refs (54-56). The first theoretical model of ion transport in gramicidin was

proposed by Lauger in 1973 (57), and consisted of a simplified array of dipole moments.

Since then the evolution of computing power and the refinement of potential energy

functions has given rise to molecular dynamics simulations with increasingly realistic

membrane and hydration environments and at increasingly long timescales, along with

Monte Carlo simulations, ab initio quantum mechanical calculations, activated dynamics and

free-energy simulation techniques (56). In addition to these, hybrid models have also been

constructed where, for example, an MD treatment of proton transport, channel and lumen

dynamics combines with a Monte-Carlo treatment of entrance and exit from the channel to

yield conductivity values which can be compared with experiment (58).

Calculating the free energy of ion permeation is essential to understanding the

functional mechanism of ion channels, since this quantity relates fundamentally to the

conductance of a channel which can be measured experimentally. Potential of Mean Force

(PMF) calculations were introduced by Kirkwood in 1935 (59) as a means of obtaining free

energy results in liquids, and this technique has become central in the treatment of transition

rates, ion transport and reactions dynamics in general. PMF calculations have become a

benchmark by which to judge the quality of MD calculations (60). A recent review of PMF

results for various cations in gramicidin (61) has demonstrated that semi-quantitative

agreement between experiment and calculation can be attained, but it also highlights the

challenges faced by theorists in the treatment of gramicidin. Since this is such a small

molecule with a very narrow pore, changes in dielectric constant occur over the space of a

few atoms, and polarizability has a large influence over electrostatic properties.

Polarizability is generally not included within standard molecular mechanics force fields, and

the treatment of the dielectric constant (an intrinsically macroscopic quantity) is also

problematic at the atomic scale. This review (61) also showed that the single-file water chain

in the lumen does a surprisingly good job of stabilizing the ion, providing about half the ion’s

bulk hydration free energy even though the ion loses 5 of its 7 solvating water molecules

upon entry into the pore.

15

Solvation and H-bonding play a strong role in modulating the conduction of ions and

protons in particular along water chains (62-65). Studies of water wires in an electrostatic

field (66) and in the water-transporting channel aquaporin have also shown that the

electrostatic environment modulated by the global conformation of the protein also strongly

influences the conductive properties of its water wire (67, 68). In the case of gramicidin A,

hydrogen-bonding between lumen water molecules and backbone carbonyl groups is thought

to play a significant role in organizing the water wire within the channel to provide surrogate

solvation to the hydrated ion (54, 55, 69, 70). Protons in particular are very sensitive to the

orientation of nearby water molecules, as H+ ions conduct by hopping from one water

molecule to the next, which can only happen if an oxygen atom is oriented toward the excess

proton. This is known as the Grotthuss mechanism of proton transport (70-72), which

recognizes that the conduction of a proton along the length of a water wire necessitates the

reorientation of the entire wire before another proton can be transmitted from the same end of

the wire. Not only are carbonyl oxygen atoms well suited to both hydration and mobility of

protons in gramicidin (71), but they also assist in the reorientation step of the Grotthuss

mechanism (70).

The surrogate solvation of cations by carbonyl oxygens in the gramicidin backbone

has a long history of study. A peptide-plane libration mechanism was first proposed on the

basis of experimental conductance measurements (73). A normal mode analysis (NMA)

study concluded there was a band of short wavelength (high frequency) modes between 75

cm-1 and 175 cm-1 which represent librational motions of the peptide planes (74). Early MD

studies concluded that the flexibility of the gA channel modulates its conductivity, and

suggested that picosecond librations of the carbonyl moieties lining the pore were coupled to

the fluctuation of water molecules and of ions in the lumen (75). Tian and Cross have

reviewed the experimental evidence for carbonyl tilting in gA (76), and NMR studies have

provided experimental evidence of peptide plane librations (77, 78) demonstrating that

motions of the backbone occur on the same time scale as cation translation through the

channel. Powder-pattern NMR revealed picoseconds librations (78), while 15N T1 relaxation

measurements indicated a nanosecond timescale (77), although this slower result was

interpreted as the effect of damping by slower correlated motions. Recent MD studies have

also computed the amplitude of these librations (72), finding significant agreement with the

amplitudes measured by NMR. The frequency of carbonyl librations has also been measured

16

by far-infrared spectroscopy (79, 80) and found to be in general agreement with the NMA

results reported by Roux and Karplus (74).

In gramicidin the collective structural fluctuations of the entire backbone influence

the more localized dynamics of the water wire, with significant functional impact. It has

been demonstrated that the flexibility of the gA backbone in general influences its ionic

conduction properties (75). Specifically, the local perturbations of channel structure caused

by the dioxolane linker in the SS and RR analogs of gramicidin have a significant impact on

the channel’s conductivity; experimental investigations have shown that single channel

conductance is a factor of 2-4 times greater in the SS-linked dimer than in the RR analog

(53), and is strongly influenced by the membrane environment (51, 81, 82). Moreover,

experiments on the concentration-dependence of proton mobilities have suggested that

channel conductance is modified by structural differences in the protein which affect the

organization of water and hydrogen-bonds in the lumen (83). Computational studies of the

linked analogs have revealed that the RR dimer has 4 conformational states not present in the

native or SS channels (84), which are defined by the orientation of the two dioxolane

carbonyl groups pointing in or out of the channel. This study showed that while all the

carbonyl groups of gA and the SS-linked channels undergo unimodal librational motions

with RMS fluctuations on the order of 15o, the dioxolane carbonyl groups of the RR-linked

channel undergo multimodal switching transitions of ~50o, largely to compensate for the

distortion of secondary structure caused by the RR-linker. Furthermore, thermally activated

transitions between these four states was shown to limit the movement of protons through the

channel lumen (72), by coupling proton translocation to the conformational transitions of the

dioxolane linker.

This coupling between fast (localized) modes of water orientation and slow

(distributed) collective modes of the protein is of central relevance to biomolecular function,

and necessitates the quantitative description of both fast and slow collective motions in this

and other biomolecular systems. We note that the computational studies listed above were

based on simulations without a membrane; the hierarchy of local and global modes they

revealed begs the question of the membrane’s role in this hierarchy, which has motivated our

inclusion of a GMO membrane in the current study.

17

1.3 Summary and Overview

The gramicidin channel is a transmembrane protein with a long history of

computational and experimental study, and serves as an appropriately simple system for the

development of analytical techniques to quantitatively characterize its dynamics. Molecular

dynamics simulations provide a large data set which describes the time evolution of

molecular structure at finite temperatures and in a biologically relevant environment such as

the hydrated membrane. MD provides a natural progression of the structural biologist’s

strategy for understanding molecular biology, by including dynamics in the mechanistic

explanations which link structure to function. In Chapter 2 we describe the MD technique in

more detail, and explain how this data set is amenable to study by multivariate statistical

techniques such as Principal Component Analysis (PCA), which can extract the structure of

collective modes from atomistic dynamics. By and large, PCA has been used on MD data

sets in its most basic form, but we demonstrate that the history of PCA in atmospheric

science shows that there is a rich diversity of strategies for improving the ability of PCA to

calculate physically meaningful modes of motion from large, noisy data sets. In Chapter 3

we establish a few quantitative measures of convergence which underwrite the reliability and

statistical meaning of PCA results computed from finite simulation times. In Chapter 4 we

describe how access to the full time-ordered trajectory of an empirical simulation cast onto

the Principal Components offers insights into the statistical mechanics and dynamics of

collective motions in the biophysical state. Here we undertake a study of the Gaussian and

non-Gaussian characteristics of collective motions, which offers insight into the anharmonic

properties of the free energy landscape, and we suggest a few reasons to believe that these

are essential to understanding biologically important dynamics. In Chapter 5 we study the

spatial structures (eigenvectors) of the principal components and propose a simple

transformation which reduces the leading 25 PCs to 4 collective modes with much simplified

structure. Finally, in Chapter 6 we apply some of our PCA techniques to the study of annular

lipids surrounding the gramicidin channel in a membrane, in an attempt to find simplified

structure in the noisy dynamics of solvating lipid molecules.

18

Chapter 2: Theory and Methods 2.1: Molecular Dynamics (MD)

2.1.1: Background

Molecular Dynamics is an algorithm routinely used to simulate the motion of many

chemically bonded atoms in the classical approximation. It is classical in the sense that it

does not model electrons, whose dynamics are intrinsically quantum mechanical and occur

on timescales much faster than the motion of the nuclei. The simulation of atomic dynamics

is achieved by means of a “molecular mechanics force field”, which is comprised of a set of

interaction energies which depend exclusively on nuclear coordinates and atomic type (which

can also depend on the molecular group – i.e. a carbon atom is described differently in

methyl and carboxyl groups) . These functions are empirically parameterized by integrating

over all electronic degrees of freedom in a high level quantum mechanical calculation,

yielding a small number of parameters, within a few simplified functional forms, which

describe bonded and non-bonded interactions. The spatial derivative of these functions

yields force, and the vector sum across all force components yields the total force on any

atom at a single moment in time. There are a number of parameterizations available for MD

of biological molecules, the most common of which are CHARMM (85, 86), AMBER (87,

88), GROMOS (89-91) and OPLS (92).

The sum of all interaction energies is the total potential energy U and is split into

bonding and non-bonding terms. An essential part of the MD algorithm is keeping track of

which atoms are bonded to which other atoms as nearest neighbours (1-2 interactions) or

next-nearest neighbours (1-k interactions, with k < 5). This effectively encodes the primary

structure of the molecule to be simulated. There are 5 possible through-bond interaction

terms for each atom in the CHARMM potential, which account for bond stretching, bond-

angle deformation, and bond torsion:

Ubond = kb (bij-b0) 2

Uangle = kθ (θij-θ0) 2

UUB = ku (uij-u0)2 (2.1)

19

Uimproper = kω (ωij-ω0) 2

Udihedral = kφ [1+cos(nφij-δ)]

Variables marked with the ij subscript denote that the interaction is between atoms i and j.

These are the dynamic variables in the simulation, and are all functions of atomic positions.

The distance between atoms separated by a single covalent bond is b (1-2 interactions), the

angle at the junction between two bonds is θ, and u is the distance between the two atoms

joined at such a junction (this is the Urey-Bradley term HUB, which is the cross-term in angle

bending involving next-nearest neighbours, i.e. 1-3 interactions). Torsions are created by

series of three bonds (1-4 interactions): the dihedral angle between the bonds on either end

with respect to the axis of the middle bond is φ. Udihedral is the first term of a Fourier

expansion, where n denotes the periodicity of the function and δ is the phase. Improper

dihedral angles are also written as 1-4 interactions, where the torsion around a bond which

branches into three bonds is ω.

The two non-bonded terms in the CHARMM potential account for electrostatic and

electrodynamic (dispersion, or van der Waals) interactions:

Uelec = qiqj/(D rij) (2.2)

UVdW = εij[(σij/rij)12-2(σij /rij6)]

where rij is the distance between atom i and every atom j within a given cut-off radius for the

electrical or van der Waals interactions (cut-offs are independently adjustable parameters

when setting up the simulation, as is the dielectric constant D). The non-bonded forces are

only applied to atom pairs separated by at least three bonds. These interactions are

computationally much more demanding than the bonded interactions, since they include a

sum across a large number of atoms within the cut-off radius at each site.

All other variables in Eq. 2.1 (kb, kθ, ku, kω, kφ, b0, θ0, u0, ω0, φ0, n, δ) and Eq. 2.2 (qi,

qj, εij, σij) are parameters established by theoretical calculations or empirical calibration for

each atomic kind in the appropriate chemical environment. Those named k are all spring

constants (kcal/mol. Å2 or kcal/mol.rad2) for a given harmonic interaction, and those with

subscript 0 denote equilibrium distances (Å) and angles (radians). qi and qj are the partial

20

charges on atoms i and j, σij is the average of atomic radii i and j (Å), and εij is the square

root of the product εiεj, denoting the strength of the van der Waals interaction for a given

atom (kcal/mol). The units stated here are those used in CHARMM. It is important to note

the assumption underlying the validity of implementing any molecular mechanics force field:

equilibrium parameters derived for smaller molecular subunits must scale appropriately

across the much larger biomolecules in an MD simulation, and also hold across a relatively

broad and high termperature range (compared to zero-temperature quantum calculations).

Computing the potential energy is the purely spatial part of the MD algorithm. To

propagate dynamics the acceleration a of each atom is determined from the spatial derivative

of the internal energy by Newton’s First Law F=-—U=ma, where m is the mass of a given

atom, and — is the vector gradient operator (variables in bold are vectors with x,y and z

components). This is achieved by choosing a time step δt (typically 0.2 fs) which is short

enough to assume constant acceleration through its duration, such that the position r can be

determined at the next time step:

ri(t+δt )= ri(t) + vi(t)δt + ai(t)δt2/2 (2.3)

This is a set of 3N simultaneous equations where i=1…N labels each atom in the system, and

can be solved numerically even for large N. The explicit inclusion of the velocities vi(t) in

the equations of motion (2.3) can be circumvented by including one step back in time ri(t-δt):

ri(t-δt )= ri(t) - vi(t)δt + ai(t)δt2/2 (2.4)

The sum of Eqs. 2.3 and 2.4 eliminates vi(t) and yields the Verlet algorithm:

ri(t+δt )= 2ri(t) - ri(t-δt) + ai(t)δt2 (2.5)

Note that to create the initial two time steps ri(t-δt ) and ri(t), Eq. 2.3 must be solved at the

first time step by specifying an initial set of velocities at the beginning of the simulation, in

addition to initial positions. These are chosen from a Maxwell-Boltzmann distribution, and

their evolution in time determines the trajectory of kinetic energy in the system, and therefore

the temperature.

21

The inclusion of temperature is essential in MD, since it allows for the calculation of

free energies G rather than simply evolving dynamics on a potential energy surface (at zero

temperature), which describes only the enthalpy H. The temperature algorithm – the

inclusion of thermal fluctuations – controls the distribution of states which is selected on this

potential surface, hence the time-ordered trajectory generated by MD includes the entropy S.

It is dynamics at a finite temperature that make MD simulations an object of interest to

biochemists and molecular biologists, by describing dynamics on the free-energy surface,

dG=dH-TdS.

There are three main temperature regulation algorithms: Berendsen (93), Nosé-

Hoover (94, 95), and Langevin dynamics (96). The Berendsen thermostat (93) is the

simplest, in that it rescales all velocities at every step (or few steps) of the simulation such

that the average kinetic energy is maintained at the desired temperature. This has been

shown to give rise to some problematic artifacts which violate equipartition of energy; the

correct average energy is maintained but becomes distributed asymmetrically across all

degrees of freedom. A canonical example of this is known as the “flying ice cube” (97),

where energy from internal degrees of freedom is channeled into translation and rotation

about the centre of mass of the whole system. If different components of a simulation such

as the water, the protein, or the membrane are governed by different relaxation rates, the

kinetic energy may be deposited largely in one of these phases over another (98). This is

clearly problematic in simulations which make use of aqueous or oil-phase solvent.

The Nosé-Hoover (94, 95) thermostat solves this problem by coupling each degree of

freedom in the atomic ensemble independently to another “virtual” degree of freedom,

coupling an oscillator of a specific mass to each particle in the system. By introducing a set

of virtual oscillators which exchange energy with each real particle in the system, the kinetic

energy is redistributed uniformly through the system. This method also has the advantage

that different parts of the solute/solvent ensemble may be coupled independently to different

heat baths; this can be helpful in simulations of membrane-bound proteins, since proteins and

membranes have very different relaxation times and may benefit from different values for the

virtual masses attached to them.

22

Langevin dynamics (96) explicitly includes friction as well as stochastic perturbations

directly in the differential equations of motion, i.e. Newton’s First Law is written including a

velocity-dependent friction term with a damping constant γ, and a stochastic term R(t):

F=ma=-—U - γmv(t)+(2γmkBT)0.5 R(t) (2.6)

where kB is Boltzmann`s constant, T is temperature and <R(t)>=0 and <R(t) R(t’)>=δ(t-t’).

Here δ is the Dirac delta function, and these definitions make R(t) an uncorrelated Gaussian

process. While the Langevin approach performs better than the Berendsen or Nosé-Hoover

thermostats, it is not possible to implement with constant surface tension and pressure with

the algorithms currently available in CHARMM. Since these are necessary for simulations

of the GMO membrane, the Nosé-Hoover thermostat was used in the simulations described

below.

2.1.2: Simulation of gA/SS/RR in a GMO Membrane

All simulations described in this work were carried out using the CHARMM 31.1

molecular dynamics package (85) with the TIP3P water model (99) and the CHARMM22

force field (86) for all other atoms. The parameters for the dioxolane linker in the SS and RR

analogs were developed in a previous study (84), by fitting geometry, vibrational

frequencies, and energy to the results of ab initio calculations. The calculations were

performed using Gaussian 98 (Rev. A.9) together with the RHF/6-311G** level of theory, on

both the [1,3]dioxolane fragment alone and on (RR)[1,3]dioxolane-4,5-dicarboxylic acid bis-

methyl-amide (a linked dipeptide). A crystalline array of 122 glycerol-1-mono-oleate

(GMO) molecules were arranged with 5 Å spacing in a bilayer configuration with 3210 water

molecules, and allowed to relax for 100 ps using Langevin dynamics in a cubic box with 50

Å sides. Then a cylindrical hole was created in the centre of the membrane and a gA dimer

was inserted, whose structure was obtained from simulations inside a phospholipid bilayer

(34) with harmonic restraints on selected side chains (72). The same initial structure was

used for the RR and SS analogs, with the addition of the linker and removal of any GMO

molecules in conflict with it. The entire system was then equilibrated at 300K with strong

harmonic restraints (100 kcal/mol/Å2) on all heavy atoms of the channel for 0.2 ns, then with

moderate restraints (5 kcal/mol/Å2) for 0.8 ns, and finally with no restraints for 1 ns before

the production runs, as described below.

23

Two sets of simulations of the gA molecule were carried out to probe long- and

short-time dynamics of the gramicidin dimer. One 64 ns production run probed longer time-

scale dynamics using a 2 fs time step and saving coordinates every 200 fs, using the SHAKE

algorithm (100) to constrain stretching of covalent bonds involving hydrogen. Another 10 ns

production run had no bond constraints within the protein and a 0.5 fs time step (saving every

10 fs) to probe hydrogen dynamics at shorter time scales and to yield accurate PCA

eigenvalues at the shortest spatial scales. Only the 64 ns simulations with 2 fs timesteps were

carried out for the SS and RR molecules.

The leapfrog Verlet algorithm was used to propagate dynamics with constant

surface tension and normal pressure on the membrane based on the Parrinello-Rahman

barostat as described in Ref. (101), with a piston mass of 500 amu and a 5 ps coupling

constant. The surface tension was kept constant at a value of zero, since the application of a

finite external pressure has been shown to be unnecessary for GMO bilayers (102). The area

per lipid was stable at 0.25 nm2, in quantitative agreement with a previous study (102), upon

equilibration of the membrane with gA inserted. The Nosé-Hoover algorithm was used to

control temperature at 300 K with a thermal piston mass of 1000 kcal·ps2 and 5 ps coupling

constant. The simulations were carried out with tetragonal periodic boundary conditions,

updating the crystal parameters for box length every 200 ps. We used the particle-mesh-

Ewald-summation (PME) method, with a width κ=0.3 Å-1 and grid point spacing of 1.0 Å.

The Lennard-Jones interactions used a force switching function from 10 to 12 Å, with a cut-

off at 14 Å.

2.2: Principal Component Analysis (PCA)

2.2.1: Background

Principal Component Analysis (PCA) is a multivariate statistical analysis for the

reduction of high-dimensional data sets onto collective coordinates. It is related to singular

value decomposition (SVD), which aims to extract only the largest components from data

sets with prohibitively high dimensionality (103, 104). PCA was originally developed to

identify the directions of most variation in data from the social sciences (105, 106) and

meteorology (107). In these disciplines the relationships between measured variables is

24

complex and non-intuitive, and PCA solved the problem of finding appropriate linear

combinations which capture a large fraction of the variance across the data set. PCA has

long been used in a number of disciplines concerned with the study of noisy, many-particle

trajectories. For the continuous case considered in the study of turbulence, the technique is

called Proper Orthogonal Decomposition (POD) (108), and in climatology the technique is

called Empirical Orthogonal Functions (EOF) (107).

PCA can be applied to the time trajectory of a collection of moving particles, and has

become a well-established technique for extracting collective modes of displacement from

atomistic MD trajectories (109, 110). The application of PCA to protein dynamics was

pioneered by García (111), whose study demonstrated that there are multi-modal

distributions of PCs along a simulated protein trajectory, and hence any harmonic

approximation of protein dynamics fails to capture the essential features of their collective

dynamics. Since it is often the large-amplitude motions which are of interest to biochemists,

PCA can afford significant data reduction by concentrating a large fraction of a system’s total

fluctuations into a small fraction of the collective motions. To this end, there have been

many studies of the largest few PCs of protein motion (18, 111-120). Application of PCA to

proteins has also been called "essential dynamics" (112, 116, 118), where it is argued that

only the largest non-Gaussian distributed PCs are sufficient to account for the functional

dynamics of a protein (112).

Consider a trajectory of N atoms in time Ri(t)=(xi(t), yi(t), zi(t)) and let r(t) represent

any one of xi(t), yi(t) or zi(t), where i=1,2,...N and t=1,2,...T, with T equal to the duration of

the trajectory. To study only the internal dynamics of a protein, it is conventional to align

each snapshot in the trajectory to the time-averaged structure, thereby eliminating translation

and rotation of the entire molecule from the trajectory. The mass-weighted covariance

matrix is

C = < Mij Dri Drj >, (2.7)

where Mij =Mi1/2 Mj

1/2and Dri=ri(t)-<ri(t)>t is the change of position from the time-averaged

structure, for each spatial component of all atoms i and j included in the analysis. Note that

Eq. 2.7 is a product of all coordinates, and yields a matrix of dimension 3N-6 (subtracting the

six largest degrees of freedom representing translation and rotation of the molecule) rather

25

than the dot product DRi ·DRj which would yield a matrix of dimension N-6. Diagonalization

of the covariance matrix C solves the eigenvalue problem

C v = s2 v (2.8)

and yields a set of eigenvalues sk2 and eigenvectors vk, where k=1,2,…,3N-6.

Each eigenvector vk represents a principal component of displacement, and may be

visualized as a set of N three-dimensional vectors attached to the N atoms analyzed within

the molecule. Each of these 3D vectors describes the magnitude and direction of the RMS

fluctuations at a given atom, within a given PC. The MD trajectory can be projected onto

each eigenvector by forming the dot product of atomic displacements with each eigenvector

for all time steps. The resulting distribution of each projection would have a variance

sk2(and standard deviation sk); this is the physical meaning of the eigenvalues, which

measure the spatial amplitude of each PC across the full trajectory.

2.2.2: PCA and Protein Dynamics

Large collective displacements may be used to study conformational changes, and

these are often the best characterized examples of functional motion in proteins. PCA can be

used to compute the RMS fluctuations along the protein backbone, and has been particularly

successful in identifying large concerted motions which may be related to function.

Examples include the hinge-bending motion of thermolysin (118), identification of ligand

binding sites in Cu Plastocyanin and Azurin (114), and regions of hydrogen exchange in

cytochrome c (117). The technique is also useful for comparing the dynamics of similar

proteins within the same superfamily (18). Table 1 shows a representative sample of proteins

that have been analyzed using PCA (17, 18, 111-114, 116-119, 121, 122), and is focused on

those studies where the emphasis is on novel aspects of PCA in the analysis of protein

dynamics. The table demonstrates that the time scales, solvation and size of the systems

analyzed have varied greatly, and also that most studies have focused on the leading 1 to 3

PCs, although select higher PCs have occasionally been examined for comparison in the

earliest work from the 1990’s.

26

Protein No. of Residues

Solvent ΔT (ns) PCs examined

Ref. Year

Crambin 46 Water 0.24 1-5 (111) 1992

Lysozyme 129 Vac. & Water 1.00 1-10, 20, 50

(112) 1993

BPTI 58 Vacuum 0.20 1-4,10,100 (116) 1994

Thermolysin 319 Vac. & Water 0.09 1-5, 10, 20, 50 (118) 1995

G-Actin 375 Water 0.24 1-3 (121) 1996

Cytochrome c 105 Water 1.50 1-3 (117) 1999

Cu Plastocyanin 99 Water 0.80 1-3 (114) 2001

Azurin 128 Water 0.80 1-3 (114) 2001

Apo-Adenylate Kinase

225 Water 6.00 1-3 (113) 2006

Lambda repressor 80 Water 10.00 1 (119) 2006

Protease family 119-1023

Water 20.00 1-3 (18) 2006

Myosin II motor head

744 Water 5.00 1-20 (17) 2007

Rhodopsin 696 H2O+ lipid bilayer

100.00 1 (122) 2007

Table 1: A representative list of proteins studied using PCA.

27

Large-scale global conformation changes are not the only interesting feature of

protein dynamics. While the tertiary and quaternary structural changes may span the entire

protein – and we would expect the largest PCs to capture these motions – individual residues

also have important collective motions at much smaller length scales, and modification of

hydrogen bonds or changes in the structure of a binding pocket in an enzyme occur on even

shorter length scales. Biomolecular processes also span at least 9 orders of magnitude in

time, from femtoseconds (bond vibrations) to milliseconds or even longer (folding). The

covariance eigenvalues of short and fast collective motions are necessarily smaller than that

of long and slow motions, and may even be smaller than the covariance of noisy motions at

larger length scales. Hence these motions may not be represented in the largest set of

principal components, and may even be found among small covariance eigenvectors

normally ascribed to motions arising from thermal noise. While in general motions on long

length scales also occur on long timescales, while short length scales correspond to fast

timescales, there are also exceptions to this rule; the flipping of aromatic rings buried in the

core of a protein (123, 124) is one example of a short length scale motion which occurs on

long timescales. However, this only reinforces the need to examine small covariance

eigenvectors to isolate functional motions at small length scales and at whatever timescales

are available within an MD simulation.

Although most PCA studies have focused almost exclusively on large-scale motions

to date, there is nothing intrinsic to PCA that gives more meaning to large covariance

motions than to small ones. In chapter 4 we conduct a comprehensive analysis of the entire

set of PCs and find non-Gaussian distributed PCs with small covariance eigenvalues. We

argue that these are also ‘essential’ in the same sense as the largest components (112), in that

they span the anharmonic portion of the free energy landscape.

2.2.3: PCA vs. NMA of Proteins

The anharmonic portions of the free energy landscape are interesting for a number of

reasons, and foremost among them is the ability to capture multimodal dynamics which can

describe conformational transitions. However, this is not to say that harmonic portions of the

landscape are unimportant. The calculation of normal modes dates back to the early 1970’s

with the work of Gō and Scheraga (125), who undertook a systematic search of minimum-

energy conformations of cyclo-hexaglycyl (126). Much has been learned since then from

28

harmonic approximations around an equilibrium structure through normal mode analysis

(NMA) (74, 127-132) and elastic network models (ENM) (133-139). Both of these

techniques are based on a single energy-minimized structure, and do not involve dynamics.

In NMA a molecular force field is chosen (such as CHARMM 22) to calculate the potential

energy of the structure, and by taking a harmonic approximation around every degree of

freedom (see Appendix 1) a list of collective modes is generated and ranked from highest to

lowest (spatial) frequency, and thereby energy. ENM is similar, but involves another

approximation which eliminates the force field entirely by replacing all interactions with a

simple harmonic spring linking each atom to every other atom within a cutoff radius. Hence

NMA and ENM are computationally much less expensive than PCA.

NMA has been used to study protein dynamics in the harmonic approximation for

almost 30 years, since the development of empirical potentials which made it possible to

compute their potential energy landscape (140). Two groups first applied NMA to the study

of a small globular protein, BPTI (127, 128), followed shortly by a comparative study of

trypsin inhibitor, crambin, ribonuclease and lysozyme (129). Typical frequencies resulting

from NMA of these proteins range from 5 to 200 cm-1, and the harmonic approximation

makes the timescales of these motions slower than 10-100 ps. The magnitude of RMS

fluctuations at each atomic site may be investigated as a measure of site flexibility, and

comparisons can be made between NMA analysis and MD simulations (141), or with

experimental data such as neutron scattering (142) and crystallographic B-factors (129).

To study domain motion in larger proteins, a coarse-grained approach is desirable as

it simplifies the energy potentials of a particular force field (i.e. CHARMM, GROMACS,

NAMD etc.) and extracts only the lowest frequency modes. Tirion was the first to

demonstrate that the results of NMA for low frequency modes are insensitive to the details of

any particular MD parametrization (143). The resulting model is known as the Elastic

Network Model (ENM), and many recent studies of large proteins have used this approach

(139, 144-146). A related method is called the Gaussian Network Model (GNM) (134, 147).

GNM has been used to analyze the a-amylase inhibitor (141) and to make comparisons

among homologous proteins within the globin family (148).

While most NMA and ENM studies have focused on the longest wavelengths as in

PCA studies, some have studied the shorter wavelengths as well (74, 116, 131). For

29

example, the normal modes of the binding pocket of wild-type α-lytic protease were found to

have a symmetric character, vibrating in phase to maintain the size of the binding pocket,

while a non-binding mutant had asymmetric modes which resulted in contraction and

expansion of the binding site (131). I have argued in the Introduction that the carbonyl

oxygen atoms lining the lumen of gramicidin show functionally relevant dynamics. An early

NMA study of gramicidin argued that the frequency separation of collective modes spanning

the whole protein (< 50 cm-1) and modes describing amide plane fluctuations involving

carbonyl oxygen atoms (75 to 175 cm-1) rules out coherent librations of many amide planes

together (74). An analysis of different amide planes showed that their motion was

uncorrelated, and perturbation of the hydrogen bonds resulted in only small changes to the

NMA frequencies. The fact that functional features have been found in the short wavelength

regime of NMA justifies the examination of the same regime in PCA of simulated MD

trajectories, where the results include the influence of an anharmonic molecular force field as

well as temperature and solvent effects.

It is worth noting a few similarities and differences between PCA and NMA. Both

techniques yield a set of eigenvectors whose components describe the directions and

magnitudes of atomic displacements across the molecular structure, and associated

eigenvalues which describe the spatial amplitude of these eigenvectors. However, PCA

derives its eigenvectors from a time trajectory of all atoms, and as such allows for the

exploration of phase space which gives rise to entropic forces such as the hydrophobic effect.

MD simulations also include a thermostat which regulates the system at a finite temperature,

thereby including the entropic component of free energy. While the entropy can also be

approximated from the curvature of the energy minimum in NMA, this is exact only in the

zero temperature limit where the results of PCA and NMA converge. On the other hand,

PCA describes a dynamic molecular structure at finite temperature, which allows for

thermally activated transitions over energy barriers. Hence the harmonic approximation

lacks the ‘essential’ anharmonicity of an atomistic force field which allows for

conformational changes on a multimodal free energy landscape. While analytical methods

that include entropic terms arising from multiple conformational basins have also been

developed (149), again these approximations are only valid in the low-T limit.

30

The influence of temperature as well as solvent makes biomolecular dynamics

intrinsically dissipative and over-damped. This makes it problematic to translate the spatial

wavelengths from NMA into temporal frequencies (thereby imitating dynamics), whereas the

time-ordered dynamics of collective modes can be extracted from a simulation by casting the

full dynamics onto a given PCA eigenvector. Also, while the directions of motion described

by NMA eigenvectors are expected to give reliable information regarding protein flexibility,

the spatial magnitudes of NMA eigenvalues are not expected to be physically meaningful

(130). By contrast, PCA eigenvalues describe the real spatial amplitude of motion observed

in the full (simulated) dynamics. Since the non-Gaussian features in PCA describe precisely

that portion of the free energy landscape which is lacking in the harmonic approximation,

this motivates and necessitates the generation of MD trajectories. For these reasons, the

remainder of this study is focused on PCA and extensions thereof.

2.2.4: PCA and its Development in Climatology

PCA is one of a number of eigenvector techniques such as Common Factor Analysis

(CFA) which have their roots in social science and go back to Pearson in 1902 (150), and

later to Hotelling in the 1930’s (105, 106) who first used the term PCA. Lorenz introduced

the technique to atmospheric science and climatological modeling in 1956 (151), where it is

known as Empirical Orthogonal Functions (EOFs). EOFs are widely used in climate

research to identify dominant patterns of variability and reduce the dimensionality of climate

data. From the 1960’s through the 1980’s, climatologists developed many elaborations of the

basic EOF/PCA technique which either extract more information from complex spatio-

temporal data or make the resulting collective modes more physically meaningful or

interpretable. An excellent review is available in Hannachi et al 2007 (107). It is

advantageous to study the implementation of EOFs in climatology for two reasons: the use of

the technique is much more fully developed with a long history in this discipline, and there

are known climatological patterns (such as variations in sea level pressure throughout the

year, or the shape and distribution of ocean currents) against which comparisons can be made

with PCA/EOF calculations. Having these as a target has allowed for the development of

techniques which can get the “right” answer. The ‘interpretability’ of ‘physical’ modes is the

essential task of structural biology, and in MD we generally do not know what our target is.

31

Richman (152) reviews four major pitfalls to the conventional use of principal

components. The first is the domain shape dependence of EOF/PCs, that is, the geometry of

boundary conditions surrounding the data set. Buell first illustrated that the topography of

PC patterns is predictable primarily as a function of the geometric shape of the data domain,

and not the covariation of the data (153, 154). Calahan (155) notes that the ‘Buell patterns’

are closely related to spherical harmonics when represented on a sphere, and ‘patch

harmonics’ when represented on a limited domain such as a rectangle. The second drawback

is subdomain instability, a corollary of domain shape dependence; the shapes of eigenvectors

within a subdomain (i.e. a subset of atoms, or weather stations) are in general not the same as

when PCA is executed on the full domain (156). A third problem has to do with sampling

errors becoming very large if neighbouring eigenvalues have similar values, and eigenvectors

become strongly mixed (157, 158). Indeed, some authors have warned that interpretation of

eigenvector shapes is virtually meaningless in such cases (159, 160). Finally, a comparison

of known input patterns with the results of PCA on the combined inputs (161), as well as an

analysis of patterns with obvious physical interpretations (156), has shown that PCs often

have no physical basis for interpretation, and rotation or other transformations are needed to

yield intuitively and physically meaningful patterns of variance. A number of climatological

studies (162-165) indicate that the mathematical constraints of orthogonal PCs which account

for successively maximal residual variance can impair the straightforward physical

interpretation of the modes. Real physical modes, both in atmospheric and biomolecular

science, do not necessarily exhibit this characteristic because physical processes are generally

not independent, and therefore physical modes are expected in general to be non-orthogonal.

Nor are they necessarily uncorrelated in time. Hence, despite the large number of studies

which focus on the structure of one or two dominant PCs in the study of protein dynamics,

the climatological literature suggests there is reason to believe that the shapes of individual

PCs may be meaningless on their own, and extensions of PCA must be developed to yield

physically meaningful results when applying PCA to the MD of proteins.

There is a broad class of extensions to EOF which aim to reduce principal

components to more physically meaningful or interpretable patterns. The most common of

these is ‘rotated’ EOFs, whereby a matrix rotation is executed on the leading group of EOFs

(152), resulting in more variance being concentrated in fewer eigenvectors (that is, the

eigenvalues are pushed either towards one or zero). This is referred to in the literature as

32

finding EOFs with “simplified structure”. The most common algorithm for orthogonal

rotations is called Varimax (166, 167), which maximizes the difference between fourth order

moments and the square of second order moments (where the kth order ‘moment’ is the sum

across each eigenvector element raised to the kth power). Another related algorithm is

Quartimax (168, 169), which maximizes the fourth order moment of the eigenvectors. It is

worth noting that the 4th order cumulant of a distribution is known as the “kurtosis” κ, and

can be written in terms of the second and fourth moments μ2 and μ4 as follows:

κ = μ4/μ2-3 (2.9)

For a Gaussian distribution the second moment μ2=σ2 is the standard deviation, and the

fourth moment is μ4=3σ4. Substituting μ4 and μ2 into Eq. 2.9 gives zero kurtosis for a

Gaussian distribution. Hence kurtosis is a measure of the degree to which a distribution is

peaked (small kurtosis) or broadened (large kurtosis) relative to a Gaussian. While both

Quartimax and Varimax optimize the width of eigenvector distributions by means of the

fourth order moment μ4, we see from Eq. 2.9 that Varimax optimizes for kurtosis κ by

maximizing the difference between μ4 and μ2.

Oblique rotations have in general been more successful in capturing simple structure

than the orthogonal rotations described above. These algorithms attain solutions by

optimizing certain products or differences of eigenvector moments; common oblique

algorithms include Quartimin (168), Biquartimin (170), Oblimax (171), Promax (172) and

DAPPFR (173).

There are various more detailed schemes for obtaining simplified structure, some of

which are discussed in Chapter 11 of Jolliffe 2002 (174). A technique called SCoTLASS

(175-177) successively maximizes variance and constrains EOF patterns to be orthogonal

and ‘simple’ according to a number of rules, such as pushing the size of eigenvector elements

to zero when they are far from the centre of action in a given eigenvector. This is based on a

technique called LASSO (178), an algorithm which solves the problem of unstable regression

coefficients in optimizations involving multiple linear regressions, which implicitly selects

variables by forcing some regression coefficients exactly to zero. There are a number of

similar simplification methods for pushing eigenvalues towards one or zero (179-183). It

should be noted that the extra simplification criteria appropriate for constraining the shapes

of atmospheric modes on a sphere are not necessarily the same as those which would be

appropriate for biomolecules.

33

Since PCA only uses the average instantaneous covariance in the construction of its

matrix, its eigenvectors lack any time-ordered information. There is a class of modified EOF

techniques called Extended EOFs, which modify the matrix to be diagonalized to include

temporal correlations by expanding this matrix to include new columns of variable values at

two or more different time steps. This allows for memory in the system, and is a much more

realistic vehicle for capturing real physical modes as time-lagged information is included in

the analysis. It was introduced by Weare and Nasstrom (184), further developed by

Broomhead and King (185, 186) for analysis of low order chaotic systems (called Singular

System Analysis – SSA) and multivariate systems (MSSA), and has been used to find

propagating structures in climatological data (187, 188).

Finally, there is a category of modified EOF which uses complex numbers in the

construction of the covariance matrix. The real and imaginary parts of the complex number

a+ib may encode information from two different fields of associated variables, e.g. the zonal

(a) and meridional (b) components of wind velocity (189-191). The eigenvectors of this

matrix encode covarying spatial patterns between the two fields. The complex number may

also be used to encode the value of a single field at two points in time separated by a chosen

time lag τ: x(t)+ix(t+τ). This encodes phase information for a particular time lag. With the

right choice of this parameter, propagating patterns may be revealed within a data set.

Frequency domain (FD) EOF also falls under the complex category (192-194), but was

abandoned in climatological research in favour of the more elegant Hilbert EOF (195, 196).

The Hilbert transform essentially provides information about the rate of change of x(t) with

respect to t at a given frequency, and has been used to study the monsoon (196-198),

atmospheric angular momentum (199), and coastal ocean currents (200).

This brief review makes it clear that the state of the art in PCA/EOF of climatological

data is considerably more advanced than its use in MD. Horel (201) states that in

climatology “principal component analysis was used for many years before its inherent

limitations were fully realized”. Let us hope that its use in molecular dynamics can benefit

from the experience of climatologist, and this thesis aims to point the way forward in this

regard. One of the key features of these enhanced PCA techniques is that many of them are

applied to the eigenvectors after weighting by the square root of their associated eigenvalues,

such that the norm squared of each PC is the variance of the corresponding time series. This

is the physical and mathematical basis of our own proposal for enhancing PCA in Chapter 5.

34

To the best of the author’s knowledge none of the techniques described above (rotated,

simplified, extended or complex PCA) have been applied to protein dynamics, nor does the

MD literature refer to any of the climatology literature cited above, with the exception of

very general references to Jolliffe’s 2002 book (174).

However, a few new developments of PCA on MD data have appeared in recent

years. One of these is ‘Nonlinear’ PCA (NLPCA) which employs hierarchically arranged

neural networks which are trained to build a set of adequate nonlinear mapping functions

between an input vector and its counterpart in PC space (202). When applied to the analysis

of peptide dynamics (triglycine, hexaalanine, and the C-terminal β-hairpin of protein G) it

was shown (203) that this technique reduces the dimensionality of these systems much better

than PCA. In the case of the β-hairpin, 4 NLPCs capture the same structure that is described

by 21 conventional PCs. Furthermore, the free energy landscapes constructed by NLPCA are

much more complex and capture conformational states not apparent in the landscapes

resulting from PCA, and also cleanly separate conformational states which are mixed

together in conventional PCA. Another enhanced PCA technique has been called

‘Multivariate Frequency Domain Analysis’ (MFDA) which is PCA executed on a band pass

filtered process across a range of frequencies (204), and is therefore related to FDEOF.

Applied to the BPTI protein, this study demonstrated that at zero temperature MFDA

eigenvectors are the same as those acquired from NMA, but at 300 K significant differences

become apparent with NMA as well as PCA eigenvectors. By applying the VARIMAX

algorithm to the MFDA eigenvectors this study was able to establish a set of orthogonal

modes which describe BPTI dynamics at each frequency used in the analysis, thereby

directly assigning a unique timescale with each set of eigenvectors (whereas PCA

eigenvectors have many frequencies in the trajectory of each PC). These advancements, in

addition to those employed by atmospheric scientists, suggest that there is ample room for

enhancement and development of PCA as applied to protein dynamics, and no single solution

to the problems described above has been proposed and accepted.

35

Chapter 3: Convergence of PCA 3.1: Background

Statistical convergence is the first concern of any scientific simulation. Is our system

in equilibrium? Has it exhaustively explored its available phase space? The answers to these

two questions underpin the scope and validity of a simulated result. It is relatively easy to

ensure that a system is in equilibrium by monitoring various energy terms over time,

ensuring that they fluctuate around a consistent average. The second question is much harder

to address, especially when simulating large complex systems with empirically determined

force fields, as in MD. In principle any finite MD simulation is too short to ‘prove’ the

complete exploration of its conformational space; there may always be unexplored states on

the other side of a large kinetic barrier. In practice it is well-known that biomolecular

dynamics have a large spread of relevant timescales ranging from picoseconds to

milliseconds, and the free energy landscape explored by the conformational dynamics of a

protein is complex, multimodal and ‘rough’ in a fractal sense, such that there are effectively

an infinite number of nested minima to explore. Certainly there are simple systems where

this difficulty is eased, but in general the complete exploration of conformational phase space

for the average protein is more than we can hope for. To make progress we need to be able

to quantify how broadly our system has explored its available phase space, and to establish

how converged is converged enough.

MD simulations have always been limited to timescales shorter than we would like.

Every year the average length of what is considered a ‘reasonably long’ simulation increases;

currently it is on the order of 100 ns, and a 1 μs simulation is considered ‘very long’. Ten

years ago 100 ps simulations were the average, and a few ns was considered ‘very long’. Yet

computational biochemists have been making scientific progress with simulations at limited

timescales for over 20 years. For example, while gA has been fruitfully studied on

picosecond timescales for decades, and more recently on the nanosecond timescale necessary

to reasonably describe membrane dynamics, it is known that the association time of the two

monomers in a membrane is on the order of 100 ms. Not only does this mean that the “brute

force” atomistic study of dimerization kinetics (i.e. at equilibrium, as opposed to biased or

steered towards this reaction) is out of reach for one of the simplest possible protein dimers,

we are unlikely to even observe a single dissociation event with current simulations.

36

However, this does not mean we cannot study the many interesting faster processes of the

associated dimer. This is the functional state of gA, which is why technically “unconverged”

simulations of this molecule are still extremely informative. This highlights the need to be

somewhat flexible, one might say “reasonable”, about what constitutes a “converged”

simulation; this must be judged with respect to the properties of interest, some of which

converge faster than others. Indeed, simulations of ion channels are of particular interest

since the timescale of ionic diffusion and transport is known to be much faster than current

simulation times.

Since conformation changes constitute large covariant changes in atomic positions,

PCA has been a useful technique when quantifying the convergence of a system’s

exploration of conformational phase space. There are two aspects to this characterization:

structural and dynamic. On the one hand, the consistency of PC eigenvector shapes at

various timescales measures the convergence of spatial characteristics by quantifying how

quickly the exploration of new conformations slows down. On the other hand, the

distribution of states over time will converge differently, depending on how often different

states are visited and how long it takes to achieve equilibrium state populations. The spatial

characteristics are determined by thermodynamics, i.e. the potential energy surface, while the

distributions are determined by dynamics, which include stochastic processes and activation

barriers. We investigate both these aspects of convergence by studying the PCA

eigenvectors and eigenvalues for trajectories of various size and duration. We do this first

for the backbone, since this is the standard practice in the literature, and yields considerably

simpler (and in the case of gA, unimodal) distributions of eigenvector projections over the

simulation trajectory. We compare results for both NCαC and NHCαCO atoms in order to

highlight any differences which may arise from the inclusion of hydrogen bonding elements.

For comparison with multimodal behavior we also analyze the convergence of side chain and

solvating GMO dynamics.

3.2: Convergence of Structure: Overlap of Covariance Matrices

The eigenvalue-normalized overlap s(A,B) introduced by Hess (205) to measure the

‘distance’ between two matrices has been adopted as a measure of convergence in a number

of studies (122, 206, 207):

37

, 1/ /

√, (3.1)

where A and B are covariance matrices defined by the eigenvectors and eigenvalues from the

PCA of two different trajectories, such that

/ s ,s , … s . (3.2)

Here v is the complete eigenvector matrix and diag(σk) is a diagonal matrix with the square

roots of all eigenvalues σk2. It is conventional to compute the overlap of halves (122, 206) or

thirds (207) of an MD trajectory, to estimate the degree to which the conformational space

explored by a trajectory has ceased to expand. Note that this measure of overlap, by

including the entire covariance matrix in s(A,B) weighted by its eigenvalues, is dominated by

the characteristics of the longest PCs. The convergence of short PC with small eigenvalues is

likely to be much faster than that of the longest PCs.

3.2.1: Backbone of gA, SS and RR: Converged Eigenvectors

To demonstrate the convergence of sampling for our MD simulations of gA, as well

as certain PCA results derived from them, we calculate the overlap s(A,B) of eigenvector

matrices A and B derived from independent PCA of different time windows within a

trajectory. In Fig. 3.1 (top) we show the overlap s(AΔT,BTtot) of subsets ΔT in a trajectory

with its full duration Ttot, as done by Hess (205). These curves are necessarily equal to one at

ΔT=Ttot, since they overlap increasingly large portions of the same trajectory with each other.

This accounts for their exponential scaling and small error bars. Other studies have

computed s(AΔT1,BΔT2) for independent simulations or non-overlapping sub-segments ΔT1

and ΔT2 of a trajectory, where ΔT1=ΔT2=1/2 Ttot (122, 206), or 1/3 Ttot (207). To

generalize this approach, in Fig. 3.1 (bottom) we show the average overlap s(AΔTk,BΔTk+1) of

all consecutive trajectory sub-segments of equal duration ΔTk and ΔTk+1. The horizontal axis

of this curve extends over increasing durations between 200 ps and 64 ns in the simulation

with SHAKE, or between 50 ps and 10 ns in the simulation without SHAKE. The average

overlap of half the 64 ns trajectory with its full length is 0.93, while for the 10 ns simulation

38

this quantity is 0.91. For our 64 ns simulation the overlap of two 32 ns segments is 0.89, the

average overlap of 8 ns segments is 0.85, and for 1 ns segments it is 0.81. For our

unconstrained 10 ns simulation the average overlap of 5 ns halves is 0.84, and for 1 ns

segments it is 0.82. The consistency of overlap values between simulations with and without

SHAKE gives us additional confidence in these results.

What level of convergence these numbers reflect can only be answered by

comparison with values obtained by similar studies. Our overlap values are in agreement

with another study which computed the PCA overlap for simulations of gA embedded in a

DMPC membrane (206). The authors analyzed the convergence of PCA for the backbone of

membrane proteins of various size on the 10 ns timescale, using gA as a comparative

standard for convergence of simulations of larger proteins. For gA the overlap of two

independent 8 ns trajectories was 0.82 while for two 4 ns trajectories it was 0.8. The overlap

of half a trajectory with its full 8 ns length was between 0.88 and 0.92. The study concluded

that “multi-nanosecond molecular dynamics calculations can provide satisfactory, albeit not

perfect, conformational sampling”. Grossfield et al. (122) have studied the convergence of

26 independent 100 ns simulations of rhodopsin solvated in a membrane containing 99

phospholipids with 1–stearoyl–2-docosahexaenoyl fatty acyl chains attached to 49

phosphatidylcholine and 50 phosphatidylethanolamine headgroups, and 24 cholesterols.

They show that different parts of a large protein exhibit very different convergence of their

respective PCA. The whole protein had a narrow distribution of overlap values centred on

0.2 and the transmembrane helicies centred on 0.4. Both extracellular and cytoplasmic loops

had broad distributions ranging from 0.2 to 0.7. Only the CI loop, which is small and

stabilized by secondary structure (and is thereby comparable to gA) converged very well

with overlap values distributed mostly over 0.8. However, this was also the only bimodal

distribution, whose minor peak is centred on 0.3.

39

Figure 3.1: Eigenvalue-normalized overlap between a sub-trajectory of length ΔT with the full trajectory (top) and overlap between two consecutive trajectories of length ΔT (bottom), averaged over the number of samples available in the trajectory. The analysis is carried out for the gA main chain NCαC atoms (left) as well as the NHCαCO backbone atoms (right), for both the simulations with (solid circles) and without SHAKE (open circles). Error bars indicate one standard deviation.

Given that gA is a single transmembrane helix which is entirely folded into its

secondary structure, we would expect that convergence is much more easily achieved here

than most proteins in the PCA literature, even in their more stable sub-segments. Indeed, our

convergence curves for the gA backbone (Fig. 3.1) are all at higher overlap values than those

quoted for particular timescales in the studies above. Taken together these studies also

suggest that an overlap of s(A,B)=0.8 is an acceptable value of convergence.

In Fig 3.2 we compare the structural convergence s(AΔTk,BΔTk+1) of the gA backbone

eigenvectors with the convergence of SS and RR backbone eigenvectors. The overlap values

of the linked analogs are lower by ~0.05 at timescales shorter than ~2 ns, but are almost

identical with the native dimer and are larger than 0.8 at longer timescales. This gives us

confidence that the backbone eigenvectors are converged for all three gramicidin molecules

40

at timescales longer than 2 ns (as studied in chapter 5). The conformational isomerization of

the dioxolane linker in the RR analog, which was found to occur frequently in 16 ns

simulations (72, 84), may be responsible for the lower values at short timescales. However,

this isomerization was not observed for the SS analog, which has overlap values between

those of gA and RR.

Figure 3.2: Comparison of gA, SS and RR backbone eigenvector convergence for 64 ns

simulations.

3.2.2: Side Chains and GMO: Unconverged Eigenvectors

To highlight the utility of s(A,B) as a measure of convergence, in this section we

show what the timescale-dependent overlap looks like for eigenvectors which are not

converged. In Appendix 2 we show certain properties of the gA side-chains (using PCA)

which demonstrate the inadequate sampling of their multi-modal distribution on the 64 ns

timescale. In Chapter 6 we demonstrate that GMO monomers surrounding the surface of the

gA molecule exchange positions only occasionally on the 64 ns timescale, also inadequately

sampling their multimodal distribution. Hence we would not expect the spatial structure of

gA side-chain or annular GMO eigenvectors to converge for our simulations. In Fig. 3.3 we

show that s(AΔT,BTtot) and s(AΔTk,BΔTk+1) have much smaller values for gA side chains and

annular GMO monomers, and even scale distinctly from the backbone case shown above.

s(AΔT,BTtot) is almost flat and ranges from 0.25 to 0.4 for the side chains, and from almost

zero to 0.15 for GMO, for timescales shorter than 10 ns. This part of the curve also has

41

surprisingly small error bars, with larger error bars apparent when the curve shoots up

towards 1. This is exactly the opposite of the case for NCαC and NHCαCO atoms shown in

Fig. 3.1. Notably, s(AΔTk,BΔTk+1) decreases over most of its timescale range, from 0.6 to 0.4

for the side chains and from 0.2 to almost zero for GMO. Thus s(A,B) is capable of

discerning converged from non-converged eigenvectors.

It is also worth noting that s(A,B) is a measure of the similarity between two

complete PCA matrices where the weighting of all eigenvectors by their associated

eigenvalues makes the leading eigenvectors the dominant terms. Hence the results in this

section represent the convergence of the longest and most likely the slowest modes in the

system. In general we would expect eigenvectors associated with covariant motion at small

length scales to converge faster than the longest components of motion.

Figure 3.3: Un-converged systems: overlap of side-chains and solvation lipids

42

3.3: Convergence of Dynamics: Average Distributions and Deviation from Gaussian

Gaussian distributions are indicative of motion on a harmonic free energy

landscape, while non-Gaussian distributions are the result of anharmonicity on this

landscape. This follows from the definition of free energy G=-kBTlog(P), where a Gaussian

probability P=exp[-ax2] yields the harmonic function G=Kx2 with spring constant K=akBT

(here x is a collective reaction coordinate representing protein conformation). This makes the

shape of PC distributions of considerable interest to structural biologists. Anharmonicity

could be evidence of multi-modal dynamics, coupling among modes, activated transitions or

trapped kinetics. A number of studies have observed and interpreted non-Gaussian yet uni-

modal distributions in the largest PCs of the backbone for various proteins on the timescale

of a few nanoseconds (112, 113, 116). It is therefore of interest to know whether this is an

artifact of insufficient sampling, in which case these PCs would converge to Gaussian

distributions given enough sampling time, or whether this is an intrinsic anharmonicity in the

free energy surface explored by protein conformations. A similar question may be asked of

multimodal distributions: do they converge to the sum of Gaussians with different centres, or

is the shape of the distribution indeed non-Gaussian? It is important to note that the spatial

shapes of eigenvectors described in section 3.2 converge much faster than their dynamics: the

former requires a quarter oscillation in time, while the latter requires many cycles to

adequately converge the distribution of states.

To compare the acquired distribution Pk with a Gaussian distribution, we normalized

each PC trajectory by its standard deviation sk (the square root of the kth covariance

eigenvalue sk2). We also re-binned the distributions into a common 100 bins to align all

distributions with each other for comparison. The resulting normalized distributions PkN

have sk=1, and their shape can be compared against a Gaussian distribution of unit variance

and height (2p)-1/2, by taking the difference between this and PkN. The resulting ΔPk

N will be

a flat line if the acquired distribution is Gaussian, or if it is not the curve will show positive

and negative deviations from zero.

While a trajectory distribution does not have any time-ordered information in it, it

may be suggestive of the statistical properties which generate the trajectory in time. For

example, Mandelbrot and van Ness (208) connect the non-Gaussian properties of a trajectory

43

distribution to non-Brownian (or “fractional” Brownian) properties of its time-ordered

behavior. That is, if consecutive steps in a noisy trajectory are uncorrelated they will have a

Gaussian distribution and their mean square displacement (MSD) will exhibit a linear

dependence on time. This is the signature of diffusive (Brownian) motion. On the other

hand, if steps are anti-correlated the distribution will be narrower than a Gaussian (hugging

close to the average and under-sampling extremes, low kurtosis) and the MSD will scale with

a power less than 1. This sub-linear scaling is the signature of sub-diffusion, since this

trajectory moves more slowly away from its centre than it would by thermal diffusion. For a

trajectory with correlated steps the distribution will be broader than a Gaussian (under-

sampling near the average and over-sampling the extremes, high kurtosis) and the MSD will

scale with a power greater than 1. This super-linear scaling is indicative of super-diffusion,

in that the trajectory moves away from its centre faster than it would by thermal diffusion.

Hence the occurrence of maxima or minima in ΔPkN on either side of the inflection points at

+/-1 may be indicative of anomalous diffusion in the time evolution of PCs.

3.3.1: Backbone and Side Chains of gA

Fig. 3.4 shows <ΔPkN> for the gA backbone with and without hydrogen bonding

atoms (NHCαCO, NCαC), and for the heavy side chain atoms (SIDE). Each panel shows

results averaged over multiple windows of width 1 ns (64 samples) to 64 ns (1 sample), taken

from the 64 ns simulation with SHAKE. A single distribution for the entire 10 ns simulation

without holonomic restraints on Hydrogen atoms is also shown in each panel for comparison,

since short PCs probe dynamics involving hydrogen atoms. PC1 is shown on the left and

PC2 is shown on the right.

It is well known that the side chains of gA have multiple conformations, especially

the Trp residues (209). Trp 9 in particular has been observed in two distinct states using

NMR: in the 1MAG (PDB ID) structure obtained by solid state NMR studies it is stacked on

the Trp 15 residue (44, 45), while in the 1JNO (PDB ID) structure obtained by solution state

NMR it is splayed away from Trp 15 in the opposite orientation (210). This discrepancy has

been resolved by MD studies which concluded that Trp 9 spends 80% of its time in the

splayed orientation and 20% in the stacked orientation (211). Hence the multimodal

distributions of side chains (Fig. 3.4E) are not surprising, and reflect this conformational

flexibility (see Appendix 2 for more details). However, the non-Gaussian features of the

44

longest backbone PCs (Figs. 3.4A and 3.4C) exhibit a similar multimodal profile at short

timescales, though with smaller amplitude. Although these backbone PCs have the

appearance of super-diffusive distributions at short time scales, at long timescales <ΔPkN>

approaches zero, indicating that the longest PCs are actually harmonic (Gaussian). This time

dependence is an artifact of inadequate sampling, as the distribution of an oscillation sampled

over less than the order of a wavelength appears asymmetric; averaging together the

distributions of many such sub-cycles would yield a super-diffusive profile. This analysis

suggests that the dynamics of the gA backbone converge around 10 ns. In Chapter 4 we

contrast this behaviour with the non-Gaussian features of the short PCs.

Figure 3.4: Average difference from Gaussian distributions at various timescales, for the gA

backbone and side chains.

45

3.3.2: Backbone of SS and RR

In Fig. 3.5 we show similar results for <ΔPkN> of the SS and RR NCαC backbone.

Deviations from a Gaussian for the leading PCs wash out at about the same time timescale as

for the gA backbone, though they are in general larger in PC1 and more apparent and

persistent in PC2 for SS and RR than for gA. This may be explained by the structural

dislocation caused by the dioxolane linker. It has been shown that this linker has four

conformational states in the RR molecule (72, 84), and we might expect this to influence the

distribution of the longest PCs. Apparently the dislocation is local enough that this is not the

case, though in Chapter 5 we show that it does make a difference in the shapes of the

eigenvectors. From this we learn that the linker exerts its influence on the thermodynamics

of the molecule by changing the potential energy landscape, rather than changing its

dynamics by introducing kinetic barriers (at least for the longest PCs).

Figure 3.5 Average difference from Gaussian distributions at various timescales, for the SS

and RR main chain.

46

3.4: Summary and Conclusions

We have seen that for timescales over ~2 ns the spatial shapes of backbone

eigenvectors for all three gA analogs have converged adequately, in that PCs extracted from

longer portions of the simulation yield very similar eigenvectors. On the other hand, the

multimodal dynamics of the side chains as well as the solvating GMO molecules do not show

this convergent behavior, and in fact the shapes of their eigenvectors are increasingly

dissimilar at longer simulation times. The distributions obtained from projecting the

eigenvectors onto the simulation trajectory also offer another measure of convergence, but in

this case it is the convergence of dynamics rather than eigenvector structure. These

distributions show that an apparently super-diffusive behavior at short timescales disappears

for trajectory lengths over ~10 ns in the case of the backbone, whereas it is persistent in the

case of the multimodal dynamics exhibited by side chains and GMO molecules. These

results give us confidence that the backbone eigenvectors and eigenvalues examined in

Chapters 4 and 5 are statistically meaningful, and also delineate the scope of applicability for

PCA of side chains and solvating molecules.

47

Chapter 4: Anharmonic Features of Collective Modes The work described in this chapter has been published with the following reference: Kurylowicz M, Yu CH, Pomès, R: A Systematic Study of Anharmonic Features in the Principal Component Analysis of Gramicidin A (2010). Biophys. J. 98 (3), 386-395.

In this chapter we present a number of quantitative measures which identify

anharmonic collective motions in gramicidin A: eigenvalue scaling, non-Gaussian PC

distributions and the Mean Square Deviation (MSD). We study the anharmonic features of

properties in the large covariance regime traditionally studied by PCA, but as shown in

Chapter 3 the anharmonicity of these motions is timescale dependent and disappears in

simulations longer than 10 ns. Prompted by the observation of distinct scaling regimes in

the eigenvalue spectrum, we go on to study the MSD and distributions of PCs in the small

covariance regime, where we show that anharmonic features persist over all timescales

studied here. This allows us to isolate bands of PCs which describe short and fast collective

motions which are associated with hydrogen bonding. We focus on a description of one

mode with known functional consequences in the channel backbone: the libration of amide

planes (55, 70, 72, 73, 78-80, 212, 213).

4.1 Scaling of PCA Eigenvalues

The complete PCA eigenvalue spectra for various atomic subsets of gA are shown in

Fig. 4.1. These results are taken from the 10 ns simulation without constraints on hydrogen

atoms, and with a 0.5 fs time step; simulations with SHAKE do not yield the correct

eigenvalues at the short-PC end of the spectrum because they freeze the covalent bond

vibrations of Hydrogen atoms. This in turn limits the number of collective degrees of

freedom to less than 3N-6, and yields artificially small eigenvalues for degrees of freedom

involving hydrogen atoms. On the other hand, the long PCs in the 10 ns simulation differ

very little from those of the 64 ns simulation (with holonomic constraints) used for

comparison; the PCA matrix from the 64 ns and 10 ns simulations has an overlap of 0.88.

Each curve in Fig. 4.1A shows the variance of all principal components for a different

atomic subset of the molecular structure: a single atom per residue (Ca), backbone atoms

(NCaC and NHCaCO), side chain atoms (SIDE and SIDEH), and the combined atom set

(ALL and ALLH). While the SIDEH and ALLH sets include all hydrogen atoms, the

NHCaCO curve includes only the amide hydrogen in order to emphasize dynamics within

48

secondary structure involving hydrogen bonds, and to explicitly capture amide plane

motions. While the long-PC spectra for the whole protein (ALL) and the side chains (SIDE)

are almost identical, the scaling of their short PCs is significantly different. This suggests

that the small eigenvalues and eigenvectors may encode real physical information about the

behaviour of our system, and are not just noise to be ignored as commonly done in previous

PCA studies of protein motion.

There are generally two scaling transitions in the spectra, one at ~25 PCs and the

other at ~100 PCs. Fig. 4.1B shows three distinct power-law scaling regimes in the heavy-

atom PCA eigenvalues sk=k-a, with all linear regressions on the log-log scale scoring

R2>0.99. While the largest PCs follow a power of a~1, there are significant differences in

scaling of the shorter PCs for different parts of the protein. The mid-size regime of the

backbone scales with a=2, while for side-chains it is distinctly more shallow with a~1.5.

The whole protein (ALL) lacks a clear scaling in this mid-scale regime, making a smooth

transition towards steeper scaling at the shortest end of the PC spectrum. In this short-PC

regime the backbone scales with roughly a~2, the side-chains scale much more steeply with

a=4, while the whole protein (ALL) approaches an average between the two, with a=3.

Different numbers of PCs span the same scaling features in these spectra for different atomic

inclusions (e.g. Ca vs NCaC), suggesting that blocks of PCs span statistically distinct regimes

of motion. Hence the scaling shown in Fig. 4.1 may be used as a guide to search for

components of motion with interesting statistical features and determine the boundaries

between distinct regimes of principal components.

49

Figure 4.1: Log10-log10 plots of the complete PCA eigenvalue spectrum of gramicidin A as a function of eigenvalue index i. A: Spectra for the backbone, side chains and whole protein without hydrogen atoms (Gray: Ca,NCaC, SIDE, ALL) and with them (Black: NHCaCO, SIDEH, ALLH). The data has been thinned towards the high indices for clarity. B: The complete PC set for heavy atoms, with linear regressions in regions of different power-law scaling. The ALL curve has been translated upwards for clarity (+ci). Bold lines indicate the range included in the fit, while the thin lines are guides for the eye. The bold numbers above each line indicates the slope of the fit (ie the power a), and the R2 value for the linear fit is italicized in brackets below the slope.

50

4.2 Non-Gaussian PC Distributions

Gaussian distributions are indicative of motion on a harmonic free energy landscape,

while non-Gaussian distributions are the result of anharmonicity on this landscape. To

compare the acquired PC distribution Pk with a Gaussian, we normalized each PC trajectory

by its standard deviation sk (the square root of the kth covariance eigenvalue sk2). We also

re-binned the distributions into a common 100 bins to align all distributions with each other

for comparison. The resulting normalized distributions PkN have sk=1, and their shape can

be compared against a Gaussian distribution of unit variance and height (2p)-1/2. Fig. 4.2

shows ΔPkN, the difference between the acquired PC distributions and a unit Gaussian, for the

gA backbone with and without hydrogen bonding atoms (NHCαCO, NCαC), and for the

heavy side chain atoms (SIDE). Each panel shows results averaged over multiple windows

of width 1 ns (64 samples) to 64 ns (1 sample), taken from the 64 ns simulation, as in Fig.

3.4. A single distribution for the entire 10 ns simulation without holonomic restraints is also

shown for comparison, since short PCs probe dynamics involving hydrogen atoms. Here we

compare a representative long PC on the left (PC1) as shown in Chapter 3, with a

representative short PC. PC1 is shown on the left while a representative short PC is shown

on the right. By contrast with the long PCs, there is no dependence on timescale for the non-

Gaussian features of the short PCs (Fig. 4.2B, D, F), indicating that these sub-diffusive

profiles (under-sampling extremes and over-sampling the average) correspond to persistent

anharmonic aspects of backbone and side-chain dynamics. The short PCs shown in Fig. 4.2

are representative of groups of neighbouring PCs which have similar distributions. Fig. 4.3

highlights this by showing a group of 5 long and short PCs from the 10ns simulation

averaged over ten 1 ns windows.

We further emphasize the band structure of non-Gaussian short PCs in Fig. 4.4, which

shows ΔPkN surfaces for all 270 PCs for NCαC atoms, 470 PCs for NHCαCO and 430 PCs for

SIDE atoms, from the 10 ns simulation. Flat regions indicate PCs with nearly perfect

Gaussian distributions (ΔPkN<0.01), while peaks and valleys indicate non-Gaussian

distributions and suggest anomalous diffusion of those PCs in time. The landscapes show

central peaks for short PCs indicating sub-diffusion. While the sub-diffusive features at high

PC are concentrated in one band (i.e. a single spatial scale) for NCαC, this is not the case for

the main chain (NHCαCO) or the side chain atoms. The NHCaCO results reveal clusters of

51

sub-diffusive modes across a number of spatial scales at high PC index. This is also true of

the heavy side chain atoms, where it is interesting to note that these features ride on a super-

diffusive envelope. Fig. 4.4 also shows the RMS deviation between the acquired PC

distribution and a Gaussian curve for each atomic subset. These plots show the distribution

of anharmonic features across the short PC spectrum, and reveal distinct bands of sub-

diffusive components.

Figure 4.2: Difference between eigenvalue-normalized PC distributions and a unit Gaussian, ΔP, for the longest PC (left) and a representative short PC (right) of the backbone and side chains of gramicidin A. PCA was executed independently for multiple windows at various timescales from the 64 ns simulation (with holonomic restraints), and PC distributions were averaged for a given timescale. Results for the 10 ns simulation (without restraints) are also shown for comparison.

52

Figure 4.3: Non-Gaussian features of long (left) and short (right) PCs for the NCaC backbone (top), the NHCaCO main chain (middle), and side chains (bottom) of gramicidin A. The data were averaged across 10 samples of PC trajectories extracted from PC analyses on 1 ns windows of H-unconstrained simulations recorded every 10 fs. The distributions were normalized by their eigenvalue sk for comparison with a unit Gaussian. Clustered around the dotted Gaussian curve are the acquired distributions P (left axis), while around the origin is shown the difference ΔP between the acquired distribution and the unit Gaussian (right axis). Each 1 ns sample has 100,000 steps, so each distribution in the figures represents about a million data points.

53

Figure 4.4: Surfaces on the left show the difference between the normalized PC distribution and a unit Gaussian, ΔP, for all components in the PCA of heavy backbone atoms (NCaC), main chain (NHCaCO) and heavy side chain atoms (SIDE) in gramicidin A. The root mean square difference between acquired distributions and a Gaussian distribution is shown on the right.

54

4.3 MSD and Anomalous Diffusion

All the results presented above describe the spatial characteristics of the system

averaged over time. To make proper contact with anomalous diffusivity, we now study the

time-ordered behaviour of our system by computing the MSD of each PC. The projection

dxk(t) of the full MD trajectory onto each PC is a trajectory of steps whose size is measured

relative to the time-averaged structure. We construct a PC walk to represent the total

displacement along a given eigenvector through the course of our simulated trajectory:

∑ 4.1

The MSD is related to the autocorrelation function of any trajectory of displacements in time.

It is the ensemble average of all possible displacements x(t) such that

(4.2)

The average <> is over all possible origins t0 and for every timescale t in the trajectory.

Since the number of possible origins t0 for a timescale t is (T-t), we can only expect adequate

statistical sampling up to t~T/2.

The term “anomalous diffusion” properly applies to systems whose particles have an

MSD which scales nonlinearly in time (214, 215). These are non-Brownian processes which

obey a generalized Einstein relation:

2 , (4.3)

where Db is the (anomalous) diffusion coefficient and d is the dimensionality of the system.

If b<1, a process is sub-diffusive in the sense that it moves away from its average more

slowly than Brownian diffusion (“sub”-linear). A sub-diffusive process has anti-persistent

correlations, where consecutive steps are more likely to move in opposite directions than they

would in a random walk. If b>1, a process is super-diffusive in that it moves away more

quickly than Brownian diffusion (“super”-linear). A super-diffusive process exhibits

persistent correlations, where consecutive steps are biased to continue in the same direction.

Note the distinction between this temporal exponent b and a spatial scaling exponent, which

we denote as a in the covariance eigenvalue spectra presented above.

55

Fig. 4.5 shows the MSD for a representative subset spanning all PCs of the NHCaCO

and SIDE atomic subsets, across six orders of magnitude in time, from the 10 ns simulation.

Careful examination of this figure reveals a number of interesting features. First, there is a

leveling of MSD(t) at long timescales past ~1 ns (with the exception of the first PC). This

leveling is a result of the fact that we are analyzing a bounded system of fixed volume: at

some timescale all PCs must cease moving away from their average and return to it. Thus

the rollover in the MSD may be considered an “edge effect”, though it may also contain

interesting information about the timescales of the collective motions in our system. For

example, we would expect that covariant motions at smaller spatial scales in the protein will

be bounded at increasingly short timescales, and this is evident in Fig. 4.5. Comparison of

the MSD curves in with lines of slope 1 and 2 (dotted gray lines) makes some general trends

apparent. The longest PCs scale with β=2, indicating ballistic motion unimpeded by thermal

perturbation, while shorter PCs tend towards β=1.5 or even β=1. Moreover, there is non-

trivial structure to the groupings of trends (in time) among PCs in the backbone, which is

made evident by the changes in spacing between groups of curves in the figure. This is

similar to the groupings of non-Gaussian distributions shown in Fig. 4.4.

The most interesting feature in Fig. 4.5 is the observation of pronounced oscillations

among the shortest PCs at timescales below ~1 ps. These oscillations are most visible in the

case of the side chains, although they are also present in the backbone with lower

frequencies. This suggests that the sub-diffusive features apparent in the non-Gaussian

distributions of short PCs are a result of short timescale oscillations, rather than longer

timescale sub-diffusive sampling. The superposition of locally sub-diffusive PCs on a global

super-diffusive envelope in the side chains of Fig. 4.4 may also be attributable to this

interplay of short and long timescale behavior. Note that although oscillations are not

‘diffusive’, they meet the definition of sub-diffusion in that consecutive steps are anti-

correlated (at a particular timescale).

56

Figure 4.5: The mean square deviation of every 11th PC for the NHCαCO and SIDE atomic subsets. The curves are evenly spaced by a constant c at their origin. Linear (β=1) and ballistic (β=2) values of slope β are shown in dotted gray as a guide for the eye.

In order to amplify small changes in the scaling of the MSD, in Fig. 4.6 we plot the

instantaneous slope of the MSD as a function of time for long and short groups of PCs.

These plots also highlight the fact that consistent power-law scaling is persistent on all

timescales up to ~100 ps for all PCs (and up to ~1 ns for the longest PCs). These plots reveal

a surprising array of oscillations in the short PC regime, with consistent frequencies across

groups of PCs and transitions to higher frequencies for shorter PCs. This figure also makes

clear that there is a general transition at ~1 ps, between very short timescale behavior and

longer timescale dynamics (100 ps - 1 ns). This is the expected ballistic (β=2) to diffusive

(β<2) transition for the longest PCs, indicating the timescale at which collective motions

become restrained by thermal perturbations of their directions and velocities of motion.

However, for short PCs the opposite trend is also apparent in the backbone, from slower sub-

diffusive scaling at short timescales to faster diffusive scaling at long timescales.

57

Figure 4.6: Instantaneous slope of log10(MSD) functions shown in Fig. 5, for the long (left) and short (right) PC’s of the heavy-atom backbone (top), main chain (middle) and the heavy side chain atoms (bottom).

58

4.4 Collective Oscillations in the Small Covariance Regime

To systematically investigate the frequencies of collective motions revealed in the

MSD, we computed the Fourier transform of the curves depicted in Fig. 4.6, for the

oscillatory regime below 1 ps. In Fig. 4.7 we plot the square of the Fourier amplitude for all

PCs of our three atomic subsets, representing power in the frequency domain. These results

reveal the existence of two dominant collective oscillations in the backbone of gA which can

be compared with experimental results from infrared and Raman spectroscopy. The first is a

broad peak centered at ~5 THz (165 cm-1), spanning PCs 90-120. The second is a sharper

peak centered at ~40 THz (1320 cm-1) near PC 250. There is good agreement between the

results for the two backbone atomic subsets, with the NHCαCO showing the same dominant

features at similar frequencies and PCs as the NCαC set, but with higher resolution and

higher frequency components in the latter, as expected from the inclusion of the hydrogen

bonding elements. The side chain spectra also show many sharply resolved modes at high

PC index, with a pair of dominant modes at 20 THz (660 cm-1) and 40 THz (1330 cm-1), and

other distinct modes apparent both above and below these frequencies.

Although it is tempting to attribute these oscillations to covalent bond vibrations,

analysis of the associated eigenvectors reveals that this is not the case in general. In fact, the

lowest frequency oscillations are associated with motions that span many heavy atoms in

both the backbone and the side chains, and hence represent collective oscillations across

functionally significant portions of our protein. Here we focus on the structure of PC

eigenvectors associated with the lowest frequency backbone oscillations in order to highlight

the possible functional significance of these motions, and the utility of information in the

previously ignored small-covariance regime of PCA.

59

Figure 4.7: Spectral power of the oscillatory regime for β (below 1 ps, as shown in Fig. 6).

60

Fig. 4.8 depicts three sample backbone eigenvectors from the broad 5 THz (165 cm-1)

band near PC 100. We illustrate the structure of displacements along each eigenvector by

superposition of the NHCαCO backbone projected away from the average structure along the

positive and negative directions of the PC eigenvector. Careful examination of this figure

reveals that in general the displacements are on the scale of a single peptide plane, with

tilting of the carbonyl oxygens and amide hydrogens apparent at a number of amino acids.

This suggests amide plane librations, whose functional significance for cation transport was

reviewed in Chapter 1. There are about 30 PCs in this group, and examination of the

eigenvectors in time makes clear that the group as a whole spans tilting motions of each

amide plane in the protein (note that there are 30 amide planes in gA). Far-infrared FT-IR

spectroscopic measurements of gA without cations have determined that carbonyl librations

occupy a band between 75 cm-1 and 175 cm-1, and there are other IR-active modes up to 500

cm-1 (79, 80). This is consistent with the low-frequency features in Fig. 4.7B, which span the

entire far-IR range from ~33 cm-1 to 500 cm-1. Moreover, the same experiments measured

broad absorption peaks upon addition of Li+ (79), K+, Rb+, and Cs+ (80) cations to the

channel, with the frequencies of cation mobility similar to those of the carbonyl libration

band. This shared timescale suggests that the librational modes of the amide planes may be

coupled to cation transport through the channel.

We have also examined the eigenvectors associated with the higher frequency

backbone mode near 40 THz (1320 cm-1). These are motions within the amide plane

associated with stretching of the carbonyl oxygen and amide hydrogen bonds, and are thus

clearly visible in the NHCαCO eigenvectors. We conclude that gA has coherent oscillations

near 40 THz (1330 cm-1) within the hydrogen bonds which define the secondary structure.

Finally, examination of the side chain eigenvectors shows that the dominant oscillation

modes correspond to bending and torsion of the Trp indole rings (peaks c1 and c2

respectively, in Fig. 4.7C), which carry a significant dipole moment and form hydrogen

bonds with the lipid headgroups in the membrane (34). This suggests that all the MSD

oscillations of short PCs are associated with hydrogen bonding, which also explains their

sub-diffusive distributions as well as their anharmonicity.

61

Figure 4.8: Illustration of backbone eigenvectors for sub-diffusive PC 100,110 and 120 of the main chain NHCαCO atomic subset. The front and back of the helix are shown separately for clarity. The superimposed structures are displaced 5 Å away from the average structure along the appropriate eigenvector, in the positive (red) and negative (blue) directions. Areas where peptide plane motions result in large displacements of the carbonyl oxygen are highlighted in circles.

4.5: Discussion

While most NMA and elastic network studies have focused on the longest

wavelengths as in PCA studies, some have studied the shorter wavelengths (74, 116, 131) as

well. The present study suggests that the same regime should prove to be a fruitful area of

study in the PCA of simulated MD trajectories. The analysis presented above could be used

as a guide to isolate regions of anharmonic motion in a protein. If this region is smaller than

the entire protein, such as a ligand-binding pocket, then a new PCA could be executed on just

this region and the relevant dynamics would now be apparent at the longest PCs of this re-

analysis. There is one example of a PCA study which has focused on such a binding pocket

in carbonmonoxy-myoglobin (216). I will discuss this study in more detail in Chapter 5.

Another interesting study has suggested that short PCs are more important in

determining the protein folding pathway than long PCs, using a method called “Essential

Dynamics Sampling” (217). After extracting 306 eigenvectors for the Cα atoms by

performing PCA on the folded structure of cytochrome c at equilibrium, the authors

performed biased MD on the unfolded protein by accepting steps which approached the

folded state, and projecting steps which did not onto various subsets of the equilibrium

eigenvectors. Surprisingly, the protein could be re-folded by biasing on the shortest 100

62

eigenvectors but not on the longest or mid-range 100 eigenvectors, or even on the complete

306 eigenvector set. The study concluded that “the most rigid quasiconstraint eigenvectors,

representing in the folded protein the smallest collective vibrations, contain the proper

mechanical information for the folding process”.

The anharmonic character of the Fourier spectra in Fig. 4.7 is also worthy of

comment; the orthogonal decomposition of collective modes in the implementation of

atomistic MD clearly lumps many frequencies of motion together in modes at different

spatial scales. This indicates that collective modes in a protein have complex dynamics with

a nonlinear dispersion relation, as originally pointed out by García (111). This finding

underlines the need to exercise caution when interpreting the spatial wavelengths from PCA

using quasi-harmonic approximations which map one wavelength onto one frequency. This

is one of the central issues in the interpretation of IR spectra, which often uses this

assumption when assigning modes with the aid of NMA calculations.

Finally, the global structure of the PCA eigenvalue spectrum shown in Fig. 4.1

deserves some discussion, as do the different values of scaling exponent α. The linear

scaling observed for all long PCs is evidence that these PCs do not describe thermal motion,

but what of the various values of α≠2 in the short PC regime? It may be that the backbone

and the side-chains exhibit different ‘colors’ of noise (i.e. the frequency dependence of the

spectral density). Moreover, the power-law implies that PCs with common scaling are

structurally related to one another through scale invariance. Mathematically, a function f(x)

is scale invariant if multiplying x by a factor m results in a scaling of f(x) by the same factor m

(independent of x). In general, such scale invariance is defined by the relation

µ .

It is easy to verify by substitution that the power law f(x)=Axp satisfies this relationship

(218). Hence, observation of power-law scaling among the PCs of a protein implies scale

invariance among its collective modes of motion. This suggests a hierarchical structure

among large and small scale motions, and important geometric relationships among the PC

eigenvectors which share the same scaling (as previously pointed out by (117, 219)). This in

turn indicates that an adequate description of protein motion is likely to require information

spread across the entire PCA spectrum, or at least across all components which scale

together, and not just the largest few PCs as conventionally analyzed.

63

The suggestion of grouping PCs together runs counter to the idea that PCs are

independent; by construction, PCA is supposed to yield eigenvectors whose time trajectories

are uncorrelated. But this is only true if those trajectories have a Gaussian distribution, and

this is precisely what we have shown not to be the case for isolated bands of short PCs. This

means that the time trajectories of different PCs may be correlated (in the super-diffusive

case) or anti-correlated (in the sub-diffusive case) in the non-Gaussian regime. The

oscillations in Figs. 4.6 and 4.7 are anti-correlated for a distinct timescale, and many

neighboring PCs share the same frequency of motion; this is evidence that many PCs may be

meaningfully grouped into a single mode. This finding could have a significant impact on

the interpretation of PCA and its ability to isolate functionally meaningful modes of motion

in MD simulations, not only for fast motions but also for the largest PCs spanning the

conformational degrees of freedom in a protein. A quantitative structural analysis of such

groupings within the largest backbone PCs of gA, SS and RR is the subject of Chapter 5.

4.6: Conclusion

PCA has traditionally been used in many disciplines to characterize the degrees of

freedom which span most of the fluctuations in a system. PCA studies of protein dynamics

have been no exception, focusing on the longest (slowest) PCs, motivated by predicting long-

time dynamics beyond the reach of current simulations (220). In an early and influential

study, Amadei et al. (112) defined the ‘essential subspace’ as “a few degrees of freedom in

which anharmonic motion occurs that comprises most of the positional fluctuations” in the

system. Here, we have shown that the anharmonic features of the long PCs may be artifacts

of insufficient sampling, whereas they are persistent for some shorter PCs. Thus,

anharmonicity extends beyond the motions which comprise “most of the positional

fluctuations”, and we suggest that these non-Gaussian-distributed modes are potentially

important in the description of function, regardless of their spatial scale. While function is

difficult to define and quantify, anharmonicity is evidence of coupling among modes, which

is likely to be necessary in the complex motions required for function.

Systematic examination of anharmonic features in the short PC regime have

identified collective oscillations with functional implications for gA; a group of backbone

oscillations were revealed at ~5 THz (165 cm-1) and can be identified as peptide plane

librations, whose carbonyl oxygens help solvate the lumen and cation in the channel. Our

64

results demonstrate that PCA can be used to isolate interesting covariant motions on a

number of different space and time scales – in a part of the PCA spectrum that is usually

ignored – and highlight the need for an adequate structural and dynamical account of many

more PCs than have been conventionally examined in the analysis of protein motion. This

analysis is readily applicable to any protein system for which MD simulations are available.

65

Chapter 5: Collective Modes at Large Covariance

5.1: Introduction

We are interested in Principal Component Analysis primarily as a technique for

transforming atomistic trajectories of N particles with 3N degrees of freedom into 3N distinct

coordinates of motion which span all particles in the system. These coordinates represent

collective modes of motion, i.e. the instantaneous displacement of particular groups of atoms

away from an average structure. They are orthogonal in space and uncorrelated in time by

construction. In Chapter 2 we discussed the lessons learned in the atmospheric sciences

through use of PCA (EOF) to isolate physically distinct patterns of atmospheric variables

(such as air pressure or wind velocity): complex systems are not likely to have either

orthogonal or uncorrelated collective dynamics. This implies that further transformations of

PCs are necessary to accurately represent the structure of physically distinct collective modes

in a protein. These patterns, these dynamic `structures`, are encoded in the 3D shapes of PC

eigenvectors, that is, the magnitudes and directions of the N 3D-vectors described by the 3N

components of each eigenvector. The accurate structural description of collective modes is

fundamental to the aims of molecular dynamics and structural biology in general – if it can

be shown that the collective dynamics are relevant to biological function.

In this chapter we undertake a detailed study of the structure of eigenvectors from

PCA of gA, SS and RR channels solvated in GMO and water. There are 3N-6 eigenvectors,

each with 3N elements, three at each of the N atomic sites. Normally we would refer to each

element of a vector as a component, but since this term becomes ambiguous when discussing

‘Principal Components’ (eigenvectors), we will use the term ‘loading’ below, since this is

conventional in other disciplines. Our first task is to quantitatively describe both the

magnitudes and directions of displacements represented by PC eigenvector loadings at each

site of the structure. What quantity do we compute from the PC loadings to represent

direction, and more to the point, what is the functionally relevant coordinate? Do we monitor

an angle (if so, with respect to what axis), or a dot product (with which unit vector in space)?

And how do we compare different PCs, by calculating RMSD of loading magnitudes or

RMSΔθ of their orientations? In this chapter we emphasize that measuring the relative

orientations Δθ of the PC vectors of neighbouring atoms, i.e. the coherence of motion among

66

subunits of a molecule, is an excellent measure of the ‘simplicity’ sought by the Empirical

Orthogonal Functions (EOF) methods described in Chapter 2. We can reasonably expect that

functionally organized motions of a complex biomolecule would have co-directional motion

of their structural sub-units. We argue that this is the right thing to measure about the

structure of PC eigenvectors, and also to judge the results of combining PCs together.

The main proposition in this chapter – and this thesis – is a simple transformation of

eigenvectors guided by a straightforward interpretation of the eigenvalue spectrum. If the

covariance eigenvalues represent the spatial amplitude of PCs – defining the variance of their

distributions over time – then a sum of eigenvectors weighted by these amplitudes should

yield a reasonable approximation of a physical mode of motion which has been decomposed

across a number of different Principal Components. Furthermore, there is substructure

apparent in the eigenvalue spectrum of the polypeptide backbone, which suggests one or

more ‘band-gaps’ between distinct modes of motion. We use this structure to determine

which PCs to add together and also to propose a quantitative criterion to separate the

conformational motions from internal dynamics of the backbone. We use our ‘directional

coordinate’ to identify and describe four apparently coherent modes from the 25 PCs in this

conformational regime. This is a valuable reduction of the MD data set, and fulfills the

primary goal of PCA to reduce a high-dimensional MD data set onto a few convenient

coordinates describing comprehensible modes of motion. Comparisons of results for the

gramicidin dimer gA with two of its covalently linked analogs SS and RR demonstrate the

ability of our technique to differentiate functionally relevant motions which arise from

structural differences.

The analysis presented below is for the subset of main-chain backbone atoms NCaC

for the 30 amino acids in gA, SS or RR molecules, not including the C- terminus

ethanolamine groups or the and N-terminus formyl groups (the terminal groups fluctuate in

an unstructured manner, and while eigenvectors extracted with these included in the analysis

are similar, their features are less clear). Note that PCA is only a descriptive technique, and

the choice of atomic subset only influences the portion of simulation data which is observed;

all atoms still move under the influence of the complete force field created by every atom

within the simulation. Hence we do not include the dioxolane linker in our analysis of SS

and RR, simply for the sake of direct comparison with PCA of gA, but the influence of this

67

linker is encoded in the motion of the NCαC atoms. Since our primary interest is a

comparison of linked and non-linked analogs of gA, which are distinguished from each other

by the structural characteristics of their backbone, we do not include any analysis of side

chains here.

5.2: Band Gaps in the Eigenvalue Spectra

In the previous chapter Fig. 4.1 showed the PCA eigenvalue spectra for gA on a log-

log scale, where a power law is made obvious by the linear slope of the curve. The different

curves were for PCA carried out with various atomic subsets, from a single atom per residue

(Cα), through backbone atoms (NCαC or NHCαCO), to all atoms including side chains with

(ALLH) and without (ALL) hydrogen atoms. In all these spectra there is clear separation of

regimes with two 1/ka power laws evident: the long-covariance components have a=1 while

the short components scale with a>1. We hypothesize that the transition made evident by

this change in scaling gives a quantitative criterion by which to separate the concerted

motions representing conformational changes of the protein backbone from the smaller

fluctuations internal to the backbone. Note that Fig. 4.1 displays results from the 10 ns

unconstrained trajectory, to adequately capture the smallest eigenvalues. In this chapter we

are interested in the long and mid-range PCs and use the most converged eigenvectors

available, from 64 ns simulations.

In Fig. 5.1A we examine the first 100 (of 270) eigenvalues for the NCαC backbone

atoms of gA. The change in power law for this spectrum occurs near PC 25. Within the

linear regime there are at least three regions with discernable flattened substructure, marked

A, B and C, with obvious gaps between them. This form indicates a sequence of increasingly

degenerate eigenvalues, suggesting strong mixing of those PCs. The regions marked D and E

indicate regions with similar if less discernable substructure, while F falls in a different

scaling regime with α=2. We call these ‘emergent modes’ A-E, since they emerge from

combinations of many individual PCs. Comparison of the Cα, NCαC, and NHCαCO curves

in Figure 4.1 shows that including more backbone atoms in the PCA fills in the same features

with more points, which suggests that differing numbers of PCs span the same underlying

physical modes; including more atoms is equivalent to increasing the resolution with which

these modes are observed. This in turn implies that the structure of individual PCs are

strongly dependent on the atomic subset included in the PCA (the “subdomain instability”

68

mentioned in section 2.2.4), while the structure of physical ‘modes’ described by groups of

PCs must be conserved.

The root-mean-square fluctuations (RMSF) of our system can be computed by taking

a sum over normalized eigenvalues, each of which represents the RMSF at a particular

spatial scale. The RMSF spanned by each feature of the spectrum is shown as a percentage

under the groups of components labeled A-F in Fig. 5.1A. Each group of components

exhibits an RMSF of comparable size, indicating that neglecting higher-index components

based on their individual spatial covariance σi may not be justified, as is commonly done in

PCA studies of proteins. Mode A spans 34% , mode B spans 20% and mode C space 11 %

of the system covariance. The less obvious modes span 9% (D), and 4% (E), while the sum

of all eigenvalues above component 23 in the steeper scaling regime account for 22% (F,G)

of the total RMSF. This is a significant quantitative result: 82% of the backbone fluctuations

fall in the linear scaling regime and can be described by 23/270=9% of all principal

components, while 91% of the components describe motion in the different scaling regime.

The essential information derived from Fig. 5.1A is the placement of boundaries

between PCs which distinguish separate modes of motion. While there is one particularly

obvious band-gap between PCs 3 and 4, and another between PCs 8 and 9, a more reliable

and quantitative criterion is needed to delineate other modes. The gray line in Fig. 5.1A is

constructed from 6 points which are the average within each group labeled A-F. In red we

show the same line displaced upwards for visual clarity. A line drawn through these points A

to D yields a power law with α=1 and an R2~0.999, while moving the boundaries between

modes results in obvious kinks along this line and much worse fits to the linear scaling curve.

We take this as quantitative evidence that our proposal for grouping PCs is sound.

Figure 5.1B compares the eigenvalue spectra of gA, SS and RR. While the bandgap between

PCs 3 and 4 persists for the linked analogs, it is about half the size of that in the unlinked gA

dimer. Moreover, the other gaps are not apparent in the linked dimers, nor are there any

other groups of flattened neighbouring eigenvalues: modes B-E are no longer distinct in the

linear scaling regime. Bandgaps are usually indicative of an energy separation between

distinct modes of motion, and it seems reasonable that modes which were clearly separated

in the un-linked dimer become mixed when a covalent linker is introduced. The basis for

69

these observations becomes clear upon examination of the structure of PCs and modes for all

three structures.

Figure 5.1: A: PCA eigenvalue spectrum for the NCaC backbone atom subset of gA, with groupings of multiple components into modes A-G. The dashed lines are a guide for the eye, showing two distinct power laws 1/fa(a=1 and a=2). Groups of PCs inside a mode are delineated by the light gray vertical lines: the mode number of the last PC in each group is shown along the top of the plot. Along the bottom the cumulative RMSF is shown as a sum across normalized eigenvalues within each mode. The dark gray line plots the average eigenvalue of groups within each mode, and the red line is the same curved displaced upwards for clarity. B: A comparison of eigenvalues for gA, SS andRR channels.

70

5.3: Spatial Structure of PC Eigenvectors

In the literature applying PCA to protein dynamics, the most common scheme for

representing the structure of PCs is the superposition of molecular structures projected along

a given eigenvector in the positive and negative directions (112-114, 118, 130, 221).

Alternatively, one can display molecular structures representative of highly visited basins in

PC-projection space (117, 222). While these schemes highlight displacements perpendicular

to the chain, they fail to resolve motion parallel to the backbone since all traces superimpose

onto each other in this direction. Sometimes eigenvectors may be visualized by attaching the

eigenvector loadings as arrows on atomic sites directly (128, 129, 144), but on the page this

often fails to convey the full 3D patterns of displacement. This approach may be useful in

schematic form, especially if the motion is much simplified and approximates the

displacement of entire domains (115, 146). More commonly, no attempt is made to fully

characterize the 3D structure of PCs, and only the magnitudes of fluctuations are analyzed

along the primary sequence (116, 119, 127, 131, 134, 141, 143, 145, 148). While such plots

should predict B-factors and hence make contact with experimental data, they do not help

characterize how the protein moves since they ignore the directions of eigenvector loadings.

To give a detailed structural account of the backbone PCs as well as their emergent

modes A-E, we must first develop good quantitative tools for understanding the full 3D

structure of eigenvectors. In order to describe coherence of motion we focus on quantifying

the direction of PC displacement vectors at each atom (rather than their magnitudes) in order

to portray patterns of common direction. To this end a ‘directional coordinate’ is helpful, in

analogy with the ‘reaction coordinate’ commonly used to simplify the multi-dimensional

description of chemical reactions. The motion along the backbone of a protein can be

decomposed into two convenient directions: parallel or normal to the chain. For a helix the

latter can be either along the helical axis (approximately) or perpendicular to it along the

cylinder radius, and this redundancy makes it difficult to visualize. However, motion parallel

to the chain can be quantified by taking the dot product of an atom’s displacement vector

with the tangent to the chain.

Figure 5.2 illustrates the use of 2 directional coordinates, d q+/-, which characterize the

direction of motion for any atom relative to a vector chosen carefully with regard to the

molecular structure. We take the dot product of every atom’s 3D PC loading with the

71

average helical pitch of the backbone, once along the front of the molecule (+), and again

along the back (-). We get two vectors for the two sides of the molecule: d q- on one side of

the axis defined by q, and d q+ on the other side. This is a vector of average backbone

direction cast against a plane through the helical axis, chosen at angle q in the plane normal

to this axis, where the origin q0 is chosen at the midpoint of the gap (or linker) between the

monomers of the molecule. We then encode the value of this dot product as a colour ranging

from blue (=-1) through purple (=0) to red (=+1). In Figure 5.2A and 5.2B this vector is

shown in bold colors for both the front (d q+) and back (d

q-) of the b-helix. Since the helical

structure of the backbone reverses the direction of the chain from the front to the back of the

molecule, the direction encoded in the color scale is reversed along the back of the molecule

for visual clarity; in this way every atom of the same color moves in roughly the same

direction in absolute space. This procedure aids considerably in reducing the ambiguity of

representing 3D vectors on a 2D surface. In Fig. 5.2C our directional coordinate is

implemented to display the pattern of coherent displacements for PC2 of gA. While the color

scheme captures the direction of each vector relative to the helical pitch to highlight the

displacement pattern, both the magnitude and direction in absolute space are encoded in the

length and direction of each vector attached to each atom of the average structure.

In Fig. 5.2D we demonstrate the utility of the directional coordinate by plotting it

directly as a function of chain position. In the same way that the magnitudes of fluctuation

are commonly displayed as a function of chain position, the directional coordinate allows for

a plot of the directional structure of a given PC; in this case it is easy to see that PC2 is a

bending of the molecule, where the central atoms in the hydrophobic core generally move to

the right while the extremal atoms at the hydrophilic ends move to the left. Fig. 5.2

demonstrates the importance of choosing a casting angle which maximizes the displacement

pattern, and it becomes very useful in comparing the PC structures for analogs of the same

molecule. With this new tool for assessing the directional structure of collective modes, we

may now investigate the shapes of principal components for gA, and make quantitative

comparisons with the PCs of SS and RR.

72

Figure 5.2: A & B: A color key for representing the direction of motion with respect to the helical pitch. The front atoms of the gA dimer are shown in A and the back in B. These two sets of atoms fall along different helical pitch vectors, shown in bold red and blue. Hence for x>0 (front), red atoms move down the pitch and blue atoms move up the pitch, and vice versa for x<0 (back). The color scales from red (-1) to blue (+1) as a function of the dot product of an atom’s displacement vector with the tangent to the chain at the extreme front (or back) of the molecule. At the center of the channel is an appropriately colored bar aligned with the average pitch of the helix for either the front or back of the molecule, along with a bar normal to the pitch. C: Implementation of the color scheme displaying PC#2. The vectors attached to each atom display the components of PC#2 directly, while the color

73

Figure 5.2 (cont.): … scheme represents the directional information relative to the pitch vectors displayed in A and B. D: 1-dimensional trace of the directional coordinate encoded in the color scheme of C. Both the horizontal position and the color of the trace encode the same information, laid out against a map of the amino acids along the vertical axis. This example demonstrates the utility of this coordinate in separating the dominant directions of motion for the outer and inner turns of the helix in PC#2.

5.4: The Principal Components of gA, SS and RR

Figure 5.3 implements the color scheme illustrated by Fig. 5.2, and depicts the

structure of collective displacement for the largest 3 PCs of gA from three perspectives in

space (along the x, y and z-axes), and compares them to the same PCs for SS and RR from a

single perspective (z-axis). Blue and red atoms move in opposite directions along the helical

pitch. The purple scale picks out motion perpendicular to the helical axis. The first three

principal components of gA have an identifiable structure which spans the whole dimer. PC1

exhibits a counter-rotation of each monomer around the helical axis, much like the wringing

of a towel; each monomer twists in the opposite sense. This is apparent in PC1 of gA in Fig.

5.3, where the top monomer is blue along the back (moving right) and red along the front of

the molecule (moving left), while the bottom monomer has the opposite color scheme,

indicating that it is twisting in the opposite sense. PCs 2 and 3 of gA are orthogonal bending

motions. PC2 exhibits red vertical extremities and a blue core: all the atoms at the top and

bottom of the protein move to the right while the middle uniformly moves to the left. PC3

has a similar structure, but the bend is normal to the page, as indicated by the dark shades of

purple at the extremities and lighter shades of purple towards the centre.

The depiction of PC2 and PC3 looking from the z-axis shows that these modes of

motion are orthogonal bends for gA. The vertical perspective also easily distinguishes the

differences between the first three PCs of gA, SS and RR. The bending patterns move to

PC1 and PC2 for the linked dimers, while the twisting pattern is PC3. This seems reasonable

given the structural differences between these molecules, as a twisting motion is

comparatively hindered by covalent bonds in the linked analogs, while bending modes

require only torsions around these bonds. Inversely, collective bending is comparatively

disfavored in the non-covalent dimer gA due to the main chain discontinuity at the dimer

junction. Another difference that becomes apparent is

74

Figure 5.3: Illustrations of the three largest PCA eigenvectors using the color key in figure 5.2. Each PC of gA is displayed from three perspectives for clarity. The atoms move in the direction shown by the vectors at each atom, while their color encodes their direction relative to the helical pitch.

75

that while the angle between the PC2 and PC3 bends in gA and RR is ~90o, it is smaller in

SS, and the bending modes are in general less apparent in this linked analog. Again this can

be explained by the orientation of the dioxolane linker with respect to the backbone pitch,

which acts like a wedge in RR, strongly favouring one direction of bend, while in SS it runs

along the helix which inhibits bending motions in general (PC1 of SS is not even a clear

bend, incorporating some twist). We also note that the relative direction of bending modes

are different for the linked and un-linked analogs.

In general, any periodicity of coloring along the backbone makes coherent patterns of

displacement discernable. Figure 5.4 shows PCs 4-9 of gA, where the loss of uniform

displacements is apparent in the variations in magnitude of the vectors, and loss of obvious

coherence is apparent in the loss of uniform stretches of color along the backbone. While

Figs. 5.3 and 5.4 give a qualitative picture of motion on the actual protein structure, a more

quantitative representation is shown in Fig. 5.5 for gA, SS and RR. Here the sequence of the

protein is 'unwrapped' onto the horizontal axis of the plot, which shows the primary amino

acid sequence of gA. The dot product of each atom’s direction of displacement with the

helical pitch vectors (as in Fig. 5.2) is plotted on the vertical axis. The color of the curve also

replicates the directional information plotted on the vertical axis, using the same color

scheme as Figs 5.3 and 5.4 to make the relationship between the figures obvious. The gray

vertical lines denote boundaries between turns of the helix, which help map atomic motions

onto structural features of the protein. For example, a twisting motion shows up along the

helical pitch as a sinusoidal trace with a period of one turn. The coherent motion of a domain

of neighboring atoms moving in the same direction is seen as a straight horizontal line in this

representation, and the relative symmetry of displacements between monomers is also readily

apparent from left to right.

Figures 5.3 and 5.5 make it easy to compare the PCs of the gA, SS and RR channels,

and thereby quantify the differences in dynamics which are caused by the inclusion of a

covalent linker between channel monomers. It is apparent that the wringing motion of PC1

in gA is present with smaller amplitude in both linked channels as PC3. The two bending

motions of PC2 and PC3 in gA are also present in SS and RR with larger amplitude, as PC2

and PC3. Although the directions of the bends are slightly different, the two bending

motions are still orthogonal within each linked analog.

76

Figure 5.4: Illustration of gA principal components 4 through 9, using the color scheme of Figure 5.2. These patterns of displacement are more difficult to describe and generally less coherent than PC1-3, although they share the general characteristic whereby the monomers roughly mirror each other’s pattern.

77

Figure 5.5: Projection of displacement direction onto helical pitch, as a function of residue. The vertical axis plots the dot product of an atom’s displacement vector with the axes of projection shown in Fig. 5.2. The line across the centre of each curve is zero, with +1 above and -1 below. The curve is colored using the same scheme as figures 5.3 and 5.4 to make the relationship between the figures obvious, such that parts of the chain moving to the left (right) are red (blue). Note that magnitudes of displacement are not captured by this plot, only relative direction.

78

5.5 Coherent Modes From Weighted Sums of PCs

The central result of this study follows from a straightforward interpretation of the

meaning behind the eigenvalues and eigenvectors of PCA: if the eigenvectors describe

the shapes of collective displacements while the (square root of) eigenvalues sk represent the

spatial amplitude of over the average of a time trajectory, then the following weighted

sum describes a physically meaningful collective mode l with displacement vector ∆ of

bandwidth ∆ :

∆ s

.

In addition to this observation an ansatz must be made to establish the appropriate bounds k1

and k2. The substructure apparent in the log-log representation of the PCA eigenvalue

spectrum (as in Fig. 5.2) is a physically reasonable guide in this respect. Just as spectral

peaks span the modes of motion in an oscillatory system, regions of distinct power law

scaling may delineate separate modes in the diffusive dynamics of a protein system.

Following the bandwidths shown in Fig. 5.2 for gA, mode A is composed of PCs 1-3,

while mode B spans PCs 4-8. Figure 5.6 shows these two largest collective modes for gA,

depicted on the molecular structure of the backbone (as in Figs. 5.3 and 5.4). Mode A

exhibits coherent motion of the hydrophobic turns at the junction of the two monomers

(blue), moving out of phase with the outermost hydrophilic turns (red). This stands in

contrast to Mode B, where the hydrophobic turns move out of phase with each other, and

where the inner two turns of each monomer move out of phase with that monomer’s outer

turn. This symmetric and anti-symmetric character is apparent in the constitutive PCs of

these two modes; PCs 1-3 are generally symmetric while PCs 4-8 are anti-symmetric (see

Fig. 5.5), though no individual PC yields as clear a directional profile as their weighted sum.

The character of motion in the modes is also quite distinct from their constitutive PCs. For

example, the wringing and two orthogonal bends of PCs 1-3 combine to yield almost uniform

lateral displacement of individual turns.

79

Figure 5.6: Coherent Modes of gA, from the eigenvalue-weighted sum of PCs. In Mode A the middle three hydrophobic turns move laterally out of phase with the outermost hydrophilic turns. In mode B the two inner turns of each monomer move out of phase with each other, opening the connection between them.

Figure 5.7: Coherent Modes of gA, represented by two different directional coordinates as a function of atom number, with the corresponding residues shown below the figure. Motion projected along the helix pitch is shown at left, while motion along the helix axis is shown at right. The vertical gray lines delineate equal angular position (ie the five turns of the helix).

80

To more clearly display these features, in Fig. 5.7 we show modes A-F on the

tangential direction coordinate described above, as a function of chain position (as in Fig.

5.5). In this figure we also present these gA modes on a different directional coordinate for

comparison, using the helical axis for the dot product with atomic displacements, where

yellow and cyan colour opposite vertical directions. The most striking feature of the

weighted sums is that they possess larger sections of uniform direction than their constituent

PCs, as is made apparent by the stretches of horizontal lines in this figure. Furthermore, this

uniformity is concentrated along individual turns of the helix. This feature is most clear for

modes A, B, and E along the tangential coordinate of Fig. 5.7, and along the axis coordinate

for modes C and D. The term 'coherent mode' is justified by this observation, since clearly

identifiable structural sub-units of the protein move together, either in or out of phase with

each other. In fact, it is clear from Figure 5.7 that mode B is simply mode A with one of the

monomers having the opposite phase. Mode C can be best described on the axis coordinate,

where it appears the monomers move in opposite directions along the vertical axis. Mode D

is a shearing motion, where the front and back of the molecule move out of phase vertically.

Finally, in mode E each helical turn moves out of phase laterally with its neighbours.

The main difference between modes A and B is the phase of motion at the junction of

monomers, which suggests a functional interpretation for this conductive channel. Notice

that mode A would preserve a continuous water column at the centre of the channel, while

mode B would disturb this path of water molecules. This suggests that mode A may be the

conductive state of gA, maintaining a conductive path for ions to move between gA

monomers, while mode B may be the non-conducting state, breaking the ionic conduction

path between monomers. It is worth noting that mode B is very similar to the one described

by Miloshevsky and Jordan using Normal Mode Analysis with biased path sampling (based

on the Monte Carlo algorithm) on gA simulated in vacuum: "The open state gating

mechanism of gramicidin A requires relative opposed monomer rotation and simultaneous

lateral displacement" (132). A similar mode and gating mechanism may also lead to

dissociation of the dimer, on the 100 ms timescale.

Figure 5.8 compares modes A and B for the three gramicidin channels. We see here

the remarkable result, that despite differences in the size and shape of the leading 3 PCs (as

shown in Figs.5.3-5.5) the eigenvalue weighted sum of PCs 1-3 yields almost identical

modes in gA, SS and RR. However, this is not true for mode B of gA, which is not present

81

in SS and RR. The eigenvalue spectrum of Fig. 5.2 is easily interpreted in light of these

results; the three analogs share the same eigenvalue scaling and band gap for PCs 1-3, but not

for higher-index PCs. Indeed, there is no bandgap to separate PCs 4-8 in SS and RR, and so

no reason to think that those PCs should yield a coherent mode in these analogs. These

results are also sensible in light of the structural differences among these three molecules: the

covalent link between monomers prevents the opening motion of mode B, as well as vertical

modes C and D, but has little influence on mode A.

Figure 5.8 also compares the mode structure to individual PC structures by overlaying

them on the same axis. The symmetry of the directional coordinate between the two

monomers (left and right of the curve) is much stronger in the coherent modes than their

constituent components. While PCs 1-3 have their own symmetry and coherence, it is

striking that such a simple pattern should emerge for mode B from the apparently incoherent

PCs 4-8 seen in Fig. 5.4. This is even also true of mode E, which exhibits counter-motion of

all neighboring turns while its components are so non-uniform as to be un-interpretable. This

emergent symmetry suggests that the grouping ansatz which underlies our results is both

sensible and useful.

82

Figure 5.8: Comparison of modes A and B for gA, SS and RR. Modes are shown in bold, and their constituent PCs in thin lines on the same axis. The modes are more symmetric across both monomers, and more uniform in their directions, than the individual PCs. Mode A is the same for all three analogs, while mode B is different in gA and the linked channels.

83

5.6 Covariance of PC Trajectories

The analysis presented above depends strongly on grouping certain blocks of

components into modes. While the eigenvalue spectrum seems to contain this information, it

is difficult to discern spectral structure for higher components. The most convincing

evidence for grouping would come from correlations in the time trajectory of components

within a mode group, correlations which should not be there if indeed PCA yielded

independent modes of motion. In Figure 5.9 we show the average covariance matrix for the

first 23 PC trajectories:

<pci(t) pcj(t)>t<T

where pci(t) is the time trajectory of the MD simulation projected onto the ith PC, and the

average is over the entire time trajectory of duration T=64ns. By calculating the covariance

of every ith PC with every jth PC, a matrix of values between -1 (anti-correlated) and 1

(correlated) is generated. In Figure 5.9 we also present the absolute value of this covariance

matrix, which yields a more readily interpretable pattern of uncorrelated (white = 0) and

correlated (black = 1) PC trajectories. These plots show that there are significant correlations

in time among the first 9 eigenvectors, and none for PCs of higher index. This is evidence

that the first 9 PCs are not independent of each other, and that groups of PCs must be

considered together to form a single mode of motion.

In Figure 5.10 we focus on the absolute value of covariance for the first 10 PCs,

comparing results for the gA, SS and RR channels. The dominant feature in the results for

gA is a white square pattern formed by the rows and columns of PC3 and PC8, which

delineates two blocks of components corresponding to modes A and B of the gA the dimer.

The results for the SS and RR channels demonstrate that the correlations among PCs 1-3 are

persistent in the linked dimers, and mode A is largely the same as for the non-linked gA

channel. However, the correlations among PCs 4-8 are degraded in SS and almost non-

existent in RR. Hence the ‘opening’ mode B of gA is not present in the linked dimers, as

expected due to the covalent linkage at their centre (with RR more strongly perturbed than

SS). This is strong evidence that our grouping ansatz is reasonable, and independent modes

of motion are indeed spread across a number of PCs in PCA of adequate atomic resolution.

84

Figure 5.9: Covariance matrix of projected trajectories for the leading 23 PCs of gA (left) and the absolute value of the same quantity (right). i and j label the PC index.

Figure 5.10: Comparison of covariance matrices (absolute value) for the PC trajectories of gA, SS and RR channels.

85

5.7: Discussion and Conclusions

The features which are apparent in the log-log plot of the eigenvalue spectrum

suggest the notion of ‘covariance bandwidth’, which may be as relevant to the description of

overdamped dissipative systems as ‘frequency bandwidth’ is in describing the modes of

motion for oscillatory systems. Integrating across these features seems to describe real

physical modes which are projected across multiple PCs, in much the same way that many

points along a broad peak in a high resolution Fourier analysis describe an oscillatory mode

across a band of frequencies. The main claim of this chapter is that a linear combination of

PCs within these groups describes physically meaningful modes of motion, and provides a

simple means of describing large-scale functional motions from the components extracted

through PCA.

In the field of climatology the use of ‘EOF’ (107) is much more developed than the

current use of PCA in computational biophysics. There is a wide array of techniques for

extending and modifying PCA to make PCs more interpretable, or to determine how a

physical mode of activity is projected across more than one PC. For example, in the

“extended EOF” where time-lagged covariance is included (see page 33), PC’s with

degenerate eigenvalues are understood to be components of a single mode (degeneracy here

is used in the approximate sense, where the eigenvalues fall within each other’s error bars).

We note that a flattening of the power spectra – as observed within the PC groups proposed

above (see Fig. 5.1A) - favors this effect; the flatter the spectrum, the more degenerate the

grouping. Weighing PC’s by the square root of their eigenvalue is also a standard operation

before rotating EOFs to obtain simplified mode structure, and suggests that our ansatz is a

physically meaningful operation. Moreover, in Chapter 4 we found groups of PCs in the

small covariance regime which shared the same frequency and phase of oscillations in their

MSD over picosecond timescales. If it is reasonable to group short components into a single

mode, then it may be expected that the same should hold true for the longest components.

Furthermore, it should be noted that the number of components resulting from PCA

scales with the number of atoms included in the analysis, while the number of “real,

physical” collective modes should be conserved independent of the subset of atoms included

in the PCA. Backbone motion of a protein provides a good illustration of this property; we

would expect to find the same set of physically meaningful backbone modes whether we

86

included only Ca, NCa, or NCaC atoms in our PCA. However, including N amino acids in

the analysis would result in 3N, 6N or 9N components which span the motion of the

backbone (the factor of 3 arises from the dimensionality of space); with more components

describing the same motion, the resulting eigenvectors would have to change shape. This

means that the shapes of individual components are not likely to be meaningful: no more so

than the shape of a single sin(x) function in the context of a Fourier transform on a noisy

signal. We would also expect the number of eigenvalues falling on the plateaus of Fig. 5.2 to

increase as we increase the number of atoms used in PCA. This is apparent to some degree

in Fig. 4.1.

In conclusion, a weighted superposition of principal components yields a small set of

physical modes of motion from a much larger number of Principal Components. To find

meaningful patterns of biomolecular motion one must target different windows within the

average covariance spectrum; the trick is knowing which components to sum together, and

how much of each. Our results suggest that the PCA eigenvalue spectrum contains this

information, and that there are five distinct collective modes of motion for the backbone of

gA solvated in a membrane. With the aid of carefully chosen directional coordinates for this

simple system, the spatial structure of the gA backbone modes (as well as their constituent

PCs) have been quantified to an unprecedented degree, such that differences in dynamic

modes could be resolved among linked analogs of the same molecule . This work presents

an approach to extracting the coherent structure from the apparent noise of biomolecular

motions, and will be helpful in future analysis of MD simulations.

87

Chapter 6: PCA of GMO Lipids Solvating Gramicidin 6.1: Background

In section 1.1 we introduced a few features of protein-lipid interactions which are

relevant to protein function and gA in particular. In general the presence of a protein within

a phospholipid bilayer increases the orientational order in the lipid matrix, and differentiates

the behavior of “annular” lipids which solvate the protein from those in the bulk of the

membrane (29). A comparative study of gA simulated in DiPhPC and GMO bilayers has

shown that GMO molecules are significantly more ordered than the diacyl chains, with three

distinct solvation shells apparent in the radial distribution function (36). In the case of the

diacyl phospholipid DMPC it is known that annular lipids remain associated with gA for

approximately 100 ns (30); we speculated that since the free energy for moving a single acyl

chain found in GMO is lower than moving two acyl chains in DMPC, we would expect the

annular residence time of a GMO molecule to be shorter than 100 ns, and therefore similar to

the simulation times considered in the current study of gA solvated in a GMO bilayer.

To the best of the author’s knowledge, there have been no PCA studies of membrane

dynamics. There are a number of intrinsic difficulties in applying PCA to a fluid composed

of many monomers. The diffusion and exchange of monomers prevents convergence to a

well-defined average structure at long timescales; indeed, in the long-time limit we would

expect the average structure to converge to a single point in the centre of the plane of

diffusion (given periodic boundary conditions). In the case of a well-structured liquid-crystal

where a lattice may be defined with a single monomer at each lattice site, this difficulty may

be overcome by exchanging the identity of two monomers when they exchange positions.

This is not possible in the case of a more amorphous liquid like the membrane bilayer, where

such a crystalline lattice cannot be defined (as is apparent in the planar distributions shown in

Fig. 6.2 below). Furthermore, PCA demands a well-defined set of atoms with continuous

coordinate trajectories. Any discontinuities of position associated with identity exchange

would give rise to artifacts in the long PCs, since these would appear s large displacements in

the covariance matrix.

In light of these difficulties we limit the current investigation to the relatively

structured annular shell of lipids in the first solvation shell of the gA molecule, where a

particular set of monomers can be chosen for study using PCA. This necessitates selecting

88

an appropriate timescale for PCA which is shorter than the residence time of monomers

within the annular shell, but long enough to capture any collective motions of lipids within

the shell itself. We hope to use PCA to compare the collective motions of the annular lipids

with the collective modes of the gA molecule itself, in order to establish whether there is

significant dynamical coupling between the protein and its immediate lipid environment.

6.2: Methods

We have performed PCA of the GMO membrane at six logarithmically spaced

timescales between 2 ns and 64 ns, aligning all frames on the NCαC atoms to subtract out

translation and rotation of the system with respect to the gA molecule. While our interest is

mainly in the annular lipids, we have performed PCA on the full membrane, on one solvation

shell (24 lipids) and on two solvation shells (48 lipids) for comparison, and in each case PCA

has been performed on the lipid headgroups alone, the acyl tails alone, and both together. To

limit the size of the data sets while capturing the relevant degrees of freedom, the tails were

represented by a subset of 6 carbon atoms, including both carbons flanking the double bond

in the middle of the acyl chain, the carbon atoms at the ends of the chain, and the carbon

atoms midway between these two positions. In order to address the coupling of lipids with

the gA molecule, we have performed PCA in each of these cases on the lipids alone as well

as the lipids plus the NCαC backbone atoms.

After equilibration of our simulation the gA molecule was located near the edge of a

box of GMO molecules with periodic boundary conditions. To avoid artifacts due to

periodic translation of GMO molecules, the trajectory needed to be ‘unwrapped’ given the

crystal parameters and configuration at a particular moment in time. Since examination of

the complete 64 ns trajectory revealed a number of exchange events between the annular

lipids and the bulk, we created two such unwrapped trajectories: one from the beginning of

the 64 ns trajectory and one from the mid-point of the trajectory at 32 ns. This resulted in

choosing two different sets of GMO monomers for study with PCA. Annular lipids were

selected by choosing all GMO molecules whose centre of mass was within the first and

second minima of the radial distribution function (RDF) to create one and two solvation

shells respectively, for both configurations at 0 ns and 32 ns. The RDF for GMO solvating

gA is shown in Figure 6.1, with minima at 37 Å and 58 Å.

89

Figure 6.1: Radial distribution function of GMO lipids surrounding gA.

6.3: Results and Discussion

In order to determine the appropriate timescale for PCA of the annular GMO

monomers, we compared the 2-dimensional distributions of GMO monomers surrounding the

gA dimer in the plane of the membrane. Figure 6.2 shows contour maps of representative

distributions for the 24 monomers in two solvation shells within a single leaflet of the

membrane at various timescales, where the GMO monomer position was represented by its

centre of mass. The distributions for the mass-weighted average of the headgoup and acyl

tail were also computed for comparison, as well as the ester oxygen linking the two; all

results were qualitatively similar. Figure 6.2 makes it clear that there is no fixed solvation

structure to be found in the annular lipids, and even identifying a coordination number is

difficult in this fluid system. It is also clear that the monomers are more localized at short

timescales and become less so at longer timescales. At the longest timescales the symmetry

of the distributions is broken, and it becomes apparent that the same set of monomers no

longer constitutes two solvation shells around the gA dimer: the 32 ns distribution is the

longest sample for which the annular structure is still apparent, but at 64 ns the monomers

have been displaced too much to discern the solvation structure. To investigate the annular

structure further we also compared the average structures obtained for independent PCA

across various subsets of the 64 ns trajectory. These structures are shown in Figure 6.3, and

revealed that relatively symmetric and uniform lipid distributions around the gA molecule

were obtained up to 32 ns, but not for the full 64 ns trajectory (using either set of annular

GMO monomers selected from configurations at 0 or 32 ns). This figure also reveals that the

90

internal degrees of freedom of the GMO monomers average out near 32 ns, resulting in

straight-chain average monomer structures. Taken together, the results of Figures 6.2 and 6.3

indicate that the solvation structure of gA is persistent for longer than 16 ns but less than 64

ns.

Figure 6.4 shows representative eigenvalue spectra for PCA of 2 through 64 ns time

windows on the headgroup (bottom), tails (middle), and complete GMO monomers (top) for

a single solvation shell around gA, including (right) and not including (left) the gA backbone

atoms in the PCA. There are almost no differences in the largest eigenvalues for the

headgroups, tails, or both together, either with or without the gA atoms, but differences

between the headgroup and tail spectra appear in the mid- and short-scale PCs. The main

difference arising from inclusion of gA atoms appears to be a steeper scaling of the shortest

PCs. The main feature apparent in all spectra is that the shape of the curve for large

eigenvalues is not consistent for the various durations tested. The 64 ns curve is distinctly

different than the others, and combined with its asymmetric average structure this suggests

that the annular lipid structure is not conserved at this timescale. The similarity of the 16 ns

and 32 ns spectra, in addition to the results of Figures 6.2 and 6.3, lead us to focus on the 32

ns timescale in the following analysis of eigenvectors.

The eigenvectors of the largest three PCs of the complete GMO monomers are

illustrated in Figure 6.5. We have used a coloured direction coordinate to resolve some of

the ambiguity of reading 3D data on a 2D graph, by taking the dot product of each arrow

with the (1,1,0) vector, and mapping +1 to red and -1 to blue, with continuous shading

through purple for the values in between. The length of the arrows indicates the magnitude

of fluctuations on a given atom. We show four panels for each PC for clarity, depicting the

front and back (top and bottom) of the system viewed from the X (Z) direction. The average

structure of the gA backbone is also shown for reference in each panel, taken from

independent PCA of the backbone in the same time window. One of the main features of the

eigenvectors is the largely uniform motion of entire GMO monomers, with no significant

differences between the lipid headgroups and their tails; this suggests that the largest PCs

capture the diffusive motion of whole monomers. There are considerable differences

between the motion of monomers in the top and bottom leaflet of the bilayer. The first PC is

dominated by the displacement of two neighbouring monomers, as is the second PC, where

91

the same monomers move in the opposite direction. There are no obvious global patterns of

motion apparent in these or the third PC.

Figure 6.6 shows the eigenvalue-weighted superposition of PCs 1 to 3, a block

suggested by the distinct and common scaling shown in the eigenvalue spectrum in Fig. 6.4.

These ‘rotated’ PCA results feature more uniformly distributed magnitudes of displacement

across the GMO monomers than individual PCs, and reveal a torsional mode of collective

tangential motions moving clockwise around the gA backbone. This is most apparent in the

top leaflet (+Z) of the bilayer. There is a quadrant of monomers in the top leaflet which

depart from this pattern, but the monomers in the same quadrant of the bottom leaflet match

the tangential pattern of the top. While still quite noisy, it is clear that the linear combination

of PC 1 through 3 yield a more coherent collective pattern of displacement than any

individual PC shown in Fig. 6.5. The wringing pattern here is also suggestive of the largest

PC of gA, as shown in Figure 5.3, though not of the emergent mode composed of gA

backbone PCs 1-3. This may be evidence of coupling among the largest covariant motions of

the gA backbone and its solvating phospholipids. Note that this raises the difficult question

of how many PCs are to be compared when searching for common patterns among differing

subsets of a complex system’s motion.

Figure 6.6 also shows the eigenvalue-weighted sum of PCs 4 through 12, which is the

next scaling group in the eigenvalue spectrum. While the pattern of motion here is less

obvious, the collective displacements are largest on the headroups with very little motion of

the lipid tails. This mode seems to describe a mode of motion internal to the GMO

monomer, while the largest mode described the relative motion of whole monomers. This

result demonstrates the ability of linear superpositions of PCs to separate modes of motion in

a complex and noisy system into patterns which are more interpretable than any individual

PC. Moreover, the differentiation of headgroup and tail motion in this mode is also

suggestive of the countermotion of hydrophilic and hydrophobic turns observed in the

dominant mode of the the gA backbone, again suggesting the possibility of coupling between

gA and annular lipids.

The PC distributions shown in Figure 6.7 reveal that the largest PCs of the annular

lipids are multimodal, indicating that the tangential motion described in Figure 6.6 is not

likely to be a gradual drift, but a concerted hopping motion of monomers between favoured

92

solvation sites. There is a very strong overlap between the distributions of the first and

second PC, which is strong evidence that these are components of the same mode. The

multimodal character of the distributions is still apparent up to ~PC10, and converges to a

relatively smooth unimodal Gaussian after ~PC25. These distributions are reminiscent of the

multimodal side chain distributions shown in Figures 3.4 and 4.3, and suggest the possibility

of coupling between concerted motions of Trp side-chains and annular GMO molecules.

However, examination and comparison of the PC trajectories for GMO and side chains did

not reveal any consistent patterns of correlated jumps between stable positions.

Examination of the magnitudes of eigenvectors on the NCαC+GMO data set reveals

that there are no large displacements of the backbone atoms within the first 30 PCs; all the

fluctuations of the largest PCs are concentrated on the annular GMO molecules. Furthermore,

collective modes from the PCA of the full membrane are dominated by concerted motions of

lipids far from the gA site. These observations remind us that PCA is most effective with a

judicial choice of atoms to be included in the analysis, since the largest PCs are generally

dominated by motions at the largest spatial scale included in the analysis. One must either

look much further into the PC spectrum toward shorter covariant motions to find collective

modes of interest, or look at the largest PCs for a set of atoms which only span the

appropriate spatial dimension.

In conclusion, there are some suggestions of shared patterns of motion among the gA

dimer and its annular lipids, although these are largely qualitative and no obvious

correlations were found among the relevant time trajectories of collective motions. One of

the main obstacles which is made apparent in this application is the limited ability of PCA to

analyze and differentiate dynamics spread across widely varying timescales, if these are not

directly coupled to widely varying length scales. While this is a preliminary attempt to apply

PCA in a novel situation, and our results show promise in their ability to extract patterns

from very noisy data, more sophisticated methods of time trajectory analysis are needed to

study the coupling of subsystem dynamics in any detail. The standard application of PCA

relies on simple averaging, and as such it is very difficult to adequately address multi-

timescale behavior. Elaborations of PCA which are capable of separating patterns in time

(i.e. “extended” PCA) in addition to patterns in space would be necessary to more fruitfully

tackle this problem. Moreover, application of PCA to unbounded, diffusive, multimeric

93

systems is intrinsically problematic for the same reasons, since average structures are ill-

defined in this context. PCA is best used on bounded systems with well-defined average

structure.

94

Figure 6.2: Top leaflet: planar distribution function of GMO lipids surrounding gA.

95

Figure 6.3: Comparison of average structures for one solvation shell of GMO monomers.

96

Figure 6.4: Eigenvalue spectra for a single solvation shell of GMO around gA. Representative curves are taken from independent PCA of various timescales, doubling in duration from 2 ns (red) to 64 ns (purple). The NCαC gA backbone is included in the analysis on the right, while only annular GMO molecules are included on the left.

97

Figure 6.5: PC 1 (top), PC 2 (middle) and PC 3 (bottom) for a single solvation shell of GMO monomers around the gA molecule. The directions of displacement are coloured according to their dot product with the (1,1,0) vector.

98

Figure 6.6: Eigenvalue weighted sum of PC 1 to 3 (top) and PC 4 to 12 (bottom).

Figure 6.7: Eigenvalue-normalized distributions of PC trajectories for the large eigenvalue regime, shown in comparison with a unit Gaussian.

99

Chapter 7: General Conclusions and Future Directions

In this study we have seen that there is room for the expansion and development of

PCA as a technique for translating large datasets of atomic motions into quantitative

descriptions of collective motions related to function. We have used gramicidin and its

linked analogs as a test system in this regard, due to its structural and functional simplicity.

In this system we have seen that there is information of interest to structural biologists not

only in a few leading principal components, but also distributed throughout the PC spectrum

in the form of eigenvalue scaling, non-Gaussian distributions and MSD oscillations.

Eigenvalue scaling separates conformational changes where many atoms move in a uniform

direction from vibrations internal to a complex molecular structure. Non-Gaussian

distributions provide target collective motions dwelling on an anharmonic free energy

surface, indicating coupled or multi-modal dynamics which are suggestive of functional

organization. Oscillations in the MSD connect these dynamics to spectroscopic IR

measurements. Moreover, we have demonstrated that the 3D structures of PCs are not likely

to be individually meaningful, and further transformations or combinations of PCs are

needed to yield a description of concerted physical modes of motion. We have proposed an

eigenvalue-weighted linear superposition of eigenvectors grouped according to band-gaps

observed in the eigenvalue spectrum. With the aid of a directional coordinate we have

reduced the conformational degrees of freedom for our simple protein to a small set of

collective modes which are physically intuitive and functionally interpretable.

The most obvious next step in this study would be to apply our proposals to different

protein systems with known functional modes, both at long and short spatial scales. For

example, it would be interesting to apply the metrics described in Chapter 4 to

carbonmonoxy myoglobin (MbCO) for the short PCs, to see if any non-Gaussian

distributions or oscillations in the MSD relate to collective modes of the CO ligand, heme

group, or the surrounding hydrophobic binding pocket. This system has been studied using

PCA (216), and the results of conformational analysis have been related to spectroscopic “A-

states” which exhibit four IR absorption bands. These are believed to relate to MbCO’s

ability to differentiate between diatomic ligands CO and O2. Just as our results in Chapter 4

related MSD oscillations to the IR spectra of gA, it would be interesting to produce spectra

100

like those in Fig. 4.7 for MbCO, and associate the structure of PC eigenvectors with peaks in

the IR regime.

At long spatial scales there are a number of studies which could elucidate the validity

of our ansatz regarding superpositions of PCs. As with any spectroscopy, we can expect that

the mixing of modes becomes more problematic with increasing system size; the separation

of modes and clarity of bandgaps between them is likely to be more clear for less complex

systems. Hence it would be most helpful to study a number of relatively small proteins such

as lysozyme or crambin, to see if there are plateaus or bandgaps in their eigenvalue spectra.

If so, eigenvalue-weighted superpositions within these plateaus should reveal easily

interpreted collective modes of motion which may relate to the function of these proteins.

Another set of interesting comparisons could be made with model systems to separate

harmonic from diffusive degrees of freedom. PCA of MD simulations of pure crystals would

show the eigenvalue signature of purely intramolecular (harmonic) interactions while PCA of

simulations of simple gases such as Argon would elucidate intermolecular (diffusive)

interactions. Of course, an essential ingredient in any of these studies would be the

development of appropriate directional coordinates by which to judge the coherent structure

of resulting eigenvectors. Comparisons of PCA with NMA for these systems would also be

insightful.

The coupling of motion among subsystems within a complex molecular assembly is

of general interest to biophysicists and structural biologists. This is the general line of

inquiry begun in Chapter 6 regarding gA and GMO dynamics, and it would be fruitful to

continue this approach to study the coupling of gA dynamics with the water molecules in the

channel lumen, or the ions which translocate through the lumen. Many of the challenges

outlined in Chapter 6 would be apparent in such a study, not the least of which would be the

treatment of the changing molecular identity among the water molecules which constitute the

lumen, and the development of quantitative techniques to detect correlation of collective

motions and the ability to ascribe causal directions among any such motions. On the other

hand, the average structure of lumen waters is much better defined than the solvation

structure of GMO molecules, and PCA may have an easier time of detecting functional

motions in this application.

101

Our discussion of EOFs in section 2.2.4 provides a long list of possible future studies

using PCA on protein dynamics, most of which would be original at this time. Choosing an

appropriate test system is especially important when trying novel analysis techniques, and

there are two criteria to observe in this case. The structural and functional simplicity of

gramicidin is useful in allowing a relatively straightforward description of results. But there

is also a need for well characterized patterns of motion of adequate biological complexity, in

order to test the ability of a technique to extract the relevant biological information from

atomic motions. The transition from T (tense) to R (relaxed) conformation upon O2 binding

in myoglobin is a good example of this, and could serve as the equivalent of well-

characterized patterns of atmospheric disturbances which were necessary in developing

extensions of EOFs.

While the utility of rotated EOFs remains controversial, it would be interesting to test

the ability of Varimax and related algorithms to extract simplified modes of motion from

PCA of MD simulations. Probably the most limiting factor in standard PCA is its reliance on

instantaneous covariance among atoms, which ignores the effect of memory and time-lags in

constructing the structure of dynamic modes; “extended” PCA which includes covariance of

motion at different points in time may reveal a much more accurate description of collective

modes in a protein. “Complex” PCA also offers the potential to explore correlations among

related variables in the MD data set; creating a complex number from the position and

velocity of an atom offers the possibility of discovering modes of motion which span the

complete phase space of protein dynamics.

We can always learn more from extending MD simulations to longer timescales.

Further characterization of the timescale dependence of eigenvector shapes, eigenvalue

spectra and PC distributions is always welcome, especially to elucidate the convergence of

the grouping plateaus in the eigenvalue spectrum. While we have argued that the backbone

dynamics of gA have adequately converged in this study, much longer simulations would

presumably capture dissociation events of the dimer, and it would be very interesting to

investigate whether mode B as described in Chapter 5 is associated with these events. It is

also clear that both side chain and annular GMO dynamics have not converged at 64 ns and

further analysis of these aspects of gramicidin dynamics would require longer simulations.

102

References 1. Karplus, M. 1987. Molecular dynamics simulations of proteins. Phys. Today 40:68‐70. 2. Karplus, M., and J. Kuriyan. 2005. Molecular dynamics and protein function. Proc. Nat. Acad.

Sci. USA 102:6679‐6685. 3. McCammon, J. A., and S. C. Harvey. 1987. Dynamics of Proteins and Nucleic Acids.

Cambridge University Press, New York. 4. Roux, B., and K. Schulten. 2004. Computational studies of membrane channels. Structure

12:1343‐1351. 5. Sanbonmatsu, K. Y., and C. S. Tung. 2007. High performance computing in biology:

Multimillion atom simulations of nanoscale systems. Journal of Structural Biology 157:470‐480.

6. Wolynes, P. G. 2005. Energy landscapes and solved protein‐folding problems. Philos. Trans. R. Soc. A 363:453‐467.

7. Zhou, Y., and M. Karplus. 1999. Interpreting the folding kinetics of helical proteins. Nature 401:400‐403.

8. Gianni, S., N. R. Guydosh, F. Khan, T. D. Caldas, U. Mayor, G. W. N. White, M. L. DeMarco, V. Daggett, and A. R. Fersht. 2003. Unifying features in protein‐folding mechanisms. Proc. Nat. Acad. Sci. USA 100:13286‐13291.

9. Garcia‐Viloca, M., J. Gao, M. Karplus, and D. G. Truhlar. 2004. How Enzymes Work: Analysis by Modern Rate Theory and Computer Simulations. Science 303:186‐195.

10. Wolfenden, R., and M. J. Snider. 2001. The Depth of Chemical Time and the Power of Enzymes as Catalysts. Acc. Chem. Res. 34:938‐945.

11. Villa, J., and A. Warshel. 2001. Energetics and Dynamics of Enzymatic Reactions. J. Phys. Chem. B 105:7887‐7907.

12. Brooks, C. L., M. Karplus, and B. M. Pettitt. 1988. Proteins: A Theoretical Perspective of Dynamics, Structure and Thermodynamics. Wiley, New York.

13. Cui, Q., and M. Karplus. 2002. Promoting Modes and Demoting Modes in Enzyme‐Catalyzed Proton Transfer Reactions: A Study of Realistic Systems. J. Phys. Chem. B 106:1768‐1798.

14. Kursula, I., M. Salin, J. Sun, B. V. Norledge, A. M. Haapalainen, N. S. Sampson, and R. K. Wierenga. 2004. Understanding protein lids: structural analysis of active hinge mutants in triosephosphate isomerase. Protein Eng., Des. Sel. 17:375‐382.

15. Horton, H. R., L. A. Moran, K. G. Scrimgeour, M. D. Perry, and J. D. Rawn. 2006. Principles of Biochemistry. Pearson Prentice Hall.

16. Koppole, S., J. C. Smith, and S. Fischer. 2006. Simulations of the myosin II motor reveal a nucleotide‐state sensing element that controls the recovery stroke. J. Mol. Biol. 361:604‐616.

17. Mesentean, S., S. Koppole, J. C. Smith, and S. Fischer. 2007. The principal motions involved in the coupling mechanism of the recovery stroke of the myosin motor. J. Molec. Biol 367:591‐602.

18. Carnevale, V., S. Raugei, C. Micheletti, and P. Carloni. 2006. Convergent Dynamics in the Protease Enzymatic Superfamily. JACS. 128:9766‐9772.

19. Hodgkin, A. L., and R. D. Keynes. 1955. The potassium permeability of a giant nerve fibre. J. Physiol. (Lond.) 128:61‐88.

20. Doyle, D. A., J. M. Cabral, R. A. Pfuetzner, A. Kuo, J. M. Gulbis, S. L. Cohen, B. T. Chait, and R. MacKinnon. 1998. The structure of the potassium channel: molecular basis of K+ conduction and selectivity. Science 280:69‐77.

103

21. Zhou, Y., J. H. Morais‐Cabral, A. DKaufman, and R. MacKinnon. 2001. Chemistry of ion coordination and hydration revealed by a K+ channel‐Fab complex at 2.0 Å resolution. Nature 414:43‐48.

22. Berneche, S., and B. Roux. 2000. Molecular dynamics of the KcsA K+ channel in a bilayer membrane. Biophys. J. 78:2900‐2917.

23. Noskov, S. Y., S. Berneche, and B. Roux. 2004. Control of ion selectivity in potassium channels by electrostatic and dynamic properties of carbonyl ligands. Nature 431:830‐834.

24. Thomas, M., D. Jayatilaka, and B. Corry. 2007. The Predominant Role of Coordination Number in Potassium Channel Selectivity. biophys. J. 93:2635‐2643

25. Lee, A. G. 2003. Lipid‐protein interactions in biological membranes: a structural perspective [Review]. Biochimica et Biophysica Acta 1612:1‐40.

26. Hunte, C., and S. Richers. 2008. Lipids and membrane protein structures. Curr. Opin. Struc. Biol. 18:406‐411.

27. Saiz, L., S. Bandyopadhyay, and M. L. Klein. 2004. Effect of the Pore Region of a Transmembrane Ion Channel on the Physical Properties of a Simple Membrane. J. Phys. Chem. B 108:2608‐2613.

28. Deol, S. S., P. J. Bond, C. Domene, and M. S. P. Sansom. 2004. Lipid‐Protein Interactions of the Integral Membrane Proteins: A Comparative Simulation Study. Biophys. J. 87:3737‐3749.

29. Lee, A. G. 2004. How lipids affect the activities of integral membrane proteins [Review]. Biochimica et Biophysica Acta 1666:62‐87.

30. Marsh, D., and L. I. Horvath. 1998. Structure, dynamics and composition of the lipid‐protein interface. Perspectives from spin‐labelling. Biochimica et Biophysica Acta 1376:267‐296.

31. de Planque, M. R. R., D. V. Greathouse, R. E. I. Koeppe, H. Schafer, D. Marsh, and J. A. Killian. 1998. Influence of Lipid/Peptide Hydrophobic Mismatch on the Thickness of Diacylphosphatidylcholine Bilayers. A 2H NMR and ESR Study Using Designed Transmembrane α‐Helical Peptides and Gramicidin A. Biochemistry 37:9333‐9345.

32. Killian, J. A. 1992. Gramicidin and gramicidin‐lipid interactions. Biochimica et Biophysica Acta 1113:391‐425.

33. Costa‐Filho, A. J., R. H. Crepeau, P. P. Borbat, M. Ge, and J. H. Freed. 2003. Lipid‐Gramicidin Interactions: Dynamic Structure of the Boundary Lipid by 2D‐ELDOR. Biophys. J. 84:3364‐3378.

34. Woolf, T. B., and B. Roux. 1994. Molecular‐dynamics of the gramicidin channel in a phospholipid membrane. Proc. Nat. Acad. Sci. USA 91:11631‐11635.

35. Chiu, S., S. Subramaniam, and E. Jakobsson. 1999. Simulation study of a gramicidin/lipid biolayer system in excess water and lipid. II. Rates and mechanisms of water transport. Biophys. J. 76:1939‐1950.

36. Qin, Z., H. L. Tepper, and G. A. Voth. 2007. Effect of membrane environment on proton permeation through gramicidin A channels. J. Phys. Chem. B 111:9931‐9939.

37. Dubos, R. J., and C. Cattaneo. 1939. Studies on a bactericidal agent extracted from a soil bacillus: III. Preparation and activity of a protein‐free fraction. J. Exp. Med. 70:249‐256.

38. Arseniev, A. S., I. L. Barskov, V. F. Bystov, A. L. Lomize, and Y. A. Ovchinnikov. 1985. 1H‐NMR study of gramicidin A transmembrane ion channel. Head‐to‐head right‐handed single stranded helices. FEBS Letters 186:168‐174.

39. Hladky, S. B., and D. A. Haydon. 1972. Ion transfer across lipid membranes in the presence of gramicidin A: 1. Studies of the unit conductance channel. Biochim. Biophys. Acta 274:294‐312.

40. Eisenman, G., and R. Horn. 1983. Ionic selectivity revisited: the role of kinetic and equilibrium processes in ion permeation through channels. J. Membr. Biol. 76:197‐225.

104

41. Finkelstein, A., and O. S. Andersen. 1981. The Gramicidin A Channel: a review of its permeability characteristics with special reference to the single‐file aspect of transport. J. Membr. Biol. 59:155‐171.

42. Hladky, S. B., and D. A. Haydon. 1974. Temperature‐dependent properties of gramicidin A channels. Biochim. Biophys. Acta 367:127‐133.

43. Killian, J. A. 1992. Gramicidin and gramicidin‐lipid interactions. Biochim. Biophys. Acta 1113:391‐425.

44. Ketchem, R. R., B. Roux, and T. A. Cross. 1997. High‐resolution polypeptide structure in a lamellar phase lipid environment from solid state NMR derived orientational constraints. Structure 5:1655‐1669.

45. Ketchem, R. R., W. Hu, and T. A. Cross. 1993. High‐Resolution Conformation of Gramicidin A in a Lipid Bilayer by Solid‐State NMR. Science 261:1457‐1460.

46. Mitchell, J. B. O., and J. Smith. 2003. D‐amino acid residues in peptides adn proteins. Proteins 50:563‐571.

47. Elliott, J. R., D. Needham, J. P. Dilger, and D. A. Haydon. 1983. The effects of bilayer thickness and tension on gramicidin single‐channel lifetime. Biochim. Biophys. Acta 735:95‐103.

48. Huang, H. W. 1986. Deformation free energy of bilayer membrane and its effect on gramicidin channel lifetime. Biophys. J. 50:1061‐1070.

49. Stankovic, C. J., S. H. Heinemann, J. M. Delfino, F. J. Sigworth, and S. L. Schreiber. 1989. Transmembrane Channels Based on Tartaric Acid ‐ Gramicidin A Hybrids. Science 244:813‐817.

50. Stankovic, C. J., S. H. Heinemann, and S. L. Schreiber. 1990. Immobilizing the Gate of a Tartaric Acid Gramicidin ‐ A Hybrid Channel Molecule by Rational Design. J. Am. Chem. Soc. 112:3702‐3704.

51. Cukierman, S., E. P. Quigley, and D. S. Crumrine. 1997. Proton conduction in gramicidin A and in its dioxolane‐linked dimer in different lipid bilayers. Biophys. J. 73:2489‐2502.

52. Quigley, E. P., D. S. Crumrine, and S. Cukierman. 2000. Gating and Permeation in Ion Channels Formed by Gramicidin A and Its Dioxolane‐linked Dimer in Na+ and Cs+ Solutions. J. Membrane Biol. 174:207‐212.

53. Quigley, E. P., P. Quigley, D. S. Crumrine, and S. Cukierman. 1999. The Conduction of Protons in Different Stereoisomers of Dioxolane‐Linked Gramicidin A Channels. Biophys. J. 77:2479‐2491.

54. Roux, B. 2002. Computational studies of the gramicidin channel. Acc. Chem. Res. 35:366‐375.

55. Roux, B., and M. Karplus. 1991. Ion transport in a model gramicidin channel: Structure and thermodynamics. Biophys. J. 59:961‐981.

56. Roux, B., and M. Karplus. 1994. Molecular dynamics simulations of the gramicidin channel. Annu. Rev. Biophys. Biomol. Struct. 23:731‐761.

57. Lauger, P. 1973. Ion transport through pores: a rate theory analysis. Biochim. Biophys. Acta 311:423‐441.

58. Schumaker, M., R. Pomès, and B. Roux. 2000. A combined molecular dynamics and diffusion model of single proton conduction through gramicidin. Biophys. J. 79:2840‐2857.

59. Kirkwood, J. G. 1935. Statistical mechanics of fluid mixtures. J. CHem. Phys. 3:300. 60. Allen, T. W., O. S. Andersen, and B. Roux. 2006. Ion permeation through a narrow channel:

Using gramicidin to ascertain all‐atom molecular dynamics potential of mean force methodology and biomolecular force fields Biophys. J. 90:3447‐3468.

61. Allen, T. W., O. S. Andersen, and B. Roux. 2006. Molecular dynamics ‐ potential of mean force calculations as a tool for understanding ion permeation and selectivity in narrow channels. Biophys. Chem. 124:251‐267.

105

62. Decornez, H., K. Drukker, and S. Hammes‐Schiffer. 1999. Solvation and Hydrogen‐Bonding Effects on Proton Wires. J. Phys. Chem. A 103:2891‐2898.

63. Pomès, R. 1995. Quantum effects on the structure and energy of a protonated linear chain of hydrogen‐bonded water molecules. Chemical Physics Letters 234:416‐424.

64. Pomes, R., and B. Roux. 1996. Theoretical Study of H+ Translocation along a Model Proton Wire. J. Phys. Chem. 100:2519‐2527.

65. Pomès, R., and B. Roux. 1998. Free Energy Profiles of H+ Conduction along Hydrogen‐Bonded Chains of Water Molecules. Biophys. J. 75:33‐40.

66. Drukker, K., S. W. de Leeuw, and S. Hammes‐Schiffer. 1998. Proton transport along water chains in an electric field. J. Chem. Phys 108:6799‐6808.

67. Chakrabarti, N., B. Roux, and R. Pomès. 2004. Structural Determinants of Proton Blockage in Aquaporins. J. Mol. Biol. 20:1‐18.

68. Chakrabarti, N., E. Tajkhorshid, B. Roux, and R. Pomès. 2004. Molecular Basis of Proton Blockage in Aquaporins. Structure 12:65‐74.

69. Pomes, R., and B. Roux. 1996. Structure and Dynamics of a Proton Wire: A Theoretical Study of H+ Translocation along the Single‐File Water Chain in the Gramicidin A Channel. Biophys. J. 71:19‐39.

70. Pomès, R., and B. Roux. 2002. Molecular mechanism of H+ conduction in the single‐file water chain of the gramicidin channel. Biophys. J. 82:2304‐2316.

71. Pomès, R., and B. Roux. 1996. Structure and Dynamics of a Proton Wire: A Theoretical Study of H+ Translocation along the Single‐File Water Chain in the Gramicidin A Channel. Biophys. J. 71:19‐39.

72. Yu, C. H., and R. Pomès. 2003. Functional dynamics of ion channels: modulation of proton movement by conformational switches. J. Am. Chem. Soc. 125:13890‐13894.

73. Urry, D. W., S. Alonso‐Romanowski, C. M. Venkatachalam, R. J. Bradley, and R. D. Harris. 1984. Temperature Dependence of Single Channel Currents and the Peptide Libration Mechanism for ion Transport through the Gramicidin A Transmembrane Channel. J. Membr. Biol. 81:205‐217.

74. Roux, B., and M. Karplus. 1988. The normal modes of the gramicidin‐A dimer channel. Biophys. J. 53:297‐309.

75. Chiu, S., E. Jakobsson, S. Subramaniam, and J. A. McCammon. 1991. Time‐correlation analysis of simulated water motion in flexible and rigid gramicidin channels. Biophys. J. 60.

76. Tian, F., and T. A. Cross. 1999. Cation Transport: An Example of Structural Based Selectivity. J. Mol. Biol. 285:1993‐2003.

77. North, C. L., and T. A. Cross. 1995. Correlations between Function and Dynamics: Time Scale Coincidence for Ion Translocation and Molecular Dynamics in the Gramicidin Channel Backbone. Biochemistry 34:5883‐5895.

78. Lazo, N. D., W. Hu, and T. A. Cross. 1995. Low‐Temperature Solid‐State 15N NMR Characterization of Polypeptide Backbone Librations. J. Magn. Reson. B 107:43‐50.

79. Bartl, F., B. Brzezinski, B. Rozalski, and G. Zundel. 1998. FT‐IR Study of the Nature of the Proton and Li+ Motions in Gramicidin A and C. J. Phys. Chem. B 102:5234‐5238.

80. Pankiewicz, R., G. Wojciechowski, G. Schroeder, B. Brzezinski, F. Bartl, and G. Zundel. 2001. FT‐IR study of the nature of K+, Rb+ and Cs+ cation motions in gramicidin A. J. Mol. Struct. 565:213‐217.

81. Armstrong, K. M., and S. Cukierman. 2002. On the Origin of Closing Flickers in Gramicidin Channels: A New Hypothesis. Biophys. J. 82:1329‐1337.

82. de Godoy, C. M. G., and S. Cukierman. 2001. Modulation of Proton Transfer in the Water Wire of Dioxolane‐Linked Gramicidin Channels by Lipid Membranes. Biophys. J. 81:1430‐1438.

106

83. Cukierman, S. 2000. Proton Mobilities in Water and Different Stereoisomers of Covalently Linked Gramicidin A Channels. Biophys. J. 78:1825‐1834.

84. Yu, C. H., S. Cukierman, and R. Pomès. 2003. Theoretical Study of the Structure and Dynamics Fluctuations of Dioxolane‐Linked Gramicidin Channels. Biophys. J. 84:816‐831.

85. Brooks, B. R., R. E. Bruccoleri, O. B. D., D. J. States, S. Swaminathan, and M. Karplus. 1983. CHARMM ‐ A Program For Macromolecular energy, minimization, and dynamics calculations. J. Comp. Chem. 4:187‐217.

86. MacKerell, A. D., Jr., D. Bashford, M. Bellott, R. L. BDunbrack, J. D. Evanseck, M. J. Field, S. Fischer, J. B. Gao, H. Guo, S. Ha, D. Joseph‐McCarthy, L. Kuchnir, K. Kuczera, F. T. K. Lau, C. Mattos, S. Michnick, T. Ngo, D. T. Nguyen, B. Prodhom, W. E. Reiher, B. Roux, M. Schlenkrich, J. C. Smith, R. Stote, J. Straub, M. Watanabe, J. Wiorkiewicz‐Kuczera, D. Yin, and M. Karplus. 1998. All‐atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102:3586‐3616.

87. Cornell, W. D., P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, and P. A. Kollman. 1995. A 2nd Generation Force‐Field for the Simulation of Proteins, Nucleic‐Acids, and Organic‐Molecules. J. Am. Chem. Soc. 117:5179‐5197.

88. Pearlman, D. A., D. A. Case, J. W. Caldwell, W. S. Ross, T. E. Cheatham, S. Debolt, D. Frerguson, G. Seibel, and P. Kollman. 1995. Amber, A Package of Computer‐Programs for Applying Molecular Mechanics, Normal‐Mode Analysis, Molecular‐Dynamics and Free‐Energy Calculations to Simulate the Structural And Energetic Properties of Molecules. Comput. Phys. Commun. 91:1‐41.

89. Berendsen, H. J. C., D. Vanderspoel, and R. Vandrunen. 1995. Gromacs ‐ a Message‐Passing Parallel Molecular‐Dynamics Implementation. Comput. Phys. Commun. 91:43‐56.

90. Lindahl, E., B. Hess, and D. van der Spoel. 2001. GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Model 7:306‐317.

91. van Gunsteren, W. F., S. R. Billeter, A. A. Eising, P. H. Hunenberger, P. Kruger, A. E. Mark, W. R. P. Scott, and I. G. Tironi. 1996. Biomolecular Simulation: The GROMOS96 manual and user guide. . Hochschulverlag AG an der ETH Zurich, Zurich.

92. Jorgensen, W. L., and J. Tiradorives. 1988. The Opls Potential Functions for Proteins ‐ Energy Minimizations for Crystals of Cyclic‐Peptides and Crambin. J. Am. Chem. Soc. 110:1657‐1666.

93. Berendsen, A., J. P. M. Postma, W. F. Van Gunsteren, A. DiNola, and J. R. Haak. 1984. Molecular dynamics with coupling to an external bath. J. Chem. Phys 81:3684‐3690.

94. Hoover, W. G. 1985. Canonical dynamics: Equilibrium phase‐space distributions. Phys. Rev. A 31:1695‐1697.

95. Nosé, S. 1984. A unified formulation of the constant temperature molecular dynamics methods. J. Chem. Phys 81:511‐519.

96. Allen, M. P., and D. J. Tildesley. 1987. Computer Simulation of Liquids. Oxford University Press

Oxford. 97. Harvey, S. C., R. K. Z. Tan, and T. E. Cheatham. 1998. The flying ice cube: Velocity rescaling in

molecular dynamics leads to violation of energy equipartition J. Comput. Chem. 19:726‐740.

98. Herce, H. D., and A. E. Garcia. 2006. Correction of apparent finite size effects in the area per lipid of lipid membranes simulations. J. Chem. Phys. 125:224711.

99. Jorgensen, W. L., J. Chandrasekhar, J. D. Madura, R. W. Impey, and M. L. Klein. 1983. Comparison of Simple Potential Functions for Simulating Water. J. Chem. Phys. 79:926‐935.

100. Ryckaert, J.‐P., G. Ciccotti, and H. J. C. Berendsen. 1977. Numerical Integration of the Cartesian Equations of Motion of a System with Constraints: Molecular Dynamics of n‐Alkanes. J. Comp. Phys. 23:327–341.

107

101. Zhang, Y., S. Feller, B. Brooks, and R. W. Pastor. 1995. Computer simulation of liquid/liquid interfaces. I. Theory and application to octane/water. J. Chem. Phys 103:10252‐10266.

102. Marrink, S. J., and A. E. Mark. 2001. Effect of Undulations on Surface Tension in Simulated Bilayers. J. Phys. Chem. B 105:6122‐6127.

103. Eckart, C., and G. Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika I:211‐218.

104. Golub, G. H., and W. Kahan. 1965. Calculating the singular values and pseudo‐inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis 2:205‐224.

105. Hotelling, H. 1935. The most predictable criterion. J. Educ. Psychol. 26:139‐142. 106. Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J.

Educ. Psychol. 24:417‐520. 107. Hannachi, A., I. T. Jolliffe, and D. B. Stephenson. 2007. Empirical orthogonal functions and

related techniques in atmospheric science: A review. Int. J. Climatology 27:1119‐1152. 108. Berkooz, G., P. Holmes, and J. L. Lumley. 1993. The Proper Orthogonal Decomposition in the

Analysis of Turbulent Flows. Ann. Rev. Fluid Mech. 25:539‐575. 109. Berendsen, H., and S. Hayward. 2000. Collective protein dynamics in relation to function.

Curr. Opin. Struc. Biol. 10:165‐169. 110. Kitao, A., and N. Go. 1999. Investigating protein dynamics in collective coordinate space.

Curr. Opin. Struc. Biol. 9:164‐169. 111. García, A. 1992. Large‐amplitude nonlinear motions in proteins. Phys. Rev. Lett. 17:2696‐

2699. 112. Amadei, A., A. B. M. Linssen, and H. J. C. Berendsen. 1993. Essential dynamics of proteins.

Proteins 17:412‐425. 113. Lou, H., and R. I. Cukier. 2006. Molecular dynamics of Apo‐Adenylate Kinase: A Principal

component Analysis. J. Phys. Chem. 110:12796‐12808. 114. Arcangeli, C., A. R. Bizzarri, and S. Cannistraro. 2001. Concerted motions in copper

plastocyanin and azurin: an essential dynamics study. Biophys. Chem 90:45‐56. 115. Hayward, S., and H. J. C. Berendsen. 1998. Systematic analysis of domain motions in proteins

from conformational change: New results on citrate synthase and T4 lysozyme. Proteins 30:144‐154.

116. Hayward, S., A. Kitao, and N. Go. 1994. Harmonic and anharmonic aspects in the dynamics of BPTI: A normal mode analysis and principal component analysis. Protein Science 3:936‐943.

117. García, A., and G. Hummer. 1999. Conformational dynamics of cytochrome c: correlation to hydrogen exchange. Proteins 36:175‐191.

118. van Aalten, D. M. F., A. Amadei, A. B. M. Linssen, V. G. H. Eijsink, G. Vriend, and H. J. C. Berendsen. 1995. The essential dynamics of thermolysin: confirmation of the hinge‐bending motion and comparison of simulations in vacuum and water. Proteins: Structure, Function and Genetics 22:45‐54.

119. Maisuradze, G. G., and D. M. Leitner. 2006. Principal component analysis of fast‐folding lambda‐repressor mutants. Chem. Phys. Lett. 421:5‐10.

120. Materese, C. K., C. C. Goldmon, and G. A. Papoian. 2008. Hierarchical organization of eglin c native state dynamics is shaped by competing direct and water‐mediated interactions. Proc. Natl. Aca. Sci. USA 105:10659‐10664.

121. Balsera, M. A., W. Wriggers, Y. Oono, and K. Schulten. 1996. Principal component analysis and long time protein dynamics. J. Phys. Chem. 100:2567‐2572.

122. Grossfield, A., S. Feller, and M. Pitman. 2007. Convergence of molecular dynamics simulations of membrane proteins. Proteins 67:31‐40.

108

123. Hattori, M. L., H; Yamada, H; Akasaka, K; Hengstenberg, W; Gronwald, W; Kalbitzer, HR. 2004. Infrequent cavity‐forming fluctuations in HPr from Staphylococcus carnosus revealed by pressure‐ and temperature‐dependent tyrosine ring flips. Protein Science 13:3104‐3114.

124. Rao, D. K., and A. K. Bhuyan. 2007. Complexity of aromatic ring‐flip motions in proteins: Y97 ring dynamics in cytochrome c observed by cross‐relaxation suppressed exchange NMR spectroscopy. J. Biomol. NMR 39:187‐196.

125. Go, N., and H. A. Scheraga. 1970. Calculation of the Conformation of the Pentapeptide cycle‐(Glycylglycylglycylprolylprolyl). I. A Complete Energy Map. Macromolecules 3:188‐194.

126. Go, N., and H. A. Scheraga. 1973. Calculation of the Conformation of cyclo‐Hexaglycyl. Macromolecules 6:525‐541.

127. Brooks, B., and M. Karplus. 1983. Harmonic dynamics of proteins: Normal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc. Natl. Aca. Sci. USA 80:6571‐6575.

128. Go, N., T. Noguti, and T. Nishikawa. 1983. Dynamics of a small globular protein in terms of low‐frequency vibrational modes. Proc. Natl. Acad. Sci. USA 80:3696‐3700.

129. Levitt, M., C. Sander, and P. S. Stern. 1985. Protein Normal‐mode Dynamics: Trypsin Inhibitor, Crambin, Ribonuclease and Lysozyme. J. Mol. Biol. 181:423‐447.

130. Ma, J. 2005. Usefulness and Limitations of Normal Mode Analysis in Modeling Dynamics of Biomolecular Complexes. Structure 13:373‐380.

131. Miller, D. W., and D. A. Agard. 1999. Enzyme specificity under dynamic control: a normal mode analysis of alpha‐lytic protease. J. Mol. Biol. 286:267‐278.

132. Miloshevsky, G., and P. Jordan. 2006. The open state gating mechanism of gramicidin A requires relative opposed monomer rotation and simultaneous lateral displacement. Structure 14:1241‐1249.

133. Atilgan, A. R., S. R. Durell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Bahar. 2001. Anisotropy of Fluctuation Dynamics of Proteins with an Elastic Network Model. Biophys. J. 80:505‐515.

134. Bahar, I., A. R. Atilgan, and B. Erman. 1997. Direct evaluation of thermal fluctuations in proteins using a single‐parameter harmonic potential. Folding Design 2:173‐181.

135. Bahar, I., C. Chennubhotla, and B. Erman. 2007. Reply to 'Comment on elastic network models and proteins'. Phys. Biol. 4:64‐65.

136. Bahar, I., and A. Rader. 2005. Coarse‐grained normal mode analsis in structural biology. Current Opinion in Structural Biology 15:586‐592.

137. Chennubhotla, C., A. J. Rader, L. Yang, and I. Bahar. 2005. Elastic network models for understanding biomolecular machinery: from enzymes to supramolecular assemblies. Phys. Biol. 2:S172‐S180.

138. Eyal, E., and I. Bahar. 2008. Toward a Molecular Understanding of the Anisotropic Response of Proteins to External Forces: Insights from Elastic Network Models. Biophys. J. 94:3424‐3435.

139. Tama, F., M. Valle, J. Frank, and C. L. Brooks. 2003. Dynamic reorganization of the functionally active ribosome explored by normal mode analysis and cryo‐electron microscopy. Proc. Natl. Aca. Sci. USA 100:9319‐9323.

140. McCammon, J. A., B. R. Gelin, and M. Karplus. 1977. Dynamics of folded proteins. Nature 267:585‐590.

141. Doruker, P., A. R. Atilgan, and I. Bahar. 2000. Dynamics of Proteins Predicted by Molecular Dynamics Simulations and Analytical Approaches: Application to α‐Amylase Inhibitor. Proteins: Structure, Function and Genetics 40:512‐524.

142. Smith, J., S. Cusack, U. Pezzeca, B. Brooks, and M. Karplus. 1986. Inelastic neutron scattering analysis of low frequency motion in proteins: A normal mode study of the bovine pancreatic trypsin inhibitor. J. Chem. Phys. 85:3636‐3654.

109

143. Tirion, M. M. 1996. Large Amplitude Elastic Motions in Proteins from a single‐Parameter, Atomic Analysis. Phys. Rev. Lett. 7:1905‐1908.

144. Delarue, M., and Y.‐H. Sanejouand. 2002. Simplified Normal Mode Analysis of Conformational Transitions in DNA‐dependent Polymerases: the Elastic Network Model. J. Mol. Biol. 320:1011‐1024.

145. Tama, F., and Y.‐H. Sanejouand. 2001. Conformational change of proteins arising from normal mode calculations. Protein Engineering 14:1‐6.

146. Zheng, W., B. Brooks, and D. Thirumalai. 2006. Low‐frequency normal modes that describe allosteric transitions in biological nanomachines are robust to sequence variations. Proc. Natl. Aca. Sci. USA 103:7664‐7669.

147. Hinsen, K. 1998. Analysis of Domain Motions by Approximate Normal Mode Calculations. Proteins: Structure, Function, and Genetics 33:417‐429.

148. Maguid, S., S. Fernandez‐Alberti, L. Ferrelli, and J. Echave. 2005. Exploring the common dynamics of homologous proteins. Application to the Globin family. Biophys. J. 89:3‐13.

149. Stillinger, F. H., and T. A. Webber. 1982. Hidden structure in liquids. Phys. Rev. A 25:978‐989. 150. Pearson, K. 1902. On lines and planes of closest fit to systems of points in space.

Philosophical Magazine 2:559‐572. 151. Lorenz, E. N. 1956. Empirical Orthogonal Functions and Statistical Weather Prediction. In

Technical Report, Statistical Forecast Project Report 1. Department of Meteorology, MIT. 49. 152. Richman, M. B. 1986. Rotation of Principal Components. J. Climatology 6:293‐335. 153. Buell, C. E. 1975. The topography of empirical orthogonal functions. In Fourth Conf. on Prob.

and Stats. in Atmos. Sci. Amer. Metero. Soc., Tallahassee, FL. 188. 154. Buell, C. E. 1979. On the physical interpretation of empirical orthogonal functions. In Sixth

Conf. on Prob. and Stats. in Atmos. Sci. Amer. Metero. Soc., Banff, Alta. 112. 155. Calahan, R. F. 1983. EOF spectral estimation in climate analysis. In Second International

Conf. on Stat. Climat. National Institute of Metero. and Geophysics, Lisbon, Portugal. 4.5.1. 156. Richman, M. B., and P. J. Lamb. 1985. Climate pattern analysis of 3‐ and 7‐day summer

rainfall in the central United States: Some methodological considerations and a regionalization. J. Clim. Appl. Meteor. 24:1325.

157. Kendall, M. G. 1980. Multivariate Analysis. C. Griffin, London. 158. North, G. R., T. L. Bell, R. F. Calahan, and F. J. Moeng. 1982. Sampling errors in the

estimation of empirical orthogonal functions. Mon. Wea. Rev. 110:699. 159. Cliff, N., and C. D. Hamburger. 1967. A study of sampling errors in factor analysis by means

of artificial experiments. Psych. Bull. 68:430. 160. Storch, H., and G. Hannoschock. 1985. Statistical aspects of estimated principal vectors

(EOFs) based on small sample sizes. J. Clim. Appl. Meteor. 24:716. 161. Vargas, W. M., and R. H. Compagnucci. 1983. Methodological aspects of principal

component analysis in meteorological fields. In Second International Conf. on Stat. Climat. National Institute of Metero. and Geophysics, Lisbon, Portugal. 5.3.1.

162. Barnston, A. G., and R. E. Livezey. 1987. Classification, seasonality and persistence of low‐frequency atmospheric circulation patterns. Mon. Wea. Rev. 115:1083‐1126.

163. Craddock, J. M. 1965. A meteorological application of factor analysis. The Statistician 15:143‐156.

164. Horel, J. D. 1981. A rotated principal component analysis of the interannual variability of the Northern Hemisphere 500 mb height field. Mon. Wea. Rev. 109:2080‐2092.

165. Richman, M. B. 1981. Obliquely rotated principal components: an improved meteorological map typing technique? J. App. Met. 20:1145‐1159.

166. Kaiser, H. F. 1958. The Varimax criterion for analytic rotation in factor analysis. Psychometrika 23:187.

110

167. Kaiser, H. F. 1959. Computer program for Varimax rotation in factor analysis. Educ. Psych. Meas. 19:413.

168. Carroll, J. B. 1953. An analytic solution for approximating simple structure in factor analysis. Psychometrika 18:23.

169. Neuhaus, J. O., and C. Wrigley. 1954. The Quartimax method: an analytical approach to simple structure. Brit. J. Stat. Psych. 7:81.

170. Carroll, J. B. 1957. Biquartimin criterion for rotating to oblique simple structure in factor analysis. Science 126:1114.

171. Saunders, D. R. 1961. The rationale for an "Oblimax" method of tranformation in factor analysis. Psychometrika 26:317.

172. Hendrickson, A. E., and P. O. White. 1964. Promax: a quick method to oblique simple structure. Brit. J. Stat. Psych. 17:65.

173. Tucker, L. R., and C. T. Finkbeiner. 1982. Transformation of factors by artificial personal probability functions. In ETS research report 81‐58, test and measurement no. TM 820429.

174. Jolliffe, I. T. 2002. Principal Component Analysis. Springer, New York. 175. Hannachi, A., I. T. Jolliffe, D. B. Stephenson, and N. Trendafilov. 2006. In Search of Simple

Structures in Climate: Simplifying EOFs. Int. J. Climatology 26:7‐28. 176. Jolliffe, I. T., N. Trendafilov, and M. Uddin. 2003. A modified principal component

thechnique based on the LASSO. J. Computational and Graphical Statistics 12:531‐547. 177. Trendafilov, N., and I. T. Jolliffe. 2005. Numerical solution of the SCoTLASS. Computational

Statistics and Data Analysis 50:242‐253. 178. Tibshirani, R. 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal

Statistical Society B 58:267‐288. 179. Bibby, J. 1980. Some effects of rounding optimal estimates. Sankhya B 42:165‐178. 180. Green, B. F. 1977. Parameter sensitivity in multivariate methods. Journal fo Multivariate

Behavioral Research 12:263‐287. 181. Hausmann, R. 1982. Constrained multivariate analysis. In Optimisation and Statistics. S. H.

Zanckis, and J. S. Rustagi, editors. North‐Holland, Amsterdam. 137‐151. 182. Van den Dool, H. M., S. Saha, and J. A. 2000. Empirical orthogonal teleconnections. Journal

of Climate 13:1421‐1435. 183. Vines, S. K. 2000. Simple principal components. Applied Statistics 49:441‐451. 184. Weare, B. C., and J. S. Nasstrom. 1982. Examples of extended empirical orthogonal function

analysis. Monthly Weather Review 110:481‐485. 185. Broomhead, D. S., and G. P. King. 1986. Extracting qualitative dynamics from experimental

data. Physica D 20:217‐236. 186. Broomhead, D. S., and G. P. King. 1986. On the qualitative analysis of experimental

dynamical systems. In Nonlinear Phenomena and Chaos. S. Sarkar, editor. Adam Hilger, Bristol. 113‐144.

187. Kimoto, M., M. Ghil, and K. C. Mo. 1991. Spatial structure of the extratropical 40‐day oscillation. In Proceedings of the 8th Conference on Atmospheric and Oceanic Waves and Stability. American Meteorological Society, Boston, MA. 115‐116.

188. Plaut, G., and R. Vautard. 1994. Spells of low‐frequency oscillations and weather regimes in the northern hemisphere. Journal of the Atmospheric Sciences 51:210‐236.

189. Brink, K. H., and R. D. Muench. 1986. Circulation in the point conception‐Santa Barbara channel region. Journal of Geophysical research C 91:877‐895.

190. Hardy, D. M., and J. J. Walton. 1978. Principal components analysis of vector wind measurements. J. App. Meteorology 17:1153‐1162.

191. Kundu, P. K., and J. S. Allen. 1976. Some three‐dimensional characteristics of low‐frequency current fluctuations near the Oregon coast. Journal of Physical Oceanography 6:181‐199.

111

192. Johnson, E. S., and M. J. McPhaden. 1993. Structure of intraseasonal Kelvin waves in the equatorial Pacific Ocean. Journal of Physical Oceanography 23:608‐625.

193. Wallace, J. M. 1972. Empirical orthogonal representation of time series in the frequency domain. Part II: Application to the study of tropical wave disturbances. J. App. Meteorology 11:893‐900.

194. Wallace, J. M., and R. E. Dickinson. 1972. Empirical orthogonal representation of time series in the frequency domain. Part I: Theoretical consideration. J. App. Meteorology 11:887‐892.

195. Rasmusson, E. M., P. A. Arkin, W. Y. Chen, and J. B. Jalickee. 1981. Biennial variations in surface temperature over the United States as revealed by singular decomposition. Monthly Weather Review 109:587‐598.

196. Barnett, T. P. 1983. Interaction of the monsoon and pacific trade wind system at interannual time scales. Part I: The equatorial case. Monthly Weather Review 111:756‐773.

197. Barnett, T. P. 1984. Interaction of the monsoon and pacific trade wind system at interannual time scales. Part II: The tropical band. Monthly Weather Review 112:2380‐2387.

198. Barnett, T. P. 1984. Interaction of the monsoon and pacific trade wind system at interannual time scales. Part III: A parial anatomy of the Southern Oscillation. Monthly Weather Review 112:2388‐2400.

199. Anderson, J. R., and R. D. Rosen. 1983. the latitude‐height structure of 40‐50 day variations in atmospheric angular momentum. Journal of the Atmospheric Sciences 40:1584‐1591.

200. Merrifield, M. A., and C. D. Winant. 1989. Shelf circulation in the gulf of California: a description of the variability. Journal of Geophysical Research 94:18133‐18160.

201. Horel, J. D. 1984. Complex principal component analysis: theory and examples. J. Clim. Appl. Meteor. 23:1660.

202. Saegusa, R., H. Sakano, and S. Hashimoto. 2004. Nonlinear principal component analysis to preserve the order of principal components. Neurocomputing 61:57‐70.

203. Nguyen, D. T. 2006. Complexity of Free Energy Lanscapes of Peptides Revealed by Nonlinear Principal Component Analysis. Proteins: Structure, Function and Genetics 65:898‐913.

204. Matsunaga, Y., S. Fuchigami, and A. Kidera. 2009. Multivariate frequency domain analysis of protein dynamics. J. Chem. Phys 130:124104.

205. Hess, B. 2002. Convergence of sampling in protein simulations. Phys. Rev. E 65:031910. 206. Faraldo‐Gomez, J. D., L. R. Forrest, M. Baaden, P. J. Bond, C. Domene, G. Patargias, J.

Cuthbertson, and M. S. P. Sansom. 2004. Conformational Sampling and Dynamics of Membrane Proteins From 10‐Nanosecond Computer Simulations. Proteins 57:783‐791.

207. Luchko, T., J. T. Huzil, M. Stepanova, and J. Tuszynski. 2008. Conformational Analysis of the Carboxy‐Terminal Tails of Human β‐Tubulin Isotypes. Biophys. J. 94:1971‐1982.

208. Mandelbrot, B., and J. W. Van Ness. 1968. Fractional Brownian Motions, Fractional Noises and Applications. SIAM Rev. 10:422‐437.

209. Bingham, N. C., N. E. Smith, T. A. Cross, and D. D. Busath. 2003. Molecular dynamics simulations of Trp side‐chain conformational flexibility in the gramicidin A channel. Biopolymers 71:593‐600.

210. Townsley, L. E., W. A. Tucker, S. Sham, and J. F. Hinton. 2001. Structures of Gramicidins A, B, and C Incorporated into Sodium Dodecyl Sulfate Micelles. Biochemistry 40:11676‐11686.

211. Allen, T. W., O. S. Andersen, and B. Roux. 2003. Structure of Gramicidin A in a Lipid Bilayer Environment Determined Using Molecular Dynamics Simulations and Solid‐State NMR Data. J. Am. Chem. Soc. 125:9868‐9877.

212. Andersen, O. S., and R. E. Koeppe. 1992. Molecular determinants of channel function. Physiol. Rev. 72:89S‐158S.

112

213. Urry, D. W., C. M. Venkatachalam, K. U. Prasad, R. J. Bradley, G. Parenti‐Castelli, and G. Lenaz. 1981. Conduction Processes of the Gramicidin Channel. Int. J. Quantum Chem. Quantum Biolo. Symp. 8:385.

214. Mandelbrot, B. 2002. Gaussian Self‐affinity and Fractals. Springer, New York. 215. Metzler, R., and J. Klafter. 2004. The restaurant at the end of the random walk: recent

developments in the description of anomalous transport by fractional dynamics. Journal of Physics A 37:R161‐R208.

216. Schulze, B. G., and J. D. Evanseck. 1999. Cooperative role of Arg45 and His64 in the spectroscopic A3 state of carbonmonoxy myoglobin: Molecular dynamics simulations, multivariate analysis and quantum mechanical computations. J. Am. Chem. Soc. 121:6444‐6454.

217. Daidone, I., A. Amadei, D. Roccatano, and A. Di Nola. 2003. Molecular Dynamics Simulation of Protein Folding by Essential Dynamics Sampling: Folding Landscape of Horse Heart Cytochrome c. Biophys. J. 85:2865‐2871.

218. Stanley, H. E., S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.‐K. Peng, and M. Simons. 1999. Scaling features of noncoding DNA. Physica A 273:1‐18.

219. Gao, J. B., Y. Cao, and J. M. Lee. 2003. Principal component analysis of 1/fα noise. Phys. Lett. A 314:392‐400.

220. Maisuradze, G. G., and D. M. Leitner. 2007. Free energy landscape of a biomolecule in dihedral principal component space: sampling convergence and correspondence between structures and minima. Proteins 67:569‐578.

221. Ma, J., and M. Karplus. 1998. The allosteric mechanism of the chaperonin GroEL: A dynamics analysis. Proc. Natl. Acad. Sci. USA 95:8502‐8507.

222. García, A. 1997. Multi‐basin dynamics of a protein in a crystal environment. Physica D 107:225‐239.

223. Goldstein, H. 1980. Classical Mechanics. Addison‐Wesley. 224. Lindahl, E., C. Azuara, P. Koehl, and M. Delarue. 2006. NOMAD‐Ref: visualization,

deformation and refinement of macromolecular structures based on all‐atom Normal Mode Analysis. Nucleic Acids Res. 34:W52‐56.

113

Appendix 1: Normal Mode Analysis

Given a molecular structure with N atoms at coordinates ri (i = 1,2,…,3N), and a

molecular energy landscape U(r), the force constant matrix may be written

.

The Hessian / / is the mass-weighted form of . If a harmonic

approximation (quadratic function) is taken for the coordinate dependence of U(r) around a

minimum, then the spatial frequencies wi can be obtained by solving the eigenvalue problem

Where / ∆ is a 3N dimensional eigenvector of mass-weighted displacements. A

more detailed derivation of NMA can be found in standard mechanics textbooks, such as

(223).

Note that the detailed form of U(r) is necessary for traditional NMA, which is taken

from existing molecular mechanics force fields such as CHARMM. The more recent

development using Elastic Network Models (143) do away with this detailed energy function

and replace it with a network of elastic interactions connecting every atom to every other

atom within a cut-off radius (typically 10 Å). Notice that this replaces the harmonic

functions usually associated with the bonding topology of a molecule with other harmonic

functions now distributed much like non-bonded interactions.

In Figure A1 we present NMA results on the gA dimer solvated in a GMO

membrane, which were computed using an online engine called NOMAD-Ref (224) on a

very well minimized structure from our simulations. A special request was made in order to

compute all eigenvalues and eigenvectors for comparison with PCA results (rather than only

the first 10, which is the default standard). We present eigenvalues for the Cα atoms, the

NCαC main chain, backbone with H atoms (NCαCH), with O atoms (NCαCO), and the

complete backbone (NCαCOH), as well as the full gA molecule with (GRAall) and without

(GRAnoH) hydrogen atoms. All of these curves have the same form, where the leading 3

eigenvalues are separated by a significant bandgap from the rest of the eigenvalues, and they

114

all scale with a power of 0.25. Along with this more shallow scaling, the most notable

differences from PCA spectra is the lack of more than one bandgap, and no transition to

steeper scaling at high frequencies.

The interpretation of these curves is not obvious. On the one hand, it is tempting to

ascribe differences in the NMA spectrum – like the lack of scaling transition – to the

presence of temperature in the PCA of MD simulations, which is not present in the NMA

approximation. This seems reasonable given the association of exponents in power law

scaling with properties of noise. On the other hand, we have suggested that the shallow

scaling at low PC index is associated with conformational degrees of freedom, while the

steeper scaling at high PC index is associated with internal vibrations. It is exactly these

internal vibrations with seem to be lacking in the NMA spectrum, which is surprising given

the harmonic nature of the approximation. However, the shallow scaling of the NMA

spectrum would suggest the approximation captures only conformational degrees of freedom,

which arise due to the non-bonded topology of interactions in the ENM model, despite its

harmonic form. Unraveling the effects of dynamics, entropy, temperature, interaction

topology and interaction functions is clearly non-trivial; a more detailed understanding of

the differences between PCA and NMA is a task that would benefit from comparisons of

model systems such as crystals and gases with the protein spectra presented here.

Figure A1: NMA eigenvalues (spatial frequencies ωk) for various atomic subsets of gA.

115

Appendix 2: Side Chain Conformations of gA Our analysis of the convergence for side chain eigenvectors in section 2.2.2 shows

that their longest PCs are not converged. In Figure A2 we demonstrate the multi-modal

character of the side-chain eigenvectors by contrasting the PCs of the gA backbone with the

PCs of its side-chains. We plot the density of points along the trajectory of PC1 vs. PC2, as

well as a scatter plot of PC1 vs. PC2 for every 1000th point in the trajectory for the backbone,

and the same plot of PC1 vs. PC2 vs. PC3 for the side-chains. A unimodal density is

apparent for the gA backbone, while the side-chains have a multi-modal density distribution

for PC1 vs. PC2 with 4 distinct peaks, and the scatter plot of PC1 vs. PC2 vs. PC3 shows 4 or

5 distinct clusters. We have labeled these clusters with a representative time step (divided by

1000) as well as a coloured dot. The colours correspond to those used in Figure A3 to

display the 5 different conformations at these representative time steps in a 64 ns simulation

of gA in a GMO membrane. The characteristic feature of a given conformation is also

labeled with its time step in this figure.

The blue structure in Fig. A3 is the starting conformation. The red structure

corresponds to step 15, which is the smallest cluster visible in Fig. A2 and is only briefly

visited before returning to the initial cluster. This conformation does not differ significantly

from the starting structure except for a tilting of Trp13 on both monomers. The green

structure at step 135 has a 120o change in χ1 of Trp11 of monomer 1, as well as a tilt in χ2 of

Trp 9 on monomer 2. The orange structure at step 165 has 120o change in χ1 of Trp 9 on

monomer 2. The yellow structure at step 221 exhibits a 120o change in χ1 and a 30o change

in χ2 of Trp 15 on monomer 2. The tan structure at step 315 has this same change at Trp 15

on monomer 2, accompanied by a 120o change in χ1 on Trp13 of monomer 1.

These figures demonstrate that our simulation has only limited sampling of a few

conformational states, with only one or two transitions into each well on the free energy

surface of side-chain dynamics. MD studies in vacuo (209), and in DMPC (211) have

described six rotameric states available to each Trp in gA, although only Trp 9 showed a

significant (eighteen) number of transitions among them in a 100 ns simulation in DMPC

(211). Our results are in general agreement with this study, indicating that Trp rotameric

basins are visited on the 10 ns timescale in the GMO membrane, and therefore the longest

side chain PCs are not be expected to converge within 64 ns.

116

Figure A2: 2D Distribution of the complete PC1 vs. PC2 scatter plot for the NCαC backbone and side chains, as well as the time-ordered scatter plot for every 1000th point colored from blue (start) through red (end).

117

Figure A3: Side-chain conformations for a 64 ns simulation of gA in a hydrated GMO bilayer.

Documents

PRINCIPAL COMPONENT ANALYSIS OF GRAMICIDIN